{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "# How to: Create and Use a FeatureSet\n",
    "\n",
    "The `FeatureSet` is the central data container in ModularML. It organizes your data into three domains:\n",
    "\n",
    "- **Features**: model inputs (e.g., time-series signals, sensor readings)\n",
    "- **Targets**: values to predict (e.g., state-of-health, capacity)\n",
    "- **Tags**: metadata for grouping and filtering (e.g., cell ID, temperature)\n",
    "\n",
    "Under the hood, a `FeatureSet` wraps a `SampleCollection`, which stores all data in a columnar [Apache Arrow](https://arrow.apache.org/) table. Each column follows the naming convention `<domain>.<key>.<representation>` (e.g., `features.voltage.raw`). A `SampleSchema` tracks the structure, shapes, and data types.\n",
    "\n",
    "This notebook covers the complete `FeatureSet` API:\n",
    "\n",
    "- {ref}`01-create-featureset-creating-a-featureset`\n",
    "- {ref}`01-create-featureset-inspecting-a-featureset`\n",
    "- {ref}`01-create-featureset-accessing-data`\n",
    "- {ref}`01-create-featureset-row-subsetting-and-filtering`\n",
    "- {ref}`01-create-featureset-column-subsetting`\n",
    "- {ref}`01-create-featureset-splitting-data`\n",
    "- {ref}`01-create-featureset-transforms-and-scaling`\n",
    "- {ref}`01-create-featureset-serialization-save-and-load`\n",
    "- {ref}`01-create-featureset-references-for-model-graph-wiring`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1",
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "import modularml as mml\n",
    "from modularml import FeatureSet"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2",
   "metadata": {},
   "source": [
    "We'll use synthetic battery pulse-response data throughout this notebook. Each sample contains a 101-point voltage time-series, a scalar state-of-health (SOH) target, and metadata tags."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3",
   "metadata": {},
   "outputs": [],
   "source": [
    "N_SAMPLES = 1000\n",
    "N_CELLS = 20\n",
    "N_GROUPS = 5\n",
    "TIME = np.linspace(0, 100, 101)\n",
    "\n",
    "rng = np.random.default_rng(42)\n",
    "\n",
    "# Assign each sample a cell, group, pulse type, and SOC\n",
    "cell_ids = rng.integers(1, N_CELLS + 1, size=N_SAMPLES)\n",
    "group_ids = rng.integers(1, N_GROUPS + 1, size=N_SAMPLES)\n",
    "pulse_types = rng.choice([\"chg\", \"dchg\"], size=N_SAMPLES)\n",
    "pulse_socs = rng.choice([10, 20, 30, 40, 50, 60, 70, 80, 90], size=N_SAMPLES)\n",
    "\n",
    "# SOH degrades with group_id (higher group = more degraded)\n",
    "soh = 100.0 - (group_ids - 1) * 8.0 + rng.normal(0, 2, size=N_SAMPLES)\n",
    "\n",
    "# Synthetic voltage: baseline + pulse shape, shifted by SOC and degraded by SOH\n",
    "voltage = np.zeros((N_SAMPLES, 101))\n",
    "for i in range(N_SAMPLES):\n",
    "    base = 3.2 + pulse_socs[i] / 100.0 * 0.5\n",
    "    amplitude = 0.3 * (soh[i] / 100.0)\n",
    "    sign = 1.0 if pulse_types[i] == \"chg\" else -1.0\n",
    "    curve = sign * amplitude * (1 - np.exp(-TIME / 15.0))\n",
    "    voltage[i] = base + curve + rng.normal(0, 0.002, size=101)\n",
    "\n",
    "data = {\n",
    "    \"voltage\": voltage.tolist(),\n",
    "    \"soh\": soh.tolist(),\n",
    "    \"cell_id\": cell_ids.tolist(),\n",
    "    \"group_id\": group_ids.tolist(),\n",
    "    \"pulse_type\": pulse_types.tolist(),\n",
    "    \"pulse_soc\": pulse_socs.tolist(),\n",
    "}\n",
    "\n",
    "print(f\"Samples: {N_SAMPLES}\")\n",
    "print(f\"Voltage shape per sample: {voltage[0].shape}\")\n",
    "print(f\"SOH range: [{soh.min():.1f}, {soh.max():.1f}]\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5",
   "metadata": {},
   "source": [
    "(01-create-featureset-creating-a-featureset)=\n",
    "## Creating a FeatureSet"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6",
   "metadata": {},
   "source": [
    "\n",
    "Three class methods are available for construction."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7",
   "metadata": {},
   "source": [
    "### `from_dict()`: From a Python dictionary\n",
    "\n",
    "The most common constructor. Pass a dict where each key maps to a list/array of values (one entry per sample), then specify which keys are features, targets, and tags."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs = FeatureSet.from_dict(\n",
    "    label=\"PulseData\",\n",
    "    data=data,\n",
    "    feature_keys=\"voltage\",\n",
    "    target_keys=\"soh\",\n",
    "    tag_keys=[\"cell_id\", \"group_id\", \"pulse_type\", \"pulse_soc\"],\n",
    ")\n",
    "print(fs)\n",
    "print(f\"Feature shapes: {fs.get_feature_shapes()}\")\n",
    "print(f\"Target shapes:  {fs.get_target_shapes()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9",
   "metadata": {},
   "source": [
    "When accessing FeatureSet data, you'll notice that all keys are returned in the `<domain>.<key>.<representation>` by default.\n",
    "You can modify the the returned string with the `include_rep_suffix` and `include_domain_prefix` arguments in all `FeatureSet.get_<>` methods.\n",
    "\n",
    "*Note that certain string-component omissions will raise an error if it results in a non-unique key (e.g., you have two representations of the same feature column)*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\n",
    "    f\"Feature shapes: {fs.get_feature_shapes(include_domain_prefix=False, include_rep_suffix=False)}\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11",
   "metadata": {},
   "source": [
    "### `from_pandas()`: From a Pandas DataFrame\n",
    "\n",
    "Allows for FeatureSet structuring directly from a Pandas DataFrame.\n",
    "We similarly need to assign column names to features, targets, and tags."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a simple DataFrame example\n",
    "df = pd.DataFrame(\n",
    "    {\n",
    "        \"temperature\": np.random.default_rng(0).normal(25, 5, size=100),\n",
    "        \"humidity\": np.random.default_rng(1).normal(60, 10, size=100),\n",
    "        \"output_power\": np.random.default_rng(2).normal(100, 15, size=100),\n",
    "        \"site_id\": np.repeat([\"A\", \"B\", \"C\", \"D\"], 25),\n",
    "        \"timestamp\": np.arange(25).tolist() * 4,\n",
    "    },\n",
    ")\n",
    "\n",
    "fs_from_df = FeatureSet.from_pandas(\n",
    "    label=\"WeatherData\",\n",
    "    df=df,\n",
    "    feature_cols=[\"temperature\", \"humidity\"],\n",
    "    target_cols=\"output_power\",\n",
    "    tag_cols=\"site_id\",\n",
    ")\n",
    "print(fs_from_df)\n",
    "print(f\"Feature shapes: {fs_from_df.get_feature_shapes()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13",
   "metadata": {},
   "source": [
    "Note that the above approach treats every row in the dataframe as a unique sample for modeling.\n",
    "\n",
    "If that's not the case, grouping will need to be performed to aggregate rows in the Pandas dataframe belonging to each sample.\n",
    "The `from_pandas` constructor provides the `group_by` and `sort_by` arguments to do just that.\n",
    "\n",
    "Below, we group all rows in our dataframe by the `'site_id'` at which the data was measured, and then ensure all data points are sorted by `'time_stamp'` within each sample. Notice how the feauture now have a shape of (25,)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs_grouped = FeatureSet.from_pandas(\n",
    "    label=\"WeatherGrouped\",\n",
    "    df=df,\n",
    "    feature_cols=[\"temperature\", \"humidity\"],\n",
    "    target_cols=\"output_power\",\n",
    "    group_by=\"site_id\",\n",
    "    sort_by=\"timestamp\",\n",
    "    tag_cols=[\"site_id\", \"timestamp\"],\n",
    ")\n",
    "print(fs_grouped)\n",
    "print(f\"Feature shapes: {fs_grouped.get_feature_shapes()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15",
   "metadata": {},
   "source": [
    "### `from_pyarrow_table()`: From an Arrow table\n",
    "\n",
    "If you already have a `pyarrow.Table` with columns following the `<domain>.<key>.<rep>` naming convention, you can wrap it directly.\n",
    "\n",
    "*Unless you are certain that the existing table uses the appropriate schema, it is recommended to use `table.to_pandas()`, then use the `from_pandas()` constructor.*\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pyarrow as pa\n",
    "\n",
    "table = pa.table(\n",
    "    {\n",
    "        \"features.x.raw\": [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]],\n",
    "        \"targets.y.raw\": [0.5, 1.5],\n",
    "        \"tags.group.raw\": [\"a\", \"b\"],\n",
    "    },\n",
    ")\n",
    "\n",
    "fs_arrow = FeatureSet.from_pyarrow_table(label=\"ArrowExample\", table=table)\n",
    "print(fs_arrow)\n",
    "print(f\"Feature shapes: {fs_arrow.get_feature_shapes()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18",
   "metadata": {},
   "source": [
    "(01-create-featureset-inspecting-a-featureset)=\n",
    "## Inspecting a FeatureSet"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19",
   "metadata": {},
   "source": [
    "\n",
    "Use the following properties and methods to understand the structure of your data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Basic info\n",
    "print(f\"Label:      {fs.label}\")\n",
    "print(f\"Samples:    {len(fs)}\")\n",
    "print(f\"repr:       {fs!r}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "21",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Column keys by domain\n",
    "print(\"Feature keys:\", fs.get_feature_keys())\n",
    "print(\"Target keys: \", fs.get_target_keys())\n",
    "print(\"Tag keys:    \", fs.get_tag_keys())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "22",
   "metadata": {},
   "outputs": [],
   "source": [
    "# All keys with full qualification (domain prefix + rep suffix)\n",
    "fs.get_all_keys(include_domain_prefix=True, include_rep_suffix=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "23",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Shapes and dtypes\n",
    "print(\"Feature shapes:\", fs.get_feature_shapes())\n",
    "print(\"Target shapes: \", fs.get_target_shapes())\n",
    "print(\"Tag shapes:    \", fs.get_tag_shapes())\n",
    "print()\n",
    "print(\"Feature dtypes:\", fs.get_feature_dtypes())\n",
    "print(\"Target dtypes: \", fs.get_target_dtypes())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24",
   "metadata": {},
   "source": [
    "Note that most data containing classes in ModularML also support a `summary()` method.\n",
    "\n",
    "Printing the results provides a formatted summary of all characteristics of that object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(fs.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "27",
   "metadata": {},
   "source": [
    "(01-create-featureset-accessing-data)=\n",
    "## Accessing Data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28",
   "metadata": {},
   "source": [
    "\n",
    "Data can be retrieved in multiple formats via the `fmt` parameter. Accepted values include `\"numpy\"`, `\"pandas\"`, `\"dict_numpy\"`, `\"torch\"`, and more."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29",
   "metadata": {},
   "source": [
    "### Domain-specific accessors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "30",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get all features as a dict of numpy arrays (default)\n",
    "features = fs.get_features()\n",
    "print(f\"Type: {type(features)}\")\n",
    "print(f\"Keys: {list(features.keys())}\")\n",
    "print(f\"Voltage shape: {features['voltage'].shape}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "31",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get a single feature by name, as numpy\n",
    "voltage = fs.get_features(fmt=\"numpy\", features=\"voltage\")\n",
    "print(f\"Shape: {voltage.shape}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "32",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get targets as a pandas DataFrame\n",
    "targets_df = fs.get_targets(fmt=\"pandas\")\n",
    "targets_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "33",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get specific tags as a dict of numpy arrays\n",
    "tags = fs.get_tags(fmt=\"dict_numpy\", tags=[\"cell_id\", \"pulse_type\"])\n",
    "print(f\"Type: {type(tags)}\")\n",
    "print(f\"Cell IDs (first 5): {tags['cell_id'][:5]}\")\n",
    "print(f\"Pulse types (first 5): {tags['pulse_type'][:5]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34",
   "metadata": {},
   "source": [
    "### Unified accessor: `get_data()`\n",
    "\n",
    "Retrieve columns from multiple domains in a single call. Supports wildcards and a default `rep` parameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "35",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Specific columns from different domains\n",
    "result = fs.get_data(\n",
    "    features=\"voltage\",\n",
    "    targets=\"soh\",\n",
    "    tags=\"*\",\n",
    "    fmt=\"pandas\",\n",
    ")\n",
    "result.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "36",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sample UUIDs - each sample has a unique identifier\n",
    "uuids = fs.get_sample_uuids()\n",
    "print(f\"First 3 UUIDs: {uuids[:3]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "37",
   "metadata": {},
   "source": [
    "### Export to pandas or Arrow"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Full export to pandas\n",
    "df_all = fs.to_pandas()\n",
    "df_all"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Export to Arrow table\n",
    "arrow_table = fs.to_arrow()\n",
    "print(f\"Arrow schema:\\n{arrow_table.schema}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41",
   "metadata": {},
   "source": [
    "(01-create-featureset-row-subsetting-and-filtering)=\n",
    "## Row Subsetting and Filtering"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42",
   "metadata": {},
   "source": [
    "\n",
    "All row-subsetting operations return a `FeatureSetView` - a lightweight, zero-copy window over the parent `FeatureSet`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "43",
   "metadata": {},
   "source": [
    "### `filter()`: Condition-based filtering\n",
    "\n",
    "Conditions are a dict mapping fully-qualified column names to:\n",
    "- A **scalar** (equality match)\n",
    "- A **list/set** (membership test)\n",
    "- A **callable** (row-wise predicate)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "44",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Filter by equality\n",
    "view_chg = fs.filter(conditions={\"tags.pulse_type.raw\": \"chg\"})\n",
    "print(f\"Charge-only: {view_chg}\")\n",
    "print(np.unique(view_chg.get_tags(fmt=\"np\", tags=\"pulse_type\")))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "45",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Filter by list membership\n",
    "view_cells = fs.filter(conditions={\"tags.cell_id.raw\": [1, 2, 3]})\n",
    "print(f\"Cells 1-3: {view_cells}\")\n",
    "print(np.unique(view_cells.get_tags(fmt=\"np\", tags=\"cell_id\")))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "46",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Filter with callable + multiple conditions (AND-composed)\n",
    "view_healthy_chg = fs.filter(\n",
    "    conditions={\n",
    "        \"tags.pulse_type.raw\": \"chg\",\n",
    "        \"targets.soh.raw\": lambda x: x >= 90.0,\n",
    "    },\n",
    ")\n",
    "print(f\"Healthy charge pulses: {view_healthy_chg}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47",
   "metadata": {},
   "source": [
    "### `take()`: By relative index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48",
   "metadata": {},
   "outputs": [],
   "source": [
    "view_first10 = fs.take([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], label=\"first_10\")\n",
    "print(view_first10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49",
   "metadata": {},
   "source": [
    "### `take_sample_uuids()`: By UUID"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "50",
   "metadata": {},
   "outputs": [],
   "source": [
    "some_uuids = fs.get_sample_uuids()[:5].tolist()\n",
    "view_by_uuid = fs.take_sample_uuids(some_uuids, label=\"uuid_subset\")\n",
    "print(view_by_uuid)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51",
   "metadata": {},
   "source": [
    "### Set operations between views"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "52",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Intersection: samples in both views\n",
    "view_a = fs.filter(conditions={\"tags.pulse_type.raw\": \"chg\"})\n",
    "print(\"View A:\")\n",
    "print(\" - view:\", view_a)\n",
    "print(\" - cells:\", np.unique(view_a.get_tags(fmt=\"np\", tags=\"cell_id\")))\n",
    "print(\" - pulse_types:\", np.unique(view_a.get_tags(fmt=\"np\", tags=\"pulse_type\")))\n",
    "\n",
    "view_b = fs.filter(conditions={\"tags.cell_id.raw\": [1, 2, 3]})\n",
    "print(\"\\nView B:\")\n",
    "print(\" - view:\", view_b)\n",
    "print(\" - cells:\", np.unique(view_b.get_tags(fmt=\"np\", tags=\"cell_id\")))\n",
    "print(\" - pulse_types:\", np.unique(view_b.get_tags(fmt=\"np\", tags=\"pulse_type\")))\n",
    "\n",
    "\n",
    "view_intersect = view_a.take_intersection(view_b, label=\"chg_cells_1to3\")\n",
    "print(\"\\nIntersection:\")\n",
    "print(\" - view:\", view_intersect)\n",
    "print(\" - cells:\", np.unique(view_intersect.get_tags(fmt=\"np\", tags=\"cell_id\")))\n",
    "print(\n",
    "    \" - pulse_types:\",\n",
    "    np.unique(view_intersect.get_tags(fmt=\"np\", tags=\"pulse_type\")),\n",
    ")\n",
    "\n",
    "# Difference: samples in A but not in B\n",
    "view_diff = view_a.take_difference(view_b, label=\"chg_not_cells_1to3\")\n",
    "print(\"\\nDifference:\")\n",
    "print(\" - view:\", view_diff)\n",
    "print(\" - cells:\", np.unique(view_diff.get_tags(fmt=\"np\", tags=\"cell_id\")))\n",
    "print(\" - pulse_types:\", np.unique(view_diff.get_tags(fmt=\"np\", tags=\"pulse_type\")))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53",
   "metadata": {},
   "source": [
    "We can also check view overlap via the `is_disjoint_with` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "54",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"view_diff does not contain view_b samples: \", view_b.is_disjoint_with(view_diff))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55",
   "metadata": {},
   "source": [
    "### Converting a view back to a FeatureSet\n",
    "\n",
    "A `FeatureSetView` is a lightweight reference of indices in the parent FeatureSet. \n",
    "Any modification to the FeatureSet with change the data access through its child views.\n",
    "\n",
    "To create an independent `FeatureSet` from a view, use `to_featureset()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "56",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs_charge = view_chg.to_featureset(label=\"ChargePulses\")\n",
    "print(fs_charge)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58",
   "metadata": {},
   "source": [
    "(01-create-featureset-column-subsetting)=\n",
    "## Column Subsetting"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59",
   "metadata": {},
   "source": [
    "\n",
    "Use `select()` to create a view with only specific columns. Row indices are preserved.\n",
    "\n",
    "Select supports the same wildcard usage as `filter()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "60",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select specific features and targets\n",
    "view_slim = fs.select(features=\"voltage.*\", targets=\"soh\")\n",
    "print(f\"Columns: {view_slim.get_all_keys()}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "61",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select by domain and representation\n",
    "view_raw_only = fs.select(features=\"voltage\", rep=\"raw\")\n",
    "print(\n",
    "    f\"Columns: {view_raw_only.get_all_keys(include_domain_prefix=True, include_rep_suffix=True)}\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63",
   "metadata": {},
   "source": [
    "(01-create-featureset-splitting-data)=\n",
    "## Splitting Data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64",
   "metadata": {},
   "source": [
    "\n",
    "Splitting creates named `FeatureSetView` partitions that are registered (optional) on the parent FeatureSet."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65",
   "metadata": {},
   "source": [
    "### Random splitting\n",
    "\n",
    "Random splitting takes a `ratios` argument, defining the proportions of all samples in the calling container to be assigned to each key.\n",
    "The ratio values must add up to 1.\n",
    "\n",
    "By default, splits views are not returned and automatically registered to the parent FeatureSet.\n",
    "This behaviour can be specified via the `return_views` and `register` arguments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "66",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs_charge.clear_splits()\n",
    "fs_charge.split_random(\n",
    "    ratios={\"train\": 0.6, \"val\": 0.2, \"test\": 0.2},\n",
    "    seed=42,\n",
    ")\n",
    "\n",
    "print(f\"Available splits: {fs_charge.available_splits}\")\n",
    "for name, view in fs_charge.splits.items():\n",
    "    print(f\"  {name}: {len(view)} samples\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67",
   "metadata": {},
   "source": [
    "Use `group_by` to keep all samples sharing a tag value in the same split (prevents data leakage):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "68",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs_charge.clear_splits()\n",
    "\n",
    "fs_charge.split_random(\n",
    "    ratios={\"train\": 0.5, \"val\": 0.3, \"test\": 0.2},\n",
    "    group_by=\"group_id\",\n",
    "    seed=1,\n",
    ")\n",
    "\n",
    "for name, view in fs_charge.splits.items():\n",
    "    group_ids = view.get_tags(fmt=\"numpy\", tags=\"group_id\")\n",
    "    print(f\"  {name}: {len(view)} samples, groups: {np.unique(group_ids)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69",
   "metadata": {},
   "source": [
    "Use `stratify_by` to ensure all splits have representative distributions of the calling source.\n",
    "\n",
    "Note that grouping and stratification are mutually exclusive (you can't use both at the same time)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs_charge.clear_splits()\n",
    "\n",
    "fs_charge.split_random(\n",
    "    ratios={\"train\": 0.5, \"val\": 0.3, \"test\": 0.2},\n",
    "    stratify_by=\"group_id\",\n",
    "    seed=1,\n",
    ")\n",
    "\n",
    "for name, view in fs_charge.splits.items():\n",
    "    group_ids = view.get_tags(fmt=\"numpy\", tags=\"group_id\")\n",
    "    print(f\"  {name}: {len(view)} samples, groups: {np.unique(group_ids)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71",
   "metadata": {},
   "source": [
    "### Condition-based splitting\n",
    "\n",
    "Assign samples to splits using explicit conditions on any column.\n",
    "\n",
    "`split_by_condition` takes a `conditions` argument: a nested dict of `{split_name: {column: condition}}`.\n",
    "\n",
    "Conditions can be specified as:\n",
    "- A **scalar**: equality match, e.g. `\"chg\"`\n",
    "- A **list**: membership test, e.g. `[1, 2, 3]`\n",
    "- A **callable**: row-wise predicate, e.g. `lambda x: x >= 90`\n",
    "- A **`Predicate` instance**: e.g. `GTE(90)`, `In([1, 2, 3])`\n",
    "\n",
    "Samples satisfying all conditions within a named split are included in that split.\n",
    "A warning is raised if the produced splits are not mutually exclusive.\n",
    "\n",
    "#### Predicates and serialization\n",
    "\n",
    "Raw scalars and lists work as shorthand, but **lambdas are not serializable**: they cannot be saved and\n",
    "restored when a `FeatureSet` is persisted to disk. To make `split_by_condition` calls fully round-trip\n",
    "safe, use the `Predicate` classes from `modularml.predicates`:\n",
    "\n",
    "| Class | Condition |\n",
    "|-------|-----------|\n",
    "| `LT(v)` | `x < v` |\n",
    "| `LTE(v)` | `x <= v` |\n",
    "| `GT(v)` | `x > v` |\n",
    "| `GTE(v)` | `x >= v` |\n",
    "| `EQ(v)` | `x == v` |\n",
    "| `NE(v)` | `x != v` |\n",
    "| `In([...])` | `x in [...]` |\n",
    "| `NotIn([...])` | `x not in [...]` |\n",
    "| `Lambda(\"lambda x: ...\")` | arbitrary logic |\n",
    "\n",
    "`Lambda` stores the **source string** explicitly, so it survives serialization; unlike a bare Python `lambda`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "72",
   "metadata": {},
   "outputs": [],
   "source": [
    "from modularml.predicates import EQ, GT, GTE, LT, LTE, NE, In, Lambda, NotIn\n",
    "\n",
    "fs_charge.clear_splits()\n",
    "\n",
    "fs_charge.split_by_condition(\n",
    "    {\n",
    "        \"train\": {\"tags.group_id.raw\": In([1, 2, 3])},\n",
    "        \"val\": {\"tags.group_id.raw\": EQ(4)},\n",
    "        \"test\": {\"tags.group_id.raw\": EQ(5)},\n",
    "    },\n",
    ")\n",
    "\n",
    "for name, view in fs_charge.splits.items():\n",
    "    group_ids = view.get_tags(fmt=\"numpy\", tags=\"group_id\")\n",
    "    print(f\"  {name}: {len(view)} samples, groups: {np.unique(group_ids)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "73",
   "metadata": {},
   "source": [
    "### Nested splits\n",
    "\n",
    "Splitting can be called on any existing split, in addition to directly on the parent FeatureSet.\n",
    "The nested split conditions will only draw from the samples available in the calling view.\n",
    "\n",
    "This allows us to \"nest\" split conditions to create more complex modeling setups.\n",
    "\n",
    "*Note that the sub-splits will inherently overlap with the calling view, and care should be taking when using these splits in downstream modeling.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "74",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs_charge.clear_splits()\n",
    "\n",
    "fs_charge.split_by_condition(\n",
    "    {\n",
    "        \"source\": {\n",
    "            \"targets.soh.raw\": GTE(90),\n",
    "            \"tags.group_id.raw\": In([1, 2, 3]),\n",
    "        },\n",
    "        \"test\": {\n",
    "            \"targets.soh.raw\": LT(90),\n",
    "            \"tags.group_id.raw\": In([4, 5]),\n",
    "        },\n",
    "    },\n",
    ")\n",
    "fs_charge.get_split(\"source\").split_random(\n",
    "    ratios={\"train\": 0.8, \"val\": 0.2},\n",
    "    stratify_by=\"group_id\",\n",
    ")\n",
    "\n",
    "for name, view in fs_charge.splits.items():\n",
    "    group_ids = view.get_tags(fmt=\"numpy\", tags=\"group_id\")\n",
    "    print(f\"  {name}: {len(view)} samples, groups: {np.unique(group_ids)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "75",
   "metadata": {},
   "source": [
    "### Manual split registration with `filter()` and `add_split()`\n",
    "\n",
    "You can build splits from arbitrary filters and register them manually with `add_split()`.\n",
    "This is useful when your partition logic doesn't fit neatly into a ratio-based or condition-based split."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "76",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs_charge.clear_splits()\n",
    "\n",
    "train_view = fs_charge.filter(\n",
    "    conditions={\n",
    "        \"tags.group_id.raw\": [1, 2, 3],\n",
    "        \"targets.soh.raw\": Lambda(\"lambda x: x >= 80.0\"),\n",
    "    },\n",
    "    label=\"train\",\n",
    ")\n",
    "test_view = fs_charge.filter(\n",
    "    conditions={\"tags.group_id.raw\": [4, 5]},\n",
    "    label=\"test\",\n",
    ")\n",
    "\n",
    "# Register them as named splits\n",
    "fs_charge.add_split(train_view)\n",
    "fs_charge.add_split(test_view)\n",
    "\n",
    "for name, view in fs_charge.splits.items():\n",
    "    print(f\"  {name}: {len(view)} samples\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "77",
   "metadata": {},
   "source": [
    "### Returning views directly with `return_views=True`\n",
    "\n",
    "Any split method can return the produced views directly by passing `return_views=True`.\n",
    "This is useful when you want immediate access to the views without going through `get_split()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "78",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs_charge.clear_splits()\n",
    "\n",
    "split_views = fs_charge.split_random(\n",
    "    ratios={\"train\": 0.6, \"val\": 0.2, \"test\": 0.2},\n",
    "    group_by=\"group_id\",\n",
    "    seed=42,\n",
    "    return_views=True,\n",
    ")\n",
    "\n",
    "for name, view in split_views.items():\n",
    "    print(f\"  {name}: {len(view)} samples\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79",
   "metadata": {},
   "source": [
    "Note that most ModularML core classes implement a `.visualize()` method.\n",
    "For FeatureSets, this displays a Mermaid diagram of all splits registered to the FeatureSet.\n",
    "\n",
    "*Note that you will need to install a Mermaid rendering extension for your IDE. I use \"Markdown Preview Mermaid Support\".*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "80",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs_charge.visualize(show_overlaps=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "82",
   "metadata": {},
   "source": [
    "(01-create-featureset-transforms-and-scaling)=\n",
    "## Transforms and Scaling"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "83",
   "metadata": {},
   "source": [
    "\n",
    "Apply preprocessing transforms to features or targets. Transforms are tracked and can be undone.\n",
    "\n",
    "Several scalers are built into to ModularML and accessible via the `Scaler.get_supported_scalers()` command.\n",
    "You can also create custom scalers, as outlined below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "84",
   "metadata": {},
   "outputs": [],
   "source": [
    "# List all available scalers\n",
    "mml.supported_scalers"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "85",
   "metadata": {},
   "source": [
    "Let's create a little utility to plot our voltages so we can verify our transforms:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "86",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "\n",
    "def plot_voltages(\n",
    "    fs: FeatureSet,\n",
    "    n_samples: int = 200,\n",
    "    rep: str = \"transformed\",\n",
    "    seed: int = 13,\n",
    "):\n",
    "    \"\"\"\n",
    "    Plot the 'voltage' feature contained in the FeatureSet.\n",
    "\n",
    "    Each split will get its own panel.\n",
    "    Colors by SOH (dark blue = high SOH, light blue = low SOH)\n",
    "\n",
    "    Args:\n",
    "        fs (FeatureSet): FeatureSet to use.\n",
    "        n_samples (int, optional): The number of samples in `fs` that will\n",
    "              get plotted. Defaults to 200.\n",
    "        rep (str): The representation of the data to plot (eg, \"raw\" and \"transformed\")\n",
    "        seed (int, optional): A seed to ensure the same samples get plotted\n",
    "              with repeated calls. Defaults to 13.\n",
    "\n",
    "    \"\"\"\n",
    "\n",
    "    def order_splits(values: list[str]) -> list[str]:\n",
    "        priority = {\"train\": 0, \"val\": 1, \"test\": 2}\n",
    "        return sorted(values, key=lambda x: priority.get(x, 99))\n",
    "\n",
    "    rng = np.random.default_rng(seed)\n",
    "    scm = plt.cm.ScalarMappable(\n",
    "        cmap=plt.cm.Blues,\n",
    "        norm=plt.Normalize(vmin=50, vmax=100),\n",
    "    )\n",
    "\n",
    "    # Verify rep exists\n",
    "    avail_reps = fs.collection._get_rep_keys(domain=\"features\", key=\"voltage\")\n",
    "    if rep not in avail_reps:\n",
    "        rep = \"raw\"\n",
    "\n",
    "    # Create figure with panels for each split\n",
    "    fig, axes = plt.subplots(\n",
    "        figsize=(7, 2.5),\n",
    "        ncols=fs.n_splits,\n",
    "        sharex=True,\n",
    "        sharey=True,\n",
    "    )\n",
    "    split_names = order_splits(fs.available_splits)\n",
    "    for i, split_label in enumerate(split_names):\n",
    "        # For each split, get all voltage features and group_ids\n",
    "        split_view = fs.get_split(split_label)\n",
    "        voltages = np.squeeze(\n",
    "            split_view.get_features(features=\"voltage\", fmt=\"numpy\", rep=rep),\n",
    "        )\n",
    "        sohs = np.squeeze(split_view.get_targets(targets=\"soh\", fmt=\"numpy\", rep=\"raw\"))\n",
    "\n",
    "        # Select n_samples\n",
    "        sample_idxs = rng.choice(np.arange(0, len(voltages)), size=n_samples)\n",
    "        for idx in sample_idxs:\n",
    "            axes[i].plot(voltages[idx], color=scm.to_rgba(sohs[idx]))\n",
    "\n",
    "        axes[i].set_title(split_label, fontsize=10)\n",
    "        axes[i].set_xlabel(\"Time (s)\", fontsize=10)\n",
    "    axes[0].set_ylabel(\"Voltage (V)\", fontsize=10)\n",
    "\n",
    "    # Adjust main subplot area to leave space on the right for colorbar\n",
    "    fig.tight_layout(pad=1)\n",
    "    fig.subplots_adjust(right=0.85)\n",
    "\n",
    "    # Add colorbar as a dedicated panel on the far right\n",
    "    cbar_ax = fig.add_axes([0.87, 0.19, 0.02, 0.7])  # [left, bottom, width, height]\n",
    "    cbar = fig.colorbar(scm, cax=cbar_ax)\n",
    "    cbar.set_label(\"SOH (%)\", fontsize=10)\n",
    "    return fig, axes\n",
    "\n",
    "\n",
    "fig, axes = plot_voltages(fs_charge, n_samples=200)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "87",
   "metadata": {},
   "source": [
    "### Applying a transform\n",
    "\n",
    "`fit_transform()` fits a scaler and stores the result as a `\"transformed\"` representation alongside the original `\"raw\"` data.\n",
    "\n",
    "- `domain`: `\"features\"` or `\"targets\"`\n",
    "- `keys`: which keys to transform (default: all in domain)\n",
    "- `fit_to_split`: fit only on this split's data (prevents data leakage)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "88",
   "metadata": {},
   "outputs": [],
   "source": [
    "from modularml import Scaler\n",
    "\n",
    "# Apply MinMaxScaler to voltage, fitted on training data only\n",
    "fs_charge.fit_transform(\n",
    "    scaler=Scaler(\"MinMaxScaler\"),\n",
    "    domain=\"features\",\n",
    "    keys=\"voltage\",\n",
    "    fit_to_split=\"train\",\n",
    ")\n",
    "\n",
    "# Raw data is preserved - access both representations\n",
    "raw = fs_charge[\"train\"].get_features(fmt=\"numpy\", features=\"voltage\", rep=\"raw\")\n",
    "transformed = fs_charge[\"train\"].get_features(\n",
    "    fmt=\"numpy\",\n",
    "    features=\"voltage\",\n",
    "    rep=\"transformed\",\n",
    ")\n",
    "\n",
    "fig, axes = plot_voltages(fs_charge, n_samples=200, rep=\"transformed\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89",
   "metadata": {},
   "source": [
    "### Chaining transforms\n",
    "\n",
    "Multiple transforms can be applied sequentially. Each call transforms the current `\"transformed\"` representation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "90",
   "metadata": {},
   "outputs": [],
   "source": [
    "# First undo, then chain: zero-start -> min-max\n",
    "fs_charge.undo_all_transforms(domain=\"features\")\n",
    "\n",
    "fs_charge.fit_transform(\n",
    "    scaler=\"PerSampleZeroStart\",\n",
    "    domain=\"features\",\n",
    "    keys=\"voltage\",\n",
    "    fit_to_split=\"train\",\n",
    ")\n",
    "\n",
    "fig, axes = plot_voltages(fs_charge, n_samples=200, rep=\"transformed\")\n",
    "plt.show()\n",
    "\n",
    "fs_charge.fit_transform(\n",
    "    scaler=\"MinMaxScaler\",\n",
    "    domain=\"features\",\n",
    "    keys=\"voltage\",\n",
    "    fit_to_split=\"train\",\n",
    ")\n",
    "\n",
    "fig, axes = plot_voltages(fs_charge, n_samples=200, rep=\"transformed\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91",
   "metadata": {},
   "source": [
    "### Scaling targets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "92",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import MinMaxScaler\n",
    "\n",
    "# You can also pass sklearn instances directly\n",
    "fs_charge.fit_transform(\n",
    "    scaler=MinMaxScaler(),\n",
    "    domain=\"targets\",\n",
    "    keys=\"soh\",\n",
    "    fit_to_split=\"train\",\n",
    ")\n",
    "\n",
    "soh_raw = fs_charge[\"test\"].get_targets(fmt=\"numpy\", targets=\"soh\", rep=\"raw\")\n",
    "soh_scaled = fs_charge[\"test\"].get_targets(\n",
    "    fmt=\"numpy\",\n",
    "    targets=\"soh\",\n",
    "    rep=\"transformed\",\n",
    ")\n",
    "print(f\"SOH raw range:     [{soh_raw.min():.1f}, {soh_raw.max():.1f}]\")\n",
    "print(f\"SOH scaled range:  [{soh_scaled.min():.3f}, {soh_scaled.max():.3f}]\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93",
   "metadata": {},
   "source": [
    "### Undoing transforms"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "94",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Undo the last *feature* transform (MinMaxScaler), keeping PerSampleZeroStart\n",
    "# Note that the target transform (although more recent) is not inversed\n",
    "fs_charge.undo_last_transform(domain=\"features\", keys=\"voltage\")\n",
    "\n",
    "transformed = fs_charge[\"train\"].get_features(\n",
    "    fmt=\"numpy\",\n",
    "    features=\"voltage\",\n",
    "    rep=\"transformed\",\n",
    ")\n",
    "print(\"After undo last:\")\n",
    "print(f\"  min={transformed.min():.4f} (should be ~0.0)\")\n",
    "print(f\"  max={transformed.max():.4f} (no longer bounded to 1.0)\")\n",
    "\n",
    "print(f\"SOH:  [{soh_scaled.min():.3f}, {soh_scaled.max():.3f}]\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "95",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Undo all transforms in a domain\n",
    "fs_charge.undo_all_transforms()\n",
    "\n",
    "# Verify: after undoing all, 'transformed' rep no longer exists\n",
    "print(\n",
    "    \"Keys after undo:\",\n",
    "    fs_charge.get_all_keys(include_domain_prefix=True, include_rep_suffix=True),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96",
   "metadata": {},
   "source": [
    "### Inverse-scaling external data\n",
    "\n",
    "Use `unscale_data_for_cols()` to apply inverse transforms to data that lives outside the FeatureSet (e.g., model predictions):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "97",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs_charge.fit_transform(\n",
    "    scaler=MinMaxScaler(),\n",
    "    domain=\"targets\",\n",
    "    keys=\"soh\",\n",
    "    fit_to_split=\"train\",\n",
    ")\n",
    "\n",
    "# Simulate model predictions in scaled space\n",
    "fake_predictions = np.array([[0.5], [0.8], [0.2]])\n",
    "\n",
    "# Inverse-transform back to original SOH scale\n",
    "original_scale = fs_charge.unscale_data_for_cols(\n",
    "    data=fake_predictions,\n",
    "    domain=\"targets\",\n",
    "    columns=\"soh\",\n",
    ")\n",
    "print(f\"Scaled predictions:   {fake_predictions.ravel()}\")\n",
    "print(f\"Original-scale SOH:   {original_scale.ravel()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "98",
   "metadata": {},
   "source": [
    "While you can access the fit scalers on a particular column and use for unscaling (shown above), it is best practice to use the original FeatureSet, filter to the sample IDs on which your scaled data was produced, and then access the \"transformed\" version directly. This is the only way to fully guarantee that you are \"applying\" the correct scaler. \n",
    "\n",
    "We'll cover this more in depth in: \n",
    "TODO: $\\textcolor{red}{\\text{add notebook link to \"working with model outputs / results\"}}$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "99",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "100",
   "metadata": {},
   "source": [
    "(01-create-featureset-serialization-save-and-load)=\n",
    "## Serialization (Save and Load)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "101",
   "metadata": {},
   "source": [
    "\n",
    "FeatureSets can be saved to disk and fully restored, including splits and transforms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "102",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Re-apply splits and transforms before saving\n",
    "fs_charge.clear_splits()\n",
    "fs_charge.undo_all_transforms()\n",
    "\n",
    "fs_charge.split_random(\n",
    "    ratios={\"train\": 0.6, \"val\": 0.2, \"test\": 0.2},\n",
    "    group_by=\"group_id\",\n",
    "    seed=42,\n",
    ")\n",
    "\n",
    "fs_charge.fit_transform(\n",
    "    scaler=\"PerSampleZeroStart\",\n",
    "    domain=\"features\",\n",
    "    keys=\"voltage\",\n",
    "    fit_to_split=\"train\",\n",
    ")\n",
    "fig, axes = plot_voltages(fs_charge, n_samples=200, rep=\"transformed\")\n",
    "plt.show()\n",
    "fs_charge.fit_transform(\n",
    "    scaler=\"MinMaxScaler\",\n",
    "    domain=\"features\",\n",
    "    keys=\"voltage\",\n",
    "    fit_to_split=\"train\",\n",
    ")\n",
    "fig, axes = plot_voltages(fs_charge, n_samples=200, rep=\"transformed\")\n",
    "plt.show()\n",
    "fs_charge.fit_transform(\n",
    "    scaler=\"MinMaxScaler\",\n",
    "    domain=\"targets\",\n",
    "    keys=\"soh\",\n",
    "    fit_to_split=\"train\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "103",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "from tempfile import TemporaryDirectory\n",
    "\n",
    "# Save to temp file\n",
    "SAVE_DIR = TemporaryDirectory()\n",
    "\n",
    "save_path = fs_charge.save(Path(SAVE_DIR.name) / \"fs_charge_demo\", overwrite=True)\n",
    "print(f\"Saved to: {save_path}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "104",
   "metadata": {},
   "source": [
    "Now we can reload this FeatureSet.\n",
    "\n",
    "\n",
    "Note that ModularML assign all \"nodes\" in an Experiment a unique ID.\n",
    "This is important when we move to Experiments and ModelGraphs, but we can just ignore the collision warning for now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "105",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reload from file\n",
    "fs_rel = FeatureSet.load(save_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "106",
   "metadata": {},
   "source": [
    "We can pick up exactly where we left off; all history is preserved.\n",
    "\n",
    "This means we can undo the last transform."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "107",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plot_voltages(fs_rel, n_samples=200, rep=\"transformed\")\n",
    "plt.show()\n",
    "\n",
    "fs_rel.undo_last_transform(domain=\"features\")\n",
    "fig, axes = plot_voltages(fs_rel, n_samples=200, rep=\"transformed\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "108",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "109",
   "metadata": {},
   "source": [
    "### Copying a FeatureSet\n",
    "\n",
    "The Node ID Collision warning appears because the saved FeatureSet carries an internal ID\n",
    "that already exists in the active context (it was saved from `fs_charge`). ModularML\n",
    "resolves this by assigning a new ID to the loaded FeatureSet; `fs_rel` and `fs_charge`\n",
    "are distinct objects.\n",
    "\n",
    "To explicitly create an independent copy of a FeatureSet within the same session, use `.copy()`.\n",
    "\n",
    "By default, the copy shares the underlying PyArrow data buffer with the original (zero-copy).\n",
    "Setting `share_raw_data_buffer=False` produces a fully independent copy with no shared state.\n",
    "\n",
    "Use `restore_splits` and `restore_scalers` to control whether the new copy inherits\n",
    "existing split definitions and fitted scalers from the original."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "110",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Shallow copy with raw data only\n",
    "fs_copy_raw = fs_charge.copy(label=\"CopyRawOnly\", share_raw_data_buffer=True)\n",
    "print(\n",
    "    f\"Copy (raw only) keys: {fs_copy_raw.get_all_keys(include_domain_prefix=True, include_rep_suffix=True)}\",\n",
    ")\n",
    "print(f\"Copy splits: {fs_copy_raw.available_splits}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "111",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Full copy with splits and scalers restored\n",
    "fs_copy_full = fs_charge.copy(\n",
    "    label=\"CopyFull\",\n",
    "    share_raw_data_buffer=False,\n",
    "    restore_splits=True,\n",
    "    restore_scalers=True,\n",
    "    register=True,\n",
    ")\n",
    "print(\n",
    "    f\"Full copy keys: {fs_copy_full.get_all_keys(include_domain_prefix=True, include_rep_suffix=True)}\",\n",
    ")\n",
    "print(f\"Full copy splits: {fs_copy_full.available_splits}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "112",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "113",
   "metadata": {},
   "source": [
    "(01-create-featureset-references-for-model-graph-wiring)=\n",
    "## References (for Model Graph Wiring)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "114",
   "metadata": {},
   "source": [
    "\n",
    "When connecting a FeatureSet to a `ModelStage` in a model graph, you create symbolic references rather than passing data directly.\n",
    "\n",
    "Below is a quick overview, but more details are provided in the following notebook:\n",
    "* {doc}`02_create_modelnode`\n",
    "* {doc}`03_create_modelgraph`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "115",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Multi-column reference (used for ModelStage inputs)\n",
    "ref = fs_charge.reference(features=\"voltage\", targets=\"soh\", rep=\"transformed\")\n",
    "print(f\"Reference type: {type(ref).__name__}\")\n",
    "print(f\"Reference: {ref}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "116",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Single-column reference (specify rep when multiple representations exist)\n",
    "col_ref = fs_charge.column_reference(feature=\"voltage\", rep=\"transformed\")\n",
    "print(f\"Column reference type: {type(col_ref).__name__}\")\n",
    "print(f\"Column reference: {col_ref}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "117",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "118",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "| Task | Method |\n",
    "|------|--------|\n",
    "| Create from dict | `FeatureSet.from_dict(label, data, feature_keys, target_keys, tag_keys)` |\n",
    "| Create from DataFrame | `FeatureSet.from_pandas(label, df, feature_cols, target_cols, tag_cols, groupby_cols)` |\n",
    "| Create from Arrow | `FeatureSet.from_pyarrow_table(label, table)` |\n",
    "| Inspect keys | `get_feature_keys()`, `get_target_keys()`, `get_tag_keys()`, `get_all_keys()` |\n",
    "| Inspect shapes/dtypes | `get_feature_shapes()`, `get_feature_dtypes()` |\n",
    "| Get data | `get_features(fmt=...)`, `get_targets(fmt=...)`, `get_tags(fmt=...)` |\n",
    "| Unified access | `get_data(features=..., targets=..., tags=..., fmt=...)` |\n",
    "| Filter rows | `filter(conditions={...})` |\n",
    "| Subset by index | `take(indices)` |\n",
    "| Select columns | `select(features=..., targets=..., rep=...)` |\n",
    "| Split randomly | `split_random(ratios, group_by, seed)` |\n",
    "| Split by condition | `split_by_condition({split_name: {col: condition}})` |\n",
    "| Apply transform | `fit_transform(scaler, domain, keys, fit_to_split)` |\n",
    "| Undo transform | `undo_last_transform(domain, keys)` / `undo_all_transforms()` |\n",
    "| Inverse-scale data | `unscale_data_for_cols(data, domain, columns)` |\n",
    "| Save / Load | `save(path)` / `FeatureSet.load(path)` |\n",
    "| Copy | `copy(restore_splits, restore_scalers)` |\n",
    "| Create reference | `reference(features, targets)` / `column_reference(feature)` |"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv (3.13.5)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}