{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "# How to: Create and Use a FeatureSet\n",
    "\n",
    "The `FeatureSet` is the central data container in ModularML. It organizes your data into three domains:\n",
    "\n",
    "- **Features**: model inputs (e.g., time-series signals, sensor readings)\n",
    "- **Targets**: values to predict (e.g., state-of-health, capacity)\n",
    "- **Tags**: metadata for grouping and filtering (e.g., cell ID, temperature)\n",
    "\n",
    "Under the hood, a `FeatureSet` wraps a `SampleCollection`, which stores all data in a columnar [Apache Arrow](https://arrow.apache.org/) table. Each column follows the naming convention `<domain>.<key>.<representation>` (e.g., `features.voltage.raw`). A `SampleSchema` tracks the structure, shapes, and data types.\n",
    "\n",
    "This notebook covers the complete `FeatureSet` API:\n",
    "\n",
    "- {ref}`01-create-featureset-creating-a-featureset`\n",
    "- {ref}`01-create-featureset-inspecting-a-featureset`\n",
    "- {ref}`01-create-featureset-accessing-data`\n",
    "- {ref}`01-create-featureset-row-subsetting-and-filtering`\n",
    "- {ref}`01-create-featureset-column-subsetting`\n",
    "- {ref}`01-create-featureset-splitting-data`\n",
    "- {ref}`01-create-featureset-transforms-and-scaling`\n",
    "- {ref}`01-create-featureset-serialization-save-and-load`\n",
    "- {ref}`01-create-featureset-references-for-model-graph-wiring`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "import modularml as mml\n",
    "from modularml import FeatureSet"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2",
   "metadata": {},
   "source": [
    "We'll use synthetic battery pulse-response data throughout this notebook. Each sample contains a 101-point voltage time-series, a scalar state-of-health (SOH) target, and metadata tags."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3",
   "metadata": {},
   "outputs": [],
   "source": [
    "N_SAMPLES = 1000\n",
    "N_CELLS = 20\n",
    "N_GROUPS = 5\n",
    "TIME = np.linspace(0, 100, 101)\n",
    "\n",
    "rng = np.random.default_rng(42)\n",
    "\n",
    "# Assign each sample a cell, group, pulse type, and SOC\n",
    "cell_ids = rng.integers(1, N_CELLS + 1, size=N_SAMPLES)\n",
    "group_ids = rng.integers(1, N_GROUPS + 1, size=N_SAMPLES)\n",
    "pulse_types = rng.choice([\"chg\", \"dchg\"], size=N_SAMPLES)\n",
    "pulse_socs = rng.choice([10, 20, 30, 40, 50, 60, 70, 80, 90], size=N_SAMPLES)\n",
    "\n",
    "# SOH degrades with group_id (higher group = more degraded)\n",
    "soh = 100.0 - (group_ids - 1) * 8.0 + rng.normal(0, 2, size=N_SAMPLES)\n",
    "\n",
    "# Synthetic voltage: baseline + pulse shape, shifted by SOC and degraded by SOH\n",
    "voltage = np.zeros((N_SAMPLES, 101))\n",
    "for i in range(N_SAMPLES):\n",
    "    base = 3.2 + pulse_socs[i] / 100.0 * 0.5\n",
    "    amplitude = 0.3 * (soh[i] / 100.0)\n",
    "    sign = 1.0 if pulse_types[i] == \"chg\" else -1.0\n",
    "    curve = sign * amplitude * (1 - np.exp(-TIME / 15.0))\n",
    "    voltage[i] = base + curve + rng.normal(0, 0.002, size=101)\n",
    "\n",
    "data = {\n",
    "    \"voltage\": voltage.tolist(),\n",
    "    \"soh\": soh.tolist(),\n",
    "    \"cell_id\": cell_ids.tolist(),\n",
    "    \"group_id\": group_ids.tolist(),\n",
    "    \"pulse_type\": pulse_types.tolist(),\n",
    "    \"pulse_soc\": pulse_socs.tolist(),\n",
    "}\n",
    "\n",
    "print(f\"Samples: {N_SAMPLES}\")\n",
    "print(f\"Voltage shape per sample: {voltage[0].shape}\")\n",
    "print(f\"SOH range: [{soh.min():.1f}, {soh.max():.1f}]\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5",
   "metadata": {},
   "source": [
    "(01-create-featureset-creating-a-featureset)=\n",
    "## Creating a FeatureSet"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6",
   "metadata": {},
   "source": [
    "\n",
    "Three class methods are available for construction."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7",
   "metadata": {},
   "source": [
    "### `from_dict()` — From a Python dictionary\n",
    "\n",
    "The most common constructor. Pass a dict where each key maps to a list/array of values (one entry per sample), then specify which keys are features, targets, and tags."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs = FeatureSet.from_dict(\n",
    "    label=\"PulseData\",\n",
    "    data=data,\n",
    "    feature_keys=\"voltage\",\n",
    "    target_keys=\"soh\",\n",
    "    tag_keys=[\"cell_id\", \"group_id\", \"pulse_type\", \"pulse_soc\"],\n",
    ")\n",
    "print(fs)\n",
    "print(f\"Feature shapes: {fs.get_feature_shapes()}\")\n",
    "print(f\"Target shapes:  {fs.get_target_shapes()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9",
   "metadata": {},
   "source": [
    "When accessing FeatureSet data, you'll notice that all keys are returned in the `<domain>.<key>.<representation>` by default.\n",
    "You can modify the the returned string with the `include_rep_suffix` and `include_domain_prefix` arguments in all `FeatureSet.get_<>` methods.\n",
    "\n",
    "*Note that certain string-component omissions will raise an error if it results in a non-unique key (e.g., you have two representations of the same feature column)*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\n",
    "    f\"Feature shapes: {fs.get_feature_shapes(include_domain_prefix=False, include_rep_suffix=False)}\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11",
   "metadata": {},
   "source": [
    "### `from_pandas()` — From a Pandas DataFrame\n",
    "\n",
    "Allows for FeatureSet structuring directly from a Pandas DataFrame.\n",
    "We similarly need to assign column names to features, targets, and tags."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a simple DataFrame example\n",
    "df = pd.DataFrame(\n",
    "    {\n",
    "        \"temperature\": np.random.default_rng(0).normal(25, 5, size=100),\n",
    "        \"humidity\": np.random.default_rng(1).normal(60, 10, size=100),\n",
    "        \"output_power\": np.random.default_rng(2).normal(100, 15, size=100),\n",
    "        \"site_id\": np.repeat([\"A\", \"B\", \"C\", \"D\"], 25),\n",
    "        \"timestamp\": np.arange(25).tolist() * 4,\n",
    "    },\n",
    ")\n",
    "\n",
    "fs_from_df = FeatureSet.from_pandas(\n",
    "    label=\"WeatherData\",\n",
    "    df=df,\n",
    "    feature_cols=[\"temperature\", \"humidity\"],\n",
    "    target_cols=\"output_power\",\n",
    "    tag_cols=\"site_id\",\n",
    ")\n",
    "print(fs_from_df)\n",
    "print(f\"Feature shapes: {fs_from_df.get_feature_shapes()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13",
   "metadata": {},
   "source": [
    "Note that the above approach treats every row in the dataframe as a unique sample for modeling.\n",
    "\n",
    "If that's not the case, grouping will need to be performed to aggregate rows in the Pandas dataframe belonging to each sample.\n",
    "The `from_pandas` constructor provides the `group_by` and `sort_by` arguments to do just that.\n",
    "\n",
    "Below, we group all rows in our dataframe by the `'site_id'` at which the data was measured, and then ensure all data points are sorted by `'time_stamp'` within each sample. Notice how the feauture now have a shape of (25,)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs_grouped = FeatureSet.from_pandas(\n",
    "    label=\"WeatherGrouped\",\n",
    "    df=df,\n",
    "    feature_cols=[\"temperature\", \"humidity\"],\n",
    "    target_cols=\"output_power\",\n",
    "    group_by=\"site_id\",\n",
    "    sort_by=\"timestamp\",\n",
    "    tag_cols=[\"site_id\", \"timestamp\"],\n",
    ")\n",
    "print(fs_grouped)\n",
    "print(f\"Feature shapes: {fs_grouped.get_feature_shapes()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15",
   "metadata": {},
   "source": [
    "### `from_pyarrow_table()` — From an Arrow table\n",
    "\n",
    "If you already have a `pyarrow.Table` with columns following the `<domain>.<key>.<rep>` naming convention, you can wrap it directly.\n",
    "\n",
    "*Unless you are certain that the existing table uses the appropriate schema, it is recommended to use `table.to_pandas()`, then use the `from_pandas()` constructor.*\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pyarrow as pa\n",
    "\n",
    "table = pa.table(\n",
    "    {\n",
    "        \"features.x.raw\": [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]],\n",
    "        \"targets.y.raw\": [0.5, 1.5],\n",
    "        \"tags.group.raw\": [\"a\", \"b\"],\n",
    "    },\n",
    ")\n",
    "\n",
    "fs_arrow = FeatureSet.from_pyarrow_table(label=\"ArrowExample\", table=table)\n",
    "print(fs_arrow)\n",
    "print(f\"Feature shapes: {fs_arrow.get_feature_shapes()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18",
   "metadata": {},
   "source": [
    "(01-create-featureset-inspecting-a-featureset)=\n",
    "## Inspecting a FeatureSet"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19",
   "metadata": {},
   "source": [
    "\n",
    "Use the following properties and methods to understand the structure of your data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Basic info\n",
    "print(f\"Label:      {fs.label}\")\n",
    "print(f\"Samples:    {len(fs)}\")\n",
    "print(f\"repr:       {fs!r}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "21",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Column keys by domain\n",
    "print(\"Feature keys:\", fs.get_feature_keys())\n",
    "print(\"Target keys: \", fs.get_target_keys())\n",
    "print(\"Tag keys:    \", fs.get_tag_keys())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "22",
   "metadata": {},
   "outputs": [],
   "source": [
    "# All keys with full qualification (domain prefix + rep suffix)\n",
    "fs.get_all_keys(include_domain_prefix=True, include_rep_suffix=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "23",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Shapes and dtypes\n",
    "print(\"Feature shapes:\", fs.get_feature_shapes())\n",
    "print(\"Target shapes: \", fs.get_target_shapes())\n",
    "print(\"Tag shapes:    \", fs.get_tag_shapes())\n",
    "print()\n",
    "print(\"Feature dtypes:\", fs.get_feature_dtypes())\n",
    "print(\"Target dtypes: \", fs.get_target_dtypes())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24",
   "metadata": {},
   "source": [
    "Note that most data containing classes in ModularML also support a `summary()` method.\n",
    "\n",
    "Printing the results provides a formatted summary of all characteristics of that object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(fs.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "27",
   "metadata": {},
   "source": [
    "(01-create-featureset-accessing-data)=\n",
    "## Accessing Data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28",
   "metadata": {},
   "source": [
    "\n",
    "Data can be retrieved in multiple formats via the `fmt` parameter. Accepted values include `\"numpy\"`, `\"pandas\"`, `\"dict_numpy\"`, `\"torch\"`, and more."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29",
   "metadata": {},
   "source": [
    "### Domain-specific accessors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "30",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get all features as a dict of numpy arrays (default)\n",
    "features = fs.get_features()\n",
    "print(f\"Type: {type(features)}\")\n",
    "print(f\"Keys: {list(features.keys())}\")\n",
    "print(f\"Voltage shape: {features['voltage'].shape}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "31",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get a single feature by name, as numpy\n",
    "voltage = fs.get_features(fmt=\"numpy\", features=\"voltage\")\n",
    "print(f\"Shape: {voltage.shape}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "32",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get targets as a pandas DataFrame\n",
    "targets_df = fs.get_targets(fmt=\"pandas\")\n",
    "targets_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "33",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get specific tags as a dict of numpy arrays\n",
    "tags = fs.get_tags(fmt=\"dict_numpy\", tags=[\"cell_id\", \"pulse_type\"])\n",
    "print(f\"Type: {type(tags)}\")\n",
    "print(f\"Cell IDs (first 5): {tags['cell_id'][:5]}\")\n",
    "print(f\"Pulse types (first 5): {tags['pulse_type'][:5]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34",
   "metadata": {},
   "source": [
    "### Unified accessor: `get_data()`\n",
    "\n",
    "Retrieve columns from multiple domains in a single call. Supports wildcards and a default `rep` parameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "35",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Specific columns from different domains\n",
    "result = fs.get_data(\n",
    "    features=\"voltage\",\n",
    "    targets=\"soh\",\n",
    "    tags=\"*\",\n",
    "    fmt=\"pandas\",\n",
    ")\n",
    "result.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "36",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sample UUIDs - each sample has a unique identifier\n",
    "uuids = fs.get_sample_uuids()\n",
    "print(f\"First 3 UUIDs: {uuids[:3]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "37",
   "metadata": {},
   "source": [
    "### Export to pandas or Arrow"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Full export to pandas\n",
    "df_all = fs.to_pandas()\n",
    "df_all"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Export to Arrow table\n",
    "arrow_table = fs.to_arrow()\n",
    "print(f\"Arrow schema:\\n{arrow_table.schema}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41",
   "metadata": {},
   "source": [
    "(01-create-featureset-row-subsetting-and-filtering)=\n",
    "## Row Subsetting and Filtering"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42",
   "metadata": {},
   "source": [
    "\n",
    "All row-subsetting operations return a `FeatureSetView` - a lightweight, zero-copy window over the parent `FeatureSet`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "43",
   "metadata": {},
   "source": [
    "### `filter()` — Condition-based filtering\n",
    "\n",
    "Conditions are a dict mapping fully-qualified column names to:\n",
    "- A **scalar** (equality match)\n",
    "- A **list/set** (membership test)\n",
    "- A **callable** (row-wise predicate)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "44",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Filter by equality\n",
    "view_chg = fs.filter(conditions={\"tags.pulse_type.raw\": \"chg\"})\n",
    "print(f\"Charge-only: {view_chg}\")\n",
    "print(np.unique(view_chg.get_tags(fmt=\"np\", tags=\"pulse_type\")))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "45",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Filter by list membership\n",
    "view_cells = fs.filter(conditions={\"tags.cell_id.raw\": [1, 2, 3]})\n",
    "print(f\"Cells 1-3: {view_cells}\")\n",
    "print(np.unique(view_cells.get_tags(fmt=\"np\", tags=\"cell_id\")))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "46",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Filter with callable + multiple conditions (AND-composed)\n",
    "view_healthy_chg = fs.filter(\n",
    "    conditions={\n",
    "        \"tags.pulse_type.raw\": \"chg\",\n",
    "        \"targets.soh.raw\": lambda x: x >= 90.0,\n",
    "    },\n",
    ")\n",
    "print(f\"Healthy charge pulses: {view_healthy_chg}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47",
   "metadata": {},
   "source": [
    "### `take()` — By relative index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48",
   "metadata": {},
   "outputs": [],
   "source": [
    "view_first10 = fs.take([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], label=\"first_10\")\n",
    "print(view_first10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49",
   "metadata": {},
   "source": [
    "### `take_sample_uuids()` — By UUID"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "50",
   "metadata": {},
   "outputs": [],
   "source": [
    "some_uuids = fs.get_sample_uuids()[:5].tolist()\n",
    "view_by_uuid = fs.take_sample_uuids(some_uuids, label=\"uuid_subset\")\n",
    "print(view_by_uuid)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51",
   "metadata": {},
   "source": [
    "### Set operations between views"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "52",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Intersection: samples in both views\n",
    "view_a = fs.filter(conditions={\"tags.pulse_type.raw\": \"chg\"})\n",
    "print(\"View A:\")\n",
    "print(\" - view:\", view_a)\n",
    "print(\" - cells:\", np.unique(view_a.get_tags(fmt=\"np\", tags=\"cell_id\")))\n",
    "print(\" - pulse_types:\", np.unique(view_a.get_tags(fmt=\"np\", tags=\"pulse_type\")))\n",
    "\n",
    "view_b = fs.filter(conditions={\"tags.cell_id.raw\": [1, 2, 3]})\n",
    "print(\"\\nView B:\")\n",
    "print(\" - view:\", view_b)\n",
    "print(\" - cells:\", np.unique(view_b.get_tags(fmt=\"np\", tags=\"cell_id\")))\n",
    "print(\" - pulse_types:\", np.unique(view_b.get_tags(fmt=\"np\", tags=\"pulse_type\")))\n",
    "\n",
    "\n",
    "view_intersect = view_a.take_intersection(view_b, label=\"chg_cells_1to3\")\n",
    "print(\"\\nIntersection:\")\n",
    "print(\" - view:\", view_intersect)\n",
    "print(\" - cells:\", np.unique(view_intersect.get_tags(fmt=\"np\", tags=\"cell_id\")))\n",
    "print(\n",
    "    \" - pulse_types:\",\n",
    "    np.unique(view_intersect.get_tags(fmt=\"np\", tags=\"pulse_type\")),\n",
    ")\n",
    "\n",
    "# Difference: samples in A but not in B\n",
    "view_diff = view_a.take_difference(view_b, label=\"chg_not_cells_1to3\")\n",
    "print(\"\\nDifference:\")\n",
    "print(\" - view:\", view_diff)\n",
    "print(\" - cells:\", np.unique(view_diff.get_tags(fmt=\"np\", tags=\"cell_id\")))\n",
    "print(\" - pulse_types:\", np.unique(view_diff.get_tags(fmt=\"np\", tags=\"pulse_type\")))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53",
   "metadata": {},
   "source": [
    "We can also check view overlap via the `is_disjoint_with` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "54",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"view_diff does not contain view_b samples: \", view_b.is_disjoint_with(view_diff))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55",
   "metadata": {},
   "source": [
    "### Converting a view back to a FeatureSet\n",
    "\n",
    "A `FeatureSetView` is a lightweight reference of indices in the parent FeatureSet. \n",
    "Any modification to the FeatureSet with change the data access through its child views.\n",
    "\n",
    "To create an independent `FeatureSet` from a view, use `to_featureset()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "56",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs_charge = view_chg.to_featureset(label=\"ChargePulses\")\n",
    "print(fs_charge)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58",
   "metadata": {},
   "source": [
    "(01-create-featureset-column-subsetting)=\n",
    "## Column Subsetting"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59",
   "metadata": {},
   "source": [
    "\n",
    "Use `select()` to create a view with only specific columns. Row indices are preserved.\n",
    "\n",
    "Select supports the same wildcard usage as `filter()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "60",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select specific features and targets\n",
    "view_slim = fs.select(features=\"voltage.*\", targets=\"soh\")\n",
    "print(f\"Columns: {view_slim.get_all_keys()}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "61",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select by domain and representation\n",
    "view_raw_only = fs.select(features=\"voltage\", rep=\"raw\")\n",
    "print(\n",
    "    f\"Columns: {view_raw_only.get_all_keys(include_domain_prefix=True, include_rep_suffix=True)}\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63",
   "metadata": {},
   "source": [
    "(01-create-featureset-splitting-data)=\n",
    "## Splitting Data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64",
   "metadata": {},
   "source": [
    "\n",
    "Splitting creates named `FeatureSetView` partitions that are registered (optional) on the parent FeatureSet."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65",
   "metadata": {},
   "source": [
    "### Random splitting\n",
    "\n",
    "Random splitting takes a `ratios` argument, defining the proportions of all samples in the calling container to be assigned to each key.\n",
    "The ratio values must add up to 1.\n",
    "\n",
    "By default, splits views are not returned and automatically registered to the parent FeatureSet.\n",
    "This behaviour can be specified via the `return_views` and `register` arguments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "66",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs_charge.clear_splits()\n",
    "fs_charge.split_random(\n",
    "    ratios={\"train\": 0.6, \"val\": 0.2, \"test\": 0.2},\n",
    "    seed=42,\n",
    ")\n",
    "\n",
    "print(f\"Available splits: {fs_charge.available_splits}\")\n",
    "for name, view in fs_charge.splits.items():\n",
    "    print(f\"  {name}: {len(view)} samples\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67",
   "metadata": {},
   "source": [
    "Use `group_by` to keep all samples sharing a tag value in the same split (prevents data leakage):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "68",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs_charge.clear_splits()\n",
    "\n",
    "fs_charge.split_random(\n",
    "    ratios={\"train\": 0.5, \"val\": 0.3, \"test\": 0.2},\n",
    "    group_by=\"group_id\",\n",
    "    seed=1,\n",
    ")\n",
    "\n",
    "for name, view in fs_charge.splits.items():\n",
    "    group_ids = view.get_tags(fmt=\"numpy\", tags=\"group_id\")\n",
    "    print(f\"  {name}: {len(view)} samples, groups: {np.unique(group_ids)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69",
   "metadata": {},
   "source": [
    "Use `stratify_by` to ensure all splits have representative distributions of the calling source.\n",
    "\n",
    "Note that grouping and stratification are mutually exclusive (you can't use both at the same time)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs_charge.clear_splits()\n",
    "\n",
    "fs_charge.split_random(\n",
    "    ratios={\"train\": 0.5, \"val\": 0.3, \"test\": 0.2},\n",
    "    stratify_by=\"group_id\",\n",
    "    seed=1,\n",
    ")\n",
    "\n",
    "for name, view in fs_charge.splits.items():\n",
    "    group_ids = view.get_tags(fmt=\"numpy\", tags=\"group_id\")\n",
    "    print(f\"  {name}: {len(view)} samples, groups: {np.unique(group_ids)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71",
   "metadata": {},
   "source": [
    "### Condition-based splitting\n",
    "\n",
    "Assign samples to splits using explicit conditions on any column.\n",
    "\n",
    "Condition-based splitting takes a `condition` argument, which is a nested dict of: `{split_name: {column: condition}}`.\n",
    "\n",
    "Samples that satisfy all conditions within the named split are returned in the split. \n",
    "A warning will be thrown if the produces splits are not mutually exclusive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "72",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs_charge.clear_splits()\n",
    "\n",
    "fs_charge.split_by_condition(\n",
    "    {\n",
    "        \"train\": {\"tags.group_id.raw\": [1, 2, 3]},\n",
    "        \"val\": {\"tags.group_id.raw\": [4]},\n",
    "        \"test\": {\"tags.group_id.raw\": [5]},\n",
    "    },\n",
    ")\n",
    "\n",
    "for name, view in fs_charge.splits.items():\n",
    "    group_ids = view.get_tags(fmt=\"numpy\", tags=\"group_id\")\n",
    "    print(f\"  {name}: {len(view)} samples, groups: {np.unique(group_ids)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "73",
   "metadata": {},
   "source": [
    "### Nested splits\n",
    "\n",
    "Splitting can be called on any existing split, in addition to directly on the parent FeatureSet.\n",
    "The nested split conditions will only draw from the samples available in the calling view.\n",
    "\n",
    "This allows us to \"nest\" split conditions to create more complex modeling setups.\n",
    "\n",
    "*Note that the sub-splits will inherently overlap with the calling view, and care should be taking when using these splits in downstream modeling.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "74",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs_charge.clear_splits()\n",
    "\n",
    "fs_charge.split_by_condition(\n",
    "    {\n",
    "        \"source\": {\n",
    "            \"targets.soh\": lambda soh: soh >= 90,\n",
    "            \"tags.group_id.raw\": [1, 2, 3],\n",
    "        },\n",
    "        \"test\": {\n",
    "            \"targets.soh\": lambda soh: soh < 90,\n",
    "            \"tags.group_id.raw\": [4, 5],\n",
    "        },\n",
    "    },\n",
    ")\n",
    "fs_charge.get_split(\"source\").split_random(\n",
    "    ratios={\"train\": 0.8, \"val\": 0.2},\n",
    "    stratify_by=\"group_id\",\n",
    ")\n",
    "\n",
    "for name, view in fs_charge.splits.items():\n",
    "    group_ids = view.get_tags(fmt=\"numpy\", tags=\"group_id\")\n",
    "    print(f\"  {name}: {len(view)} samples, groups: {np.unique(group_ids)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "75",
   "metadata": {},
   "source": [
    "### Getting splits directly with `return_views`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "76",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs_charge.clear_splits()\n",
    "\n",
    "# Build custom views using any filter logic\n",
    "train_view = fs_charge.filter(\n",
    "    conditions={\n",
    "        \"tags.group_id.raw\": [1, 2, 3],\n",
    "        \"targets.soh.raw\": lambda x: x >= 80.0,\n",
    "    },\n",
    "    label=\"train\",\n",
    ")\n",
    "test_view = fs_charge.filter(\n",
    "    conditions={\"tags.group_id.raw\": [4, 5]},\n",
    "    label=\"test\",\n",
    ")\n",
    "\n",
    "# Register them as named splits\n",
    "fs_charge.add_split(train_view)\n",
    "fs_charge.add_split(test_view)\n",
    "\n",
    "for name, view in fs_charge.splits.items():\n",
    "    print(f\"  {name}: {len(view)} samples\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "77",
   "metadata": {},
   "source": [
    "Let's restore a simple train/val/test for scaling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "78",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs_charge.clear_splits()\n",
    "\n",
    "split_views = fs_charge.split_random(\n",
    "    ratios={\"train\": 0.6, \"val\": 0.2, \"test\": 0.2},\n",
    "    group_by=\"group_id\",\n",
    "    seed=42,\n",
    "    return_views=True,\n",
    ")\n",
    "\n",
    "for name, view in split_views.items():\n",
    "    print(f\"  {name}: {len(view)} samples\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79",
   "metadata": {},
   "source": [
    "Note that most ModularML core classes implement a `.visualize()` method.\n",
    "For FeatureSets, this displays a Mermaid diagram of all splits registered to the FeatureSet.\n",
    "\n",
    "*Note that you will need to install a Mermaid rendering extension for your IDE. I use \"Markdown Preview Mermaid Support\".*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "80",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs_charge.visualize(show_overlaps=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "82",
   "metadata": {},
   "source": [
    "(01-create-featureset-transforms-and-scaling)=\n",
    "## Transforms and Scaling"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "83",
   "metadata": {},
   "source": [
    "\n",
    "Apply preprocessing transforms to features or targets. Transforms are tracked and can be undone.\n",
    "\n",
    "Several scalers are built into to ModularML and accessible via the `Scaler.get_supported_scalers()` command.\n",
    "You can also create custom scalers, as outlined below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "84",
   "metadata": {},
   "outputs": [],
   "source": [
    "# List all available scalers\n",
    "mml.supported_scalers"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "85",
   "metadata": {},
   "source": [
    "Let's create a little utility to plot our voltages so we can verify our transforms:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "86",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "\n",
    "def plot_voltages(\n",
    "    fs: FeatureSet,\n",
    "    n_samples: int = 200,\n",
    "    rep: str = \"transformed\",\n",
    "    seed: int = 13,\n",
    "):\n",
    "    \"\"\"\n",
    "    Plot the 'voltage' feature contained in the FeatureSet.\n",
    "\n",
    "    Each split will get its own panel.\n",
    "    Colors by SOH (dark blue = high SOH, light blue = low SOH)\n",
    "\n",
    "    Args:\n",
    "        fs (FeatureSet): FeatureSet to use.\n",
    "        n_samples (int, optional): The number of samples in `fs` that will\n",
    "              get plotted. Defaults to 200.\n",
    "        rep (str): The representation of the data to plot (eg, \"raw\" and \"transformed\")\n",
    "        seed (int, optional): A seed to ensure the same samples get plotted\n",
    "              with repeated calls. Defaults to 13.\n",
    "\n",
    "    \"\"\"\n",
    "\n",
    "    def order_splits(values: list[str]) -> list[str]:\n",
    "        priority = {\"train\": 0, \"val\": 1, \"test\": 2}\n",
    "        return sorted(values, key=lambda x: priority.get(x, 99))\n",
    "\n",
    "    rng = np.random.default_rng(seed)\n",
    "    scm = plt.cm.ScalarMappable(\n",
    "        cmap=plt.cm.Blues,\n",
    "        norm=plt.Normalize(vmin=50, vmax=100),\n",
    "    )\n",
    "\n",
    "    # Verify rep exists\n",
    "    avail_reps = fs.collection._get_rep_keys(domain=\"features\", key=\"voltage\")\n",
    "    if rep not in avail_reps:\n",
    "        rep = \"raw\"\n",
    "\n",
    "    # Create figure with panels for each split\n",
    "    fig, axes = plt.subplots(\n",
    "        figsize=(7, 2.5),\n",
    "        ncols=fs.n_splits,\n",
    "        sharex=True,\n",
    "        sharey=True,\n",
    "    )\n",
    "    split_names = order_splits(fs.available_splits)\n",
    "    for i, split_label in enumerate(split_names):\n",
    "        # For each split, get all voltage features and group_ids\n",
    "        split_view = fs.get_split(split_label)\n",
    "        voltages = np.squeeze(\n",
    "            split_view.get_features(features=\"voltage\", fmt=\"numpy\", rep=rep),\n",
    "        )\n",
    "        sohs = np.squeeze(split_view.get_targets(targets=\"soh\", fmt=\"numpy\", rep=\"raw\"))\n",
    "\n",
    "        # Select n_samples\n",
    "        sample_idxs = rng.choice(np.arange(0, len(voltages)), size=n_samples)\n",
    "        for idx in sample_idxs:\n",
    "            axes[i].plot(voltages[idx], color=scm.to_rgba(sohs[idx]))\n",
    "\n",
    "        axes[i].set_title(split_label, fontsize=10)\n",
    "        axes[i].set_xlabel(\"Time (s)\", fontsize=10)\n",
    "    axes[0].set_ylabel(\"Voltage (V)\", fontsize=10)\n",
    "\n",
    "    # Adjust main subplot area to leave space on the right for colorbar\n",
    "    fig.tight_layout(pad=1)\n",
    "    fig.subplots_adjust(right=0.85)\n",
    "\n",
    "    # Add colorbar as a dedicated panel on the far right\n",
    "    cbar_ax = fig.add_axes([0.87, 0.19, 0.02, 0.7])  # [left, bottom, width, height]\n",
    "    cbar = fig.colorbar(scm, cax=cbar_ax)\n",
    "    cbar.set_label(\"SOH (%)\", fontsize=10)\n",
    "    return fig, axes\n",
    "\n",
    "\n",
    "fig, axes = plot_voltages(fs_charge, n_samples=200)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "87",
   "metadata": {},
   "source": [
    "### Applying a transform\n",
    "\n",
    "`fit_transform()` fits a scaler and stores the result as a `\"transformed\"` representation alongside the original `\"raw\"` data.\n",
    "\n",
    "- `domain`: `\"features\"` or `\"targets\"`\n",
    "- `keys`: which keys to transform (default: all in domain)\n",
    "- `fit_to_split`: fit only on this split's data (prevents data leakage)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "88",
   "metadata": {},
   "outputs": [],
   "source": [
    "from modularml import Scaler\n",
    "\n",
    "# Apply MinMaxScaler to voltage, fitted on training data only\n",
    "fs_charge.fit_transform(\n",
    "    scaler=Scaler(\"MinMaxScaler\"),\n",
    "    domain=\"features\",\n",
    "    keys=\"voltage\",\n",
    "    fit_to_split=\"train\",\n",
    ")\n",
    "\n",
    "# Raw data is preserved - access both representations\n",
    "raw = fs_charge[\"train\"].get_features(fmt=\"numpy\", features=\"voltage\", rep=\"raw\")\n",
    "transformed = fs_charge[\"train\"].get_features(\n",
    "    fmt=\"numpy\",\n",
    "    features=\"voltage\",\n",
    "    rep=\"transformed\",\n",
    ")\n",
    "\n",
    "fig, axes = plot_voltages(fs_charge, n_samples=200, rep=\"transformed\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89",
   "metadata": {},
   "source": [
    "### Chaining transforms\n",
    "\n",
    "Multiple transforms can be applied sequentially. Each call transforms the current `\"transformed\"` representation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "90",
   "metadata": {},
   "outputs": [],
   "source": [
    "# First undo, then chain: zero-start -> min-max\n",
    "fs_charge.undo_all_transforms(domain=\"features\")\n",
    "\n",
    "fs_charge.fit_transform(\n",
    "    scaler=\"PerSampleZeroStart\",\n",
    "    domain=\"features\",\n",
    "    keys=\"voltage\",\n",
    "    fit_to_split=\"train\",\n",
    ")\n",
    "\n",
    "fig, axes = plot_voltages(fs_charge, n_samples=200, rep=\"transformed\")\n",
    "plt.show()\n",
    "\n",
    "fs_charge.fit_transform(\n",
    "    scaler=\"MinMaxScaler\",\n",
    "    domain=\"features\",\n",
    "    keys=\"voltage\",\n",
    "    fit_to_split=\"train\",\n",
    ")\n",
    "\n",
    "fig, axes = plot_voltages(fs_charge, n_samples=200, rep=\"transformed\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91",
   "metadata": {},
   "source": [
    "### Scaling targets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "92",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import MinMaxScaler\n",
    "\n",
    "# You can also pass sklearn instances directly\n",
    "fs_charge.fit_transform(\n",
    "    scaler=MinMaxScaler(),\n",
    "    domain=\"targets\",\n",
    "    keys=\"soh\",\n",
    "    fit_to_split=\"train\",\n",
    ")\n",
    "\n",
    "soh_raw = fs_charge[\"test\"].get_targets(fmt=\"numpy\", targets=\"soh\", rep=\"raw\")\n",
    "soh_scaled = fs_charge[\"test\"].get_targets(\n",
    "    fmt=\"numpy\",\n",
    "    targets=\"soh\",\n",
    "    rep=\"transformed\",\n",
    ")\n",
    "print(f\"SOH raw range:     [{soh_raw.min():.1f}, {soh_raw.max():.1f}]\")\n",
    "print(f\"SOH scaled range:  [{soh_scaled.min():.3f}, {soh_scaled.max():.3f}]\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93",
   "metadata": {},
   "source": [
    "### Undoing transforms"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "94",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Undo the last *feature* transform (MinMaxScaler), keeping PerSampleZeroStart\n",
    "# Note that the target transform (although more recent) is not inversed\n",
    "fs_charge.undo_last_transform(domain=\"features\", keys=\"voltage\")\n",
    "\n",
    "transformed = fs_charge[\"train\"].get_features(\n",
    "    fmt=\"numpy\",\n",
    "    features=\"voltage\",\n",
    "    rep=\"transformed\",\n",
    ")\n",
    "print(\"After undo last:\")\n",
    "print(f\"  min={transformed.min():.4f} (should be ~0.0)\")\n",
    "print(f\"  max={transformed.max():.4f} (no longer bounded to 1.0)\")\n",
    "\n",
    "print(f\"SOH:  [{soh_scaled.min():.3f}, {soh_scaled.max():.3f}]\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "95",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Undo all transforms in a domain\n",
    "fs_charge.undo_all_transforms()\n",
    "\n",
    "# Verify: after undoing all, 'transformed' rep no longer exists\n",
    "print(\n",
    "    \"Keys after undo:\",\n",
    "    fs_charge.get_all_keys(include_domain_prefix=True, include_rep_suffix=True),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96",
   "metadata": {},
   "source": [
    "### Inverse-scaling external data\n",
    "\n",
    "Use `unscale_data_for_cols()` to apply inverse transforms to data that lives outside the FeatureSet (e.g., model predictions):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "97",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs_charge.fit_transform(\n",
    "    scaler=MinMaxScaler(),\n",
    "    domain=\"targets\",\n",
    "    keys=\"soh\",\n",
    "    fit_to_split=\"train\",\n",
    ")\n",
    "\n",
    "# Simulate model predictions in scaled space\n",
    "fake_predictions = np.array([[0.5], [0.8], [0.2]])\n",
    "\n",
    "# Inverse-transform back to original SOH scale\n",
    "original_scale = fs_charge.unscale_data_for_cols(\n",
    "    data=fake_predictions,\n",
    "    domain=\"targets\",\n",
    "    columns=\"soh\",\n",
    ")\n",
    "print(f\"Scaled predictions:   {fake_predictions.ravel()}\")\n",
    "print(f\"Original-scale SOH:   {original_scale.ravel()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "98",
   "metadata": {},
   "source": [
    "While you can access the fit scalers on a particular column and use for unscaling (shown above), it is best practice to use the original FeatureSet, filter to the sample IDs on which your scaled data was produced, and then access the \"transformed\" version directly. This is the only way to fully guarantee that you are \"applying\" the correct scaler. \n",
    "\n",
    "We'll cover this more in depth in: \n",
    "TODO: $\\textcolor{red}{\\text{add notebook link to \"working with model outputs / results\"}}$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "99",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "100",
   "metadata": {},
   "source": [
    "(01-create-featureset-serialization-save-and-load)=\n",
    "## Serialization (Save and Load)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "101",
   "metadata": {},
   "source": [
    "\n",
    "FeatureSets can be saved to disk and fully restored, including splits and transforms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "102",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Re-apply splits and transforms before saving\n",
    "fs_charge.clear_splits()\n",
    "fs_charge.undo_all_transforms()\n",
    "\n",
    "fs_charge.split_random(\n",
    "    ratios={\"train\": 0.6, \"val\": 0.2, \"test\": 0.2},\n",
    "    group_by=\"group_id\",\n",
    "    seed=42,\n",
    ")\n",
    "\n",
    "fs_charge.fit_transform(\n",
    "    scaler=\"PerSampleZeroStart\",\n",
    "    domain=\"features\",\n",
    "    keys=\"voltage\",\n",
    "    fit_to_split=\"train\",\n",
    ")\n",
    "fig, axes = plot_voltages(fs_charge, n_samples=200, rep=\"transformed\")\n",
    "plt.show()\n",
    "fs_charge.fit_transform(\n",
    "    scaler=\"MinMaxScaler\",\n",
    "    domain=\"features\",\n",
    "    keys=\"voltage\",\n",
    "    fit_to_split=\"train\",\n",
    ")\n",
    "fig, axes = plot_voltages(fs_charge, n_samples=200, rep=\"transformed\")\n",
    "plt.show()\n",
    "fs_charge.fit_transform(\n",
    "    scaler=\"MinMaxScaler\",\n",
    "    domain=\"targets\",\n",
    "    keys=\"soh\",\n",
    "    fit_to_split=\"train\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "103",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "from tempfile import TemporaryDirectory\n",
    "\n",
    "# Save to temp file\n",
    "SAVE_DIR = TemporaryDirectory()\n",
    "\n",
    "save_path = fs_charge.save(Path(SAVE_DIR.name) / \"fs_charge_demo\", overwrite=True)\n",
    "print(f\"Saved to: {save_path}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "104",
   "metadata": {},
   "source": [
    "Now we can reload this FeatureSet.\n",
    "\n",
    "\n",
    "Note that ModularML assign all \"nodes\" in an Experiment a unique ID.\n",
    "This is important when we move to Experiments and ModelGraphs, but we can just ignore the collision warning for now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "105",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reload from file\n",
    "fs_rel = FeatureSet.load(save_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "106",
   "metadata": {},
   "source": [
    "We can pick up exactly where we left off; all history is preserved.\n",
    "\n",
    "This means we can undo the last transform."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "107",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plot_voltages(fs_rel, n_samples=200, rep=\"transformed\")\n",
    "plt.show()\n",
    "\n",
    "fs_rel.undo_last_transform(domain=\"features\")\n",
    "fig, axes = plot_voltages(fs_rel, n_samples=200, rep=\"transformed\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "108",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "109",
   "metadata": {},
   "source": [
    "### Copying a FeatureSet\n",
    "\n",
    "Remember that little Node Collision warning?\n",
    "Well it told us that the loaded FeatureSet was identical to the one in memory, and that instead of loading a copy, it just returned the existing FeatureSet.\n",
    "\n",
    "That means those last `undo_transforms` call modified our original `fs_charge` FeatureSet too (they're the same object).\n",
    "\n",
    "While this is great for memory, there are times we'd want to create a copy of a FeatureSet.\n",
    "This is done with the `.copy()` method.\n",
    "\n",
    "By default, this also shares internal buffers with the same underlying PyArrow table (i.e., its not a true copy).\n",
    "Setting `share_raw_data_buffer=False` ensures the new FeatureSet is fully independent of the original.\n",
    "\n",
    "Note that we can also choose to restore splits and scalers, or just the raw data with the `restore_splits` and `restore_scalers` arguments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "110",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Shallow copy with raw data only\n",
    "fs_copy_raw = fs_charge.copy(label=\"CopyRawOnly\", share_raw_data_buffer=True)\n",
    "print(\n",
    "    f\"Copy (raw only) keys: {fs_copy_raw.get_all_keys(include_domain_prefix=True, include_rep_suffix=True)}\",\n",
    ")\n",
    "print(f\"Copy splits: {fs_copy_raw.available_splits}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "111",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Full copy with splits and scalers restored\n",
    "fs_copy_full = fs_charge.copy(\n",
    "    label=\"CopyFull\",\n",
    "    share_raw_data_buffer=False,\n",
    "    restore_splits=True,\n",
    "    restore_scalers=True,\n",
    "    register=True,\n",
    ")\n",
    "print(\n",
    "    f\"Full copy keys: {fs_copy_full.get_all_keys(include_domain_prefix=True, include_rep_suffix=True)}\",\n",
    ")\n",
    "print(f\"Full copy splits: {fs_copy_full.available_splits}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "112",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "113",
   "metadata": {},
   "source": [
    "(01-create-featureset-references-for-model-graph-wiring)=\n",
    "## References (for Model Graph Wiring)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "114",
   "metadata": {},
   "source": [
    "\n",
    "When connecting a FeatureSet to a `ModelStage` in a model graph, you create symbolic references rather than passing data directly.\n",
    "\n",
    "Below is a quick overview, but more details are provided in the following notebook:\n",
    "* {doc}`02_create_modelnode`\n",
    "* {doc}`03_create_modelgraph`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "115",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Multi-column reference (used for ModelStage inputs)\n",
    "ref = fs_charge.reference(features=\"voltage\", targets=\"soh\", rep=\"transformed\")\n",
    "print(f\"Reference type: {type(ref).__name__}\")\n",
    "print(f\"Reference: {ref}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "116",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Single-column reference (specify rep when multiple representations exist)\n",
    "col_ref = fs_charge.column_reference(feature=\"voltage\", rep=\"transformed\")\n",
    "print(f\"Column reference type: {type(col_ref).__name__}\")\n",
    "print(f\"Column reference: {col_ref}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "117",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "118",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "| Task | Method |\n",
    "|------|--------|\n",
    "| Create from dict | `FeatureSet.from_dict(label, data, feature_keys, target_keys, tag_keys)` |\n",
    "| Create from DataFrame | `FeatureSet.from_pandas(label, df, feature_cols, target_cols, tag_cols, groupby_cols)` |\n",
    "| Create from Arrow | `FeatureSet.from_pyarrow_table(label, table)` |\n",
    "| Inspect keys | `get_feature_keys()`, `get_target_keys()`, `get_tag_keys()`, `get_all_keys()` |\n",
    "| Inspect shapes/dtypes | `get_feature_shapes()`, `get_feature_dtypes()` |\n",
    "| Get data | `get_features(fmt=...)`, `get_targets(fmt=...)`, `get_tags(fmt=...)` |\n",
    "| Unified access | `get_data(features=..., targets=..., tags=..., fmt=...)` |\n",
    "| Filter rows | `filter(conditions={...})` |\n",
    "| Subset by index | `take(indices)` |\n",
    "| Select columns | `select(features=..., targets=..., rep=...)` |\n",
    "| Split randomly | `split_random(ratios, group_by, seed)` |\n",
    "| Split by condition | `split_by_condition({split_name: {col: condition}})` |\n",
    "| Apply transform | `fit_transform(scaler, domain, keys, fit_to_split)` |\n",
    "| Undo transform | `undo_last_transform(domain, keys)` / `undo_all_transforms()` |\n",
    "| Inverse-scale data | `unscale_data_for_cols(data, domain, columns)` |\n",
    "| Save / Load | `save(path)` / `FeatureSet.load(path)` |\n",
    "| Copy | `copy(restore_splits, restore_scalers)` |\n",
    "| Create reference | `reference(features, targets)` / `column_reference(feature)` |"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv (3.10.18)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}