{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "# How to: Create and Use Scalers\n",
    "\n",
    "ModularML's `Scaler` class provides a unified interface for applying preprocessing transforms to\n",
    "`FeatureSet` data. It wraps any scikit-learn-compatible transformer and integrates with\n",
    "`fit_transform`, undo history, and serialization.\n",
    "\n",
    "This notebook covers:\n",
    "\n",
    "- {ref}`06-scalers-data-and-setup`\n",
    "- {ref}`06-scalers-built-in-scalers`\n",
    "- {ref}`06-scalers-the-scaler-wrapper`\n",
    "- {ref}`06-scalers-per-sample-zero-start`\n",
    "- {ref}`06-scalers-per-sample-min-max`\n",
    "- {ref}`06-scalers-segmented-scaler`\n",
    "- {ref}`06-scalers-negate-and-absolute`\n",
    "- {ref}`06-scalers-chaining-transforms`\n",
    "- {ref}`06-scalers-creating-a-custom-scaler`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1",
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "\n",
    "import modularml as mml\n",
    "from modularml import FeatureSet, Scaler\n",
    "from modularml.scalers import (\n",
    "    Absolute,\n",
    "    Negate,\n",
    "    PerSampleMinMaxScaler,\n",
    "    PerSampleZeroStart,\n",
    "    SegmentedScaler,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2",
   "metadata": {},
   "source": [
    "We'll use synthetic HPPC (Hybrid Pulse Power Characterization) battery data throughout this\n",
    "notebook. Each sample simulates a standard HPPC pulse sequence:\n",
    "\n",
    "1. **OCV observation** (10 s) - cell resting at open-circuit voltage\n",
    "2. **Charge pulse** (10 s) — 1.2 A applied; ohmic jump then exponential rise\n",
    "3. **Rest after charge** (40 s) — current removed; ohmic recovery then slow relaxation\n",
    "4. **Discharge pulse** (10 s) — 1.2 A drawn; ohmic drop then exponential decay\n",
    "5. **Rest after discharge** (40 s) — ohmic recovery then slow relaxation back to OCV\n",
    "\n",
    "Cells span a range of state-of-health (SOH) values, degrading from 100% to ~50%."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3",
   "metadata": {},
   "outputs": [],
   "source": [
    "from utils.hppc_data_gen import get_mock_hppc_data\n",
    "\n",
    "voltage, soh, cell_ids, group_ids = get_mock_hppc_data(n_samples=1000)\n",
    "\n",
    "print(f\"Samples:          {voltage.shape[0]}\")\n",
    "print(f\"Voltage shape:    {voltage.shape}\")\n",
    "print(f\"OCV range:        [{voltage[:, 0].min():.2f}, {voltage[:, 0].max():.2f}] V\")\n",
    "print(f\"SOH range:        [{soh.min():.1f}, {soh.max():.1f}] %\")\n",
    "print(f\"Voltage overall:  [{voltage.min():.3f}, {voltage.max():.3f}] V\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5",
   "metadata": {},
   "source": [
    "(06-scalers-data-and-setup)=\n",
    "## Data and Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs = FeatureSet.from_dict(\n",
    "    label=\"HPPCData\",\n",
    "    data={\n",
    "        \"voltage\": voltage.tolist(),\n",
    "        \"soh\": soh.tolist(),\n",
    "        \"cell_id\": cell_ids.tolist(),\n",
    "        \"group_id\": group_ids.tolist(),\n",
    "    },\n",
    "    feature_keys=\"voltage\",\n",
    "    target_keys=\"soh\",\n",
    "    tag_keys=[\"cell_id\", \"group_id\"],\n",
    ")\n",
    "print(fs)\n",
    "print(f\"Feature shapes: {fs.get_feature_shapes()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7",
   "metadata": {},
   "source": [
    "Split by cell group to prevent data leakage between train / val / test."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs.split_random(\n",
    "    ratios={\"train\": 0.6, \"val\": 0.2, \"test\": 0.2},\n",
    "    group_by=\"group_id\",\n",
    "    seed=42,\n",
    ")\n",
    "\n",
    "for name, view in fs.splits.items():\n",
    "    groups = view.get_tags(fmt=\"numpy\", tags=\"group_id\")\n",
    "    print(f\"  {name}: {len(view)} samples, groups: {np.unique(groups)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9",
   "metadata": {},
   "source": [
    "Define a reusable plotting helper. Each split gets its own panel;\n",
    "traces are colored by SOH (dark blue = high, light blue = low)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10",
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_timeseries(\n",
    "    fs: FeatureSet,\n",
    "    columns: str | list[str],\n",
    "    splits: list[str] | None = None,\n",
    "    n_samples: int = 100,\n",
    "    color_by: str = \"targets.soh.raw\",\n",
    "    color_vbounds: tuple = (50, 100),\n",
    "    xlabel: str = \"Time (s)\",\n",
    "    ylabel: str = \"Voltage (V)\",\n",
    "    clabel: str = \"SOH (%)\",\n",
    "    marker=\"-\",\n",
    "    seed: int = 13,\n",
    "):\n",
    "    \"\"\"\n",
    "    Plot time-series columns from a FeatureSet, one panel per split.\n",
    "\n",
    "    Args:\n",
    "        fs:            FeatureSet to visualise.\n",
    "        columns:       Fully-qualified column name(s), e.g. ``\"features.voltage.raw\"``.\n",
    "                       Multiple columns are flattened and horizontally stacked.\n",
    "        splits:        Splits to include. Defaults to all registered splits.\n",
    "        n_samples:     Number of traces to draw per panel.\n",
    "        color_by:      Fully-qualified scalar column used for the colormap.\n",
    "        color_vbounds: ``(vmin, vmax)`` for the colormap.\n",
    "        xlabel:        Axis x-label.\n",
    "        ylabel:        Axis y-label.\n",
    "        clabel:        Colorbar label.\n",
    "        marker:        Marker style.\n",
    "        seed:          RNG seed for reproducible sample selection.\n",
    "\n",
    "    \"\"\"\n",
    "\n",
    "    def order_splits(values: list[str]) -> list[str]:\n",
    "        priority = {\"train\": 0, \"val\": 1, \"test\": 2}\n",
    "        return sorted(values, key=lambda x: priority.get(x, 99))\n",
    "\n",
    "    rng = np.random.default_rng(seed)\n",
    "    scm = plt.cm.ScalarMappable(\n",
    "        cmap=plt.cm.Blues,\n",
    "        norm=plt.Normalize(vmin=color_vbounds[0], vmax=color_vbounds[1]),\n",
    "    )\n",
    "\n",
    "    columns = columns if isinstance(columns, list) else [columns]\n",
    "    split_names = order_splits(splits or fs.available_splits)\n",
    "\n",
    "    fig, axes = plt.subplots(\n",
    "        figsize=(7, 2.5),\n",
    "        ncols=len(split_names),\n",
    "        sharex=True,\n",
    "        sharey=True,\n",
    "    )\n",
    "\n",
    "    for i, split_label in enumerate(split_names):\n",
    "        split_view = fs.get_split(split_label)\n",
    "\n",
    "        res = split_view.get_data(\n",
    "            columns=columns,\n",
    "            fmt=\"dict_numpy\",\n",
    "            include_domain_prefix=True,\n",
    "            include_rep_suffix=True,\n",
    "        )\n",
    "\n",
    "        # Match user-supplied column specs to actual keys returned by get_data\n",
    "        ordered_keys = []\n",
    "        for c in columns:\n",
    "            parts = [c.replace(\"*\", \"\")]\n",
    "            if \".\" in c:\n",
    "                parts = c.replace(\"*\", \"\").split(\".\")\n",
    "            for k in res:\n",
    "                if any(p == k.split(\".\")[1] for p in parts):\n",
    "                    ordered_keys.append(k)\n",
    "                    break\n",
    "\n",
    "        color_vals = split_view.get_data(columns=[color_by], fmt=\"np\").reshape(-1)\n",
    "        flat_data = np.column_stack(\n",
    "            [res[k].reshape(len(color_vals), -1) for k in ordered_keys],\n",
    "        )\n",
    "\n",
    "        sample_idxs = rng.choice(np.arange(len(color_vals)), size=n_samples)\n",
    "        for idx in sample_idxs:\n",
    "            axes[i].plot(flat_data[idx], marker, color=scm.to_rgba(color_vals[idx]))\n",
    "\n",
    "        axes[i].set_title(split_label, fontsize=10)\n",
    "        axes[i].set_xlabel(xlabel, fontsize=10)\n",
    "\n",
    "    axes[0].set_ylabel(ylabel, fontsize=10)\n",
    "    fig.tight_layout(pad=1)\n",
    "    fig.subplots_adjust(right=0.85)\n",
    "    cbar_ax = fig.add_axes([0.87, 0.19, 0.02, 0.7])\n",
    "    cbar = fig.colorbar(scm, cax=cbar_ax)\n",
    "    cbar.set_label(clabel, fontsize=10)\n",
    "    return fig, axes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plot_timeseries(fs, columns=\"features.voltage.raw\")\n",
    "plt.suptitle(\"Raw HPPC voltage\", y=1.02, fontsize=11)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13",
   "metadata": {},
   "source": [
    "(06-scalers-built-in-scalers)=\n",
    "## Built-in Scalers\n",
    "\n",
    "All registered scalers are accessible via `mml.supported_scalers`.\n",
    "The registry contains both ModularML-native and scikit-learn scalers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14",
   "metadata": {},
   "outputs": [],
   "source": [
    "mml.supported_scalers"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15",
   "metadata": {},
   "source": [
    "The ModularML-native (non-sklearn) scalers are:\n",
    "\n",
    "| Scaler | What it does |\n",
    "|--------|--------------|\n",
    "| `PerSampleZeroStart` | Shifts each sample so its first value equals zero |\n",
    "| `PerSampleMinMaxScaler` | Scales each sample independently to a target range (default [0, 1]) |\n",
    "| `SegmentedScaler` | Applies independent scalers to contiguous feature sub-regions |\n",
    "| `Negate` | Multiplies all values by −1 |\n",
    "| `Absolute` | Replaces each value with its absolute value |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17",
   "metadata": {},
   "source": [
    "(06-scalers-the-scaler-wrapper)=\n",
    "## The Scaler Wrapper\n",
    "\n",
    "`Scaler` is a thin adapter that gives any sklearn-compatible transformer a consistent ModularML\n",
    "interface. It can be constructed three ways:\n",
    "\n",
    "```python\n",
    "Scaler(\"MinMaxScaler\")          # by registry name (case-insensitive)\n",
    "Scaler(MinMaxScaler)            # by class\n",
    "Scaler(MinMaxScaler(clip=True)) # by instance\n",
    "```\n",
    "\n",
    "You can also pass a string, class, or instance **directly** to `FeatureSet.fit_transform` — it\n",
    "will be wrapped automatically.\n",
    "\n",
    "Key methods:\n",
    "\n",
    "| Method | Description |\n",
    "|--------|-------------|\n",
    "| `fit(X)` | Learn parameters from data |\n",
    "| `transform(X)` | Apply the fitted transform |\n",
    "| `fit_transform(X)` | Fit and transform in one step |\n",
    "| `inverse_transform(X)` | Reverse the transform (if supported) |\n",
    "| `clone_unfitted()` | Return a fresh copy with the same config but no learned state |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import MinMaxScaler\n",
    "\n",
    "# Three equivalent constructors\n",
    "s1 = Scaler(\"MinMaxScaler\")\n",
    "s2 = Scaler(MinMaxScaler)\n",
    "s3 = Scaler(MinMaxScaler())\n",
    "\n",
    "print(f\"Name:             {s1.scaler_name}\")\n",
    "print(f\"Is fit (before):  {s1._is_fit}\")\n",
    "\n",
    "X = np.random.default_rng(0).normal(size=(10, 5))\n",
    "s1.fit(X)\n",
    "print(f\"Is fit (after):   {s1._is_fit}\")\n",
    "\n",
    "X_scaled = s1.transform(X)\n",
    "print(f\"Scaled range:     [{X_scaled.min():.3f}, {X_scaled.max():.3f}]\")\n",
    "\n",
    "# Clone: same config, no learned state\n",
    "s1_clone = s1.clone_unfitted()\n",
    "print(f\"Clone is fit:     {s1_clone._is_fit}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20",
   "metadata": {},
   "source": [
    "(06-scalers-per-sample-zero-start)=\n",
    "## PerSampleZeroStart\n",
    "\n",
    "**What it does:** Subtracts the first value of each sample from every element in that sample:\n",
    "\n",
    "$$x_i^\\prime = x_i - x_i[0]$$\n",
    "\n",
    "**Why it's useful for HPPC data:** The absolute OCV varies with SOC and cell-to-cell spread\n",
    "(here 2.0–3.6 V). Subtracting the initial value removes this offset so all traces start at\n",
    "zero and only the *delta-V* response to the current pulse is retained. Models trained on\n",
    "zero-started data learn the electrochemical dynamics rather than the SOC level.\n",
    "\n",
    "**Fitting behaviour:** Statistics are computed per-sample at `transform` time (no global\n",
    "statistics), so `fit_to_split` does not affect the result. Specifying it is still good practice."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "21",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs.undo_all_transforms()\n",
    "\n",
    "fs.fit_transform(\n",
    "    scaler=PerSampleZeroStart,\n",
    "    domain=\"features\",\n",
    "    keys=\"voltage\",\n",
    "    fit_to_split=\"train\",\n",
    ")\n",
    "\n",
    "v_raw = fs[\"train\"].get_features(fmt=\"numpy\", features=\"voltage\", rep=\"raw\")\n",
    "v_zs = fs[\"train\"].get_features(fmt=\"numpy\", features=\"voltage\", rep=\"transformed\")\n",
    "\n",
    "print(\n",
    "    f\"Raw first-value range:       [{v_raw[:, 0].min():.3f}, {v_raw[:, 0].max():.3f}] V\"\n",
    ")\n",
    "print(f\"ZeroStart first-value |max|: {np.abs(v_zs[:, 0]).max():.3e} (all ~0)\")\n",
    "\n",
    "fig, axes = plot_timeseries(fs, columns=\"features.voltage.transformed\")\n",
    "plt.suptitle(\"After PerSampleZeroStart\", y=1.02, fontsize=11)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23",
   "metadata": {},
   "source": [
    "(06-scalers-per-sample-min-max)=\n",
    "## PerSampleMinMaxScaler\n",
    "\n",
    "**What it does:** Scales each sample independently so its values span `feature_range` (default [0, 1]):\n",
    "\n",
    "$$x_i^\\prime = \\frac{x_i - \\min(x_i)}{\\max(x_i) - \\min(x_i)}$$\n",
    "\n",
    "**Contrast with sklearn's `MinMaxScaler`:** sklearn's version computes min and max *across the\n",
    "training set* — one scalar per feature dimension. `PerSampleMinMaxScaler` uses per-sample\n",
    "statistics, making it invariant to the absolute voltage level and amplitude differences between\n",
    "cells and SOC levels.\n",
    "\n",
    "**Fitting behaviour:** Like `PerSampleZeroStart`, statistics are recomputed per-sample at\n",
    "transform time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs.undo_all_transforms()\n",
    "\n",
    "fs.fit_transform(\n",
    "    scaler=PerSampleMinMaxScaler(),\n",
    "    domain=\"features\",\n",
    "    keys=\"voltage\",\n",
    "    fit_to_split=\"train\",\n",
    ")\n",
    "\n",
    "v_scaled = fs[\"train\"].get_features(fmt=\"numpy\", features=\"voltage\", rep=\"transformed\")\n",
    "fig, axes = plot_timeseries(fs, columns=\"features.voltage.transformed\")\n",
    "plt.suptitle(\"After PerSampleMinMaxScaler\", y=1.02, fontsize=11)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26",
   "metadata": {},
   "source": [
    "(06-scalers-segmented-scaler)=\n",
    "## SegmentedScaler\n",
    "\n",
    "**What it does:** Partitions the feature vector into contiguous segments and fits an independent\n",
    "scaler on each segment:\n",
    "\n",
    "```\n",
    "[ ─── segment 0 ─── | ─── segment 1 ─── | ─── ... ─── ]\n",
    "  boundaries[0:1]     boundaries[1:2]\n",
    "```\n",
    "\n",
    "A cloned copy of the template scaler is fit independently on each slice of the training data.\n",
    "\n",
    "**Why it's useful for HPPC data:** The OCV, charge pulse, rest, discharge, and rest regions\n",
    "occupy very different voltage ranges. A single global scaler compresses or expands regions\n",
    "unevenly. `SegmentedScaler` applies an independent normalization to each protocol region,\n",
    "preserving its full dynamic range.\n",
    "\n",
    "**Boundaries** must be a tuple of strictly increasing integers, starting at 0 and ending at the\n",
    "total feature length.\n",
    "\n",
    "**Fitting behaviour:** The underlying scaler (e.g., `MinMaxScaler`) learns global statistics\n",
    "*across training samples* for each segment, so `fit_to_split=\"train\"` **is** important here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27",
   "metadata": {},
   "outputs": [],
   "source": [
    "import itertools\n",
    "\n",
    "# Visualise the HPPC segment layout on a representative trace\n",
    "SEGMENT_LABELS = [\"OCV\", \"Charge\", \"Rest 1\", \"Discharge\", \"Rest 2\"]\n",
    "SEGMENT_COLORS = [\"#cce5f0\", \"#f0cccc\", \"#ccf0cc\", \"#f0e0cc\", \"#e0ccf0\"]\n",
    "HPPC_BOUNDARIES = [0, 9, 20, 59, 70, 110]\n",
    "\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(8, 3))\n",
    "sample = fs.get_features(fmt=\"numpy\", features=\"voltage\", rep=\"raw\")[0][0]\n",
    "ax.plot(sample, \"k-\", lw=1.5)\n",
    "\n",
    "for (start, end), color, label in zip(\n",
    "    itertools.pairwise(HPPC_BOUNDARIES),\n",
    "    SEGMENT_COLORS,\n",
    "    SEGMENT_LABELS,\n",
    "    strict=True,\n",
    "):\n",
    "    ax.axvspan(start, end - 0.5, alpha=0.4, color=color)\n",
    "    ax.text(\n",
    "        (start + end) / 2, ax.get_ylim()[0], label, ha=\"center\", va=\"bottom\", fontsize=8\n",
    "    )\n",
    "\n",
    "ax.set_xlabel(\"Time (s)\")\n",
    "ax.set_ylabel(\"Voltage (V)\")\n",
    "ax.set_title(\n",
    "    f\"HPPC Protocol — SegmentedScaler boundaries: {HPPC_BOUNDARIES}\", fontsize=10\n",
    ")\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "28",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs.undo_all_transforms()\n",
    "\n",
    "new_bounds = [0, 20, 60, 70, 110]\n",
    "fs.fit_transform(\n",
    "    scaler=SegmentedScaler(boundaries=new_bounds, scaler=PerSampleZeroStart),\n",
    "    domain=\"features\",\n",
    "    keys=\"voltage\",\n",
    "    fit_to_split=\"train\",\n",
    ")\n",
    "\n",
    "v_seg = fs[\"train\"].get_features(fmt=\"numpy\", features=\"voltage\", rep=\"transformed\")\n",
    "fig, axes = plot_timeseries(fs, columns=\"features.voltage.transformed\", marker=\".\")\n",
    "plt.suptitle(\"After SegmentedScaler(PerSampleZeroStart)\", y=1.03, fontsize=11)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29",
   "metadata": {},
   "source": [
    "Note that segment boundaries in the val / test panels may slightly exceed [0, 1] because those\n",
    "splits contain cell groups not seen during training."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31",
   "metadata": {},
   "source": [
    "(06-scalers-negate-and-absolute)=\n",
    "## Negate and Absolute\n",
    "\n",
    "These are simple element-wise transforms, most useful as building blocks in a transform chain.\n",
    "\n",
    "- **`Negate`** — multiplies every value by −1. Useful when a model or loss expects positive\n",
    "  values (e.g., converting a discharge voltage drop to a positive deviation).\n",
    "- **`Absolute`** — replaces every value with its absolute value, recording the per-element sign\n",
    "  mask so the transform can be inverted exactly.\n",
    "\n",
    "Both have a no-op `fit`, so `fit_to_split` has no effect on the output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "32",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs.undo_all_transforms()\n",
    "\n",
    "# Negate\n",
    "fs.fit_transform(\n",
    "    scaler=Negate(), domain=\"features\", keys=\"voltage\", fit_to_split=\"train\"\n",
    ")\n",
    "\n",
    "v_raw = fs[\"train\"].get_features(fmt=\"numpy\", features=\"voltage\", rep=\"raw\")\n",
    "v_neg = fs[\"train\"].get_features(fmt=\"numpy\", features=\"voltage\", rep=\"transformed\")\n",
    "print(\"Negate\")\n",
    "print(f\"  Raw range:     [{v_raw.min():.3f}, {v_raw.max():.3f}] V\")\n",
    "print(f\"  Negated range: [{v_neg.min():.3f}, {v_neg.max():.3f}] V\")\n",
    "\n",
    "fs.undo_all_transforms()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "33",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Absolute\n",
    "# Apply ZeroStart first so voltage deviations straddle zero (+/-);\n",
    "# Absolute then maps both directions into the positive half-plane.\n",
    "fs.fit_transform(\n",
    "    scaler=PerSampleZeroStart, domain=\"features\", keys=\"voltage\", fit_to_split=\"train\"\n",
    ")\n",
    "fs.fit_transform(\n",
    "    scaler=Absolute(), domain=\"features\", keys=\"voltage\", fit_to_split=\"train\"\n",
    ")\n",
    "\n",
    "v_abs = fs[\"train\"].get_features(fmt=\"numpy\", features=\"voltage\", rep=\"transformed\")\n",
    "print(\"Absolute (after PerSampleZeroStart)\")\n",
    "print(f\"  Min value: {v_abs.min():.6f}  (all non-negative)\")\n",
    "\n",
    "fs.undo_all_transforms()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "35",
   "metadata": {},
   "source": [
    "(06-scalers-chaining-transforms)=\n",
    "## Chaining Transforms\n",
    "\n",
    "Each call to `fit_transform` builds on the current `\"transformed\"` representation, creating a\n",
    "chain that can be unwound one step at a time (`undo_last_transform`) or all at once\n",
    "(`undo_all_transforms`).\n",
    "\n",
    "A natural preprocessing pipeline for HPPC voltage:\n",
    "\n",
    "1. **PerSampleZeroStart** — removes the per-sample OCV offset so all traces start at 0\n",
    "2. **SegmentedScaler(MinMaxScaler)** — independently normalises each protocol region to [0, 1]\n",
    "3. **MinMaxScaler** on the SOH target — normalises the prediction target across training samples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "36",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs.undo_all_transforms()\n",
    "\n",
    "# Step 1: Remove per-sample OCV offset\n",
    "fs.fit_transform(\n",
    "    scaler=PerSampleZeroStart,\n",
    "    domain=\"features\",\n",
    "    keys=\"voltage\",\n",
    "    fit_to_split=\"train\",\n",
    ")\n",
    "\n",
    "fig, axes = plot_timeseries(fs, columns=\"features.voltage.transformed\")\n",
    "plt.suptitle(\"Step 1 - PerSampleZeroStart\", y=1.02, fontsize=11)\n",
    "plt.show()\n",
    "\n",
    "# Step 2: Normalise each HPPC segment independently\n",
    "fs.fit_transform(\n",
    "    scaler=SegmentedScaler(boundaries=new_bounds, scaler=\"MinMaxScaler\"),\n",
    "    domain=\"features\",\n",
    "    keys=\"voltage\",\n",
    "    fit_to_split=\"train\",\n",
    ")\n",
    "\n",
    "fig, axes = plot_timeseries(fs, columns=\"features.voltage.transformed\")\n",
    "plt.suptitle(\"Step 2 - SegmentedScaler(MinMaxScaler)\", y=1.02, fontsize=11)\n",
    "plt.show()\n",
    "\n",
    "# Step 3: Scale SOH target across training samples\n",
    "fs.fit_transform(\n",
    "    scaler=MinMaxScaler(),\n",
    "    domain=\"targets\",\n",
    "    keys=\"soh\",\n",
    "    fit_to_split=\"train\",\n",
    ")\n",
    "\n",
    "soh_raw = fs[\"test\"].get_targets(fmt=\"numpy\", targets=\"soh\", rep=\"raw\")\n",
    "soh_scaled = fs[\"test\"].get_targets(fmt=\"numpy\", targets=\"soh\", rep=\"transformed\")\n",
    "print(f\"SOH raw range:    [{soh_raw.min():.1f}, {soh_raw.max():.1f}] %\")\n",
    "print(f\"SOH scaled range: [{soh_scaled.min():.3f}, {soh_scaled.max():.3f}]\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "37",
   "metadata": {},
   "source": [
    "Undo the last feature transform (SegmentedScaler) while keeping PerSampleZeroStart active."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs.undo_last_transform(domain=\"features\", keys=\"voltage\")\n",
    "\n",
    "v = fs[\"train\"].get_features(fmt=\"numpy\", features=\"voltage\", rep=\"transformed\")\n",
    "print(\"After undoing SegmentedScaler (PerSampleZeroStart still active):\")\n",
    "print(\n",
    "    f\"  First-value |max|: {np.abs(v[:, 0]).max():.2e}  (ZeroStart preserved, all ~0)\"\n",
    ")\n",
    "print(\n",
    "    f\"  Overall range:     [{v.min():.3f}, {v.max():.3f}]  (no longer bounded to [0,1])\"\n",
    ")\n",
    "\n",
    "# SOH target transform is unaffected (different domain)\n",
    "soh_scaled = fs[\"test\"].get_targets(fmt=\"numpy\", targets=\"soh\", rep=\"transformed\")\n",
    "print(\n",
    "    f\"  SOH scaled range:  [{soh_scaled.min():.3f}, {soh_scaled.max():.3f}]  (unchanged)\"\n",
    ")\n",
    "\n",
    "fig, axes = plot_timeseries(fs, columns=\"features.voltage.transformed\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40",
   "metadata": {},
   "source": [
    "(06-scalers-creating-a-custom-scaler)=\n",
    "## Creating a Custom Scaler\n",
    "\n",
    "Any object that satisfies the scikit-learn estimator interface can be wrapped by `Scaler` and\n",
    "used directly in `FeatureSet.fit_transform`:\n",
    "\n",
    "```python\n",
    "class MyScaler(BaseEstimator, TransformerMixin):\n",
    "    def fit(self, X, y=None): ...\n",
    "    def transform(self, X): ...\n",
    "    def inverse_transform(self, X): ...  # optional but enables undo\n",
    "```\n",
    "\n",
    "The hard requirements are `fit` and `transform`. Adding `inverse_transform` enables\n",
    "`undo_last_transform` and `unscale_data_for_cols`.\n",
    "\n",
    "### Example: `PerSampleStandardScaler`\n",
    "\n",
    "sklearn's `StandardScaler` standardizes *across samples* using training-set statistics.\n",
    "The custom scaler below standardizes *each sample independently* — useful when every\n",
    "measurement has its own mean and variance that should be removed.\n",
    "\n",
    "*Note: Custom scalers cannot be serialized unless they are defined in a separate file,\n",
    "or registered to the `mml.supported_scalers` registry.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "41",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.base import BaseEstimator, TransformerMixin\n",
    "\n",
    "\n",
    "class PerSampleStandardScaler(BaseEstimator, TransformerMixin):\n",
    "    \"\"\"\n",
    "    Standardize each sample to zero mean and unit variance.\n",
    "\n",
    "    Unlike sklearn's ``StandardScaler``, statistics are computed per sample\n",
    "    at transform time; no global state is learned from the training set.\n",
    "    \"\"\"\n",
    "\n",
    "    def __init__(self):\n",
    "        self._sample_mean = None\n",
    "        self._sample_std = None\n",
    "\n",
    "    def fit(self, X, y=None):\n",
    "        # No global statistics to learn; all computation is deferred to transform\n",
    "        return self\n",
    "\n",
    "    def transform(self, X):\n",
    "        if X.ndim != 2:\n",
    "            msg = f\"Expected 2D array, got shape {X.shape}\"\n",
    "            raise ValueError(msg)\n",
    "        X = np.asarray(X)\n",
    "        self._sample_mean = X.mean(axis=1, keepdims=True)\n",
    "        self._sample_std = X.std(axis=1, keepdims=True)\n",
    "        # Guard against constant samples (std = 0)\n",
    "        self._sample_std = np.where(self._sample_std == 0, 1.0, self._sample_std)\n",
    "        return (X - self._sample_mean) / self._sample_std\n",
    "\n",
    "    def inverse_transform(self, X):\n",
    "        if self._sample_mean is None:\n",
    "            raise RuntimeError(\"Scaler has not been applied yet.\")\n",
    "        return X * self._sample_std + self._sample_mean"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "42",
   "metadata": {},
   "outputs": [],
   "source": [
    "fs.undo_all_transforms()\n",
    "\n",
    "try:\n",
    "    fs.fit_transform(\n",
    "        scaler=Scaler(PerSampleStandardScaler()),\n",
    "        domain=\"features\",\n",
    "        keys=\"voltage\",\n",
    "        fit_to_split=\"train\",\n",
    "    )\n",
    "except RuntimeError as e:\n",
    "    print(e)\n",
    "\n",
    "# We can register it to the builtin scaler register\n",
    "mml.scaler_registry.register(\n",
    "    \"PerSampleStandardScaler\",\n",
    "    PerSampleStandardScaler,\n",
    ")\n",
    "\n",
    "# And now we can use the scaler\n",
    "fs.fit_transform(\n",
    "    scaler=PerSampleStandardScaler(),\n",
    "    domain=\"features\",\n",
    "    keys=\"voltage\",\n",
    "    fit_to_split=\"train\",\n",
    ")\n",
    "fig, axes = plot_timeseries(fs, columns=\"features.voltage.transformed\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "43",
   "metadata": {},
   "source": [
    "However, adding custom classes to the built-in registry only allows usage of that class in this same environment.\n",
    "Restarting the kernel would clear user-added items in the registry.\n",
    "\n",
    "For more robust serailization, move the the scaler class to its own Python file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "44",
   "metadata": {},
   "outputs": [],
   "source": [
    "from utils.my_scaler import PerSampleStandardScaler as MyScaler\n",
    "\n",
    "fs.undo_all_transforms()\n",
    "fs.fit_transform(\n",
    "    scaler=MyScaler(),\n",
    "    domain=\"features\",\n",
    "    keys=\"voltage\",\n",
    "    fit_to_split=\"train\",\n",
    ")\n",
    "\n",
    "v_std = fs[\"train\"].get_features(fmt=\"numpy\", features=\"voltage\", rep=\"transformed\")\n",
    "fig, axes = plot_timeseries(fs, columns=\"features.voltage.transformed\")\n",
    "plt.suptitle(\"After PerSampleStandardScaler (custom)\", y=1.02, fontsize=11)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "| Task | Code |\n",
    "|------|------|\n",
    "| List all scalers | `mml.supported_scalers` |\n",
    "| Create scaler by name | `Scaler(\"MinMaxScaler\")` |\n",
    "| Create scaler by instance | `Scaler(MinMaxScaler())` |\n",
    "| Apply a transform | `fs.fit_transform(scaler, domain, keys, fit_to_split)` |\n",
    "| Undo last transform | `fs.undo_last_transform(domain, keys)` |\n",
    "| Undo all transforms | `fs.undo_all_transforms()` |\n",
    "| Per-sample OCV removal | `PerSampleZeroStart` |\n",
    "| Per-sample normalisation | `PerSampleMinMaxScaler()` |\n",
    "| Segment-wise normalisation | `SegmentedScaler(boundaries=(...), scaler=\"MinMaxScaler\")` |\n",
    "| Sign flip | `Negate()` |\n",
    "| Absolute value | `Absolute()` |\n",
    "| Custom scaler | Subclass `BaseEstimator, TransformerMixin`; implement `fit` + `transform` |\n",
    "| Inverse-scale external data | `fs.unscale_data_for_cols(data, domain, columns)` |"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv (3.10.18)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}