{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# How to: Create and Use Scalers\n", "\n", "ModularML's `Scaler` class provides a unified interface for applying preprocessing transforms to\n", "`FeatureSet` data. It wraps any scikit-learn-compatible transformer and integrates with\n", "`fit_transform`, undo history, and serialization.\n", "\n", "This notebook covers:\n", "\n", "- {ref}`06-scalers-data-and-setup`\n", "- {ref}`06-scalers-built-in-scalers`\n", "- {ref}`06-scalers-the-scaler-wrapper`\n", "- {ref}`06-scalers-per-sample-zero-start`\n", "- {ref}`06-scalers-per-sample-min-max`\n", "- {ref}`06-scalers-segmented-scaler`\n", "- {ref}`06-scalers-negate-and-absolute`\n", "- {ref}`06-scalers-chaining-transforms`\n", "- {ref}`06-scalers-creating-a-custom-scaler`" ] }, { "cell_type": "code", "execution_count": null, "id": "1", "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "import modularml as mml\n", "from modularml import FeatureSet, Scaler\n", "from modularml.scalers import (\n", " Absolute,\n", " Negate,\n", " PerSampleMinMaxScaler,\n", " PerSampleZeroStart,\n", " SegmentedScaler,\n", ")" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "We'll use synthetic HPPC (Hybrid Pulse Power Characterization) battery data throughout this\n", "notebook. Each sample simulates a standard HPPC pulse sequence:\n", "\n", "1. **OCV observation** (10 s) - cell resting at open-circuit voltage\n", "2. **Charge pulse** (10 s) — 1.2 A applied; ohmic jump then exponential rise\n", "3. **Rest after charge** (40 s) — current removed; ohmic recovery then slow relaxation\n", "4. **Discharge pulse** (10 s) — 1.2 A drawn; ohmic drop then exponential decay\n", "5. **Rest after discharge** (40 s) — ohmic recovery then slow relaxation back to OCV\n", "\n", "Cells span a range of state-of-health (SOH) values, degrading from 100% to ~50%." ] }, { "cell_type": "code", "execution_count": null, "id": "3", "metadata": {}, "outputs": [], "source": [ "from utils.hppc_data_gen import get_mock_hppc_data\n", "\n", "voltage, soh, cell_ids, group_ids = get_mock_hppc_data(n_samples=1000)\n", "\n", "print(f\"Samples: {voltage.shape[0]}\")\n", "print(f\"Voltage shape: {voltage.shape}\")\n", "print(f\"OCV range: [{voltage[:, 0].min():.2f}, {voltage[:, 0].max():.2f}] V\")\n", "print(f\"SOH range: [{soh.min():.1f}, {soh.max():.1f}] %\")\n", "print(f\"Voltage overall: [{voltage.min():.3f}, {voltage.max():.3f}] V\")" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "(06-scalers-data-and-setup)=\n", "## Data and Setup" ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "fs = FeatureSet.from_dict(\n", " label=\"HPPCData\",\n", " data={\n", " \"voltage\": voltage.tolist(),\n", " \"soh\": soh.tolist(),\n", " \"cell_id\": cell_ids.tolist(),\n", " \"group_id\": group_ids.tolist(),\n", " },\n", " feature_keys=\"voltage\",\n", " target_keys=\"soh\",\n", " tag_keys=[\"cell_id\", \"group_id\"],\n", ")\n", "print(fs)\n", "print(f\"Feature shapes: {fs.get_feature_shapes()}\")" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "Split by cell group to prevent data leakage between train / val / test." ] }, { "cell_type": "code", "execution_count": null, "id": "8", "metadata": {}, "outputs": [], "source": [ "fs.split_random(\n", " ratios={\"train\": 0.6, \"val\": 0.2, \"test\": 0.2},\n", " group_by=\"group_id\",\n", " seed=42,\n", ")\n", "\n", "for name, view in fs.splits.items():\n", " groups = view.get_tags(fmt=\"numpy\", tags=\"group_id\")\n", " print(f\" {name}: {len(view)} samples, groups: {np.unique(groups)}\")" ] }, { "cell_type": "markdown", "id": "9", "metadata": {}, "source": [ "Define a reusable plotting helper. Each split gets its own panel;\n", "traces are colored by SOH (dark blue = high, light blue = low)." ] }, { "cell_type": "code", "execution_count": null, "id": "10", "metadata": {}, "outputs": [], "source": [ "def plot_timeseries(\n", " fs: FeatureSet,\n", " columns: str | list[str],\n", " splits: list[str] | None = None,\n", " n_samples: int = 100,\n", " color_by: str = \"targets.soh.raw\",\n", " color_vbounds: tuple = (50, 100),\n", " xlabel: str = \"Time (s)\",\n", " ylabel: str = \"Voltage (V)\",\n", " clabel: str = \"SOH (%)\",\n", " marker=\"-\",\n", " seed: int = 13,\n", "):\n", " \"\"\"\n", " Plot time-series columns from a FeatureSet, one panel per split.\n", "\n", " Args:\n", " fs: FeatureSet to visualise.\n", " columns: Fully-qualified column name(s), e.g. ``\"features.voltage.raw\"``.\n", " Multiple columns are flattened and horizontally stacked.\n", " splits: Splits to include. Defaults to all registered splits.\n", " n_samples: Number of traces to draw per panel.\n", " color_by: Fully-qualified scalar column used for the colormap.\n", " color_vbounds: ``(vmin, vmax)`` for the colormap.\n", " xlabel: Axis x-label.\n", " ylabel: Axis y-label.\n", " clabel: Colorbar label.\n", " marker: Marker style.\n", " seed: RNG seed for reproducible sample selection.\n", "\n", " \"\"\"\n", "\n", " def order_splits(values: list[str]) -> list[str]:\n", " priority = {\"train\": 0, \"val\": 1, \"test\": 2}\n", " return sorted(values, key=lambda x: priority.get(x, 99))\n", "\n", " rng = np.random.default_rng(seed)\n", " scm = plt.cm.ScalarMappable(\n", " cmap=plt.cm.Blues,\n", " norm=plt.Normalize(vmin=color_vbounds[0], vmax=color_vbounds[1]),\n", " )\n", "\n", " columns = columns if isinstance(columns, list) else [columns]\n", " split_names = order_splits(splits or fs.available_splits)\n", "\n", " fig, axes = plt.subplots(\n", " figsize=(7, 2.5),\n", " ncols=len(split_names),\n", " sharex=True,\n", " sharey=True,\n", " )\n", "\n", " for i, split_label in enumerate(split_names):\n", " split_view = fs.get_split(split_label)\n", "\n", " res = split_view.get_data(\n", " columns=columns,\n", " fmt=\"dict_numpy\",\n", " include_domain_prefix=True,\n", " include_rep_suffix=True,\n", " )\n", "\n", " # Match user-supplied column specs to actual keys returned by get_data\n", " ordered_keys = []\n", " for c in columns:\n", " parts = [c.replace(\"*\", \"\")]\n", " if \".\" in c:\n", " parts = c.replace(\"*\", \"\").split(\".\")\n", " for k in res:\n", " if any(p == k.split(\".\")[1] for p in parts):\n", " ordered_keys.append(k)\n", " break\n", "\n", " color_vals = split_view.get_data(columns=[color_by], fmt=\"np\").reshape(-1)\n", " flat_data = np.column_stack(\n", " [res[k].reshape(len(color_vals), -1) for k in ordered_keys],\n", " )\n", "\n", " sample_idxs = rng.choice(np.arange(len(color_vals)), size=n_samples)\n", " for idx in sample_idxs:\n", " axes[i].plot(flat_data[idx], marker, color=scm.to_rgba(color_vals[idx]))\n", "\n", " axes[i].set_title(split_label, fontsize=10)\n", " axes[i].set_xlabel(xlabel, fontsize=10)\n", "\n", " axes[0].set_ylabel(ylabel, fontsize=10)\n", " fig.tight_layout(pad=1)\n", " fig.subplots_adjust(right=0.85)\n", " cbar_ax = fig.add_axes([0.87, 0.19, 0.02, 0.7])\n", " cbar = fig.colorbar(scm, cax=cbar_ax)\n", " cbar.set_label(clabel, fontsize=10)\n", " return fig, axes" ] }, { "cell_type": "code", "execution_count": null, "id": "11", "metadata": {}, "outputs": [], "source": [ "fig, axes = plot_timeseries(fs, columns=\"features.voltage.raw\")\n", "plt.suptitle(\"Raw HPPC voltage\", y=1.02, fontsize=11)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "12", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "13", "metadata": {}, "source": [ "(06-scalers-built-in-scalers)=\n", "## Built-in Scalers\n", "\n", "All registered scalers are accessible via `mml.supported_scalers`.\n", "The registry contains both ModularML-native and scikit-learn scalers." ] }, { "cell_type": "code", "execution_count": null, "id": "14", "metadata": {}, "outputs": [], "source": [ "mml.supported_scalers" ] }, { "cell_type": "markdown", "id": "15", "metadata": {}, "source": [ "The ModularML-native (non-sklearn) scalers are:\n", "\n", "| Scaler | What it does |\n", "|--------|--------------|\n", "| `PerSampleZeroStart` | Shifts each sample so its first value equals zero |\n", "| `PerSampleMinMaxScaler` | Scales each sample independently to a target range (default [0, 1]) |\n", "| `SegmentedScaler` | Applies independent scalers to contiguous feature sub-regions |\n", "| `Negate` | Multiplies all values by −1 |\n", "| `Absolute` | Replaces each value with its absolute value |" ] }, { "cell_type": "markdown", "id": "16", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "17", "metadata": {}, "source": [ "(06-scalers-the-scaler-wrapper)=\n", "## The Scaler Wrapper\n", "\n", "`Scaler` is a thin adapter that gives any sklearn-compatible transformer a consistent ModularML\n", "interface. It can be constructed three ways:\n", "\n", "```python\n", "Scaler(\"MinMaxScaler\") # by registry name (case-insensitive)\n", "Scaler(MinMaxScaler) # by class\n", "Scaler(MinMaxScaler(clip=True)) # by instance\n", "```\n", "\n", "You can also pass a string, class, or instance **directly** to `FeatureSet.fit_transform` — it\n", "will be wrapped automatically.\n", "\n", "Key methods:\n", "\n", "| Method | Description |\n", "|--------|-------------|\n", "| `fit(X)` | Learn parameters from data |\n", "| `transform(X)` | Apply the fitted transform |\n", "| `fit_transform(X)` | Fit and transform in one step |\n", "| `inverse_transform(X)` | Reverse the transform (if supported) |\n", "| `clone_unfitted()` | Return a fresh copy with the same config but no learned state |" ] }, { "cell_type": "code", "execution_count": null, "id": "18", "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import MinMaxScaler\n", "\n", "# Three equivalent constructors\n", "s1 = Scaler(\"MinMaxScaler\")\n", "s2 = Scaler(MinMaxScaler)\n", "s3 = Scaler(MinMaxScaler())\n", "\n", "print(f\"Name: {s1.scaler_name}\")\n", "print(f\"Is fit (before): {s1._is_fit}\")\n", "\n", "X = np.random.default_rng(0).normal(size=(10, 5))\n", "s1.fit(X)\n", "print(f\"Is fit (after): {s1._is_fit}\")\n", "\n", "X_scaled = s1.transform(X)\n", "print(f\"Scaled range: [{X_scaled.min():.3f}, {X_scaled.max():.3f}]\")\n", "\n", "# Clone: same config, no learned state\n", "s1_clone = s1.clone_unfitted()\n", "print(f\"Clone is fit: {s1_clone._is_fit}\")" ] }, { "cell_type": "markdown", "id": "19", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "20", "metadata": {}, "source": [ "(06-scalers-per-sample-zero-start)=\n", "## PerSampleZeroStart\n", "\n", "**What it does:** Subtracts the first value of each sample from every element in that sample:\n", "\n", "$$x_i^\\prime = x_i - x_i[0]$$\n", "\n", "**Why it's useful for HPPC data:** The absolute OCV varies with SOC and cell-to-cell spread\n", "(here 2.0–3.6 V). Subtracting the initial value removes this offset so all traces start at\n", "zero and only the *delta-V* response to the current pulse is retained. Models trained on\n", "zero-started data learn the electrochemical dynamics rather than the SOC level.\n", "\n", "**Fitting behaviour:** Statistics are computed per-sample at `transform` time (no global\n", "statistics), so `fit_to_split` does not affect the result. Specifying it is still good practice." ] }, { "cell_type": "code", "execution_count": null, "id": "21", "metadata": {}, "outputs": [], "source": [ "fs.undo_all_transforms()\n", "\n", "fs.fit_transform(\n", " scaler=PerSampleZeroStart,\n", " domain=\"features\",\n", " keys=\"voltage\",\n", " fit_to_split=\"train\",\n", ")\n", "\n", "v_raw = fs[\"train\"].get_features(fmt=\"numpy\", features=\"voltage\", rep=\"raw\")\n", "v_zs = fs[\"train\"].get_features(fmt=\"numpy\", features=\"voltage\", rep=\"transformed\")\n", "\n", "print(\n", " f\"Raw first-value range: [{v_raw[:, 0].min():.3f}, {v_raw[:, 0].max():.3f}] V\"\n", ")\n", "print(f\"ZeroStart first-value |max|: {np.abs(v_zs[:, 0]).max():.3e} (all ~0)\")\n", "\n", "fig, axes = plot_timeseries(fs, columns=\"features.voltage.transformed\")\n", "plt.suptitle(\"After PerSampleZeroStart\", y=1.02, fontsize=11)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "22", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "23", "metadata": {}, "source": [ "(06-scalers-per-sample-min-max)=\n", "## PerSampleMinMaxScaler\n", "\n", "**What it does:** Scales each sample independently so its values span `feature_range` (default [0, 1]):\n", "\n", "$$x_i^\\prime = \\frac{x_i - \\min(x_i)}{\\max(x_i) - \\min(x_i)}$$\n", "\n", "**Contrast with sklearn's `MinMaxScaler`:** sklearn's version computes min and max *across the\n", "training set* — one scalar per feature dimension. `PerSampleMinMaxScaler` uses per-sample\n", "statistics, making it invariant to the absolute voltage level and amplitude differences between\n", "cells and SOC levels.\n", "\n", "**Fitting behaviour:** Like `PerSampleZeroStart`, statistics are recomputed per-sample at\n", "transform time." ] }, { "cell_type": "code", "execution_count": null, "id": "24", "metadata": {}, "outputs": [], "source": [ "fs.undo_all_transforms()\n", "\n", "fs.fit_transform(\n", " scaler=PerSampleMinMaxScaler(),\n", " domain=\"features\",\n", " keys=\"voltage\",\n", " fit_to_split=\"train\",\n", ")\n", "\n", "v_scaled = fs[\"train\"].get_features(fmt=\"numpy\", features=\"voltage\", rep=\"transformed\")\n", "fig, axes = plot_timeseries(fs, columns=\"features.voltage.transformed\")\n", "plt.suptitle(\"After PerSampleMinMaxScaler\", y=1.02, fontsize=11)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "25", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "26", "metadata": {}, "source": [ "(06-scalers-segmented-scaler)=\n", "## SegmentedScaler\n", "\n", "**What it does:** Partitions the feature vector into contiguous segments and fits an independent\n", "scaler on each segment:\n", "\n", "```\n", "[ ─── segment 0 ─── | ─── segment 1 ─── | ─── ... ─── ]\n", " boundaries[0:1] boundaries[1:2]\n", "```\n", "\n", "A cloned copy of the template scaler is fit independently on each slice of the training data.\n", "\n", "**Why it's useful for HPPC data:** The OCV, charge pulse, rest, discharge, and rest regions\n", "occupy very different voltage ranges. A single global scaler compresses or expands regions\n", "unevenly. `SegmentedScaler` applies an independent normalization to each protocol region,\n", "preserving its full dynamic range.\n", "\n", "**Boundaries** must be a tuple of strictly increasing integers, starting at 0 and ending at the\n", "total feature length.\n", "\n", "**Fitting behaviour:** The underlying scaler (e.g., `MinMaxScaler`) learns global statistics\n", "*across training samples* for each segment, so `fit_to_split=\"train\"` **is** important here." ] }, { "cell_type": "code", "execution_count": null, "id": "27", "metadata": {}, "outputs": [], "source": [ "import itertools\n", "\n", "# Visualise the HPPC segment layout on a representative trace\n", "SEGMENT_LABELS = [\"OCV\", \"Charge\", \"Rest 1\", \"Discharge\", \"Rest 2\"]\n", "SEGMENT_COLORS = [\"#cce5f0\", \"#f0cccc\", \"#ccf0cc\", \"#f0e0cc\", \"#e0ccf0\"]\n", "HPPC_BOUNDARIES = [0, 9, 20, 59, 70, 110]\n", "\n", "\n", "fig, ax = plt.subplots(figsize=(8, 3))\n", "sample = fs.get_features(fmt=\"numpy\", features=\"voltage\", rep=\"raw\")[0][0]\n", "ax.plot(sample, \"k-\", lw=1.5)\n", "\n", "for (start, end), color, label in zip(\n", " itertools.pairwise(HPPC_BOUNDARIES),\n", " SEGMENT_COLORS,\n", " SEGMENT_LABELS,\n", " strict=True,\n", "):\n", " ax.axvspan(start, end - 0.5, alpha=0.4, color=color)\n", " ax.text(\n", " (start + end) / 2, ax.get_ylim()[0], label, ha=\"center\", va=\"bottom\", fontsize=8\n", " )\n", "\n", "ax.set_xlabel(\"Time (s)\")\n", "ax.set_ylabel(\"Voltage (V)\")\n", "ax.set_title(\n", " f\"HPPC Protocol — SegmentedScaler boundaries: {HPPC_BOUNDARIES}\", fontsize=10\n", ")\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "28", "metadata": {}, "outputs": [], "source": [ "fs.undo_all_transforms()\n", "\n", "new_bounds = [0, 20, 60, 70, 110]\n", "fs.fit_transform(\n", " scaler=SegmentedScaler(boundaries=new_bounds, scaler=PerSampleZeroStart),\n", " domain=\"features\",\n", " keys=\"voltage\",\n", " fit_to_split=\"train\",\n", ")\n", "\n", "v_seg = fs[\"train\"].get_features(fmt=\"numpy\", features=\"voltage\", rep=\"transformed\")\n", "fig, axes = plot_timeseries(fs, columns=\"features.voltage.transformed\", marker=\".\")\n", "plt.suptitle(\"After SegmentedScaler(PerSampleZeroStart)\", y=1.03, fontsize=11)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "29", "metadata": {}, "source": [ "Note that segment boundaries in the val / test panels may slightly exceed [0, 1] because those\n", "splits contain cell groups not seen during training." ] }, { "cell_type": "markdown", "id": "30", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "31", "metadata": {}, "source": [ "(06-scalers-negate-and-absolute)=\n", "## Negate and Absolute\n", "\n", "These are simple element-wise transforms, most useful as building blocks in a transform chain.\n", "\n", "- **`Negate`** — multiplies every value by −1. Useful when a model or loss expects positive\n", " values (e.g., converting a discharge voltage drop to a positive deviation).\n", "- **`Absolute`** — replaces every value with its absolute value, recording the per-element sign\n", " mask so the transform can be inverted exactly.\n", "\n", "Both have a no-op `fit`, so `fit_to_split` has no effect on the output." ] }, { "cell_type": "code", "execution_count": null, "id": "32", "metadata": {}, "outputs": [], "source": [ "fs.undo_all_transforms()\n", "\n", "# Negate\n", "fs.fit_transform(\n", " scaler=Negate(), domain=\"features\", keys=\"voltage\", fit_to_split=\"train\"\n", ")\n", "\n", "v_raw = fs[\"train\"].get_features(fmt=\"numpy\", features=\"voltage\", rep=\"raw\")\n", "v_neg = fs[\"train\"].get_features(fmt=\"numpy\", features=\"voltage\", rep=\"transformed\")\n", "print(\"Negate\")\n", "print(f\" Raw range: [{v_raw.min():.3f}, {v_raw.max():.3f}] V\")\n", "print(f\" Negated range: [{v_neg.min():.3f}, {v_neg.max():.3f}] V\")\n", "\n", "fs.undo_all_transforms()" ] }, { "cell_type": "code", "execution_count": null, "id": "33", "metadata": {}, "outputs": [], "source": [ "# Absolute\n", "# Apply ZeroStart first so voltage deviations straddle zero (+/-);\n", "# Absolute then maps both directions into the positive half-plane.\n", "fs.fit_transform(\n", " scaler=PerSampleZeroStart, domain=\"features\", keys=\"voltage\", fit_to_split=\"train\"\n", ")\n", "fs.fit_transform(\n", " scaler=Absolute(), domain=\"features\", keys=\"voltage\", fit_to_split=\"train\"\n", ")\n", "\n", "v_abs = fs[\"train\"].get_features(fmt=\"numpy\", features=\"voltage\", rep=\"transformed\")\n", "print(\"Absolute (after PerSampleZeroStart)\")\n", "print(f\" Min value: {v_abs.min():.6f} (all non-negative)\")\n", "\n", "fs.undo_all_transforms()" ] }, { "cell_type": "markdown", "id": "34", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "35", "metadata": {}, "source": [ "(06-scalers-chaining-transforms)=\n", "## Chaining Transforms\n", "\n", "Each call to `fit_transform` builds on the current `\"transformed\"` representation, creating a\n", "chain that can be unwound one step at a time (`undo_last_transform`) or all at once\n", "(`undo_all_transforms`).\n", "\n", "A natural preprocessing pipeline for HPPC voltage:\n", "\n", "1. **PerSampleZeroStart** — removes the per-sample OCV offset so all traces start at 0\n", "2. **SegmentedScaler(MinMaxScaler)** — independently normalises each protocol region to [0, 1]\n", "3. **MinMaxScaler** on the SOH target — normalises the prediction target across training samples" ] }, { "cell_type": "code", "execution_count": null, "id": "36", "metadata": {}, "outputs": [], "source": [ "fs.undo_all_transforms()\n", "\n", "# Step 1: Remove per-sample OCV offset\n", "fs.fit_transform(\n", " scaler=PerSampleZeroStart,\n", " domain=\"features\",\n", " keys=\"voltage\",\n", " fit_to_split=\"train\",\n", ")\n", "\n", "fig, axes = plot_timeseries(fs, columns=\"features.voltage.transformed\")\n", "plt.suptitle(\"Step 1 - PerSampleZeroStart\", y=1.02, fontsize=11)\n", "plt.show()\n", "\n", "# Step 2: Normalise each HPPC segment independently\n", "fs.fit_transform(\n", " scaler=SegmentedScaler(boundaries=new_bounds, scaler=\"MinMaxScaler\"),\n", " domain=\"features\",\n", " keys=\"voltage\",\n", " fit_to_split=\"train\",\n", ")\n", "\n", "fig, axes = plot_timeseries(fs, columns=\"features.voltage.transformed\")\n", "plt.suptitle(\"Step 2 - SegmentedScaler(MinMaxScaler)\", y=1.02, fontsize=11)\n", "plt.show()\n", "\n", "# Step 3: Scale SOH target across training samples\n", "fs.fit_transform(\n", " scaler=MinMaxScaler(),\n", " domain=\"targets\",\n", " keys=\"soh\",\n", " fit_to_split=\"train\",\n", ")\n", "\n", "soh_raw = fs[\"test\"].get_targets(fmt=\"numpy\", targets=\"soh\", rep=\"raw\")\n", "soh_scaled = fs[\"test\"].get_targets(fmt=\"numpy\", targets=\"soh\", rep=\"transformed\")\n", "print(f\"SOH raw range: [{soh_raw.min():.1f}, {soh_raw.max():.1f}] %\")\n", "print(f\"SOH scaled range: [{soh_scaled.min():.3f}, {soh_scaled.max():.3f}]\")" ] }, { "cell_type": "markdown", "id": "37", "metadata": {}, "source": [ "Undo the last feature transform (SegmentedScaler) while keeping PerSampleZeroStart active." ] }, { "cell_type": "code", "execution_count": null, "id": "38", "metadata": {}, "outputs": [], "source": [ "fs.undo_last_transform(domain=\"features\", keys=\"voltage\")\n", "\n", "v = fs[\"train\"].get_features(fmt=\"numpy\", features=\"voltage\", rep=\"transformed\")\n", "print(\"After undoing SegmentedScaler (PerSampleZeroStart still active):\")\n", "print(\n", " f\" First-value |max|: {np.abs(v[:, 0]).max():.2e} (ZeroStart preserved, all ~0)\"\n", ")\n", "print(\n", " f\" Overall range: [{v.min():.3f}, {v.max():.3f}] (no longer bounded to [0,1])\"\n", ")\n", "\n", "# SOH target transform is unaffected (different domain)\n", "soh_scaled = fs[\"test\"].get_targets(fmt=\"numpy\", targets=\"soh\", rep=\"transformed\")\n", "print(\n", " f\" SOH scaled range: [{soh_scaled.min():.3f}, {soh_scaled.max():.3f}] (unchanged)\"\n", ")\n", "\n", "fig, axes = plot_timeseries(fs, columns=\"features.voltage.transformed\")\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "39", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "40", "metadata": {}, "source": [ "(06-scalers-creating-a-custom-scaler)=\n", "## Creating a Custom Scaler\n", "\n", "Any object that satisfies the scikit-learn estimator interface can be wrapped by `Scaler` and\n", "used directly in `FeatureSet.fit_transform`:\n", "\n", "```python\n", "class MyScaler(BaseEstimator, TransformerMixin):\n", " def fit(self, X, y=None): ...\n", " def transform(self, X): ...\n", " def inverse_transform(self, X): ... # optional but enables undo\n", "```\n", "\n", "The hard requirements are `fit` and `transform`. Adding `inverse_transform` enables\n", "`undo_last_transform` and `unscale_data_for_cols`.\n", "\n", "### Example: `PerSampleStandardScaler`\n", "\n", "sklearn's `StandardScaler` standardizes *across samples* using training-set statistics.\n", "The custom scaler below standardizes *each sample independently* — useful when every\n", "measurement has its own mean and variance that should be removed.\n", "\n", "*Note: Custom scalers cannot be serialized unless they are defined in a separate file,\n", "or registered to the `mml.supported_scalers` registry.*" ] }, { "cell_type": "code", "execution_count": null, "id": "41", "metadata": {}, "outputs": [], "source": [ "from sklearn.base import BaseEstimator, TransformerMixin\n", "\n", "\n", "class PerSampleStandardScaler(BaseEstimator, TransformerMixin):\n", " \"\"\"\n", " Standardize each sample to zero mean and unit variance.\n", "\n", " Unlike sklearn's ``StandardScaler``, statistics are computed per sample\n", " at transform time; no global state is learned from the training set.\n", " \"\"\"\n", "\n", " def __init__(self):\n", " self._sample_mean = None\n", " self._sample_std = None\n", "\n", " def fit(self, X, y=None):\n", " # No global statistics to learn; all computation is deferred to transform\n", " return self\n", "\n", " def transform(self, X):\n", " if X.ndim != 2:\n", " msg = f\"Expected 2D array, got shape {X.shape}\"\n", " raise ValueError(msg)\n", " X = np.asarray(X)\n", " self._sample_mean = X.mean(axis=1, keepdims=True)\n", " self._sample_std = X.std(axis=1, keepdims=True)\n", " # Guard against constant samples (std = 0)\n", " self._sample_std = np.where(self._sample_std == 0, 1.0, self._sample_std)\n", " return (X - self._sample_mean) / self._sample_std\n", "\n", " def inverse_transform(self, X):\n", " if self._sample_mean is None:\n", " raise RuntimeError(\"Scaler has not been applied yet.\")\n", " return X * self._sample_std + self._sample_mean" ] }, { "cell_type": "code", "execution_count": null, "id": "42", "metadata": {}, "outputs": [], "source": [ "fs.undo_all_transforms()\n", "\n", "try:\n", " fs.fit_transform(\n", " scaler=Scaler(PerSampleStandardScaler()),\n", " domain=\"features\",\n", " keys=\"voltage\",\n", " fit_to_split=\"train\",\n", " )\n", "except RuntimeError as e:\n", " print(e)\n", "\n", "# We can register it to the builtin scaler register\n", "mml.scaler_registry.register(\n", " \"PerSampleStandardScaler\",\n", " PerSampleStandardScaler,\n", ")\n", "\n", "# And now we can use the scaler\n", "fs.fit_transform(\n", " scaler=PerSampleStandardScaler(),\n", " domain=\"features\",\n", " keys=\"voltage\",\n", " fit_to_split=\"train\",\n", ")\n", "fig, axes = plot_timeseries(fs, columns=\"features.voltage.transformed\")\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "43", "metadata": {}, "source": [ "However, adding custom classes to the built-in registry only allows usage of that class in this same environment.\n", "Restarting the kernel would clear user-added items in the registry.\n", "\n", "For more robust serailization, move the the scaler class to its own Python file." ] }, { "cell_type": "code", "execution_count": null, "id": "44", "metadata": {}, "outputs": [], "source": [ "from utils.my_scaler import PerSampleStandardScaler as MyScaler\n", "\n", "fs.undo_all_transforms()\n", "fs.fit_transform(\n", " scaler=MyScaler(),\n", " domain=\"features\",\n", " keys=\"voltage\",\n", " fit_to_split=\"train\",\n", ")\n", "\n", "v_std = fs[\"train\"].get_features(fmt=\"numpy\", features=\"voltage\", rep=\"transformed\")\n", "fig, axes = plot_timeseries(fs, columns=\"features.voltage.transformed\")\n", "plt.suptitle(\"After PerSampleStandardScaler (custom)\", y=1.02, fontsize=11)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "45", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "46", "metadata": {}, "source": [ "## Summary\n", "\n", "| Task | Code |\n", "|------|------|\n", "| List all scalers | `mml.supported_scalers` |\n", "| Create scaler by name | `Scaler(\"MinMaxScaler\")` |\n", "| Create scaler by instance | `Scaler(MinMaxScaler())` |\n", "| Apply a transform | `fs.fit_transform(scaler, domain, keys, fit_to_split)` |\n", "| Undo last transform | `fs.undo_last_transform(domain, keys)` |\n", "| Undo all transforms | `fs.undo_all_transforms()` |\n", "| Per-sample OCV removal | `PerSampleZeroStart` |\n", "| Per-sample normalisation | `PerSampleMinMaxScaler()` |\n", "| Segment-wise normalisation | `SegmentedScaler(boundaries=(...), scaler=\"MinMaxScaler\")` |\n", "| Sign flip | `Negate()` |\n", "| Absolute value | `Absolute()` |\n", "| Custom scaler | Subclass `BaseEstimator, TransformerMixin`; implement `fit` + `transform` |\n", "| Inverse-scale external data | `fs.unscale_data_for_cols(data, domain, columns)` |" ] } ], "metadata": { "kernelspec": { "display_name": ".venv (3.10.18)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.18" } }, "nbformat": 4, "nbformat_minor": 5 }