{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# How to: Use Cross-Validation\n", "\n", "Cross-validation (CV) evaluates how well a model generalizes by repeatedly training\n", "and validating on rotating subsets of data. ModularML provides `CrossValidation`\n", "and `CVBinding` to integrate CV directly with the `Experiment` API.\n", "\n", "> **Prerequisites:** This notebook uses `Evaluation` and `EvalLossMetric` callbacks.\n", "> Read {doc}`08_use_callbacks` first if you are not familiar with them.\n", "\n", "This notebook covers:\n", "\n", "- {ref}`09-cv-dataset`\n", "- {ref}`09-cv-model-experiment`\n", "- {ref}`09-cv-execution-plan`\n", "- {ref}`09-cv-binding`\n", "- {ref}`09-cv-running`\n", "- {ref}`09-cv-results`\n", "- {ref}`09-cv-summary`" ] }, { "cell_type": "code", "execution_count": null, "id": "1", "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "\n", "from modularml import (\n", " AppliedLoss,\n", " EvalPhase,\n", " Experiment,\n", " FeatureSet,\n", " Loss,\n", " ModelGraph,\n", " ModelNode,\n", " Optimizer,\n", " TrainPhase,\n", ")\n", "from modularml.models.torch import SequentialMLP\n", "from modularml.samplers import SimpleSampler" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "(09-cv-dataset)=\n", "## Dataset Setup\n", "\n", "We create a synthetic dataset that mimics a battery health monitoring scenario:\n", "50 sensors, each providing 10 voltage readings, with a scalar state-of-health\n", "target (`soh`). A `sensor_id` tag column identifies which sensor each sample\n", "belongs to.\n", "\n", "We split the data into two stages:\n", "\n", "1. **Source / Test split**: separate sensors held out for final testing from those\n", " used for cross-validation. We keep sensor groups intact (`group_by=\"sensor_id\"`).\n", "2. **Train / Val split within source**: randomly divide source sensors into\n", " `train` and `val` splits. These are the splits that will rotate during CV." ] }, { "cell_type": "code", "execution_count": null, "id": "4", "metadata": {}, "outputs": [], "source": [ "rng = np.random.default_rng(13)\n", "n_sensors = 50\n", "n_readings_per_sensor = 10\n", "n_samples = n_sensors * n_readings_per_sensor # 500 total\n", "\n", "# sensor_id repeats for each reading within a sensor\n", "sensor_ids = np.repeat(np.arange(n_sensors), n_readings_per_sensor).astype(str)\n", "\n", "fs = FeatureSet.from_dict(\n", " label=\"SensorData\",\n", " data={\n", " \"voltage\": list(rng.standard_normal((n_samples, 50))),\n", " \"soh\": list(rng.standard_normal((n_samples, 1))),\n", " \"sensor_id\": list(sensor_ids),\n", " },\n", " feature_keys=\"voltage\",\n", " target_keys=\"soh\",\n", " tag_keys=\"sensor_id\",\n", ")\n", "print(f\"Total samples: {len(fs)}\")\n", "print(f\"Tags: {fs.get_tag_keys()}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "5", "metadata": {}, "outputs": [], "source": [ "fs.clear_splits()\n", "\n", "# Stage 1: split by sensor_id - keeps all readings from a sensor together\n", "# source = 40 sensors (80%), test = 10 sensors (20%)\n", "fs.split_random(\n", " ratios={\"source\": 0.8, \"test\": 0.2},\n", " group_by=\"sensor_id\",\n", " seed=13,\n", ")\n", "\n", "# Stage 2: randomly split source readings into train and val\n", "fs.get_split(\"source\").split_random(\n", " ratios={\"train\": 0.7, \"val\": 0.3},\n", " seed=42,\n", ")\n", "\n", "fs.visualize()" ] }, { "cell_type": "markdown", "id": "6", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "(09-cv-model-experiment)=\n", "## Model and Experiment\n", "\n", "We set up a simple MLP model graph and create an experiment.\n", "The setup is identical to the approach in {doc}`05_create_experiment`." ] }, { "cell_type": "code", "execution_count": null, "id": "8", "metadata": {}, "outputs": [], "source": [ "fs_ref = fs.reference(features=\"voltage\", targets=\"soh\")\n", "\n", "mn_mlp = ModelNode(\n", " label=\"MLP\",\n", " model=SequentialMLP(output_shape=(1, 1), n_layers=2, hidden_dim=16),\n", " upstream_ref=fs_ref,\n", ")\n", "\n", "graph = ModelGraph(\n", " label=\"SimpleGraph\",\n", " nodes=[mn_mlp],\n", " optimizer=Optimizer(\"adam\", opt_kwargs={\"lr\": 1e-3}, backend=\"torch\"),\n", ")\n", "graph.build()\n", "graph.visualize()\n", "\n", "exp = Experiment.from_active_context(label=\"my_experiment\")" ] }, { "cell_type": "markdown", "id": "9", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "10", "metadata": {}, "source": [ "(09-cv-execution-plan)=\n", "## Execution Plan\n", "\n", "We define the execution plan that will run **inside each fold** of the\n", "cross-validation. This plan consists of:\n", "\n", "1. A `TrainPhase` on the `train` split, with an `Evaluation` callback that\n", " monitors validation loss on the `val` split after every epoch.\n", "2. A final `EvalPhase` on the held-out `test` split." ] }, { "cell_type": "code", "execution_count": null, "id": "11", "metadata": {}, "outputs": [], "source": [ "from modularml.callbacks import EvalLossMetric, Evaluation\n", "\n", "mse_loss = AppliedLoss(\n", " loss=Loss(\"mse\", backend=\"torch\"),\n", " on=\"MLP\",\n", " inputs=[\"outputs\", \"targets\"],\n", ")\n", "\n", "# Evaluation callback: run on val split after every epoch\n", "eval_cb = Evaluation.from_split(\n", " label=\"eval_val\",\n", " split=\"val\",\n", " every_n_epochs=1,\n", " metrics=[\n", " EvalLossMetric(\n", " name=\"val_loss\",\n", " loss=AppliedLoss(\n", " loss=Loss(\"mse\", backend=\"torch\"),\n", " on=\"MLP\",\n", " inputs=[\"targets\", \"outputs\"],\n", " ),\n", " ),\n", " ],\n", ")\n", "\n", "train_phase = TrainPhase.from_split(\n", " label=\"train\",\n", " split=\"train\",\n", " sampler=SimpleSampler(batch_size=4, shuffle=True, seed=42),\n", " losses=[mse_loss],\n", " n_epochs=2,\n", " callbacks=[eval_cb],\n", ")\n", "\n", "# Final eval on held-out test split\n", "eval_phase = EvalPhase.from_split(\n", " label=\"eval\",\n", " split=\"test\",\n", " losses=[mse_loss],\n", ")\n", "\n", "exp.execution_plan.add_phase(train_phase)\n", "exp.execution_plan.add_phase(eval_phase)" ] }, { "cell_type": "markdown", "id": "12", "metadata": {}, "source": [ "We can verify this plan before running cross-validation with `preview_run`.\n", "Unlike the `run_` methods, `preview_` methods do not mutate the Experiment state.\n", "\n", "This allows us to verify execution plans before running a final phase sequence without worrying about accidentally pre-training the ModelGraph." ] }, { "cell_type": "code", "execution_count": null, "id": "13", "metadata": {}, "outputs": [], "source": [ "# Verify the plan by running the experiment once before starting CV\n", "exp.preview_run()\n", "print(\"Single experiment run completed.\")" ] }, { "cell_type": "markdown", "id": "14", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "15", "metadata": {}, "source": [ "(09-cv-binding)=\n", "## CVBinding\n", "\n", "A `CVBinding` tells `CrossValidation` which `FeatureSet` to fold over and which\n", "existing splits form the CV pool.\n", "\n", "```python\n", " CVBinding(\n", " fs: str | FeatureSet,\n", " source_splits: list[str],\n", " *,\n", " group_by: str | list[str] | None = None,\n", " stratify_by: str | list[str] | None = None,\n", " train_split_name: str = \"train\",\n", " val_split_name: str = \"val\",\n", " val_size: float | None = None,\n", " )\n", "```\n", "\n", "| Parameter | Type | Default | Description |\n", "|-----------|------|---------|-------------|\n", "| `fs` | `str \\| FeatureSet` | (required) | The `FeatureSet` to fold. |\n", "| `source_splits` | `list[str]` | (required) | Existing splits to pool before folding. |\n", "| `group_by` | `str \\| list[str] \\| None` | `None` | Keep groups together across fold boundaries. |\n", "| `stratify_by` | `str \\| list[str] \\| None` | `None` | Balance strata across folds (mutually exclusive with `group_by`). |\n", "| `train_split_name` | `str` | `\"train\"` | The split name that receives each fold's training data. |\n", "| `val_split_name` | `str` | `\"val\"` | The split name that receives each fold's validation data. |\n", "| `val_size` | `float \\| None` | `None` | Explicit validation proportion per fold. If `None`, uses `1 / n_folds`. |\n", "\n", "### How folding works\n", "\n", "`source_splits` specifies which existing splits are **pooled** into the CV data.\n", "In our case `source_splits=[\"train\", \"val\"]` combines the train and val samples\n", "into one pool. This pool is then split into `n_folds` equal pieces. Each fold\n", "uses one piece as validation and the remainder as training, replacing the\n", "`train` and `val` splits in the `FeatureSet` for that fold's execution.\n", "\n", "*Note that we could just use the `source` split as our pool, as it is union of `train` and `val` samples. We use a distinct list of splits to show that any views can be merged into the CV pool, they do not need to originate from the same parent view (but they do need to belong to the same FeatureSet).*\n", "\n", "The `test` split is **not** included in `source_splits`, so it remains unchanged\n", "across all folds." ] }, { "cell_type": "code", "execution_count": null, "id": "16", "metadata": {}, "outputs": [], "source": [ "from modularml import CrossValidation, CVBinding\n", "\n", "cv = CrossValidation(\n", " bindings=CVBinding(\n", " fs=fs,\n", " source_splits=[\"train\", \"val\"],\n", " group_by=\"sensor_id\", # keep all readings from a sensor in the same fold\n", " ),\n", " n_folds=5,\n", " seed=13,\n", " experiment=exp,\n", ")\n", "print(f\"CrossValidation: {cv.n_folds} folds\")\n", "print(f\"Phase template: {[e.label for e in cv.phase_template.all]}\")" ] }, { "cell_type": "markdown", "id": "17", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "18", "metadata": {}, "source": [ "(09-cv-running)=\n", "## Running Cross-Validation\n", "\n", "Call `cv.run()` to execute all folds. For each fold, `CrossValidation`:\n", "\n", "1. Partitions the pooled source data into `n_folds` non-overlapping pieces.\n", "2. Creates a temporary context where `train` = all-but-one piece,\n", " `val` = the held-out piece, and `test` remains unchanged.\n", "3. Runs the full execution plan inside the temporary context.\n", "4. Restores the original context (the original `FeatureSet` and `Experiment` are identical after CV as before CV).\n", "\n", "`cv.run()` returns a `CVResults` object containing one `PhaseGroupResults`\n", "per fold." ] }, { "cell_type": "code", "execution_count": null, "id": "19", "metadata": {}, "outputs": [], "source": [ "cv_res = cv.run()\n", "print(cv_res)\n", "print(f\"Fold labels: {cv_res.fold_labels}\")" ] }, { "cell_type": "markdown", "id": "20", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "21", "metadata": {}, "source": [ "(09-cv-results)=\n", "## Accessing Results\n", "\n", "### Per-fold results\n", "\n", "`CVResults` extends `PhaseGroupResults`. Each fold's results are accessed with\n", "`get_fold(i)` (by index) or `get_fold(\"fold_i\")` (by label)." ] }, { "cell_type": "code", "execution_count": null, "id": "22", "metadata": {}, "outputs": [], "source": [ "# Access the first fold\n", "fold_0 = cv_res.get_fold(0)\n", "print(f\"Fold 0 results: {fold_0}\")\n", "\n", "# Training results for fold 0\n", "train_res_0 = fold_0.get_train_result(\"train\")\n", "print(f\" train results: {train_res_0}\")\n", "\n", "# Final eval results for fold 0\n", "eval_res_0 = fold_0.get_eval_result(\"eval\")\n", "print(f\" eval results: {eval_res_0}\")" ] }, { "cell_type": "markdown", "id": "23", "metadata": {}, "source": [ "### Validation loss tracked during training\n", "\n", "The `EvalLossMetric` inside the `Evaluation` callback logged `val_loss` to\n", "the `MetricStore` each epoch. Access it via `TrainResults.metrics`." ] }, { "cell_type": "code", "execution_count": null, "id": "24", "metadata": {}, "outputs": [], "source": [ "for fold_label in cv_res.fold_labels:\n", " fold = cv_res.get_fold(fold_label)\n", " train_res = fold.get_train_result(\"train\")\n", " val_loss = train_res.metrics().where(name=\"val_loss\").last(sort_by=\"epoch\").value\n", " print(f\"{fold_label}: final val_loss = {val_loss:.4f}\")" ] }, { "cell_type": "markdown", "id": "25", "metadata": {}, "source": [ "### Cross-fold training losses\n", "\n", "`CVResults.losses()` collects training losses across all folds and returns\n", "an `AxisSeries` keyed by `(fold, epoch, batch, label)`. Use `.where()`, `.collapse()`, and\n", "`.at()` from the `AxisSeries` API to filter and aggregate." ] }, { "cell_type": "code", "execution_count": null, "id": "26", "metadata": {}, "outputs": [], "source": [ "# Training losses over all folds and epochs\n", "train_losses = cv_res.losses(node=\"MLP\", phase=\"train\")\n", "print(f\"Axes: {train_losses.axes}\")\n", "\n", "# Mean across batches, then across folds\n", "mean_by_epoch = (\n", " train_losses\n", " .collapse(axis=\"batch\", reducer=\"mean\")\n", " .collapse(axis=\"fold\", reducer=\"mean\")\n", " .squeeze()\n", ")\n", "print(\"Mean train loss per epoch (averaged across batches and folds):\")\n", "for epoch, loss_record in mean_by_epoch.items():\n", " print(f\" epoch {epoch}: {loss_record.trainable:.4f}\")" ] }, { "cell_type": "markdown", "id": "27", "metadata": {}, "source": [ "### Custom fold extraction with `collect()`\n", "\n", "`CVResults.collect()` applies an arbitrary extractor to each fold, returning\n", "an `AxisSeries` with a `fold` axis prepended." ] }, { "cell_type": "code", "execution_count": null, "id": "28", "metadata": {}, "outputs": [], "source": [ "# Collect the final-epoch val_loss scalar from each fold\n", "final_val_losses = cv_res.collect(\n", " lambda fold: (\n", " fold.get_train_result(\"train\")\n", " .metrics()\n", " .where(name=\"val_loss\")\n", " .last(sort_by=\"epoch\")\n", " ),\n", ")\n", "print(\"Final val_loss per fold:\")\n", "for fold_label, metric_entry in final_val_losses.items():\n", " print(f\" {fold_label}: {metric_entry.value:.4f}\")" ] }, { "cell_type": "markdown", "id": "29", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "30", "metadata": {}, "source": [ "(09-cv-summary)=\n", "## Summary\n", "\n", "### `CrossValidation` Constructor\n", "\n", "| Parameter | Type | Default | Description |\n", "|-----------|------|---------|-------------|\n", "| `bindings` | `CVBinding \\| list[CVBinding]` | (required) | Fold configurations per `FeatureSet`. |\n", "| `n_folds` | `int` | `5` | Number of folds. |\n", "| `seed` | `int` | `13` | Random seed for fold generation. |\n", "| `label` | `str` | `\"CV\"` | Label applied to generated fold groups. |\n", "| `phase` | `TrainPhase \\| PhaseGroup \\| None` | `None` | Phase to run per fold. If `None`, uses the experiment's execution plan. |\n", "| `experiment` | `Experiment \\| None` | `None` | Experiment to execute. Defaults to the active experiment. |\n", "\n", "### `CVBinding` Constructor\n", "\n", "| Parameter | Type | Default | Description |\n", "|-----------|------|---------|-------------|\n", "| `fs` | `str \\| FeatureSet` | (required) | `FeatureSet` to fold over. |\n", "| `source_splits` | `list[str]` | (required) | Splits pooled into the CV data. |\n", "| `group_by` | `str \\| list[str] \\| None` | `None` | Tag column(s) for group-based folding. |\n", "| `stratify_by` | `str \\| list[str] \\| None` | `None` | Tag column(s) for stratified folding. |\n", "| `train_split_name` | `str` | `\"train\"` | Split name replaced with fold training data. |\n", "| `val_split_name` | `str` | `\"val\"` | Split name replaced with fold validation data. |\n", "| `val_size` | `float \\| None` | `None` | Explicit validation size per fold (`1/n_folds` if `None`). |\n", "\n", "### `CVResults` API\n", "\n", "| Method / Property | Returns | Description |\n", "|-------------------|---------|-------------|\n", "| `n_folds` | `int` | Number of completed folds. |\n", "| `fold_labels` | `list[str]` | Fold labels in execution order. |\n", "| `get_fold(fold)` | `PhaseGroupResults` | Results for a specific fold (by index or label). |\n", "| `losses(node, phase)` | `AxisSeries[(fold, epoch, batch, label)]` | Training losses across all folds. |\n", "| `collect(extractor)` | `AxisSeries` | Apply a function to each fold; merge results with `fold` axis prepended. |\n", "\n", "### Data Flow During Cross-Validation\n", "\n", "```\n", "FeatureSet (unchanged after CV completes)\n", " ├─ source (pooled into CV)\n", " │ ├─ train <-- replaced with fold training data\n", " │ └─ val <-- replaced with fold validation data\n", " └─ test <-- unchanged in all folds (not in source_splits)\n", "```\n", "\n", "Each fold creates a temporary context where `train` and `val` are swapped out.\n", "The original `FeatureSet` is never mutated." ] } ], "metadata": { "kernelspec": { "display_name": ".venv (3.13.5)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.5" } }, "nbformat": 4, "nbformat_minor": 5 }