{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# How to: Use Hardware Acceleration\n", "\n", "ModularML supports GPU and Apple Silicon acceleration for both model training and\n", "data preprocessing. Device placement is configured through the `Accelerator` class\n", "and can be specified at two levels:\n", "\n", "- **Phase-level**: a default device applied to all active model nodes in that phase\n", "- **Node-level**: a per-node override that takes priority over the phase-level setting\n", "\n", "When a phase begins, ModularML pre-places all model nodes on their resolved devices\n", "and pre-materializes all input batches to device-resident tensors **before the epoch\n", "loop starts**. This eliminates repeated data-transfer overhead across epochs and\n", "results in significantly faster training.\n", "\n", "This notebook covers:\n", "\n", "- {ref}`07-hw-accel-accelerator-class`\n", "- {ref}`07-hw-accel-phase-level`\n", "- {ref}`07-hw-accel-node-level`\n", "- {ref}`07-hw-accel-mixed-device`\n", "- {ref}`07-hw-accel-tensorflow`\n", "- {ref}`07-hw-accel-checking-availability`\n", "- {ref}`07-hw-accel-summary`\n", "\n", "*Note that hardware acceleration may require different installs of PyTorch or Tensorflow packages than what's included with `modularml`.*\n", "*For PyTorch, see the compute platform options at: [https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/)*" ] }, { "cell_type": "code", "execution_count": null, "id": "1", "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "\n", "from modularml import (\n", " AppliedLoss,\n", " EvalPhase,\n", " Experiment,\n", " FeatureSet,\n", " Loss,\n", " ModelGraph,\n", " ModelNode,\n", " Optimizer,\n", " TrainPhase,\n", ")\n", "from modularml.samplers import SimpleSampler\n", "from modularml.utils.nn.accelerator import Accelerator" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "(07-hw-accel-accelerator-class)=\n", "## The Accelerator Class\n", "\n", "The `Accelerator` class is the single configuration object for hardware device\n", "placement. It wraps a device string and an optional `pin_memory` flag, and provides\n", "backend-specific helpers for PyTorch and TensorFlow.\n", "\n", "```python\n", "Accelerator(\n", " device: str = \"cpu\",\n", " *,\n", " pin_memory: bool = False,\n", ")\n", "```\n", "\n", "| Parameter | Type | Default | Description |\n", "|-----------|------|---------|-------------|\n", "| `device` | `str` | `\"cpu\"` | Device identifier string (see table below). |\n", "| `pin_memory` | `bool` | `False` | If `True`, CPU tensors are pinned before GPU transfer, enabling asynchronous DMA. CUDA only. |" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": [ "### Supported Device Strings\n", "\n", "| Device string | Backend | Meaning |\n", "|---------------|---------|--------|\n", "| `\"cpu\"` | PyTorch / TensorFlow | Host CPU |\n", "| `\"cuda\"` | PyTorch | Default CUDA GPU (index 0) |\n", "| `\"cuda:0\"`, `\"cuda:1\"`, … | PyTorch | Specific CUDA GPU by index |\n", "| `\"gpu\"` | PyTorch / TensorFlow | Generic GPU alias (maps to `cuda` / `/GPU:0`) |\n", "| `\"gpu:0\"`, `\"gpu:1\"`, … | PyTorch / TensorFlow | Generic GPU with index |\n", "| `\"mps\"` | PyTorch | Apple Silicon Metal Performance Shaders |\n", "\n", "Internally, `Accelerator.torch_device_str()` translates these to PyTorch’s format\n", "(e.g. `\"gpu:1\"` → `\"cuda:1\"`), and `Accelerator.tf_device_str()` produces the\n", "TensorFlow format (e.g. `\"cuda:1\"` → `\"/GPU:1\"`)." ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "### Constructor variants\n", "\n", "Both a direct constructor and convenience classmethods are available." ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "# Direct construction\n", "acc_cpu = Accelerator(\"cpu\")\n", "acc_cuda = Accelerator(\"cuda:0\", pin_memory=True)\n", "acc_mps = Accelerator(\"mps\")\n", "\n", "# Convenience classmethods (equivalent)\n", "acc_cpu2 = Accelerator.cpu()\n", "acc_cuda2 = Accelerator.cuda(index=0, pin_memory=True)\n", "acc_mps2 = Accelerator.mps()\n", "acc_gpu = Accelerator.gpu(index=0) # backend-agnostic alias\n", "\n", "print(f\"CPU torch device: {acc_cpu.torch_device_str()}\")\n", "print(f\"CUDA torch device: {acc_cuda.torch_device_str()}\")\n", "print(f\"MPS torch device: {acc_mps.torch_device_str()}\")\n", "print(f\"GPU torch device: {acc_gpu.torch_device_str()}\")\n", "print()\n", "print(f\"CUDA TF device: {acc_cuda.tf_device_str()}\")\n", "print(f\"MPS TF device: {acc_mps.tf_device_str()} # TF has no MPS support; maps to CPU\")" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "8", "metadata": {}, "source": [ "## Experiment Setup\n", "\n", "We will reuse a simple single-node experiment throughout this notebook.\n", "The model and data setup is similar to the experiment notebook ({doc}`05_create_experiment`);\n", "however, in this notebook we utilize the `accelerator=` argument.\n", "\n", "*Note that benefits of hardware acceleration only become obvious for larger models/datasets. We keep size small in this example to limit documentation compilation time.*" ] }, { "cell_type": "code", "execution_count": null, "id": "9", "metadata": {}, "outputs": [], "source": [ "from modularml.models.torch import SequentialMLP\n", "\n", "rng = np.random.default_rng(42)\n", "\n", "# Synthetic data: 500 samples, 50-d feature, 1-d target\n", "fs = FeatureSet.from_dict(\n", " label=\"SensorData\",\n", " data={\n", " \"voltage\": list(rng.standard_normal((500, 50))),\n", " \"soh\": list(rng.standard_normal((500, 1))),\n", " },\n", " feature_keys=\"voltage\",\n", " target_keys=\"soh\",\n", ")\n", "fs.split_random(ratios={\"train\": 0.8, \"test\": 0.2}, seed=13)\n", "fs_ref = fs.reference(features=\"voltage\", targets=\"soh\")\n", "\n", "\n", "# Create model node\n", "mn_mlp = ModelNode(\n", " label=\"MLP\",\n", " model=SequentialMLP(output_shape=(1, 1), n_layers=2, hidden_dim=16),\n", " upstream_ref=fs_ref,\n", ")\n", "\n", "# Create model graph with a global optimizer\n", "graph = ModelGraph(\n", " label=\"SimpleGraph\",\n", " nodes=[mn_mlp],\n", " optimizer=Optimizer(\"adam\", opt_kwargs={\"lr\": 1e-3}, backend=\"torch\"),\n", ")\n", "\n", "# Build the graph (infers shapes)\n", "graph.build()\n", "graph.visualize()\n", "\n", "exp = Experiment.from_active_context(label=\"my_experiment\")" ] }, { "cell_type": "code", "execution_count": null, "id": "10", "metadata": {}, "outputs": [], "source": [ "# Pick the best device available on this machine\n", "def best_accelerator() -> Accelerator | None:\n", " for acc in [Accelerator.gpu(), Accelerator.mps(), Accelerator.cpu()]:\n", " if acc.is_available():\n", " return acc\n", " return None\n", "\n", "device = best_accelerator()\n", "print(fs)\n", "print(f\"Selected device: {device.device}\")" ] }, { "cell_type": "markdown", "id": "11", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "12", "metadata": {}, "source": [ "(07-hw-accel-phase-level)=\n", "## Phase-Level Acceleration\n", "\n", "The simplest way to enable GPU training is to pass an `accelerator` to the phase.\n", "All active nodes are moved to that device before the first epoch begins.\n", "\n", "The `accelerator` parameter is available on `TrainPhase`, `EvalPhase`, and `FitPhase`.\n", "It accepts either an `Accelerator` instance or a plain device string - ModularML\n", "wraps strings automatically." ] }, { "cell_type": "code", "execution_count": null, "id": "13", "metadata": {}, "outputs": [], "source": [ "mse_loss = AppliedLoss(\n", " loss=Loss(\"mse\", backend=\"torch\"),\n", " on=\"MLP\",\n", " inputs=[\"outputs\", \"targets\"],\n", ")\n", "\n", "# Pass an Accelerator instance\n", "train_phase = TrainPhase.from_split(\n", " label=\"train\",\n", " split=\"train\",\n", " sampler=SimpleSampler(batch_size=4, shuffle=True, seed=42),\n", " losses=[mse_loss],\n", " n_epochs=2,\n", " accelerator=device, # Or pass a plain string; it is wrapped automatically: \"cuda:0\" == Accelerator(\"cuda:0\")\n", ")\n", "\n", "\n", "eval_phase = EvalPhase.from_split(\n", " label=\"eval\",\n", " split=\"test\",\n", " losses=[mse_loss],\n", " accelerator=device,\n", ")\n", "\n", "print(f\"TrainPhase accelerator: {train_phase.accelerator}\")\n", "print(f\"EvalPhase accelerator: {eval_phase.accelerator}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "14", "metadata": {}, "outputs": [], "source": [ "results = exp.run_phase(train_phase)\n", "print(\"Training complete.\")" ] }, { "cell_type": "markdown", "id": "15", "metadata": {}, "source": [ "### What happens under the hood\n", "\n", "When `iter_execution()` is called on the phase, two steps run **once before the\n", "epoch loop** rather than repeatedly inside it:\n", "\n", "**1. Node placement** - `ModelGraph.pre_place_nodes()` iterates over all active\n", "`ModelNode` instances and calls `node._ensure_node_on_device(accelerator)` on each:\n", "\n", "- `torch_module.to(\"cuda:0\")` - moves all model parameters and buffers in-place\n", "- For already-built optimizers: iterates `optimizer.instance.state` and calls\n", " `.to(\"cuda:0\")` on every momentum / variance tensor, preventing a device mismatch\n", " on the first `optimizer.step()`\n", "\n", "**2. Batch pre-materialization** - `_pre_materialize_sampler_execs()` converts every\n", "lazy `BatchView` (a zero-copy index slice into a PyArrow table) into a concrete\n", "`Batch` of torch tensors already resident on the target device.\n", "\n", "After these two steps, the epoch loop runs with zero PyArrow overhead and zero\n", "device-transfer overhead per step." ] }, { "cell_type": "markdown", "id": "16", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "17", "metadata": {}, "source": [ "(07-hw-accel-node-level)=\n", "## Node-Level Acceleration\n", "\n", "Individual nodes can declare their own accelerator directly on the `ModelNode`.\n", "A node-level accelerator always takes priority over the phase-level setting.\n", "This is useful when different nodes in the same graph should run on different devices." ] }, { "cell_type": "code", "execution_count": null, "id": "18", "metadata": {}, "outputs": [], "source": [ "# Node explicitly pinned to CPU, regardless of phase accelerator\n", "mn_cpu = ModelNode(\n", " label=\"MLP\",\n", " model=SequentialMLP(output_shape=(1, 1), n_layers=2, hidden_dim=32),\n", " upstream_ref=fs_ref,\n", " accelerator=Accelerator.cpu(), # node-level override\n", ")\n", "graph.replace_node(mn_mlp, mn_cpu).build()\n", "\n", "# Phase accelerator is None - node overrides it\n", "train_node_acc = TrainPhase.from_split(\n", " label=\"train\",\n", " split=\"train\",\n", " sampler=SimpleSampler(batch_size=64, shuffle=True, seed=42),\n", " losses=[mse_loss],\n", " n_epochs=2,\n", " accelerator=None,\n", ")\n", "\n", "print(f\"Phase accelerator: {train_node_acc.accelerator}\")\n", "print(f\"Node accelerator: {mn_cpu._accelerator}\")" ] }, { "cell_type": "markdown", "id": "19", "metadata": {}, "source": [ "### Priority rules\n", "\n", "ModularML resolves the effective device for each node with the following priority:\n", "\n", "```\n", "node._accelerator > phase.accelerator > CPU (fallback)\n", "```\n", "\n", "This logic lives in `ModelGraph._resolve_node_accelerator(node, phase_accelerator)`:\n", "\n", "```python\n", "node_acc = getattr(node, \"_accelerator\", None)\n", "if node_acc is not None:\n", " return node_acc # node-level wins\n", "return phase_accelerator # phase-level wins (may be None -> CPU)\n", "```\n", "\n", "If neither the node nor the phase specifies an accelerator, `pre_place_nodes` falls\n", "back to `Accelerator(\"cpu\")` so all nodes are always in a known, deterministic state." ] }, { "cell_type": "markdown", "id": "20", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "21", "metadata": {}, "source": [ "(07-hw-accel-mixed-device)=\n", "## Mixed-Device Graphs\n", "\n", "In graphs with multiple model nodes you can assign different devices to different nodes.\n", "However, this will introduce additional overhead of passing batches between devices (unless using pinned memory).\n", "\n", "Generally, hardware acceleration is best reserved for ModelGraphs of a common backend.\n", "You likely won't see large speed-ups for mixed device workflows." ] }, { "cell_type": "code", "execution_count": null, "id": "22", "metadata": {}, "outputs": [], "source": [ "# Encoder on GPU 0, head on GPU 1 (falls back to CPU if only one GPU)\n", "enc_device = Accelerator.cuda(0) if Accelerator.cuda(0).is_available() else Accelerator.cpu()\n", "head_device = Accelerator.cuda(1) if Accelerator.cuda(1).is_available() else Accelerator.cpu()\n", "\n", "encoder = ModelNode(\n", " label=\"Encoder\",\n", " model=SequentialMLP(output_shape=(1, 16), n_layers=2, hidden_dim=64),\n", " upstream_ref=fs_ref,\n", " accelerator=enc_device,\n", ")\n", "head = ModelNode(\n", " label=\"Head\",\n", " model=SequentialMLP(output_shape=(1, 1), n_layers=1, hidden_dim=16),\n", " upstream_ref=encoder.reference(),\n", " accelerator=head_device,\n", ")\n", "\n", "print(f\"Encoder device: {encoder._accelerator.device}\")\n", "print(f\"Head device: {head_device.device}\")" ] }, { "cell_type": "markdown", "id": "23", "metadata": {}, "source": [ "During `pre_place_nodes`, each node is moved to its own resolved device\n", "independently. Data flowing between nodes on different devices is handled\n", "automatically: the graph's forward pass checks whether each incoming tensor is\n", "already on the correct device and only calls `accelerator.move_torch_tensor()`\n", "when a transfer is actually needed." ] }, { "cell_type": "markdown", "id": "24", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "25", "metadata": {}, "source": [ "### `pin_memory` for faster host-to-device transfers\n", "\n", "Setting `pin_memory=True` on a CUDA accelerator places CPU tensors in page-locked\n", "(pinned) memory before the GPU transfer. This enables **asynchronous PCIe DMA**:\n", "the CPU can continue executing Python while the GPU is receiving data over the bus.\n", "\n", "Under the hood, `Accelerator.move_torch_tensor()` dispatches:\n", "\n", "```python\n", "# With pin_memory=True\n", "tensor.pin_memory().to(device_str, non_blocking=True)\n", "\n", "# With pin_memory=False (default)\n", "tensor.to(device_str)\n", "```\n", "\n", "`pin_memory` is most useful when batches are transferred from CPU to GPU during training,\n", "since pinned host memory can enable asynchronous/non-blocking copies.\n", "If the full dataset is pre-materialized onto the GPU before the epoch loop,\n", "then `pin_memory` does not affect per-step training throughput; \n", "at most, it can affect the one-time CPU->GPU materialization step, \n", "and whether that is actually faster depends on the workload and hardware." ] }, { "cell_type": "code", "execution_count": null, "id": "26", "metadata": {}, "outputs": [], "source": [ "acc_pinned = Accelerator(\"cuda:0\", pin_memory=True)\n", "print(f\"device: {acc_pinned.device}\")\n", "print(f\"pin_memory: {acc_pinned.pin_memory}\")" ] }, { "cell_type": "markdown", "id": "27", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "28", "metadata": {}, "source": [ "(07-hw-accel-tensorflow)=\n", "## TensorFlow Acceleration\n", "\n", "ModularML supports TensorFlow models through the same `ModelGraph` / phase API.\n", "The `Accelerator` class translates device strings to TensorFlow's `/GPU:N` format\n", "automatically.\n", "\n", "For TensorFlow nodes, device placement uses a `tf.device()` context manager rather\n", "than `.to()` calls. `Accelerator.tf_device_scope()` returns this context manager and\n", "is called internally during the forward pass of TF-backend nodes." ] }, { "cell_type": "code", "execution_count": null, "id": "29", "metadata": {}, "outputs": [], "source": [ "# TF device string translation\n", "for device_str in [\"cpu\", \"cuda\", \"cuda:1\", \"gpu:0\", \"mps\"]:\n", " acc = Accelerator(device_str)\n", " print(f\" {device_str:<10} torch: {acc.torch_device_str():<12} tf: {acc.tf_device_str()}\")" ] }, { "cell_type": "markdown", "id": "30", "metadata": {}, "source": [ "> **Note:** MPS is not supported by TensorFlow. `tf_device_str()` maps `\"mps\"` to\n", "> `\"/CPU:0\"` so that TF-backend nodes run without error on Apple Silicon machines.\n", "\n", "When PyTorch and TensorFlow nodes coexist in a graph, each node is placed on its\n", "resolved device using the correct backend API. The same `accelerator=` argument at\n", "the phase or node level drives both." ] }, { "cell_type": "markdown", "id": "31", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "32", "metadata": {}, "source": [ "(07-hw-accel-checking-availability)=\n", "## Checking Availability and Serialization\n", "\n", "### Availability\n", "\n", "`Accelerator.is_available()` probes the hardware using the relevant backend library:\n", "\n", "- **CUDA / GPU**: `torch.cuda.is_available()` and `torch.cuda.device_count()`\n", "- **MPS**: `torch.backends.mps.is_available()`\n", "- **CPU**: always `True`" ] }, { "cell_type": "code", "execution_count": null, "id": "33", "metadata": {}, "outputs": [], "source": [ "print(\"CUDA available:\", Accelerator.gpu().is_available())\n", "print(\"MPS available: \", Accelerator.mps().is_available())\n", "print(\"CPU available: \", Accelerator.cpu().is_available())" ] }, { "cell_type": "markdown", "id": "34", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "35", "metadata": {}, "source": [ "(07-hw-accel-summary)=\n", "## Summary\n", "\n", "### Accelerator constructor\n", "\n", "| Parameter | Type | Default | Description |\n", "|-----------|------|---------|-------------|\n", "| `device` | `str` | `\"cpu\"` | Device string: `\"cpu\"`, `\"cuda\"`, `\"cuda:N\"`, `\"gpu\"`, `\"gpu:N\"`, `\"mps\"`. |\n", "| `pin_memory` | `bool` | `False` | Pin CPU tensors before GPU transfer (async DMA). CUDA only. |\n", "\n", "### Accelerator classmethods\n", "\n", "| Method | Returns |\n", "|--------|---------|\n", "| `Accelerator.cpu()` | CPU accelerator |\n", "| `Accelerator.cuda(index=0, pin_memory=False)` | CUDA accelerator |\n", "| `Accelerator.mps()` | Apple Silicon MPS accelerator |\n", "| `Accelerator.gpu(index=0, pin_memory=False)` | Generic GPU alias (maps to CUDA) |\n", "\n", "### Accelerator methods\n", "\n", "| Method | Returns | Description |\n", "|--------|---------|-------------|\n", "| `is_available()` | `bool` | Probes whether the device exists on this machine. |\n", "| `torch_device_str()` | `str` | PyTorch device string, e.g. `\"cuda:0\"`. |\n", "| `tf_device_str()` | `str` | TensorFlow device string, e.g. `\"/GPU:0\"`. |\n", "| `setup_torch_model(module)` | `None` | Calls `module.to(device)` in-place. |\n", "| `move_torch_tensor(tensor)` | `Tensor` | Moves tensor to device (with optional pinning). |\n", "| `tf_device_scope()` | context manager | `tf.device(…)` context for TF placement. |\n", "| `get_config()` | `dict` | Serializable config. |\n", "| `from_config(config)` | `Accelerator` | Reconstruct from config. |\n", "\n", "### Phase and node parameters\n", "\n", "| Location | Parameter | Effect |\n", "|----------|-----------|--------|\n", "| `TrainPhase`, `EvalPhase`, `FitPhase` | `accelerator=` | Default device for all active nodes in the phase. |\n", "| `ModelNode` | `accelerator=` | Per-node override; takes priority over the phase accelerator. |\n", "\n", "### Device resolution order\n", "\n", "```\n", "node._accelerator > phase.accelerator > CPU (fallback)\n", "```\n", "\n", "### What happens at phase start\n", "\n", "| Step | What runs | Under the hood |\n", "|------|-----------|----------------|\n", "| Node placement | `ModelGraph.pre_place_nodes()` | `module.to(device)` + optimizer state tensors `.to(device)` |\n", "| Batch pre-materialization | `_pre_materialize_sampler_execs()` | PyArrow `.take()` -> numpy -> `torch.as_tensor()` -> `.to(device)` |\n", "| Progress spinner | `ProgressTask(style=\"spinner\", total=None)` | Indeterminate spinner with elapsed-time counter |\n", "| Epoch loop | Pre-built `Batch` objects reused each epoch | Zero Arrow / conversion / device-transfer overhead |" ] } ], "metadata": { "kernelspec": { "display_name": ".venv (3.13.5)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.5" } }, "nbformat": 4, "nbformat_minor": 5 }