How to: Use Hardware Acceleration#

ModularML supports GPU and Apple Silicon acceleration for both model training and data preprocessing. Device placement is configured through the Accelerator class and can be specified at two levels:

Phase-level: a default device applied to all active model nodes in that phase
Node-level: a per-node override that takes priority over the phase-level setting

When a phase begins, ModularML pre-places all model nodes on their resolved devices and pre-materializes all input batches to device-resident tensors before the epoch loop starts. This eliminates repeated data-transfer overhead across epochs and results in significantly faster training.

This notebook covers:

The Accelerator Class
Phase-Level Acceleration
Node-Level Acceleration
Mixed-Device Graphs
TensorFlow Acceleration
Checking Availability and Serialization
Summary

Note that hardware acceleration may require different installs of PyTorch or Tensorflow packages than what’s included with modularml. For PyTorch, see the compute platform options at: https://pytorch.org/get-started/locally/

%matplotlib inline
import numpy as np

from modularml import (
    AppliedLoss,
    EvalPhase,
    Experiment,
    FeatureSet,
    Loss,
    ModelGraph,
    ModelNode,
    Optimizer,
    TrainPhase,
)
from modularml.samplers import SimpleSampler
from modularml.utils.nn.accelerator import Accelerator

The Accelerator Class#

The Accelerator class is the single configuration object for hardware device placement. It wraps a device string and an optional pin_memory flag, and provides backend-specific helpers for PyTorch and TensorFlow.

Accelerator(
    device: str = "cpu",
    *,
    pin_memory: bool = False,
)

Parameter	Type	Default	Description
`device`	`str`	`"cpu"`	Device identifier string (see table below).
`pin_memory`	`bool`	`False`	If `True`, CPU tensors are pinned before GPU transfer, enabling asynchronous DMA. CUDA only.

Supported Device Strings#

Device string	Backend	Meaning
`"cpu"`	PyTorch / TensorFlow	Host CPU
`"cuda"`	PyTorch	Default CUDA GPU (index 0)
`"cuda:0"`, `"cuda:1"`, …	PyTorch	Specific CUDA GPU by index
`"gpu"`	PyTorch / TensorFlow	Generic GPU alias (maps to `cuda` / `/GPU:0`)
`"gpu:0"`, `"gpu:1"`, …	PyTorch / TensorFlow	Generic GPU with index
`"mps"`	PyTorch	Apple Silicon Metal Performance Shaders

Internally, Accelerator.torch_device_str() translates these to PyTorch’s format (e.g. "gpu:1" → "cuda:1"), and Accelerator.tf_device_str() produces the TensorFlow format (e.g. "cuda:1" → "/GPU:1").

Constructor variants#

Both a direct constructor and convenience classmethods are available.

# Direct construction
acc_cpu  = Accelerator("cpu")
acc_cuda = Accelerator("cuda:0", pin_memory=True)
acc_mps  = Accelerator("mps")

# Convenience classmethods (equivalent)
acc_cpu2  = Accelerator.cpu()
acc_cuda2 = Accelerator.cuda(index=0, pin_memory=True)
acc_mps2  = Accelerator.mps()
acc_gpu   = Accelerator.gpu(index=0)   # backend-agnostic alias

print(f"CPU torch device:   {acc_cpu.torch_device_str()}")
print(f"CUDA torch device:  {acc_cuda.torch_device_str()}")
print(f"MPS torch device:   {acc_mps.torch_device_str()}")
print(f"GPU torch device:   {acc_gpu.torch_device_str()}")
print()
print(f"CUDA TF device:     {acc_cuda.tf_device_str()}")
print(f"MPS TF device:      {acc_mps.tf_device_str()}  # TF has no MPS support; maps to CPU")

Experiment Setup#

We will reuse a simple single-node experiment throughout this notebook. The model and data setup is similar to the experiment notebook (How to: Create and Use an Experiment); however, in this notebook we utilize the accelerator= argument.

Note that benefits of hardware acceleration only become obvious for larger models/datasets. We keep size small in this example to limit documentation compilation time.

from modularml.models.torch import SequentialMLP

rng = np.random.default_rng(42)

# Synthetic data: 500 samples, 50-d feature, 1-d target
fs = FeatureSet.from_dict(
    label="SensorData",
    data={
        "voltage": list(rng.standard_normal((500, 50))),
        "soh": list(rng.standard_normal((500, 1))),
    },
    feature_keys="voltage",
    target_keys="soh",
)
fs.split_random(ratios={"train": 0.8, "test": 0.2}, seed=13)
fs_ref = fs.reference(features="voltage", targets="soh")


# Create model node
mn_mlp = ModelNode(
    label="MLP",
    model=SequentialMLP(output_shape=(1, 1), n_layers=2, hidden_dim=16),
    upstream_ref=fs_ref,
)

# Create model graph with a global optimizer
graph = ModelGraph(
    label="SimpleGraph",
    nodes=[mn_mlp],
    optimizer=Optimizer("adam", opt_kwargs={"lr": 1e-3}, backend="torch"),
)

# Build the graph (infers shapes)
graph.build()
graph.visualize()

exp = Experiment.from_active_context(label="my_experiment")

# Pick the best device available on this machine
def best_accelerator() -> Accelerator | None:
    for acc in [Accelerator.gpu(), Accelerator.mps(), Accelerator.cpu()]:
        if acc.is_available():
            return acc
    return None

device = best_accelerator()
print(fs)
print(f"Selected device: {device.device}")

Phase-Level Acceleration#

The simplest way to enable GPU training is to pass an accelerator to the phase. All active nodes are moved to that device before the first epoch begins.

The accelerator parameter is available on TrainPhase, EvalPhase, and FitPhase. It accepts either an Accelerator instance or a plain device string - ModularML wraps strings automatically.

mse_loss = AppliedLoss(
    loss=Loss("mse", backend="torch"),
    on="MLP",
    inputs=["outputs", "targets"],
)

# Pass an Accelerator instance
train_phase = TrainPhase.from_split(
    label="train",
    split="train",
    sampler=SimpleSampler(batch_size=4, shuffle=True, seed=42),
    losses=[mse_loss],
    n_epochs=2,
    accelerator=device, # Or pass a plain string; it is wrapped automatically: "cuda:0" == Accelerator("cuda:0")
)


eval_phase = EvalPhase.from_split(
    label="eval",
    split="test",
    losses=[mse_loss],
    accelerator=device,
)

print(f"TrainPhase accelerator: {train_phase.accelerator}")
print(f"EvalPhase accelerator:  {eval_phase.accelerator}")

results = exp.run_phase(train_phase)
print("Training complete.")

What happens under the hood#

When iter_execution() is called on the phase, two steps run once before the epoch loop rather than repeatedly inside it:

1. Node placement - ModelGraph.pre_place_nodes() iterates over all active ModelNode instances and calls node._ensure_node_on_device(accelerator) on each:

torch_module.to("cuda:0") - moves all model parameters and buffers in-place
For already-built optimizers: iterates optimizer.instance.state and calls .to("cuda:0") on every momentum / variance tensor, preventing a device mismatch on the first optimizer.step()

2. Batch pre-materialization - _pre_materialize_sampler_execs() converts every lazy BatchView (a zero-copy index slice into a PyArrow table) into a concrete Batch of torch tensors already resident on the target device.

After these two steps, the epoch loop runs with zero PyArrow overhead and zero device-transfer overhead per step.

Node-Level Acceleration#

Individual nodes can declare their own accelerator directly on the ModelNode. A node-level accelerator always takes priority over the phase-level setting. This is useful when different nodes in the same graph should run on different devices.

# Node explicitly pinned to CPU, regardless of phase accelerator
mn_cpu = ModelNode(
    label="MLP",
    model=SequentialMLP(output_shape=(1, 1), n_layers=2, hidden_dim=32),
    upstream_ref=fs_ref,
    accelerator=Accelerator.cpu(),  # node-level override
)
graph.replace_node(mn_mlp, mn_cpu).build()

# Phase accelerator is None - node overrides it
train_node_acc = TrainPhase.from_split(
    label="train",
    split="train",
    sampler=SimpleSampler(batch_size=64, shuffle=True, seed=42),
    losses=[mse_loss],
    n_epochs=2,
    accelerator=None,
)

print(f"Phase accelerator: {train_node_acc.accelerator}")
print(f"Node accelerator:  {mn_cpu._accelerator}")

Priority rules#

ModularML resolves the effective device for each node with the following priority:

node._accelerator  >  phase.accelerator  >  CPU (fallback)

This logic lives in ModelGraph._resolve_node_accelerator(node, phase_accelerator):

node_acc = getattr(node, "_accelerator", None)
if node_acc is not None:
    return node_acc          # node-level wins
return phase_accelerator     # phase-level wins (may be None -> CPU)

If neither the node nor the phase specifies an accelerator, pre_place_nodes falls back to Accelerator("cpu") so all nodes are always in a known, deterministic state.

Mixed-Device Graphs#

In graphs with multiple model nodes you can assign different devices to different nodes. However, this will introduce additional overhead of passing batches between devices (unless using pinned memory).

Generally, hardware acceleration is best reserved for ModelGraphs of a common backend. You likely won’t see large speed-ups for mixed device workflows.

# Encoder on GPU 0, head on GPU 1 (falls back to CPU if only one GPU)
enc_device  = Accelerator.cuda(0) if Accelerator.cuda(0).is_available() else Accelerator.cpu()
head_device = Accelerator.cuda(1) if Accelerator.cuda(1).is_available() else Accelerator.cpu()

encoder = ModelNode(
    label="Encoder",
    model=SequentialMLP(output_shape=(1, 16), n_layers=2, hidden_dim=64),
    upstream_ref=fs_ref,
    accelerator=enc_device,
)
head = ModelNode(
    label="Head",
    model=SequentialMLP(output_shape=(1, 1), n_layers=1, hidden_dim=16),
    upstream_ref=encoder.reference(),
    accelerator=head_device,
)

print(f"Encoder device: {encoder._accelerator.device}")
print(f"Head device:    {head_device.device}")

During pre_place_nodes, each node is moved to its own resolved device independently. Data flowing between nodes on different devices is handled automatically: the graph’s forward pass checks whether each incoming tensor is already on the correct device and only calls accelerator.move_torch_tensor() when a transfer is actually needed.

`pin_memory` for faster host-to-device transfers#

Setting pin_memory=True on a CUDA accelerator places CPU tensors in page-locked (pinned) memory before the GPU transfer. This enables asynchronous PCIe DMA: the CPU can continue executing Python while the GPU is receiving data over the bus.

Under the hood, Accelerator.move_torch_tensor() dispatches:

# With pin_memory=True
tensor.pin_memory().to(device_str, non_blocking=True)

# With pin_memory=False (default)
tensor.to(device_str)

pin_memory is most useful when batches are transferred from CPU to GPU during training, since pinned host memory can enable asynchronous/non-blocking copies. If the full dataset is pre-materialized onto the GPU before the epoch loop, then pin_memory does not affect per-step training throughput; at most, it can affect the one-time CPU->GPU materialization step, and whether that is actually faster depends on the workload and hardware.

acc_pinned = Accelerator("cuda:0", pin_memory=True)
print(f"device:      {acc_pinned.device}")
print(f"pin_memory:  {acc_pinned.pin_memory}")

TensorFlow Acceleration#

ModularML supports TensorFlow models through the same ModelGraph / phase API. The Accelerator class translates device strings to TensorFlow’s /GPU:N format automatically.

For TensorFlow nodes, device placement uses a tf.device() context manager rather than .to() calls. Accelerator.tf_device_scope() returns this context manager and is called internally during the forward pass of TF-backend nodes.

# TF device string translation
for device_str in ["cpu", "cuda", "cuda:1", "gpu:0", "mps"]:
    acc = Accelerator(device_str)
    print(f"  {device_str:<10}  torch: {acc.torch_device_str():<12}  tf: {acc.tf_device_str()}")

Note: MPS is not supported by TensorFlow. tf_device_str() maps "mps" to "/CPU:0" so that TF-backend nodes run without error on Apple Silicon machines.

When PyTorch and TensorFlow nodes coexist in a graph, each node is placed on its resolved device using the correct backend API. The same accelerator= argument at the phase or node level drives both.

Checking Availability and Serialization#

Availability#

Accelerator.is_available() probes the hardware using the relevant backend library:

CUDA / GPU: torch.cuda.is_available() and torch.cuda.device_count()
MPS: torch.backends.mps.is_available()
CPU: always True

print("CUDA available:", Accelerator.gpu().is_available())
print("MPS available: ", Accelerator.mps().is_available())
print("CPU available: ", Accelerator.cpu().is_available())

Summary#

Accelerator constructor#

Parameter	Type	Default	Description
`device`	`str`	`"cpu"`	Device string: `"cpu"`, `"cuda"`, `"cuda:N"`, `"gpu"`, `"gpu:N"`, `"mps"`.
`pin_memory`	`bool`	`False`	Pin CPU tensors before GPU transfer (async DMA). CUDA only.

Accelerator classmethods#

Method	Returns
`Accelerator.cpu()`	CPU accelerator
`Accelerator.cuda(index=0, pin_memory=False)`	CUDA accelerator
`Accelerator.mps()`	Apple Silicon MPS accelerator
`Accelerator.gpu(index=0, pin_memory=False)`	Generic GPU alias (maps to CUDA)

Accelerator methods#

Method	Returns	Description
`is_available()`	`bool`	Probes whether the device exists on this machine.
`torch_device_str()`	`str`	PyTorch device string, e.g. `"cuda:0"`.
`tf_device_str()`	`str`	TensorFlow device string, e.g. `"/GPU:0"`.
`setup_torch_model(module)`	`None`	Calls `module.to(device)` in-place.
`move_torch_tensor(tensor)`	`Tensor`	Moves tensor to device (with optional pinning).
`tf_device_scope()`	context manager	`tf.device(…)` context for TF placement.
`get_config()`	`dict`	Serializable config.
`from_config(config)`	`Accelerator`	Reconstruct from config.

Phase and node parameters#

Location	Parameter	Effect
`TrainPhase`, `EvalPhase`, `FitPhase`	`accelerator=`	Default device for all active nodes in the phase.
`ModelNode`	`accelerator=`	Per-node override; takes priority over the phase accelerator.

Device resolution order#

node._accelerator  >  phase.accelerator  >  CPU (fallback)

What happens at phase start#

Step	What runs	Under the hood
Node placement	`ModelGraph.pre_place_nodes()`	`module.to(device)` + optimizer state tensors `.to(device)`
Batch pre-materialization	`_pre_materialize_sampler_execs()`	PyArrow `.take()` -> numpy -> `torch.as_tensor()` -> `.to(device)`
Progress spinner	`ProgressTask(style="spinner", total=None)`	Indeterminate spinner with elapsed-time counter
Epoch loop	Pre-built `Batch` objects reused each epoch	Zero Arrow / conversion / device-transfer overhead