How to: Use Hardware Acceleration#
ModularML supports GPU and Apple Silicon acceleration for both model training and
data preprocessing. Device placement is configured through the Accelerator class
and can be specified at two levels:
Phase-level: a default device applied to all active model nodes in that phase
Node-level: a per-node override that takes priority over the phase-level setting
When a phase begins, ModularML pre-places all model nodes on their resolved devices and pre-materializes all input batches to device-resident tensors before the epoch loop starts. This eliminates repeated data-transfer overhead across epochs and results in significantly faster training.
This notebook covers:
Note that hardware acceleration may require different installs of PyTorch or Tensorflow packages than what’s included with modularml.
For PyTorch, see the compute platform options at: https://pytorch.org/get-started/locally/
%matplotlib inline
import numpy as np
from modularml import (
AppliedLoss,
EvalPhase,
Experiment,
FeatureSet,
Loss,
ModelGraph,
ModelNode,
Optimizer,
TrainPhase,
)
from modularml.samplers import SimpleSampler
from modularml.utils.nn.accelerator import Accelerator
The Accelerator Class#
The Accelerator class is the single configuration object for hardware device
placement. It wraps a device string and an optional pin_memory flag, and provides
backend-specific helpers for PyTorch and TensorFlow.
Accelerator(
device: str = "cpu",
*,
pin_memory: bool = False,
)
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Device identifier string (see table below). |
|
|
|
If |
Supported Device Strings#
Device string |
Backend |
Meaning |
|---|---|---|
|
PyTorch / TensorFlow |
Host CPU |
|
PyTorch |
Default CUDA GPU (index 0) |
|
PyTorch |
Specific CUDA GPU by index |
|
PyTorch / TensorFlow |
Generic GPU alias (maps to |
|
PyTorch / TensorFlow |
Generic GPU with index |
|
PyTorch |
Apple Silicon Metal Performance Shaders |
Internally, Accelerator.torch_device_str() translates these to PyTorch’s format
(e.g. "gpu:1" → "cuda:1"), and Accelerator.tf_device_str() produces the
TensorFlow format (e.g. "cuda:1" → "/GPU:1").
Constructor variants#
Both a direct constructor and convenience classmethods are available.
# Direct construction
acc_cpu = Accelerator("cpu")
acc_cuda = Accelerator("cuda:0", pin_memory=True)
acc_mps = Accelerator("mps")
# Convenience classmethods (equivalent)
acc_cpu2 = Accelerator.cpu()
acc_cuda2 = Accelerator.cuda(index=0, pin_memory=True)
acc_mps2 = Accelerator.mps()
acc_gpu = Accelerator.gpu(index=0) # backend-agnostic alias
print(f"CPU torch device: {acc_cpu.torch_device_str()}")
print(f"CUDA torch device: {acc_cuda.torch_device_str()}")
print(f"MPS torch device: {acc_mps.torch_device_str()}")
print(f"GPU torch device: {acc_gpu.torch_device_str()}")
print()
print(f"CUDA TF device: {acc_cuda.tf_device_str()}")
print(f"MPS TF device: {acc_mps.tf_device_str()} # TF has no MPS support; maps to CPU")
Experiment Setup#
We will reuse a simple single-node experiment throughout this notebook.
The model and data setup is similar to the experiment notebook (How to: Create and Use an Experiment);
however, in this notebook we utilize the accelerator= argument.
Note that benefits of hardware acceleration only become obvious for larger models/datasets. We keep size small in this example to limit documentation compilation time.
from modularml.models.torch import SequentialMLP
rng = np.random.default_rng(42)
# Synthetic data: 500 samples, 50-d feature, 1-d target
fs = FeatureSet.from_dict(
label="SensorData",
data={
"voltage": list(rng.standard_normal((500, 50))),
"soh": list(rng.standard_normal((500, 1))),
},
feature_keys="voltage",
target_keys="soh",
)
fs.split_random(ratios={"train": 0.8, "test": 0.2}, seed=13)
fs_ref = fs.reference(features="voltage", targets="soh")
# Create model node
mn_mlp = ModelNode(
label="MLP",
model=SequentialMLP(output_shape=(1, 1), n_layers=2, hidden_dim=16),
upstream_ref=fs_ref,
)
# Create model graph with a global optimizer
graph = ModelGraph(
label="SimpleGraph",
nodes=[mn_mlp],
optimizer=Optimizer("adam", opt_kwargs={"lr": 1e-3}, backend="torch"),
)
# Build the graph (infers shapes)
graph.build()
graph.visualize()
exp = Experiment.from_active_context(label="my_experiment")
# Pick the best device available on this machine
def best_accelerator() -> Accelerator | None:
for acc in [Accelerator.gpu(), Accelerator.mps(), Accelerator.cpu()]:
if acc.is_available():
return acc
return None
device = best_accelerator()
print(fs)
print(f"Selected device: {device.device}")
Phase-Level Acceleration#
The simplest way to enable GPU training is to pass an accelerator to the phase.
All active nodes are moved to that device before the first epoch begins.
The accelerator parameter is available on TrainPhase, EvalPhase, and FitPhase.
It accepts either an Accelerator instance or a plain device string - ModularML
wraps strings automatically.
mse_loss = AppliedLoss(
loss=Loss("mse", backend="torch"),
on="MLP",
inputs=["outputs", "targets"],
)
# Pass an Accelerator instance
train_phase = TrainPhase.from_split(
label="train",
split="train",
sampler=SimpleSampler(batch_size=4, shuffle=True, seed=42),
losses=[mse_loss],
n_epochs=2,
accelerator=device, # Or pass a plain string; it is wrapped automatically: "cuda:0" == Accelerator("cuda:0")
)
eval_phase = EvalPhase.from_split(
label="eval",
split="test",
losses=[mse_loss],
accelerator=device,
)
print(f"TrainPhase accelerator: {train_phase.accelerator}")
print(f"EvalPhase accelerator: {eval_phase.accelerator}")
results = exp.run_phase(train_phase)
print("Training complete.")
What happens under the hood#
When iter_execution() is called on the phase, two steps run once before the
epoch loop rather than repeatedly inside it:
1. Node placement - ModelGraph.pre_place_nodes() iterates over all active
ModelNode instances and calls node._ensure_node_on_device(accelerator) on each:
torch_module.to("cuda:0")- moves all model parameters and buffers in-placeFor already-built optimizers: iterates
optimizer.instance.stateand calls.to("cuda:0")on every momentum / variance tensor, preventing a device mismatch on the firstoptimizer.step()
2. Batch pre-materialization - _pre_materialize_sampler_execs() converts every
lazy BatchView (a zero-copy index slice into a PyArrow table) into a concrete
Batch of torch tensors already resident on the target device.
After these two steps, the epoch loop runs with zero PyArrow overhead and zero device-transfer overhead per step.
Node-Level Acceleration#
Individual nodes can declare their own accelerator directly on the ModelNode.
A node-level accelerator always takes priority over the phase-level setting.
This is useful when different nodes in the same graph should run on different devices.
# Node explicitly pinned to CPU, regardless of phase accelerator
mn_cpu = ModelNode(
label="MLP",
model=SequentialMLP(output_shape=(1, 1), n_layers=2, hidden_dim=32),
upstream_ref=fs_ref,
accelerator=Accelerator.cpu(), # node-level override
)
graph.replace_node(mn_mlp, mn_cpu).build()
# Phase accelerator is None - node overrides it
train_node_acc = TrainPhase.from_split(
label="train",
split="train",
sampler=SimpleSampler(batch_size=64, shuffle=True, seed=42),
losses=[mse_loss],
n_epochs=2,
accelerator=None,
)
print(f"Phase accelerator: {train_node_acc.accelerator}")
print(f"Node accelerator: {mn_cpu._accelerator}")
Priority rules#
ModularML resolves the effective device for each node with the following priority:
node._accelerator > phase.accelerator > CPU (fallback)
This logic lives in ModelGraph._resolve_node_accelerator(node, phase_accelerator):
node_acc = getattr(node, "_accelerator", None)
if node_acc is not None:
return node_acc # node-level wins
return phase_accelerator # phase-level wins (may be None -> CPU)
If neither the node nor the phase specifies an accelerator, pre_place_nodes falls
back to Accelerator("cpu") so all nodes are always in a known, deterministic state.
Mixed-Device Graphs#
In graphs with multiple model nodes you can assign different devices to different nodes. However, this will introduce additional overhead of passing batches between devices (unless using pinned memory).
Generally, hardware acceleration is best reserved for ModelGraphs of a common backend. You likely won’t see large speed-ups for mixed device workflows.
# Encoder on GPU 0, head on GPU 1 (falls back to CPU if only one GPU)
enc_device = Accelerator.cuda(0) if Accelerator.cuda(0).is_available() else Accelerator.cpu()
head_device = Accelerator.cuda(1) if Accelerator.cuda(1).is_available() else Accelerator.cpu()
encoder = ModelNode(
label="Encoder",
model=SequentialMLP(output_shape=(1, 16), n_layers=2, hidden_dim=64),
upstream_ref=fs_ref,
accelerator=enc_device,
)
head = ModelNode(
label="Head",
model=SequentialMLP(output_shape=(1, 1), n_layers=1, hidden_dim=16),
upstream_ref=encoder.reference(),
accelerator=head_device,
)
print(f"Encoder device: {encoder._accelerator.device}")
print(f"Head device: {head_device.device}")
During pre_place_nodes, each node is moved to its own resolved device
independently. Data flowing between nodes on different devices is handled
automatically: the graph’s forward pass checks whether each incoming tensor is
already on the correct device and only calls accelerator.move_torch_tensor()
when a transfer is actually needed.
pin_memory for faster host-to-device transfers#
Setting pin_memory=True on a CUDA accelerator places CPU tensors in page-locked
(pinned) memory before the GPU transfer. This enables asynchronous PCIe DMA:
the CPU can continue executing Python while the GPU is receiving data over the bus.
Under the hood, Accelerator.move_torch_tensor() dispatches:
# With pin_memory=True
tensor.pin_memory().to(device_str, non_blocking=True)
# With pin_memory=False (default)
tensor.to(device_str)
pin_memory is most useful when batches are transferred from CPU to GPU during training,
since pinned host memory can enable asynchronous/non-blocking copies.
If the full dataset is pre-materialized onto the GPU before the epoch loop,
then pin_memory does not affect per-step training throughput;
at most, it can affect the one-time CPU->GPU materialization step,
and whether that is actually faster depends on the workload and hardware.
acc_pinned = Accelerator("cuda:0", pin_memory=True)
print(f"device: {acc_pinned.device}")
print(f"pin_memory: {acc_pinned.pin_memory}")
TensorFlow Acceleration#
ModularML supports TensorFlow models through the same ModelGraph / phase API.
The Accelerator class translates device strings to TensorFlow’s /GPU:N format
automatically.
For TensorFlow nodes, device placement uses a tf.device() context manager rather
than .to() calls. Accelerator.tf_device_scope() returns this context manager and
is called internally during the forward pass of TF-backend nodes.
# TF device string translation
for device_str in ["cpu", "cuda", "cuda:1", "gpu:0", "mps"]:
acc = Accelerator(device_str)
print(f" {device_str:<10} torch: {acc.torch_device_str():<12} tf: {acc.tf_device_str()}")
Note: MPS is not supported by TensorFlow.
tf_device_str()maps"mps"to"/CPU:0"so that TF-backend nodes run without error on Apple Silicon machines.
When PyTorch and TensorFlow nodes coexist in a graph, each node is placed on its
resolved device using the correct backend API. The same accelerator= argument at
the phase or node level drives both.
Checking Availability and Serialization#
Availability#
Accelerator.is_available() probes the hardware using the relevant backend library:
CUDA / GPU:
torch.cuda.is_available()andtorch.cuda.device_count()MPS:
torch.backends.mps.is_available()CPU: always
True
print("CUDA available:", Accelerator.gpu().is_available())
print("MPS available: ", Accelerator.mps().is_available())
print("CPU available: ", Accelerator.cpu().is_available())
Summary#
Accelerator constructor#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Device string: |
|
|
|
Pin CPU tensors before GPU transfer (async DMA). CUDA only. |
Accelerator classmethods#
Method |
Returns |
|---|---|
|
CPU accelerator |
|
CUDA accelerator |
|
Apple Silicon MPS accelerator |
|
Generic GPU alias (maps to CUDA) |
Accelerator methods#
Method |
Returns |
Description |
|---|---|---|
|
|
Probes whether the device exists on this machine. |
|
|
PyTorch device string, e.g. |
|
|
TensorFlow device string, e.g. |
|
|
Calls |
|
|
Moves tensor to device (with optional pinning). |
|
context manager |
|
|
|
Serializable config. |
|
|
Reconstruct from config. |
Phase and node parameters#
Location |
Parameter |
Effect |
|---|---|---|
|
|
Default device for all active nodes in the phase. |
|
|
Per-node override; takes priority over the phase accelerator. |
Device resolution order#
node._accelerator > phase.accelerator > CPU (fallback)
What happens at phase start#
Step |
What runs |
Under the hood |
|---|---|---|
Node placement |
|
|
Batch pre-materialization |
|
PyArrow |
Progress spinner |
|
Indeterminate spinner with elapsed-time counter |
Epoch loop |
Pre-built |
Zero Arrow / conversion / device-transfer overhead |