How to: Use Hardware Acceleration#

ModularML supports GPU and Apple Silicon acceleration for both model training and data preprocessing. Device placement is configured through the Accelerator class and can be specified at two levels:

  • Phase-level: a default device applied to all active model nodes in that phase

  • Node-level: a per-node override that takes priority over the phase-level setting

When a phase begins, ModularML pre-places all model nodes on their resolved devices and pre-materializes all input batches to device-resident tensors before the epoch loop starts. This eliminates repeated data-transfer overhead across epochs and results in significantly faster training.

This notebook covers:

Note that hardware acceleration may require different installs of PyTorch or Tensorflow packages than what’s included with modularml. For PyTorch, see the compute platform options at: https://pytorch.org/get-started/locally/

%matplotlib inline
import numpy as np

from modularml import (
    AppliedLoss,
    EvalPhase,
    Experiment,
    FeatureSet,
    Loss,
    ModelGraph,
    ModelNode,
    Optimizer,
    TrainPhase,
)
from modularml.samplers import SimpleSampler
from modularml.utils.nn.accelerator import Accelerator

The Accelerator Class#

The Accelerator class is the single configuration object for hardware device placement. It wraps a device string and an optional pin_memory flag, and provides backend-specific helpers for PyTorch and TensorFlow.

Accelerator(
    device: str = "cpu",
    *,
    pin_memory: bool = False,
)

Parameter

Type

Default

Description

device

str

"cpu"

Device identifier string (see table below).

pin_memory

bool

False

If True, CPU tensors are pinned before GPU transfer, enabling asynchronous DMA. CUDA only.

Supported Device Strings#

Device string

Backend

Meaning

"cpu"

PyTorch / TensorFlow

Host CPU

"cuda"

PyTorch

Default CUDA GPU (index 0)

"cuda:0", "cuda:1", …

PyTorch

Specific CUDA GPU by index

"gpu"

PyTorch / TensorFlow

Generic GPU alias (maps to cuda / /GPU:0)

"gpu:0", "gpu:1", …

PyTorch / TensorFlow

Generic GPU with index

"mps"

PyTorch

Apple Silicon Metal Performance Shaders

Internally, Accelerator.torch_device_str() translates these to PyTorch’s format (e.g. "gpu:1""cuda:1"), and Accelerator.tf_device_str() produces the TensorFlow format (e.g. "cuda:1""/GPU:1").

Constructor variants#

Both a direct constructor and convenience classmethods are available.

# Direct construction
acc_cpu  = Accelerator("cpu")
acc_cuda = Accelerator("cuda:0", pin_memory=True)
acc_mps  = Accelerator("mps")

# Convenience classmethods (equivalent)
acc_cpu2  = Accelerator.cpu()
acc_cuda2 = Accelerator.cuda(index=0, pin_memory=True)
acc_mps2  = Accelerator.mps()
acc_gpu   = Accelerator.gpu(index=0)   # backend-agnostic alias

print(f"CPU torch device:   {acc_cpu.torch_device_str()}")
print(f"CUDA torch device:  {acc_cuda.torch_device_str()}")
print(f"MPS torch device:   {acc_mps.torch_device_str()}")
print(f"GPU torch device:   {acc_gpu.torch_device_str()}")
print()
print(f"CUDA TF device:     {acc_cuda.tf_device_str()}")
print(f"MPS TF device:      {acc_mps.tf_device_str()}  # TF has no MPS support; maps to CPU")

Experiment Setup#

We will reuse a simple single-node experiment throughout this notebook. The model and data setup is similar to the experiment notebook (How to: Create and Use an Experiment); however, in this notebook we utilize the accelerator= argument.

Note that benefits of hardware acceleration only become obvious for larger models/datasets. We keep size small in this example to limit documentation compilation time.

from modularml.models.torch import SequentialMLP

rng = np.random.default_rng(42)

# Synthetic data: 500 samples, 50-d feature, 1-d target
fs = FeatureSet.from_dict(
    label="SensorData",
    data={
        "voltage": list(rng.standard_normal((500, 50))),
        "soh": list(rng.standard_normal((500, 1))),
    },
    feature_keys="voltage",
    target_keys="soh",
)
fs.split_random(ratios={"train": 0.8, "test": 0.2}, seed=13)
fs_ref = fs.reference(features="voltage", targets="soh")


# Create model node
mn_mlp = ModelNode(
    label="MLP",
    model=SequentialMLP(output_shape=(1, 1), n_layers=2, hidden_dim=16),
    upstream_ref=fs_ref,
)

# Create model graph with a global optimizer
graph = ModelGraph(
    label="SimpleGraph",
    nodes=[mn_mlp],
    optimizer=Optimizer("adam", opt_kwargs={"lr": 1e-3}, backend="torch"),
)

# Build the graph (infers shapes)
graph.build()
graph.visualize()

exp = Experiment.from_active_context(label="my_experiment")
# Pick the best device available on this machine
def best_accelerator() -> Accelerator | None:
    for acc in [Accelerator.gpu(), Accelerator.mps(), Accelerator.cpu()]:
        if acc.is_available():
            return acc
    return None

device = best_accelerator()
print(fs)
print(f"Selected device: {device.device}")

Phase-Level Acceleration#

The simplest way to enable GPU training is to pass an accelerator to the phase. All active nodes are moved to that device before the first epoch begins.

The accelerator parameter is available on TrainPhase, EvalPhase, and FitPhase. It accepts either an Accelerator instance or a plain device string - ModularML wraps strings automatically.

mse_loss = AppliedLoss(
    loss=Loss("mse", backend="torch"),
    on="MLP",
    inputs=["outputs", "targets"],
)

# Pass an Accelerator instance
train_phase = TrainPhase.from_split(
    label="train",
    split="train",
    sampler=SimpleSampler(batch_size=4, shuffle=True, seed=42),
    losses=[mse_loss],
    n_epochs=2,
    accelerator=device, # Or pass a plain string; it is wrapped automatically: "cuda:0" == Accelerator("cuda:0")
)


eval_phase = EvalPhase.from_split(
    label="eval",
    split="test",
    losses=[mse_loss],
    accelerator=device,
)

print(f"TrainPhase accelerator: {train_phase.accelerator}")
print(f"EvalPhase accelerator:  {eval_phase.accelerator}")
results = exp.run_phase(train_phase)
print("Training complete.")

What happens under the hood#

When iter_execution() is called on the phase, two steps run once before the epoch loop rather than repeatedly inside it:

1. Node placement - ModelGraph.pre_place_nodes() iterates over all active ModelNode instances and calls node._ensure_node_on_device(accelerator) on each:

  • torch_module.to("cuda:0") - moves all model parameters and buffers in-place

  • For already-built optimizers: iterates optimizer.instance.state and calls .to("cuda:0") on every momentum / variance tensor, preventing a device mismatch on the first optimizer.step()

2. Batch pre-materialization - _pre_materialize_sampler_execs() converts every lazy BatchView (a zero-copy index slice into a PyArrow table) into a concrete Batch of torch tensors already resident on the target device.

After these two steps, the epoch loop runs with zero PyArrow overhead and zero device-transfer overhead per step.


Node-Level Acceleration#

Individual nodes can declare their own accelerator directly on the ModelNode. A node-level accelerator always takes priority over the phase-level setting. This is useful when different nodes in the same graph should run on different devices.

# Node explicitly pinned to CPU, regardless of phase accelerator
mn_cpu = ModelNode(
    label="MLP",
    model=SequentialMLP(output_shape=(1, 1), n_layers=2, hidden_dim=32),
    upstream_ref=fs_ref,
    accelerator=Accelerator.cpu(),  # node-level override
)
graph.replace_node(mn_mlp, mn_cpu).build()

# Phase accelerator is None - node overrides it
train_node_acc = TrainPhase.from_split(
    label="train",
    split="train",
    sampler=SimpleSampler(batch_size=64, shuffle=True, seed=42),
    losses=[mse_loss],
    n_epochs=2,
    accelerator=None,
)

print(f"Phase accelerator: {train_node_acc.accelerator}")
print(f"Node accelerator:  {mn_cpu._accelerator}")

Priority rules#

ModularML resolves the effective device for each node with the following priority:

node._accelerator  >  phase.accelerator  >  CPU (fallback)

This logic lives in ModelGraph._resolve_node_accelerator(node, phase_accelerator):

node_acc = getattr(node, "_accelerator", None)
if node_acc is not None:
    return node_acc          # node-level wins
return phase_accelerator     # phase-level wins (may be None -> CPU)

If neither the node nor the phase specifies an accelerator, pre_place_nodes falls back to Accelerator("cpu") so all nodes are always in a known, deterministic state.


Mixed-Device Graphs#

In graphs with multiple model nodes you can assign different devices to different nodes. However, this will introduce additional overhead of passing batches between devices (unless using pinned memory).

Generally, hardware acceleration is best reserved for ModelGraphs of a common backend. You likely won’t see large speed-ups for mixed device workflows.

# Encoder on GPU 0, head on GPU 1 (falls back to CPU if only one GPU)
enc_device  = Accelerator.cuda(0) if Accelerator.cuda(0).is_available() else Accelerator.cpu()
head_device = Accelerator.cuda(1) if Accelerator.cuda(1).is_available() else Accelerator.cpu()

encoder = ModelNode(
    label="Encoder",
    model=SequentialMLP(output_shape=(1, 16), n_layers=2, hidden_dim=64),
    upstream_ref=fs_ref,
    accelerator=enc_device,
)
head = ModelNode(
    label="Head",
    model=SequentialMLP(output_shape=(1, 1), n_layers=1, hidden_dim=16),
    upstream_ref=encoder.reference(),
    accelerator=head_device,
)

print(f"Encoder device: {encoder._accelerator.device}")
print(f"Head device:    {head_device.device}")

During pre_place_nodes, each node is moved to its own resolved device independently. Data flowing between nodes on different devices is handled automatically: the graph’s forward pass checks whether each incoming tensor is already on the correct device and only calls accelerator.move_torch_tensor() when a transfer is actually needed.


pin_memory for faster host-to-device transfers#

Setting pin_memory=True on a CUDA accelerator places CPU tensors in page-locked (pinned) memory before the GPU transfer. This enables asynchronous PCIe DMA: the CPU can continue executing Python while the GPU is receiving data over the bus.

Under the hood, Accelerator.move_torch_tensor() dispatches:

# With pin_memory=True
tensor.pin_memory().to(device_str, non_blocking=True)

# With pin_memory=False (default)
tensor.to(device_str)

pin_memory is most useful when batches are transferred from CPU to GPU during training, since pinned host memory can enable asynchronous/non-blocking copies. If the full dataset is pre-materialized onto the GPU before the epoch loop, then pin_memory does not affect per-step training throughput; at most, it can affect the one-time CPU->GPU materialization step, and whether that is actually faster depends on the workload and hardware.

acc_pinned = Accelerator("cuda:0", pin_memory=True)
print(f"device:      {acc_pinned.device}")
print(f"pin_memory:  {acc_pinned.pin_memory}")

TensorFlow Acceleration#

ModularML supports TensorFlow models through the same ModelGraph / phase API. The Accelerator class translates device strings to TensorFlow’s /GPU:N format automatically.

For TensorFlow nodes, device placement uses a tf.device() context manager rather than .to() calls. Accelerator.tf_device_scope() returns this context manager and is called internally during the forward pass of TF-backend nodes.

# TF device string translation
for device_str in ["cpu", "cuda", "cuda:1", "gpu:0", "mps"]:
    acc = Accelerator(device_str)
    print(f"  {device_str:<10}  torch: {acc.torch_device_str():<12}  tf: {acc.tf_device_str()}")

Note: MPS is not supported by TensorFlow. tf_device_str() maps "mps" to "/CPU:0" so that TF-backend nodes run without error on Apple Silicon machines.

When PyTorch and TensorFlow nodes coexist in a graph, each node is placed on its resolved device using the correct backend API. The same accelerator= argument at the phase or node level drives both.


Checking Availability and Serialization#

Availability#

Accelerator.is_available() probes the hardware using the relevant backend library:

  • CUDA / GPU: torch.cuda.is_available() and torch.cuda.device_count()

  • MPS: torch.backends.mps.is_available()

  • CPU: always True

print("CUDA available:", Accelerator.gpu().is_available())
print("MPS available: ", Accelerator.mps().is_available())
print("CPU available: ", Accelerator.cpu().is_available())

Summary#

Accelerator constructor#

Parameter

Type

Default

Description

device

str

"cpu"

Device string: "cpu", "cuda", "cuda:N", "gpu", "gpu:N", "mps".

pin_memory

bool

False

Pin CPU tensors before GPU transfer (async DMA). CUDA only.

Accelerator classmethods#

Method

Returns

Accelerator.cpu()

CPU accelerator

Accelerator.cuda(index=0, pin_memory=False)

CUDA accelerator

Accelerator.mps()

Apple Silicon MPS accelerator

Accelerator.gpu(index=0, pin_memory=False)

Generic GPU alias (maps to CUDA)

Accelerator methods#

Method

Returns

Description

is_available()

bool

Probes whether the device exists on this machine.

torch_device_str()

str

PyTorch device string, e.g. "cuda:0".

tf_device_str()

str

TensorFlow device string, e.g. "/GPU:0".

setup_torch_model(module)

None

Calls module.to(device) in-place.

move_torch_tensor(tensor)

Tensor

Moves tensor to device (with optional pinning).

tf_device_scope()

context manager

tf.device(…) context for TF placement.

get_config()

dict

Serializable config.

from_config(config)

Accelerator

Reconstruct from config.

Phase and node parameters#

Location

Parameter

Effect

TrainPhase, EvalPhase, FitPhase

accelerator=

Default device for all active nodes in the phase.

ModelNode

accelerator=

Per-node override; takes priority over the phase accelerator.

Device resolution order#

node._accelerator  >  phase.accelerator  >  CPU (fallback)

What happens at phase start#

Step

What runs

Under the hood

Node placement

ModelGraph.pre_place_nodes()

module.to(device) + optimizer state tensors .to(device)

Batch pre-materialization

_pre_materialize_sampler_execs()

PyArrow .take() -> numpy -> torch.as_tensor() -> .to(device)

Progress spinner

ProgressTask(style="spinner", total=None)

Indeterminate spinner with elapsed-time counter

Epoch loop

Pre-built Batch objects reused each epoch

Zero Arrow / conversion / device-transfer overhead