How to: Create and Use a FeatureSet#

The FeatureSet is the central data container in ModularML. It organizes your data into three domains:

  • Features: model inputs (e.g., time-series signals, sensor readings)

  • Targets: values to predict (e.g., state-of-health, capacity)

  • Tags: metadata for grouping and filtering (e.g., cell ID, temperature)

Under the hood, a FeatureSet wraps a SampleCollection, which stores all data in a columnar Apache Arrow table. Each column follows the naming convention <domain>.<key>.<representation> (e.g., features.voltage.raw). A SampleSchema tracks the structure, shapes, and data types.

This notebook covers the complete FeatureSet API:

%matplotlib inline
import numpy as np
import pandas as pd

import modularml as mml
from modularml import FeatureSet

We’ll use synthetic battery pulse-response data throughout this notebook. Each sample contains a 101-point voltage time-series, a scalar state-of-health (SOH) target, and metadata tags.

N_SAMPLES = 1000
N_CELLS = 20
N_GROUPS = 5
TIME = np.linspace(0, 100, 101)

rng = np.random.default_rng(42)

# Assign each sample a cell, group, pulse type, and SOC
cell_ids = rng.integers(1, N_CELLS + 1, size=N_SAMPLES)
group_ids = rng.integers(1, N_GROUPS + 1, size=N_SAMPLES)
pulse_types = rng.choice(["chg", "dchg"], size=N_SAMPLES)
pulse_socs = rng.choice([10, 20, 30, 40, 50, 60, 70, 80, 90], size=N_SAMPLES)

# SOH degrades with group_id (higher group = more degraded)
soh = 100.0 - (group_ids - 1) * 8.0 + rng.normal(0, 2, size=N_SAMPLES)

# Synthetic voltage: baseline + pulse shape, shifted by SOC and degraded by SOH
voltage = np.zeros((N_SAMPLES, 101))
for i in range(N_SAMPLES):
    base = 3.2 + pulse_socs[i] / 100.0 * 0.5
    amplitude = 0.3 * (soh[i] / 100.0)
    sign = 1.0 if pulse_types[i] == "chg" else -1.0
    curve = sign * amplitude * (1 - np.exp(-TIME / 15.0))
    voltage[i] = base + curve + rng.normal(0, 0.002, size=101)

data = {
    "voltage": voltage.tolist(),
    "soh": soh.tolist(),
    "cell_id": cell_ids.tolist(),
    "group_id": group_ids.tolist(),
    "pulse_type": pulse_types.tolist(),
    "pulse_soc": pulse_socs.tolist(),
}

print(f"Samples: {N_SAMPLES}")
print(f"Voltage shape per sample: {voltage[0].shape}")
print(f"SOH range: [{soh.min():.1f}, {soh.max():.1f}]")

Creating a FeatureSet#

Three class methods are available for construction.

from_dict(): From a Python dictionary#

The most common constructor. Pass a dict where each key maps to a list/array of values (one entry per sample), then specify which keys are features, targets, and tags.

fs = FeatureSet.from_dict(
    label="PulseData",
    data=data,
    feature_keys="voltage",
    target_keys="soh",
    tag_keys=["cell_id", "group_id", "pulse_type", "pulse_soc"],
)
print(fs)
print(f"Feature shapes: {fs.get_feature_shapes()}")
print(f"Target shapes:  {fs.get_target_shapes()}")

When accessing FeatureSet data, you’ll notice that all keys are returned in the <domain>.<key>.<representation> by default. You can modify the the returned string with the include_rep_suffix and include_domain_prefix arguments in all FeatureSet.get_<> methods.

Note that certain string-component omissions will raise an error if it results in a non-unique key (e.g., you have two representations of the same feature column)

print(
    f"Feature shapes: {fs.get_feature_shapes(include_domain_prefix=False, include_rep_suffix=False)}",
)

from_pandas(): From a Pandas DataFrame#

Allows for FeatureSet structuring directly from a Pandas DataFrame. We similarly need to assign column names to features, targets, and tags.

# Create a simple DataFrame example
df = pd.DataFrame(
    {
        "temperature": np.random.default_rng(0).normal(25, 5, size=100),
        "humidity": np.random.default_rng(1).normal(60, 10, size=100),
        "output_power": np.random.default_rng(2).normal(100, 15, size=100),
        "site_id": np.repeat(["A", "B", "C", "D"], 25),
        "timestamp": np.arange(25).tolist() * 4,
    },
)

fs_from_df = FeatureSet.from_pandas(
    label="WeatherData",
    df=df,
    feature_cols=["temperature", "humidity"],
    target_cols="output_power",
    tag_cols="site_id",
)
print(fs_from_df)
print(f"Feature shapes: {fs_from_df.get_feature_shapes()}")

Note that the above approach treats every row in the dataframe as a unique sample for modeling.

If that’s not the case, grouping will need to be performed to aggregate rows in the Pandas dataframe belonging to each sample. The from_pandas constructor provides the group_by and sort_by arguments to do just that.

Below, we group all rows in our dataframe by the 'site_id' at which the data was measured, and then ensure all data points are sorted by 'time_stamp' within each sample. Notice how the feauture now have a shape of (25,).

fs_grouped = FeatureSet.from_pandas(
    label="WeatherGrouped",
    df=df,
    feature_cols=["temperature", "humidity"],
    target_cols="output_power",
    group_by="site_id",
    sort_by="timestamp",
    tag_cols=["site_id", "timestamp"],
)
print(fs_grouped)
print(f"Feature shapes: {fs_grouped.get_feature_shapes()}")

from_pyarrow_table(): From an Arrow table#

If you already have a pyarrow.Table with columns following the <domain>.<key>.<rep> naming convention, you can wrap it directly.

Unless you are certain that the existing table uses the appropriate schema, it is recommended to use table.to_pandas(), then use the from_pandas() constructor.

import pyarrow as pa

table = pa.table(
    {
        "features.x.raw": [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]],
        "targets.y.raw": [0.5, 1.5],
        "tags.group.raw": ["a", "b"],
    },
)

fs_arrow = FeatureSet.from_pyarrow_table(label="ArrowExample", table=table)
print(fs_arrow)
print(f"Feature shapes: {fs_arrow.get_feature_shapes()}")

Inspecting a FeatureSet#

Use the following properties and methods to understand the structure of your data.

# Basic info
print(f"Label:      {fs.label}")
print(f"Samples:    {len(fs)}")
print(f"repr:       {fs!r}")
# Column keys by domain
print("Feature keys:", fs.get_feature_keys())
print("Target keys: ", fs.get_target_keys())
print("Tag keys:    ", fs.get_tag_keys())
# All keys with full qualification (domain prefix + rep suffix)
fs.get_all_keys(include_domain_prefix=True, include_rep_suffix=True)
# Shapes and dtypes
print("Feature shapes:", fs.get_feature_shapes())
print("Target shapes: ", fs.get_target_shapes())
print("Tag shapes:    ", fs.get_tag_shapes())
print()
print("Feature dtypes:", fs.get_feature_dtypes())
print("Target dtypes: ", fs.get_target_dtypes())

Note that most data containing classes in ModularML also support a summary() method.

Printing the results provides a formatted summary of all characteristics of that object.

print(fs.summary())

Accessing Data#

Data can be retrieved in multiple formats via the fmt parameter. Accepted values include "numpy", "pandas", "dict_numpy", "torch", and more.

Domain-specific accessors#

# Get all features as a dict of numpy arrays (default)
features = fs.get_features()
print(f"Type: {type(features)}")
print(f"Keys: {list(features.keys())}")
print(f"Voltage shape: {features['voltage'].shape}")
# Get a single feature by name, as numpy
voltage = fs.get_features(fmt="numpy", features="voltage")
print(f"Shape: {voltage.shape}")
# Get targets as a pandas DataFrame
targets_df = fs.get_targets(fmt="pandas")
targets_df.head()
# Get specific tags as a dict of numpy arrays
tags = fs.get_tags(fmt="dict_numpy", tags=["cell_id", "pulse_type"])
print(f"Type: {type(tags)}")
print(f"Cell IDs (first 5): {tags['cell_id'][:5]}")
print(f"Pulse types (first 5): {tags['pulse_type'][:5]}")

Unified accessor: get_data()#

Retrieve columns from multiple domains in a single call. Supports wildcards and a default rep parameter.

# Specific columns from different domains
result = fs.get_data(
    features="voltage",
    targets="soh",
    tags="*",
    fmt="pandas",
)
result.head()
# Sample UUIDs - each sample has a unique identifier
uuids = fs.get_sample_uuids()
print(f"First 3 UUIDs: {uuids[:3]}")

Export to pandas or Arrow#

# Full export to pandas
df_all = fs.to_pandas()
df_all
# Export to Arrow table
arrow_table = fs.to_arrow()
print(f"Arrow schema:\n{arrow_table.schema}")

Row Subsetting and Filtering#

All row-subsetting operations return a FeatureSetView - a lightweight, zero-copy window over the parent FeatureSet.

filter(): Condition-based filtering#

Conditions are a dict mapping fully-qualified column names to:

  • A scalar (equality match)

  • A list/set (membership test)

  • A callable (row-wise predicate)

# Filter by equality
view_chg = fs.filter(conditions={"tags.pulse_type.raw": "chg"})
print(f"Charge-only: {view_chg}")
print(np.unique(view_chg.get_tags(fmt="np", tags="pulse_type")))
# Filter by list membership
view_cells = fs.filter(conditions={"tags.cell_id.raw": [1, 2, 3]})
print(f"Cells 1-3: {view_cells}")
print(np.unique(view_cells.get_tags(fmt="np", tags="cell_id")))
# Filter with callable + multiple conditions (AND-composed)
view_healthy_chg = fs.filter(
    conditions={
        "tags.pulse_type.raw": "chg",
        "targets.soh.raw": lambda x: x >= 90.0,
    },
)
print(f"Healthy charge pulses: {view_healthy_chg}")

take(): By relative index#

view_first10 = fs.take([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], label="first_10")
print(view_first10)

take_sample_uuids(): By UUID#

some_uuids = fs.get_sample_uuids()[:5].tolist()
view_by_uuid = fs.take_sample_uuids(some_uuids, label="uuid_subset")
print(view_by_uuid)

Set operations between views#

# Intersection: samples in both views
view_a = fs.filter(conditions={"tags.pulse_type.raw": "chg"})
print("View A:")
print(" - view:", view_a)
print(" - cells:", np.unique(view_a.get_tags(fmt="np", tags="cell_id")))
print(" - pulse_types:", np.unique(view_a.get_tags(fmt="np", tags="pulse_type")))

view_b = fs.filter(conditions={"tags.cell_id.raw": [1, 2, 3]})
print("\nView B:")
print(" - view:", view_b)
print(" - cells:", np.unique(view_b.get_tags(fmt="np", tags="cell_id")))
print(" - pulse_types:", np.unique(view_b.get_tags(fmt="np", tags="pulse_type")))


view_intersect = view_a.take_intersection(view_b, label="chg_cells_1to3")
print("\nIntersection:")
print(" - view:", view_intersect)
print(" - cells:", np.unique(view_intersect.get_tags(fmt="np", tags="cell_id")))
print(
    " - pulse_types:",
    np.unique(view_intersect.get_tags(fmt="np", tags="pulse_type")),
)

# Difference: samples in A but not in B
view_diff = view_a.take_difference(view_b, label="chg_not_cells_1to3")
print("\nDifference:")
print(" - view:", view_diff)
print(" - cells:", np.unique(view_diff.get_tags(fmt="np", tags="cell_id")))
print(" - pulse_types:", np.unique(view_diff.get_tags(fmt="np", tags="pulse_type")))

We can also check view overlap via the is_disjoint_with method.

print("view_diff does not contain view_b samples: ", view_b.is_disjoint_with(view_diff))

Converting a view back to a FeatureSet#

A FeatureSetView is a lightweight reference of indices in the parent FeatureSet. Any modification to the FeatureSet with change the data access through its child views.

To create an independent FeatureSet from a view, use to_featureset():

fs_charge = view_chg.to_featureset(label="ChargePulses")
print(fs_charge)

Column Subsetting#

Use select() to create a view with only specific columns. Row indices are preserved.

Select supports the same wildcard usage as filter()

# Select specific features and targets
view_slim = fs.select(features="voltage.*", targets="soh")
print(f"Columns: {view_slim.get_all_keys()}")
# Select by domain and representation
view_raw_only = fs.select(features="voltage", rep="raw")
print(
    f"Columns: {view_raw_only.get_all_keys(include_domain_prefix=True, include_rep_suffix=True)}",
)

Splitting Data#

Splitting creates named FeatureSetView partitions that are registered (optional) on the parent FeatureSet.

Random splitting#

Random splitting takes a ratios argument, defining the proportions of all samples in the calling container to be assigned to each key. The ratio values must add up to 1.

By default, splits views are not returned and automatically registered to the parent FeatureSet. This behaviour can be specified via the return_views and register arguments.

fs_charge.clear_splits()
fs_charge.split_random(
    ratios={"train": 0.6, "val": 0.2, "test": 0.2},
    seed=42,
)

print(f"Available splits: {fs_charge.available_splits}")
for name, view in fs_charge.splits.items():
    print(f"  {name}: {len(view)} samples")

Use group_by to keep all samples sharing a tag value in the same split (prevents data leakage):

fs_charge.clear_splits()

fs_charge.split_random(
    ratios={"train": 0.5, "val": 0.3, "test": 0.2},
    group_by="group_id",
    seed=1,
)

for name, view in fs_charge.splits.items():
    group_ids = view.get_tags(fmt="numpy", tags="group_id")
    print(f"  {name}: {len(view)} samples, groups: {np.unique(group_ids)}")

Use stratify_by to ensure all splits have representative distributions of the calling source.

Note that grouping and stratification are mutually exclusive (you can’t use both at the same time).

fs_charge.clear_splits()

fs_charge.split_random(
    ratios={"train": 0.5, "val": 0.3, "test": 0.2},
    stratify_by="group_id",
    seed=1,
)

for name, view in fs_charge.splits.items():
    group_ids = view.get_tags(fmt="numpy", tags="group_id")
    print(f"  {name}: {len(view)} samples, groups: {np.unique(group_ids)}")

Condition-based splitting#

Assign samples to splits using explicit conditions on any column.

split_by_condition takes a conditions argument: a nested dict of {split_name: {column: condition}}.

Conditions can be specified as:

  • A scalar: equality match, e.g. "chg"

  • A list: membership test, e.g. [1, 2, 3]

  • A callable: row-wise predicate, e.g. lambda x: x >= 90

  • A Predicate instance: e.g. GTE(90), In([1, 2, 3])

Samples satisfying all conditions within a named split are included in that split. A warning is raised if the produced splits are not mutually exclusive.

Predicates and serialization#

Raw scalars and lists work as shorthand, but lambdas are not serializable: they cannot be saved and restored when a FeatureSet is persisted to disk. To make split_by_condition calls fully round-trip safe, use the Predicate classes from modularml.predicates:

Class

Condition

LT(v)

x < v

LTE(v)

x <= v

GT(v)

x > v

GTE(v)

x >= v

EQ(v)

x == v

NE(v)

x != v

In([...])

x in [...]

NotIn([...])

x not in [...]

Lambda("lambda x: ...")

arbitrary logic

Lambda stores the source string explicitly, so it survives serialization; unlike a bare Python lambda.

from modularml.predicates import EQ, GT, GTE, LT, LTE, NE, In, Lambda, NotIn

fs_charge.clear_splits()

fs_charge.split_by_condition(
    {
        "train": {"tags.group_id.raw": In([1, 2, 3])},
        "val": {"tags.group_id.raw": EQ(4)},
        "test": {"tags.group_id.raw": EQ(5)},
    },
)

for name, view in fs_charge.splits.items():
    group_ids = view.get_tags(fmt="numpy", tags="group_id")
    print(f"  {name}: {len(view)} samples, groups: {np.unique(group_ids)}")

Nested splits#

Splitting can be called on any existing split, in addition to directly on the parent FeatureSet. The nested split conditions will only draw from the samples available in the calling view.

This allows us to “nest” split conditions to create more complex modeling setups.

Note that the sub-splits will inherently overlap with the calling view, and care should be taking when using these splits in downstream modeling.

fs_charge.clear_splits()

fs_charge.split_by_condition(
    {
        "source": {
            "targets.soh.raw": GTE(90),
            "tags.group_id.raw": In([1, 2, 3]),
        },
        "test": {
            "targets.soh.raw": LT(90),
            "tags.group_id.raw": In([4, 5]),
        },
    },
)
fs_charge.get_split("source").split_random(
    ratios={"train": 0.8, "val": 0.2},
    stratify_by="group_id",
)

for name, view in fs_charge.splits.items():
    group_ids = view.get_tags(fmt="numpy", tags="group_id")
    print(f"  {name}: {len(view)} samples, groups: {np.unique(group_ids)}")

Manual split registration with filter() and add_split()#

You can build splits from arbitrary filters and register them manually with add_split(). This is useful when your partition logic doesn’t fit neatly into a ratio-based or condition-based split.

fs_charge.clear_splits()

train_view = fs_charge.filter(
    conditions={
        "tags.group_id.raw": [1, 2, 3],
        "targets.soh.raw": Lambda("lambda x: x >= 80.0"),
    },
    label="train",
)
test_view = fs_charge.filter(
    conditions={"tags.group_id.raw": [4, 5]},
    label="test",
)

# Register them as named splits
fs_charge.add_split(train_view)
fs_charge.add_split(test_view)

for name, view in fs_charge.splits.items():
    print(f"  {name}: {len(view)} samples")

Returning views directly with return_views=True#

Any split method can return the produced views directly by passing return_views=True. This is useful when you want immediate access to the views without going through get_split().

fs_charge.clear_splits()

split_views = fs_charge.split_random(
    ratios={"train": 0.6, "val": 0.2, "test": 0.2},
    group_by="group_id",
    seed=42,
    return_views=True,
)

for name, view in split_views.items():
    print(f"  {name}: {len(view)} samples")

Note that most ModularML core classes implement a .visualize() method. For FeatureSets, this displays a Mermaid diagram of all splits registered to the FeatureSet.

Note that you will need to install a Mermaid rendering extension for your IDE. I use “Markdown Preview Mermaid Support”.

fs_charge.visualize(show_overlaps=True)

Transforms and Scaling#

Apply preprocessing transforms to features or targets. Transforms are tracked and can be undone.

Several scalers are built into to ModularML and accessible via the Scaler.get_supported_scalers() command. You can also create custom scalers, as outlined below.

# List all available scalers
mml.supported_scalers

Let’s create a little utility to plot our voltages so we can verify our transforms:

import matplotlib.pyplot as plt


def plot_voltages(
    fs: FeatureSet,
    n_samples: int = 200,
    rep: str = "transformed",
    seed: int = 13,
):
    """
    Plot the 'voltage' feature contained in the FeatureSet.

    Each split will get its own panel.
    Colors by SOH (dark blue = high SOH, light blue = low SOH)

    Args:
        fs (FeatureSet): FeatureSet to use.
        n_samples (int, optional): The number of samples in `fs` that will
              get plotted. Defaults to 200.
        rep (str): The representation of the data to plot (eg, "raw" and "transformed")
        seed (int, optional): A seed to ensure the same samples get plotted
              with repeated calls. Defaults to 13.

    """

    def order_splits(values: list[str]) -> list[str]:
        priority = {"train": 0, "val": 1, "test": 2}
        return sorted(values, key=lambda x: priority.get(x, 99))

    rng = np.random.default_rng(seed)
    scm = plt.cm.ScalarMappable(
        cmap=plt.cm.Blues,
        norm=plt.Normalize(vmin=50, vmax=100),
    )

    # Verify rep exists
    avail_reps = fs.collection._get_rep_keys(domain="features", key="voltage")
    if rep not in avail_reps:
        rep = "raw"

    # Create figure with panels for each split
    fig, axes = plt.subplots(
        figsize=(7, 2.5),
        ncols=fs.n_splits,
        sharex=True,
        sharey=True,
    )
    split_names = order_splits(fs.available_splits)
    for i, split_label in enumerate(split_names):
        # For each split, get all voltage features and group_ids
        split_view = fs.get_split(split_label)
        voltages = np.squeeze(
            split_view.get_features(features="voltage", fmt="numpy", rep=rep),
        )
        sohs = np.squeeze(split_view.get_targets(targets="soh", fmt="numpy", rep="raw"))

        # Select n_samples
        sample_idxs = rng.choice(np.arange(0, len(voltages)), size=n_samples)
        for idx in sample_idxs:
            axes[i].plot(voltages[idx], color=scm.to_rgba(sohs[idx]))

        axes[i].set_title(split_label, fontsize=10)
        axes[i].set_xlabel("Time (s)", fontsize=10)
    axes[0].set_ylabel("Voltage (V)", fontsize=10)

    # Adjust main subplot area to leave space on the right for colorbar
    fig.tight_layout(pad=1)
    fig.subplots_adjust(right=0.85)

    # Add colorbar as a dedicated panel on the far right
    cbar_ax = fig.add_axes([0.87, 0.19, 0.02, 0.7])  # [left, bottom, width, height]
    cbar = fig.colorbar(scm, cax=cbar_ax)
    cbar.set_label("SOH (%)", fontsize=10)
    return fig, axes


fig, axes = plot_voltages(fs_charge, n_samples=200)
plt.show()

Applying a transform#

fit_transform() fits a scaler and stores the result as a "transformed" representation alongside the original "raw" data.

  • domain: "features" or "targets"

  • keys: which keys to transform (default: all in domain)

  • fit_to_split: fit only on this split’s data (prevents data leakage)

from modularml import Scaler

# Apply MinMaxScaler to voltage, fitted on training data only
fs_charge.fit_transform(
    scaler=Scaler("MinMaxScaler"),
    domain="features",
    keys="voltage",
    fit_to_split="train",
)

# Raw data is preserved - access both representations
raw = fs_charge["train"].get_features(fmt="numpy", features="voltage", rep="raw")
transformed = fs_charge["train"].get_features(
    fmt="numpy",
    features="voltage",
    rep="transformed",
)

fig, axes = plot_voltages(fs_charge, n_samples=200, rep="transformed")
plt.show()

Chaining transforms#

Multiple transforms can be applied sequentially. Each call transforms the current "transformed" representation.

# First undo, then chain: zero-start -> min-max
fs_charge.undo_all_transforms(domain="features")

fs_charge.fit_transform(
    scaler="PerSampleZeroStart",
    domain="features",
    keys="voltage",
    fit_to_split="train",
)

fig, axes = plot_voltages(fs_charge, n_samples=200, rep="transformed")
plt.show()

fs_charge.fit_transform(
    scaler="MinMaxScaler",
    domain="features",
    keys="voltage",
    fit_to_split="train",
)

fig, axes = plot_voltages(fs_charge, n_samples=200, rep="transformed")
plt.show()

Scaling targets#

from sklearn.preprocessing import MinMaxScaler

# You can also pass sklearn instances directly
fs_charge.fit_transform(
    scaler=MinMaxScaler(),
    domain="targets",
    keys="soh",
    fit_to_split="train",
)

soh_raw = fs_charge["test"].get_targets(fmt="numpy", targets="soh", rep="raw")
soh_scaled = fs_charge["test"].get_targets(
    fmt="numpy",
    targets="soh",
    rep="transformed",
)
print(f"SOH raw range:     [{soh_raw.min():.1f}, {soh_raw.max():.1f}]")
print(f"SOH scaled range:  [{soh_scaled.min():.3f}, {soh_scaled.max():.3f}]")

Undoing transforms#

# Undo the last *feature* transform (MinMaxScaler), keeping PerSampleZeroStart
# Note that the target transform (although more recent) is not inversed
fs_charge.undo_last_transform(domain="features", keys="voltage")

transformed = fs_charge["train"].get_features(
    fmt="numpy",
    features="voltage",
    rep="transformed",
)
print("After undo last:")
print(f"  min={transformed.min():.4f} (should be ~0.0)")
print(f"  max={transformed.max():.4f} (no longer bounded to 1.0)")

print(f"SOH:  [{soh_scaled.min():.3f}, {soh_scaled.max():.3f}]")
# Undo all transforms in a domain
fs_charge.undo_all_transforms()

# Verify: after undoing all, 'transformed' rep no longer exists
print(
    "Keys after undo:",
    fs_charge.get_all_keys(include_domain_prefix=True, include_rep_suffix=True),
)

Inverse-scaling external data#

Use unscale_data_for_cols() to apply inverse transforms to data that lives outside the FeatureSet (e.g., model predictions):

fs_charge.fit_transform(
    scaler=MinMaxScaler(),
    domain="targets",
    keys="soh",
    fit_to_split="train",
)

# Simulate model predictions in scaled space
fake_predictions = np.array([[0.5], [0.8], [0.2]])

# Inverse-transform back to original SOH scale
original_scale = fs_charge.unscale_data_for_cols(
    data=fake_predictions,
    domain="targets",
    columns="soh",
)
print(f"Scaled predictions:   {fake_predictions.ravel()}")
print(f"Original-scale SOH:   {original_scale.ravel()}")

While you can access the fit scalers on a particular column and use for unscaling (shown above), it is best practice to use the original FeatureSet, filter to the sample IDs on which your scaled data was produced, and then access the “transformed” version directly. This is the only way to fully guarantee that you are “applying” the correct scaler.

We’ll cover this more in depth in: TODO: $\textcolor{red}{\text{add notebook link to “working with model outputs / results”}}$


Serialization (Save and Load)#

FeatureSets can be saved to disk and fully restored, including splits and transforms.

import matplotlib.pyplot as plt

# Re-apply splits and transforms before saving
fs_charge.clear_splits()
fs_charge.undo_all_transforms()

fs_charge.split_random(
    ratios={"train": 0.6, "val": 0.2, "test": 0.2},
    group_by="group_id",
    seed=42,
)

fs_charge.fit_transform(
    scaler="PerSampleZeroStart",
    domain="features",
    keys="voltage",
    fit_to_split="train",
)
fig, axes = plot_voltages(fs_charge, n_samples=200, rep="transformed")
plt.show()
fs_charge.fit_transform(
    scaler="MinMaxScaler",
    domain="features",
    keys="voltage",
    fit_to_split="train",
)
fig, axes = plot_voltages(fs_charge, n_samples=200, rep="transformed")
plt.show()
fs_charge.fit_transform(
    scaler="MinMaxScaler",
    domain="targets",
    keys="soh",
    fit_to_split="train",
)
from pathlib import Path
from tempfile import TemporaryDirectory

# Save to temp file
SAVE_DIR = TemporaryDirectory()

save_path = fs_charge.save(Path(SAVE_DIR.name) / "fs_charge_demo", overwrite=True)
print(f"Saved to: {save_path}")

Now we can reload this FeatureSet.

Note that ModularML assign all “nodes” in an Experiment a unique ID. This is important when we move to Experiments and ModelGraphs, but we can just ignore the collision warning for now.

# Reload from file
fs_rel = FeatureSet.load(save_path)

We can pick up exactly where we left off; all history is preserved.

This means we can undo the last transform.

fig, axes = plot_voltages(fs_rel, n_samples=200, rep="transformed")
plt.show()

fs_rel.undo_last_transform(domain="features")
fig, axes = plot_voltages(fs_rel, n_samples=200, rep="transformed")
plt.show()

Copying a FeatureSet#

The Node ID Collision warning appears because the saved FeatureSet carries an internal ID that already exists in the active context (it was saved from fs_charge). ModularML resolves this by assigning a new ID to the loaded FeatureSet; fs_rel and fs_charge are distinct objects.

To explicitly create an independent copy of a FeatureSet within the same session, use .copy().

By default, the copy shares the underlying PyArrow data buffer with the original (zero-copy). Setting share_raw_data_buffer=False produces a fully independent copy with no shared state.

Use restore_splits and restore_scalers to control whether the new copy inherits existing split definitions and fitted scalers from the original.

# Shallow copy with raw data only
fs_copy_raw = fs_charge.copy(label="CopyRawOnly", share_raw_data_buffer=True)
print(
    f"Copy (raw only) keys: {fs_copy_raw.get_all_keys(include_domain_prefix=True, include_rep_suffix=True)}",
)
print(f"Copy splits: {fs_copy_raw.available_splits}")
# Full copy with splits and scalers restored
fs_copy_full = fs_charge.copy(
    label="CopyFull",
    share_raw_data_buffer=False,
    restore_splits=True,
    restore_scalers=True,
    register=True,
)
print(
    f"Full copy keys: {fs_copy_full.get_all_keys(include_domain_prefix=True, include_rep_suffix=True)}",
)
print(f"Full copy splits: {fs_copy_full.available_splits}")

References (for Model Graph Wiring)#

When connecting a FeatureSet to a ModelStage in a model graph, you create symbolic references rather than passing data directly.

Below is a quick overview, but more details are provided in the following notebook:

# Multi-column reference (used for ModelStage inputs)
ref = fs_charge.reference(features="voltage", targets="soh", rep="transformed")
print(f"Reference type: {type(ref).__name__}")
print(f"Reference: {ref}")
# Single-column reference (specify rep when multiple representations exist)
col_ref = fs_charge.column_reference(feature="voltage", rep="transformed")
print(f"Column reference type: {type(col_ref).__name__}")
print(f"Column reference: {col_ref}")

Summary#

Task

Method

Create from dict

FeatureSet.from_dict(label, data, feature_keys, target_keys, tag_keys)

Create from DataFrame

FeatureSet.from_pandas(label, df, feature_cols, target_cols, tag_cols, groupby_cols)

Create from Arrow

FeatureSet.from_pyarrow_table(label, table)

Inspect keys

get_feature_keys(), get_target_keys(), get_tag_keys(), get_all_keys()

Inspect shapes/dtypes

get_feature_shapes(), get_feature_dtypes()

Get data

get_features(fmt=...), get_targets(fmt=...), get_tags(fmt=...)

Unified access

get_data(features=..., targets=..., tags=..., fmt=...)

Filter rows

filter(conditions={...})

Subset by index

take(indices)

Select columns

select(features=..., targets=..., rep=...)

Split randomly

split_random(ratios, group_by, seed)

Split by condition

split_by_condition({split_name: {col: condition}})

Apply transform

fit_transform(scaler, domain, keys, fit_to_split)

Undo transform

undo_last_transform(domain, keys) / undo_all_transforms()

Inverse-scale data

unscale_data_for_cols(data, domain, columns)

Save / Load

save(path) / FeatureSet.load(path)

Copy

copy(restore_splits, restore_scalers)

Create reference

reference(features, targets) / column_reference(feature)