How to: Create and Use a FeatureSet#

The FeatureSet is the central data container in ModularML. It organizes your data into three domains:

  • Features: model inputs (e.g., time-series signals, sensor readings)

  • Targets: values to predict (e.g., state-of-health, capacity)

  • Tags: metadata for grouping and filtering (e.g., cell ID, temperature)

Under the hood, a FeatureSet wraps a SampleCollection, which stores all data in a columnar Apache Arrow table. Each column follows the naming convention <domain>.<key>.<representation> (e.g., features.voltage.raw). A SampleSchema tracks the structure, shapes, and data types.

This notebook covers the complete FeatureSet API:

import numpy as np
import pandas as pd

import modularml as mml
from modularml import FeatureSet
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 4
      1 import numpy as np
      2 import pandas as pd
----> 4 import modularml as mml
      5 from modularml import FeatureSet

File ~/checkouts/readthedocs.org/user_builds/modular-ml/envs/stable/lib/python3.10/site-packages/modularml/__init__.py:1
----> 1 from modularml.api import (
      2     AppliedLoss,
      3     BaseModel,
      4     Checkpointing,
      5     ConcatNode,
      6     CrossValidation,
      7     CVBinding,
      8     EarlyStopping,
      9     EvalLossMetric,
     10     EvalPhase,
     11     EvalResults,
     12     Experiment,
     13     ExperimentContext,
     14     FeatureSet,
     15     FeatureSetView,
     16     FitPhase,
     17     FitResults,
     18     InputBinding,
     19     Loss,
     20     ModelGraph,
     21     ModelNode,
     22     Optimizer,
     23     PhaseGroup,
     24     PhaseGroupResults,
     25     ResultRecording,
     26     Scaler,
     27     SimilarityCondition,
     28     TensorflowBaseModel,
     29     TorchBaseModel,
     30     TrainPhase,
     31     TrainResults,
     32     supported_scalers,
     33 )
     34 from modularml.registry import register_all
     36 register_all()

File ~/checkouts/readthedocs.org/user_builds/modular-ml/envs/stable/lib/python3.10/site-packages/modularml/api.py:4
      1 # ================================================
      2 # Experiment & Phases
      3 # ================================================
----> 4 from modularml.core.experiment.experiment import Experiment
      5 from modularml.core.experiment.experiment_context import ExperimentContext
      6 from modularml.core.experiment.phases.phase_group import PhaseGroup

File ~/checkouts/readthedocs.org/user_builds/modular-ml/envs/stable/lib/python3.10/site-packages/modularml/core/experiment/experiment.py:20
     11 from modularml.core.experiment.callbacks.experiment_callback import (
     12     ExperimentCallback,
     13 )
     14 from modularml.core.experiment.checkpointing import (
     15     EXPERIMENT_HOOKS,
     16     EXPERIMENT_NAME_TEMPLATE,
     17     EXPERIMENT_PLACEHOLDERS,
     18     Checkpointing,
     19 )
---> 20 from modularml.core.experiment.experiment_context import (
     21     ExperimentContext,
     22     RegistrationPolicy,
     23 )
     24 from modularml.core.experiment.phases.eval_phase import EvalPhase
     25 from modularml.core.experiment.phases.fit_phase import FitPhase

File ~/checkouts/readthedocs.org/user_builds/modular-ml/envs/stable/lib/python3.10/site-packages/modularml/core/experiment/experiment_context.py:11
      8 from typing import TYPE_CHECKING, Any
      9 from weakref import ref
---> 11 from matplotlib.pylab import Enum
     13 from modularml.utils.environment.environment import IN_NOTEBOOK
     14 from modularml.utils.logging.warnings import warn

ModuleNotFoundError: No module named 'matplotlib'

We’ll use synthetic battery pulse-response data throughout this notebook. Each sample contains a 101-point voltage time-series, a scalar state-of-health (SOH) target, and metadata tags.

N_SAMPLES = 1000
N_CELLS = 20
N_GROUPS = 5
TIME = np.linspace(0, 100, 101)

rng = np.random.default_rng(42)

# Assign each sample a cell, group, pulse type, and SOC
cell_ids = rng.integers(1, N_CELLS + 1, size=N_SAMPLES)
group_ids = rng.integers(1, N_GROUPS + 1, size=N_SAMPLES)
pulse_types = rng.choice(["chg", "dchg"], size=N_SAMPLES)
pulse_socs = rng.choice([10, 20, 30, 40, 50, 60, 70, 80, 90], size=N_SAMPLES)

# SOH degrades with group_id (higher group = more degraded)
soh = 100.0 - (group_ids - 1) * 8.0 + rng.normal(0, 2, size=N_SAMPLES)

# Synthetic voltage: baseline + pulse shape, shifted by SOC and degraded by SOH
voltage = np.zeros((N_SAMPLES, 101))
for i in range(N_SAMPLES):
    base = 3.2 + pulse_socs[i] / 100.0 * 0.5
    amplitude = 0.3 * (soh[i] / 100.0)
    sign = 1.0 if pulse_types[i] == "chg" else -1.0
    curve = sign * amplitude * (1 - np.exp(-TIME / 15.0))
    voltage[i] = base + curve + rng.normal(0, 0.002, size=101)

data = {
    "voltage": voltage.tolist(),
    "soh": soh.tolist(),
    "cell_id": cell_ids.tolist(),
    "group_id": group_ids.tolist(),
    "pulse_type": pulse_types.tolist(),
    "pulse_soc": pulse_socs.tolist(),
}

print(f"Samples: {N_SAMPLES}")
print(f"Voltage shape per sample: {voltage[0].shape}")
print(f"SOH range: [{soh.min():.1f}, {soh.max():.1f}]")

Creating a FeatureSet#

Three class methods are available for construction.

from_dict() — From a Python dictionary#

The most common constructor. Pass a dict where each key maps to a list/array of values (one entry per sample), then specify which keys are features, targets, and tags.

fs = FeatureSet.from_dict(
    label="PulseData",
    data=data,
    feature_keys="voltage",
    target_keys="soh",
    tag_keys=["cell_id", "group_id", "pulse_type", "pulse_soc"],
)
print(fs)
print(f"Feature shapes: {fs.get_feature_shapes()}")
print(f"Target shapes:  {fs.get_target_shapes()}")

When accessing FeatureSet data, you’ll notice that all keys are returned in the <domain>.<key>.<representation> by default. You can modify the the returned string with the include_rep_suffix and include_domain_prefix arguments in all FeatureSet.get_<> methods.

Note that certain string-component omissions will raise an error if it results in a non-unique key (e.g., you have two representations of the same feature column)

print(
    f"Feature shapes: {fs.get_feature_shapes(include_domain_prefix=False, include_rep_suffix=False)}",
)

from_pandas() — From a Pandas DataFrame#

Allows for FeatureSet structuring directly from a Pandas DataFrame. We similarly need to assign column names to features, targets, and tags.

# Create a simple DataFrame example
df = pd.DataFrame(
    {
        "temperature": np.random.default_rng(0).normal(25, 5, size=100),
        "humidity": np.random.default_rng(1).normal(60, 10, size=100),
        "output_power": np.random.default_rng(2).normal(100, 15, size=100),
        "site_id": np.repeat(["A", "B", "C", "D"], 25),
        "timestamp": np.arange(25).tolist() * 4,
    },
)

fs_from_df = FeatureSet.from_pandas(
    label="WeatherData",
    df=df,
    feature_cols=["temperature", "humidity"],
    target_cols="output_power",
    tag_cols="site_id",
)
print(fs_from_df)
print(f"Feature shapes: {fs_from_df.get_feature_shapes()}")

Note that the above approach treats every row in the dataframe as a unique sample for modeling.

If that’s not the case, grouping will need to be performed to aggregate rows in the Pandas dataframe belonging to each sample. The from_pandas constructor provides the group_by and sort_by arguments to do just that.

Below, we group all rows in our dataframe by the 'site_id' at which the data was measured, and then ensure all data points are sorted by 'time_stamp' within each sample. Notice how the feauture now have a shape of (25,).

fs_grouped = FeatureSet.from_pandas(
    label="WeatherGrouped",
    df=df,
    feature_cols=["temperature", "humidity"],
    target_cols="output_power",
    group_by="site_id",
    sort_by="timestamp",
    tag_cols=["site_id", "timestamp"],
)
print(fs_grouped)
print(f"Feature shapes: {fs_grouped.get_feature_shapes()}")

from_pyarrow_table() — From an Arrow table#

If you already have a pyarrow.Table with columns following the <domain>.<key>.<rep> naming convention, you can wrap it directly.

Unless you are certain that the existing table uses the appropriate schema, it is recommended to use table.to_pandas(), then use the from_pandas() constructor.

import pyarrow as pa

table = pa.table(
    {
        "features.x.raw": [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]],
        "targets.y.raw": [0.5, 1.5],
        "tags.group.raw": ["a", "b"],
    },
)

fs_arrow = FeatureSet.from_pyarrow_table(label="ArrowExample", table=table)
print(fs_arrow)
print(f"Feature shapes: {fs_arrow.get_feature_shapes()}")

Inspecting a FeatureSet#

Use the following properties and methods to understand the structure of your data.

# Basic info
print(f"Label:      {fs.label}")
print(f"Samples:    {len(fs)}")
print(f"repr:       {fs!r}")
# Column keys by domain
print("Feature keys:", fs.get_feature_keys())
print("Target keys: ", fs.get_target_keys())
print("Tag keys:    ", fs.get_tag_keys())
# All keys with full qualification (domain prefix + rep suffix)
fs.get_all_keys(include_domain_prefix=True, include_rep_suffix=True)
# Shapes and dtypes
print("Feature shapes:", fs.get_feature_shapes())
print("Target shapes: ", fs.get_target_shapes())
print("Tag shapes:    ", fs.get_tag_shapes())
print()
print("Feature dtypes:", fs.get_feature_dtypes())
print("Target dtypes: ", fs.get_target_dtypes())

Note that most data containing classes in ModularML also support a summary() method.

Printing the results provides a formatted summary of all characteristics of that object.

print(fs.summary())

Accessing Data#

Data can be retrieved in multiple formats via the fmt parameter. Accepted values include "numpy", "pandas", "dict_numpy", "torch", and more.

Domain-specific accessors#

# Get all features as a dict of numpy arrays (default)
features = fs.get_features()
print(f"Type: {type(features)}")
print(f"Keys: {list(features.keys())}")
print(f"Voltage shape: {features['voltage'].shape}")
# Get a single feature by name, as numpy
voltage = fs.get_features(fmt="numpy", features="voltage")
print(f"Shape: {voltage.shape}")
# Get targets as a pandas DataFrame
targets_df = fs.get_targets(fmt="pandas")
targets_df.head()
# Get specific tags as a dict of numpy arrays
tags = fs.get_tags(fmt="dict_numpy", tags=["cell_id", "pulse_type"])
print(f"Type: {type(tags)}")
print(f"Cell IDs (first 5): {tags['cell_id'][:5]}")
print(f"Pulse types (first 5): {tags['pulse_type'][:5]}")

Unified accessor: get_data()#

Retrieve columns from multiple domains in a single call. Supports wildcards and a default rep parameter.

# Specific columns from different domains
result = fs.get_data(
    features="voltage",
    targets="soh",
    tags="*",
    fmt="pandas",
)
result.head()
# Sample UUIDs - each sample has a unique identifier
uuids = fs.get_sample_uuids()
print(f"First 3 UUIDs: {uuids[:3]}")

Export to pandas or Arrow#

# Full export to pandas
df_all = fs.to_pandas()
df_all
# Export to Arrow table
arrow_table = fs.to_arrow()
print(f"Arrow schema:\n{arrow_table.schema}")

Row Subsetting and Filtering#

All row-subsetting operations return a FeatureSetView - a lightweight, zero-copy window over the parent FeatureSet.

filter() — Condition-based filtering#

Conditions are a dict mapping fully-qualified column names to:

  • A scalar (equality match)

  • A list/set (membership test)

  • A callable (row-wise predicate)

# Filter by equality
view_chg = fs.filter(conditions={"tags.pulse_type.raw": "chg"})
print(f"Charge-only: {view_chg}")
print(np.unique(view_chg.get_tags(fmt="np", tags="pulse_type")))
# Filter by list membership
view_cells = fs.filter(conditions={"tags.cell_id.raw": [1, 2, 3]})
print(f"Cells 1-3: {view_cells}")
print(np.unique(view_cells.get_tags(fmt="np", tags="cell_id")))
# Filter with callable + multiple conditions (AND-composed)
view_healthy_chg = fs.filter(
    conditions={
        "tags.pulse_type.raw": "chg",
        "targets.soh.raw": lambda x: x >= 90.0,
    },
)
print(f"Healthy charge pulses: {view_healthy_chg}")

take() — By relative index#

view_first10 = fs.take([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], label="first_10")
print(view_first10)

take_sample_uuids() — By UUID#

some_uuids = fs.get_sample_uuids()[:5].tolist()
view_by_uuid = fs.take_sample_uuids(some_uuids, label="uuid_subset")
print(view_by_uuid)

Set operations between views#

# Intersection: samples in both views
view_a = fs.filter(conditions={"tags.pulse_type.raw": "chg"})
print("View A:")
print(" - view:", view_a)
print(" - cells:", np.unique(view_a.get_tags(fmt="np", tags="cell_id")))
print(" - pulse_types:", np.unique(view_a.get_tags(fmt="np", tags="pulse_type")))

view_b = fs.filter(conditions={"tags.cell_id.raw": [1, 2, 3]})
print("\nView B:")
print(" - view:", view_b)
print(" - cells:", np.unique(view_b.get_tags(fmt="np", tags="cell_id")))
print(" - pulse_types:", np.unique(view_b.get_tags(fmt="np", tags="pulse_type")))


view_intersect = view_a.take_intersection(view_b, label="chg_cells_1to3")
print("\nIntersection:")
print(" - view:", view_intersect)
print(" - cells:", np.unique(view_intersect.get_tags(fmt="np", tags="cell_id")))
print(
    " - pulse_types:",
    np.unique(view_intersect.get_tags(fmt="np", tags="pulse_type")),
)

# Difference: samples in A but not in B
view_diff = view_a.take_difference(view_b, label="chg_not_cells_1to3")
print("\nDifference:")
print(" - view:", view_diff)
print(" - cells:", np.unique(view_diff.get_tags(fmt="np", tags="cell_id")))
print(" - pulse_types:", np.unique(view_diff.get_tags(fmt="np", tags="pulse_type")))

We can also check view overlap via the is_disjoint_with method.

print("view_diff does not contain view_b samples: ", view_b.is_disjoint_with(view_diff))

Converting a view back to a FeatureSet#

A FeatureSetView is a lightweight reference of indices in the parent FeatureSet. Any modification to the FeatureSet with change the data access through its child views.

To create an independent FeatureSet from a view, use to_featureset():

fs_charge = view_chg.to_featureset(label="ChargePulses")
print(fs_charge)

Column Subsetting#

Use select() to create a view with only specific columns. Row indices are preserved.

Select supports the same wildcard usage as filter()

# Select specific features and targets
view_slim = fs.select(features="voltage.*", targets="soh")
print(f"Columns: {view_slim.get_all_keys()}")
# Select by domain and representation
view_raw_only = fs.select(features="voltage", rep="raw")
print(
    f"Columns: {view_raw_only.get_all_keys(include_domain_prefix=True, include_rep_suffix=True)}",
)

Splitting Data#

Splitting creates named FeatureSetView partitions that are registered (optional) on the parent FeatureSet.

Random splitting#

Random splitting takes a ratios argument, defining the proportions of all samples in the calling container to be assigned to each key. The ratio values must add up to 1.

By default, splits views are not returned and automatically registered to the parent FeatureSet. This behaviour can be specified via the return_views and register arguments.

fs_charge.clear_splits()
fs_charge.split_random(
    ratios={"train": 0.6, "val": 0.2, "test": 0.2},
    seed=42,
)

print(f"Available splits: {fs_charge.available_splits}")
for name, view in fs_charge.splits.items():
    print(f"  {name}: {len(view)} samples")

Use group_by to keep all samples sharing a tag value in the same split (prevents data leakage):

fs_charge.clear_splits()

fs_charge.split_random(
    ratios={"train": 0.5, "val": 0.3, "test": 0.2},
    group_by="group_id",
    seed=1,
)

for name, view in fs_charge.splits.items():
    group_ids = view.get_tags(fmt="numpy", tags="group_id")
    print(f"  {name}: {len(view)} samples, groups: {np.unique(group_ids)}")

Use stratify_by to ensure all splits have representative distributions of the calling source.

Note that grouping and stratification are mutually exclusive (you can’t use both at the same time).

fs_charge.clear_splits()

fs_charge.split_random(
    ratios={"train": 0.5, "val": 0.3, "test": 0.2},
    stratify_by="group_id",
    seed=1,
)

for name, view in fs_charge.splits.items():
    group_ids = view.get_tags(fmt="numpy", tags="group_id")
    print(f"  {name}: {len(view)} samples, groups: {np.unique(group_ids)}")

Condition-based splitting#

Assign samples to splits using explicit conditions on any column.

Condition-based splitting takes a condition argument, which is a nested dict of: {split_name: {column: condition}}.

Samples that satisfy all conditions within the named split are returned in the split. A warning will be thrown if the produces splits are not mutually exclusive.

fs_charge.clear_splits()

fs_charge.split_by_condition(
    {
        "train": {"tags.group_id.raw": [1, 2, 3]},
        "val": {"tags.group_id.raw": [4]},
        "test": {"tags.group_id.raw": [5]},
    },
)

for name, view in fs_charge.splits.items():
    group_ids = view.get_tags(fmt="numpy", tags="group_id")
    print(f"  {name}: {len(view)} samples, groups: {np.unique(group_ids)}")

Nested splits#

Splitting can be called on any existing split, in addition to directly on the parent FeatureSet. The nested split conditions will only draw from the samples available in the calling view.

This allows us to “nest” split conditions to create more complex modeling setups.

Note that the sub-splits will inherently overlap with the calling view, and care should be taking when using these splits in downstream modeling.

fs_charge.clear_splits()

fs_charge.split_by_condition(
    {
        "source": {
            "targets.soh": lambda soh: soh >= 90,
            "tags.group_id.raw": [1, 2, 3],
        },
        "test": {
            "targets.soh": lambda soh: soh < 90,
            "tags.group_id.raw": [4, 5],
        },
    },
)
fs_charge.get_split("source").split_random(
    ratios={"train": 0.8, "val": 0.2},
    stratify_by="group_id",
)

for name, view in fs_charge.splits.items():
    group_ids = view.get_tags(fmt="numpy", tags="group_id")
    print(f"  {name}: {len(view)} samples, groups: {np.unique(group_ids)}")

Getting splits directly with return_views#

fs_charge.clear_splits()

# Build custom views using any filter logic
train_view = fs_charge.filter(
    conditions={
        "tags.group_id.raw": [1, 2, 3],
        "targets.soh.raw": lambda x: x >= 80.0,
    },
    label="train",
)
test_view = fs_charge.filter(
    conditions={"tags.group_id.raw": [4, 5]},
    label="test",
)

# Register them as named splits
fs_charge.add_split(train_view)
fs_charge.add_split(test_view)

for name, view in fs_charge.splits.items():
    print(f"  {name}: {len(view)} samples")

Let’s restore a simple train/val/test for scaling

fs_charge.clear_splits()

split_views = fs_charge.split_random(
    ratios={"train": 0.6, "val": 0.2, "test": 0.2},
    group_by="group_id",
    seed=42,
    return_views=True,
)

for name, view in split_views.items():
    print(f"  {name}: {len(view)} samples")

Note that most ModularML core classes implement a .visualize() method. For FeatureSets, this displays a Mermaid diagram of all splits registered to the FeatureSet.

Note that you will need to install a Mermaid rendering extension for your IDE. I use “Markdown Preview Mermaid Support”.

fs_charge.visualize(show_overlaps=True)

Transforms and Scaling#

Apply preprocessing transforms to features or targets. Transforms are tracked and can be undone.

Several scalers are built into to ModularML and accessible via the Scaler.get_supported_scalers() command. You can also create custom scalers, as outlined below.

# List all available scalers
mml.supported_scalers

Let’s create a little utility to plot our voltages so we can verify our transforms:

import matplotlib.pyplot as plt


def plot_voltages(
    fs: FeatureSet,
    n_samples: int = 200,
    rep: str = "transformed",
    seed: int = 13,
):
    """
    Plot the 'voltage' feature contained in the FeatureSet.

    Each split will get its own panel.
    Colors by SOH (dark blue = high SOH, light blue = low SOH)

    Args:
        fs (FeatureSet): FeatureSet to use.
        n_samples (int, optional): The number of samples in `fs` that will
              get plotted. Defaults to 200.
        rep (str): The representation of the data to plot (eg, "raw" and "transformed")
        seed (int, optional): A seed to ensure the same samples get plotted
              with repeated calls. Defaults to 13.

    """

    def order_splits(values: list[str]) -> list[str]:
        priority = {"train": 0, "val": 1, "test": 2}
        return sorted(values, key=lambda x: priority.get(x, 99))

    rng = np.random.default_rng(seed)
    scm = plt.cm.ScalarMappable(
        cmap=plt.cm.Blues,
        norm=plt.Normalize(vmin=50, vmax=100),
    )

    # Verify rep exists
    avail_reps = fs.collection._get_rep_keys(domain="features", key="voltage")
    if rep not in avail_reps:
        rep = "raw"

    # Create figure with panels for each split
    fig, axes = plt.subplots(
        figsize=(7, 2.5),
        ncols=fs.n_splits,
        sharex=True,
        sharey=True,
    )
    split_names = order_splits(fs.available_splits)
    for i, split_label in enumerate(split_names):
        # For each split, get all voltage features and group_ids
        split_view = fs.get_split(split_label)
        voltages = np.squeeze(
            split_view.get_features(features="voltage", fmt="numpy", rep=rep),
        )
        sohs = np.squeeze(split_view.get_targets(targets="soh", fmt="numpy", rep="raw"))

        # Select n_samples
        sample_idxs = rng.choice(np.arange(0, len(voltages)), size=n_samples)
        for idx in sample_idxs:
            axes[i].plot(voltages[idx], color=scm.to_rgba(sohs[idx]))

        axes[i].set_title(split_label, fontsize=10)
        axes[i].set_xlabel("Time (s)", fontsize=10)
    axes[0].set_ylabel("Voltage (V)", fontsize=10)

    # Adjust main subplot area to leave space on the right for colorbar
    fig.tight_layout(pad=1)
    fig.subplots_adjust(right=0.85)

    # Add colorbar as a dedicated panel on the far right
    cbar_ax = fig.add_axes([0.87, 0.19, 0.02, 0.7])  # [left, bottom, width, height]
    cbar = fig.colorbar(scm, cax=cbar_ax)
    cbar.set_label("SOH (%)", fontsize=10)
    return fig, axes


fig, axes = plot_voltages(fs_charge, n_samples=200)
plt.show()

Applying a transform#

fit_transform() fits a scaler and stores the result as a "transformed" representation alongside the original "raw" data.

  • domain: "features" or "targets"

  • keys: which keys to transform (default: all in domain)

  • fit_to_split: fit only on this split’s data (prevents data leakage)

from modularml import Scaler

# Apply MinMaxScaler to voltage, fitted on training data only
fs_charge.fit_transform(
    scaler=Scaler("MinMaxScaler"),
    domain="features",
    keys="voltage",
    fit_to_split="train",
)

# Raw data is preserved - access both representations
raw = fs_charge["train"].get_features(fmt="numpy", features="voltage", rep="raw")
transformed = fs_charge["train"].get_features(
    fmt="numpy",
    features="voltage",
    rep="transformed",
)

fig, axes = plot_voltages(fs_charge, n_samples=200, rep="transformed")
plt.show()

Chaining transforms#

Multiple transforms can be applied sequentially. Each call transforms the current "transformed" representation.

# First undo, then chain: zero-start -> min-max
fs_charge.undo_all_transforms(domain="features")

fs_charge.fit_transform(
    scaler="PerSampleZeroStart",
    domain="features",
    keys="voltage",
    fit_to_split="train",
)

fig, axes = plot_voltages(fs_charge, n_samples=200, rep="transformed")
plt.show()

fs_charge.fit_transform(
    scaler="MinMaxScaler",
    domain="features",
    keys="voltage",
    fit_to_split="train",
)

fig, axes = plot_voltages(fs_charge, n_samples=200, rep="transformed")
plt.show()

Scaling targets#

from sklearn.preprocessing import MinMaxScaler

# You can also pass sklearn instances directly
fs_charge.fit_transform(
    scaler=MinMaxScaler(),
    domain="targets",
    keys="soh",
    fit_to_split="train",
)

soh_raw = fs_charge["test"].get_targets(fmt="numpy", targets="soh", rep="raw")
soh_scaled = fs_charge["test"].get_targets(
    fmt="numpy",
    targets="soh",
    rep="transformed",
)
print(f"SOH raw range:     [{soh_raw.min():.1f}, {soh_raw.max():.1f}]")
print(f"SOH scaled range:  [{soh_scaled.min():.3f}, {soh_scaled.max():.3f}]")

Undoing transforms#

# Undo the last *feature* transform (MinMaxScaler), keeping PerSampleZeroStart
# Note that the target transform (although more recent) is not inversed
fs_charge.undo_last_transform(domain="features", keys="voltage")

transformed = fs_charge["train"].get_features(
    fmt="numpy",
    features="voltage",
    rep="transformed",
)
print("After undo last:")
print(f"  min={transformed.min():.4f} (should be ~0.0)")
print(f"  max={transformed.max():.4f} (no longer bounded to 1.0)")

print(f"SOH:  [{soh_scaled.min():.3f}, {soh_scaled.max():.3f}]")
# Undo all transforms in a domain
fs_charge.undo_all_transforms()

# Verify: after undoing all, 'transformed' rep no longer exists
print(
    "Keys after undo:",
    fs_charge.get_all_keys(include_domain_prefix=True, include_rep_suffix=True),
)

Inverse-scaling external data#

Use unscale_data_for_cols() to apply inverse transforms to data that lives outside the FeatureSet (e.g., model predictions):

fs_charge.fit_transform(
    scaler=MinMaxScaler(),
    domain="targets",
    keys="soh",
    fit_to_split="train",
)

# Simulate model predictions in scaled space
fake_predictions = np.array([[0.5], [0.8], [0.2]])

# Inverse-transform back to original SOH scale
original_scale = fs_charge.unscale_data_for_cols(
    data=fake_predictions,
    domain="targets",
    columns="soh",
)
print(f"Scaled predictions:   {fake_predictions.ravel()}")
print(f"Original-scale SOH:   {original_scale.ravel()}")

While you can access the fit scalers on a particular column and use for unscaling (shown above), it is best practice to use the original FeatureSet, filter to the sample IDs on which your scaled data was produced, and then access the “transformed” version directly. This is the only way to fully guarantee that you are “applying” the correct scaler.

We’ll cover this more in depth in: TODO: $\textcolor{red}{\text{add notebook link to “working with model outputs / results”}}$


Serialization (Save and Load)#

FeatureSets can be saved to disk and fully restored, including splits and transforms.

import matplotlib.pyplot as plt

# Re-apply splits and transforms before saving
fs_charge.clear_splits()
fs_charge.undo_all_transforms()

fs_charge.split_random(
    ratios={"train": 0.6, "val": 0.2, "test": 0.2},
    group_by="group_id",
    seed=42,
)

fs_charge.fit_transform(
    scaler="PerSampleZeroStart",
    domain="features",
    keys="voltage",
    fit_to_split="train",
)
fig, axes = plot_voltages(fs_charge, n_samples=200, rep="transformed")
plt.show()
fs_charge.fit_transform(
    scaler="MinMaxScaler",
    domain="features",
    keys="voltage",
    fit_to_split="train",
)
fig, axes = plot_voltages(fs_charge, n_samples=200, rep="transformed")
plt.show()
fs_charge.fit_transform(
    scaler="MinMaxScaler",
    domain="targets",
    keys="soh",
    fit_to_split="train",
)
from pathlib import Path
from tempfile import TemporaryDirectory

# Save to temp file
SAVE_DIR = TemporaryDirectory()

save_path = fs_charge.save(Path(SAVE_DIR.name) / "fs_charge_demo", overwrite=True)
print(f"Saved to: {save_path}")

Now we can reload this FeatureSet.

Note that ModularML assign all “nodes” in an Experiment a unique ID. This is important when we move to Experiments and ModelGraphs, but we can just ignore the collision warning for now.

# Reload from file
fs_rel = FeatureSet.load(save_path)

We can pick up exactly where we left off; all history is preserved.

This means we can undo the last transform.

fig, axes = plot_voltages(fs_rel, n_samples=200, rep="transformed")
plt.show()

fs_rel.undo_last_transform(domain="features")
fig, axes = plot_voltages(fs_rel, n_samples=200, rep="transformed")
plt.show()

Copying a FeatureSet#

Remember that little Node Collision warning? Well it told us that the loaded FeatureSet was identical to the one in memory, and that instead of loading a copy, it just returned the existing FeatureSet.

That means those last undo_transforms call modified our original fs_charge FeatureSet too (they’re the same object).

While this is great for memory, there are times we’d want to create a copy of a FeatureSet. This is done with the .copy() method.

By default, this also shares internal buffers with the same underlying PyArrow table (i.e., its not a true copy). Setting share_raw_data_buffer=False ensures the new FeatureSet is fully independent of the original.

Note that we can also choose to restore splits and scalers, or just the raw data with the restore_splits and restore_scalers arguments.

# Shallow copy with raw data only
fs_copy_raw = fs_charge.copy(label="CopyRawOnly", share_raw_data_buffer=True)
print(
    f"Copy (raw only) keys: {fs_copy_raw.get_all_keys(include_domain_prefix=True, include_rep_suffix=True)}",
)
print(f"Copy splits: {fs_copy_raw.available_splits}")
# Full copy with splits and scalers restored
fs_copy_full = fs_charge.copy(
    label="CopyFull",
    share_raw_data_buffer=False,
    restore_splits=True,
    restore_scalers=True,
    register=True,
)
print(
    f"Full copy keys: {fs_copy_full.get_all_keys(include_domain_prefix=True, include_rep_suffix=True)}",
)
print(f"Full copy splits: {fs_copy_full.available_splits}")

References (for Model Graph Wiring)#

When connecting a FeatureSet to a ModelStage in a model graph, you create symbolic references rather than passing data directly.

Below is a quick overview, but more details are provided in the following notebook:

# Multi-column reference (used for ModelStage inputs)
ref = fs_charge.reference(features="voltage", targets="soh", rep="transformed")
print(f"Reference type: {type(ref).__name__}")
print(f"Reference: {ref}")
# Single-column reference (specify rep when multiple representations exist)
col_ref = fs_charge.column_reference(feature="voltage", rep="transformed")
print(f"Column reference type: {type(col_ref).__name__}")
print(f"Column reference: {col_ref}")

Summary#

Task

Method

Create from dict

FeatureSet.from_dict(label, data, feature_keys, target_keys, tag_keys)

Create from DataFrame

FeatureSet.from_pandas(label, df, feature_cols, target_cols, tag_cols, groupby_cols)

Create from Arrow

FeatureSet.from_pyarrow_table(label, table)

Inspect keys

get_feature_keys(), get_target_keys(), get_tag_keys(), get_all_keys()

Inspect shapes/dtypes

get_feature_shapes(), get_feature_dtypes()

Get data

get_features(fmt=...), get_targets(fmt=...), get_tags(fmt=...)

Unified access

get_data(features=..., targets=..., tags=..., fmt=...)

Filter rows

filter(conditions={...})

Subset by index

take(indices)

Select columns

select(features=..., targets=..., rep=...)

Split randomly

split_random(ratios, group_by, seed)

Split by condition

split_by_condition({split_name: {col: condition}})

Apply transform

fit_transform(scaler, domain, keys, fit_to_split)

Undo transform

undo_last_transform(domain, keys) / undo_all_transforms()

Inverse-scale data

unscale_data_for_cols(data, domain, columns)

Save / Load

save(path) / FeatureSet.load(path)

Copy

copy(restore_splits, restore_scalers)

Create reference

reference(features, targets) / column_reference(feature)