Philosophy¶
Design principles behind neuralset.
Pydantic Everywhere¶
neuralset uses pydantic as the backbone for
every configurable object: Events, Studies, Extractors, Segmenters, and
Transforms are all BaseModel subclasses.
This gives you:
Typed configs — every parameter is annotated; your IDE auto-completes and type-checks.
Validation — pydantic validates inputs at construction time. Pass a string where an
intis expected and you get a clear error, not a silent bug downstream.Serialization — any config can round-trip to/from a dictionary, JSON, or YAML. This makes experiment configs easy to log, reproduce, and share.
Because Events are pydantic models, you can create them from dicts
(Event.from_dict(d)), convert back (event.to_dict()), and store them
in a pandas DataFrame where each row is a validated event.
exca — Caching and Infrastructure¶
neuralset relies on exca for
caching expensive computation and for submitting work to compute clusters.
What caching does. When an extractor’s prepare(events) is called, it
runs _get_data() for every event and stores the result on disk. Subsequent
calls read from cache instead of recomputing. This is critical for large
datasets where feature extraction (e.g. running a HuggingFace model on
every image) would otherwise dominate wall-clock time.
What prepare() triggers. The Segmenter.apply() method calls
prepare() on all extractors automatically. You can also call it manually
on individual extractors.
Infrastructure configs. The infra parameter on extractors, studies,
and chain steps controls where and how computation runs:
folder— where cached results are stored on disk.keep_in_ram— whether to also keep results in memory for fast access.cluster— the compute backend:None(in-process),"local"(subprocess), or a Slurm configuration for cluster submission.
See Caching & Cluster Execution for configuration details and examples, or the exca documentation for the full API reference.
Lazy Loading¶
neuralset keeps everything lightweight until you actually need the data:
A Study’s
run()returns a pandas DataFrame of event metadata — no raw signals are loaded.Transforms modify this DataFrame (adding columns, filtering rows) without touching raw data.
The Segmenter defines time windows and assigns extractors, but still doesn’t load data.
Actual data loading and feature extraction happen lazily in
dataset.__getitem__()— i.e. when you iterate through a DataLoader or calldataset[i].
This means you can configure a full pipeline, inspect events, filter
segments, and validate shapes without waiting for heavy I/O or GPU
computation. When you’re ready, prepare() precomputes everything in one
pass.
Modularity¶
The neuralset pipeline is built from four composable building blocks:
Component |
Role |
|---|---|
Study |
Interface to an external dataset: download, iterate timelines, produce events |
Transform |
Modifies the events DataFrame (filter, split, enrich) |
Extractor |
Converts events in a time window into a tensor |
Segmenter |
Creates time-locked segments and wires extractors to a Dataset |
Each component can be used independently — you can call an extractor on its own without a Segmenter, or apply a transform without a Chain. When composed, they form a reproducible pipeline where each step is independently cacheable and configurable.