Segmenter & Dataset

The Segmenter creates time-locked segments from events, pairs them with extractors, and produces a SegmentDataset — a standard PyTorch Dataset ready for a DataLoader.

What is a Segment?

A Segment is a time window defined by start and duration. It holds references to all events that overlap with it, plus an optional trigger event that anchored the window.

import pandas as pd

import neuralset as ns

tl = "sub-01_run-01"
events = ns.events.standardize_events(
    pd.DataFrame(
        [
            dict(type="Stimulus", start=0, duration=60, timeline=tl, code=0),
            dict(type="Word", start=5.0, duration=0.3, text="hello", timeline=tl),
            dict(type="Word", start=10.0, duration=0.4, text="world", timeline=tl),
            dict(type="Word", start=15.0, duration=0.3, text="foo", timeline=tl),
        ]
    )
)

Trigger-based Segmentation

Use list_segments() with a triggers mask to create one segment per matching event. start is relative to the trigger:

raw_segments = ns.segments.list_segments(
    events,
    triggers=events.type == "Word",
    start=-0.5,
    duration=2.0,
)
print(f"{len(raw_segments)} segments created")
for seg in raw_segments:
    print(f"  {repr(seg)}")
3 segments created
  Segment(start=4.5, duration=2.0, timeline='sub-01_run-01', _trigger_idx=1)
  Segment(start=9.5, duration=2.0, timeline='sub-01_run-01', _trigger_idx=2)
  Segment(start=14.5, duration=2.0, timeline='sub-01_run-01', _trigger_idx=3)

The Segmenter

Segmenter brings together extractors and segmentation parameters into a single config. apply(events) creates segments and returns a SegmentDataset:

segmenter = ns.dataloader.Segmenter(
    extractors={
        "pulse": {"name": "Pulse", "event_types": "Word", "aggregation": "trigger"}
    },
    trigger_query="type == 'Word'",
    start=-0.5,
    duration=2.0,
)
dataset = segmenter.apply(events)
print(f"Dataset: {len(dataset)} segments")
Dataset: 3 segments

Key parameters:

Parameter

Description

extractors

dict of name → extractor

trigger_query

pandas query selecting trigger events

start

offset relative to trigger (negative = before)

duration

segment length; None = use trigger event duration

drop_incomplete

remove segments missing events for any extractor

Accessing Data

Each dataset[i] returns a Batch with two fields:

  • .data — a dict mapping extractor names to tensors

  • .segments — the list of segments for this item

item = dataset[0]
print(f"data keys: {list(item.data.keys())}")
print(f"pulse shape: {item.data['pulse'].shape}")
data keys: ['pulse']
pulse shape: torch.Size([1, 1])

Batching with DataLoader

Use a standard PyTorch DataLoader with the dataset’s collate_fn to properly merge items into batches:

from torch.utils.data import DataLoader

loader = DataLoader(dataset, batch_size=2, collate_fn=dataset.collate_fn)
for batch in loader:
    print(f"Batch pulse shape: {batch.data['pulse'].shape}")
    print(f"Batch segments: {len(batch.segments)}")
    break
Batch pulse shape: torch.Size([2, 1])
Batch segments: 2

Subselection

.select() accepts integer indices or boolean masks to create a subset of the dataset:

sub = dataset.select([0, 2])
print(f"Selected: {len(sub)} segments")

sub = dataset.select(dataset.triggers.text == "hello")
print(f"Segments for 'hello': {len(sub)}")
Selected: 2 segments
Segments for 'hello': 1

Strided Segments

Use stride to create regularly-spaced segments across the recording, independent of trigger events. Some windows may land where no Word events exist, so drop_incomplete=True silently discards those segments:

strided_segmenter = ns.dataloader.Segmenter(
    extractors={"pulse": {"name": "Pulse", "event_types": "Word", "aggregation": "sum"}},
    trigger_query="type == 'Stimulus'",
    start=0.0,
    duration=5.0,
    stride=2.5,
    drop_incomplete=True,
)
strided_dataset = strided_segmenter.apply(events)
print(f"Strided dataset: {len(strided_dataset)} segments")
Strided dataset: 6 segments

Validation

The Segmenter validates that:

  • trigger_query matches at least one event

  • All extractors have matching event types in the DataFrame

  • drop_incomplete removes segments where an extractor has no matching events

Use find_incomplete_segments() to inspect which segments would be dropped before building the dataset.

Memory Considerations

Data is loaded lazily in __getitem__ — only the requested segment’s events are passed through the extractors at access time. For large datasets:

  • Call extractor.prepare(events) to precompute and cache heavy operations before iteration

  • Set infra.keep_in_ram=True on extractors for frequently accessed data

  • Use drop_unused_events=True (default in Segmenter) to reduce the events DataFrame size

  • Use drop_incomplete=True to skip segments that would fail due to missing event types

Next Steps

  • Compose studies and transforms with chains: Chains

Total running time of the script: (0 minutes 0.021 seconds)

Gallery generated by Sphinx-Gallery