DataPipeline
- final class fairseq2.data.DataPipeline[source]
Bases:
object
fairseq2 native data pipeline.
The pipeline state can be persisted to the disk, allowing it to be resumed later. It is a Python Iterable, but it also contains the iterator states.
Calling iter twice will create two iterators reading from the same dataloader, and sharing the same state, so it will behave inconcistently.
- __iter__()[source]
Return an iterator over the examples in the data pipeline.
The iterator will modify the internal state of the this DataPipeline, so it’s not safe to have several iterators over the same DataPipeline.
- static concat(pipelines)[source]
Concatenate examples from
pipelines
.- Parameters:
pipelines (Sequence[DataPipeline]) – The data pipelines to concatenate.
- Return type:
- static constant(example, key=None)[source]
Repeatedly yield
example
.This pipeline is pseudo-infinite; when used with functions that combine pipelines (e.g. sample, round_robin, zip), it will yield examples only as long as other pipelines yield examples.
See Pseudo-infinite and Infinite Pipelines for more details.
- Parameters:
- Return type:
- static count(start=0, step=1, key=None)[source]
Count from
start
in steps of sizestep
.This pipeline is pseudo-infinite; when used with functions that combine pipelines (e.g. sample, round_robin, zip), it will yield examples only as long as other pipelines yield examples.
See Pseudo-infinite and Infinite Pipelines for more details.
- Parameters:
- Return type:
- load_state_dict(state_dict)[source]
Restore the state of the data pipeline from
state_dict
.- Parameters:
state_dict (Mapping[str, Any]) – A state dictionary previously returned by
state_dict()
.
- reset(reset_rng=False)[source]
Move back to the first example in the data pipeline.
- Parameters:
reset_rng (bool) – If
True
, resets all random number generators in the pipeline.
- static round_robin(pipelines, stop_at_shortest=False, allow_repeats=True)[source]
Extract examples from
pipelines
in round robin.- Parameters:
pipelines (Sequence[DataPipeline]) – The data pipelines to round robin.
stop_at_shortest (bool) – If
True
, stops round_robin when first pipeline reaches its end.allow_repeats (bool) – If
True
, circles around finished pipelines until all pipelines reach their end. IfFalse
, does not repeat pipelines that have reached their end.
- Return type:
- static sample(pipelines, weights=None, seed=None, allow_repeats=True)[source]
Extract examples from
pipelines
by sampling based onweights
. Circles around pipelines until all have reached their end at least once.- Parameters:
data_pipelines – The data pipelines to sample from.
weights (Sequence[float] | None) – Desired distribution of pipelines. If
None
, use uniform distribution.allow_repeats (bool) – If
True
, circles around finished pipelines until all pipelines reach their end. IfFalse
, does not repeat pipelines that have reached their end.
- Return type:
- state_dict(strict=True)[source]
Return a dictionary containing the state of the data pipeline.
The current position of the data pipeline can be restored by passing the returned state dictionary to
load_state_dict()
.
- static zip(pipelines, names=None, zip_to_shortest=False, flatten=False, disable_parallelism=False)[source]
Zip together examples read from
pipelines
.- Parameters:
pipelines (Sequence[DataPipeline]) – The data pipelines to zip.
names (Sequence[str] | None) – The names to assign to the data pipelines. If
None
, yields examples as lists.zip_to_shortest (bool) – If
True
, stops yielding examples after shortest pipeline terminates. Otherwise, all pipelines (that are not pseudo-infinite) must have the same number of examples.flatten (bool) – If
True
, flatten examples from each pipeline into one dictionary or list. All pipelines must return the same type (dict or non-dict).,disable_parallelism (bool) – If
True
, calls each data pipeline sequentially.
- Return type:
- property is_broken: bool
Return
True
if the data pipeline is broken.If
True
, any future operation on this data pipeline will raise aDataPipelineError
.