fairseq2.data

fairseq2.data provides a Python API to build a C++ DataPipeline.

The dataloader will be able to leverage several threads, working around Python Global Interpreter Lock limitations, and also providing better performance than a pure Python dataloader.

Building a DataPipeline looks like this:

data = (
    text.read_text("file.tsv")
    .map(lambda x: str(x.split("\t")[1]).lower())
    .filter(lambda x: len(x) < 10)
)

Functions to build a DataPipeline:

`DataPipeline`()	fairseq2 native data pipeline.
`DataPipelineBuilder`()	API to create DataPipeline
`list_files`(pathname[, pattern])	List recursively all files under `pathname` that matches `pattern`.
`read_sequence`(seq)	Read every element in `seq`.
`read_zipped_records`(pathname)	Read each file in a zip archive
`text.read_text`(pathname[, encoding, ...])	Open a text file and return a data pipeline reading lines one by one.
`FileMapper`([root_dir, cached_fd_count])	For a given file name, returns the file content as bytes.
`Collater`([pad_value, pad_to_multiple, overrides])	Concatenate a list of inputs into a single inputs.
`CollateOptionsOverride`(selector[, ...])	Overrides how the collater should create batch for a particular column.

Column syntax

The data items going through the pipeline don’t have to be flat tensors, but can be tuples, or python dictionaries. Several operators have a syntax to specify a specific column of the input data. Notably the DataPipelineBuilder.map() operator has a selector argument to choose the column to apply the function to.

If the data item is a tuple, then the selector "[3]" selects the third column. If the data item is a dictionary, then "foo" will select the value corresponding to the key "foo". You can nest selectors using . to separate key selectors, following a python-like syntax. For a data item {"foo": [{"x": 1, "y": 2}, {"x": 3, "y": 4, "z": 5}], "bar": 6}, the selector "foo[1].y" referes to the value 4.

Functions that accepts several selectors, accept them as a comma separated list of selectors. For example .map(lambda x: x * 10, selector="foo[1].y,bar") will multiply the values 4 and 6 by 10, but leave others unmodified.

Public classes used in fairseq2 API:

`CString`()	Represents an immutable UTF-8 string that supports zero-copy marshalling between Python and native code.
`PathLike`	alias of `Union`[`str`, `CString`, `os.PathLike[str]`]
`StringLike`	alias of `Union`[`str`, `CString`]
`ByteStreamError`	Raised when a dataset file can't be read.
`DataPipelineError`	Raised when an error occurs while reading from a data pipeline.
`RecordError`	Raised when a corrupt record is encountered while reading a dataset.
`VocabularyInfo`(size, unk_idx, bos_idx, ...)	Describes the vocabulary used by a tokenizer

Helper methods:

`get_last_failed_example`()
`is_string_like`(s)	Return `True` if `s` is of type `str` or `CString`.

fairseq2.data.text

Tools to tokenize text, converting it from bytes to tensors.

`TextTokenizer`(vocab_info)	Represents a tokenizer to encode and decode text.
`TextTokenDecoder`()	Decodes text from tokens or token indices.
`TextTokenEncoder`()	Encodes text into tokens or token indices.
`StrSplitter`([sep, names, indices, exclude])	Split string on a given character.
`StrToIntConverter`([base])	Parses integers in a given base
`StrToTensorConverter`([size, dtype])
`SentencePieceModel`(pathname[, control_symbols])
`SentencePieceEncoder`(model[, prefix_tokens, ...])
`SentencePieceDecoder`(model[, reverse])
`vocab_info_from_sentencepiece`(model)	Return the vocabulary information of `model`.
`LineEnding`(value[, names, module, qualname, ...])