fairseq2.data
fairseq2.data
provides a Python API to build a C++ DataPipeline
.
The dataloader will be able to leverage several threads, working around Python Global Interpreter Lock limitations, and also providing better performance than a pure Python dataloader.
Building a DataPipeline
looks like this:
data = (
text.read_text("file.tsv")
.map(lambda x: str(x.split("\t")[1]).lower())
.filter(lambda x: len(x) < 10)
)
Functions to build a DataPipeline
:
fairseq2 native data pipeline. |
|
API to create DataPipeline |
|
|
List recursively all files under |
|
Read every element in |
|
Read each file in a zip archive |
|
Open a text file and return a data pipeline reading lines one by one. |
|
For a given file name, returns the file content as bytes. |
|
Concatenate a list of inputs into a single inputs. |
|
Overrides how the collater should create batch for a particular column. |
Column syntax
The data items going through the pipeline don’t have to be flat tensors, but can be tuples, or python dictionaries.
Several operators have a syntax to specify a specific column of the input data.
Notably the DataPipelineBuilder.map()
operator
has a selector argument to choose the column to apply the function to.
If the data item is a tuple,
then the selector "[3]"
selects the third column.
If the data item is a dictionary, then "foo"
will select the value corresponding to the key "foo"
.
You can nest selectors using .
to separate key selectors, following a python-like syntax.
For a data item {"foo": [{"x": 1, "y": 2}, {"x": 3, "y": 4, "z": 5}], "bar": 6}
,
the selector "foo[1].y"
referes to the value 4.
Functions that accepts several selectors,
accept them as a comma separated list of selectors.
For example .map(lambda x: x * 10, selector="foo[1].y,bar")
will multiply the values 4 and 6 by 10, but leave others unmodified.
Public classes used in fairseq2 API:
|
Represents an immutable UTF-8 string that supports zero-copy marshalling between Python and native code. |
Raised when a dataset file can't be read. |
|
Raised when an error occurs while reading from a data pipeline. |
|
Raised when a corrupt record is encountered while reading a dataset. |
|
|
Describes the vocabulary used by a tokenizer |
Helper methods:
Return |
fairseq2.data.text
Tools to tokenize text, converting it from bytes to tensors.
|
Represents a tokenizer to encode and decode text. |
Decodes text from tokens or token indices. |
|
Encodes text into tokens or token indices. |
|
|
Split string on a given character. |
|
Parses integers in a given base |
|
|
|
|
|
|
|
|
Return the vocabulary information of |
|
|