fairseq2.data¶
The data module provides flexible data processing pipelines and utilities for working with various data formats including text, audio, and structured data. It includes high-performance data loaders, text tokenization systems, and specialized processing utilities for machine learning workflows.
Key Features:
High-Performance Data Pipelines: Optimized C++-based data loading and processing
Text Processing: Comprehensive tokenization and text preprocessing utilities
Audio Processing: Tools for audio data loading and feature extraction
Structured Data: Support for Parquet, JSON, and other structured formats
Memory Efficient: Streaming and batched processing for large datasets
Integration: Seamless integration with fairseq2’s training and evaluation systems
Note
The fairseq2 data module is designed for high-throughput machine learning workloads and provides both Python and C++ implementations for performance-critical operations.
Text Processing¶
Data Pipeline Components¶
Coming Soon: Documentation for data pipeline components including:
DataPipeline: Core pipeline abstraction for chaining data transformations
DataLoader: High-performance data loading with batching and shuffling
Collators: Utilities for batching variable-length sequences
Samplers: Various sampling strategies for training and evaluation
Audio Processing¶
Coming Soon: Documentation for audio processing utilities including:
AudioDataset: Dataset classes for audio files and features
AudioCollator: Batching utilities for variable-length audio sequences
Feature Extraction: Mel-spectrogram, MFCC, and other audio features
Structured Data¶
Coming Soon: Documentation for structured data processing including:
ParquetDataset: Efficient loading of Parquet files
JsonDataset: JSON and JSONL file processing
CSVDataset: CSV file loading with type inference
See Also¶
fairseq2.nn - Neural network components and BatchLayout
fairseq2.datasets - High-level dataset abstractions
fairseq2.models - Model implementations that consume data