Decoding Process Overview ========================== This section explains how media data is processed internally in SPDL. Understanding this multi-stage process will help you customize decoding effectively. The Decoding Process -------------------- Media decoding in SPDL follows a multi-stage process based on FFmpeg. The decoding process consists of the following stages: 1. **Demuxing**: Extract packets from the source 2. **Decoding**: Decode packets into frames 3. **Filtering**: Apply transformations to frames (optional but commonly used) 4. **Buffer Conversion**: Merge frames into a contiguous array The following diagram illustrates this process: .. mermaid:: flowchart TD subgraph Demuxing direction LR b(Byte String) --> |Demuxing| p1(Packet) p1 --> |Bitstream Filtering (Optional)| p2(Packet) end subgraph Decoding direction LR p3(Packet) --> |Decoding| f1(Frame) f1 --> |Filtering| f2(Frame) end subgraph c["Buffer Conversion"] subgraph ff[" "] f3(Frame) f4(Frame) f5(Frame) end ff --> |Buffer Conversion| b2(Buffer) end Demuxing --> Decoding --> c Low-Level Functions ~~~~~~~~~~~~~~~~~~~ While the high-level loading functions (:py:func:`spdl.io.load_audio`, :py:func:`spdl.io.load_video`, :py:func:`spdl.io.load_image`) provide a simple interface for common use cases, SPDL also exposes low-level functions that give you fine-grained control over each stage of the decoding process. These low-level functions are: 1. **Demuxing functions**: Extract packets from the source - :py:func:`spdl.io.demux_audio` - :py:func:`spdl.io.demux_video` - :py:func:`spdl.io.demux_image` 2. **Decoding function**: Decode packets into frames - :py:func:`spdl.io.decode_packets` 3. **Buffer conversion function**: Convert frames into a contiguous buffer - :py:func:`spdl.io.convert_frames` 4. **Transfer function**: Transfer buffer to GPU (optional) - :py:func:`spdl.io.transfer_buffer` The relationship between high-level and low-level functions can be expressed as: .. code-block:: python # load_audio is equivalent to: packets = spdl.io.demux_audio(src, timestamp=timestamp, demux_config=demux_config) frames = spdl.io.decode_packets(packets, decode_config=decode_config, filter_desc=filter_desc) buffer = spdl.io.convert_frames(frames) # Optionally: buffer = spdl.io.transfer_buffer(buffer, device_config=device_config) # load_video is equivalent to: packets = spdl.io.demux_video(src, timestamp=timestamp, demux_config=demux_config) frames = spdl.io.decode_packets(packets, decode_config=decode_config, filter_desc=filter_desc) buffer = spdl.io.convert_frames(frames) # Optionally: buffer = spdl.io.transfer_buffer(buffer, device_config=device_config) # load_image is equivalent to: packets = spdl.io.demux_image(src, demux_config=demux_config) frames = spdl.io.decode_packets(packets, decode_config=decode_config, filter_desc=filter_desc) buffer = spdl.io.convert_frames(frames) # Optionally: buffer = spdl.io.transfer_buffer(buffer, device_config=device_config) Using low-level functions is useful when you need to: - Inspect or modify packets/frames between steps - Apply custom processing logic - Integrate with other libraries or custom decoders - Debug decoding issues - Reuse demuxed packets with different decoding parameters Stage 1: Demuxing ------------------ `Demuxing `_ (short for demultiplexing) is the process of splitting input data into smaller chunks called **packets**. Media files typically contain multiple streams (e.g., one audio stream and one video stream) interleaved together. Demuxing identifies the boundaries between these packets and extracts them one by one. .. mermaid:: block-beta columns 1 b["0101010101100101...................................."] space block:demuxed p0[["Header"]] p1(["Audio 0"]) p2["Video 0"] p3["Video 1"] p4["Video 2"] p5["Video 3"] p6(["Audio 1"]) p7["Video 4"] p8["Video 5"] end b-- "demuxing" -->demuxed **In SPDL:** - :py:func:`spdl.io.demux_audio` - Demux audio packets - :py:func:`spdl.io.demux_video` - Demux video packets - :py:func:`spdl.io.demux_image` - Demux image packets **Example:** .. code-block:: python import spdl.io # Demux video packets from a file packets = spdl.io.demux_video("video.mp4") # Demux audio packets for a specific time window packets = spdl.io.demux_audio("audio.mp3", timestamp=(5.0, 10.0)) Stage 2: Decoding ------------------ Decoding is the process of decompressing packets to recover the original media data. Media files are typically encoded (compressed) to reduce file size. The decoder reverses this process to produce **frames**. **Frames** contain the actual media samples: - For audio: waveform samples - For video/image: pixel data for each frame **In SPDL:** - :py:func:`spdl.io.decode_packets` - Decode packets into frames **Example:** .. code-block:: python import spdl.io # Demux and decode packets = spdl.io.demux_video("video.mp4") frames = spdl.io.decode_packets(packets) Stage 3: Filtering ------------------- Filtering is a versatile stage that can apply various transformations to frames. This is where format conversion, resizing, cropping, and other preprocessing operations occur. FFmpeg provides a rich set of filters through its `filter graph system `_. In SPDL, filtering is controlled by the ``filter_desc`` parameter. **Common filtering operations:** - **Format conversion**: Convert pixel format (e.g., YUV to RGB) or audio sample format - **Resizing**: Scale video/image to different dimensions - **Cropping**: Extract a region of interest - **Frame rate adjustment**: Change video frame rate - **Trimming**: Remove frames outside a specified time window - **Augmentation**: Apply random transformations for data augmentation **In SPDL:** The ``filter_desc`` parameter in :py:func:`spdl.io.decode_packets` controls filtering. Helper functions generate filter descriptions: - :py:func:`spdl.io.get_audio_filter_desc` - :py:func:`spdl.io.get_video_filter_desc` - :py:func:`spdl.io.get_filter_desc` **Example:** .. code-block:: python import spdl.io packets = spdl.io.demux_video("video.mp4") # Create a filter description filter_desc = spdl.io.get_video_filter_desc( scale_width=256, scale_height=256, pix_fmt="rgb24" ) # Decode with filtering frames = spdl.io.decode_packets(packets, filter_desc=filter_desc) See :doc:`filtering` for detailed information about filter customization. Stage 4: Buffer Conversion --------------------------- Buffer conversion is the final stage where multiple frames are merged into a single contiguous memory region. This creates an array-like buffer that can be easily converted to NumPy arrays, PyTorch tensors, or other array types. **In SPDL:** - :py:func:`spdl.io.convert_frames` - Convert frames to a buffer **Example:** .. code-block:: python import spdl.io packets = spdl.io.demux_video("video.mp4") frames = spdl.io.decode_packets(packets) buffer = spdl.io.convert_frames(frames) # Convert to PyTorch tensor tensor = spdl.io.to_torch(buffer) Optional: Bitstream Filtering ------------------------------ Bitstream filtering is an optional stage that modifies packets before decoding. This is less commonly used but necessary for certain scenarios. **Common use cases:** - Converting H.264/HEVC packets to Annex B format for hardware-accelerated decoding - Extracting specific data from packets - Modifying packet metadata **In SPDL:** - :py:class:`spdl.io.BSF` - Bitstream filter class - :py:func:`spdl.io.apply_bsf` - Apply bitstream filtering to packets **Example:** .. code-block:: python import spdl.io # For hardware-accelerated video decoding, H.264 packets need conversion packets = spdl.io.demux_video("video.mp4") packets = spdl.io.apply_bsf(packets, "h264_mp4toannexb") # Now decode with hardware decoder frames = spdl.io.decode_packets_nvdec( packets, device_config=spdl.io.cuda_config(device_index=0) ) Complete Example ---------------- Here's a complete example showing all stages: .. code-block:: python import spdl.io # Source file src = "video.mp4" # Stage 1: Demuxing packets = spdl.io.demux_video(src, timestamp=(0.0, 5.0)) print(f"Demuxed {len(packets)} packets") # Stage 2 & 3: Decoding with Filtering filter_desc = spdl.io.get_video_filter_desc( scale_width=224, scale_height=224, pix_fmt="rgb24", num_frames=30 ) frames = spdl.io.decode_packets(packets, filter_desc=filter_desc) print(f"Decoded {len(frames)} frames") # Stage 4: Buffer Conversion buffer = spdl.io.convert_frames(frames) print(f"Buffer shape: {buffer.shape}") # Convert to array tensor = spdl.io.to_torch(buffer) print(f"Tensor shape: {tensor.shape}") # (30, 224, 224, 3) .. note:: The high-level :py:func:`spdl.io.load_video` function performs all four stages (demuxing, decoding, filtering, and buffer conversion) automatically. The equivalent call would be: .. code-block:: python import spdl.io buffer = spdl.io.load_video( "video.mp4", timestamp=(0.0, 5.0), filter_desc=spdl.io.get_video_filter_desc( scale_width=224, scale_height=224, pix_fmt="rgb24", num_frames=30 ) ) tensor = spdl.io.to_torch(buffer) This performs stages 1-4 internally and returns the final buffer ready for conversion. Performance Considerations -------------------------- While this multi-stage design provides great flexibility and features, it comes with overhead due to the complexity of FFmpeg-based processing. For certain formats, specialized libraries or direct byte manipulation can be significantly faster. For example, WAV format stores raw audio samples without compression. Instead of going through the full demux-decode-filter-convert process, it's more efficient to simply reinterpret the incoming bytes directly as an array. **In SPDL:** :py:func:`spdl.io.load_wav` is optimized for this use case and bypasses the full FFmpeg-based process for better performance when working with WAV files. When choosing between :py:func:`spdl.io.load_audio` and :py:func:`spdl.io.load_wav`: - Use :py:func:`spdl.io.load_wav` for WAV files when you need maximum performance and don't require complex preprocessing - Use :py:func:`spdl.io.load_audio` when you need: - Support for multiple formats (MP3, FLAC, AAC, etc.) - Complex filtering and preprocessing - Timestamp-based seeking - Consistent API across different formats