Decoding Process Overview

This section explains how media data is processed internally in SPDL. Understanding this multi-stage process will help you customize decoding effectively.

The Decoding Process

Media decoding in SPDL follows a multi-stage process based on FFmpeg. The decoding process consists of the following stages:

  1. Demuxing: Extract packets from the source

  2. Decoding: Decode packets into frames

  3. Filtering: Apply transformations to frames (optional but commonly used)

  4. Buffer Conversion: Merge frames into a contiguous array

The following diagram illustrates this process:

flowchart TD subgraph Demuxing direction LR b(Byte String) --> |Demuxing| p1(Packet) p1 --> |Bitstream Filtering &#40Optional&#41| p2(Packet) end subgraph Decoding direction LR p3(Packet) --> |Decoding| f1(Frame) f1 --> |Filtering| f2(Frame) end subgraph c["Buffer Conversion"] subgraph ff[" "] f3(Frame) f4(Frame) f5(Frame) end ff --> |Buffer Conversion| b2(Buffer) end Demuxing --> Decoding --> c

Low-Level Functions

While the high-level loading functions (spdl.io.load_audio(), spdl.io.load_video(), spdl.io.load_image()) provide a simple interface for common use cases, SPDL also exposes low-level functions that give you fine-grained control over each stage of the decoding process.

These low-level functions are:

  1. Demuxing functions: Extract packets from the source

  2. Decoding function: Decode packets into frames

  3. Buffer conversion function: Convert frames into a contiguous buffer

  4. Transfer function: Transfer buffer to GPU (optional)

The relationship between high-level and low-level functions can be expressed as:

# load_audio is equivalent to:
packets = spdl.io.demux_audio(src, timestamp=timestamp, demux_config=demux_config)
frames = spdl.io.decode_packets(packets, decode_config=decode_config, filter_desc=filter_desc)
buffer = spdl.io.convert_frames(frames)
# Optionally:
buffer = spdl.io.transfer_buffer(buffer, device_config=device_config)

# load_video is equivalent to:
packets = spdl.io.demux_video(src, timestamp=timestamp, demux_config=demux_config)
frames = spdl.io.decode_packets(packets, decode_config=decode_config, filter_desc=filter_desc)
buffer = spdl.io.convert_frames(frames)
# Optionally:
buffer = spdl.io.transfer_buffer(buffer, device_config=device_config)

# load_image is equivalent to:
packets = spdl.io.demux_image(src, demux_config=demux_config)
frames = spdl.io.decode_packets(packets, decode_config=decode_config, filter_desc=filter_desc)
buffer = spdl.io.convert_frames(frames)
# Optionally:
buffer = spdl.io.transfer_buffer(buffer, device_config=device_config)

Using low-level functions is useful when you need to:

  • Inspect or modify packets/frames between steps

  • Apply custom processing logic

  • Integrate with other libraries or custom decoders

  • Debug decoding issues

  • Reuse demuxed packets with different decoding parameters

Stage 1: Demuxing

Demuxing (short for demultiplexing) is the process of splitting input data into smaller chunks called packets.

Media files typically contain multiple streams (e.g., one audio stream and one video stream) interleaved together. Demuxing identifies the boundaries between these packets and extracts them one by one.

block-beta columns 1 b["0101010101100101...................................."] space block:demuxed p0[["Header"]] p1(["Audio 0"]) p2["Video 0"] p3["Video 1"] p4["Video 2"] p5["Video 3"] p6(["Audio 1"]) p7["Video 4"] p8["Video 5"] end b-- "demuxing" -->demuxed

In SPDL:

Example:

import spdl.io

# Demux video packets from a file
packets = spdl.io.demux_video("video.mp4")

# Demux audio packets for a specific time window
packets = spdl.io.demux_audio("audio.mp3", timestamp=(5.0, 10.0))

Stage 2: Decoding

Decoding is the process of decompressing packets to recover the original media data. Media files are typically encoded (compressed) to reduce file size. The decoder reverses this process to produce frames.

Frames contain the actual media samples:

  • For audio: waveform samples

  • For video/image: pixel data for each frame

In SPDL:

Example:

import spdl.io

# Demux and decode
packets = spdl.io.demux_video("video.mp4")
frames = spdl.io.decode_packets(packets)

Stage 3: Filtering

Filtering is a versatile stage that can apply various transformations to frames. This is where format conversion, resizing, cropping, and other preprocessing operations occur.

FFmpeg provides a rich set of filters through its filter graph system. In SPDL, filtering is controlled by the filter_desc parameter.

Common filtering operations:

  • Format conversion: Convert pixel format (e.g., YUV to RGB) or audio sample format

  • Resizing: Scale video/image to different dimensions

  • Cropping: Extract a region of interest

  • Frame rate adjustment: Change video frame rate

  • Trimming: Remove frames outside a specified time window

  • Augmentation: Apply random transformations for data augmentation

In SPDL:

The filter_desc parameter in spdl.io.decode_packets() controls filtering. Helper functions generate filter descriptions:

Example:

import spdl.io

packets = spdl.io.demux_video("video.mp4")

# Create a filter description
filter_desc = spdl.io.get_video_filter_desc(
    scale_width=256,
    scale_height=256,
    pix_fmt="rgb24"
)

# Decode with filtering
frames = spdl.io.decode_packets(packets, filter_desc=filter_desc)

See Filter Graphs for detailed information about filter customization.

Stage 4: Buffer Conversion

Buffer conversion is the final stage where multiple frames are merged into a single contiguous memory region. This creates an array-like buffer that can be easily converted to NumPy arrays, PyTorch tensors, or other array types.

In SPDL:

Example:

import spdl.io

packets = spdl.io.demux_video("video.mp4")
frames = spdl.io.decode_packets(packets)
buffer = spdl.io.convert_frames(frames)

# Convert to PyTorch tensor
tensor = spdl.io.to_torch(buffer)

Optional: Bitstream Filtering

Bitstream filtering is an optional stage that modifies packets before decoding. This is less commonly used but necessary for certain scenarios.

Common use cases:

  • Converting H.264/HEVC packets to Annex B format for hardware-accelerated decoding

  • Extracting specific data from packets

  • Modifying packet metadata

In SPDL:

Example:

import spdl.io

# For hardware-accelerated video decoding, H.264 packets need conversion
packets = spdl.io.demux_video("video.mp4")
packets = spdl.io.apply_bsf(packets, "h264_mp4toannexb")

# Now decode with hardware decoder
frames = spdl.io.decode_packets_nvdec(
    packets,
    device_config=spdl.io.cuda_config(device_index=0)
)

Complete Example

Here’s a complete example showing all stages:

import spdl.io

# Source file
src = "video.mp4"

# Stage 1: Demuxing
packets = spdl.io.demux_video(src, timestamp=(0.0, 5.0))
print(f"Demuxed {len(packets)} packets")

# Stage 2 & 3: Decoding with Filtering
filter_desc = spdl.io.get_video_filter_desc(
    scale_width=224,
    scale_height=224,
    pix_fmt="rgb24",
    num_frames=30
)
frames = spdl.io.decode_packets(packets, filter_desc=filter_desc)
print(f"Decoded {len(frames)} frames")

# Stage 4: Buffer Conversion
buffer = spdl.io.convert_frames(frames)
print(f"Buffer shape: {buffer.shape}")

# Convert to array
tensor = spdl.io.to_torch(buffer)
print(f"Tensor shape: {tensor.shape}")  # (30, 224, 224, 3)

Note

The high-level spdl.io.load_video() function performs all four stages (demuxing, decoding, filtering, and buffer conversion) automatically. The equivalent call would be:

import spdl.io

buffer = spdl.io.load_video(
    "video.mp4",
    timestamp=(0.0, 5.0),
    filter_desc=spdl.io.get_video_filter_desc(
        scale_width=224,
        scale_height=224,
        pix_fmt="rgb24",
        num_frames=30
    )
)
tensor = spdl.io.to_torch(buffer)

This performs stages 1-4 internally and returns the final buffer ready for conversion.

Performance Considerations

While this multi-stage design provides great flexibility and features, it comes with overhead due to the complexity of FFmpeg-based processing. For certain formats, specialized libraries or direct byte manipulation can be significantly faster.

For example, WAV format stores raw audio samples without compression. Instead of going through the full demux-decode-filter-convert process, it’s more efficient to simply reinterpret the incoming bytes directly as an array.

In SPDL: spdl.io.load_wav() is optimized for this use case and bypasses the full FFmpeg-based process for better performance when working with WAV files.

When choosing between spdl.io.load_audio() and spdl.io.load_wav():

  • Use spdl.io.load_wav() for WAV files when you need maximum performance and don’t require complex preprocessing

  • Use spdl.io.load_audio() when you need:

    • Support for multiple formats (MP3, FLAC, AAC, etc.)

    • Complex filtering and preprocessing

    • Timestamp-based seeking

    • Consistent API across different formats