Decoding Process Overview¶

This section explains how media data is processed internally in SPDL. Understanding this multi-stage process will help you customize decoding effectively.

The Decoding Process¶

Media decoding in SPDL follows a multi-stage process based on FFmpeg. The decoding process consists of the following stages:

Demuxing: Extract packets from the source
Decoding: Decode packets into frames
Filtering: Apply transformations to frames (optional but commonly used)
Buffer Conversion: Merge frames into a contiguous array

The following diagram illustrates this process:

Low-Level Functions¶

While the high-level loading functions (spdl.io.load_audio(), spdl.io.load_video(), spdl.io.load_image()) provide a simple interface for common use cases, SPDL also exposes low-level functions that give you fine-grained control over each stage of the decoding process.

These low-level functions are:

Demuxing functions: Extract packets from the source
Decoding function: Decode packets into frames
- spdl.io.decode_packets()
Buffer conversion function: Convert frames into a contiguous buffer
- spdl.io.convert_frames()
Transfer function: Transfer buffer to GPU (optional)
- spdl.io.transfer_buffer()

The relationship between high-level and low-level functions can be expressed as:

# load_audio is equivalent to:
packets = spdl.io.demux_audio(src, timestamp=timestamp, demux_config=demux_config)
frames = spdl.io.decode_packets(packets, decode_config=decode_config, filter_desc=filter_desc)
buffer = spdl.io.convert_frames(frames)
# Optionally:
buffer = spdl.io.transfer_buffer(buffer, device_config=device_config)

# load_video is equivalent to:
packets = spdl.io.demux_video(src, timestamp=timestamp, demux_config=demux_config)
frames = spdl.io.decode_packets(packets, decode_config=decode_config, filter_desc=filter_desc)
buffer = spdl.io.convert_frames(frames)
# Optionally:
buffer = spdl.io.transfer_buffer(buffer, device_config=device_config)

# load_image is equivalent to:
packets = spdl.io.demux_image(src, demux_config=demux_config)
frames = spdl.io.decode_packets(packets, decode_config=decode_config, filter_desc=filter_desc)
buffer = spdl.io.convert_frames(frames)
# Optionally:
buffer = spdl.io.transfer_buffer(buffer, device_config=device_config)

Using low-level functions is useful when you need to:

Inspect or modify packets/frames between steps
Apply custom processing logic
Integrate with other libraries or custom decoders
Debug decoding issues
Reuse demuxed packets with different decoding parameters

Stage 1: Demuxing¶

Demuxing (short for demultiplexing) is the process of splitting input data into smaller chunks called packets.

Media files typically contain multiple streams (e.g., one audio stream and one video stream) interleaved together. Demuxing identifies the boundaries between these packets and extracts them one by one.

block-beta columns 1 b["0101010101100101...................................."] space block:demuxed p0[["Header"]] p1(["Audio 0"]) p2["Video 0"] p3["Video 1"] p4["Video 2"] p5["Video 3"] p6(["Audio 1"]) p7["Video 4"] p8["Video 5"] end b-- "demuxing" -->demuxed

In SPDL:

spdl.io.demux_audio() - Demux audio packets
spdl.io.demux_video() - Demux video packets
spdl.io.demux_image() - Demux image packets

Example:

import spdl.io

# Demux video packets from a file
packets = spdl.io.demux_video("video.mp4")

# Demux audio packets for a specific time window
packets = spdl.io.demux_audio("audio.mp3", timestamp=(5.0, 10.0))

Stage 2: Decoding¶

Decoding is the process of decompressing packets to recover the original media data. Media files are typically encoded (compressed) to reduce file size. The decoder reverses this process to produce frames.

Frames contain the actual media samples:

For audio: waveform samples
For video/image: pixel data for each frame

In SPDL:

spdl.io.decode_packets() - Decode packets into frames

Example:

import spdl.io

# Demux and decode
packets = spdl.io.demux_video("video.mp4")
frames = spdl.io.decode_packets(packets)

Stage 3: Filtering¶

Filtering is a versatile stage that can apply various transformations to frames. This is where format conversion, resizing, cropping, and other preprocessing operations occur.

FFmpeg provides a rich set of filters through its filter graph system. In SPDL, filtering is controlled by the filter_desc parameter.

Common filtering operations:

Format conversion: Convert pixel format (e.g., YUV to RGB) or audio sample format
Resizing: Scale video/image to different dimensions
Cropping: Extract a region of interest
Frame rate adjustment: Change video frame rate
Trimming: Remove frames outside a specified time window
Augmentation: Apply random transformations for data augmentation

In SPDL:

The filter_desc parameter in spdl.io.decode_packets() controls filtering. Helper functions generate filter descriptions:

Example:

import spdl.io

packets = spdl.io.demux_video("video.mp4")

# Create a filter description
filter_desc = spdl.io.get_video_filter_desc(
    scale_width=256,
    scale_height=256,
    pix_fmt="rgb24"
)

# Decode with filtering
frames = spdl.io.decode_packets(packets, filter_desc=filter_desc)

See Filter Graphs for detailed information about filter customization.

Stage 4: Buffer Conversion¶

Buffer conversion is the final stage where multiple frames are merged into a single contiguous memory region. This creates an array-like buffer that can be easily converted to NumPy arrays, PyTorch tensors, or other array types.

In SPDL:

spdl.io.convert_frames() - Convert frames to a buffer

Example:

import spdl.io

packets = spdl.io.demux_video("video.mp4")
frames = spdl.io.decode_packets(packets)
buffer = spdl.io.convert_frames(frames)

# Convert to PyTorch tensor
tensor = spdl.io.to_torch(buffer)

Optional: Bitstream Filtering¶

Bitstream filtering is an optional stage that modifies packets before decoding. This is less commonly used but necessary for certain scenarios.

Common use cases:

Converting H.264/HEVC packets to Annex B format for hardware-accelerated decoding
Extracting specific data from packets
Modifying packet metadata

In SPDL:

spdl.io.BSF - Bitstream filter class
spdl.io.apply_bsf() - Apply bitstream filtering to packets

Example:

import spdl.io

# For hardware-accelerated video decoding, H.264 packets need conversion
packets = spdl.io.demux_video("video.mp4")
packets = spdl.io.apply_bsf(packets, "h264_mp4toannexb")

# Now decode with hardware decoder
frames = spdl.io.decode_packets_nvdec(
    packets,
    device_config=spdl.io.cuda_config(device_index=0)
)

Complete Example¶

Here’s a complete example showing all stages:

import spdl.io

# Source file
src = "video.mp4"

# Stage 1: Demuxing
packets = spdl.io.demux_video(src, timestamp=(0.0, 5.0))
print(f"Demuxed {len(packets)} packets")

# Stage 2 & 3: Decoding with Filtering
filter_desc = spdl.io.get_video_filter_desc(
    scale_width=224,
    scale_height=224,
    pix_fmt="rgb24",
    num_frames=30
)
frames = spdl.io.decode_packets(packets, filter_desc=filter_desc)
print(f"Decoded {len(frames)} frames")

# Stage 4: Buffer Conversion
buffer = spdl.io.convert_frames(frames)
print(f"Buffer shape: {buffer.shape}")

# Convert to array
tensor = spdl.io.to_torch(buffer)
print(f"Tensor shape: {tensor.shape}")  # (30, 224, 224, 3)

Note

The high-level spdl.io.load_video() function performs all four stages (demuxing, decoding, filtering, and buffer conversion) automatically. The equivalent call would be:

import spdl.io

buffer = spdl.io.load_video(
    "video.mp4",
    timestamp=(0.0, 5.0),
    filter_desc=spdl.io.get_video_filter_desc(
        scale_width=224,
        scale_height=224,
        pix_fmt="rgb24",
        num_frames=30
    )
)
tensor = spdl.io.to_torch(buffer)

This performs stages 1-4 internally and returns the final buffer ready for conversion.

Decoding to Different Color Formats¶

The spdl.io module supports decoding images and videos into various color formats beyond RGB, including YUV420p, NV12, and other pixel formats supported by FFmpeg.

Example: Decode to YUV420p:

import spdl.io

# Decode video in YUV420p format
buffer = spdl.io.load_video(
    "video.mp4",
    filter_desc=spdl.io.get_video_filter_desc(
        scale_width=224,
        scale_height=224,
        pix_fmt="yuv420p"  # Keep in YUV420p format
    )
)
tensor = spdl.io.to_torch(buffer)
# tensor.shape: (num_frames, 1, height * 3 // 2, width)
# YUV420p has Y plane at full resolution and U/V planes at half resolution

Example: Decode image to NV12:

import spdl.io

# Decode image in NV12 format
buffer = spdl.io.load_image(
    "image.jpg",
    filter_desc=spdl.io.get_video_filter_desc(
        scale_width=224,
        scale_height=224,
        pix_fmt="nv12"  # NV12 format (Y plane + interleaved UV plane)
    )
)
array = spdl.io.to_numpy(buffer)
# array.shape: (1, height * 3 // 2, width)

Common pixel formats:

"rgb24" - RGB with 8 bits per channel (default for images/videos)
"bgr24" - BGR with 8 bits per channel
"yuv420p" - YUV 4:2:0 planar format
"nv12" - YUV 4:2:0 with interleaved UV plane
"gray" - Grayscale (single channel)

For a complete list of supported pixel formats, see the FFmpeg Pixel Formats documentation.

Performance Considerations¶

While this multi-stage design provides great flexibility and features, it comes with overhead due to the complexity of FFmpeg-based processing. For certain formats, specialized libraries or direct byte manipulation can be significantly faster.

For example, WAV format stores raw audio samples without compression. Instead of going through the full demux-decode-filter-convert process, it’s more efficient to simply reinterpret the incoming bytes directly as an array.

In SPDL: spdl.io.load_wav() is optimized for this use case and bypasses the full FFmpeg-based process for better performance when working with WAV files.

When choosing between spdl.io.load_audio() and spdl.io.load_wav():

Use spdl.io.load_wav() for WAV files when you need maximum performance and don’t require complex preprocessing
Use spdl.io.load_audio() when you need:
- Support for multiple formats (MP3, FLAC, AAC, etc.)
- Complex filtering and preprocessing
- Timestamp-based seeking
- Consistent API across different formats