Decoding Process Overview¶
This section explains how media data is processed internally in SPDL. Understanding this multi-stage process will help you customize decoding effectively.
The Decoding Process¶
Media decoding in SPDL follows a multi-stage process based on FFmpeg. The decoding process consists of the following stages:
Demuxing: Extract packets from the source
Decoding: Decode packets into frames
Filtering: Apply transformations to frames (optional but commonly used)
Buffer Conversion: Merge frames into a contiguous array
The following diagram illustrates this process:
Low-Level Functions¶
While the high-level loading functions (spdl.io.load_audio(), spdl.io.load_video(), spdl.io.load_image()) provide a simple interface for common use cases, SPDL also exposes low-level functions that give you fine-grained control over each stage of the decoding process.
These low-level functions are:
Demuxing functions: Extract packets from the source
Decoding function: Decode packets into frames
Buffer conversion function: Convert frames into a contiguous buffer
Transfer function: Transfer buffer to GPU (optional)
The relationship between high-level and low-level functions can be expressed as:
# load_audio is equivalent to:
packets = spdl.io.demux_audio(src, timestamp=timestamp, demux_config=demux_config)
frames = spdl.io.decode_packets(packets, decode_config=decode_config, filter_desc=filter_desc)
buffer = spdl.io.convert_frames(frames)
# Optionally:
buffer = spdl.io.transfer_buffer(buffer, device_config=device_config)
# load_video is equivalent to:
packets = spdl.io.demux_video(src, timestamp=timestamp, demux_config=demux_config)
frames = spdl.io.decode_packets(packets, decode_config=decode_config, filter_desc=filter_desc)
buffer = spdl.io.convert_frames(frames)
# Optionally:
buffer = spdl.io.transfer_buffer(buffer, device_config=device_config)
# load_image is equivalent to:
packets = spdl.io.demux_image(src, demux_config=demux_config)
frames = spdl.io.decode_packets(packets, decode_config=decode_config, filter_desc=filter_desc)
buffer = spdl.io.convert_frames(frames)
# Optionally:
buffer = spdl.io.transfer_buffer(buffer, device_config=device_config)
Using low-level functions is useful when you need to:
Inspect or modify packets/frames between steps
Apply custom processing logic
Integrate with other libraries or custom decoders
Debug decoding issues
Reuse demuxed packets with different decoding parameters
Stage 1: Demuxing¶
Demuxing (short for demultiplexing) is the process of splitting input data into smaller chunks called packets.
Media files typically contain multiple streams (e.g., one audio stream and one video stream) interleaved together. Demuxing identifies the boundaries between these packets and extracts them one by one.
In SPDL:
spdl.io.demux_audio()- Demux audio packetsspdl.io.demux_video()- Demux video packetsspdl.io.demux_image()- Demux image packets
Example:
import spdl.io
# Demux video packets from a file
packets = spdl.io.demux_video("video.mp4")
# Demux audio packets for a specific time window
packets = spdl.io.demux_audio("audio.mp3", timestamp=(5.0, 10.0))
Stage 2: Decoding¶
Decoding is the process of decompressing packets to recover the original media data. Media files are typically encoded (compressed) to reduce file size. The decoder reverses this process to produce frames.
Frames contain the actual media samples:
For audio: waveform samples
For video/image: pixel data for each frame
In SPDL:
spdl.io.decode_packets()- Decode packets into frames
Example:
import spdl.io
# Demux and decode
packets = spdl.io.demux_video("video.mp4")
frames = spdl.io.decode_packets(packets)
Stage 3: Filtering¶
Filtering is a versatile stage that can apply various transformations to frames. This is where format conversion, resizing, cropping, and other preprocessing operations occur.
FFmpeg provides a rich set of filters through its filter graph system.
In SPDL, filtering is controlled by the filter_desc parameter.
Common filtering operations:
Format conversion: Convert pixel format (e.g., YUV to RGB) or audio sample format
Resizing: Scale video/image to different dimensions
Cropping: Extract a region of interest
Frame rate adjustment: Change video frame rate
Trimming: Remove frames outside a specified time window
Augmentation: Apply random transformations for data augmentation
In SPDL:
The filter_desc parameter in spdl.io.decode_packets() controls filtering.
Helper functions generate filter descriptions:
Example:
import spdl.io
packets = spdl.io.demux_video("video.mp4")
# Create a filter description
filter_desc = spdl.io.get_video_filter_desc(
scale_width=256,
scale_height=256,
pix_fmt="rgb24"
)
# Decode with filtering
frames = spdl.io.decode_packets(packets, filter_desc=filter_desc)
See Filter Graphs for detailed information about filter customization.
Stage 4: Buffer Conversion¶
Buffer conversion is the final stage where multiple frames are merged into a single contiguous memory region. This creates an array-like buffer that can be easily converted to NumPy arrays, PyTorch tensors, or other array types.
In SPDL:
spdl.io.convert_frames()- Convert frames to a buffer
Example:
import spdl.io
packets = spdl.io.demux_video("video.mp4")
frames = spdl.io.decode_packets(packets)
buffer = spdl.io.convert_frames(frames)
# Convert to PyTorch tensor
tensor = spdl.io.to_torch(buffer)
Optional: Bitstream Filtering¶
Bitstream filtering is an optional stage that modifies packets before decoding. This is less commonly used but necessary for certain scenarios.
Common use cases:
Converting H.264/HEVC packets to Annex B format for hardware-accelerated decoding
Extracting specific data from packets
Modifying packet metadata
In SPDL:
spdl.io.BSF- Bitstream filter classspdl.io.apply_bsf()- Apply bitstream filtering to packets
Example:
import spdl.io
# For hardware-accelerated video decoding, H.264 packets need conversion
packets = spdl.io.demux_video("video.mp4")
packets = spdl.io.apply_bsf(packets, "h264_mp4toannexb")
# Now decode with hardware decoder
frames = spdl.io.decode_packets_nvdec(
packets,
device_config=spdl.io.cuda_config(device_index=0)
)
Complete Example¶
Here’s a complete example showing all stages:
import spdl.io
# Source file
src = "video.mp4"
# Stage 1: Demuxing
packets = spdl.io.demux_video(src, timestamp=(0.0, 5.0))
print(f"Demuxed {len(packets)} packets")
# Stage 2 & 3: Decoding with Filtering
filter_desc = spdl.io.get_video_filter_desc(
scale_width=224,
scale_height=224,
pix_fmt="rgb24",
num_frames=30
)
frames = spdl.io.decode_packets(packets, filter_desc=filter_desc)
print(f"Decoded {len(frames)} frames")
# Stage 4: Buffer Conversion
buffer = spdl.io.convert_frames(frames)
print(f"Buffer shape: {buffer.shape}")
# Convert to array
tensor = spdl.io.to_torch(buffer)
print(f"Tensor shape: {tensor.shape}") # (30, 224, 224, 3)
Note
The high-level spdl.io.load_video() function performs all four stages (demuxing, decoding,
filtering, and buffer conversion) automatically. The equivalent call would be:
import spdl.io
buffer = spdl.io.load_video(
"video.mp4",
timestamp=(0.0, 5.0),
filter_desc=spdl.io.get_video_filter_desc(
scale_width=224,
scale_height=224,
pix_fmt="rgb24",
num_frames=30
)
)
tensor = spdl.io.to_torch(buffer)
This performs stages 1-4 internally and returns the final buffer ready for conversion.
Performance Considerations¶
While this multi-stage design provides great flexibility and features, it comes with overhead due to the complexity of FFmpeg-based processing. For certain formats, specialized libraries or direct byte manipulation can be significantly faster.
For example, WAV format stores raw audio samples without compression. Instead of going through the full demux-decode-filter-convert process, it’s more efficient to simply reinterpret the incoming bytes directly as an array.
In SPDL: spdl.io.load_wav() is optimized for this use case and bypasses the full FFmpeg-based process for better performance when working with WAV files.
When choosing between spdl.io.load_audio() and spdl.io.load_wav():
Use
spdl.io.load_wav()for WAV files when you need maximum performance and don’t require complex preprocessingUse
spdl.io.load_audio()when you need:Support for multiple formats (MP3, FLAC, AAC, etc.)
Complex filtering and preprocessing
Timestamp-based seeking
Consistent API across different formats