Loading NumPy Arrays

NumPy serialization formats (NPY, NPZ) are often used in data collection. The numpy.save(), numpy.savez(), and numpy.savez_compressed() functions are versatile—they can serialize almost any data, which is why they are a popular choice. However, loading such data with numpy.load() is not performant, and this is difficult to optimize.

SPDL provides optimized functions for loading NumPy arrays from byte strings (data already in memory).

See also

Data Format and Performance

Case study comparing different data serialization formats for performance and efficiency.

Overview

NumPy’s serialization format stores arrays in two formats:

  • NPY format (.npy files): Single array per file

  • NPZ format (.npz files): Multiple arrays in a ZIP archive

SPDL provides functions optimized for loading arrays from byte strings that are already in memory. These are designed for data downloaded from remote storage, network APIs, or other sources where the binary data is already loaded into RAM.

Important

These functions work exclusively with byte strings (bytes or bytearray). They do not accept file paths or file-like objects (e.g., BytesIO). They are optimized for scenarios where data is already loaded into memory.

Note

To load NumPy arrays with SPDL IO functions, the data must be serialized with allow_pickle=False. This is the default behavior for numpy.save(), numpy.savez(), and numpy.savez_compressed() when saving numeric arrays.

Key benefits:

  • Works with byte strings: Accepts bytes or bytearray objects directly

  • Fast array creation: Creates NumPy arrays from in-memory data without computation

  • Zero-copy loading: No intermediate copies for supported formats

  • Memory efficiency: Direct memory mapping when possible

  • Optimized for in-memory data: No file I/O overhead, works directly with downloaded data

Performance characteristics:

  • Since the GIL is not released, performance does not scale in multi-threading

  • However, these functions are faster than standard NumPy functions when working with in-memory data because they do not perform any computation and work directly with byte data

Loading NPY Files

spdl.io.load_npy() loads a single NumPy array from a byte string in NPY format. The input must be bytes or bytearray, not a file path or file-like object.

Basic Usage

import spdl.io
import numpy as np
from io import BytesIO

# Create and save an array
original = np.arange(100).reshape(10, 10)
buffer = BytesIO()
np.save(buffer, original)

# Load using SPDL
data = buffer.getvalue()
restored = spdl.io.load_npy(data)

assert np.array_equal(restored, original)

Comparison with numpy.load

SPDL’s implementation is more efficient than using numpy.load() with io.BytesIO when working with data already in memory:

import spdl.io
import numpy as np
from io import BytesIO

# NumPy's approach (slower for in-memory data)
data = buffer.getvalue()  # bytes object
restored_numpy = np.load(BytesIO(data))  # Wraps bytes in file-like object

 # SPDL's approach (faster for in-memory data)
 restored_spdl = spdl.io.load_npy(data)  # Works directly with bytes

Why SPDL is faster for in-memory data:

  1. No intermediate BytesIO wrapper: SPDL works directly with byte strings

  2. Zero-copy when possible: Avoids unnecessary memory allocation

  3. No computation: Creates array objects from in-memory data without processing

  4. Designed for bytes in memory: Optimized for data already downloaded/loaded

Note: If you need to load from a file path, use numpy.load() directly. SPDL functions are designed for byte strings, not file paths.

Zero-Copy Loading

By default, spdl.io.load_npy() returns a view into the original byte data without copying:

import spdl.io
import numpy as np

# Load without copying (default)
data = bytearray(npy_bytes)
array = spdl.io.load_npy(data)

# Modifying the array affects the original data
array[0] = 999
# The underlying byte data is also modified

# Force a copy if needed
array = spdl.io.load_npy(data, copy=True)
array[0] = 999
# Now the original byte data is unchanged

Warning

When using zero-copy mode (copy=False), the returned array shares memory with the input data. Ensure the input data remains valid for the lifetime of the array.

Supported Data Types

spdl.io.load_npy() supports all numeric NumPy dtypes:

  • Integer types: uint8, int16, uint16, int32, uint32, int64, uint64

  • Floating point: float16, float32, float64

  • Boolean: bool

Limitations:

  • No object dtype support: Arrays with dtype=object are not supported

  • No Fortran order: Only C-contiguous arrays are supported

import spdl.io
import numpy as np

# Supported: numeric types
int_array = np.array([1, 2, 3], dtype=np.int32)
float_array = np.array([1.0, 2.0, 3.0], dtype=np.float64)

# Not supported: object dtype
# obj_array = np.array([{"key": "value"}], dtype=object)  # Will fail

Loading NPZ Files

spdl.io.load_npz() loads multiple NumPy arrays from a byte string containing NPZ (ZIP archive) data. The input must be bytes or bytearray, not a file path or file-like object.

Basic Usage

import spdl.io
import numpy as np
from io import BytesIO

# Create and save multiple arrays
x = np.arange(10)
y = np.sin(x)
z = np.cos(x)

buffer = BytesIO()
np.savez(buffer, x=x, y=y, z=z)

# Load using SPDL
data = buffer.getvalue()
npz_file = spdl.io.load_npz(data)

# Access arrays by name
assert np.array_equal(npz_file["x"], x)
assert np.array_equal(npz_file["y"], y)
assert np.array_equal(npz_file["z"], z)

NpzFile Interface

spdl.io.load_npz() returns a spdl.io.NpzFile object that mimics numpy.lib.npyio.NpzFile.

The NpzFile class implements the collections.abc.Mapping interface:

import spdl.io
import numpy as np

npz_file = spdl.io.load_npz(data)

# Dictionary-like access
x = npz_file["x"]

# Check if key exists
if "x" in npz_file:
    print("x is in the archive")

# List all arrays
print(npz_file.files)  # ['x', 'y', 'z']

# Iterate over keys
for key in npz_file:
    print(f"{key}: {npz_file[key].shape}")

# Get number of arrays
print(len(npz_file))  # 3

Accessing Arrays

Arrays can be accessed with or without the .npy suffix:

import spdl.io

npz_file = spdl.io.load_npz(data)

# Both work the same
x1 = npz_file["x"]
x2 = npz_file["x.npy"]

assert np.array_equal(x1, x2)

Compressed NPZ Files

spdl.io.load_npz() supports both uncompressed and DEFLATE-compressed archives:

import spdl.io
import numpy as np
from io import BytesIO

x = np.arange(1000)
y = np.random.random(1000)

# Compressed NPZ (savez_compressed)
buffer = BytesIO()
np.savez_compressed(buffer, x=x, y=y)

data = buffer.getvalue()
npz_file = spdl.io.load_npz(data)

assert np.array_equal(npz_file["x"], x)
assert np.array_equal(npz_file["y"], y)

Positional Arguments

Arrays saved without names get automatic arr_0, arr_1 naming:

import spdl.io
import numpy as np
from io import BytesIO

x = np.arange(10)
y = np.sin(x)

# Save with positional arguments (no names)
buffer = BytesIO()
np.savez(buffer, x, y)

data = buffer.getvalue()
npz_file = spdl.io.load_npz(data)

# Access using auto-generated names
assert np.array_equal(npz_file["arr_0"], x)
assert np.array_equal(npz_file["arr_1"], y)

Use Cases

Loading from Remote Storage

These functions are specifically designed for loading data from remote storage or network APIs, where the data is first downloaded into memory as bytes:

import spdl.io
import numpy as np

def load_from_s3(bucket: str, key: str) -> np.ndarray:
    # Download bytes from S3 into memory
    data = s3_client.get_object(Bucket=bucket, Key=key)["Body"].read()
    # `data` is now a bytes object in memory

    # Load efficiently with SPDL - works directly with the bytes
    return spdl.io.load_npy(data)

def load_from_http(url: str) -> np.ndarray:
    # Download from HTTP endpoint
    response = requests.get(url)
    data = response.content  # bytes object

    # Load directly from the downloaded bytes
    return spdl.io.load_npy(data)

# Use in data pipeline
for key in data_keys:
    array = load_from_s3("my-bucket", key)
    # Process array...

Why these functions are ideal for remote storage:

  • Data is already downloaded into memory as bytes

  • No need to write to disk and read back

  • Efficient conversion from bytes to NumPy arrays

  • Optimized for this specific use case

Performance Considerations

GIL Behavior

Both spdl.io.load_npy() and spdl.io.load_npz() do not release the GIL. They create NumPy array objects from in-memory data without performing any computation. Since the majority of time is spent on Python object creation, it is not possible to release the GIL.

Implications:

  • No multi-threading scalability: Performance does not scale with multiple threads

  • Still faster than NumPy: Despite not releasing the GIL, these functions are faster than standard NumPy functions because they avoid computation and work directly with byte data

  • Best for single-threaded or I/O-bound pipelines: Use when data loading is not the bottleneck

Memory Usage

Zero-copy mode (copy=False) is more memory-efficient but requires careful lifetime management:

import spdl.io

# Memory-efficient: No copy
array = spdl.io.load_npy(data, copy=False)
# 'data' must remain valid while 'array' is in use

# Memory-safe: Independent copy
array = spdl.io.load_npy(data, copy=True)
# 'data' can be deleted, 'array' is independent

When to Use

Use SPDL’s NumPy loaders when:

  • Working with byte strings downloaded from remote storage (S3, HTTP, etc.)

  • Data is already in memory as bytes or bytearray

  • Loading from network APIs or cloud storage

  • Memory efficiency is important (zero-copy loading)

  • Performance is critical for in-memory data conversion

Use standard numpy.load when:

  • Working with file paths on disk (numpy.load('file.npy'))

  • Working with file-like objects (e.g., BytesIO, file handles)

  • Need support for object dtype

  • Need support for Fortran-order arrays

  • Working with advanced NumPy features

Note

SPDL functions accept only byte strings (bytes/bytearray), not file paths or file-like objects. They are optimized for scenarios where data has already been downloaded or loaded into memory.

See Also