Loading NumPy Arrays ==================== NumPy serialization formats (NPY, NPZ) are often used in data collection. The :py:func:`numpy.save`, :py:func:`numpy.savez`, and :py:func:`numpy.savez_compressed` functions are versatile—they can serialize almost any data, which is why they are a popular choice. However, loading such data with :py:func:`numpy.load` is not performant, and this is difficult to optimize. SPDL provides optimized functions for loading NumPy arrays from byte strings (data already in memory). .. seealso:: :doc:`../case_studies/data_format` Case study comparing different data serialization formats for performance and efficiency. Overview -------- NumPy's serialization format stores arrays in two formats: - **NPY format** (``.npy`` files): Single array per file - **NPZ format** (``.npz`` files): Multiple arrays in a ZIP archive SPDL provides functions optimized for loading arrays from byte strings that are already in memory. These are designed for data downloaded from remote storage, network APIs, or other sources where the binary data is already loaded into RAM. .. important:: These functions work **exclusively with byte strings** (``bytes`` or ``bytearray``). They do **not** accept file paths or file-like objects (e.g., ``BytesIO``). They are optimized for scenarios where data is already loaded into memory. .. note:: To load NumPy arrays with SPDL IO functions, the data must be serialized with ``allow_pickle=False``. This is the default behavior for :py:func:`numpy.save`, :py:func:`numpy.savez`, and :py:func:`numpy.savez_compressed` when saving numeric arrays. **Key benefits:** - **Works with byte strings**: Accepts ``bytes`` or ``bytearray`` objects directly - **Fast array creation**: Creates NumPy arrays from in-memory data without computation - **Zero-copy loading**: No intermediate copies for supported formats - **Memory efficiency**: Direct memory mapping when possible - **Optimized for in-memory data**: No file I/O overhead, works directly with downloaded data **Performance characteristics:** - Since the GIL is not released, performance does not scale in multi-threading - However, these functions are faster than standard NumPy functions when working with in-memory data because they do not perform any computation and work directly with byte data Loading NPY Files ------------------ :py:func:`spdl.io.load_npy` loads a single NumPy array from a byte string in NPY format. The input must be ``bytes`` or ``bytearray``, not a file path or file-like object. Basic Usage ~~~~~~~~~~~ .. code-block:: python import spdl.io import numpy as np from io import BytesIO # Create and save an array original = np.arange(100).reshape(10, 10) buffer = BytesIO() np.save(buffer, original) # Load using SPDL data = buffer.getvalue() restored = spdl.io.load_npy(data) assert np.array_equal(restored, original) Comparison with numpy.load ~~~~~~~~~~~~~~~~~~~~~~~~~~~ SPDL's implementation is more efficient than using :py:func:`numpy.load` with :py:class:`io.BytesIO` when working with data already in memory: .. code-block:: python import spdl.io import numpy as np from io import BytesIO # NumPy's approach (slower for in-memory data) data = buffer.getvalue() # bytes object restored_numpy = np.load(BytesIO(data)) # Wraps bytes in file-like object # SPDL's approach (faster for in-memory data) restored_spdl = spdl.io.load_npy(data) # Works directly with bytes **Why SPDL is faster for in-memory data:** 1. **No intermediate BytesIO wrapper**: SPDL works directly with byte strings 2. **Zero-copy when possible**: Avoids unnecessary memory allocation 3. **No computation**: Creates array objects from in-memory data without processing 4. **Designed for bytes in memory**: Optimized for data already downloaded/loaded **Note:** If you need to load from a file path, use :py:func:`numpy.load` directly. SPDL functions are designed for byte strings, not file paths. Zero-Copy Loading ~~~~~~~~~~~~~~~~~ By default, :py:func:`spdl.io.load_npy` returns a view into the original byte data without copying: .. code-block:: python import spdl.io import numpy as np # Load without copying (default) data = bytearray(npy_bytes) array = spdl.io.load_npy(data) # Modifying the array affects the original data array[0] = 999 # The underlying byte data is also modified # Force a copy if needed array = spdl.io.load_npy(data, copy=True) array[0] = 999 # Now the original byte data is unchanged .. warning:: When using zero-copy mode (``copy=False``), the returned array shares memory with the input data. Ensure the input data remains valid for the lifetime of the array. Supported Data Types ~~~~~~~~~~~~~~~~~~~~ :py:func:`spdl.io.load_npy` supports all numeric NumPy dtypes: - **Integer types**: ``uint8``, ``int16``, ``uint16``, ``int32``, ``uint32``, ``int64``, ``uint64`` - **Floating point**: ``float16``, ``float32``, ``float64`` - **Boolean**: ``bool`` **Limitations:** - **No object dtype support**: Arrays with ``dtype=object`` are not supported - **No Fortran order**: Only C-contiguous arrays are supported .. code-block:: python import spdl.io import numpy as np # Supported: numeric types int_array = np.array([1, 2, 3], dtype=np.int32) float_array = np.array([1.0, 2.0, 3.0], dtype=np.float64) # Not supported: object dtype # obj_array = np.array([{"key": "value"}], dtype=object) # Will fail Loading NPZ Files ------------------ :py:func:`spdl.io.load_npz` loads multiple NumPy arrays from a byte string containing NPZ (ZIP archive) data. The input must be ``bytes`` or ``bytearray``, not a file path or file-like object. Basic Usage ~~~~~~~~~~~ .. code-block:: python import spdl.io import numpy as np from io import BytesIO # Create and save multiple arrays x = np.arange(10) y = np.sin(x) z = np.cos(x) buffer = BytesIO() np.savez(buffer, x=x, y=y, z=z) # Load using SPDL data = buffer.getvalue() npz_file = spdl.io.load_npz(data) # Access arrays by name assert np.array_equal(npz_file["x"], x) assert np.array_equal(npz_file["y"], y) assert np.array_equal(npz_file["z"], z) NpzFile Interface ~~~~~~~~~~~~~~~~~ :py:func:`spdl.io.load_npz` returns a :py:class:`spdl.io.NpzFile` object that mimics :py:class:`numpy.lib.npyio.NpzFile`. The :py:class:`~spdl.io.NpzFile` class implements the :py:class:`collections.abc.Mapping` interface: .. code-block:: python import spdl.io import numpy as np npz_file = spdl.io.load_npz(data) # Dictionary-like access x = npz_file["x"] # Check if key exists if "x" in npz_file: print("x is in the archive") # List all arrays print(npz_file.files) # ['x', 'y', 'z'] # Iterate over keys for key in npz_file: print(f"{key}: {npz_file[key].shape}") # Get number of arrays print(len(npz_file)) # 3 Accessing Arrays ~~~~~~~~~~~~~~~~ Arrays can be accessed with or without the ``.npy`` suffix: .. code-block:: python import spdl.io npz_file = spdl.io.load_npz(data) # Both work the same x1 = npz_file["x"] x2 = npz_file["x.npy"] assert np.array_equal(x1, x2) Compressed NPZ Files ~~~~~~~~~~~~~~~~~~~~ :py:func:`spdl.io.load_npz` supports both uncompressed and DEFLATE-compressed archives: .. code-block:: python import spdl.io import numpy as np from io import BytesIO x = np.arange(1000) y = np.random.random(1000) # Compressed NPZ (savez_compressed) buffer = BytesIO() np.savez_compressed(buffer, x=x, y=y) data = buffer.getvalue() npz_file = spdl.io.load_npz(data) assert np.array_equal(npz_file["x"], x) assert np.array_equal(npz_file["y"], y) Positional Arguments ~~~~~~~~~~~~~~~~~~~~ Arrays saved without names get automatic ``arr_0``, ``arr_1`` naming: .. code-block:: python import spdl.io import numpy as np from io import BytesIO x = np.arange(10) y = np.sin(x) # Save with positional arguments (no names) buffer = BytesIO() np.savez(buffer, x, y) data = buffer.getvalue() npz_file = spdl.io.load_npz(data) # Access using auto-generated names assert np.array_equal(npz_file["arr_0"], x) assert np.array_equal(npz_file["arr_1"], y) Use Cases --------- Loading from Remote Storage ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ These functions are specifically designed for loading data from remote storage or network APIs, where the data is first downloaded into memory as bytes: .. code-block:: python import spdl.io import numpy as np def load_from_s3(bucket: str, key: str) -> np.ndarray: # Download bytes from S3 into memory data = s3_client.get_object(Bucket=bucket, Key=key)["Body"].read() # `data` is now a bytes object in memory # Load efficiently with SPDL - works directly with the bytes return spdl.io.load_npy(data) def load_from_http(url: str) -> np.ndarray: # Download from HTTP endpoint response = requests.get(url) data = response.content # bytes object # Load directly from the downloaded bytes return spdl.io.load_npy(data) # Use in data pipeline for key in data_keys: array = load_from_s3("my-bucket", key) # Process array... **Why these functions are ideal for remote storage:** - Data is already downloaded into memory as bytes - No need to write to disk and read back - Efficient conversion from bytes to NumPy arrays - Optimized for this specific use case Performance Considerations --------------------------- GIL Behavior ~~~~~~~~~~~~ Both :py:func:`spdl.io.load_npy` and :py:func:`spdl.io.load_npz` do **not** release the GIL. They create NumPy array objects from in-memory data without performing any computation. Since the majority of time is spent on Python object creation, it is not possible to release the GIL. **Implications:** - **No multi-threading scalability**: Performance does not scale with multiple threads - **Still faster than NumPy**: Despite not releasing the GIL, these functions are faster than standard NumPy functions because they avoid computation and work directly with byte data - **Best for single-threaded or I/O-bound pipelines**: Use when data loading is not the bottleneck Memory Usage ~~~~~~~~~~~~ Zero-copy mode (``copy=False``) is more memory-efficient but requires careful lifetime management: .. code-block:: python import spdl.io # Memory-efficient: No copy array = spdl.io.load_npy(data, copy=False) # 'data' must remain valid while 'array' is in use # Memory-safe: Independent copy array = spdl.io.load_npy(data, copy=True) # 'data' can be deleted, 'array' is independent When to Use ~~~~~~~~~~~ **Use SPDL's NumPy loaders when:** - Working with **byte strings** downloaded from remote storage (S3, HTTP, etc.) - Data is already in memory as ``bytes`` or ``bytearray`` - Loading from network APIs or cloud storage - Memory efficiency is important (zero-copy loading) - Performance is critical for in-memory data conversion **Use standard numpy.load when:** - Working with **file paths** on disk (``numpy.load('file.npy')``) - Working with **file-like objects** (e.g., ``BytesIO``, file handles) - Need support for object dtype - Need support for Fortran-order arrays - Working with advanced NumPy features .. note:: SPDL functions accept **only byte strings** (``bytes``/``bytearray``), not file paths or file-like objects. They are optimized for scenarios where data has already been downloaded or loaded into memory. See Also -------- - :doc:`basic` - High-level media loading functions - :doc:`decoding_overview` - Understanding the decoding process - :py:func:`numpy.save` - Save arrays to NPY format - :py:func:`numpy.savez` - Save multiple arrays to NPZ format - :py:func:`numpy.savez_compressed` - Save compressed NPZ files