FLACArray

This package provides a set of tools for compressing multi-dimensional arrays where the last array dimension consists of "streams" of data. These streams are compressed with the FLAC algorithm and can be written to different file formats as well as decompressed back into numpy arrays.

FLAC compression is particularly suited to "noisy" timestreams that do not compress well with DEFLATE algorithms used by zip / gzip. This type of data is found in audio signals, scientific timestreams, etc.

In the flacarray package we use only a small subset of features found in the libFLAC library. In particular, each data stream is compressed as a single, 32 bit "channel". Stream data consisting of 32 bit integers (or 64 bit integers spanning a peak-to-peak range that fits into 32 bits) are compressed in a loss-less fashion. Floating point data is converted to 32 bit integers with a user-specified precision or quantization.

If you are specifically working with audio data and want to write flac format files, you should look at other software tools such as pyflac.

Installation

For most use cases, you can just install flacarray from pre-built python wheels or conda packages. For specialized use cases or development it is straightforward to build the package from source using either a conda environment for dependencies or with those obtained through your OS package manager.

Python Wheels

You can install pre-built wheels from PyPI using pip within a virtualenv:

pip install flacarray

Or, if you are using a shared python environment you can install to a user location with:

pip install --user flacarray

Conda Packages

If you are using a conda environment you can install the conda package for flacarray from the conda-forge channels:

conda install -c conda-forge flacarray

To-Do

flacarray is not yet on conda-forge

Building From Source

In order to build from source, you will need a C compiler and the FLAC development libraries installed.

Building Within a Conda Environment

If you have conda available, you can create an environment will all the dependencies you need to build flacarray from source. For this example, we create an environment called "flacarray". First create the env with all dependencies and activate it (FIXME, add a requirements file for dev):

conda create -n flacarray \
    c_compiler numpy libflac cython meson-python pkgconfig

conda activate flacarray

Now you can go into your local git checkout of the flacarray source and do:

pip install .

To build and install the package.

To also work on docs, install additional packages:

conda install mkdocs mkdocstrings mkdocs-jupyter
pip install mkdocs-print-site-plugin

Other Ways of Building

To-Do

Discuss OS packages, document apt, rpm, homebrew options.

Tutorial¶

The flacarray package has tools for working with compressed arrays in memory, as well as saving and loading those to several file formats. This tutorial makes use of some interactive helper functions in the flacarray.demo package.

In [1]:

Copied!

import numpy as np
import h5py # For optional I/O operations
import zarr # For optional I/O operations
import numpy as np
import h5py # For optional I/O operations
import zarr # For optional I/O operations

`FlacArray` - Compressed Arrays in Memory¶

The primary class for working with compressed arrays in memory is the FlacArray class. You can construct one of these from a numpy array with a class method. First create some fake data in a numpy array for testing. This is a small 3-D array and the final dimension is always the one that is compressed. This last dimension should consist of "streams" of data.

In [2]:

Copied!

from flacarray import FlacArray, demo
from flacarray import FlacArray, demo

In [3]:

Copied!





# Create a 3D array where the last dimension is the "streams" we are compressing.
arr = demo.create_fake_data((4, 3, 10000))
# How large is this in memory?
print(f"Input array is {arr.nbytes} bytes")
# Create a 3D array where the last dimension is the "streams" we are compressing.
arr = demo.create_fake_data((4, 3, 10000))
# How large is this in memory?
print(f"Input array is {arr.nbytes} bytes")

Input array is 960000 bytes

In [4]:

Copied!

# Plot these streams
demo.plot_data(arr)
# Plot these streams
demo.plot_data(arr)

No description has been provided for this image

Create From Array¶

Now create a FlacArray from this. Since this is floating point data, the streams will always be truncated to 32 bits and by default each bit value will be chosen so that the peak-to-peak range of the signal spans the available $2^{30}$ bits available for the FLAC stream.

In [5]:

Copied!

# Create a compressed array
flcarr = FlacArray.from_array(arr)
# Create a compressed array
flcarr = FlacArray.from_array(arr)

In [6]:

Copied!

# Properties of the compressed array
print(flcarr)
# Properties of the compressed array
print(flcarr)

<FlacArray float64 shape=(4, 3, 10000) bytes=447080>

Decompress Back to Array¶

Now decompress back to a numpy array. The results will only be bitwise identical for arrays consisting of 32 bit integers.

In [7]:

Copied!

# Restore back to an array
restored = flcarr.to_array()
demo.plot_data(restored)
# Restore back to an array
restored = flcarr.to_array()
demo.plot_data(restored)

In [8]:

Copied!

# Plot the residual
residual = restored - arr
demo.plot_data(residual)
# Plot the residual
residual = restored - arr
demo.plot_data(residual)

Slicing¶

A subset of the full array can be decompressed on the fly. Any fancy array indexing can be used for the leading dimensions, but only contiguous slices or individual samples are supported in the last dimension.

In [9]:

Copied!

subarr = flcarr[1:2, :, 200:300]
demo.plot_data(subarr)
subarr = flcarr[1:2, :, 200:300]
demo.plot_data(subarr)

Writing and Reading¶

The FlacArray class has methods to write the internal compressed data and metadata to both h5py and zarr groups. The data members written to these file formats are simple arrays and scalars. Supporting other formats in the future would be straightforward. When decompressing data from disk, you can choose to decompress only a subset of the streams. Here is an example writing the compressed array to HDF5 and loading it back in.

In [10]:

Copied!

with h5py.File("flcarr.h5", "w") as hf:
    flcarr.write_hdf5(hf)
with h5py.File("flcarr.h5", "w") as hf:
    flcarr.write_hdf5(hf)

In [11]:

Copied!

# We can load this back into a new FlacArray using a class method
with h5py.File("flcarr.h5", "r") as hf:
    new_flcarr = FlacArray.read_hdf5(hf)
# We can load this back into a new FlacArray using a class method
with h5py.File("flcarr.h5", "r") as hf:
    new_flcarr = FlacArray.read_hdf5(hf)

In [12]:

Copied!

# The compressed representations should be equal...
print(new_flcarr == flcarr)
# The compressed representations should be equal...
print(new_flcarr == flcarr)

True

You can also load in just a subset of the streams using a "keep" mask. This is a boolean array with the same shape as the leading dimensions of the original array.

In [13]:

Copied!





leading_shape = arr.shape[:-1]
keep = np.zeros(leading_shape, dtype=bool)
# Select the first and last stream on the second row
keep[1, 0] = True
keep[1, -1] = True
leading_shape = arr.shape[:-1]
keep = np.zeros(leading_shape, dtype=bool)
# Select the first and last stream on the second row
keep[1, 0] = True
keep[1, -1] = True

In [14]:

Copied!

# Load just these streams
with h5py.File("flcarr.h5", "r") as hf:
    sub_flcarr = FlacArray.read_hdf5(hf, keep=keep)
# Load just these streams
with h5py.File("flcarr.h5", "r") as hf:
    sub_flcarr = FlacArray.read_hdf5(hf, keep=keep)

In [15]:

Copied!

# Decompress and plot
demo.plot_data(sub_flcarr.to_array())
# Decompress and plot
demo.plot_data(sub_flcarr.to_array())

Direct I/O and Compression of Numpy Arrays¶

For some use cases, there is no need to keep the full compressed data in memory (in a FlacArray). Instead, a normal numpy array is compressed when writing to a file and decompressed back into a numpy array when reading. The package has high-level functions for performing this kind of operation. When decompressing, a subset of streams can be loaded from disk, and then a sample range can be specified when doing the decompression.

HDF5¶

The hdf5 sub-module has helper functions for direct I/O to HDF5.

In [16]:

Copied!

import flacarray.hdf5
import flacarray.hdf5

In [17]:

Copied!





# Write a numpy array directly to HDF5.  This is equivalent to doing:
#
# temp = FlacArray.from_array(arr)
# with h5py.File("test.h5", "w") as hf:
#     temp.write_hdf5(hf)
#
with h5py.File("test.h5", "w") as hf:
    flacarray.hdf5.write_array(arr, hf)
# Write a numpy array directly to HDF5.  This is equivalent to doing:
#
# temp = FlacArray.from_array(arr)
# with h5py.File("test.h5", "w") as hf:
#     temp.write_hdf5(hf)
#
with h5py.File("test.h5", "w") as hf:
    flacarray.hdf5.write_array(arr, hf)

In [18]:

Copied!

with h5py.File("test.h5", "r") as hf:
    restored = flacarray.hdf5.read_array(hf)
with h5py.File("test.h5", "r") as hf:
    restored = flacarray.hdf5.read_array(hf)

In [19]:

Copied!

demo.plot_data(restored)
demo.plot_data(restored)

In [20]:

Copied!





# Load only a subset of streams and a slice of samples in those streams.
# This is equivalent to the following code:
#
# with h5py.File("test.h5", "r") as hf:
#    restored = FlacArray.read_hdf5(hf, keep=keep)
#    sub_restored = restored[:, 200:300]
#
with h5py.File("test.h5", "r") as hf:
    sub_restored = flacarray.hdf5.read_array(hf, keep=keep, stream_slice=slice(200, 300, 1))
# Load only a subset of streams and a slice of samples in those streams.
# This is equivalent to the following code:
#
# with h5py.File("test.h5", "r") as hf:
#    restored = FlacArray.read_hdf5(hf, keep=keep)
#    sub_restored = restored[:, 200:300]
#
with h5py.File("test.h5", "r") as hf:
    sub_restored = flacarray.hdf5.read_array(hf, keep=keep, stream_slice=slice(200, 300, 1))

In [21]:

Copied!

demo.plot_data(sub_restored)
demo.plot_data(sub_restored)

Zarr¶

The zarr package provides an h5py-like interface for creating groups with attributes and "datasets" (arrays) on disk. Given an existing zarr.hierarchy.Group, you can compress and write an array and then load it back in. This is almost identical to the HDF5 syntax above.

In [22]:

Copied!

import flacarray.zarr
import flacarray.zarr

In [23]:

Copied!

with zarr.open_group("test.zarr", mode="w") as zf:
    flacarray.zarr.write_array(arr, zf)
with zarr.open_group("test.zarr", mode="w") as zf:
    flacarray.zarr.write_array(arr, zf)

In [24]:

Copied!

with zarr.open_group("test.zarr", mode="r") as zf:
    restored = flacarray.zarr.read_array(zf)
with zarr.open_group("test.zarr", mode="r") as zf:
    restored = flacarray.zarr.read_array(zf)

In [25]:

Copied!

demo.plot_data(restored)
demo.plot_data(restored)

In [26]:

Copied!

# Specifying a keep mask and sample slice also works.
with zarr.open_group("test.zarr", mode="r") as zf:
    sub_restored = flacarray.zarr.read_array(zf, keep=keep, stream_slice=slice(200, 300, 1))
# Specifying a keep mask and sample slice also works.
with zarr.open_group("test.zarr", mode="r") as zf:
    sub_restored = flacarray.zarr.read_array(zf, keep=keep, stream_slice=slice(200, 300, 1))

In [27]:

Copied!

demo.plot_data(sub_restored)
demo.plot_data(sub_restored)

In [ ]:

Cook Book¶

This notebook contains some more advanced examples addressing common usage patterns. Look at the Tutorial first to get a better sense of the big picture of the tools.

In [1]:

Copied!

# Set the number of OpenMP threads to use
import os
os.environ["OMP_NUM_THREADS"] = "4"
# Set the number of OpenMP threads to use
import os
os.environ["OMP_NUM_THREADS"] = "4"

In [2]:

Copied!





import time
import numpy as np
import h5py
from flacarray import FlacArray, demo
import flacarray.hdf5
import time
import numpy as np
import h5py
from flacarray import FlacArray, demo
import flacarray.hdf5

Random Access to Large Arrays¶

Consider a common case where we have a 2D array that represents essentially a "list" of timestreams of data. We might have thousands of timestreams, each with millions of samples. Now we want to decompress and access a subset of those streams and / or samples. To reduce memory in this notebook we are using a slightly smaller array.

In [3]:

Copied!





# Create a 2D array of streams
arr = demo.create_fake_data((1000, 100000), dtype=np.float32)
# How large is this in memory?
print(f"Input array is {arr.nbytes} bytes")
# Create a 2D array of streams
arr = demo.create_fake_data((1000, 100000), dtype=np.float32)
# How large is this in memory?
print(f"Input array is {arr.nbytes} bytes")

Input array is 400000000 bytes

In [4]:

Copied!

# Compress this with threads
start = time.perf_counter()

flcarr = FlacArray.from_array(arr, use_threads=True)

stop = time.perf_counter()
print(f"Elapsed = {stop-start:0.3} seconds")
# Compress this with threads
start = time.perf_counter()

flcarr = FlacArray.from_array(arr, use_threads=True)

stop = time.perf_counter()
print(f"Elapsed = {stop-start:0.3} seconds")

Elapsed = 1.33 seconds

In [5]:

Copied!

# Compress this without threads
start = time.perf_counter()

flcarr = FlacArray.from_array(arr, use_threads=False)

stop = time.perf_counter()
print(f"Elapsed = {stop-start:0.3} seconds")
# Compress this without threads
start = time.perf_counter()

flcarr = FlacArray.from_array(arr, use_threads=False)

stop = time.perf_counter()
print(f"Elapsed = {stop-start:0.3} seconds")

Elapsed = 3.06 seconds

In [6]:

Copied!

# Decompress the whole thing
start = time.perf_counter()

restored = flcarr.to_array()

stop = time.perf_counter()
print(f"Elapsed = {stop-start:0.3} seconds")
# Decompress the whole thing
start = time.perf_counter()

restored = flcarr.to_array()

stop = time.perf_counter()
print(f"Elapsed = {stop-start:0.3} seconds")

Elapsed = 0.506 seconds

In [7]:

Copied!





# Decompress the whole thing with threads
del restored
start = time.perf_counter()

restored = flcarr.to_array(use_threads=True)

stop = time.perf_counter()
print(f"Elapsed = {stop-start:0.3} seconds")
# Decompress the whole thing with threads
del restored
start = time.perf_counter()

restored = flcarr.to_array(use_threads=True)

stop = time.perf_counter()
print(f"Elapsed = {stop-start:0.3} seconds")

Elapsed = 0.493 seconds

Subset of Samples for All Streams¶

If our 2D array of streams contains co-sampled data, we might mant to examine a slice in time of all streams. Imagine we wanted to get data near the end of the array for all streams:

In [8]:

Copied!

n_end = 10000
start = time.perf_counter()

end_arr = flcarr.to_array(stream_slice=slice(-n_end, None, 1))

stop = time.perf_counter()
print(f"Elapsed = {stop-start:0.3} seconds")
n_end = 10000
start = time.perf_counter()

end_arr = flcarr.to_array(stream_slice=slice(-n_end, None, 1))

stop = time.perf_counter()
print(f"Elapsed = {stop-start:0.3} seconds")

Elapsed = 0.496 seconds

Subset of Samples for a Few Streams¶

Imagine we want the last 1000 samples of one stream in the middle. We can use a "keep" mask combined with a sample slice:

In [9]:

Copied!





n_end = 10000
keep = np.zeros(arr.shape[:-1], dtype=bool)
keep[500] = True
start = time.perf_counter()

sub_arr = flcarr.to_array(keep=keep, stream_slice=slice(-n_end, None, 1))

stop = time.perf_counter()
print(f"Elapsed = {stop-start:0.3} seconds")
print(sub_arr)
n_end = 10000
keep = np.zeros(arr.shape[:-1], dtype=bool)
keep[500] = True
start = time.perf_counter()

sub_arr = flcarr.to_array(keep=keep, stream_slice=slice(-n_end, None, 1))

stop = time.perf_counter()
print(f"Elapsed = {stop-start:0.3} seconds")
print(sub_arr)

Elapsed = 0.00223 seconds
[[ 1.3499216   1.1607051  -1.0080613  ...  0.2447555   1.0821551
   0.03497726]]

So, we can see that decompressing a small number of random samples from a multi-GB dataset in memory is very fast.

Parallel I/O¶

For some use cases, there is no need to keep the full compressed data in memory (in a FlacArray). Instead, a normal numpy array is compressed when writing to a file and decompressed back into a numpy array when reading.

To-Do: Discuss

Interaction of threads, OpenMP versus libFLAC pthreads
Use of MPI HDF5 with h5py

In [ ]:

API Reference

The flacarray package consists of a primary class (FlacArray) plus a variety of helper functions.

Compressed Array Representation

The FlacArray class stores a compressed representation of an N dimensional array where the last dimension consists of "streams" of numbers to be compressed.

`flacarray.FlacArray`

FLAC compressed array representation.

This class holds a compressed representation of an N-dimensional array. The final (fastest changing) dimension is the axis along which the data is compressed. Each of the vectors in this last dimension is called a "stream" here. The leading dimensions of the original matrix form an array of these streams.

Internally, the data is stored as a contiguous concatenation of the bytes from these compressed streams. A separate array contains the starting byte of each stream in the overall bytes array. The shape of the starting array corresponds to the shape of the leading, un-compressed dimensions of the original array.

The input data is converted to 32bit integers. The "quanta" value is used for floating point data conversion and represents the floating point increment for a single integer value. If quanta is None, each stream is scaled independently based on its data range. If quanta is a scalar, all streams are scaled with the same value. If quanta is an array, it specifies the scaling independently for each stream.

Alternatively, if "precision" is provided, each data vector is scaled to retain the prescribed number of significant digits when converting to integers.

The following rules specify the data conversion that is performed depending on the input type:

int32: No conversion.
int64: Subtract the integer closest to the mean, then truncate to lower 32 bits, and check that the higher bits were zero.
float32: Subtract the mean and scale data based on the quanta value (see above). Then round to nearest 32bit integer.
float64: Subtract the mean and scale data based on the quanta value (see above). Then round to nearest 32bit integer.

After conversion to 32bit integers, each stream's data is separately compressed into a sequence of FLAC bytes, which is appended to the bytestream. The offset in bytes for each stream is recorded.

A FlacArray is only constructed directly when making a copy. Use the class methods to create FlacArrays from numpy arrays or on-disk representations.

Parameters:

Name	Type	Description	Default
`other`	`FlacArray`	Construct a copy of the input FlacArray.	required

Source code in flacarray/array.py

class FlacArray:
    """FLAC compressed array representation.

    This class holds a compressed representation of an N-dimensional array.  The final
    (fastest changing) dimension is the axis along which the data is compressed.  Each
    of the vectors in this last dimension is called a "stream" here.  The leading
    dimensions of the original matrix form an array of these streams.

    Internally, the data is stored as a contiguous concatenation of the bytes from
    these compressed streams.  A separate array contains the starting byte of each
    stream in the overall bytes array.  The shape of the starting array corresponds
    to the shape of the leading, un-compressed dimensions of the original array.

    The input data is converted to 32bit integers.  The "quanta" value is used
    for floating point data conversion and represents the floating point increment
    for a single integer value.  If quanta is None, each stream is scaled independently
    based on its data range.  If quanta is a scalar, all streams are scaled with the
    same value.  If quanta is an array, it specifies the scaling independently for each
    stream.

    Alternatively, if "precision" is provided, each data vector is scaled to retain
    the prescribed number of significant digits when converting to integers.

    The following rules specify the data conversion that is performed depending on
    the input type:

    * int32:  No conversion.

    * int64:  Subtract the integer closest to the mean, then truncate to lower
        32 bits, and check that the higher bits were zero.

    * float32:  Subtract the mean and scale data based on the quanta value (see
        above).  Then round to nearest 32bit integer.

    * float64:  Subtract the mean and scale data based on the quanta value (see
        above).  Then round to nearest 32bit integer.

    After conversion to 32bit integers, each stream's data is separately compressed
    into a sequence of FLAC bytes, which is appended to the bytestream.  The offset in
    bytes for each stream is recorded.

    A FlacArray is only constructed directly when making a copy.  Use the class methods
    to create FlacArrays from numpy arrays or on-disk representations.

    Args:
        other (FlacArray):  Construct a copy of the input FlacArray.

    """

    def __init__(
        self,
        other,
        shape=None,
        global_shape=None,
        compressed=None,
        stream_starts=None,
        stream_nbytes=None,
        stream_offsets=None,
        stream_gains=None,
        mpi_comm=None,
        mpi_dist=None,
    ):
        if other is not None:
            # We are copying an existing object, make sure we have an
            # independent copy.
            self._shape = copy.deepcopy(other._shape)
            self._global_shape = copy.deepcopy(other._global_shape)
            self._compressed = copy.deepcopy(other._compressed)
            self._stream_starts = copy.deepcopy(other._stream_starts)
            self._stream_nbytes = copy.deepcopy(other._stream_nbytes)
            self._stream_offsets = copy.deepcopy(other._stream_offsets)
            self._stream_gains = copy.deepcopy(other._stream_gains)
            self._mpi_dist = copy.deepcopy(other._mpi_dist)
            # MPI communicators can be limited in number and expensive to create.
            self._mpi_comm = other._mpi_comm
        else:
            # This form of constructor is used in the class methods where we
            # have already created these arrays for use by this instance.
            self._shape = shape
            self._global_shape = global_shape
            self._compressed = compressed
            self._stream_starts = stream_starts
            self._stream_nbytes = stream_nbytes
            self._stream_offsets = stream_offsets
            self._stream_gains = stream_gains
            self._mpi_comm = mpi_comm
            self._mpi_dist = mpi_dist
        self._init_params()

    def _init_params(self):
        self._local_nbytes = self._compressed.nbytes
        (
            self._global_nbytes,
            self._global_proc_nbytes,
            self._global_stream_starts,
        ) = global_bytes(self._local_nbytes, self._stream_starts, self._mpi_comm)
        self._stream_size = self._shape[-1]
        self._leading_shape = self._stream_starts.shape
        self._local_nstreams = np.prod(self._leading_shape)
        if len(self._global_shape) == 1:
            self._global_leading_shape = (1,)
        else:
            self._global_leading_shape = self._global_shape[:-1]
        self._global_nstreams = np.prod(self._global_leading_shape)
        # For reference, record the type of the original data.
        if self._stream_offsets is not None:
            if self._stream_gains is not None:
                # This is floating point data
                if self._stream_gains.dtype == np.dtype(np.float64):
                    self._typestr = "float64"
                else:
                    self._typestr = "float32"
            else:
                # This is int64 data
                self._typestr = "int64"
        else:
            self._typestr = "int32"

    # Shapes of decompressed array

    @property
    def shape(self):
        """The shape of the local, uncompressed array."""
        return self._shape

    @property
    def global_shape(self):
        """The global shape of array across any MPI communicator."""
        return self._global_shape

    @property
    def leading_shape(self):
        """The local shape of leading uncompressed dimensions."""
        return self._leading_shape

    @property
    def global_leading_shape(self):
        """The global shape of leading uncompressed dimensions across all processes."""
        return self._global_leading_shape

    @property
    def stream_size(self):
        """The uncompressed length of each stream."""
        return self._shape[-1]

    # Properties of the compressed data

    @property
    def nbytes(self):
        """The total number of bytes used by compressed data on the local process."""
        return self._local_nbytes

    @property
    def global_nbytes(self):
        """The sum of total bytes used by compressed data across all processes."""
        return self._global_nbytes

    @property
    def global_process_nbytes(self):
        """The bytes used by compressed data on each process."""
        return self._global_proc_bytes

    @property
    def nstreams(self):
        """The number of local streams (product of entries of `leading_shape`)"""
        return self._local_nstreams

    @property
    def global_nstreams(self):
        """Number of global streams (product of entries of `global_leading_shape`)"""
        return self._global_nstreams

    @property
    def compressed(self):
        """The concatenated raw bytes of all streams on the local process."""
        return self._compressed

    @property
    def stream_starts(self):
        """The array of starting bytes for each stream on the local process."""
        return self._stream_starts

    @property
    def stream_nbytes(self):
        """The array of nbytes for each stream on the local process."""
        return self._stream_nbytes

    @property
    def global_stream_starts(self):
        """The array of starting bytes within the global compressed data."""
        return self._global_stream_starts

    @property
    def global_stream_nbytes(self):
        """The array of nbytes within the global compressed data."""
        return self._global_stream_nbytes

    @property
    def stream_offsets(self):
        """The value subtracted from each stream during conversion to int32."""
        return self._stream_offsets

    @property
    def stream_gains(self):
        """The gain factor for each stream during conversion to int32."""
        return self._stream_gains

    @property
    def mpi_comm(self):
        """The MPI communicator over which the array is distributed."""
        return self._mpi_comm

    @property
    def mpi_dist(self):
        """The range of the leading dimension assigned to each MPI process."""
        return self._mpi_dist

    def _keep_view(self, key):
        if len(key) != len(self._leading_shape):
            raise ValueError("view size does not match leading dimensions")
        view = np.zeros(self._leading_shape, dtype=bool)
        view[key] = True
        return view

    def __getitem__(self, key):
        """Decompress a slice of data on the fly."""
        first = None
        last = None
        keep = None
        if isinstance(key, tuple):
            # We are slicing on multiple dimensions
            if len(key) == len(self._shape):
                # Slicing on the sample dimension too
                keep = self._keep_view(key[:-1])
                samp_key = key[-1]
                if isinstance(samp_key, slice):
                    # A slice
                    if samp_key.step is not None and samp_key.step != 1:
                        raise ValueError("Only stride==1 supported on stream slices")
                    first = samp_key.start
                    last = samp_key.stop
                elif isinstance(samp_key, (int, np.integer)):
                    # Just a scalar
                    first = samp_key
                    last = samp_key + 1
                else:
                    raise ValueError(
                        "Only contiguous slices supported on the stream dimension"
                    )
            else:
                # Only slicing the leading dimensions
                vw = list(key)
                vw.extend(
                    [slice(None) for x in range(len(self._leading_shape) - len(key))]
                )
                keep = self._keep_view(tuple(vw))
        else:
            # We are slicing / indexing only the leading dimension
            vw = [slice(None) for x in range(len(self._leading_shape))]
            vw[0] = key
            keep = self._keep_view(tuple(vw))

        arr, _ = array_decompress_slice(
            self._compressed,
            self._stream_size,
            self._stream_starts,
            self._stream_nbytes,
            stream_offsets=self._stream_offsets,
            stream_gains=self._stream_gains,
            keep=keep,
            first_stream_sample=first,
            last_stream_sample=last,
        )
        return arr

    def __delitem__(self, key):
        raise RuntimeError("Cannot delete individual streams")

    def __setitem__(self, key, value):
        raise RuntimeError("Cannot modify individual byte streams")

    def __repr__(self):
        rank = 0
        mpistr = ""
        if self._mpi_comm is not None:
            rank = self._mpi_comm.rank
            mpistr = f" | Rank {rank:04d} "
            mpistr += f"{self._mpi_dist[rank][0]}-"
            mpistr += f"{self._mpi_dist[rank][1] - 1} |"
        rep = f"<FlacArray{mpistr} {self._typestr} "
        rep += f"shape={self._shape} bytes={self._local_nbytes}>"
        return rep

    def __eq__(self, other):
        if self._shape != other._shape:
            log.debug(f"other shape {other._shape} != {self._shape}")
            return False
        if self._global_shape != other._global_shape:
            msg = f"other global_shape {other._global_shape} != {self._global_shape}"
            log.debug(msg)
            return False
        if not np.array_equal(self._stream_starts, other._stream_starts):
            msg = f"other starts {other._stream_starts} != {self._stream_starts}"
            log.debug(msg)
            return False
        if not np.array_equal(self._compressed, other._compressed):
            msg = f"other compressed {other._compressed} != {self._compressed}"
            log.debug(msg)
            return False
        if self._stream_offsets is None:
            if other._stream_offsets is not None:
                log.debug("other stream_offsets not None, self is None")
                return False
        else:
            if other._stream_offsets is None:
                log.debug("other stream_offsets is None, self is not None")
                return False
            else:
                if not np.allclose(self._stream_offsets, other._stream_offsets):
                    msg = f"other stream_offsets {other._stream_offsets} != "
                    msg += f"{self._stream_offsets}"
                    log.debug(msg)
                    return False
        if self._stream_gains is None:
            if other._stream_gains is not None:
                log.debug("other stream_gains not None, self is None")
                return False
        else:
            if other._stream_gains is None:
                log.debug("other stream_offsets is None, self is not None")
                return False
            else:
                if not np.allclose(self._stream_gains, other._stream_gains):
                    msg = f"other stream_gains {other._stream_gains} != "
                    msg += f"{self._stream_gains}"
                    log.debug(msg)
                    return False
        return True

    def to_array(
        self, keep=None, stream_slice=None, keep_indices=False, use_threads=False
    ):
        """Decompress local data into a numpy array.

        This uses the compressed representation to reconstruct a normal numpy
        array.  The returned data type will be either int32, int64, float32, or
        float64 depending on the original data type.

        If `stream_slice` is specified, the returned array will have only that
        range of samples in the final dimension.

        If `keep` is specified, this should be a boolean array with the same shape
        as the leading dimensions of the original array.  True values in this array
        indicate that the stream should be kept.

        If `keep` is specified, the returned array WILL NOT have the same shape as
        the original.  Instead it will be a 2D array of decompressed streams- the
        streams corresponding to True values in the `keep` mask.

        If `keep_indices` is True and `keep` is specified, then a tuple of two values
        is returned.  The first is the array of decompressed streams.  The second is
        a list of tuples, each of which specifies the indices of the stream in the
        original array.

        Args:
            keep (array):  Bool array of streams to keep in the decompression.
            stream_slice (slice):  A python slice with step size of one, indicating
                the sample range to extract from each stream.
            keep_indices (bool):  If True, also return the original indices of the
                streams.
            use_threads (bool):  If True, use OpenMP threads to parallelize decoding.
                This is only beneficial for large arrays.

        """
        first_samp = None
        last_samp = None
        if stream_slice is not None:
            if stream_slice.step is not None and stream_slice.step != 1:
                raise RuntimeError(
                    "Only stream slices with a step size of 1 are supported"
                )
            first_samp = stream_slice.start
            last_samp = stream_slice.stop

        arr, indices = array_decompress_slice(
            self._compressed,
            self._stream_size,
            self._stream_starts,
            self._stream_nbytes,
            stream_offsets=self._stream_offsets,
            stream_gains=self._stream_gains,
            keep=keep,
            first_stream_sample=first_samp,
            last_stream_sample=last_samp,
            use_threads=use_threads,
        )
        if keep is not None and keep_indices:
            return (arr, indices)
        else:
            return arr

    @classmethod
    def from_array(
        cls, arr, level=5, quanta=None, precision=None, mpi_comm=None, use_threads=False
    ):
        """Construct a FlacArray from a numpy ndarray.

        Args:
            arr (numpy.ndarray):  The input array data.
            level (int):  Compression level (0-8).
            quanta (float, array):  For floating point data, the floating point
                increment of each 32bit integer value.  Optionally an iterable of
                increments, one per stream.
            precision (int, array):  Number of significant digits to retain in
                float-to-int conversion.  Alternative to `quanta`.  Optionally an
                iterable of values, one per stream.
            mpi_comm (MPI.Comm):  If specified, the input array is assumed to be
                distributed across the communicator at the leading dimension.  The
                local piece of the array is passed in on each process.
            use_threads (bool):  If True, use OpenMP threads to parallelize decoding.
                This is only beneficial for large arrays.

        Returns:
            (FlacArray):  A newly constructed FlacArray.

        """
        # Get the global shape of the array
        global_props = global_array_properties(arr.shape, mpi_comm=mpi_comm)
        global_shape = global_props["shape"]
        mpi_dist = global_props["dist"]

        # Compress our local piece of the array
        compressed, starts, nbytes, offsets, gains = array_compress(
            arr,
            level=level,
            quanta=quanta,
            precision=precision,
            use_threads=use_threads,
        )

        return FlacArray(
            None,
            shape=arr.shape,
            global_shape=global_shape,
            compressed=compressed,
            stream_starts=starts,
            stream_nbytes=nbytes,
            stream_offsets=offsets,
            stream_gains=gains,
            mpi_comm=mpi_comm,
            mpi_dist=mpi_dist,
        )

    def write_hdf5(self, hgrp):
        """Write data to an HDF5 Group.

        The internal object properties are written to an open HDF5 group.  If you
        wish to use MPI I/O to write data to the group, then you must be using an MPI
        enabled h5py and you should pass in a valid handle to the group on all
        processes.

        If the `FlacArray` is distributed over an MPI communicator, but the h5py
        implementation does not support MPI I/O, then all data will be communicated
        to the rank zero process for writing.  In this case, the `hgrp` argument should
        be None except on the root process.

        Args:
            hgrp (h5py.Group):  The open Group for writing.

        Returns:
            None

        """
        hdf5_write_compressed(
            hgrp,
            self._leading_shape,
            self._global_leading_shape,
            self._stream_size,
            self._stream_starts,
            self._global_stream_starts,
            self._stream_nbytes,
            self._stream_offsets,
            self._stream_gains,
            self._compressed,
            self._compressed.nbytes,
            self._global_nbytes,
            self._global_proc_nbytes,
            self._mpi_comm,
            self._mpi_dist,
        )

    @classmethod
    def read_hdf5(
        cls,
        hgrp,
        keep=None,
        mpi_comm=None,
        mpi_dist=None,
    ):
        """Construct a FlacArray from an HDF5 Group.

        This function loads all information about the array from an HDF5 group.  If
        `mpi_comm` is specified, the created array is distributed over that
        communicator.  If you also wish to use MPI I/O to read data from the group,
        then you must be using an MPI-enabled h5py and you should pass in a valid
        handle to the group on all processes.

        If `mpi_dist` is specified, it should be an iterable with the number of leading
        dimension elements assigned to each process.  If None, the leading dimension
        will be distributed uniformly.

        If `keep` is specified, this should be a boolean array with the same shape
        as the leading dimensions of the original array.  True values in this array
        indicate that the stream should be kept.

        If `keep` is specified, the returned array WILL NOT have the same shape as
        the original.  Instead it will be a 2D array of decompressed streams- the
        streams corresponding to True values in the `keep` mask.

        Args:
            hgrp (h5py.Group):  The open Group for reading.
            keep (array):  Bool array of streams to keep in the decompression.
            mpi_comm (MPI.Comm):  If specified, the communicator over which to
                distribute the leading dimension.
            mpi_dist (array):  If specified, assign blocks of these sizes to processes
                when distributing the leading dimension.

        Returns:
            (FlacArray):  A newly constructed FlacArray.

        """
        (
            local_shape,
            global_shape,
            compressed,
            stream_starts,
            stream_nbytes,
            stream_offsets,
            stream_gains,
            mpi_dist,
            keep_indices,
        ) = hdf5_read_compressed(
            hgrp,
            keep=keep,
            mpi_comm=mpi_comm,
            mpi_dist=mpi_dist,
        )

        return FlacArray(
            None,
            shape=local_shape,
            global_shape=global_shape,
            compressed=compressed,
            stream_starts=stream_starts,
            stream_nbytes=stream_nbytes,
            stream_offsets=stream_offsets,
            stream_gains=stream_gains,
            mpi_comm=mpi_comm,
            mpi_dist=mpi_dist,
        )

    def write_zarr(self, zgrp):
        """Write data to an Zarr Group.

        The internal object properties are written to an open zarr group.

        If the `FlacArray` is distributed over an MPI communicator, then all data will
        be communicated to the rank zero process for writing.  In this case, the `zgrp`
        argument should be None except on the root process.

        Args:
            zgrp (zarr.Group):  The open Group for writing.

        Returns:
            None

        """
        zarr_write_compressed(
            zgrp,
            self._leading_shape,
            self._global_leading_shape,
            self._stream_size,
            self._stream_starts,
            self._global_stream_starts,
            self._stream_nbytes,
            self._stream_offsets,
            self._stream_gains,
            self._compressed,
            self._compressed.nbytes,
            self._global_nbytes,
            self._global_proc_nbytes,
            self._mpi_comm,
            self._mpi_dist,
        )

    @classmethod
    def read_zarr(
        cls,
        zgrp,
        keep=None,
        mpi_comm=None,
        mpi_dist=None,
    ):
        """Construct a FlacArray from a Zarr Group.

        This function loads all information about the array from a zarr group.  If
        `mpi_comm` is specified, the created array is distributed over that
        communicator.

        If `mpi_dist` is specified, it should be an iterable with the number of leading
        dimension elements assigned to each process.  If None, the leading dimension
        will be distributed uniformly.

        If `keep` is specified, this should be a boolean array with the same shape
        as the leading dimensions of the original array.  True values in this array
        indicate that the stream should be kept.

        If `keep` is specified, the returned array WILL NOT have the same shape as
        the original.  Instead it will be a 2D array of decompressed streams- the
        streams corresponding to True values in the `keep` mask.

        Args:
            zgrp (zarr.Group):  The open Group for reading.
            keep (array):  Bool array of streams to keep in the decompression.
            mpi_comm (MPI.Comm):  If specified, the communicator over which to
                distribute the leading dimension.
            mpi_dist (array):  If specified, assign blocks of these sizes to processes
                when distributing the leading dimension.

        Returns:
            (FlacArray):  A newly constructed FlacArray.

        """
        (
            local_shape,
            global_shape,
            compressed,
            stream_starts,
            stream_nbytes,
            stream_offsets,
            stream_gains,
            mpi_dist,
            keep_indices,
        ) = zarr_read_compressed(
            zgrp,
            keep=keep,
            mpi_comm=mpi_comm,
            mpi_dist=mpi_dist,
        )

        return FlacArray(
            None,
            shape=local_shape,
            global_shape=global_shape,
            compressed=compressed,
            stream_starts=stream_starts,
            stream_nbytes=stream_nbytes,
            stream_offsets=stream_offsets,
            stream_gains=stream_gains,
            mpi_comm=mpi_comm,
            mpi_dist=mpi_dist,
        )

`compressed` `property`

The concatenated raw bytes of all streams on the local process.

`global_leading_shape` `property`

The global shape of leading uncompressed dimensions across all processes.

`global_nbytes` `property`

The sum of total bytes used by compressed data across all processes.

`global_nstreams` `property`

Number of global streams (product of entries of global_leading_shape)

`global_process_nbytes` `property`

The bytes used by compressed data on each process.

`global_shape` `property`

The global shape of array across any MPI communicator.

`global_stream_nbytes` `property`

The array of nbytes within the global compressed data.

`global_stream_starts` `property`

The array of starting bytes within the global compressed data.

`leading_shape` `property`

The local shape of leading uncompressed dimensions.

`mpi_comm` `property`

The MPI communicator over which the array is distributed.

`mpi_dist` `property`

The range of the leading dimension assigned to each MPI process.

`nbytes` `property`

The total number of bytes used by compressed data on the local process.

`nstreams` `property`

The number of local streams (product of entries of leading_shape)

`shape` `property`

The shape of the local, uncompressed array.

`stream_gains` `property`

The gain factor for each stream during conversion to int32.

`stream_nbytes` `property`

The array of nbytes for each stream on the local process.

`stream_offsets` `property`

The value subtracted from each stream during conversion to int32.

`stream_size` `property`

The uncompressed length of each stream.

`stream_starts` `property`

The array of starting bytes for each stream on the local process.

`getitem(key)`

Decompress a slice of data on the fly.

Source code in flacarray/array.py

def __getitem__(self, key):
    """Decompress a slice of data on the fly."""
    first = None
    last = None
    keep = None
    if isinstance(key, tuple):
        # We are slicing on multiple dimensions
        if len(key) == len(self._shape):
            # Slicing on the sample dimension too
            keep = self._keep_view(key[:-1])
            samp_key = key[-1]
            if isinstance(samp_key, slice):
                # A slice
                if samp_key.step is not None and samp_key.step != 1:
                    raise ValueError("Only stride==1 supported on stream slices")
                first = samp_key.start
                last = samp_key.stop
            elif isinstance(samp_key, (int, np.integer)):
                # Just a scalar
                first = samp_key
                last = samp_key + 1
            else:
                raise ValueError(
                    "Only contiguous slices supported on the stream dimension"
                )
        else:
            # Only slicing the leading dimensions
            vw = list(key)
            vw.extend(
                [slice(None) for x in range(len(self._leading_shape) - len(key))]
            )
            keep = self._keep_view(tuple(vw))
    else:
        # We are slicing / indexing only the leading dimension
        vw = [slice(None) for x in range(len(self._leading_shape))]
        vw[0] = key
        keep = self._keep_view(tuple(vw))

    arr, _ = array_decompress_slice(
        self._compressed,
        self._stream_size,
        self._stream_starts,
        self._stream_nbytes,
        stream_offsets=self._stream_offsets,
        stream_gains=self._stream_gains,
        keep=keep,
        first_stream_sample=first,
        last_stream_sample=last,
    )
    return arr

`from_array(arr, level=5, quanta=None, precision=None, mpi_comm=None, use_threads=False)` `classmethod`

Construct a FlacArray from a numpy ndarray.

Parameters:

Name	Type	Description	Default
`arr`	`ndarray`	The input array data.	required
`level`	`int`	Compression level (0-8).	`5`
`quanta`	`(float, array)`	For floating point data, the floating point increment of each 32bit integer value. Optionally an iterable of increments, one per stream.	`None`
`precision`	`(int, array)`	Number of significant digits to retain in float-to-int conversion. Alternative to `quanta`. Optionally an iterable of values, one per stream.	`None`
`mpi_comm`	`Comm`	If specified, the input array is assumed to be distributed across the communicator at the leading dimension. The local piece of the array is passed in on each process.	`None`
`use_threads`	`bool`	If True, use OpenMP threads to parallelize decoding. This is only beneficial for large arrays.	`False`

Returns:

Type	Description
`FlacArray`	A newly constructed FlacArray.

Source code in flacarray/array.py

@classmethod
def from_array(
    cls, arr, level=5, quanta=None, precision=None, mpi_comm=None, use_threads=False
):
    """Construct a FlacArray from a numpy ndarray.

    Args:
        arr (numpy.ndarray):  The input array data.
        level (int):  Compression level (0-8).
        quanta (float, array):  For floating point data, the floating point
            increment of each 32bit integer value.  Optionally an iterable of
            increments, one per stream.
        precision (int, array):  Number of significant digits to retain in
            float-to-int conversion.  Alternative to `quanta`.  Optionally an
            iterable of values, one per stream.
        mpi_comm (MPI.Comm):  If specified, the input array is assumed to be
            distributed across the communicator at the leading dimension.  The
            local piece of the array is passed in on each process.
        use_threads (bool):  If True, use OpenMP threads to parallelize decoding.
            This is only beneficial for large arrays.

    Returns:
        (FlacArray):  A newly constructed FlacArray.

    """
    # Get the global shape of the array
    global_props = global_array_properties(arr.shape, mpi_comm=mpi_comm)
    global_shape = global_props["shape"]
    mpi_dist = global_props["dist"]

    # Compress our local piece of the array
    compressed, starts, nbytes, offsets, gains = array_compress(
        arr,
        level=level,
        quanta=quanta,
        precision=precision,
        use_threads=use_threads,
    )

    return FlacArray(
        None,
        shape=arr.shape,
        global_shape=global_shape,
        compressed=compressed,
        stream_starts=starts,
        stream_nbytes=nbytes,
        stream_offsets=offsets,
        stream_gains=gains,
        mpi_comm=mpi_comm,
        mpi_dist=mpi_dist,
    )

`read_hdf5(hgrp, keep=None, mpi_comm=None, mpi_dist=None)` `classmethod`

Construct a FlacArray from an HDF5 Group.

This function loads all information about the array from an HDF5 group. If mpi_comm is specified, the created array is distributed over that communicator. If you also wish to use MPI I/O to read data from the group, then you must be using an MPI-enabled h5py and you should pass in a valid handle to the group on all processes.

If mpi_dist is specified, it should be an iterable with the number of leading dimension elements assigned to each process. If None, the leading dimension will be distributed uniformly.

If keep is specified, this should be a boolean array with the same shape as the leading dimensions of the original array. True values in this array indicate that the stream should be kept.

If keep is specified, the returned array WILL NOT have the same shape as the original. Instead it will be a 2D array of decompressed streams- the streams corresponding to True values in the keep mask.

Parameters:

Name	Type	Description	Default
`hgrp`	`Group`	The open Group for reading.	required
`keep`	`array`	Bool array of streams to keep in the decompression.	`None`
`mpi_comm`	`Comm`	If specified, the communicator over which to distribute the leading dimension.	`None`
`mpi_dist`	`array`	If specified, assign blocks of these sizes to processes when distributing the leading dimension.	`None`

Returns:

Type	Description
`FlacArray`	A newly constructed FlacArray.

Source code in flacarray/array.py

@classmethod
def read_hdf5(
    cls,
    hgrp,
    keep=None,
    mpi_comm=None,
    mpi_dist=None,
):
    """Construct a FlacArray from an HDF5 Group.

    This function loads all information about the array from an HDF5 group.  If
    `mpi_comm` is specified, the created array is distributed over that
    communicator.  If you also wish to use MPI I/O to read data from the group,
    then you must be using an MPI-enabled h5py and you should pass in a valid
    handle to the group on all processes.

    If `mpi_dist` is specified, it should be an iterable with the number of leading
    dimension elements assigned to each process.  If None, the leading dimension
    will be distributed uniformly.

    If `keep` is specified, this should be a boolean array with the same shape
    as the leading dimensions of the original array.  True values in this array
    indicate that the stream should be kept.

    If `keep` is specified, the returned array WILL NOT have the same shape as
    the original.  Instead it will be a 2D array of decompressed streams- the
    streams corresponding to True values in the `keep` mask.

    Args:
        hgrp (h5py.Group):  The open Group for reading.
        keep (array):  Bool array of streams to keep in the decompression.
        mpi_comm (MPI.Comm):  If specified, the communicator over which to
            distribute the leading dimension.
        mpi_dist (array):  If specified, assign blocks of these sizes to processes
            when distributing the leading dimension.

    Returns:
        (FlacArray):  A newly constructed FlacArray.

    """
    (
        local_shape,
        global_shape,
        compressed,
        stream_starts,
        stream_nbytes,
        stream_offsets,
        stream_gains,
        mpi_dist,
        keep_indices,
    ) = hdf5_read_compressed(
        hgrp,
        keep=keep,
        mpi_comm=mpi_comm,
        mpi_dist=mpi_dist,
    )

    return FlacArray(
        None,
        shape=local_shape,
        global_shape=global_shape,
        compressed=compressed,
        stream_starts=stream_starts,
        stream_nbytes=stream_nbytes,
        stream_offsets=stream_offsets,
        stream_gains=stream_gains,
        mpi_comm=mpi_comm,
        mpi_dist=mpi_dist,
    )

`read_zarr(zgrp, keep=None, mpi_comm=None, mpi_dist=None)` `classmethod`

Construct a FlacArray from a Zarr Group.

This function loads all information about the array from a zarr group. If mpi_comm is specified, the created array is distributed over that communicator.

If mpi_dist is specified, it should be an iterable with the number of leading dimension elements assigned to each process. If None, the leading dimension will be distributed uniformly.

If keep is specified, this should be a boolean array with the same shape as the leading dimensions of the original array. True values in this array indicate that the stream should be kept.

If keep is specified, the returned array WILL NOT have the same shape as the original. Instead it will be a 2D array of decompressed streams- the streams corresponding to True values in the keep mask.

Parameters:

Name	Type	Description	Default
`zgrp`	`Group`	The open Group for reading.	required
`keep`	`array`	Bool array of streams to keep in the decompression.	`None`
`mpi_comm`	`Comm`	If specified, the communicator over which to distribute the leading dimension.	`None`
`mpi_dist`	`array`	If specified, assign blocks of these sizes to processes when distributing the leading dimension.	`None`

Returns:

Type	Description
`FlacArray`	A newly constructed FlacArray.

Source code in flacarray/array.py

@classmethod
def read_zarr(
    cls,
    zgrp,
    keep=None,
    mpi_comm=None,
    mpi_dist=None,
):
    """Construct a FlacArray from a Zarr Group.

    This function loads all information about the array from a zarr group.  If
    `mpi_comm` is specified, the created array is distributed over that
    communicator.

    If `mpi_dist` is specified, it should be an iterable with the number of leading
    dimension elements assigned to each process.  If None, the leading dimension
    will be distributed uniformly.

    If `keep` is specified, this should be a boolean array with the same shape
    as the leading dimensions of the original array.  True values in this array
    indicate that the stream should be kept.

    If `keep` is specified, the returned array WILL NOT have the same shape as
    the original.  Instead it will be a 2D array of decompressed streams- the
    streams corresponding to True values in the `keep` mask.

    Args:
        zgrp (zarr.Group):  The open Group for reading.
        keep (array):  Bool array of streams to keep in the decompression.
        mpi_comm (MPI.Comm):  If specified, the communicator over which to
            distribute the leading dimension.
        mpi_dist (array):  If specified, assign blocks of these sizes to processes
            when distributing the leading dimension.

    Returns:
        (FlacArray):  A newly constructed FlacArray.

    """
    (
        local_shape,
        global_shape,
        compressed,
        stream_starts,
        stream_nbytes,
        stream_offsets,
        stream_gains,
        mpi_dist,
        keep_indices,
    ) = zarr_read_compressed(
        zgrp,
        keep=keep,
        mpi_comm=mpi_comm,
        mpi_dist=mpi_dist,
    )

    return FlacArray(
        None,
        shape=local_shape,
        global_shape=global_shape,
        compressed=compressed,
        stream_starts=stream_starts,
        stream_nbytes=stream_nbytes,
        stream_offsets=stream_offsets,
        stream_gains=stream_gains,
        mpi_comm=mpi_comm,
        mpi_dist=mpi_dist,
    )

`to_array(keep=None, stream_slice=None, keep_indices=False, use_threads=False)`

Decompress local data into a numpy array.

This uses the compressed representation to reconstruct a normal numpy array. The returned data type will be either int32, int64, float32, or float64 depending on the original data type.

If stream_slice is specified, the returned array will have only that range of samples in the final dimension.

If keep is specified, this should be a boolean array with the same shape as the leading dimensions of the original array. True values in this array indicate that the stream should be kept.

If keep is specified, the returned array WILL NOT have the same shape as the original. Instead it will be a 2D array of decompressed streams- the streams corresponding to True values in the keep mask.

If keep_indices is True and keep is specified, then a tuple of two values is returned. The first is the array of decompressed streams. The second is a list of tuples, each of which specifies the indices of the stream in the original array.

Parameters:

Name	Type	Description	Default
`keep`	`array`	Bool array of streams to keep in the decompression.	`None`
`stream_slice`	`slice`	A python slice with step size of one, indicating the sample range to extract from each stream.	`None`
`keep_indices`	`bool`	If True, also return the original indices of the streams.	`False`
`use_threads`	`bool`	If True, use OpenMP threads to parallelize decoding. This is only beneficial for large arrays.	`False`

Source code in flacarray/array.py

def to_array(
    self, keep=None, stream_slice=None, keep_indices=False, use_threads=False
):
    """Decompress local data into a numpy array.

    This uses the compressed representation to reconstruct a normal numpy
    array.  The returned data type will be either int32, int64, float32, or
    float64 depending on the original data type.

    If `stream_slice` is specified, the returned array will have only that
    range of samples in the final dimension.

    If `keep` is specified, this should be a boolean array with the same shape
    as the leading dimensions of the original array.  True values in this array
    indicate that the stream should be kept.

    If `keep` is specified, the returned array WILL NOT have the same shape as
    the original.  Instead it will be a 2D array of decompressed streams- the
    streams corresponding to True values in the `keep` mask.

    If `keep_indices` is True and `keep` is specified, then a tuple of two values
    is returned.  The first is the array of decompressed streams.  The second is
    a list of tuples, each of which specifies the indices of the stream in the
    original array.

    Args:
        keep (array):  Bool array of streams to keep in the decompression.
        stream_slice (slice):  A python slice with step size of one, indicating
            the sample range to extract from each stream.
        keep_indices (bool):  If True, also return the original indices of the
            streams.
        use_threads (bool):  If True, use OpenMP threads to parallelize decoding.
            This is only beneficial for large arrays.

    """
    first_samp = None
    last_samp = None
    if stream_slice is not None:
        if stream_slice.step is not None and stream_slice.step != 1:
            raise RuntimeError(
                "Only stream slices with a step size of 1 are supported"
            )
        first_samp = stream_slice.start
        last_samp = stream_slice.stop

    arr, indices = array_decompress_slice(
        self._compressed,
        self._stream_size,
        self._stream_starts,
        self._stream_nbytes,
        stream_offsets=self._stream_offsets,
        stream_gains=self._stream_gains,
        keep=keep,
        first_stream_sample=first_samp,
        last_stream_sample=last_samp,
        use_threads=use_threads,
    )
    if keep is not None and keep_indices:
        return (arr, indices)
    else:
        return arr

`write_hdf5(hgrp)`

Write data to an HDF5 Group.

The internal object properties are written to an open HDF5 group. If you wish to use MPI I/O to write data to the group, then you must be using an MPI enabled h5py and you should pass in a valid handle to the group on all processes.

If the FlacArray is distributed over an MPI communicator, but the h5py implementation does not support MPI I/O, then all data will be communicated to the rank zero process for writing. In this case, the hgrp argument should be None except on the root process.

Parameters:

Name	Type	Description	Default
`hgrp`	`Group`	The open Group for writing.	required

Returns:

Type	Description
	None

Source code in flacarray/array.py

def write_hdf5(self, hgrp):
    """Write data to an HDF5 Group.

    The internal object properties are written to an open HDF5 group.  If you
    wish to use MPI I/O to write data to the group, then you must be using an MPI
    enabled h5py and you should pass in a valid handle to the group on all
    processes.

    If the `FlacArray` is distributed over an MPI communicator, but the h5py
    implementation does not support MPI I/O, then all data will be communicated
    to the rank zero process for writing.  In this case, the `hgrp` argument should
    be None except on the root process.

    Args:
        hgrp (h5py.Group):  The open Group for writing.

    Returns:
        None

    """
    hdf5_write_compressed(
        hgrp,
        self._leading_shape,
        self._global_leading_shape,
        self._stream_size,
        self._stream_starts,
        self._global_stream_starts,
        self._stream_nbytes,
        self._stream_offsets,
        self._stream_gains,
        self._compressed,
        self._compressed.nbytes,
        self._global_nbytes,
        self._global_proc_nbytes,
        self._mpi_comm,
        self._mpi_dist,
    )

`write_zarr(zgrp)`

Write data to an Zarr Group.

The internal object properties are written to an open zarr group.

If the FlacArray is distributed over an MPI communicator, then all data will be communicated to the rank zero process for writing. In this case, the zgrp argument should be None except on the root process.

Parameters:

Name	Type	Description	Default
`zgrp`	`Group`	The open Group for writing.	required

Returns:

Type	Description
	None

Source code in flacarray/array.py

def write_zarr(self, zgrp):
    """Write data to an Zarr Group.

    The internal object properties are written to an open zarr group.

    If the `FlacArray` is distributed over an MPI communicator, then all data will
    be communicated to the rank zero process for writing.  In this case, the `zgrp`
    argument should be None except on the root process.

    Args:
        zgrp (zarr.Group):  The open Group for writing.

    Returns:
        None

    """
    zarr_write_compressed(
        zgrp,
        self._leading_shape,
        self._global_leading_shape,
        self._stream_size,
        self._stream_starts,
        self._global_stream_starts,
        self._stream_nbytes,
        self._stream_offsets,
        self._stream_gains,
        self._compressed,
        self._compressed.nbytes,
        self._global_nbytes,
        self._global_proc_nbytes,
        self._mpi_comm,
        self._mpi_dist,
    )

Direct I/O

Sometimes code has no need to store compressed arrays in memory. Instead, it may be desirable to have full arrays in memory and compressed arrays on disk. In those situations, you can use several helper functions to write and read numpy arrays directly to / from files.

HDF5

You can write to / read from an h5py Group using functions in the hdf5 submodule.

`flacarray.hdf5.write_array(arr, hgrp, level=5, quanta=None, precision=None, mpi_comm=None, use_threads=False)`

Compress a numpy array and write to an HDF5 group.

This function is useful if you do not need to access the compressed array in memory and only wish to write it directly to HDF5. The input array is compressed and then the write_compressed() function is called.

Parameters:

Name	Type	Description	Default
`arr`	`array`	The input numpy array.	required
`hgrp`	`Group`	The Group to use.	required
`level`	`int`	Compression level (0-8).	`5`
`quanta`	`(float, array)`	For floating point data, the floating point increment of each 32bit integer value. Optionally an iterable of increments, one per stream.	`None`
`precision`	`(int, array)`	Number of significant digits to retain in float-to-int conversion. Alternative to `quanta`. Optionally an iterable of values, one per stream.	`None`
`mpi_comm`	`Comm`	If specified, the input array is assumed to be distributed across the communicator at the leading dimension. The local piece of the array is passed in on each process.	`None`
`use_threads`	`bool`	If True, use OpenMP threads to parallelize decoding. This is only beneficial for large arrays.	`False`

Returns:

Type	Description
	None

Source code in flacarray/hdf5.py

@function_timer
def write_array(
    arr, hgrp, level=5, quanta=None, precision=None, mpi_comm=None, use_threads=False
):
    """Compress a numpy array and write to an HDF5 group.

    This function is useful if you do not need to access the compressed array in memory
    and only wish to write it directly to HDF5.  The input array is compressed and then
    the `write_compressed()` function is called.

    Args:
        arr (array):  The input numpy array.
        hgrp (h5py.Group):  The Group to use.
        level (int):  Compression level (0-8).
        quanta (float, array):  For floating point data, the floating point
            increment of each 32bit integer value.  Optionally an iterable of
            increments, one per stream.
        precision (int, array):  Number of significant digits to retain in
            float-to-int conversion.  Alternative to `quanta`.  Optionally an
            iterable of values, one per stream.
        mpi_comm (MPI.Comm):  If specified, the input array is assumed to be
            distributed across the communicator at the leading dimension.  The
            local piece of the array is passed in on each process.
        use_threads (bool):  If True, use OpenMP threads to parallelize decoding.
            This is only beneficial for large arrays.

    Returns:
        None

    """
    if not have_hdf5:
        raise RuntimeError("h5py is not importable, cannot write to HDF5")

    # Get the global shape of the array
    global_props = global_array_properties(arr.shape, mpi_comm=mpi_comm)
    global_shape = global_props["shape"]
    mpi_dist = global_props["dist"]

    # Compress our local piece of the array
    compressed, starts, nbytes, offsets, gains = array_compress(
        arr, level=level, quanta=quanta, precision=precision, use_threads=use_threads
    )

    local_nbytes = compressed.nbytes
    global_nbytes, global_proc_bytes, global_starts = global_bytes(
        local_nbytes, starts, mpi_comm
    )
    stream_size = arr.shape[-1]
    leading_shape = starts.shape
    if len(global_shape) == 1:
        global_leading_shape = (1,)
    else:
        global_leading_shape = global_shape[:-1]

    write_compressed(
        hgrp,
        leading_shape,
        global_leading_shape,
        stream_size,
        starts,
        global_starts,
        nbytes,
        offsets,
        gains,
        compressed,
        local_nbytes,
        global_nbytes,
        global_proc_bytes,
        mpi_comm,
        mpi_dist,
    )

`flacarray.hdf5.read_array(hgrp, keep=None, stream_slice=None, keep_indices=False, mpi_comm=None, mpi_dist=None, use_threads=False)`

Load a numpy array from compressed HDF5.

This function is useful if you do not need to store a compressed representation of the array in memory. Each stream will be read individually from the file and the desired slice decompressed. This avoids storing the full compressed data.

This function acts as a dispatch to the correct version of the reading function. The function is selected based on the format version string in the data.

If stream_slice is specified, the returned array will have only that range of samples in the final dimension.

If keep is specified, this should be a boolean array with the same shape as the leading dimensions of the original array. True values in this array indicate that the stream should be kept.

If keep is specified, the returned array WILL NOT have the same shape as the original. Instead it will be a 2D array of decompressed streams- the streams corresponding to True values in the keep mask.

If keep_indices is True and keep is specified, then an additional list is returned containing the indices of each stream that was kept.

Parameters:

Name	Type	Description	Default
`hgrp`	`Group`	The group to read.	required
`keep`	`array`	Bool array of streams to keep in the decompression.	`None`
`stream_slice`	`slice`	A python slice with step size of one, indicating the sample range to extract from each stream.	`None`
`keep_indices`	`bool`	If True, also return the original indices of the streams.	`False`
`mpi_comm`	`Comm`	The optional MPI communicator over which to distribute the leading dimension of the array.	`None`
`mpi_dist`	`list`	The optional list of tuples specifying the first / last element of the leading dimension to assign to each process.	`None`
`use_threads`	`bool`	If True, use OpenMP threads to parallelize decoding. This is only beneficial for large arrays.	`False`

Returns:

Type	Description
`array`	The loaded and decompressed data OR the array and the kept indices.

Source code in flacarray/hdf5.py

@function_timer
def read_array(
    hgrp,
    keep=None,
    stream_slice=None,
    keep_indices=False,
    mpi_comm=None,
    mpi_dist=None,
    use_threads=False,
):
    """Load a numpy array from compressed HDF5.

    This function is useful if you do not need to store a compressed representation
    of the array in memory.  Each stream will be read individually from the file and
    the desired slice decompressed.  This avoids storing the full compressed data.

    This function acts as a dispatch to the correct version of the reading function.
    The function is selected based on the format version string in the data.

    If `stream_slice` is specified, the returned array will have only that
    range of samples in the final dimension.

    If `keep` is specified, this should be a boolean array with the same shape
    as the leading dimensions of the original array.  True values in this array
    indicate that the stream should be kept.

    If `keep` is specified, the returned array WILL NOT have the same shape as
    the original.  Instead it will be a 2D array of decompressed streams- the
    streams corresponding to True values in the `keep` mask.

    If `keep_indices` is True and `keep` is specified, then an additional list
    is returned containing the indices of each stream that was kept.

    Args:
        hgrp (h5py.Group):  The group to read.
        keep (array):  Bool array of streams to keep in the decompression.
        stream_slice (slice):  A python slice with step size of one, indicating
            the sample range to extract from each stream.
        keep_indices (bool):  If True, also return the original indices of the
            streams.
        mpi_comm (MPI.Comm):  The optional MPI communicator over which to distribute
            the leading dimension of the array.
        mpi_dist (list):  The optional list of tuples specifying the first / last
            element of the leading dimension to assign to each process.
        use_threads (bool):  If True, use OpenMP threads to parallelize decoding.
            This is only beneficial for large arrays.

    Returns:
        (array):  The loaded and decompressed data OR the array and the kept indices.

    """
    if not have_hdf5:
        raise RuntimeError("h5py is not importable, cannot write to HDF5")

    format_version = None
    if hgrp is not None:
        if "flacarray_format_version" in hgrp.attrs:
            format_version = hgrp.attrs["flacarray_format_version"]
    if mpi_comm is not None:
        format_version = mpi_comm.bcast(format_version, root=0)
    if format_version is None:
        raise RuntimeError("h5py Group does not contain a FlacArray")

    mod_name = f".hdf5_load_v{format_version}"
    mod = importlib.import_module(mod_name, package="flacarray")
    read_func = getattr(mod, "read_array")
    return read_func(
        hgrp,
        keep=keep,
        stream_slice=stream_slice,
        keep_indices=keep_indices,
        mpi_comm=mpi_comm,
        mpi_dist=mpi_dist,
        use_threads=use_threads,
    )

Zarr

You can write to / read from a zarr hierarch Group using functions in the zarr submodule.

`flacarray.zarr.write_array(arr, zgrp, level=5, quanta=None, precision=None, mpi_comm=None, use_threads=False)`

Compress a numpy array and write to an Zarr group.

This function is useful if you do not need to access the compressed array in memory and only wish to write it directly to Zarr files. The input array is compressed and then the write_compressed() function is called.

Parameters:

Name	Type	Description	Default
`arr`	`array`	The input numpy array.	required
`zgrp`	`Group`	The Group to use.	required
`level`	`int`	Compression level (0-8).	`5`
`quanta`	`(float, array)`	For floating point data, the floating point increment of each 32bit integer value. Optionally an iterable of increments, one per stream.	`None`
`precision`	`(int, array)`	Number of significant digits to retain in float-to-int conversion. Alternative to `quanta`. Optionally an iterable of values, one per stream.	`None`
`mpi_comm`	`Comm`	If specified, the input array is assumed to be distributed across the communicator at the leading dimension. The local piece of the array is passed in on each process.	`None`
`use_threads`	`bool`	If True, use OpenMP threads to parallelize decoding. This is only beneficial for large arrays.	`False`

Returns:

Type	Description
	None

Source code in flacarray/zarr.py

@function_timer
def write_array(
    arr, zgrp, level=5, quanta=None, precision=None, mpi_comm=None, use_threads=False
):
    """Compress a numpy array and write to an Zarr group.

    This function is useful if you do not need to access the compressed array in memory
    and only wish to write it directly to Zarr files.  The input array is compressed
    and then the `write_compressed()` function is called.

    Args:
        arr (array):  The input numpy array.
        zgrp (zarr.Group):  The Group to use.
        level (int):  Compression level (0-8).
        quanta (float, array):  For floating point data, the floating point
            increment of each 32bit integer value.  Optionally an iterable of
            increments, one per stream.
        precision (int, array):  Number of significant digits to retain in
            float-to-int conversion.  Alternative to `quanta`.  Optionally an
            iterable of values, one per stream.
        mpi_comm (MPI.Comm):  If specified, the input array is assumed to be
            distributed across the communicator at the leading dimension.  The
            local piece of the array is passed in on each process.
        use_threads (bool):  If True, use OpenMP threads to parallelize decoding.
            This is only beneficial for large arrays.

    Returns:
        None

    """
    if not have_zarr:
        raise RuntimeError("zarr is not importable, cannot write to zarr.Group")

    # Get the global shape of the array
    global_props = global_array_properties(arr.shape, mpi_comm=mpi_comm)
    global_shape = global_props["shape"]
    mpi_dist = global_props["dist"]

    # Compress our local piece of the array
    compressed, starts, nbytes, offsets, gains = array_compress(
        arr, level=level, quanta=quanta, precision=precision, use_threads=use_threads
    )

    local_nbytes = compressed.nbytes
    global_nbytes, global_proc_bytes, global_starts = global_bytes(
        local_nbytes, starts, mpi_comm
    )
    stream_size = arr.shape[-1]
    leading_shape = starts.shape
    if len(global_shape) == 1:
        global_leading_shape = (1,)
    else:
        global_leading_shape = global_shape[:-1]

    write_compressed(
        zgrp,
        leading_shape,
        global_leading_shape,
        stream_size,
        starts,
        global_starts,
        nbytes,
        offsets,
        gains,
        compressed,
        local_nbytes,
        global_nbytes,
        global_proc_bytes,
        mpi_comm,
        mpi_dist,
    )

`flacarray.zarr.read_array(zgrp, keep=None, stream_slice=None, keep_indices=False, mpi_comm=None, mpi_dist=None, use_threads=False)`

Load a numpy array from a compressed Zarr group.

This function is useful if you do not need to store a compressed representation of the array in memory. Each stream will be read individually from the file and the desired slice decompressed. This avoids storing the full compressed data.

This function acts as a dispatch to the correct version of the reading function. The function is selected based on the format version string in the data.

If stream_slice is specified, the returned array will have only that range of samples in the final dimension.

If keep is specified, this should be a boolean array with the same shape as the leading dimensions of the original array. True values in this array indicate that the stream should be kept.

If keep is specified, the returned array WILL NOT have the same shape as the original. Instead it will be a 2D array of decompressed streams- the streams corresponding to True values in the keep mask.

If keep_indices is True and keep is specified, then an additional list is returned containing the indices of each stream that was kept.

Parameters:

Name	Type	Description	Default
`zgrp`	`Group`	The group to read.	required
`keep`	`array`	Bool array of streams to keep in the decompression.	`None`
`stream_slice`	`slice`	A python slice with step size of one, indicating the sample range to extract from each stream.	`None`
`keep_indices`	`bool`	If True, also return the original indices of the streams.	`False`
`mpi_comm`	`Comm`	The optional MPI communicator over which to distribute the leading dimension of the array.	`None`
`mpi_dist`	`list`	The optional list of tuples specifying the first / last element of the leading dimension to assign to each process.	`None`
`use_threads`	`bool`	If True, use OpenMP threads to parallelize decoding. This is only beneficial for large arrays.	`False`

Returns:

Type	Description
`array`	The loaded and decompressed data OR the array and the kept indices.

Source code in flacarray/zarr.py

@function_timer
def read_array(
    zgrp,
    keep=None,
    stream_slice=None,
    keep_indices=False,
    mpi_comm=None,
    mpi_dist=None,
    use_threads=False,
):
    """Load a numpy array from a compressed Zarr group.

    This function is useful if you do not need to store a compressed representation
    of the array in memory.  Each stream will be read individually from the file and
    the desired slice decompressed.  This avoids storing the full compressed data.

    This function acts as a dispatch to the correct version of the reading function.
    The function is selected based on the format version string in the data.

    If `stream_slice` is specified, the returned array will have only that
    range of samples in the final dimension.

    If `keep` is specified, this should be a boolean array with the same shape
    as the leading dimensions of the original array.  True values in this array
    indicate that the stream should be kept.

    If `keep` is specified, the returned array WILL NOT have the same shape as
    the original.  Instead it will be a 2D array of decompressed streams- the
    streams corresponding to True values in the `keep` mask.

    If `keep_indices` is True and `keep` is specified, then an additional list
    is returned containing the indices of each stream that was kept.

    Args:
        zgrp (zarr.Group):  The group to read.
        keep (array):  Bool array of streams to keep in the decompression.
        stream_slice (slice):  A python slice with step size of one, indicating
            the sample range to extract from each stream.
        keep_indices (bool):  If True, also return the original indices of the
            streams.
        mpi_comm (MPI.Comm):  The optional MPI communicator over which to distribute
            the leading dimension of the array.
        mpi_dist (list):  The optional list of tuples specifying the first / last
            element of the leading dimension to assign to each process.
        use_threads (bool):  If True, use OpenMP threads to parallelize decoding.
            This is only beneficial for large arrays.

    Returns:
        (array):  The loaded and decompressed data OR the array and the kept indices.

    """
    if not have_zarr:
        raise RuntimeError("zarr is not importable, cannot write to a Zarr Group")

    format_version = None
    if zgrp is not None:
        if "flacarray_format_version" in zgrp.attrs:
            format_version = zgrp.attrs["flacarray_format_version"]
    if mpi_comm is not None:
        format_version = mpi_comm.bcast(format_version, root=0)
    if format_version is None:
        raise RuntimeError("Zarr Group does not contain a FlacArray")

    mod_name = f".zarr_load_v{format_version}"
    mod = importlib.import_module(mod_name, package="flacarray")
    read_func = getattr(mod, "read_array")
    return read_func(
        zgrp,
        keep=keep,
        stream_slice=stream_slice,
        keep_indices=keep_indices,
        mpi_comm=mpi_comm,
        mpi_dist=mpi_dist,
        use_threads=use_threads,
    )

Interactive Tools

The flacarray.demo submodule contains a few helper functions that are not imported by default. You will need to have optional dependencies (matplotlib) installed to use the visualization tools. For testing, it is convenient to generate arrays consisting of random timestreams with some structure. The create_fake_data function can be used for this.

`flacarray.demo.create_fake_data(shape, sigma=1.0, dtype=np.float64)`

Source code in flacarray/demo.py

def create_fake_data(shape, sigma=1.0, dtype=np.float64):
    flatshape = np.prod(shape)
    stream_size = shape[-1]
    leading_shape = shape[:-1]
    leading_shape_ext = leading_shape + (1,)

    rng = np.random.default_rng(seed=123456789)

    # Construct a random DC level for each stream that is +/- 5 sigma
    dc = 5 * sigma * (rng.random(size=leading_shape_ext) - 0.5)

    # Construct a simple low frequency waveform (assume 1Hz sampling)
    wave = np.zeros(stream_size, dtype=dtype)
    t = np.arange(stream_size)
    minf = 5 / stream_size
    for freq, amp in zip([3 * minf, minf], [2 * sigma, 6 * sigma]):
        wave[:] += amp * np.sin(2 * np.pi * freq * t)

    # Initialize all streams to a scaled version of this waveform plus the DC level
    scale = rng.random(size=leading_shape_ext)
    leading_slc = tuple([slice(None) for x in leading_shape])
    data = np.empty(shape, dtype=dtype)
    data[leading_slc] = dc
    data[leading_slc] += scale * wave

    # Add some Gaussian random noise to each stream
    data[:] += rng.normal(0.0, sigma, flatshape).reshape(shape)

    return data

Most data arrays in practice have 2 or 3 dimensions. If the number of streams is relatively small, then an uncompressed array can be plotted with the plot_data function.

`flacarray.demo.plot_data(data, keep=None, stream_slc=slice(None), file=None)`

Source code in flacarray/demo.py

def plot_data(data, keep=None, stream_slc=slice(None), file=None):
    # We only import matplotlib if we are actually going to make some plots.
    # This is not a required package.
    import matplotlib.pyplot as plt

    if len(data.shape) > 3:
        raise NotImplementedError("Can only plot 1D and 2D arrays of streams")

    if len(data.shape) == 1:
        plot_rows = 1
        plot_cols = 1
    elif len(data.shape) == 2:
        plot_rows = data.shape[0]
        plot_cols = 1
    else:
        plot_rows = data.shape[1]
        plot_cols = data.shape[0]

    fig_dpi = 100
    fig_width = 6 * plot_cols
    fig_height = 4 * plot_rows
    fig = plt.figure(figsize=(fig_width, fig_height), dpi=fig_dpi)
    if len(data.shape) == 1:
        # Single stream
        ax = fig.add_subplot(1, 1, 1, aspect="auto")
        ax.plot(data[stream_slc])
    elif len(data.shape) == 2:
        # 1-D array of streams, plot vertically
        for iplot in range(data.shape[0]):
            ax = fig.add_subplot(plot_rows, 1, iplot + 1, aspect="auto")
            ax.plot(data[iplot, stream_slc])
    else:
        # 2-D array of streams, plot in a grid
        for row in range(plot_rows):
            for col in range(plot_cols):
                slc = (col, row, stream_slc)
                ax = fig.add_subplot(
                    plot_rows, plot_cols, row * plot_cols + col + 1, aspect="auto"
                )
                ax.plot(data[slc], color="black")
    if file is None:
        plt.show()
    else:
        plt.savefig(file)
        plt.close()

Low-Level Tools

For specialized use cases, you can also work directly with the compressed bytestream and auxiliary arrays and convert to / from numpy arrays.

`flacarray.compress.array_compress(arr, level=5, quanta=None, precision=None, use_threads=False)`

Compress a numpy array with optional floating point conversion.

If arr is an int32 array, the returned stream offsets and gains will be None. if arr is an int64 array, the stream offsets will be the integer value subtracted when converting to int32. Both float32 and float64 data will have floating point offset and gain arrays returned.

Parameters:

Name	Type	Description	Default
`arr`	`ndarray`	The input array data.	required
`level`	`int`	Compression level (0-8).	`5`
`quanta`	`(float, array)`	For floating point data, the floating point increment of each 32bit integer value. Optionally an array of increments, one per stream.	`None`
`precision`	`(int, array)`	Number of significant digits to retain in float-to-int conversion. Alternative to `quanta`. Optionally an iterable of values, one per stream.	`None`
`use_threads`	`bool`	If True, use OpenMP threads to parallelize decoding. This is only beneficial for large arrays.	`False`

Returns:

Type	Description
`tuple`	The (compressed bytes, stream starts, stream_nbytes, stream offsets, stream gains)

Source code in flacarray/compress.py

@function_timer
def array_compress(arr, level=5, quanta=None, precision=None, use_threads=False):
    """Compress a numpy array with optional floating point conversion.

    If `arr` is an int32 array, the returned stream offsets and gains will be None.
    if `arr` is an int64 array, the stream offsets will be the integer value subtracted
    when converting to int32.  Both float32 and float64 data will have floating point
    offset and gain arrays returned.

    Args:
        arr (numpy.ndarray):  The input array data.
        level (int):  Compression level (0-8).
        quanta (float, array):  For floating point data, the floating point
            increment of each 32bit integer value.  Optionally an array of
            increments, one per stream.
        precision (int, array):  Number of significant digits to retain in
            float-to-int conversion.  Alternative to `quanta`.  Optionally an
            iterable of values, one per stream.
        use_threads (bool):  If True, use OpenMP threads to parallelize decoding.
            This is only beneficial for large arrays.

    Returns:
        (tuple): The (compressed bytes, stream starts, stream_nbytes, stream offsets,
            stream gains)

    """
    if arr.size == 0:
        raise ValueError("Cannot compress a zero-sized array!")
    leading_shape = arr.shape[:-1]

    if quanta is not None:
        if precision is not None:
            raise RuntimeError("Cannot set both quanta and precision")
        try:
            nq = len(quanta)
            # This is an array
            if nq.shape != leading_shape:
                msg = "If not a scalar, quanta must have the same shape as the "
                msg += "leading dimensions of the array"
                raise ValueError(msg)
            dquanta = quanta.astype(arr.dtype)
        except TypeError:
            # This is a scalar, applied to all detectors
            dquanta = quanta * np.ones(leading_shape, dtype=arr.dtype)
    else:
        dquanta = None

    if arr.dtype == np.dtype(np.int32):
        (compressed, starts, nbytes) = encode_flac(arr, level, use_threads=use_threads)
        return (compressed, starts, nbytes, None, None)
    elif arr.dtype == np.dtype(np.int64):
        idata, ioff = int64_to_int32(arr)
        (compressed, starts, nbytes) = encode_flac(
            idata, level, use_threads=use_threads
        )
        return (compressed, starts, nbytes, ioff, None)
    elif arr.dtype == np.dtype(np.float64) or arr.dtype == np.dtype(np.float32):
        idata, foff, gains = float_to_int32(arr, quanta=dquanta, precision=precision)
        (compressed, starts, nbytes) = encode_flac(
            idata, level, use_threads=use_threads
        )
        return (compressed, starts, nbytes, foff, gains)
    else:
        raise ValueError(f"Unsupported data type '{arr.dtype}'")

`flacarray.decompress.array_decompress(compressed, stream_size, stream_starts, stream_nbytes, stream_offsets=None, stream_gains=None, first_stream_sample=None, last_stream_sample=None, use_threads=False)`

Decompress a FLAC encoded array and restore original data type.

If stream_gains is specified, the output data will be float32 and stream_offsets must also be provided. If stream_gains is not specified, but stream_offsets is, then the returned data will be int64. If neither offsets or gains are specified, the decompressed int32 array is returned.

To decompress a subset of samples in all streams, specify the first_stream_sample and last_stream_sample values. None values or negative values disable this feature.

Parameters:

Name	Type	Description	Default
`compressed`	`array`	The array of compressed bytes.	required
`stream_size`	`int`	The length of the decompressed final dimension.	required
`stream_starts`	`array`	The array of starting bytes in the bytestream.	required
`stream_nbytes`	`array`	The array of number of bytes in each stream.	required
`stream_offsets`	`array`	The array of offsets, one per stream.	`None`
`stream_gains`	`array`	The array of gains, one per stream.	`None`
`first_stream_sample`	`int`	The first sample of every stream to decompress.	`None`
`last_stream_sample`	`int`	The last sample of every stream to decompress.	`None`
`use_threads`	`bool`	If True, use OpenMP threads to parallelize decoding. This is only beneficial for large arrays.	`False`

Returns:

Type	Description
`array`	The output array.

Source code in flacarray/decompress.py

@function_timer
def array_decompress(
    compressed,
    stream_size,
    stream_starts,
    stream_nbytes,
    stream_offsets=None,
    stream_gains=None,
    first_stream_sample=None,
    last_stream_sample=None,
    use_threads=False,
):
    """Decompress a FLAC encoded array and restore original data type.

    If `stream_gains` is specified, the output data will be float32 and `stream_offsets`
    must also be provided.  If `stream_gains` is not specified, but `stream_offsets` is,
    then the returned data will be int64.  If neither offsets or gains are specified,
    the decompressed int32 array is returned.

    To decompress a subset of samples in all streams, specify the `first_stream_sample`
    and `last_stream_sample` values.  None values or negative values disable this
    feature.

    Args:
        compressed (array):  The array of compressed bytes.
        stream_size (int):  The length of the decompressed final dimension.
        stream_starts (array):  The array of starting bytes in the bytestream.
        stream_nbytes (array):  The array of number of bytes in each stream.
        stream_offsets (array):  The array of offsets, one per stream.
        stream_gains (array):  The array of gains, one per stream.
        first_stream_sample (int):  The first sample of every stream to decompress.
        last_stream_sample (int):  The last sample of every stream to decompress.
        use_threads (bool):  If True, use OpenMP threads to parallelize decoding.
            This is only beneficial for large arrays.

    Returns:
        (array): The output array.

    """
    arr, _ = array_decompress_slice(
        compressed,
        stream_size,
        stream_starts,
        stream_nbytes,
        stream_offsets=stream_offsets,
        stream_gains=stream_gains,
        keep=None,
        first_stream_sample=first_stream_sample,
        last_stream_sample=last_stream_sample,
        use_threads=use_threads,
    )
    return arr

`flacarray.decompress.array_decompress_slice(compressed, stream_size, stream_starts, stream_nbytes, stream_offsets=None, stream_gains=None, keep=None, first_stream_sample=None, last_stream_sample=None, use_threads=False)`

Decompress a slice of a FLAC encoded array and restore original data type.

If stream_gains is specified, the output data will be float32 and stream_offsets must also be provided. If stream_gains is not specified, but stream_offsets is, then the returned data will be int64. If neither offsets or gains are specified, the decompressed int32 array is returned.

To decompress a subset of samples in all streams, specify the first_stream_sample and last_stream_sample values. None values or negative values disable this feature.

To decompress a subset of streams, pass a boolean array to the keep argument. This should have the same shape as the starts array. Only streams with a True value in the keep array will be decompressed.

If the keep array is specified, the output tuple will contain the 2D array of streams that were kept, as well as a list of tuples indicating the original array indices for each stream in the output. If the keep array is None, the output tuple will contain an array with the original N-dimensional leading array shape and the trailing number of samples. The second element of the tuple will be None.

Parameters:

Name	Type	Description	Default
`compressed`	`array`	The array of compressed bytes.	required
`stream_size`	`int`	The length of the decompressed final dimension.	required
`stream_starts`	`array`	The array of starting bytes in the bytestream.	required
`stream_nbytes`	`array`	The array of number of bytes in each stream.	required
`stream_offsets`	`array`	The array of offsets, one per stream.	`None`
`stream_gains`	`array`	The array of gains, one per stream.	`None`
`keep`	`array`	Bool array of streams to keep in the decompression.	`None`
`first_stream_sample`	`int`	The first sample of every stream to decompress.	`None`
`last_stream_sample`	`int`	The last sample of every stream to decompress.	`None`
`use_threads`	`bool`	If True, use OpenMP threads to parallelize decoding. This is only beneficial for large arrays.	`False`

Returns:

Type	Description
`tuple`	The (output array, list of stream indices).

Source code in flacarray/decompress.py

@function_timer
def array_decompress_slice(
    compressed,
    stream_size,
    stream_starts,
    stream_nbytes,
    stream_offsets=None,
    stream_gains=None,
    keep=None,
    first_stream_sample=None,
    last_stream_sample=None,
    use_threads=False,
):
    """Decompress a slice of a FLAC encoded array and restore original data type.

    If `stream_gains` is specified, the output data will be float32 and `stream_offsets`
    must also be provided.  If `stream_gains` is not specified, but `stream_offsets` is,
    then the returned data will be int64.  If neither offsets or gains are specified,
    the decompressed int32 array is returned.

    To decompress a subset of samples in all streams, specify the `first_stream_sample`
    and `last_stream_sample` values.  None values or negative values disable this
    feature.

    To decompress a subset of streams, pass a boolean array to the `keep` argument.
    This should have the same shape as the `starts` array.  Only streams with a True
    value in the `keep` array will be decompressed.

    If the `keep` array is specified, the output tuple will contain the 2D array of
    streams that were kept, as well as a list of tuples indicating the original array
    indices for each stream in the output.  If the `keep` array is None, the output
    tuple will contain an array with the original N-dimensional leading array shape
    and the trailing number of samples.  The second element of the tuple will be None.

    Args:
        compressed (array):  The array of compressed bytes.
        stream_size (int):  The length of the decompressed final dimension.
        stream_starts (array):  The array of starting bytes in the bytestream.
        stream_nbytes (array):  The array of number of bytes in each stream.
        stream_offsets (array):  The array of offsets, one per stream.
        stream_gains (array):  The array of gains, one per stream.
        keep (array):  Bool array of streams to keep in the decompression.
        first_stream_sample (int):  The first sample of every stream to decompress.
        last_stream_sample (int):  The last sample of every stream to decompress.
        use_threads (bool):  If True, use OpenMP threads to parallelize decoding.
            This is only beneficial for large arrays.

    Returns:
        (tuple): The (output array, list of stream indices).

    """
    if first_stream_sample is None:
        first_stream_sample = -1
    if last_stream_sample is None:
        last_stream_sample = -1

    starts, nbytes, indices = keep_select(keep, stream_starts, stream_nbytes)
    offsets = select_keep_indices(stream_offsets, indices)
    gains = select_keep_indices(stream_gains, indices)

    if stream_offsets is not None:
        if stream_gains is not None:
            # This is floating point data
            idata = decode_flac(
                compressed,
                starts,
                nbytes,
                stream_size,
                first_sample=first_stream_sample,
                last_sample=last_stream_sample,
                use_threads=use_threads,
            )
            arr = int32_to_float(idata, offsets, gains)
        else:
            # This is int64 data
            idata = decode_flac(
                compressed,
                starts,
                nbytes,
                stream_size,
                first_sample=first_stream_sample,
                last_sample=last_stream_sample,
                use_threads=use_threads,
            )
            ext_shape = offsets.shape + (1,)
            arr = idata.astype(np.int64) + offsets.reshape(ext_shape)
    else:
        if stream_gains is not None:
            raise RuntimeError(
                "When specifying gains, you must also provide the offsets"
            )
        # This is int32 data
        arr = decode_flac(
            compressed,
            starts,
            nbytes,
            stream_size,
            first_sample=first_stream_sample,
            last_sample=last_stream_sample,
            use_threads=use_threads,
        )
    return (arr, indices)

Developer Notes

To-Do

Discuss: - Code formatting (ruff) - PR workflow

FLACArray

Installation

Python Wheels

Conda Packages

Building From Source

Building Within a Conda Environment

Other Ways of Building

Tutorial¶

FlacArray - Compressed Arrays in Memory¶

Create From Array¶

Decompress Back to Array¶

Slicing¶

Writing and Reading¶

Direct I/O and Compression of Numpy Arrays¶

HDF5¶

Zarr¶

Cook Book¶

Random Access to Large Arrays¶

Subset of Samples for All Streams¶

Subset of Samples for a Few Streams¶

Parallel I/O¶

API Reference

Compressed Array Representation

flacarray.FlacArray

compressed property

global_leading_shape property

global_nbytes property

global_nstreams property

global_process_nbytes property

global_shape property

global_stream_nbytes property

global_stream_starts property

leading_shape property

mpi_comm property

mpi_dist property

nbytes property

nstreams property

shape property

stream_gains property

stream_nbytes property

stream_offsets property

stream_size property

stream_starts property

__getitem__(key)

from_array(arr, level=5, quanta=None, precision=None, mpi_comm=None, use_threads=False) classmethod

read_hdf5(hgrp, keep=None, mpi_comm=None, mpi_dist=None) classmethod

read_zarr(zgrp, keep=None, mpi_comm=None, mpi_dist=None) classmethod

to_array(keep=None, stream_slice=None, keep_indices=False, use_threads=False)

write_hdf5(hgrp)

write_zarr(zgrp)

Direct I/O

HDF5

flacarray.hdf5.write_array(arr, hgrp, level=5, quanta=None, precision=None, mpi_comm=None, use_threads=False)

flacarray.hdf5.read_array(hgrp, keep=None, stream_slice=None, keep_indices=False, mpi_comm=None, mpi_dist=None, use_threads=False)

Zarr

flacarray.zarr.write_array(arr, zgrp, level=5, quanta=None, precision=None, mpi_comm=None, use_threads=False)

flacarray.zarr.read_array(zgrp, keep=None, stream_slice=None, keep_indices=False, mpi_comm=None, mpi_dist=None, use_threads=False)

Interactive Tools

flacarray.demo.create_fake_data(shape, sigma=1.0, dtype=np.float64)

flacarray.demo.plot_data(data, keep=None, stream_slc=slice(None), file=None)

Low-Level Tools

flacarray.compress.array_compress(arr, level=5, quanta=None, precision=None, use_threads=False)

flacarray.decompress.array_decompress(compressed, stream_size, stream_starts, stream_nbytes, stream_offsets=None, stream_gains=None, first_stream_sample=None, last_stream_sample=None, use_threads=False)

flacarray.decompress.array_decompress_slice(compressed, stream_size, stream_starts, stream_nbytes, stream_offsets=None, stream_gains=None, keep=None, first_stream_sample=None, last_stream_sample=None, use_threads=False)

Developer Notes

`FlacArray` - Compressed Arrays in Memory¶

`flacarray.FlacArray`

`compressed` `property`

`global_leading_shape` `property`

`global_nbytes` `property`

`global_nstreams` `property`

`global_process_nbytes` `property`

`global_shape` `property`

`global_stream_nbytes` `property`

`global_stream_starts` `property`

`leading_shape` `property`

`mpi_comm` `property`

`mpi_dist` `property`

`nbytes` `property`

`nstreams` `property`

`shape` `property`

`stream_gains` `property`

`stream_nbytes` `property`

`stream_offsets` `property`

`stream_size` `property`

`stream_starts` `property`

`getitem(key)`

`from_array(arr, level=5, quanta=None, precision=None, mpi_comm=None, use_threads=False)` `classmethod`

`read_hdf5(hgrp, keep=None, mpi_comm=None, mpi_dist=None)` `classmethod`

`read_zarr(zgrp, keep=None, mpi_comm=None, mpi_dist=None)` `classmethod`

`to_array(keep=None, stream_slice=None, keep_indices=False, use_threads=False)`

`write_hdf5(hgrp)`

`write_zarr(zgrp)`

`flacarray.hdf5.write_array(arr, hgrp, level=5, quanta=None, precision=None, mpi_comm=None, use_threads=False)`

`flacarray.hdf5.read_array(hgrp, keep=None, stream_slice=None, keep_indices=False, mpi_comm=None, mpi_dist=None, use_threads=False)`

`flacarray.zarr.write_array(arr, zgrp, level=5, quanta=None, precision=None, mpi_comm=None, use_threads=False)`

`flacarray.zarr.read_array(zgrp, keep=None, stream_slice=None, keep_indices=False, mpi_comm=None, mpi_dist=None, use_threads=False)`

`flacarray.demo.create_fake_data(shape, sigma=1.0, dtype=np.float64)`

`flacarray.demo.plot_data(data, keep=None, stream_slc=slice(None), file=None)`

`flacarray.compress.array_compress(arr, level=5, quanta=None, precision=None, use_threads=False)`

`flacarray.decompress.array_decompress(compressed, stream_size, stream_starts, stream_nbytes, stream_offsets=None, stream_gains=None, first_stream_sample=None, last_stream_sample=None, use_threads=False)`

`flacarray.decompress.array_decompress_slice(compressed, stream_size, stream_starts, stream_nbytes, stream_offsets=None, stream_gains=None, keep=None, first_stream_sample=None, last_stream_sample=None, use_threads=False)`