Table of Contents
FLACArray
This package provides a set of tools for compressing multi-dimensional arrays where the last array dimension consists of "streams" of data. These streams are compressed with the FLAC algorithm and can be written to different file formats as well as decompressed back into numpy arrays.
FLAC compression is particularly suited to "noisy" timestreams that do not compress well with DEFLATE algorithms used by zip / gzip. This type of data is found in audio signals, scientific timestreams, etc.
In the flacarray
package we use only a small subset of features found in the
libFLAC library. In particular, each data stream is compressed as either one or
two 32 bit "channels". Stream data consisting of 32 or 64 bit integers are
compressed in a loss-less fashion. Floating point data is converted to either 32
or 64 bit integers with a user-specified precision or quantization.
If you are specifically working with audio data and want to write flac format audio files, you should look at other software tools such as pyflac.
Installation
For most use cases, you can just install flacarray
from pre-built python
wheels or conda packages. For specialized use cases or development it is
straightforward to build the package from source using either a conda
environment for dependencies or with those obtained through your OS package
manager.
Python Wheels
You can install pre-built wheels from PyPI using pip within a virtualenv:
pip install flacarray
Or, if you are using a shared python environment you can install to a user location with:
pip install --user flacarray
Conda Packages
If you are using a conda environment you can install the conda package for
flacarray
from the conda-forge channels:
conda install -c conda-forge flacarray
To-Do
flacarray
is not yet on conda-forge
Building From Source
In order to build from source, you will need a C compiler and the FLAC development libraries installed.
Building Within a Conda Environment
If you have conda available, you can create an environment will all the dependencies you need to build flacarray from source. For this example, we create an environment called "flacarray". First create the env with all dependencies and activate it (FIXME: add a requirements file for dev):
conda create -n flacarray \
c-compiler numpy libflac cython meson-python pkgconfig
conda activate flacarray
Now you can go into your local git checkout of the flacarray source and do:
pip install .
To build and install the package.
To also work on docs, install additional packages:
conda install mkdocs mkdocstrings mkdocstrings-python mkdocs-jupyter
pip install mkdocs-print-site-plugin
Other Ways of Building
To-Do
Discuss OS packages, document apt, rpm, homebrew options.
Tutorial¶
The flacarray
package has tools for working with compressed arrays in memory, as well as saving and loading those to several file formats. This tutorial makes use of some interactive helper functions in the flacarray.demo
package.
import numpy as np
import h5py # For optional I/O operations
import zarr # For optional I/O operations
FlacArray
- Compressed Arrays in Memory¶
The primary class for working with compressed arrays in memory is the FlacArray
class. You can construct one of these from a numpy array with a class method. First create some fake data in a numpy array for testing. This is a small 3-D array and the final dimension is always the one that is compressed. This last dimension should consist of "streams" of data.
from flacarray import FlacArray, demo
# Create a 3D array where the last dimension is the "streams" we are compressing.
arr, _ = demo.create_fake_data((4, 3, 10000))
# How large is this in memory?
print(f"Input array is {arr.nbytes} bytes")
Input array is 960000 bytes
# Plot these streams
demo.plot_data(arr)
Create From Array¶
Now create a FlacArray
from this. Integer data is compressed in a lossless fashion, but since this is floating point data, we must choose either a quantization value or a fixed precision (number of decimal places) when converting to integers.
# Create a compressed array
flcarr = FlacArray.from_array(arr, precision=10)
# Properties of the compressed array
print(flcarr)
<FlacArray float64 shape=(4, 3, 10000) bytes=522899>
Decompress Back to Array¶
Now decompress back to a numpy array. The results will only be bitwise identical for arrays consisting of 32 bit integers.
# Restore back to an array
restored = flcarr.to_array()
demo.plot_data(restored)
# Plot the residual
residual = restored - arr
demo.plot_data(residual)
Slicing¶
A subset of the full array can be decompressed on the fly. Any fancy array indexing can be used for the leading dimensions, but only contiguous slices or individual samples are supported in the last dimension.
subarr = flcarr[1:2, :, 200:300]
demo.plot_data(subarr)
Writing and Reading¶
The FlacArray
class has methods to write the internal compressed data and metadata to both h5py and zarr groups. The data members written to these file formats are simple arrays and scalars. Supporting other formats in the future would be straightforward. When decompressing data from disk, you can choose to decompress only a subset of the streams. Here is an example writing the compressed array to HDF5 and loading it back in.
with h5py.File("flcarr.h5", "w") as hf:
flcarr.write_hdf5(hf)
# We can load this back into a new FlacArray using a class method
with h5py.File("flcarr.h5", "r") as hf:
new_flcarr = FlacArray.read_hdf5(hf)
# The compressed representations should be equal...
print(new_flcarr == flcarr)
True
You can also load in just a subset of the streams using a "keep" mask. This is a boolean array with the same shape as the leading dimensions of the original array.
leading_shape = arr.shape[:-1]
keep = np.zeros(leading_shape, dtype=bool)
# Select the first and last stream on the second row
keep[1, 0] = True
keep[1, -1] = True
# Load just these streams
with h5py.File("flcarr.h5", "r") as hf:
sub_flcarr = FlacArray.read_hdf5(hf, keep=keep)
# Decompress and plot
demo.plot_data(sub_flcarr.to_array())
Direct I/O and Compression of Numpy Arrays¶
For some use cases, there is no need to keep the full compressed data in memory (in a FlacArray
). Instead, a normal numpy array is compressed when writing to a file and decompressed back into a numpy array when reading. The package has high-level functions for performing this kind of operation. When decompressing, a subset of streams can be loaded from disk, and then a sample range can be specified when doing the decompression.
HDF5¶
The hdf5
sub-module has helper functions for direct I/O to HDF5.
import flacarray.hdf5
# Write a numpy array directly to HDF5. This is equivalent to doing:
#
# temp = FlacArray.from_array(arr)
# with h5py.File("test.h5", "w") as hf:
# temp.write_hdf5(hf)
#
with h5py.File("test.h5", "w") as hf:
flacarray.hdf5.write_array(arr, hf, precision=10)
with h5py.File("test.h5", "r") as hf:
restored = flacarray.hdf5.read_array(hf)
demo.plot_data(restored)
# Load only a subset of streams and a slice of samples in those streams.
# This is equivalent to the following code:
#
# with h5py.File("test.h5", "r") as hf:
# restored = FlacArray.read_hdf5(hf, keep=keep)
# sub_restored = restored[:, 200:300]
#
with h5py.File("test.h5", "r") as hf:
sub_restored = flacarray.hdf5.read_array(hf, keep=keep, stream_slice=slice(200, 300, 1))
demo.plot_data(sub_restored)
Zarr¶
The zarr package provides an h5py-like interface for creating groups with attributes and "datasets" (arrays) on disk. Given an existing zarr.hierarchy.Group
, you can compress and write an array and then load it back in. This is almost identical to the HDF5 syntax above. The flacarray package includes a helper class (ZarrGroup
) which acts as a context manager around an open file. However if you already have a handle to Group
you can pass that directly to flacarray.zarr.write_array()
and flacarray.zarr.read_array()
.
import flacarray.zarr
with flacarray.zarr.ZarrGroup("test.zarr", mode="w") as zf:
flacarray.zarr.write_array(arr, zf, precision=10)
with flacarray.zarr.ZarrGroup("test.zarr", mode="r") as zf:
restored = flacarray.zarr.read_array(zf)
demo.plot_data(restored)
# Specifying a keep mask and sample slice also works.
with flacarray.zarr.ZarrGroup("test.zarr", mode="r") as zf:
sub_restored = flacarray.zarr.read_array(zf, keep=keep, stream_slice=slice(200, 300, 1))
demo.plot_data(sub_restored)
Cook Book¶
This notebook contains some more advanced examples addressing common usage patterns. Look at the Tutorial first to get a better sense of the big picture of the tools.
# Set the number of OpenMP threads to use
import os
os.environ["OMP_NUM_THREADS"] = "4"
import time
import numpy as np
import h5py
from flacarray import FlacArray, demo
import flacarray.hdf5
Random Access to Large Arrays¶
Consider a common case where we have a 2D array that represents essentially a "list" of timestreams of data. We might have thousands of timestreams, each with millions of samples. Now we want to decompress and access a subset of those streams and / or samples. To reduce memory in this notebook we are using a slightly smaller array.
# Create a 2D array of streams
arr, _ = demo.create_fake_data((1000, 100000), dtype=np.float32)
# How large is this in memory?
print(f"Input array is {arr.nbytes} bytes")
Input array is 400000000 bytes
# Compress this with threads
start = time.perf_counter()
flcarr = FlacArray.from_array(arr, quanta=1.0e-7, use_threads=True)
stop = time.perf_counter()
print(f"Elapsed = {stop-start:0.3} seconds")
Elapsed = 1.23 seconds
# Compress this without threads
start = time.perf_counter()
flcarr = FlacArray.from_array(arr, quanta=1.0e-7, use_threads=False)
stop = time.perf_counter()
print(f"Elapsed = {stop-start:0.3} seconds")
Elapsed = 2.6 seconds
# Decompress the whole thing
start = time.perf_counter()
restored = flcarr.to_array()
stop = time.perf_counter()
print(f"Elapsed = {stop-start:0.3} seconds")
Elapsed = 0.447 seconds
# Decompress the whole thing with threads
del restored
start = time.perf_counter()
restored = flcarr.to_array(use_threads=True)
stop = time.perf_counter()
print(f"Elapsed = {stop-start:0.3} seconds")
Elapsed = 0.439 seconds
Subset of Samples for All Streams¶
If our 2D array of streams contains co-sampled data, we might mant to examine a slice in time of all streams. Imagine we wanted to get data near the end of the array for all streams:
n_end = 10000
start = time.perf_counter()
end_arr = flcarr.to_array(stream_slice=slice(-n_end, None, 1))
stop = time.perf_counter()
print(f"Elapsed = {stop-start:0.3} seconds")
Elapsed = 0.43 seconds
Subset of Samples for a Few Streams¶
Imagine we want the last 1000 samples of one stream in the middle. We can use a "keep" mask combined with a sample slice:
n_end = 10000
keep = np.zeros(arr.shape[:-1], dtype=bool)
keep[500] = True
start = time.perf_counter()
sub_arr = flcarr.to_array(keep=keep, stream_slice=slice(-n_end, None, 1))
stop = time.perf_counter()
print(f"Elapsed = {stop-start:0.3} seconds")
print(sub_arr)
Elapsed = 0.00201 seconds [[ 1.3499217 1.1607051 -1.0080613 ... 0.2447555 1.0821551 0.03497732]]
So, we can see that decompressing a small number of random samples from a multi-GB dataset in memory is very fast.
Quantization Effects¶
As mentioned previously, 32 and 64 bit integer data is compressed in a lossless fashion with FLAC, using either one or two audio channels. When compressing 32 or 64 bit floating point data, a choice must be made about the "amount" of floating point value assigned to each integer. This quanta
parameter is a tradeoff between fidelity of the input data and compression factor. One "safe" choice is to pick a quantization value that is near the machine epsilon for float32 or float64. Although this will ensure nearly lossless compression, the compression ratio will be very poor.
To achieve better compression of floating point data, you should consider the dynamic range and actual precision of this data. As an example, consider some fake 32 bit floating point data:
# Create a single timestream
arr, _ = demo.create_fake_data((10000,), dtype=np.float32)
# How large is this in memory?
print(f"Input array is {arr.nbytes} bytes")
Input array is 40000 bytes
# Plot this
demo.plot_data(arr)
Now compress / decompress this data with a very conservative quanta
value:
# Create a compressed array
flcarr = FlacArray.from_array(arr, quanta=1.0e-8)
How big is this compressed array?
print(f"FlacArray is {flcarr.nbytes} bytes")
FlacArray is 36161 bytes
So we see that our compression factor is only about 0.9 (not good)
# Restore back to an array
restored = flcarr.to_array()
demo.plot_data(restored)
# Difference
residual = restored - arr
demo.plot_data(residual)
The residual difference after the roundtrip compression / decompression is close to the machine precision for float32.
Decreased Precision¶
Now imagine that we know the underlying precision of our data above is not really at the level of machine precision for float32. Instead, we know that our data came from measurements with a precision of 1.0e-4 in the units of this dataset. We can use that information when compressing:
# Create a compressed array
flcarr = FlacArray.from_array(arr, quanta=1.0e-4)
How big is this lower-precision compressed array?
print(f"FlacArray is {flcarr.nbytes} bytes")
FlacArray is 19664 bytes
So we have used information about our data to avoid storing unnecessary precision and have improved our compression ratio.
# Restore back to an array
restored = flcarr.to_array()
demo.plot_data(restored)
# Difference
residual = restored - arr
demo.plot_data(residual)
As expected, the residual is now comparable to the size of the quanta that we used.
Parallel I/O¶
To-Do: Discuss
- Interaction of threads, OpenMP versus libFLAC pthreads
- Use of MPI HDF5 with h5py
API Reference
The flacarray
package consists of a primary class (FlacArray
) plus a
variety of helper functions.
Compressed Array Representation
The FlacArray
class stores a compressed representation of an N dimensional
array where the last dimension consists of "streams" of numbers to be
compressed.
flacarray.FlacArray
FLAC compressed array representation.
This class holds a compressed representation of an N-dimensional array. The final (fastest changing) dimension is the axis along which the data is compressed. Each of the vectors in this last dimension is called a "stream" here. The leading dimensions of the original matrix form an array of these streams.
Internally, the data is stored as a contiguous concatenation of the bytes from these compressed streams. A separate array contains the starting byte of each stream in the overall bytes array. The shape of the starting array corresponds to the shape of the leading, un-compressed dimensions of the original array.
If the input data is 32bit or 64bit integers, each stream in the array is compressed directly with FLAC.
If the input data is 32bit or 64bit floating point numbers, then you must
specify exactly one of either quanta or precision when calling from_array()
. For
floating point data, the mean of each stream is computed and rounded to the nearest
whole quanta. This "offset" per stream is recorded and subtracted from the
stream. The offset-subtracted stream data is then rescaled and truncated to
integers (int32 or int64 depending on the bit width of the input array). If
quanta
is specified, the data is rescaled by 1 / quanta. The quanta may either
be a scalar applied to all streams, or an array of values, one per stream. If
instead the precision (integer number of decimal places) is specified, this is
converted to a quanta by dividing the stream RMS by 10^{precision}
. Similar to
quanta, the precision may be specified as a single value for all streams, or as an
array of values, one per stream.
If you choose a quanta value that is close to machine epsilon (e.g. 1e-7 for 32bit or 1e-16 for 64bit), then the compression amount will be negligible but the results nearly lossless. Compression of floating point data should not be done blindly and you should consider the underlying precision of the data you are working with in order to achieve the best compression possible.
The following rules summarize the data conversion that is performed depending on the input type:
-
int32: No conversion. Compressed to single channel FLAC bytestream.
-
int64: No conversion. Compressed to 2-channel (stereo) FLAC bytestream.
-
float32: Subtract the offset per stream and scale data based on the quanta value or precision (see above). Then round to nearest 32bit integer.
-
float64: Subtract the offset per stream and scale data based on the quanta value or precision (see above). Then round to nearest 64bit integer.
After conversion to integers, each stream's data is separately compressed into a sequence of FLAC bytes, which is appended to the bytestream. The offset in bytes for each stream is recorded.
A FlacArray is only constructed directly when making a copy. Use the class methods to create FlacArrays from numpy arrays or on-disk representations.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
other
|
FlacArray
|
Construct a copy of the input FlacArray. |
required |
Source code in flacarray/array.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 |
|
compressed
property
The concatenated raw bytes of all streams on the local process.
dtype
property
The dtype of the uncompressed array.
global_leading_shape
property
The global shape of leading uncompressed dimensions across all processes.
global_nbytes
property
The sum of total bytes used by compressed data across all processes.
global_nstreams
property
Number of global streams (product of entries of global_leading_shape
)
global_process_nbytes
property
The bytes used by compressed data on each process.
global_shape
property
The global shape of array across any MPI communicator.
global_stream_nbytes
property
The array of nbytes within the global compressed data.
global_stream_starts
property
The array of starting bytes within the global compressed data.
leading_shape
property
The local shape of leading uncompressed dimensions.
mpi_comm
property
The MPI communicator over which the array is distributed.
mpi_dist
property
The range of the leading dimension assigned to each MPI process.
nbytes
property
The total number of bytes used by compressed data on the local process.
nstreams
property
The number of local streams (product of entries of leading_shape
)
shape
property
The shape of the local, uncompressed array.
stream_gains
property
The gain factor for each stream during conversion to int32.
stream_nbytes
property
The array of nbytes for each stream on the local process.
stream_offsets
property
The value subtracted from each stream during conversion to int32.
stream_size
property
The uncompressed length of each stream.
stream_starts
property
The array of starting bytes for each stream on the local process.
typestr
property
A string representation of the original data type.
__getitem__(raw_key)
Decompress a slice of data on the fly.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
raw_key
|
tuple
|
A tuple of slices or integers. |
required |
Returns:
Type | Description |
---|---|
array
|
The decompressed array slice. |
Source code in flacarray/array.py
409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 |
|
from_array(arr, level=5, quanta=None, precision=None, mpi_comm=None, use_threads=False)
classmethod
Construct a FlacArray from a numpy ndarray.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
arr
|
ndarray
|
The input array data. |
required |
level
|
int
|
Compression level (0-8). |
5
|
quanta
|
(float, array)
|
For floating point data, the floating point increment of each 32bit integer value. Optionally an iterable of increments, one per stream. |
None
|
precision
|
(int, array)
|
Number of significant digits to retain in
float-to-int conversion. Alternative to |
None
|
mpi_comm
|
Comm
|
If specified, the input array is assumed to be distributed across the communicator at the leading dimension. The local piece of the array is passed in on each process. |
None
|
use_threads
|
bool
|
If True, use OpenMP threads to parallelize decoding. This is only beneficial for large arrays. |
False
|
Returns:
Type | Description |
---|---|
FlacArray
|
A newly constructed FlacArray. |
Source code in flacarray/array.py
586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 |
|
read_hdf5(hgrp, keep=None, mpi_comm=None, mpi_dist=None, no_flatten=False)
classmethod
Construct a FlacArray from an HDF5 Group.
This function loads all information about the array from an HDF5 group. If
mpi_comm
is specified, the created array is distributed over that
communicator. If you also wish to use MPI I/O to read data from the group,
then you must be using an MPI-enabled h5py and you should pass in a valid
handle to the group on all processes.
If mpi_dist
is specified, it should be an iterable with the number of leading
dimension elements assigned to each process. If None, the leading dimension
will be distributed uniformly.
If keep
is specified, this should be a boolean array with the same shape
as the leading dimensions of the original array. True values in this array
indicate that the stream should be kept.
If keep
is specified, the returned array WILL NOT have the same shape as
the original. Instead it will be a 2D array of decompressed streams- the
streams corresponding to True values in the keep
mask.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
hgrp
|
Group
|
The open Group for reading. |
required |
keep
|
array
|
Bool array of streams to keep in the decompression. |
None
|
mpi_comm
|
Comm
|
If specified, the communicator over which to distribute the leading dimension. |
None
|
mpi_dist
|
array
|
If specified, assign blocks of these sizes to processes when distributing the leading dimension. |
None
|
no_flatten
|
bool
|
If True, for single-stream arrays, leave the leading dimension of (1,) in the result. |
False
|
Returns:
Type | Description |
---|---|
FlacArray
|
A newly constructed FlacArray. |
Source code in flacarray/array.py
683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 |
|
read_zarr(zgrp, keep=None, mpi_comm=None, mpi_dist=None, no_flatten=False)
classmethod
Construct a FlacArray from a Zarr Group.
This function loads all information about the array from a zarr group. If
mpi_comm
is specified, the created array is distributed over that
communicator.
If mpi_dist
is specified, it should be an iterable with the number of leading
dimension elements assigned to each process. If None, the leading dimension
will be distributed uniformly.
If keep
is specified, this should be a boolean array with the same shape
as the leading dimensions of the original array. True values in this array
indicate that the stream should be kept.
If keep
is specified, the returned array WILL NOT have the same shape as
the original. Instead it will be a 2D array of decompressed streams- the
streams corresponding to True values in the keep
mask.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
zgrp
|
Group
|
The open Group for reading. |
required |
keep
|
array
|
Bool array of streams to keep in the decompression. |
None
|
mpi_comm
|
Comm
|
If specified, the communicator over which to distribute the leading dimension. |
None
|
mpi_dist
|
array
|
If specified, assign blocks of these sizes to processes when distributing the leading dimension. |
None
|
no_flatten
|
bool
|
If True, for single-stream arrays, leave the leading dimension of (1,) in the result. |
False
|
Returns:
Type | Description |
---|---|
FlacArray
|
A newly constructed FlacArray. |
Source code in flacarray/array.py
805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 |
|
to_array(keep=None, stream_slice=None, keep_indices=False, use_threads=False)
Decompress local data into a numpy array.
This uses the compressed representation to reconstruct a normal numpy array. The returned data type will be either int32, int64, float32, or float64 depending on the original data type.
If stream_slice
is specified, the returned array will have only that
range of samples in the final dimension.
If keep
is specified, this should be a boolean array with the same shape
as the leading dimensions of the original array. True values in this array
indicate that the stream should be kept.
If keep
is specified, the returned array WILL NOT have the same shape as
the original. Instead it will be a 2D array of decompressed streams- the
streams corresponding to True values in the keep
mask.
If keep_indices
is True and keep
is specified, then a tuple of two values
is returned. The first is the array of decompressed streams. The second is
a list of tuples, each of which specifies the indices of the stream in the
original array.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keep
|
array
|
Bool array of streams to keep in the decompression. |
None
|
stream_slice
|
slice
|
A python slice with step size of one, indicating the sample range to extract from each stream. |
None
|
keep_indices
|
bool
|
If True, also return the original indices of the streams. |
False
|
use_threads
|
bool
|
If True, use OpenMP threads to parallelize decoding. This is only beneficial for large arrays. |
False
|
Source code in flacarray/array.py
518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 |
|
write_hdf5(hgrp)
Write data to an HDF5 Group.
The internal object properties are written to an open HDF5 group. If you wish to use MPI I/O to write data to the group, then you must be using an MPI enabled h5py and you should pass in a valid handle to the group on all processes.
If the FlacArray
is distributed over an MPI communicator, but the h5py
implementation does not support MPI I/O, then all data will be communicated
to the rank zero process for writing. In this case, the hgrp
argument should
be None except on the root process.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
hgrp
|
Group
|
The open Group for writing. |
required |
Returns:
Type | Description |
---|---|
None |
Source code in flacarray/array.py
639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 |
|
write_zarr(zgrp)
Write data to an Zarr Group.
The internal object properties are written to an open zarr group.
If the FlacArray
is distributed over an MPI communicator, then all data will
be communicated to the rank zero process for writing. In this case, the zgrp
argument should be None except on the root process.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
zgrp
|
Group
|
The open Group for writing. |
required |
Returns:
Type | Description |
---|---|
None |
Source code in flacarray/array.py
766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 |
|
Direct I/O
Sometimes code has no need to store compressed arrays in memory. Instead, it may be desirable to have full arrays in memory and compressed arrays on disk. In those situations, you can use several helper functions to write and read numpy arrays directly to / from files.
HDF5
You can write to / read from an h5py Group using functions in the hdf5
submodule.
flacarray.hdf5.write_array(arr, hgrp, level=5, quanta=None, precision=None, mpi_comm=None, use_threads=False)
Compress a numpy array and write to an HDF5 group.
This function is useful if you do not need to access the compressed array in memory
and only wish to write it directly to HDF5. The input array is compressed and then
the write_compressed()
function is called.
If the input array is int32 or int64, the compression is lossless and the compressed
bytes and ancillary data is written to datasets within the output group. If the
array is float32 or float64, either the quanta
or precision
must be specified.
See discussion in the FlacArray
class documentation about how the offsets and
gains are computed for a given quanta. The offsets and gains are also written as
datasets within the output group.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
arr
|
array
|
The input numpy array. |
required |
hgrp
|
Group
|
The Group to use. |
required |
level
|
int
|
Compression level (0-8). |
5
|
quanta
|
(float, array)
|
For floating point data, the floating point increment of each 32bit integer value. Optionally an iterable of increments, one per stream. |
None
|
precision
|
(int, array)
|
Number of significant digits to retain in
float-to-int conversion. Alternative to |
None
|
mpi_comm
|
Comm
|
If specified, the input array is assumed to be distributed across the communicator at the leading dimension. The local piece of the array is passed in on each process. |
None
|
use_threads
|
bool
|
If True, use OpenMP threads to parallelize decoding. This is only beneficial for large arrays. |
False
|
Returns:
Type | Description |
---|---|
None |
Source code in flacarray/hdf5.py
311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 |
|
flacarray.hdf5.read_array(hgrp, keep=None, stream_slice=None, keep_indices=False, mpi_comm=None, mpi_dist=None, use_threads=False)
Load a numpy array from compressed HDF5.
This function is useful if you do not need to store a compressed representation of the array in memory. Each stream will be read individually from the file and the desired slice decompressed. This avoids storing the full compressed data.
This function acts as a dispatch to the correct version of the reading function. The function is selected based on the format version string in the data.
If stream_slice
is specified, the returned array will have only that
range of samples in the final dimension.
If keep
is specified, this should be a boolean array with the same shape
as the leading dimensions of the original array. True values in this array
indicate that the stream should be kept.
If keep
is specified, the returned array WILL NOT have the same shape as
the original. Instead it will be a 2D array of decompressed streams- the
streams corresponding to True values in the keep
mask.
If keep_indices
is True and keep
is specified, then an additional list
is returned containing the indices of each stream that was kept.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
hgrp
|
Group
|
The group to read. |
required |
keep
|
array
|
Bool array of streams to keep in the decompression. |
None
|
stream_slice
|
slice
|
A python slice with step size of one, indicating the sample range to extract from each stream. |
None
|
keep_indices
|
bool
|
If True, also return the original indices of the streams. |
False
|
mpi_comm
|
Comm
|
The optional MPI communicator over which to distribute the leading dimension of the array. |
None
|
mpi_dist
|
list
|
The optional list of tuples specifying the first / last element of the leading dimension to assign to each process. |
None
|
use_threads
|
bool
|
If True, use OpenMP threads to parallelize decoding. This is only beneficial for large arrays. |
False
|
Returns:
Type | Description |
---|---|
array
|
The loaded and decompressed data OR the array and the kept indices. |
Source code in flacarray/hdf5.py
450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 |
|
Zarr
You can write to / read from a zarr hierarch Group using functions in the
zarr
submodule.
flacarray.zarr.write_array(arr, zgrp, level=5, quanta=None, precision=None, mpi_comm=None, use_threads=False)
Compress a numpy array and write to an Zarr group.
This function is useful if you do not need to access the compressed array in memory
and only wish to write it directly to Zarr files. The input array is compressed
and then the write_compressed()
function is called.
If the input array is int32 or int64, the compression is lossless and the compressed
bytes and ancillary data is written to datasets within the output group. If the
array is float32 or float64, either the quanta
or precision
must be specified.
See discussion in the FlacArray
class documentation about how the offsets and
gains are computed for a given quanta. The offsets and gains are also written as
datasets within the output group.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
arr
|
array
|
The input numpy array. |
required |
zgrp
|
Group
|
The Group to use. |
required |
level
|
int
|
Compression level (0-8). |
5
|
quanta
|
(float, array)
|
For floating point data, the floating point increment of each 32bit integer value. Optionally an iterable of increments, one per stream. |
None
|
precision
|
(int, array)
|
Number of significant digits to retain in
float-to-int conversion. Alternative to |
None
|
mpi_comm
|
Comm
|
If specified, the input array is assumed to be distributed across the communicator at the leading dimension. The local piece of the array is passed in on each process. |
None
|
use_threads
|
bool
|
If True, use OpenMP threads to parallelize decoding. This is only beneficial for large arrays. |
False
|
Returns:
Type | Description |
---|---|
None |
Source code in flacarray/zarr.py
307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 |
|
flacarray.zarr.read_array(zgrp, keep=None, stream_slice=None, keep_indices=False, mpi_comm=None, mpi_dist=None, use_threads=False, no_flatten=False)
Load a numpy array from a compressed Zarr group.
This function is useful if you do not need to store a compressed representation of the array in memory. Each stream will be read individually from the file and the desired slice decompressed. This avoids storing the full compressed data.
This function acts as a dispatch to the correct version of the reading function. The function is selected based on the format version string in the data.
If stream_slice
is specified, the returned array will have only that
range of samples in the final dimension.
If keep
is specified, this should be a boolean array with the same shape
as the leading dimensions of the original array. True values in this array
indicate that the stream should be kept.
If keep
is specified, the returned array WILL NOT have the same shape as
the original. Instead it will be a 2D array of decompressed streams- the
streams corresponding to True values in the keep
mask.
If keep_indices
is True and keep
is specified, then an additional list
is returned containing the indices of each stream that was kept.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
zgrp
|
Group
|
The group to read. |
required |
keep
|
array
|
Bool array of streams to keep in the decompression. |
None
|
stream_slice
|
slice
|
A python slice with step size of one, indicating the sample range to extract from each stream. |
None
|
keep_indices
|
bool
|
If True, also return the original indices of the streams. |
False
|
mpi_comm
|
Comm
|
The optional MPI communicator over which to distribute the leading dimension of the array. |
None
|
mpi_dist
|
list
|
The optional list of tuples specifying the first / last element of the leading dimension to assign to each process. |
None
|
use_threads
|
bool
|
If True, use OpenMP threads to parallelize decoding. This is only beneficial for large arrays. |
False
|
no_flatten
|
bool
|
If True, for single-stream arrays, leave the leading dimension of (1,) in the result. |
False
|
Returns:
Type | Description |
---|---|
array
|
The loaded and decompressed data OR the array and the kept indices. |
Source code in flacarray/zarr.py
446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 |
|
Interactive Tools
The flacarray.demo
submodule contains a few helper functions that are not
imported by default. You will need to have optional dependencies (matplotlib)
installed to use the visualization tools. For testing, it is convenient to
generate arrays consisting of random timestreams with some structure. The
create_fake_data
function can be used for this.
flacarray.demo.create_fake_data(local_shape, sigma=1.0, dtype=np.float64, seed=123456789, comm=None, dc_sigma=5)
Create fake random data for testing.
This is a helper function to generate some random data for testing.
if sigma
is None, uniform randoms are return. If sigma is not None,
samples drawn from a Gaussian distribution are returned.
If comm
is not None, the data is created on one process and then pieces are
distributed among the processes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
local_shape
|
tuple
|
The local shape of the data on this process. |
required |
sigma
|
float
|
The width of the distribution or None. |
1.0
|
dtype
|
dtype
|
The data type of the returned array. |
float64
|
seed
|
int
|
The optional seed for np.random. |
123456789
|
comm
|
Comm
|
The MPI communicator or None. |
None
|
Returns:
Type | Description |
---|---|
tuple
|
(The random data on the local process, MPI distribution). |
Source code in flacarray/demo.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
|
Most data arrays in practice have 2 or 3 dimensions. If the number of streams
is relatively small, then an uncompressed array can be plotted with the
plot_data
function.
flacarray.demo.plot_data(data, keep=None, stream_slc=slice(None), file=None)
Source code in flacarray/demo.py
117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
|
Low-Level Tools
For specialized use cases, you can also work directly with the compressed bytestream and auxiliary arrays and convert to / from numpy arrays.
flacarray.compress.array_compress(arr, level=5, quanta=None, precision=None, use_threads=False)
Compress a numpy array with optional floating point conversion.
If arr
is an int32 array, the returned stream offsets and gains will be None.
if arr
is an int64 array, the returned stream offsets and gains will be None and
the calling code is responsible for tracking that the compressed bytes are
associated with a 64bit stream.
If the input array is float32 or float64, exactly one of quanta or precision
must be specified. Both float32 and float64 data will have floating point offset
and gain arrays returned. See discussion in the FlacArray
class documentation
about how the offsets and gains are computed for a given quanta.
The shape of the returned auxiliary arrays (starts, nbytes, etc) will have a shape corresponding to the leading shape of the input array. If the input array is a single stream, the returned auxiliary information will be arrays with a single element.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
arr
|
ndarray
|
The input array data. |
required |
level
|
int
|
Compression level (0-8). |
5
|
quanta
|
(float, array)
|
For floating point data, the floating point increment of each integer value. Optionally an array of increments, one per stream. |
None
|
precision
|
(int, array)
|
Number of significant digits to retain in
float-to-int conversion. Alternative to |
None
|
use_threads
|
bool
|
If True, use OpenMP threads to parallelize decoding. This is only beneficial for large arrays. |
False
|
Returns:
Type | Description |
---|---|
tuple
|
The (compressed bytes, stream starts, stream_nbytes, stream offsets, stream gains) |
Source code in flacarray/compress.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
|
flacarray.decompress.array_decompress(compressed, stream_size, stream_starts, stream_nbytes, stream_offsets=None, stream_gains=None, first_stream_sample=None, last_stream_sample=None, is_int64=False, use_threads=False, no_flatten=False)
Decompress a FLAC encoded array and restore original data type.
If both stream_gains
and stream_offsets
are specified, the output will be
floating point data. If neither is specified, the output will be integer data.
It is an error to specify only one of those options.
The compressed byte stream might contain either int32 or int64 data, and the calling
code is responsible for tracking this. The is_int64
parameter should be set to
True if the byte stream contains 64bit integers.
To decompress a subset of samples in all streams, specify the first_stream_sample
and last_stream_sample
values. None values or negative values disable this
feature.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
compressed
|
array
|
The array of compressed bytes. |
required |
stream_size
|
int
|
The length of the decompressed final dimension. |
required |
stream_starts
|
array
|
The array of starting bytes in the bytestream. |
required |
stream_nbytes
|
array
|
The array of number of bytes in each stream. |
required |
stream_offsets
|
array
|
The array of offsets, one per stream. |
None
|
stream_gains
|
array
|
The array of gains, one per stream. |
None
|
first_stream_sample
|
int
|
The first sample of every stream to decompress. |
None
|
last_stream_sample
|
int
|
The last sample of every stream to decompress. |
None
|
is_int64
|
bool
|
If True, the compressed stream contains 64bit integers. |
False
|
use_threads
|
bool
|
If True, use OpenMP threads to parallelize decoding. This is only beneficial for large arrays. |
False
|
no_flatten
|
bool
|
If True, for single-stream arrays, leave the leading dimension of (1,) in the result. |
False
|
Returns:
Type | Description |
---|---|
array
|
The output array. |
Source code in flacarray/decompress.py
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 |
|
flacarray.decompress.array_decompress_slice(compressed, stream_size, stream_starts, stream_nbytes, stream_offsets=None, stream_gains=None, keep=None, first_stream_sample=None, last_stream_sample=None, is_int64=False, use_threads=False, no_flatten=False)
Decompress a slice of a FLAC encoded array and restore original data type.
If both stream_gains
and stream_offsets
are specified, the output will be
floating point data. If neither is specified, the output will be integer data.
It is an error to specify only one of those options.
The compressed byte stream might contain either int32 or int64 data, and the calling
code is responsible for tracking this. The is_int64
parameter should be set to
True if the byte stream contains 64bit integers.
To decompress a subset of samples in all streams, specify the first_stream_sample
and last_stream_sample
values. None values or negative values disable this
feature.
To decompress a subset of streams, pass a boolean array to the keep
argument.
This should have the same shape as the starts
array. Only streams with a True
value in the keep
array will be decompressed.
If the keep
array is specified, the output tuple will contain the 2D array of
streams that were kept, as well as a list of tuples indicating the original array
indices for each stream in the output. If the keep
array is None, the output
tuple will contain an array with the original N-dimensional leading array shape
and the trailing number of samples. The second element of the tuple will be None.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
compressed
|
array
|
The array of compressed bytes. |
required |
stream_size
|
int
|
The length of the decompressed final dimension. |
required |
stream_starts
|
array
|
The array of starting bytes in the bytestream. |
required |
stream_nbytes
|
array
|
The array of number of bytes in each stream. |
required |
stream_offsets
|
array
|
The array of offsets, one per stream. |
None
|
stream_gains
|
array
|
The array of gains, one per stream. |
None
|
keep
|
array
|
Bool array of streams to keep in the decompression. |
None
|
first_stream_sample
|
int
|
The first sample of every stream to decompress. |
None
|
last_stream_sample
|
int
|
The last sample of every stream to decompress. |
None
|
is_int64
|
bool
|
If True, the compressed stream contains 64bit integers. |
False
|
use_threads
|
bool
|
If True, use OpenMP threads to parallelize decoding. This is only beneficial for large arrays. |
False
|
no_flatten
|
bool
|
If True, for single-stream arrays, leave the leading dimension of (1,) in the result. |
False
|
Returns:
Type | Description |
---|---|
tuple
|
The (output array, list of stream indices). |
Source code in flacarray/decompress.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 |
|
Developer Notes
To-Do
Discuss: - Code formatting (ruff) - PR workflow