Cook Book¶
This notebook contains some more advanced examples addressing common usage patterns. Look at the Tutorial first to get a better sense of the big picture of the tools.
# Set the number of OpenMP threads to use
import os
os.environ["OMP_NUM_THREADS"] = "4"
import time
import numpy as np
import h5py
from flacarray import FlacArray, demo
import flacarray.hdf5
Random Access to Large Arrays¶
Consider a common case where we have a 2D array that represents essentially a "list" of timestreams of data. We might have thousands of timestreams, each with millions of samples. Now we want to decompress and access a subset of those streams and / or samples. To reduce memory in this notebook we are using a slightly smaller array.
# Create a 2D array of streams
arr = demo.create_fake_data((1000, 100000), dtype=np.float32)
# How large is this in memory?
print(f"Input array is {arr.nbytes} bytes")
Input array is 400000000 bytes
# Compress this with threads
start = time.perf_counter()
flcarr = FlacArray.from_array(arr, use_threads=True)
stop = time.perf_counter()
print(f"Elapsed = {stop-start:0.3} seconds")
Elapsed = 1.33 seconds
# Compress this without threads
start = time.perf_counter()
flcarr = FlacArray.from_array(arr, use_threads=False)
stop = time.perf_counter()
print(f"Elapsed = {stop-start:0.3} seconds")
Elapsed = 3.06 seconds
# Decompress the whole thing
start = time.perf_counter()
restored = flcarr.to_array()
stop = time.perf_counter()
print(f"Elapsed = {stop-start:0.3} seconds")
Elapsed = 0.506 seconds
# Decompress the whole thing with threads
del restored
start = time.perf_counter()
restored = flcarr.to_array(use_threads=True)
stop = time.perf_counter()
print(f"Elapsed = {stop-start:0.3} seconds")
Elapsed = 0.493 seconds
Subset of Samples for All Streams¶
If our 2D array of streams contains co-sampled data, we might mant to examine a slice in time of all streams. Imagine we wanted to get data near the end of the array for all streams:
n_end = 10000
start = time.perf_counter()
end_arr = flcarr.to_array(stream_slice=slice(-n_end, None, 1))
stop = time.perf_counter()
print(f"Elapsed = {stop-start:0.3} seconds")
Elapsed = 0.496 seconds
Subset of Samples for a Few Streams¶
Imagine we want the last 1000 samples of one stream in the middle. We can use a "keep" mask combined with a sample slice:
n_end = 10000
keep = np.zeros(arr.shape[:-1], dtype=bool)
keep[500] = True
start = time.perf_counter()
sub_arr = flcarr.to_array(keep=keep, stream_slice=slice(-n_end, None, 1))
stop = time.perf_counter()
print(f"Elapsed = {stop-start:0.3} seconds")
print(sub_arr)
Elapsed = 0.00223 seconds [[ 1.3499216 1.1607051 -1.0080613 ... 0.2447555 1.0821551 0.03497726]]
So, we can see that decompressing a small number of random samples from a multi-GB dataset in memory is very fast.
Parallel I/O¶
For some use cases, there is no need to keep the full compressed data in memory (in a FlacArray
). Instead, a normal numpy array is compressed when writing to a file and decompressed back into a numpy array when reading.
To-Do: Discuss
- Interaction of threads, OpenMP versus libFLAC pthreads
- Use of MPI HDF5 with h5py