What is cooler?¶
We use the term genomically-labeled array to refer to a data structure that assigns unique quantitative values to tuples of genomic bins obtained from an interval partition of a reference genome assembly. The tuples of bins make up the coordinates of the array’s elements. By omitting elements possessing zero or no value, the representation becomes sparse.
Cooler was designed for the storage and manipulation of extremely large Hi-C datasets at any resolution, but is not limited to Hi-C data in any way.
We can describe two tabular representations of such data.
By extending the bedGraph format, we can encode a 2D array with the following header.
Other bin-related attributes (e.g. X and Y) and be appended as columns X1, X2, Y1, Y2, and so on. One problem with this representation is that each bin-related attribute can be repeated many times throughout the table, leading to great redundancy.
bedGraph is technically different from BED: the former describes a quantitative track supported by non-overlapping intervals (a step function), while the latter describes genomic intervals with no such restrictions. BG2 is different from BEDPE in the same way: intervals on the same axis are non-overlapping and interval pairs are not repeated (describing a heatmap).
A simple solution is to decompose or “normalize” the single table into two files. The first is a bin table that describes the genomic bin segmentation on both axes of the matrix (in the one-dimensional bedGraph style). The second table contains single columns that reference the rows of the bin table, providing a condensed representation of the nonzero elements of the array. Conveniently, this corresponds to the classic coordinate list (COO) sparse matrix representation. This two-table representation is used as a text format by HiC-Pro.
The table of elements (non-zero pixels) is often too large to hold in memory, but for any small selection of elements we can reconstitute the bin-related attributes by “joining” the bin IDs against the bin table. We refer to this process as element annotation.
We model a genomically-labeled sparse matrix using three tables. It corresponds to the bin and element (pixel) tables above. We include a third chromosome description table for completeness, and indexes to support random access.
- Required columns:
- Order: enumeration
An semantic ordering of the chromosomes, scaffolds or contigs of the assembly that the data is mapped to. This information can be extracted from the bin table below, but is included separately for convenience. This enumeration is the intended ordering of the chromosomes as they would appear in a global genomic matrix. Additional columns can provide metadata on the chromosomes, such as their length.
- Required columns:
chrom, start, end [, weight]
An enumeration of the concatenated genomic bins that make up a single dimension or axis of the global genomic matrix. Genomic bins can be of fixed size or variable sizes (e.g. restriction fragments). A genomic bin is defined by the triple (chrom, start, end), where start is zero-based and end is 1-based. The order is significant: the bins are sorted by chromosome (based on the chromosome enumeration) then by start, and each genomic bin is implicitly endowed with a 0-based bin ID from this ordering (i.e., its row number in the table). A reserved but optional column called
weight can store weights for normalization or matrix balancing. Additional columns can be added to describe other bin-associated properties such as additional normalization vectors or bin-level masks.
- Required columns:
bin1_id, bin2_id, count
The array is stored as a single table containing only the nonzero upper triangle elements, assuming the ordering of the bins given by the bin table. Each row defines a non-zero element of the genomic matrix. Additional columns can be appended to store pixel-associated properties such as pixel-level masks or filtered and transformed versions of the data. Currently, the pixels are sorted lexicographically by the bin ID of the 1st axis (matrix row) then the bin ID of the 2nd axis (matrix column).
The sort order on the pixels and types of indexing strategies that can be used are strongly related.
We stipulate that the records of the pixel table must be sorted lexicographically by the bin
ID along the first axis, then by the bin ID along the second axis. This way, the
bin1_id column can
be substituted with its run length encoding, which serves as a lookup index for the rows of the ma-
trix. With this index, we obtain a compressed sparse row (CSR) sparse matrix representation.
Given an enumeration of chromosomes, the bin table must also be lexicographically sorted by chromosome then by start coordinate. Then similarly, the chrom column of the bin table will reference the rows of the chrom table, and can also be substituted with a run length encoding.
The reference implementation of this data model uses HDF5 as the container format. HDF5 is a hierarchical data format for homongenenously typed multidimensional arrays, which supports chunking, compression, and random access. The HDF5 file specification and open source standard library is maintained by the nonprofit HDF Group.
HDF5 files consist of three fundamental entities: groups, datasets, and attibutes. The hierarchical organization of an HDF5 file is conceptually analogous to a file system: groups are akin to directories and datasets (arrays) are akin to files. Additionally, key-value metadata can be attached to groups and datasets using attributes. The standard library provides the ability to access and manipulate these entities. There are bindings for virtually every platform and programming environment. To learn more in detail about HDF5, I recommend the book HDF5 and Python by Andrew Collette, the author of
To implement the data model in HDF5, data tables are stored in a columnar representation as HDF5 groups of 1D array datasets of equal length. Metadata is stored using top-level attributes. See the schema.
HDF5 bindings in other languages¶
HDF5 is not a database system and is not journalled. It supports concurrent read access but not simultaneous reads and writes (with upcoming support for the SWMR access pattern). One must be careful using multi-process concurrency based on Unix
fork(): if a file is already open before the fork, the child processes will inherit state such that they won’t play well with each other on that file. HDF5 will work fine with Python’s
multiprocessing as long as you make sure to close file handles before creating a process pool. Otherwise, you’ll need to use locks or avoid opening the file in worker processes completely (see this blog post for a simple workaround). For more information on using multiprocessing safely, see this discussion.