Schema

Schema Version

3

The following document describes a compressed sparse row (CSR) storage scheme for a matrix (i.e., a quantitative heatmap) with genomically labeled dimensions/axes.

HDF5 does not natively implement sparse arrays or relational data structures: its datasets are dense multidimensional arrays. We implement tables and sparse array indexes in HDF5 using groups of 1D arrays. The descriptions of tables and indexes in this document specify required groups and arrays, conventional column orders, and default data types.

Summary of changes

Version 3 introduces the storage-mode metadata attribute to accomodate square matrices that are non-symmetric. Version 2 files which lack the storage-mode attribute should be interpreted as using the “symmetric-upper” storage mode. See Storage mode.
The multi-resolution cooler file layout has been standardized. See File flavors.

Data collection

We refer to the object hierarchy describing a single matrix as a cooler data collection. A cooler data collection consists of tables, indexes and metadata describing a genomically-labelled sparse matrix.

A typical data collection has the following structure. At the top level, there are four HDF5 Groups, each containing 1D arrays (HDF5 Datasets). The depiction below shows an example group hierarchy as a tree, with arrays at the leaves, printed with their shapes in parentheses and their data type symbols.

/
 ├── chroms
 │   ├── length (24,) int32
 │   └── name (24,) |S64
 ├── bins
 │   ├── chrom (3088281,) int32
 │   ├── start (3088281,) int32
 │   ├── end (3088281,) int32
 │   └── weight (3088281,) float64
 ├── pixels
 │   ├── bin1_id (271958554,) int64
 │   ├── bin2_id (271958554,) int64
 │   └── count (271958554,) int32
 └── indexes
     ├── bin1_offset (3088282,) int64
     └── chrom_offset (25,) int64

URI syntax

We identify a cooler data collection using a URI string to its top-level group, separating the system path to the container file from the group path within the container file by a double colon ::.

path/to/container.cool::/path/to/cooler/group

For any URI, the leading slash after the :: may be omitted. To reference the root group /, the entire ::/ suffix may be omitted (i.e., just a file path).

Tables

A table is a group of equal-length 1D arrays representing columns.

Additional groups and tables may be added to a data collection as long as they are not nested under the group of another table.

This storage mode does not enforce specific column orders, but conventional orders for required columns is provided in the listings below.

This storage mode does not set limits on the number or length of columns. Additional arrays may be inserted into a table to form new columns, but they must conform to the common length of the table.

The table descriptions below are given in the datashape layout language. The column data types are given as numpy equivalents. They are only defaults and may be altered as desired.

GZIP is chosen as the default compression filter for all columns. This is for portability reasons, since all versions of the HDF5 library ship with it.

chroms

chroms: {
  # REQUIRED
  name:     typevar['Nchroms'] * string['ascii'],
  length:   typevar['Nchroms'] * int32
}

In HDF5, name is a null-padded, fixed-length ASCII array, which maps to numpy’s S dtype.

bins

bins: {
  # REQUIRED
  chrom:    typevar['Nbins'] * categorical[typevar['name'], type=string, ordered=True],
  start:    typevar['Nbins'] * int32,
  end:      typevar['Nbins'] * int32,

  # RESERVED
  weight:   typevar['Nbins'] * float64
}

In HDF5, we use the integer-backed ENUM type to encode the chrom column. For data collections with a very large number of scaffolds, the ENUM type information may be too large to fit in the object’s metadata header. In that case, the chrom column is stored using raw integers and the enumeration is inferred from the chrom table.

Genomic intervals are stored using a 0-start, half-open representation. The first interval in a scaffold should have start = 0 and the last interval should have end = the chromosome length. Intervals are sorted by chrom, then by start.

Because they measure the same quantity in the same units, the coordinate columns chroms/length, bins/start and bins/end should be encoded using the same data type.

The cooler balance command stores balancing weights in a column called weight by default. NaN values indicate genomic bins that were blacklisted during the balancing procedure.

pixels

pixels: {
  # REQUIRED
  bin1_id:  typevar['Nnz'] * int64,
  bin2_id:  typevar['Nnz'] * int64,

  # RESERVED
  count:    typevar['Nnz'] * int32
}

In the matrix coordinate system, bin1_id refers to the ith axis and bin2_id refers to the jth. Bin IDs are zero-based, i.e. we start counting at 0. Pixels are sorted by bin1_id then by bin2_id.

The count column is integer by default, but floating point types can be substituted. Additional columns are to be interpreted as supplementary value columns.

Warning

float16 has limited support from 3rd party libraries and is not recommended. For floating point value columns consider using either single- (float32) or double-precision (float64).

Indexes

Indexes are stored as 1D arrays in a separate group called indexes. They can be thought of as run-length encodings of the bins/chrom and pixels/bin1_id columns, respectively. Both arrays are required.

indexes: {
  chrom_offset:  (typevar['Nchroms'] + 1) * int64,
  bin1_offset:   (typevar['Nbins'] + 1) * int64
}

chrom_offset: indicates which row in the bin table each chromosome first appears. The last element stores the length of the bin table.
bin1_offset: indicates which row in the pixel table each bin1 ID first appears. The last element stores the length of the pixel table. This index is usually called indptr in CSR data structures.

Storage mode

Storing a symmetric matrix requires only the upper triangular part, including the diagonal, since the remaining elements can be reconstructed from the former ones. To indicate the use of this mode of matrix storage to client software, the value of the metadata attribute storage-mode must be set to "symmetric-upper" (see Metadata).

New in version 3: To indicate the absence of a special storage mode, e.g. for non-symmetric matrices, storage-mode must be set to "square". This storage mode indicates to client software that 2D range queries should not be symmetrized.

Warning

In schema v2 and earlier, the symmetric-upper storage mode is always assumed.

Metadata

Essential key-value properties are stored as HDF5 attributes at the top-level group of the data collection. Note that depending on where the data collection is located in the file, this can be different from the root group of the entire file /.

Required attributes

format : string (constant): “HDF5::Cooler”

format-version : int: The schema version used.

bin-type : { "fixed", "variable" }: Indicates whether the resolution is constant along both axes.

bin-size : int or "null": Size of genomic bins in base pairs if bin-type is “fixed”. Otherwise, “null”.

storage-mode : { "symmetric-upper", "square" }: Indicates whether ordinary sparse matrix encoding is used (“square”) or whether a symmetric matrix is encoded by storing only the upper triangular elements (“symmetric-upper”).

Reserved, but optional

assembly : string: Name of the genome assembly, e.g. “hg19”.

generated-by : string: Agent that created the file, e.g. “cooler-x.y.z”.

creation-date : datetime string: The moment the collection was created.

metadata : JSON: Arbitrary JSON-compatible user metadata about the experiment.

All scalar string attributes, including serialized JSON, must be stored as variable-length UTF-8 encoded strings.

Warning

When assigning scalar string attributes in Python 2, always store values having unicode type. In h5py, assigning a Python text string (Python 3 str or Python 2 unicode) to an HDF5 attribute results in variable-length UTF-8 storage.

Additional metadata may be stored in other top-level attributes and the attributes of table groups and columns.

File flavors

Many cooler data collections can be stored in a single file. We recognize two conventional layouts:

Single-resolution

A single-resolution cooler file that contains a single data collection under the / group. Conventional file extension: .cool.

XYZ.1000.cool
/
 ├── bins
 ├── chroms
 ├── pixels
 └── indexes

Multi-resolution

A multi-resolution cooler file that contains multiple “coarsened” resolutions or “zoom-levels” derived from the same dataset. Multires cooler files should store each data collection underneath a group called /resolutions within a sub-group whose name is the bin size (e.g, XYZ.1000.mcool::resolutions/10000). If the base cooler has variable-length bins, then use 1 to designate the base resolution, and the use coarsening multiplier (e.g. 2, 4, 8, etc.) to name the lower resolutions. Conventional file extension: .mcool.

XYZ.1000.mcool
/
 └── resolutions
     ├── 1000
     │   ├── bins
     │   ├── chroms
     │   ├── pixels
     │   └── indexes
     ├── 2000
     │   ├── bins
     │   ├── chroms
     │   ├── pixels
     │   └── indexes
     ├── 5000
     │   ├── bins
     │   ├── chroms
     │   ├── pixels
     │   └── indexes
     ├── 10000
     │   ├── bins
     │   ├── chroms
     │   ├── pixels
     │   └── indexes
     .
     .
     .

In addition, a multi-resolution cooler file may indicate to clients that it is using this layout with the following /-level attributes:

format : string (constant): “HDF5::MCOOL”

format-version : int: 2

bin-type : { "fixed", "variable" }: Indicates whether the resolution is constant along both axes.

Note

The old multi-resolution layout used resolutions strictly in increments of powers of 2. In this layout (MCOOL version 2), the data collections are named by zoom level, starting with XYZ.1000.mcool::0 being the coarsest resolution up until the finest or “base” resolution (e.g., XYZ.1000.mcool::14 for 14 levels of coarsening).

Changed in version 0.8: Both the legacy layout and the new mcool layout are supported by HiGlass. Prior to cooler 0.8, the new layout was produced only when requesting a specific list of resolutions. As of cooler 0.8, the new layout is always produced by the cooler zoomify command unless the --legacy option is given. Files produced by cooler.zoomify_cooler(), hic2cool, and the mcools from the 4DN data portal also follow the new layout.

Single-cell (single-resolution)

A single-cell cooler file contains all the matrices of a single-cell Hi-C data set. All cells are stored under a group called /cells, and all cells share the primary bin table columns i.e. bins['chrom'], bins['start'] and bins['end'] which are hardlinked to the root-level bin table. Any individual cell can be accessed using the regular cooler.Cooler interface. Conventional file extension: .scool.

XYZ.scool
/
 ├── bins
 ├── chroms
 └── cells
     ├── cell_id1
     │   ├── bins
     │   ├── chroms
     │   ├── pixels
     │   └── indexes
     ├── cell_id2
     │   ├── bins
     │   ├── chroms
     │   ├── pixels
     │   └── indexes
     ├── cell_id3
     │   ├── bins
     │   ├── chroms
     │   ├── pixels
     │   └── indexes
     ├── cell_id4
     │   ├── bins
     │   ├── chroms
     │   ├── pixels
     │   └── indexes
     .
     .
     .

In addition, a single-cell single-resolution cooler file may indicate to clients that it is using this layout with the following /-level attributes:

format : string (constant): “HDF5::SCOOL”

format-version : int: 1

bin-type : { "fixed", "variable" }: Indicates whether the resolution is constant along both axes.

bin-size : int: The bin resolution

nbins : int: The number of bins

nchroms : int: The number of chromosomes of the cells

ncells : int: The number of stored cells

Previous schema versions

v1
v2