API Reference

Quick reference

Cooler class

cooler.Cooler(store[, root]) A convenient interface to a cooler data collection.
cooler.Cooler.binsize Resolution in base pairs if uniform else None
cooler.Cooler.chromnames List of reference sequence names
cooler.Cooler.chromsizes Ordered mapping of reference sequences to their lengths in bp
cooler.Cooler.bins(**kwargs) Bin table selector
cooler.Cooler.pixels([join]) Pixel table selector
cooler.Cooler.matrix([field, balance, …]) Contact matrix selector
cooler.Cooler.open([mode]) Open the HDF5 group containing the Cooler with h5py
cooler.Cooler.info File information and metadata
cooler.Cooler.offset(region) Bin ID containing the left end of a genomic region
cooler.Cooler.extent(region) Bin IDs containing the left and right ends of a genomic region

Creation/reduction

cooler.create_cooler(cool_uri, bins, pixels) Create a cooler from bins and pixels at the specified URI.
cooler.merge_coolers(output_uri, input_uris, …) Merge multiple coolers with identical axes.
cooler.coarsen_cooler(base_uri, output_uri, …) Coarsen a cooler to a lower resolution by an integer factor k.
cooler.zoomify_cooler(base_uris, outfile, …) Generate multiple cooler resolutions by recursive coarsening.

Manipulation

cooler.annotate(pixels, bins[, replace]) Add bin annotations to a data frame of pixels.
cooler.balance_cooler(clr[, chunksize, map, …]) Iterative correction or matrix balancing of a sparse Hi-C contact map in Cooler HDF5 format.
cooler.rename_chroms(clr, rename_dict[, h5opts]) Substitute existing scaffold names for new ones.

File operations

cooler.fileops.is_cooler(uri) Determine if a URI string references a cooler data collection.
cooler.fileops.is_multires_file(filepath[, …]) Determine if a file is a multi-res cooler file.
cooler.fileops.list_coolers(filepath) List group paths to all cooler data collections in a file.
cooler.fileops.cp(src_uri, dst_uri[, overwrite]) Copy a group or dataset from one file to another or within the same file.
cooler.fileops.mv(src_uri, dst_uri[, overwrite]) Rename a group or dataset within the same file.
cooler.fileops.ln(src_uri, dst_uri[, soft, …]) Create a hard link to a group or dataset in the same file.

Sandbox

cooler.sandbox.dask.read_table(group_uri[, …]) Create a dask dataframe around a column-oriented table in HDF5.

cooler

class cooler.Cooler(store, root=None, **kwargs)[source]

A convenient interface to a cooler data collection.

Parameters:
  • store (str, h5py.File or h5py.Group) – Path to a cooler file, URI string, or open handle to the root HDF5 group of a cooler data collection.
  • root (str, optional [deprecated]) – HDF5 Group path to root of cooler group if store is a file. This option is deprecated. Instead, use a URI string of the form <file_path>::<group_path>.
  • kwargs (optional) – Options to be passed to h5py.File() upon every access. By default, the file is opened with the default driver and mode=’r’.

Notes

If store is a file path, the file will be opened temporarily in when performing operations. This allows Cooler objects to be serialized for multiprocess and distributed computations.

Metadata is accessible as a dictionary through the info property.

Table selectors, created using chroms(), bins(), and pixels(), perform range queries over table rows, returning pd.DataFrame and pd.Series.

A matrix selector, created using matrix(), performs 2D matrix range queries, returning numpy.ndarray or scipy.sparse.coo_matrix.

bins(**kwargs)[source]

Bin table selector

Returns:Table selector
binsize

Resolution in base pairs if uniform else None

chromnames

List of reference sequence names

chroms(**kwargs)[source]

Chromosome table selector

Returns:Table selector
chromsizes

Ordered mapping of reference sequences to their lengths in bp

extent(region)[source]

Bin IDs containing the left and right ends of a genomic region

Parameters:region (str or tuple) – Genomic range
Returns:2-tuple of ints

Examples

>>> c.extent('chr3')  # doctest: +SKIP
(1311, 2131)
info

File information and metadata

Returns:dict
matrix(field=None, balance=True, sparse=False, as_pixels=False, join=False, ignore_index=True, max_chunk=500000000)[source]

Contact matrix selector

Parameters:
  • field (str, optional) – Which column of the pixel table to fill the matrix with. By default, the ‘count’ column is used.
  • balance (bool, optional) – Whether to apply pre-calculated matrix balancing weights to the selection. Default is True and uses a column named ‘weight’. Alternatively, pass the name of the bin table column containing the desired balancing weights. Set to False to return untransformed counts.
  • sparse (bool, optional) – Return a scipy.sparse.coo_matrix instead of a dense 2D numpy array.
  • as_pixels (bool, optional) – Return a DataFrame of the corresponding rows from the pixel table instead of a rectangular sparse matrix. False by default.
  • join (bool, optional) – If requesting pixels, specifies whether to expand the bin ID columns into (chrom, start, end). Has no effect when requesting a rectangular matrix. Default is True.
  • ignore_index (bool, optional) – If requesting pixels, don’t populate the index column with the pixel IDs to improve performance. Default is True.
Returns:

Matrix selector

Notes

If as_pixels=True, only data explicitly stored in the pixel table will be returned: if the cooler’s storage mode is symmetric-upper, lower triangular elements will not be generated. If as_pixels=False, those missing non-zero elements will automatically be filled in.

offset(region)[source]

Bin ID containing the left end of a genomic region

Parameters:region (str or tuple) – Genomic range
Returns:int

Examples

>>> c.offset('chr3')  # doctest: +SKIP
1311
open(mode='r', **kwargs)[source]

Open the HDF5 group containing the Cooler with h5py

Functions as a context manager. Any open_kws passed during construction are ignored.

Parameters:mode (str, optional [default: 'r']) –
  • 'r' (readonly)
  • 'r+' or 'a' (read/write)

Notes

For other parameters, see h5py.File.

pixels(join=False, **kwargs)[source]

Pixel table selector

Parameters:join (bool, optional) – Whether to expand bin ID columns into chrom, start, and end columns. Default is False.
Returns:Table selector
storage_mode

Indicates whether ordinary sparse matrix encoding is used ("square") or whether a symmetric matrix is encoded by storing only the upper triangular elements ("symmetric-upper").

cooler.annotate(pixels, bins, replace=False)[source]

Add bin annotations to a data frame of pixels.

This is done by performing a relational “join” against the bin IDs of a table that describes properties of the genomic bins. New columns will be appended on the left of the output data frame.

Changed in version 0.8.0: The default value of replace changed to False.

Parameters:
  • pixels (DataFrame) – A data frame containing columns named bin1_id and/or bin2_id. If columns bin1_id and bin2_id are both present in pixels, the adjoined columns will be suffixed with ‘1’ and ‘2’ accordingly.
  • bins (DataFrame or DataFrame selector) – Data structure that contains a full description of the genomic bins of the contact matrix, where the index corresponds to bin IDs.
  • replace (bool, optional) – Remove the original bin1_id and bin2_id columns from the output. Default is False.
Returns:

DataFrame

cooler.create_cooler(cool_uri, bins, pixels, columns=None, dtypes=None, metadata=None, assembly=None, ordered=False, symmetric_upper=True, mode=None, mergebuf=20000000, delete_temp=True, temp_dir=None, max_merge=200, boundscheck=True, dupcheck=True, triucheck=True, ensure_sorted=False, h5opts=None, lock=None)[source]

Create a cooler from bins and pixels at the specified URI.

Because the number of pixels is often very large, the input pixels are normally provided as an iterable (e.g., an iterator or generator) of DataFrame chunks that fit in memory.

New in version 0.8.0.

Parameters:
  • cool_uri (str) – Path to cooler file or URI string. If the file does not exist, it will be created.
  • bins (pandas.DataFrame) – Segmentation of the chromosomes into genomic bins as a BED-like DataFrame with columns chrom, start and end. May contain additional columns.
  • pixels (DataFrame, dictionary, or iterable of either) – A table, given as a dataframe or a column-oriented dict, containing columns labeled bin1_id, bin2_id and count, sorted by (bin1_id, bin2_id). If additional columns are included in the pixel table, their names and dtypes must be specified using the columns and dtypes arguments. For larger input data, an iterable can be provided that yields the pixel data as a sequence of chunks. If the input is a dask DataFrame, it will also be processed one chunk at a time.
  • columns (sequence of str, optional) – Specify here the names of any additional value columns from the input to store in the cooler. The standard columns (bin1_id, bin2_id, count) can be provided, but are already assumed and do not need to be given explicitly. Additional value columns will be given dtype float64 unless specified using dtypes.
  • dtypes (dict, optional) – Dictionary mapping column names in the pixel table to dtypes. Can be used to override the default dtypes of bin1_id, bin2_id or count or assign dtypes to other value columns. Any additional value columns not also provided in the columns argument will be ignored.
  • metadata (dict, optional) – Experiment metadata to store in the file. Must be JSON compatible.
  • assembly (str, optional) – Name of genome assembly.
  • ordered (bool, optional [default: False]) – If the input chunks of pixels are sorted and ordered lexicographically, set this to True to create the cooler in one step. Otherwise, we create the cooler in two steps using an external sort mechanism. Default is False (i.e., two-step). See Notes.
  • symmetric_upper (bool, optional [default: True]) – If True, sets the file’s storage-mode property to symmetric-upper: use this only if the input data references the upper triangle of a symmetric matrix! For all other cases, set this option to False.
  • mode ({'w' , 'a'}, optional [default: 'w']) – Write mode for the output file. ‘a’: if the output file exists, append the new cooler to it. ‘w’: if the output file exists, it will be truncated. Default is ‘w’.
Other Parameters:
 
  • mergebuf (int, optional) – Maximum number of records to buffer in memory at any give time during the merge step.
  • delete_temp (bool, optional) – Whether to delete temporary files when finished. Useful for debugging. Default is False.
  • temp_dir (str, optional) – Create temporary files in the specified directory instead of the system one.
  • max_merge (int, optional) – If merging more than max_merge chunks, do the merge recursively in two passes.
  • boundscheck (bool, optional)
  • dupcheck (bool, optional)
  • triucheck (bool, optional)
  • ensure_sorted (bool, optional)
  • h5opts (dict, optional) – HDF5 dataset filter options to use (compression, shuffling, checksumming, etc.). Default is to use autochunking and GZIP compression, level 6.
  • lock (multiprocessing.Lock, optional) – Optional lock to control concurrent access to the output file.

Notes

If the pixel chunks are ordered and sorted lexicographically, then the cooler can be created in a single step by setting ordered=True.

If not, the cooler is created in two steps via an external sort mechanism. In the first pass, the sequence of pixel chunks are processed and sorted in memory and saved to temporary coolers. In the second pass, the temporary coolers are merged into the output file. This way the individual chunks do not need to be provided in any particular order. When ordered=False, the following options for the merge step are available: mergebuf, delete_temp, temp_dir, max_merge.

Each chunk of pixels will go through a validation pipeline, which can be customized with the following options: boundscheck, triucheck, dupcheck, ensure_sorted.

cooler.merge_coolers(output_uri, input_uris, mergebuf, columns=None, dtypes=None, agg=None, **kwargs)[source]

Merge multiple coolers with identical axes.

The merged cooler is stored at output_uri.

New in version 0.8.0.

Parameters:
  • output_uri (str) – Output cooler file path or URI.
  • input_uris (list of str) – List of input file path or URIs of coolers to combine.
  • mergebuf (int) – Maximum number of pixels processed at a time.
  • columns (list of str, optional) – Specify which pixel value columns to include in the aggregation. Default is to use all available value columns.
  • dtypes (dict, optional) – Specific dtypes to use for value columns. Default is to propagate the current dtypes of the value columns.
  • agg (dict, optional) – Functions to use for aggregating each value column. Pass the same kind of dict accepted by pandas.DataFrame.groupby.agg. Default is to apply ‘sum’ to every value column.
  • kwargs – Passed to cooler.create.
cooler.coarsen_cooler(base_uri, output_uri, factor, chunksize, nproc=1, columns=None, dtypes=None, agg=None, **kwargs)[source]

Coarsen a cooler to a lower resolution by an integer factor k.

This is done by pooling k-by-k neighborhoods of pixels and aggregating. Each chromosomal block is coarsened individually. Result is a coarsened cooler stored at output_uri.

New in version 0.8.0.

Parameters:
  • base_uri (str) – Input cooler file path or URI.
  • output_uri (str) – Input cooler file path or URI.
  • factor (int) – Coarsening factor.
  • chunksize (int) – Number of pixels processed at a time per worker.
  • nproc (int, optional) – Number of workers for batch processing of pixels. Default is 1, i.e. no process pool.
  • columns (list of str, optional) – Specify which pixel value columns to include in the aggregation. Default is to use all available value columns.
  • dtypes (dict, optional) – Specific dtypes to use for value columns. Default is to propagate the current dtypes of the value columns.
  • agg (dict, optional) – Functions to use for aggregating each value column. Pass the same kind of dict accepted by pandas.DataFrame.groupby.agg. Default is to apply ‘sum’ to every value column.
  • kwargs – Passed to cooler.create.
cooler.zoomify_cooler(base_uris, outfile, resolutions, chunksize, nproc=1, columns=None, dtypes=None, agg=None, **kwargs)[source]

Generate multiple cooler resolutions by recursive coarsening.

Result is a “zoomified” or “multires” cool file stored at outfile using the MCOOL v2 layout, where coolers are stored under a hierarchy of the form resolutions/<r> for each resolution r.

New in version 0.8.0.

Parameters:
  • base_uris (str or sequence of str) – One or more cooler URIs to use as “base resolutions” for aggregation.
  • outfile (str) – Output multires cooler (mcool) file path.
  • resolutions (list of int) – A list of target resolutions to generate.
  • chunksize (int) – Number of pixels processed at a time per worker.
  • nproc (int, optional) – Number of workers for batch processing of pixels. Default is 1, i.e. no process pool.
  • columns (list of str, optional) – Specify which pixel value columns to include in the aggregation. Default is to use all available value columns.
  • dtypes (dict, optional) – Specific dtypes to use for value columns. Default is to propagate the current dtypes of the value columns.
  • agg (dict, optional) – Functions to use for aggregating each value column. Pass the same kind of dict accepted by pandas.DataFrame.groupby.agg. Default is to apply ‘sum’ to every value column.
  • kwargs – Passed to cooler.create.

cooler.create

cooler.create.sanitize_pixels(bins, **kwargs)[source]

Builds a function to sanitize an already-binned genomic data with genomic bin assignments.

Parameters:
  • bins (DataFrame) – Bin table to compare pixel records against.
  • is_one_based (bool, optional) – Whether the input bin IDs are one-based, rather than zero-based. They will be converted to zero-based.
  • tril_action ('reflect', 'drop', 'raise' or None) – How to handle lower triangle (“tril”) pixels. If set to ‘reflect’ [default], tril pixels will be flipped or “reflected” to their mirror image: “sided” column pairs will have their values swapped. If set to ‘drop’, tril pixels will be discarded. This is useful if your input data is duplexed, i.e. contains mirror duplicates of every record. If set to ‘raise’, an exception will be raised if any tril record is encountered.
  • bin1_field (str) – Name of the column representing ith (row) axis of the matrix. Default is ‘bin1_id’.
  • bin2_field (str) – Name of the column representing jth (col) axis of the matrix. Default is ‘bin2_id’.
  • sided_fields (sequence of str) – Base names of column pairs to swap values between when mirror-reflecting pixels.
  • suffixes (pair of str) – Suffixes used to identify pairs of sided columns. e.g.: (‘1’, ‘2’), (‘_x’, ‘_y’), etc.
  • sort (bool) – Whether to sort the output dataframe by bin_id and bin2_id.
  • validate (bool) – Whether to do type- and bounds-checking on the bin IDs. Raises BadInputError.
Returns:

callable – Function of one argument that takes a raw dataframe and returns a sanitized dataframe.

cooler.create.sanitize_records(bins, schema=None, **kwargs)[source]

Builds a funtion to sanitize and assign bin IDs to a data frame of paired genomic positions based on a provided genomic bin segmentation.

Parameters:
  • bins (DataFrame) – Bin table to compare records against.
  • schema (str, optional) – Use pre-defined parameters for a particular format. Any options can be overriden via kwargs. If not provided, values for all the options below must be given.
  • decode_chroms (bool) – Convert string chromosome names to integer IDs based on the order given in the bin table. Set to False if the chromosomes are already given as an enumeration, starting at 0. Records with either chrom ID < 0 are dropped.
  • is_one_based (bool) – Whether the input anchor coordinates are one-based, rather than zero-based. They will be converted to zero-based.
  • tril_action ('reflect', 'drop', 'raise' or None) – How to handle lower triangle (“tril”) records. If set to ‘reflect’, tril records will be flipped or “reflected” to their mirror image: “sided” column pairs will have their values swapped. If set to ‘drop’, tril records will be discarded. This is useful if your input data is symmetric, i.e. contains mirror duplicates of every record. If set to ‘raise’, an exception will be raised if any tril record is encountered.
  • chrom_field (str) – Base name of the two chromosome/scaffold/contig columns.
  • anchor_field (str) – Base name of the positional anchor columns.
  • sided_fields (sequence of str) – Base names of column pairs to swap values between when mirror-reflecting records.
  • suffixes (pair of str) – Suffixes used to identify pairs of sided columns. e.g.: (‘1’, ‘2’), (‘_x’, ‘_y’), etc.
  • sort (bool) – Whether to sort the output dataframe by bin_id and bin2_id.
  • validate (bool) – Whether to do type- and bounds-checking on the anchor position columns. Raises BadInputError.
Returns:

callable – Function of one argument that takes a raw dataframe and returns a sanitized dataframe with bin IDs assigned.

cooler.fileops

cooler.fileops.is_cooler(uri)[source]

Determine if a URI string references a cooler data collection. Returns False if the file or group path doesn’t exist.

cooler.fileops.is_multires_file(filepath, min_version=1)[source]

Determine if a file is a multi-res cooler file. Returns False if the file doesn’t exist.

cooler.fileops.list_coolers(filepath)[source]

List group paths to all cooler data collections in a file.

Parameters:filepath (str) –
Returns:list – Cooler group paths in the file.
cooler.fileops.cp(src_uri, dst_uri, overwrite=False)[source]

Copy a group or dataset from one file to another or within the same file.

cooler.fileops.mv(src_uri, dst_uri, overwrite=False)[source]

Rename a group or dataset within the same file.

cooler.fileops.ln(src_uri, dst_uri, soft=False, overwrite=False)[source]

Create a hard link to a group or dataset in the same file. Also supports soft links (in the same file) or external links (different files).

cooler.util

cooler.util.partition(start, stop, step)[source]

Partition an integer interval into equally-sized subintervals. Like builtin range(), but yields pairs of end points.

Examples

>>> for lo, hi in partition(0, 9, 2):
       print(lo, hi)
0 2
2 4
4 6
6 8
8 9
cooler.util.fetch_chromsizes(db, **kwargs)[source]

Download chromosome sizes from UCSC as a pandas.Series, indexed by chromosome label.

cooler.util.read_chromsizes(filepath_or, name_patterns=('^chr[0-9]+$', '^chr[XY]$', '^chrM$'), all_names=False, **kwargs)[source]

Parse a <db>.chrom.sizes or <db>.chromInfo.txt file from the UCSC database, where db is a genome assembly name.

Parameters:
  • filepath_or (str or file-like) – Path or url to text file, or buffer.
  • name_patterns (sequence, optional) – Sequence of regular expressions to capture desired sequence names. Each corresponding set of records will be sorted in natural order.
  • all_names (bool, optional) – Whether to return all contigs listed in the file. Default is False.
Returns:

pandas.Series – Series of integer bp lengths indexed by sequence name.

References

cooler.util.binnify(chromsizes, binsize)[source]

Divide a genome into evenly sized bins.

Parameters:
  • chromsizes (Series) – pandas Series indexed by chromosome name with chromosome lengths in bp.
  • binsize (int) – size of bins in bp
Returns:

bins (pandas.DataFrame) – Dataframe with columns: chrom, start, end.

cooler.util.digest(fasta_records, enzyme)[source]

Divide a genome into restriction fragments.

Parameters:
  • fasta_records (OrderedDict) – Dictionary of chromosome names to sequence records.
  • enzyme (str) – Name of restriction enzyme (e.g., ‘DpnII’).
Returns:

frags (pandas.DataFrame) – Dataframe with columns: chrom, start, end.

cooler.sandbox

cooler.sandbox.dask.read_table(group_uri, keys=None, chunksize=10000000, index=None, lock=None)[source]

Create a dask dataframe around a column-oriented table in HDF5.

A table is a group containing equal-length 1D datasets.

Parameters:
  • group_uri (str) – URI to the HDF5 group storing the table.
  • keys (list, optional) – list of HDF5 Dataset keys, default is to use all keys in the group
  • chunksize (int, optional) – Chunk size
  • index (str, optional) – Sorted column to use as index
  • lock (multiprocessing.Lock, optional) – Lock to serialize HDF5 read/write access. Default is no lock.
Returns:

dask.dataframe.DataFrame

Notes

Learn more about the dask project.