API Reference

Quick reference

cooler.Cooler(store[, root]) Convenient interface to a Cooler in a COOL file.
cooler.Cooler.chromnames List of reference sequence names
cooler.Cooler.chromsizes Ordered mapping of reference sequences to their lengths in bp
cooler.Cooler.bins(**kwargs) Bin table selector
cooler.Cooler.pixels([join]) Pixel table selector
cooler.Cooler.matrix([field, balance, ...]) Contact matrix selector
cooler.Cooler.open([mode]) Open the HDF5 group containing the Cooler with h5py
cooler.Cooler.info File information and metadata
cooler.Cooler.offset(region) Bin ID containing the left end of a genomic region
cooler.Cooler.extent(region) Bin IDs containing the left and right ends of a genomic region
cooler.annotate(pixels, bins[, replace]) Add bin annotations to a data frame of pixels.
cooler.io.ls(filepath) Traverse a file’s data hierarchy and list all Cooler nodes.
cooler.io.is_cooler(filepath[, group]) Determine if a file contains a valid Cooler data hierarchy.
cooler.io.create(cool_uri, bins, pixels[, ...]) Create a new Cooler.
cooler.io.create_from_unordered(cool_uri, ...) Create a Cooler in two passes via an external sort mechanism.
cooler.io.append(cool_uri, table, data[, ...]) Append one or more data columns to an existing table.
cooler.io.sanitize_records(bins[, schema]) Generates a funtion to sanitize and assign bin IDs to a data frame of paired genomic positions based on a provided genomic bin segmentation.
cooler.io.sanitize_pixels(bins, **kwargs) Generates a function to sanitize pre-binned genomic data with assigned
cooler.ice.iterative_correction(clr[, ...]) Iterative correction or matrix balancing of a sparse Hi-C contact map in Cooler HDF5 format.

Sandbox

cooler.contrib.dask.daskify(filepath, grouppath) Create a dask dataframe around a column-oriented table in HDF5

URI String

The default location for a single-cooler .cool file is the root group / of the HDF5 file. It does not need to be explicitly specified.

>>> import cooler
>>> c = cooler.Cooler('data/WT.DpnII.10kb.cool')
>>> c = cooler.Cooler('data/WT.DpnII.10kb.cool::/')  # same as above

However, coolers can be stored at any level of the HDF5 hierarchy and qualified using a URI string of the form /path/to/cool/file::/path/to/cooler/group.

>>> c1 = cooler.Cooler('data/WT.DpnII.multi.cool::10kb')
>>> c2 = cooler.Cooler('data/WT.DpnII.multi.cool::1kb')

Data selection

Several cooler.Cooler methods return data selectors. They don’t retrieve any data from disk until queried. There are several ways to query using selectors. Genomic intervals can be provided using UCSC-style strings '{chrom}:{start}-{end}' or chrom-start-end triples (str, int, int). For regions with start and end that are not multiples of the resolution, selectors return the range of shortest range bins that fully contains the open interval [start, end).

Table selectors (chroms, bins, pixels)

  • lazily select columns or lists of columns, returning new selectors
  • query table rows using integer/slice indexing syntax
  • bins supports fetching genomic ranges using fetch method
  • pixels supports fetching genomic ranges along the bin1 axis
>>> c.bins()
<cooler.core.RangeSelector1D at 0x7fdb2e4f0710>

>>> c.bins()[:10]
chrom    start       end    weight
0  chr1        0   1000000       NaN
1  chr1  1000000   2000000  1.243141
2  chr1  2000000   3000000  1.313995
3  chr1  3000000   4000000  1.291705
4  chr1  4000000   5000000  1.413288
5  chr1  5000000   6000000  1.165382
6  chr1  6000000   7000000  0.811824
7  chr1  7000000   8000000  1.056107
8  chr1  8000000   9000000  1.058915
9  chr1  9000000  10000000  1.035910

>>> c.pixels()[:10]
   bin1_id  bin2_id  count
0        0        0  18578
1        0        1  11582
2        0        2    446
3        0        3    196
4        0        4     83
5        0        5    112
6        0        6    341
7        0        7    255
8        0        8    387
9        0        9    354

>>> c.bins()['weight']
 <cooler.core.RangeSelector1D at 0x7fdb2e509240>

>>> weights = c.bins()['weight'].fetch('chr3')
>>> weights.head()
494    1.144698
495    1.549848
496    1.212580
497    1.097539
498    0.871931
Name: weight, dtype: float64

>>> mybins1 = c.bins().fetch('chr3:10,000,000-20,000,000')
>>> mybins2 = c.bins().fetch( ('chr3', 10000000, 20000000) )
>>> mybins2.head()
    chrom     start       end    weight
504  chr3  10000000  11000000  0.783160
505  chr3  11000000  12000000  0.783806
506  chr3  12000000  13000000  0.791204
507  chr3  13000000  14000000  0.821171
508  chr3  14000000  15000000  0.813079

Matrix selector

  • 2D bin range queries using slice indexing syntax
  • 2D genomic range range queries using the fetch method
>>> c.matrix(balance=False)[1000:1005, 1000:1005]
array([[120022,  34107,  17335,  14053,   4137],
       [ 34107,  73396,  47427,  16125,   3642],
       [ 17335,  47427,  80458,  25105,   5394],
       [ 14053,  16125,  25105, 104536,  27214],
       [  4137,   3642,   5394,  27214, 114135]])

>>> matrix = c.matrix(sparse=True, balance=False)
>>> matrix
<cooler.core.RangeSelector2D at 0x7fdb2e245908>

>>> matrix[:]
<3114x3114 sparse matrix of type '<class 'numpy.int64'>'
    with 8220942 stored elements in COOrdinate format>

>>> c.matrix(balance=False, as_pixels=True, join=True)[1000:1005, 1000:1005]
   chrom1     start1       end1 chrom2     start2       end2   count
0    chr5  115000000  116000000   chr5  115000000  116000000  120022
1    chr5  115000000  116000000   chr5  116000000  117000000   34107
2    chr5  115000000  116000000   chr5  117000000  118000000   17335
3    chr5  115000000  116000000   chr5  118000000  119000000   14053
4    chr5  115000000  116000000   chr5  119000000  120000000    4137
5    chr5  116000000  117000000   chr5  116000000  117000000   73396
6    chr5  116000000  117000000   chr5  117000000  118000000   47427
7    chr5  116000000  117000000   chr5  118000000  119000000   16125
8    chr5  116000000  117000000   chr5  119000000  120000000    3642
9    chr5  117000000  118000000   chr5  117000000  118000000   80458
10   chr5  117000000  118000000   chr5  118000000  119000000   25105
11   chr5  117000000  118000000   chr5  119000000  120000000    5394
12   chr5  118000000  119000000   chr5  118000000  119000000  104536
13   chr5  118000000  119000000   chr5  119000000  120000000   27214
14   chr5  119000000  120000000   chr5  119000000  120000000  114135


>>> A1 = c.matrix().fetch('chr1')
>>> A2 = c.matrix().fetch('chr3:10,000,000-20,000,000')
>>> A3 = c.matrix().fetch( ('chr3', 10000000, 20000000) )
>>> A4 = c.matrix().fetch('chr2', 'chr3')

Dask

Dask data structures provide a way to manipulate and distribute computations on larger-than-memory data using familiar APIs. The current daskify function can be used to generate a dask dataframe backed by the pixel table of a Cooler as follows:

>>> from cooler.contrib.dask import daskify
>>> df = daskify(c.filename, 'pixels')

>>> df
Dask DataFrame Structure:
                bin1_id bin2_id  count
npartitions=223
0                 int64   int64  int64
9999999             ...     ...    ...
...                 ...     ...    ...
2219999999          ...     ...    ...
2220472929          ...     ...    ...
Dask Name: daskify, 223 tasks

>>> df = cooler.annotate(df, c.bins(), replace=False)
>>> df
Dask DataFrame Structure:
                chrom1 start1   end1  weight1  chrom2 start2   end2  weight2 bin1_id bin2_id  count
npartitions=31
None            object  int64  int64  float64  object  int64  int64  float64   int64   int64  int64
None               ...    ...    ...      ...     ...    ...    ...      ...     ...     ...    ...
...                ...    ...    ...      ...     ...    ...    ...      ...     ...     ...    ...
None               ...    ...    ...      ...     ...    ...    ...      ...     ...     ...    ...
None               ...    ...    ...      ...     ...    ...    ...      ...     ...     ...    ...
Dask Name: getitem, 125 tasks

>>> df = df[df.chrom1 == df.chrom2]
>>> grouped = df.groupby(df.bin2_id - df.bin1_id)
>>> x = grouped['count'].sum()
>>> x
Dask Series Structure:
npartitions=1
None    int64
None      ...
Name: count, dtype: int64
Dask Name: series-groupby-sum-agg, 378 tasks

>>> x.compute()
0       476155231
1       284724453
2       139952477
3        96520218
4        71962080
5        56085850
6        45176881
7        37274367
8        31328555
9        26781986
10       23212616
11       20366934
12       18066135
13       16159826
14       14584058
15       13249443
16       12117854
17       11149845
...

Learn more about the Dask project: https://github.com/dask/dask-tutorial

cooler

class cooler.Cooler(store, root=None, **kwargs)[source]

Convenient interface to a Cooler in a COOL file.

Notes

  • Metadata is accessible as a dictionary through the info property.
  • Data table range queries are provided through table selectors: chroms, bins, and pixels, which return DataFrames or Series.
  • Matrix range queries are provided via a matrix selector, matrix, which return NumPy arrays or SciPy sparse coo_matrix.
bins(**kwargs)[source]

Bin table selector

Returns:
Table selector
binsize

Resolution in base pairs if uniform else None

chromnames

List of reference sequence names

chroms(**kwargs)[source]

Chromosome table selector

Returns:
Table selector
chromsizes

Ordered mapping of reference sequences to their lengths in bp

extent(region)[source]

Bin IDs containing the left and right ends of a genomic region

Parameters:
region : str or tuple

Genomic range

Returns:
2-tuple of ints

Examples

>>> c.extent('chr3')  
(1311, 2131)
info

File information and metadata

Returns:
dict
matrix(field=None, balance=True, sparse=False, as_pixels=False, join=False, ignore_index=True, max_chunk=500000000)[source]

Contact matrix selector

Parameters:
field : str, optional

Which column of the pixel table to fill the matrix with. By default, the ‘count’ column is used.

balance : bool, optional

Whether to apply pre-calculated matrix balancing weights to the selection. Default is True and uses a column named ‘weight’. Alternatively, pass the name of the bin table column containing the desired balancing weights. Set to False to return untransformed counts.

sparse: bool, optional

Return a scipy.sparse.coo_matrix instead of a dense 2D numpy array.

as_pixels: bool, optional

Return a DataFrame of the corresponding rows from the pixel table instead of a rectangular sparse matrix. False by default.

join : bool, optional

If requesting pixels, specifies whether to expand the bin ID columns into (chrom, start, end). Has no effect when requesting a rectangular matrix. Default is True.

ignore_index : bool, optional

If requesting pixels, don’t populate the index column with the pixel IDs to improve performance. Default is True.

Returns:
Matrix selector
offset(region)[source]

Bin ID containing the left end of a genomic region

Parameters:
region : str or tuple

Genomic range

Returns:
int

Examples

>>> c.offset('chr3')  
1311
open(mode='r', **kwargs)[source]

Open the HDF5 group containing the Cooler with h5py

Functions as a context manager. Any open_kws passed during construction are ignored.

Parameters:
mode : str

r (readonly) or r+ (read/write), default: ‘r’

Additional keywords

See h5py.File

pixels(join=False, **kwargs)[source]

Pixel table selector

Parameters:
join : bool, optional

Whether to expand bin ID columns into chrom, start, and end columns. Default is False.

Returns:
Table selector
cooler.get(grp, lo=0, hi=None, fields=None, convert_enum=True, as_dict=False)[source]

Query a range of rows from a table as a dataframe.

A table is an HDF5 group containing equal-length 1D datasets serving as columns.

Parameters:
grp : h5py.Group or any dict-like of array-likes

Handle to an HDF5 group containing only 1D datasets or any similar collection of 1D datasets or arrays

lo, hi : int, optional

Range of rows to select from the table.

fields : str or sequence of str, optional

Column or list of columns to query. Defaults to all available columns. A single string returns a Series instead of a DataFrame.

convert_enum : bool, optional

Whether to convert HDF5 enum datasets into pandas.Categorical columns instead of plain integer columns. Default is True.

kwargs : optional

Options to pass to pandas.DataFrame or pandas.Series.

Returns:
DataFrame or Series

Notes

HDF5 ASCII datasets are converted to Unicode.

cooler.info(h5)[source]

File and user metadata dict.

Parameters:
h5 : h5py.File or h5py.Group

Open handle to cooler file.

Returns:
dict
cooler.chroms(h5, lo=0, hi=None, fields=None, **kwargs)[source]

Table describing the chromosomes/scaffolds/contigs used. They appear in the same order they occur in the heatmap.

Parameters:
h5 : h5py.File or h5py.Group

Open handle to cooler file.

lo, hi : int, optional

Range of rows to select from the table.

fields : sequence of str, optional

Subset of columns to select from table.

Returns:
DataFrame
cooler.bins(h5, lo=0, hi=None, fields=None, **kwargs)[source]

Table describing the genomic bins that make up the axes of the heatmap.

Parameters:
h5 : h5py.File or h5py.Group

Open handle to cooler file.

lo, hi : int, optional

Range of rows to select from the table.

fields : sequence of str, optional

Subset of columns to select from table.

Returns:
DataFrame
cooler.pixels(h5, lo=0, hi=None, fields=None, join=True, **kwargs)[source]

Table describing the nonzero upper triangular pixels of the Hi-C contact heatmap.

Parameters:
h5 : h5py.File or h5py.Group

Open handle to cooler file.

lo, hi : int, optional

Range of rows to select from the table.

fields : sequence of str, optional

Subset of columns to select from table.

join : bool, optional

Whether or not to expand bin ID columns to their full bin description (chrom, start, end). Default is True.

Returns:
DataFrame
cooler.matrix(h5, i0, i1, j0, j1, field=None, balance=True, sparse=False, as_pixels=False, join=True, ignore_index=True, max_chunk=500000000)[source]

Two-dimensional range query on the Hi-C contact heatmap. Depending on the options, returns either a 2D NumPy array, a rectangular sparse coo_matrix, or a data frame of upper triangle pixels.

Parameters:
h5 : h5py.File or h5py.Group

Open handle to cooler file.

i0, i1 : int, optional

Bin range along the 0th (row) axis of the heatap.

j0, j1 : int, optional

Bin range along the 1st (col) axis of the heatap.

field : str, optional

Which column of the pixel table to fill the matrix with. By default, the ‘count’ column is used.

balance : bool, optional

Whether to apply pre-calculated matrix balancing weights to the selection. Default is True and uses a column named ‘weight’. Alternatively, pass the name of the bin table column containing the desired balancing weights. Set to False to return untransformed counts.

sparse: bool, optional

Return a scipy.sparse.coo_matrix instead of a dense 2D numpy array.

as_pixels: bool, optional

Return a DataFrame of the corresponding rows from the pixel table instead of a rectangular sparse matrix. False by default.

join : bool, optional

If requesting pixels, specifies whether to expand the bin ID columns into (chrom, start, end). Has no effect when requesting a rectangular matrix. Default is True.

ignore_index : bool, optional

If requesting pixels, don’t populate the index column with the pixel IDs to improve performance. Default is True.

Returns:
ndarray, coo_matrix or DataFrame

Notes

Use the toarray() method to convert to a sparse matrix to a dense NumPy array.

cooler.annotate(pixels, bins, replace=True)[source]

Add bin annotations to a data frame of pixels.

This is done by performing a relational “join” against the bin IDs of a table that describes properties of the genomic bins. New columns will be appended on the left of the output data frame.

Parameters:
pixels : DataFrame

A data frame containing columns named bin1_id and/or bin2_id. If columns bin1_id and bin2_id are both present in pixels, the adjoined columns will be suffixed with ‘1’ and ‘2’ accordingly.

bins : DataFrame or DataFrame view

Data structure that contains a full description of the genomic bins of the contact matrix, where the index corresponds to bin IDs.

replace : bool, optional

Whether to remove the original bin1_id and bin2_id columns from the output. Default is True.

Returns:
DataFrame

cooler.io

cooler.io.append(cool_uri, table, data, chunked=False, force=False, h5opts=None, lock=None)[source]

Append one or more data columns to an existing table.

Parameters:
cool_uri : str

Path to Cooler file or URI to Cooler group.

table : str

Name of table (HDF5 group).

data : dict-like

DataFrame, Series or mapping of column names to data. If the input is a dask DataFrame or Series, the data is written in chunks.

chunked : bool, optional

If True, the values of the data dict are treated as separate chunk iterators of column data.

force : bool, optional

If True, replace existing columns with the same name as the input.

h5opts : dict, optional

HDF5 dataset filter options to use (compression, shuffling, checksumming, etc.). Default is to use autochunking and GZIP compression, level 6.

lock : multiprocessing.Lock, optional

Optional lock to synchronize concurrent HDF5 file access.

cooler.io.create(cool_uri, bins, pixels, metadata=None, assembly=None, dtypes='<deprecated>', h5opts=None, append=False, lock=None, columns=None, dtype=None, boundscheck=True, triucheck=True, dupcheck=True, ensure_sorted=False, chromsizes='<deprecated>')[source]

Create a new Cooler.

Parameters:
cool_uri : str

Path to Cooler file or URI to Cooler group. If the file does not exist, it will be created.

bins : pandas.DataFrame

Segmentation of the chromosomes into genomic bins as a BED-like DataFrame with columns chrom, start and end. May contain additional columns.

pixels : DataFrame, dictionary, or iterable of either

A table (represented as a data frame or a column-oriented dict) containing columns labeled ‘bin1_id’, ‘bin2_id’ and ‘count’, sorted by (‘bin1_id’, ‘bin2_id’). If additional columns are included in the pixel table, their names and dtypes must be specified in the dtypes argument. Alternatively, for larger data, an iterable can be provided that yields the pixel data as a sequence of “chunks”. If the input is a dask DataFrame, it will also be processed one chunk at a time.

dtypes : dict, optional

Deprecated in favor of dtype to mirror the pandas constructor.

metadata : dict, optional

Experiment metadata to store in the file. Must be JSON compatible.

assembly : str, optional

Name of genome assembly.

h5opts : dict, optional

HDF5 dataset filter options to use (compression, shuffling, checksumming, etc.). Default is to use autochunking and GZIP compression, level 6.

append : bool, optional

Append new Cooler to the file if it exists. If False, an existing file with the same name will be truncated. Default is False.

lock : multiprocessing.Lock, optional

Optional lock to control concurrent access to the output file.

columns : sequence of str, optional

Specify here the names of any additional value columns from the input besides ‘count’ to store in the Cooler. The standard columns [‘bin1_id’, ‘bin2_id’, ‘count’] can be provided, but are already assumed and don’t need to be given explicitly. Additional value columns provided here will be stored as np.float64 unless otherwised specified using dtype.

dtype : dict, optional

Dictionary mapping column names in the pixel table to dtypes. Can be used to override the default dtypes of ‘bin1_id’, ‘bin2_id’ or ‘count’. Any additional value column dtypes must also be provided in the columns argument, or will be ignored.

cooler.io.create_from_unordered(cool_uri, bins, chunks, columns=None, dtype=None, mergebuf=20000000, delete_temp=True, temp_dir=None, multifile_merge=False, **kwargs)[source]

Create a Cooler in two passes via an external sort mechanism. In the first pass, a sequence of data chunks are processed and sorted in memory and saved to temporary Coolers. In the second pass, the temporary Coolers are merged into the output. This way the individual chunks do not need to be provided in any particular order.

Parameters:
cool_uri : str

Path to Cooler file or URI to Cooler group. If the file does not exist, it will be created.

bins : DataFrame

Segmentation of the chromosomes into genomic bins. May contain additional columns.

chunks : iterable of DataFrames

Sequence of chunks that get processed and written to separate Coolers and then subsequently merged.

columns : sequence of str, optional

Specify here the names of any additional value columns from the input besides ‘count’ to store in the Cooler. The standard columns [‘bin1_id’, ‘bin2_id’, ‘count’] can be provided, but are already assumed and don’t need to be given explicitly. Additional value columns provided here will be stored as np.float64 unless otherwised specified using dtype.

dtype : dict, optional

Dictionary mapping column names in the pixel table to dtypes. Can be used to override the default dtypes of ‘bin1_id’, ‘bin2_id’ or ‘count’. Any additional value column dtypes must also be provided in the columns argument, or will be ignored.

mergebuf : int, optional

Maximum number of records to buffer in memory at any give time during the merge step.

delete_temp : bool, optional

Whether to delete temporary files when finished. Useful for debugging. Default is False.

temp_dir : str, optional

Create temporary files in this directory.

multifile_merge : bool, optional

Store temporary merge chunks as separate .cool files rather than in a single multi-cooler file. Default is False.

metadata : dict, optional

Experiment metadata to store in the file. Must be JSON compatible.

assembly : str, optional

Name of genome assembly.

h5opts : dict, optional

HDF5 dataset filter options to use (compression, shuffling, checksumming, etc.). Default is to use autochunking and GZIP compression, level 6.

append : bool, optional

Append new Cooler to the file if it exists. If False, an existing file with the same name will be truncated. Default is False.

lock : multiprocessing.Lock, optional

Optional lock to control concurrent access to the output file.

cooler.io.is_cooler(filepath, group=None)[source]

Determine if a file contains a valid Cooler data hierarchy.

Parameters:
filepath : str
group : str, optional

Path to specific group to check. Otherwise returns True if any Cooler paths are detected.

Returns:
bool
cooler.io.ls(filepath)[source]

Traverse a file’s data hierarchy and list all Cooler nodes.

Parameters:
filepath : str
Returns:
list of Cooler group paths in the file
cooler.io.parse_cooler_uri(s)[source]

Parse a Cooler URI string

e.g. /path/to/mycoolers.cool::/path/to/cooler

cooler.io.sanitize_pixels(bins, **kwargs)[source]

Generates a function to sanitize pre-binned genomic data with assigned bin IDs

Parameters:
bins : DataFrame

Bin table to compare pixel records against.

is_one_based : bool, optional

Whether the input bin IDs are one-based, rather than zero-based. They will be converted to zero-based.

tril_action : ‘reflect’, ‘drop’, ‘raise’ or None

How to handle lower triangle (“tril”) pixels. If set to ‘reflect’ [default], tril pixels will be flipped or “reflected” to their mirror image: “sided” column pairs will have their values swapped. If set to ‘drop’, tril pixels will be discarded. This is useful if your input data is symmetric, i.e. contains mirror duplicates of every record. If set to ‘raise’, an exception will be raised if any tril record is encountered.

bin1_field : str

Name of the column representing ith (row) axis of the matrix. Default is ‘bin1_id’.

bin2_field : str

Name of the column representing jth (col) axis of the matrix. Default is ‘bin2_id’.

sided_fields : sequence of str

Base names of column pairs to swap values between when mirror-reflecting pixels.

suffixes : pair of str

Suffixes used to identify pairs of sided columns. e.g.: (‘1’, ‘2’), (‘_x’, ‘_y’), etc.

sort : bool

Whether to sort the output dataframe by bin_id and bin2_id.

validate : bool

Whether to do type- and bounds-checking on the bin IDs. Raises BadInputError.

Returns:
Function of one argument that takes a raw dataframe and returns a sanitized
dataframe.
cooler.io.sanitize_records(bins, schema=None, **kwargs)[source]

Generates a funtion to sanitize and assign bin IDs to a data frame of paired genomic positions based on a provided genomic bin segmentation.

Parameters:
bins : DataFrame

Bin table to compare records against.

schema : str, optional

Use pre-defined parameters for a particular format. Any options can be overriden via kwargs. If not provided, values for all the options below must be given.

decode_chroms : bool

Convert string chromosome names to integer IDs based on the order given in the bin table. Set to False if the chromosomes are already given as an enumeration, starting at 0. Records with either chrom ID < 0 are dropped.

is_one_based : bool

Whether the input anchor coordinates are one-based, rather than zero-based. They will be converted to zero-based.

tril_action : ‘reflect’, ‘drop’, ‘raise’ or None

How to handle lower triangle (“tril”) records. If set to ‘reflect’, tril records will be flipped or “reflected” to their mirror image: “sided” column pairs will have their values swapped. If set to ‘drop’, tril records will be discarded. This is useful if your input data is symmetric, i.e. contains mirror duplicates of every record. If set to ‘raise’, an exception will be raised if any tril record is encountered.

chrom_field : str

Base name of the two chromosome/scaffold/contig columns.

anchor_field : str

Base name of the positional anchor columns.

sided_fields : sequence of str

Base names of column pairs to swap values between when mirror-reflecting records.

suffixes : pair of str

Suffixes used to identify pairs of sided columns. e.g.: (‘1’, ‘2’), (‘_x’, ‘_y’), etc.

sort : bool

Whether to sort the output dataframe by bin_id and bin2_id.

validate : bool

Whether to do type- and bounds-checking on the anchor position columns. Raises BadInputError.

Returns:
Function of one argument that takes a raw dataframe and returns a sanitized
dataframe with bin IDs assigned.

cooler.ice

cooler.ice.iterative_correction(clr, chunksize=None, map=<class 'map'>, tol=1e-05, min_nnz=0, min_count=0, mad_max=0, cis_only=False, trans_only=False, ignore_diags=False, max_iters=200, rescale_marginals=True, use_lock=False, blacklist=None, x0=None, store=False, store_name='weight')[source]

Iterative correction or matrix balancing of a sparse Hi-C contact map in Cooler HDF5 format.

Parameters:
clr : cooler.Cooler

Cooler object

chunksize : int, optional

Split the contact matrix pixel records into equally sized chunks to save memory and/or parallelize. Default is to use all the pixels at once.

map : callable, optional

Map function to dispatch the matrix chunks to workers. Default is the builtin map, but alternatives include parallel map implementations from a multiprocessing pool.

tol : float, optional

Convergence criterion is the variance of the marginal (row/col) sum vector.

min_nnz : int, optional

Pre-processing bin-level filter. Drop bins with fewer nonzero elements than this value.

min_count : int, optional

Pre-processing bin-level filter. Drop bins with lower marginal sum than this value.

mad_max : int, optional

Pre-processing bin-level filter. Drop bins whose log marginal sum is less than mad_max median absolute deviations below the median log marginal sum.

cis_only : bool, optional

Do iterative correction on intra-chromosomal data only. Inter-chromosomal data is ignored.

trans_only : bool, optional

Do iterative correction on inter-chromosomal data only. Intra-chromosomal data is ignored.

blacklist : list or 1D array, optional

An explicit list of IDs of bad bins to filter out when performing balancing.

ignore_diags : int or False, optional

Drop elements occurring on the first ignore_diags diagonals of the matrix (including the main diagonal).

max_iters : int, optional

Iteration limit.

rescale_marginals : bool, optional

Normalize the balancing weights such that the balanced matrix has rows / columns that sum to 1.0. The scale factor is stored in the stats output dictionary.

x0 : 1D array, optional

Initial weight vector to use. Default is to start with ones(n_bins).

store : bool, optional

Whether to store the results in the file when finished. Default is False.

store_name : str, optional

Name of the column of the bin table to save to. Default name is ‘weight’.

Returns:
bias : 1D array, whose shape is the number of bins in h5.

Vector of bin bias weights to normalize the observed contact map. Dropped bins will be assigned the value NaN. N[i, j] = O[i, j] * bias[i] * bias[j]

stats : dict

Summary of parameters used to perform balancing and the average magnitude of the corrected matrix’s marginal sum at convergence.

cooler.util

cooler.util.bedslice(grouped, chromsizes, region)[source]

Range query on a BED-like dataframe with non-overlapping intervals.

cooler.util.binnify(chromsizes, binsize)[source]

Divide a genome into evenly sized bins.

Parameters:
chromsizes : Series

pandas Series indexed by chromosome name with chromosome lengths in bp.

binsize : int

size of bins in bp

Returns:
Dataframe with columns: ‘chrom’, ‘start’, ‘end’.
cooler.util.digest(fasta_records, enzyme)[source]

Divide a genome into restriction fragments.

Parameters:
fasta_records : OrderedDict

Dictionary of chromosome names to sequence records.

enzyme: str

Name of restriction enzyme.

Returns:
Dataframe with columns: ‘chrom’, ‘start’, ‘end’.
cooler.util.fetch_chromsizes(db, **kwargs)[source]

Download chromosome sizes from UCSC as a pandas.Series, indexed by chromosome label.

cooler.util.get_binsize(bins)[source]

Infer bin size from a bin DataFrame. Assumes that the last bin of each contig is allowed to differ in size from the rest.

Returns:
int or None if bins are non-uniform
cooler.util.get_chromsizes(bins)[source]

Infer chromsizes Series from a bin DataFrame. Assumes that the last bin of each contig is allowed to differ in size from the rest.

Returns:
int or None if bins are non-uniform
cooler.util.get_meta(columns, dtype=None, index_columns=None, index_names=None, default_dtype=<Mock name='mock.object' id='139832327449848'>)[source]
Extracted and modified from pandas/io/parsers.py :
_get_empty_meta (BSD licensed).
cooler.util.infer_meta(x, index=None)[source]
Extracted and modified from dask/dataframe/utils.py :
make_meta (BSD licensed)

Create an empty pandas object containing the desired metadata.

Parameters:
x : dict, tuple, list, pd.Series, pd.DataFrame, pd.Index, dtype, scalar

To create a DataFrame, provide a dict mapping of {name: dtype}, or an iterable of (name, dtype) tuples. To create a Series, provide a tuple of (name, dtype). If a pandas object, names, dtypes, and index should match the desired output. If a dtype or scalar, a scalar of the same dtype is returned.

index : pd.Index, optional

Any pandas index to use in the metadata. If none provided, a RangeIndex will be used.

Examples

>>> make_meta([('a', 'i8'), ('b', 'O')])
Empty DataFrame
Columns: [a, b]
Index: []
>>> make_meta(('a', 'f8'))
Series([], Name: a, dtype: float64)
>>> make_meta('i8')
1
cooler.util.lexbisect(arrays, values, side='left', lo=0, hi=None)[source]

Bisection search on lexically sorted arrays.

Parameters:
arrays : sequence of k 1-D array-like

Each “array” can be any sequence that supports scalar integer indexing. The arrays are assumed to have the same length and values lexsorted from left to right.

values : sequence of k values

Values that would be inserted into the arrays.

side : {‘left’, ‘right’}, optional

If ‘left’, the index of the first suitable location found is given. If ‘right’, return the last such index. If there is no suitable index, return either 0 or N (where N is the length of each array).

lo, hi : int, optional

Bound the slice to be searched (default ‘lo’ is 0 and ‘hi’ is N).

Returns:
i : int

Insertion index.

Examples

>>> h5 = h5py.File('mytable.h5', 'r')  
>>> lexbisect([h5['chrom'], h5['start']], [1, 100000], side='right')  
2151688
cooler.util.load_fasta(names, *filepaths)[source]

Load lazy FASTA records from one or multiple files without reading them into memory.

Parameters:
names : sequence of str

Names of sequence records in FASTA file or files.

filepaths : str

Paths to one or more FASTA files to gather records from.

Returns:
OrderedDict of sequence name -> sequence record
cooler.util.make_bintable(chromsizes, binsize)

Divide a genome into evenly sized bins.

Parameters:
chromsizes : Series

pandas Series indexed by chromosome name with chromosome lengths in bp.

binsize : int

size of bins in bp

Returns:
Dataframe with columns: ‘chrom’, ‘start’, ‘end’.
cooler.util.open_hdf5(fp, mode='r', *args, **kwargs)[source]

Context manager like h5py.File but accepts already open HDF5 file handles which do not get closed on teardown.

Parameters:
fp : str or h5py.File object

If an open file object is provided, it passes through unchanged, provided that the requested mode is compatible. If a filepath is passed, the context manager will close the file on tear down.

mode : str
  • r Readonly, file must exist
  • r+ Read/write, file must exist
  • a Read/write if exists, create otherwise
  • w Truncate if exists, create otherwise
  • w- or x Fail if exists, create otherwise
cooler.util.parse_region(reg, chromsizes=None)[source]

Genomic regions are represented as half-open intervals (0-based starts, 1-based ends) along the length coordinate of a contig/scaffold/chromosome. Parameters ———- reg : str or tuple

UCSC-style genomic region string, or Triple (chrom, start, end), where start or end may be None.
chromsizes : mapping, optional
Lookup table of scaffold lengths to check against chrom and the end coordinate. Required if end is not supplied.

A well-formed genomic region triple (str, int, int)

cooler.util.parse_region_string(s)[source]

Parse a UCSC-style genomic region string into a triple.

Parameters:
s : str

UCSC-style string, e.g. “chr5:10,100,000-30,000,000”. Ensembl and FASTA style sequence names are allowed. End coordinate must be greater than or equal to start.

Returns:
(str, int or None, int or None)
cooler.util.read_chromsizes(filepath_or, name_patterns=('^chr[0-9]+$', '^chr[XY]$', '^chrM$'), all_names=False, **kwargs)[source]

Parse a <db>.chrom.sizes or <db>.chromInfo.txt file from the UCSC database, where db is a genome assembly name.

Parameters:
filepath_or : str or file-like

Path or url to text file, or buffer.

name_patterns : sequence, optional

Sequence of regular expressions to capture desired sequence names. Each corresponding set of records will be sorted in natural order.

all_names : bool, optional

Whether to return all contigs listed in the file. Default is False.

Returns:
Series of integer bp lengths indexed by sequence name.

Notes

UCSC assembly terminology: <http://genome.ucsc.edu/FAQ/FAQdownloads.html#download9> NCBI assembly terminology: <https://www.ncbi.nlm.nih.gov/grc/help/definitions

cooler.util.rlencode(array, chunksize=None)[source]

Run length encoding. Based on http://stackoverflow.com/a/32681075, which is based on the rle function from R.

Parameters:
x : 1D array_like

Input array to encode

dropna: bool, optional

Drop all runs of NaNs.

Returns:
start positions, run lengths, run values

cooler.contrib

cooler.contrib.dask.daskify(filepath, grouppath, keys=None, chunksize=10000000, index=None, lock=None)[source]

Create a dask dataframe around a column-oriented table in HDF5

A table is a group containing equal-length 1D datasets.

Parameters:
filepath : str

Path to HDF5 file

grouppath : str

HDF5 group path

keys : list, optional

list of HDF5 Dataset keys, default is to use all keys in the group

chunksize : int, optional

Chunk size

index : str, optional

Sorted column to use as index

lock : multiprocessing.Lock, optional

Lock to serialize HDF5 read/write access. Default is no lock.

Returns:
dask.dataframe.DataFrame

Notes

Learn more about dask: <https://github.com/dask/dask-tutorial>.