API Reference
Quick reference
Cooler class
A convenient interface to a cooler data collection. |
|
Resolution in base pairs if uniform else None |
|
List of reference sequence names |
|
Ordered mapping of reference sequences to their lengths in bp |
|
|
Bin table selector |
|
Pixel table selector |
|
Contact matrix selector |
|
Open the HDF5 group containing the Cooler with |
File information and metadata |
|
|
Bin ID containing the left end of a genomic region |
|
Bin IDs containing the left and right ends of a genomic region |
Creation/reduction
|
Create a cooler from bins and pixels at the specified URI. |
|
Merge multiple coolers with identical axes. |
|
Coarsen a cooler to a lower resolution by an integer factor k. |
|
Generate multiple cooler resolutions by recursive coarsening. |
|
Create a single-cell (scool) file. |
Manipulation
|
Add bin annotations to a data frame of pixels. |
|
Iterative correction or matrix balancing of a sparse Hi-C contact map in Cooler HDF5 format. |
|
Substitute existing chromosome/contig names for new ones. |
File operations
Determine if a URI string references a cooler data collection. |
|
|
Determine if a file is a multi-res cooler file. |
|
List group paths to all cooler data collections in a file. |
|
Copy a group or dataset from one file to another or within the same file. |
|
Rename a group or dataset within the same file. |
|
Create a hard link to a group or dataset in the same file. |
cooler
- class cooler.Cooler[source]
A convenient interface to a cooler data collection.
- Parameters:
store (str,
h5py.File
orh5py.Group
) – Path to a cooler file, URI string, or open handle to the root HDF5 group of a cooler data collection.root (str, optional [deprecated]) – HDF5 Group path to root of cooler group if
store
is a file. This option is deprecated. Instead, use a URI string of the form<file_path>::<group_path>
.kwargs (optional) – Options to be passed to
h5py.File()
upon every access. By default, the file is opened with the default driver and mode=’r’.
Notes
If
store
is a file path, the file will be opened temporarily in when performing operations. This allowsCooler
objects to be serialized for multiprocess and distributed computations.Metadata is accessible as a dictionary through the
info
property.Table selectors, created using
chroms()
,bins()
, andpixels()
, perform range queries over table rows, returningpd.DataFrame
andpd.Series
.A matrix selector, created using
matrix()
, performs 2D matrix range queries, returningnumpy.ndarray
orscipy.sparse.coo_matrix
.- property binsize: int | None
Resolution in base pairs if uniform else None
- property chromnames: list[str]
List of reference sequence names
- chroms(**kwargs)[source]
Chromosome table selector
- Returns:
Table selector
- Return type:
RangeSelector1D
- property chromsizes: Series
Ordered mapping of reference sequences to their lengths in bp
- extent(region)[source]
Bin IDs containing the left and right ends of a genomic region
- Parameters:
region (str or tuple) – Genomic range
- Returns:
2-tuple of ints
- Return type:
tuple[int, int]
Examples
>>> c.extent('chr3') (1311, 2131)
- property info: dict
File information and metadata
- Returns:
dict
- matrix(field=None, balance=True, sparse=False, as_pixels=False, join=False, ignore_index=True, divisive_weights=None, chunksize=10000000)[source]
Contact matrix selector
- Parameters:
field (str, optional) – Which column of the pixel table to fill the matrix with. By default, the ‘count’ column is used.
balance (bool, optional) – Whether to apply pre-calculated matrix balancing weights to the selection. Default is True and uses a column named ‘weight’. Alternatively, pass the name of the bin table column containing the desired balancing weights. Set to False to return untransformed counts.
sparse (bool, optional) – Return a scipy.sparse.coo_matrix instead of a dense 2D numpy array.
as_pixels (bool, optional) – Return a DataFrame of the corresponding rows from the pixel table instead of a rectangular sparse matrix. False by default.
join (bool, optional) – If requesting pixels, specifies whether to expand the bin ID columns into (chrom, start, end). Has no effect when requesting a rectangular matrix. Default is True.
ignore_index (bool, optional) – If requesting pixels, don’t populate the index column with the pixel IDs to improve performance. Default is True.
divisive_weights (bool, optional) – Force balancing weights to be interpreted as divisive (True) or multiplicative (False). Weights are always assumed to be multiplicative by default unless named KR, VC or SQRT_VC, in which case they are assumed to be divisive by default.
chunksize (int)
- Returns:
Matrix selector
- Return type:
RangeSelector2D
Notes
If
as_pixels=True
, only data explicitly stored in the pixel table will be returned: if the cooler’s storage mode is symmetric-upper, lower triangular elements will not be generated. Ifas_pixels=False
, those missing non-zero elements will automatically be filled in.
- offset(region)[source]
Bin ID containing the left end of a genomic region
- Parameters:
region (str or tuple) – Genomic range
- Returns:
int
- Return type:
int
Examples
>>> c.offset('chr3') 1311
- open(mode='r', **kwargs)[source]
Open the HDF5 group containing the Cooler with
h5py
Functions as a context manager. Any
open_kws
passed during construction are ignored.- Parameters:
mode (str, optional [default: 'r']) –
'r'
(readonly)'r+'
or'a'
(read/write)
- Return type:
Group
Notes
For other parameters, see
h5py.File
.
- pixels(join=False, **kwargs)[source]
Pixel table selector
- Parameters:
join (bool, optional) – Whether to expand bin ID columns into chrom, start, and end columns. Default is
False
.- Returns:
Table selector
- Return type:
RangeSelector1D
- property storage_mode: str
Indicates whether ordinary sparse matrix encoding is used (
"square"
) or whether a symmetric matrix is encoded by storing only the upper triangular elements ("symmetric-upper"
).
- cooler.annotate(pixels, bins, replace=False)[source]
Add bin annotations to a data frame of pixels.
This is done by performing a relational “join” against the bin IDs of a table that describes properties of the genomic bins. New columns will be appended on the left of the output data frame.
Changed in version 0.8.0: The default value of
replace
changed to False.- Parameters:
pixels (
DataFrame
) – A data frame containing columns namedbin1_id
and/orbin2_id
. If columnsbin1_id
andbin2_id
are both present inpixels
, the adjoined columns will be suffixed with ‘1’ and ‘2’ accordingly.bins (
DataFrame
or DataFrame selector) – Data structure that contains a full description of the genomic bins of the contact matrix, where the index corresponds to bin IDs.replace (bool, optional) – Remove the original
bin1_id
andbin2_id
columns from the output. Default is False.
- Returns:
DataFrame
- Return type:
DataFrame
- cooler.create_cooler(cool_uri, bins, pixels, columns=None, dtypes=None, metadata=None, assembly=None, ordered=False, symmetric_upper=True, mode='w', mergebuf=20000000, delete_temp=True, temp_dir=None, max_merge=200, boundscheck=True, dupcheck=True, triucheck=True, ensure_sorted=False, h5opts=None, lock=None)[source]
Create a cooler from bins and pixels at the specified URI.
Because the number of pixels is often very large, the input pixels are normally provided as an iterable (e.g., an iterator or generator) of DataFrame chunks that fit in memory.
Added in version 0.8.0.
- Parameters:
cool_uri (str) – Path to cooler file or URI string. If the file does not exist, it will be created.
bins (pandas.DataFrame) – Segmentation of the chromosomes into genomic bins as a BED-like DataFrame with columns
chrom
,start
andend
. May contain additional columns.pixels (DataFrame, dictionary, or iterable of either) – A table, given as a dataframe or a column-oriented dict, containing columns labeled
bin1_id
,bin2_id
andcount
, sorted by (bin1_id
,bin2_id
). If additional columns are included in the pixel table, their names and dtypes must be specified using thecolumns
anddtypes
arguments. For larger input data, an iterable can be provided that yields the pixel data as a sequence of chunks. If the input is a dask DataFrame, it will also be processed one chunk at a time.columns (sequence of str, optional) – Customize which value columns from the input pixels to store in the cooler. Non-standard value columns will be given dtype
float64
unless overriden using thedtypes
argument. IfNone
, we only attempt to store a value column named"count"
.dtypes (dict, optional) – Dictionary mapping column names to dtypes. Can be used to override the default dtypes of
bin1_id
,bin2_id
orcount
or assign dtypes to custom value columns. Non-standard value columns given indtypes
must also be provided in thecolumns
argument or they will be ignored.metadata (dict, optional) – Experiment metadata to store in the file. Must be JSON compatible.
assembly (str, optional) – Name of genome assembly.
ordered (bool, optional [default: False]) – If the input chunks of pixels are provided with correct triangularity and in ascending order of (
bin1_id
,bin2_id
), set this toTrue
to write the cooler in one step. IfFalse
(default), we create the cooler in two steps using an external sort mechanism. See Notes for more details.symmetric_upper (bool, optional [default: True]) – If True, sets the file’s storage-mode property to
symmetric-upper
: use this only if the input data references the upper triangle of a symmetric matrix! For all other cases, set this option to False.mode ({'w' , 'a'}, optional [default: 'w']) – Write mode for the output file. ‘a’: if the output file exists, append the new cooler to it. ‘w’: if the output file exists, it will be truncated. Default is ‘w’.
mergebuf (int, optional) – Maximum number of records to buffer in memory at any give time during the merge step.
delete_temp (bool, optional) – Whether to delete temporary files when finished. Useful for debugging. Default is False.
temp_dir (str, optional) – Create temporary files in a specified directory instead of the same directory as the output file. Pass
-
to use the system default.max_merge (int, optional) – If merging more than
max_merge
chunks, do the merge recursively in two passes.h5opts (dict, optional) – HDF5 dataset filter options to use (compression, shuffling, checksumming, etc.). Default is to use autochunking and GZIP compression, level 6.
lock (multiprocessing.Lock, optional) – Optional lock to control concurrent access to the output file.
ensure_sorted (bool, optional) – Ensure that each input chunk is properly sorted.
boundscheck (bool, optional) – Input validation: Check that all bin IDs lie in the expected range.
dupcheck (bool, optional) – Input validation: Check that no duplicate pixels exist within any chunk.
triucheck (bool, optional) – Input validation: Check that
bin1_id
<=bin2_id
when creating coolers in symmetric-upper mode.
- Return type:
None
Notes
If the pixel chunks are provided in the correct order required for the output to be properly sorted, then the cooler can be created in a single step by setting
ordered=True
.If not, the cooler is created in two steps via an external sort mechanism. In the first pass, the sequence of pixel chunks are processed and sorted in memory and saved to temporary coolers. In the second pass, the temporary coolers are merged into the output file. This way the individual chunks do not need to be provided in any particular order. When
ordered=False
, the following options for the merge step are available:mergebuf
,delete_temp
,temp_dir
,max_merge
.Each chunk of pixels will go through a validation pipeline, which can be customized with the following options:
boundscheck
,triucheck
,dupcheck
,ensure_sorted
.
- cooler.merge_coolers(output_uri, input_uris, mergebuf, columns=None, dtypes=None, agg=None, **kwargs)[source]
Merge multiple coolers with identical axes.
The merged cooler is stored at
output_uri
.Added in version 0.8.0.
- Parameters:
output_uri (str) – Output cooler file path or URI.
input_uris (list of str) – List of input file path or URIs of coolers to combine.
mergebuf (int) – Maximum number of pixels processed at a time.
columns (list of str, optional) – Specify which pixel value columns to include in the aggregation. Default is to use all available value columns.
dtypes (dict, optional) – Specific dtypes to use for value columns. Default is to propagate the current dtypes of the value columns.
agg (dict, optional) – Functions to use for aggregating each value column. Pass the same kind of dict accepted by
pandas.DataFrame.groupby.agg
. Default is to apply ‘sum’ to every value column.kwargs – Passed to
cooler.create
.
- Return type:
None
Notes
The default output file mode is ‘w’. If appending output to an existing file, pass mode=’a’.
See also
- cooler.coarsen_cooler(base_uri, output_uri, factor, chunksize, nproc=1, columns=None, dtypes=None, agg=None, **kwargs)[source]
Coarsen a cooler to a lower resolution by an integer factor k.
This is done by pooling k-by-k neighborhoods of pixels and aggregating. Each chromosomal block is coarsened individually. Result is a coarsened cooler stored at
output_uri
.Added in version 0.8.0.
- Parameters:
base_uri (str) – Input cooler file path or URI.
output_uri (str) – Input cooler file path or URI.
factor (int) – Coarsening factor.
chunksize (int) – Number of pixels processed at a time per worker.
nproc (int, optional) – Number of workers for batch processing of pixels. Default is 1, i.e. no process pool.
columns (list of str, optional) – Specify which pixel value columns to include in the aggregation. Default is to use all available value columns.
dtypes (dict, optional) – Specific dtypes to use for value columns. Default is to propagate the current dtypes of the value columns.
agg (dict, optional) – Functions to use for aggregating each value column. Pass the same kind of dict accepted by
pandas.DataFrame.groupby.agg
. Default is to apply ‘sum’ to every value column.kwargs – Passed to
cooler.create
.
- Return type:
None
See also
- cooler.zoomify_cooler(base_uris, outfile, resolutions, chunksize, nproc=1, columns=None, dtypes=None, agg=None, **kwargs)[source]
Generate multiple cooler resolutions by recursive coarsening.
Result is a “zoomified” or “multires” cool file stored at
outfile
using the MCOOL v2 layout, where coolers are stored under a hierarchy of the formresolutions/<r>
for each resolutionr
.Added in version 0.8.0.
- Parameters:
base_uris (str or sequence of str) – One or more cooler URIs to use as “base resolutions” for aggregation.
outfile (str) – Output multires cooler (mcool) file path.
resolutions (list of int) – A list of target resolutions to generate.
chunksize (int) – Number of pixels processed at a time per worker.
nproc (int, optional) – Number of workers for batch processing of pixels. Default is 1, i.e. no process pool.
columns (list of str, optional) – Specify which pixel value columns to include in the aggregation. Default is to use only the column named ‘count’ if it exists.
dtypes (dict, optional) – Specific dtypes to use for value columns. Default is to propagate the current dtypes of the value columns.
agg (dict, optional) – Functions to use for aggregating each value column. Pass the same kind of dict accepted by
pandas.DataFrame.groupby.agg
. Default is to apply ‘sum’ to every value column.kwargs – Passed to
cooler.create
.
- Return type:
None
See also
- cooler.balance_cooler(clr, *, cis_only=False, trans_only=False, ignore_diags=2, mad_max=5, min_nnz=10, min_count=0, blacklist=None, rescale_marginals=True, x0=None, tol=1e-05, max_iters=200, chunksize=10000000, map=<class 'map'>, use_lock=False, store=False, store_name='weight')[source]
Iterative correction or matrix balancing of a sparse Hi-C contact map in Cooler HDF5 format.
- Parameters:
clr (cooler.Cooler) – Cooler object
cis_only (bool, optional) – Do iterative correction on intra-chromosomal data only. Inter-chromosomal data is ignored.
trans_only (bool, optional) – Do iterative correction on inter-chromosomal data only. Intra-chromosomal data is ignored.
ignore_diags (int or False, optional) – Drop elements occurring on the first
ignore_diags
diagonals of the matrix (including the main diagonal).chunksize (int or None, optional) – Split the contact matrix pixel records into equally sized chunks to save memory and/or parallelize. Set to
None
to use all the pixels at once.mad_max (int, optional) – Pre-processing bin-level filter. Drop bins whose log marginal sum is less than
mad_max
median absolute deviations below the median log marginal sum.min_nnz (int, optional) – Pre-processing bin-level filter. Drop bins with fewer nonzero elements than this value.
min_count (int, optional) – Pre-processing bin-level filter. Drop bins with lower marginal sum than this value.
blacklist (list or 1D array, optional) – An explicit list of IDs of bad bins to filter out when performing balancing.
rescale_marginals (bool, optional) – Normalize the balancing weights such that the balanced matrix has rows / columns that sum to 1.0. The scale factor is stored in the
stats
output dictionary.map (callable, optional) – Map function to dispatch the matrix chunks to workers. Default is the builtin
map
, but alternatives include parallel map implementations from a multiprocessing pool.x0 (1D array, optional) – Initial weight vector to use. Default is to start with ones(n_bins).
tol (float, optional) – Convergence criterion is the variance of the marginal (row/col) sum vector.
max_iters (int, optional) – Iteration limit.
store (bool, optional) – Whether to store the results in the file when finished. Default is False.
store_name (str, optional) – Name of the column of the bin table to save to. Default name is ‘weight’.
use_lock (bool)
- Returns:
bias (1D array, whose shape is the number of bins in
h5
.) – Vector of bin bias weights to normalize the observed contact map. Dropped bins will be assigned the value NaN. N[i, j] = O[i, j] * bias[i] * bias[j]stats (dict) – Summary of parameters used to perform balancing and the average magnitude of the corrected matrix’s marginal sum at convergence.
- Return type:
tuple[ndarray, dict]
- cooler.rename_chroms(clr, rename_dict, h5opts=None)[source]
Substitute existing chromosome/contig names for new ones. They will be written to the file and the Cooler object will be refreshed.
- Parameters:
clr (Cooler) – Cooler object that can be opened with write permissions.
rename_dict (dict) – Dictionary of old -> new chromosome names. Any names omitted from the dictionary will be kept as is.
h5opts (dict, optional) – HDF5 filter options.
- Return type:
None
- cooler.create_scool(cool_uri, bins, cell_name_pixels_dict, columns=None, dtypes=None, metadata=None, assembly=None, ordered=False, symmetric_upper=True, mode='w', mergebuf=20000000, delete_temp=True, temp_dir=None, max_merge=200, boundscheck=True, dupcheck=True, triucheck=True, ensure_sorted=False, h5opts=None, lock=None, **kwargs)[source]
Create a single-cell (scool) file.
For each cell store a cooler matrix under /cells, where all matrices have the same dimensions.
Each cell is a regular cooler data collection, so the input must be a bin table and pixel table for each cell. The pixel tables are provided as a dictionary where the key is a unique cell name. The bin tables can be provided as a dict with the same keys or a single common bin table can be given.
Added in version 0.8.9.
- Parameters:
cool_uri (str) – Path to scool file or URI string. If the file does not exist, it will be created.
bins (
pandas.DataFrame
or Dict[str, DataFrame]) – A single bin table or dictionary of cell names to bins tables. A bin table is a dataframe with columnschrom
,start
andend
. May contain additional columns.cell_name_pixels_dict (Dict[str, DataFrame]) – Cell name as key and pixel table DataFrame as value. A table, given as a dataframe or a column-oriented dict, containing columns labeled
bin1_id
,bin2_id
andcount
, sorted by (bin1_id
,bin2_id
). If additional columns are included in the pixel table, their names and dtypes must be specified using thecolumns
anddtypes
arguments. For larger input data, an iterable can be provided that yields the pixel data as a sequence of chunks. If the input is a dask DataFrame, it will also be processed one chunk at a time.columns (sequence of str, optional) – Customize which value columns from the input pixels to store in the cooler. Non-standard value columns will be given dtype
float64
unless overriden using thedtypes
argument. IfNone
, we only attempt to store a value column named"count"
.dtypes (dict, optional) – Dictionary mapping column names to dtypes. Can be used to override the default dtypes of
bin1_id
,bin2_id
orcount
or assign dtypes to custom value columns. Non-standard value columns given indtypes
must also be provided in thecolumns
argument or they will be ignored.metadata (dict, optional) – Experiment metadata to store in the file. Must be JSON compatible.
assembly (str, optional) – Name of genome assembly.
ordered (bool, optional [default: False]) – If the input chunks of pixels are provided with correct triangularity and in ascending order of (
bin1_id
,bin2_id
), set this toTrue
to write the cooler in one step. IfFalse
(default), we create the cooler in two steps using an external sort mechanism. See Notes for more details.symmetric_upper (bool, optional [default: True]) – If True, sets the file’s storage-mode property to
symmetric-upper
: use this only if the input data references the upper triangle of a symmetric matrix! For all other cases, set this option to False.mode ({'w' , 'a'}, optional [default: 'w']) – Write mode for the output file. ‘a’: if the output file exists, append the new cooler to it. ‘w’: if the output file exists, it will be truncated. Default is ‘w’.
mergebuf (int, optional) – Maximum number of records to buffer in memory at any give time during the merge step.
delete_temp (bool, optional) – Whether to delete temporary files when finished. Useful for debugging. Default is False.
temp_dir (str, optional) – Create temporary files in a specified directory instead of the same directory as the output file. Pass
-
to use the system default.max_merge (int, optional) – If merging more than
max_merge
chunks, do the merge recursively in two passes.h5opts (dict, optional) – HDF5 dataset filter options to use (compression, shuffling, checksumming, etc.). Default is to use autochunking and GZIP compression, level 6.
lock (multiprocessing.Lock, optional) – Optional lock to control concurrent access to the output file.
ensure_sorted (bool, optional) – Ensure that each input chunk is properly sorted.
boundscheck (bool, optional) – Input validation: Check that all bin IDs lie in the expected range.
dupcheck (bool, optional) – Input validation: Check that no duplicate pixels exist within any chunk.
triucheck (bool, optional) – Input validation: Check that
bin1_id
<=bin2_id
when creating coolers in symmetric-upper mode.
- Return type:
None
See also
Notes
If the pixel chunks are provided in the correct order required for the output to be properly sorted, then the cooler can be created in a single step by setting
ordered=True
.If not, the cooler is created in two steps via an external sort mechanism. In the first pass, the sequence of pixel chunks are processed and sorted in memory and saved to temporary coolers. In the second pass, the temporary coolers are merged into the output file. This way the individual chunks do not need to be provided in any particular order. When
ordered=False
, the following options for the merge step are available:mergebuf
,delete_temp
,temp_dir
,max_merge
.Each chunk of pixels will go through a validation pipeline, which can be customized with the following options:
boundscheck
,triucheck
,dupcheck
,ensure_sorted
.
cooler.create
- cooler.create.sanitize_pixels(bins, **kwargs)[source]
Builds a function to sanitize an already-binned genomic data with genomic bin assignments.
- Parameters:
bins (DataFrame) – Bin table to compare pixel records against.
is_one_based (bool, optional) – Whether the input bin IDs are one-based, rather than zero-based. They will be converted to zero-based.
tril_action ('reflect', 'drop', 'raise' or None) – How to handle lower triangle (“tril”) pixels. If set to ‘reflect’ [default], tril pixels will be flipped or “reflected” to their mirror image: “sided” column pairs will have their values swapped. If set to ‘drop’, tril pixels will be discarded. This is useful if your input data is duplexed, i.e. contains mirror duplicates of every record. If set to ‘raise’, an exception will be raised if any tril record is encountered.
bin1_field (str) – Name of the column representing ith (row) axis of the matrix. Default is ‘bin1_id’.
bin2_field (str) – Name of the column representing jth (col) axis of the matrix. Default is ‘bin2_id’.
sided_fields (sequence of str) – Base names of column pairs to swap values between when mirror-reflecting pixels.
suffixes (pair of str) – Suffixes used to identify pairs of sided columns. e.g.: (‘1’, ‘2’), (‘_x’, ‘_y’), etc.
sort (bool) – Whether to sort the output dataframe by bin_id and bin2_id.
- Returns:
callable – Function of one argument that takes a raw dataframe and returns a sanitized dataframe.
- Return type:
Callable[[DataFrame], DataFrame]
- cooler.create.sanitize_records(bins, schema=None, **kwargs)[source]
Builds a funtion to sanitize and assign bin IDs to a data frame of paired genomic positions based on a provided genomic bin segmentation.
- Parameters:
bins (DataFrame) – Bin table to compare records against.
schema (str, optional) – Use pre-defined parameters for a particular format. Any options can be overriden via kwargs. If not provided, values for all the options below must be given.
decode_chroms (bool) – Convert string chromosome names to integer IDs based on the order given in the bin table. Set to False if the chromosomes are already given as an enumeration, starting at 0. Records with either chrom ID < 0 are dropped.
is_one_based (bool) – Whether the input anchor coordinates are one-based, rather than zero-based. They will be converted to zero-based.
tril_action ('reflect', 'drop', 'raise' or None) – How to handle lower triangle (“tril”) records. If set to ‘reflect’, tril records will be flipped or “reflected” to their mirror image: “sided” column pairs will have their values swapped. If set to ‘drop’, tril records will be discarded. This is useful if your input data is symmetric, i.e. contains mirror duplicates of every record. If set to ‘raise’, an exception will be raised if any tril record is encountered.
chrom_field (str) – Base name of the two chromosome/scaffold/contig columns.
anchor_field (str) – Base name of the positional anchor columns.
sided_fields (sequence of str) – Base names of column pairs to swap values between when mirror-reflecting records.
suffixes (pair of str) – Suffixes used to identify pairs of sided columns. e.g.: (‘1’, ‘2’), (‘_x’, ‘_y’), etc.
sort (bool) – Whether to sort the output dataframe by bin_id and bin2_id.
validate (bool) – Whether to do type- and bounds-checking on the anchor position columns. Raises BadInputError.
- Returns:
callable – Function of one argument that takes a raw dataframe and returns a sanitized dataframe with bin IDs assigned.
- Return type:
Callable[[DataFrame], DataFrame]
cooler.fileops
- cooler.fileops.is_cooler(uri)[source]
Determine if a URI string references a cooler data collection. Returns False if the file or group path doesn’t exist.
- Parameters:
uri (str)
- Return type:
bool
- cooler.fileops.is_multires_file(filepath, min_version=1)[source]
Determine if a file is a multi-res cooler file. Returns False if the file doesn’t exist.
- Parameters:
filepath (str)
min_version (int)
- Return type:
bool
- cooler.fileops.list_coolers(filepath)[source]
List group paths to all cooler data collections in a file.
- Parameters:
filepath (str)
- Returns:
list – Cooler group paths in the file.
- Return type:
list[str]
- cooler.fileops.cp(src_uri, dst_uri, overwrite=False)[source]
Copy a group or dataset from one file to another or within the same file.
- Parameters:
src_uri (str)
dst_uri (str)
overwrite (bool)
- Return type:
None
- cooler.fileops.mv(src_uri, dst_uri, overwrite=False)[source]
Rename a group or dataset within the same file.
- Parameters:
src_uri (str)
dst_uri (str)
overwrite (bool)
- Return type:
None
- cooler.fileops.ln(src_uri, dst_uri, soft=False, overwrite=False)[source]
Create a hard link to a group or dataset in the same file. Also supports soft links (in the same file) or external links (different files).
- Parameters:
src_uri (str)
dst_uri (str)
soft (bool)
overwrite (bool)
- Return type:
None
cooler.util
- cooler.util.partition(start, stop, step)[source]
Partition an integer interval into equally-sized subintervals. Like builtin
range()
, but yields pairs of end points.Examples
>>> for lo, hi in partition(0, 9, 2): print(lo, hi) 0 2 2 4 4 6 6 8 8 9
- Parameters:
start (int)
stop (int)
step (int)
- Return type:
Iterator[tuple[int, int]]
- cooler.util.fetch_chromsizes(db, **kwargs)[source]
Download chromosome sizes from UCSC as a
pandas.Series
, indexed by chromosome label.- Parameters:
db (str)
- Return type:
Series
- cooler.util.read_chromsizes(filepath_or, name_patterns=('^chr[0-9]+$', '^chr[XY]$', '^chrM$'), all_names=False, **kwargs)[source]
Parse a
<db>.chrom.sizes
or<db>.chromInfo.txt
file from the UCSC database, wheredb
is a genome assembly name.- Parameters:
filepath_or (str or file-like) – Path or url to text file, or buffer.
name_patterns (sequence, optional) – Sequence of regular expressions to capture desired sequence names. Each corresponding set of records will be sorted in natural order.
all_names (bool, optional) – Whether to return all contigs listed in the file. Default is
False
.
- Returns:
pandas.Series
– Series of integer bp lengths indexed by sequence name.- Return type:
Series
References
- cooler.util.binnify(chromsizes, binsize)[source]
Divide a genome into evenly sized bins.
- Parameters:
chromsizes (Series) – pandas Series indexed by chromosome name with chromosome lengths in bp.
binsize (int) – size of bins in bp
- Returns:
bins (
pandas.DataFrame
) – Dataframe with columns:chrom
,start
,end
.- Return type:
DataFrame
- cooler.util.digest(fasta_records, enzyme)[source]
Divide a genome into restriction fragments.
- Parameters:
fasta_records (OrderedDict) – Dictionary of chromosome names to sequence records.
enzyme (str) – Name of restriction enzyme (e.g., ‘DpnII’).
- Returns:
frags (
pandas.DataFrame
) – Dataframe with columns:chrom
,start
,end
.- Return type:
DataFrame