API Documentation

File formats

class tsinfer.SampleData(sequence_length=0, *, path=None, num_flush_threads=0, compressor=DEFAULT_COMPRESSOR, chunk_size=1024, max_file_size=None)

Class representing input sample data used for inference. See sample data file format specifications for details on the structure of this file.

The most common usage for this class will be to import data from some external source and save it to file for later use. This will usually follow a pattern like:

sample_data = tsinfer.SampleData(path="mydata.samples")
sample_data.add_site(position=1234, genotypes=[0, 0, 1, 0], alleles=["G", "C"])
sample_data.add_site(position=5678, genotypes=[1, 1, 1, 0], alleles=["A", "T"])
sample_data.finalise()

This creates a sample data file for four haploid samples and two sites, and saves it in the file “mydata.samples”. Note that the call to finalise() is essential here to ensure that all data will be correctly flushed to disk. For convenience, a context manager may also be used to ensure this is done:

with tsinfer.SampleData(path="mydata.samples") as sample_data:
    sample_data.add_site(1234, [0, 0, 1, 0], ["G", "C"])
    sample_data.add_site(5678, [1, 1, 1, 0], ["A", "T"])

More complex data models consisting of populations and polyploid individuals can also be specified. For example, we might have:

with tsinfer.SampleData(path="mydata.samples") as sample_data:
    # Define populations
    sample_data.add_population(metadata={"name": "CEU"})
    sample_data.add_population(metadata={"name": "YRI"})
    # Define individuals
    sample_data.add_individual(ploidy=2, population=0, metadata={"name": "NA12"})
    sample_data.add_individual(ploidy=2, population=0, metadata={"name": "NA13"})
    sample_data.add_individual(ploidy=2, population=0, metadata={"name": "NA14"})
    sample_data.add_individual(ploidy=2, population=1, metadata={"name": "NA15"})
    # Define sites and genotypes
    sample_data.add_site(1234, [0, 1, 1, 1, 0, 0, 0, 0], ["G", "C"])
    sample_data.add_site(5678, [0, 0, 0, 0, 0, 0, 1, 1], ["A", "T"])

In this example we defined two populations and four diploid individuals, and so our genotypes arrays are of length eight. Thus, at first site the first individual is heterozygous, the second is homozygous with the derived allele and the other two individuals are homozygous with the ancestral allele. To illustrate how we can use site and population metadata to link up with external data sources we use the 1000 genomes identifiers (although of course the genotype data is fake). Here we suppose that we have the famous NA12878 trio from the CEU population, and one other individual from the YRI population. This metadata is then embedded in the final tree sequence that we infer, allowing us to use it conveniently in downstream analyses.

Note

If a path is specified, the max_file_size option puts an upper limit on the possible size of the created file. On non-Windows systems, space for this file is not immediately allocated but just “reserved” using sparse file systems. However, on Windows systems the file is allocated immediately, so max_file_size takes a smaller default value, to avoid allocating very large files for no reason. Users who wish to run large inferences on Windows may therefore need to explictly set an appropriate max_file_size. Note that the max_file_size is only used while the file is being built: one the file has been finalised, it is shrunk to its minimum size.

Parameters:
  • sequence_length (float) – If specified, this is the sequence length that will be associated with the tree sequence output by tsinfer.infer() and tsinfer.match_samples(). If provided site coordinates must be less than this value.
  • path (str) – The path of the file to store the sample data. If None, the information is stored in memory and not persistent.
  • num_flush_threads (int) – The number of background threads to use for compressing data and flushing to disc. If <= 0, do not spawn any threads but use a synchronous algorithm instead. Default=0.
  • compressor (numcodecs.abc.Codec) – A numcodecs.abc.Codec instance to use for compressing data. Any codec may be used, but problems may occur with very large datasets on certain codecs as they cannot compress buffers >2GB. If None, do not use any compression. Default=:class:numcodecs.zstd.Zstd.
  • chunk_size (int) – The chunk size used for zarr arrays. This affects compression level and algorithm performance. Default=1024.
  • max_file_size (int) – If a file is being used to store this data, set a maximum size in bytes for the stored file. If None, the default value of 1GiB (2**30 bytes) is used on Windows and 1TiB (2**40 bytes) on other platforms (see above for details).
add_individual(ploidy=1, metadata=None, population=None, location=None, time=0, flags=None)

Adds a new Individual to this SampleData and returns its ID and those of the resulting additional samples. Adding an individual with ploidy k results in k new samples being added, and each of these samples will be associated with the new individual. Each new sample will also be associated with the specified population ID. It is an error to specify a population ID that does not correspond to a population defined using add_population().

All calls to this method must be made after populations are defined using add_population() and before sites are defined using add_site().

Parameters:
  • ploidy (int) – The ploidy of this individual. This corresponds to the number of samples added that refer to this individual. Defaults to 1 (haploid).
  • metadata (dict) – A JSON encodable dict-like object containing metadata that is to be associated with this individual.
  • population (int) – The ID of the population to associate with this individual (or more precisely, with the samples for this individual). If not specified or None, defaults to the null population (-1).
  • location (arraylike) – An array-like object defining n-dimensional spatial location of this individual. If not specified or None, the empty location is stored.
  • time (float) – The historical time into the past when the samples associated with this individual were taken. By default we assume that all samples come from the present time (i.e. the default time is 0).
  • flags (int) – The bitwise flags for this individual.
Returns:

The ID of the newly added individual and a list of the sample IDs also added.

Return type:

tuple(int, list(int))

add_population(metadata=None)

Adds a new Population to this SampleData and returns its ID.

All calls to this method must be made before individuals or sites are defined.

Parameters:metadata (dict) – A JSON encodable dict-like object containing metadata that is to be associated with this population.
Returns:The ID of the newly added population.
Return type:int
add_provenance(timestamp, record)

Adds a new provenance record with the specified timestamp and record. Timestamps should ISO8601 formatted, and record is some JSON encodable object.

add_site(position, genotypes, alleles=None, metadata=None, inference=None, time=None)

Adds a new site to this SampleData and returns its ID.

At a minimum, the new site must specify the position and genotypes. Sites must be added in increasing order of position; duplicate positions are not supported. For each site a list of alleles may be supplied. This list defines the ancestral and derived states at the site. For example, if we set alleles=["A", "T"] then the ancestral state is “A” and the derived state is “T”. The observed state for each sample is then encoded using the genotypes parameter. Thus if we have n samples then this must be a one dimensional array-like object with length n. The genotypes index into the list of alleles, so that for a given array g and sample index j, g[j] should contain 0 if sample j carries the ancestral state at this site and 1 if it carries the derived state. For multiple derived states, there may be more than 2 alleles`, and ``g[j] can be greater than 1, but such sites are not used for inference. All sites must have genotypes for the same number of samples.

All populations and individuals must be defined before this method is called. If no individuals have been defined using add_individual(), the first call to this method adds n haploid individuals, where n is the length of the genotypes array.

Parameters:
  • position (float) – The floating point position of this site. Must be less than the sequence_length if provided to the SampleData constructor. Must be greater than all previously added sites.
  • genotypes (arraylike) – An array-like object defining the sample genotypes at this site. The array of genotypes corresponds to the observed alleles for each sample, represented by indexes into the alleles array. Missing sample data can be represented by tskit.MISSING_DATA in this array. The input is converted to a numpy array with dtype np.int8; therefore, for maximum efficiency ensure that the input array is also of this type.
  • alleles (list(str)) – A list of strings defining the alleles at this site. The zero’th element of this list is the ancestral state and further elements are the derived states. Only biallelic sites can currently be used for inference. Sites with 3 or more alleles cannot have inference (below) set to True. If not specified or None, defaults to [“0”, “1”].
  • metadata (dict) – A JSON encodable dict-like object containing metadata that is to be associated with this site.
  • time (float) – The time of occurence (pastwards) of the mutation to the derived state at this site. If not specified or None, the frequency of the derived alleles (i.e., the proportion of non-zero values in the genotypes, out of all the non-missing values) will be used in inference. For biallelic sites this frequency should provide a reasonable estimate of the relative time, as used to order ancestral haplotypes during the inference process. For sites not used in inference, such as singletons or sites with more than two alleles or when the time is specified as np.nan, then the value is unused. Defaults to None.
Returns:

The ID of the newly added site.

Return type:

int

arrays()

Returns a list of all the zarr arrays in this DataContainer.

assert_data_equal(other)

The same as data_equal(), but raises an assertion rather than returning False. This is useful for testing.

clear_provenances()

Clear all provenances in this instance

close()

Close this DataContainer. Any read or write operations attempted after calling this will fail.

copy(path=None, max_file_size=None)

Returns a copy of this DataContainer opened in ‘edit’ mode. If path is specified, this must not be equal to the path of the current data container. The new container will have a different UUID to the current.

data_equal(other)

Returns True if all the data attributes of this input file and the specified input file are equal. This compares every attribute except the UUID and provenance.

To compare two SampleData instances for exact equality of all data including UUIDs and provenance data, use s1 == s2.

Parameters:other (SampleData) – The other SampleData instance to compare with.
Returns:True if the data held in this SampleData instance is identical to the date held in the other instance.
Return type:bool
file_size

Returns the size of the underlying file, or -1 if we do not have a file associated.

finalise()

Ensures that the state of the data is flushed and writes the provenance for the current operation. The specified ‘command’ is used to fill the corresponding entry in the provenance dictionary.

classmethod from_tree_sequence(ts, use_sites_time=True, use_individuals_time=True, **kwargs)

Create a SampleData instance from the sample nodes in an existing tree sequence. Each sample node in the tree sequence results in a sample created in the returned object. Populations in the tree sequence will be copied into the returned object. Individuals in the tree sequence that are associated with any sample nodes will also be incorporated: the ploidy of each individual is assumed to be the number of sample nodes which reference that individual; individuals with no sample nodes are omitted. A new haploid individual is created for any sample node which lacks an associated individual in the existing tree sequence. Thus a tree sequence with u sample nodes but no individuals will be translated into a SampleData file with u haploid individuals and u samples.

Metadata associated with individuals, populations, sites, and at the top level of the tree sequence, is also stored in the appropriate places in the returned SampleData instance. Any such metadata must either have a schema defined or be JSON encodable text. See the tskit documentation for more details on metadata schemas.

Parameters:
  • ts (TreeSequence) – The tskit.TreeSequence from which to generate samples.
  • use_sites_time (bool) – If True (default), the times of nodes in the tree sequence are used to set a time for each site (which affects the relative temporal order of ancestors during inference). Times for a site are only used if there is a single mutation at that site, in which case the node immediately below the mutation is taken as the origination time for the variant. If False, the frequency of the variant is used as a proxy for the relative variant time (see add_site()).
  • use_individuals_time (bool) – If True (default), set a time for individuals that contain historical sample nodes. If False, all individuals are set at time 0.
  • **kwargs – Further arguments passed to the SampleData constructor.
Returns:

A SampleData object.

Return type:

SampleData

haplotypes(samples=None, sites=None)

Returns an iterator over the (sample_id, haplotype) pairs.

Parameters:
  • samples (list) – The sample IDs for which haplotypes are returned. If None, return haplotypes for all sample nodes, otherwise this may be a numpy array (or array-like) object (converted to dtype=np.int32).
  • sites (array) – A numpy array of sites to use.
info

Returns a string containing the zarr info for each array.

merge(other, **kwargs)

Returns a copy of this SampleData file merged with the specified other SampleData file. Subsequent keyword arguments are passed to the SampleData constructor for the returned merged dataset.

The datasets are merged by following process:

  1. We add the populations from this dataset to the result, followed by the populations from other. Population references from the two datasets are updated accordingly.
  2. We add individual data from this dataset to the result, followed by the individuals from the other dataset.
  3. We merge the variant data from the two datasets by comparing sites by their position. If two sites in the datasets have the same position we combine the genotype data. The alleles from this dataset are updated to include any new alleles in other, and we then combine and update the genotypes accordingly. It is an error if sites with the same position have different ancestral state, time, or metadata values. For sites that exist in one dataset and not the other, we insert the site with MISSING_DATA present in the genotypes for the dataset that does not contain the site.
  4. We add the provenances for this dataset, followed by the provenances for the other dataset.
Parameters:other (SampleData) – The other SampleData instance to to merge.
Returns:A new SampleData instance which contains the merged data from the two datasets.
Return type:SampleData
min_site_times(individuals_only=False)

Returns a numpy array of the lower bound of the time of sites in the SampleData file. Each individual with a nonzero time (from the individuals_time array) gives a lower bound on the age of sites where the individual carries a derived allele.

Returns:A numpy array of the lower bound for each sites time.
Return type:numpy.ndarray(dtype=float64)
num_alleles(sites=None)

Returns a numpy array of the number of alleles at each site. Missing data is not counted as an allele.

Parameters:sites (array) – A numpy array of sites for which to return data. If None (default) return all sites.
Returns:A numpy array of the number of alleles at each site.
Return type:numpy.ndarray(dtype=uint32)
provenances()

Returns an iterator over the (timestamp, record) pairs representing the provenances for this data container.

record_provenance(command=None, **kwargs)

Records the provenance information for this file using the tskit provenances schema.

sites(ids=None)

Returns an iterator over the Site objects. A subset of the sites can be returned using the ids parameter. This must be a list of integer site IDs.

subset(individuals=None, sites=None, **kwargs)

Returns a subset of this sample data file consisting of the specified individuals and sites. It is important to note that these are individual IDs and not sample IDs (corresponding to distinct haplotypes within an individual). When working with haploid data, the individual and site IDs are guaranteed to be the same, and so can be used interchangably.

Parameters:
  • individuals (arraylike) – The individual IDs to retain in the returned subset. IDs must be unique, and refer to valid individuals in the current dataset. IDs can be supplied in any order, and the order will be preserved in the returned data file (i.e., individuals[0] will be the first individual in the new dataset, etc).
  • sites (arraylike) – The site IDs to retain in the returned subset. IDs must be unique, and refer to valid sites in the current dataset. Site IDs can be supplied in any order, but the order will not be preserved in the returned data file, so that sites are always in position sorted order in the output.
  • **kwargs – Further arguments passed to the SampleData constructor.
Returns:

A SampleData object.

Return type:

SampleData

variants(sites=None)

Returns an iterator over the Variant objects. This is equivalent to the TreeSequence.variants iterator.

Todo

  1. Add documentation for the data attributes in read-mode.
  2. Define copy mode.
  3. Provide example of updating inference_sites
class tsinfer.AncestorData(sample_data, *, path=None, num_flush_threads=0, compressor=None, chunk_size=1024, max_file_size=None)

Class representing the stored ancestor data produced by generate_ancestors(). See the ancestor data file format specifications for details on the structure of this file.

See the documentation for SampleData for a discussion of the max_file_size parameter.

Parameters:
  • sample_data (SampleData) – The SampleData instance that this ancestor data file was generated from.
  • path (str) – The path of the file to store the ancestor data. If None, the information is stored in memory and not persistent.
  • num_flush_threads (int) – The number of background threads to use for compressing data and flushing to disc. If <= 0, do not spawn any threads but use a synchronous algorithm instead. Default=0.
  • compressor (numcodecs.abc.Codec) – A numcodecs.abc.Codec instance to use for compressing data. Any codec may be used, but problems may occur with very large datasets on certain codecs as they cannot compress buffers >2GB. If None, do not use any compression. Default=:class:numcodecs.zstd.Zstd.
  • chunk_size (int) –

    The chunk size used for zarr arrays. This affects compression level and algorithm performance. Default=1024.

  • max_file_size (int) – If a file is being used to store this data, set a maximum size in bytes for the stored file. If None, the default value of 1GiB is used on Windows and 1TiB on other platforms (see above for details).
add_ancestor(start, end, time, focal_sites, haplotype)

Adds an ancestor with the specified haplotype, with ancestral material over the interval [start:end], that is associated with the specified timepoint and has new mutations at the specified list of focal sites. Ancestors should be added in time order, with the oldest first. The id of the added ancestor is returned.

add_provenance(timestamp, record)

Adds a new provenance record with the specified timestamp and record. Timestamps should ISO8601 formatted, and record is some JSON encodable object.

ancestors()

Returns an iterator over all the ancestors.

ancestors_length

Returns the length of ancestors in physical coordinates.

arrays()

Returns a list of all the zarr arrays in this DataContainer.

clear_provenances()

Clear all provenances in this instance

close()

Close this DataContainer. Any read or write operations attempted after calling this will fail.

copy(path=None, max_file_size=None)

Returns a copy of this DataContainer opened in ‘edit’ mode. If path is specified, this must not be equal to the path of the current data container. The new container will have a different UUID to the current.

data_equal(other)

Returns True if all the data attributes of this input file and the specified input file are equal. This compares every attribute except the UUID.

file_size

Returns the size of the underlying file, or -1 if we do not have a file associated.

finalise()

Ensures that the state of the data is flushed and writes the provenance for the current operation. The specified ‘command’ is used to fill the corresponding entry in the provenance dictionary.

info

Returns a string containing the zarr info for each array.

insert_proxy_samples(sample_data, *, sample_ids=None, epsilon=None, allow_mutation=False, require_same_sample_data=True, **kwargs)

Take a set of samples from a sample_data instance and create additional “proxy sample ancestors” from them, returning a new AncestorData instance including both the current ancestors and the additional ancestors at the appropriate time points.

A proxy sample ancestor is an ancestor based upon a known sample. At sites used in the full inference process, the haplotype of this ancestor is identical to that of the sample on which it is based. The time of the ancestor is taken to be a fraction epsilon older than the sample on which it is based.

A common use of this function is to provide ancestral nodes for anchoring historical samples at the correct time when matching them into a tree sequence during the tsinfer.match_samples() stage of inference. For this reason, by default, the samples chosen from sample_data are those associated with historical (i.e. non-contemporary) individuals. This can be altered by using the sample_ids parameter.

Note

The proxy sample ancestors inserted here will correspond to extra nodes in the inferred tree sequence. At sites which are not used in the full inference process (e.g. sites unique to a single historical sample), these proxy sample ancestor nodes may have a different genotype from their corresponding sample.

Parameters:
  • sample_data (SampleData) – The SampleData instance from which to select the samples used to create extra ancestors.
  • sample_ids (list(int)) – A list of sample ids in the sample_data instance that will be selected to create the extra ancestors. If None (default) select all the historical samples, i.e. those associated with an Individual whose time is greater than zero. The order of ids is ignored, as are duplicate ids.
  • epsilon (list(float)) – An list of small time increments determining how much older each proxy sample ancestor is than the corresponding sample listed in sample_ids. A single value is also allowed, in which case it is used as the time increment for all selected proxy sample ancestors. If None (default) find \({\delta}t\), the smallest time difference between between the sample times and the next oldest ancestor in the current AncestorData instance, setting epsilon = \({\delta}t / 100\) (or, if all selected samples are at least as old as the oldest ancestor, take \({\delta}t\) to be the smallest non-zero time difference between existing ancestors).
  • allow_mutation (bool) – If False (the default), any site in a proxy sample ancestor that has a derived allele must have a pre-existing mutation in an older (non-proxy) ancestor, otherwise an error is raised. Alternatively, if allow_mutation is True, proxy ancestors can contain a de-novo mutation at a site that also has a mutation elsewhere (i.e. breaking the infinite sites assumption), allowing them to possess derived alleles at sites where there are no pre-existing mutations in older ancestors.
  • require_same_sample_data (bool) – If True (default) then the the sample_data parameter must point to the same SampleData instance as that used to generate the current ancestors. If False, this requirement is not enforced, and it is the user’s responsibility to ensure that the encoding of alleles in sample_data matches the encoding in the current AncestorData instance (i.e. that in the original SampleData instance on which the current ancestors are based).
  • **kwargs – Further arguments passed to the constructor when creating the new AncestorData instance which will be returned.
Returns:

A new AncestorData object.

Return type:

AncestorData

provenances()

Returns an iterator over the (timestamp, record) pairs representing the provenances for this data container.

record_provenance(command=None, **kwargs)

Records the provenance information for this file using the tskit provenances schema.

set_inference_sites(site_ids)

Sets the sites used for inference (i.e., the sites at which ancestor haplotypes are defined) to the specified list of site IDs. This must be a subset of the sites in the sample data file, and the IDs must be in increasing order.

This must be called before the first call to add_ancestor().

truncate_ancestors(lower_time_bound, upper_time_bound, length_multiplier=2, **kwargs)

Truncates the length of ancestors above a given time and returns a new AncestorData instance.

Given a set of haplotypes H such that lower_time_bound <= h.time < upper_time_bound, we let max_len = length_multiplier * max(max(h.length) for h in H). Then, we truncate all haplotypes containing at least one focal site where h.time >= upper, ensuring these haplotypes extend no further than half of max_len to the either side of the leftmost and rightmost focal sites of the ancestral haplotype. Note that ancestors above upper_time_bound may still be longer than max_len if the ancestor contains greater than 2 focal sites.

This function should be used when tsinfer.generate_ancestors() generates old ancestors which are very long, as these can significantly slow down matching time. Older ancestors should generally be shorter than younger ancestors, so truncating the lengths of older ancestors has negligible effect on inference accuracy.

Note

Please ensure that the time values provided to lower_time_bound and upper_time_bound match the units used in the AncestorData file, i.e. if your ancestors do not have site times specified, upper_time_bound should be between 0 and 1.

Parameters:
  • lower_time_bound (float) – Defines the lower bound (inclusive) of the half open interval where we search for a truncation value.
  • upper_time_bound (float) – Defines the upper bound (exclusive) of the half open interval where we search for a truncation value. The truncation value is the length of the longest haplotype in this interval multiplied by length_multiplier. The length of ancestors as old or older than upper_time_bound will be truncated using this value.
  • length_multiplier (float) – A multiplier for the length of the longest ancestor in the half-open interval between lower_time_bound (inclusive) and uppper_time_bound (exclusive), i.e. if the longest ancestor in the interval is 1 megabase, a length_multiplier of 2 creates a maximum length of 2 megabases.
  • **kwargs – Further arguments passed to the AncestorData.copy() when creating the new AncestorData instance which will be returned.
Returns:

A new AncestorData object.

Return type:

AncestorData

Todo

  1. Add documentation for the data attributes in read-mode.
tsinfer.load(path)

Loads a tsinfer SampleData or AncestorData file from the specified path. The correct class will be determined by the content of the file. If the file is format not recognised a FileFormatError will be thrown.

Parameters:path (str) – The path of the file we wish to load.
Returns:The corresponding SampleData or AncestorData instance opened in read only mode.
Return type:AncestorData or SampleData.
Raises:FileFormatError if the file cannot be read.

Running inference

tsinfer.infer(sample_data, *, recombination_rate=None, mismatch_ratio=None, path_compression=True, simplify=True, exclude_positions=None, num_threads=0)

Runs the full inference pipeline on the specified SampleData instance and returns the inferred tskit.TreeSequence. See matching ancestors & samples in the documentation for details of recombination_rate, mismatch_ratio and path_compression.

Note

For finer grained control over inference, for example to set different mismatch ratios when matching ancestors versus samples, run tsinfer.generate_ancestors(), tsinfer.match_ancestors() and tsinfer.match_samples() separately.

Parameters:
  • sample_data (SampleData) – The input SampleData instance representing the observed data that we wish to make inferences from.
  • recombination_rate (float, msprime.RateMap) – Either a floating point value giving a constant rate \(\rho\) per unit length of genome, or an msprime.RateMap object. This is used to calculate the probability of recombination between adjacent sites. If None, all matching conflicts are resolved by recombination and all inference sites will have a single mutation (equivalent to mismatch_ratio near zero)
  • mismatch_ratio (float) – The probability of a mismatch relative to the median probability of recombination between adjacent sites: can only be used if a recombination rate has been set (default: 1)
  • path_compression (bool) – Whether to merge edges that share identical paths (essentially taking advantage of shared recombination breakpoints).
  • simplify (bool) – Whether to remove extra tree nodes and edges that are not on a path between the root and any of the samples. To do so, the final tree sequence is simplified by appling the tskit.TreeSequence.simplify() method with filter_sites, filter_populations and filter_individuals set to False, and keep_unary set to True (default = True).
  • exclude_positions (array_like) – A list of site positions to exclude for full inference. Sites with these positions will not be used to generate ancestors, and not used during the copying process. Any such sites that exist in the sample data file will be included in the trees after the main inference process using parsimony. The list does not need to be in to be in any particular order, and can include site positions that are not present in the sample data file.
  • num_threads (int) – The number of worker threads to use in parallelised sections of the algorithm. If <= 0, do not spawn any threads and use simpler sequential algorithms (default).
Returns:

The tskit.TreeSequence object inferred from the input sample data.

Return type:

tskit.TreeSequence

tsinfer.generate_ancestors(sample_data, *, path=None, exclude_positions=None, num_threads=0, **kwargs)

Runs the ancestor generation algorithm on the specified SampleData instance and returns the resulting AncestorData instance. If you wish to store generated ancestors persistently on file you must pass the path keyword argument to this function. For example,

ancestor_data = tsinfer.generate_ancestors(sample_data, path="mydata.ancestors")

Other keyword arguments are passed to the AncestorData constructor, which may be used to control the storage properties of the generated file.

Parameters:
  • sample_data (SampleData) – The SampleData instance that we are genering putative ancestors from.
  • path (str) – The path of the file to store the sample data. If None, the information is stored in memory and not persistent.
  • exclude_positions (array_like) – A list of site positions to exclude for full inference. Sites with these positions will not be used to generate ancestors, and not used during the copying process. The list does not need be in any particular order.
  • num_threads (int) – The number of worker threads to use. If < 1, use a simpler synchronous algorithm.
Returns:

The inferred ancestors stored in an AncestorData instance.

Return type:

AncestorData

tsinfer.match_ancestors(sample_data, ancestor_data, *, recombination_rate=None, mismatch_ratio=None, path_compression=True, num_threads=0)

Run the ancestor matching algorithm on the specified SampleData and AncestorData instances, returning the resulting tskit.TreeSequence representing the complete ancestry of the putative ancestors. See matching ancestors & samples in the documentation for details of recombination_rate, mismatch_ratio and path_compression.

Parameters:
  • sample_data (SampleData) – The SampleData instance representing the input data.
  • ancestor_data (AncestorData) – The AncestorData instance representing the set of ancestral haplotypes for which we are finding a history.
  • recombination_rate (float, msprime.RateMap) – Either a floating point value giving a constant rate \(\rho\) per unit length of genome, or an msprime.RateMap object. This is used to calculate the probability of recombination between adjacent sites. If None, all matching conflicts are resolved by recombination and all inference sites will have a single mutation (equivalent to mismatch_ratio near zero)
  • mismatch_ratio (float) – The probability of a mismatch relative to the median probability of recombination between adjacent sites: can only be used if a recombination rate has been set (default: 1)
  • path_compression (bool) – Whether to merge edges that share identical paths (essentially taking advantage of shared recombination breakpoints).
  • num_threads (int) – The number of match worker threads to use. If this is <= 0 then a simpler sequential algorithm is used (default).
Returns:

The ancestors tree sequence representing the inferred history of the set of ancestors.

Return type:

tskit.TreeSequence

tsinfer.match_samples(sample_data, ancestors_ts, *, recombination_rate=None, mismatch_ratio=None, path_compression=True, simplify=True, indexes=None, force_sample_times=False, num_threads=0)

Runs the sample matching algorithm on the specified SampleData instance and ancestors tree sequence, returning the final tskit.TreeSequence instance containing the full inferred history for all samples and sites. See matching ancestors & samples in the documentation for details of recombination_rate, mismatch_ratio and path_compression.

Parameters:
  • sample_data (SampleData) – The SampleData instance representing the input data.
  • ancestors_ts (tskit.TreeSequence) – The tskit.TreeSequence instance representing the inferred history among ancestral ancestral haplotypes.
  • recombination_rate (float, msprime.RateMap) – Either a floating point value giving a constant rate \(\rho\) per unit length of genome, or an msprime.RateMap object. This is used to calculate the probability of recombination between adjacent sites. If None, all matching conflicts are resolved by recombination and all inference sites will have a single mutation (equivalent to mismatch_ratio near zero)
  • mismatch_ratio (float) – The probability of a mismatch relative to the median probability of recombination between adjacent sites: can only be used if a recombination rate has been set (default: 1)
  • path_compression (bool) – Whether to merge edges that share identical paths (essentially taking advantage of shared recombination breakpoints).
  • simplify (bool) – Whether to remove extra tree nodes and edges that are not on a path between the root and any of the samples. To do so, the final tree sequence is simplified by appling the tskit.TreeSequence.simplify() method with filter_sites, filter_populations and filter_individuals set to False and keep_unary set to True (default = True).
  • indexes (array_like) – An array of indexes into the sample_data file of the samples to match, or None for all samples.
  • force_sample_times (bool) – After matching, should an attempt be made to adjust the time of “historical samples” (those associated with an individual having a non-zero time) such that the sample nodes in the tree sequence appear at the time of the individual with which they are associated.
  • num_threads (int) – The number of match worker threads to use. If this is <= 0 then a simpler sequential algorithm is used (default).
Returns:

The tree sequence representing the inferred history of the sample.

Return type:

tskit.TreeSequence

tsinfer.augment_ancestors(sample_data, ancestors_ts, indexes, *, recombination_rate=None, mismatch_ratio=None, path_compression=True, num_threads=0)

Runs the sample matching algorithm on the specified SampleData instance and ancestors tree sequence, for the specified subset of sample indexes, returning the tskit.TreeSequence instance including these samples. This tree sequence can then be used as an ancestors tree sequence for subsequent matching against all samples. See matching ancestors & samples in the documentation for details of recombination_rate, mismatch_ratio and path_compression.

Parameters:
  • sample_data (SampleData) – The SampleData instance representing the input data.
  • ancestors_ts (tskit.TreeSequence) – The tskit.TreeSequence instance representing the inferred history among ancestral ancestral haplotypes.
  • indexes (array) – The sample indexes to insert into the ancestors tree sequence.
  • recombination_rate (float, msprime.RateMap) – Either a floating point value giving a constant rate \(\rho\) per unit length of genome, or an msprime.RateMap object. This is used to calculate the probability of recombination between adjacent sites. If None, all matching conflicts are resolved by recombination and all inference sites will have a single mutation (equivalent to mismatch_ratio near zero)
  • mismatch_ratio (float) – The probability of a mismatch relative to the median probability of recombination between adjacent sites: can only be used if a recombination rate has been set (default: 1)
  • path_compression (bool) – Whether to merge edges that share identical paths (essentially taking advantage of shared recombination breakpoints).
  • num_threads (int) – The number of match worker threads to use. If this is <= 0 then a simpler sequential algorithm is used (default).
Returns:

The specified ancestors tree sequence augmented with copying paths for the specified sample.

Return type:

tskit.TreeSequence

Exceptions

exception tsinfer.FileFormatError

Exception raised when a malformed file is encountered.