Skip to content

Calculate source match rates

Input

To calculate source match rates, users should provide a VCF file containing genotypes from the reference, target, and source populations (e.g. test.match.rate.data.vcf). Users also need to provide three files containing names of individuals from the reference, target and source populations (e.g. ref.ind.list, tgt.ind.list and nean.ind.list) for analysis. The file (e.g. test.match.rate.score.exp.results) containing S* scores from sstar score is also required.

Users can calculate source match rates with the following command:

sstar matchrate --vcf test.match.rate.data.vcf --ref ref.ind.list --tgt tgt.ind.list --src nean.ind.list --score test.match.rate.score.exp.results --output test.match.rate.results

The expected result above can be found in test.match.rate.exp.results.

Output

An example for the output is below:

chrom start end sample hap_index match_rate src_sample
21 9400000 9450000 NA06986 NA 0.0454545 Nean

The meanings of the first five columns are the same as those in the output from sstar score. The meanings of the remaining columns:

  • The match_rate column is the source match rate on the current region.
  • The src_sample column is the name of the individual from the source population used for calculating the source match rate.

Settings

By default, sstar assumes the reference allele is the ancestral allele and the alternative allele is the derived allele. Users can use the argument --anc-allele with a BED format file (e.g. test.anc.allele.bed) to define the ancestral allele for each variant. If --anc-allele is used, variants without ancestral allele information will be removed.

Users can provide a BED file (e.g. test.mapped.region.bed) defining non-overlapping mapped regions with --mapped-region. If --mapped-region is not provided, the full window is treated as mapped.

By default, sstar matchrate reports one row per target sample and preserves the hap_index value from the score file, which is usually NA for score files generated without --phased. In this default mode, the current implementation calculates the two haplotype-level match rates over the candidate window and reports their average as one individual-level match rate. Therefore, for a diploid individual, a source-matching segment carried by only one haplotype can produce an individual-level match rate of approximately 0.5 when the other haplotype does not match. Users can use --phased to output source match rates for phased haplotypes. When --phased is used, the command outputs one row per target haplotype and sets hap_index to 1 or 2 for diploid data.

Users can use --thread to specify the number of CPUs used for multiprocessing.