Skip to content

Calculate Statistics

sai provides the score command to calculate different statistics.

To see available options, we can use the following command:

sai score -h

This will show information for each argument:

Argument Description
-h, --help show this help message and exit
--vcf Path to the VCF file containing variant data
--chr-name Chromosome name to analyze from the VCF file.
--win-len Length of each genomic window in base pairs. Default: 50,000.
--win-step Step size in base pairs between consecutive windows. Default: 10,000.
--anc-alleles Path to the BED file with ancestral allele information. If ancestral allele information is not provided, filtering will be performed for each variant based on whether the allele frequency of any allele (assuming biallelic) meets the specified condition during the calculation of the statistics. Default: None.
--output Output file path for saving results.
--config Path to the YAML configuration file specifying the statistics to compute, ploidy settings, and population group file paths.

Input files

sai requires a biallelic VCF file containing all populations, including the reference population (assumed to have no introgression), the target population (assumed to have received introgressed material), and the source population (assumed to have provided the introgressed material). It also assumes that all variants observed in these populations are already included in the input VCF file. The program does not fill in missing genotypes (e.g., treating unobserved genotypes as homozygous reference, as done in MaLAdapt). An example VCF file can be found here.

Three tab-delimited files are required to specify the population information for individuals in the reference, target, and source populations. An example file is shown below and can also be found here.

NEA AltaiNeandertal
DEN Denisova

This file specifies the individuals in the source populations. In this example, there are two source populations, NEA and DEN (first column), and the second column lists the corresponding sample names matching those in the VCF header.

A configuration file in YAML format is required to specify the statistics to compute, ploidy settings, and population group file paths. An example can be found here. In the example below, the statistics section lists which statistics will be computed. Empty braces ({}) indicate that no additional parameters are required for that statistic, while entries such as \(U\) and \(Q\) specify reference (ref), target (tgt), and source (src) populations with filtering conditions. The ploidies section defines the ploidy level for each population, typically 2 for diploids. The populations section provides file paths to sample lists that assign individuals to each population.

statistics:
  Danc: {}
  DD: {}
  Dplus: {}
  df: {}
  fd: {}
  U:
    ref: 
      AFR+EAS: 0.01
    tgt: 
      EUR: 0.5
    src:
      NEA: "=1"
      DEN: "=1"
  Q:
    ref: 
      AFR+EAS: 0.01
    tgt: 
      EUR: 0.95
    src:
      NEA: "=1"
      DEN: "=1"

ploidies:
  ref:
    AFR+EAS: 2
  tgt:
    EUR: 2
  src:
    NEA: 2
    DEN: 2

populations:
  ref: "examples/data/1KG.ref.samples.txt"
  tgt: "examples/data/1KG.tgt.samples.txt"
  src: "examples/data/1KG.nea_den.samples.txt"

Output files

An example output is shown below.

Chrom Start End Ref Tgt Src N(Variants) Danc.NEA Danc.DEN DD.NEA DD.DEN Dplus.NEA Dplus.DEN df.NEA df.DEN fd.NEA fd.DEN U Q
9 16400001 16440000 AFR+EAS EUR NEA,DEN 1082 -0.12355029565477203 -0.09559566661094993 -7.478853072043918 -4.739448231247437 -0.14512264357342178 -0.09212831462235394 -0.021122428026740657 -0.005965214141977462 -0.03412663364767112 -0.011254553402083362 0 nan
9 16440001 16480000 AFR+EAS EUR NEA,DEN 983 0.23853287087522393 0.09720568452718485 15.824720328820732 2.6519198302249976 0.24254139814192216 0.04198624490861517 0.06679895300702358 -0.027612395797831944 0.09662599519731511 -0.06072062935032934 0 nan
9 16480001 16520000 AFR+EAS EUR NEA,DEN 985 0.11191758509638665 0.1371820707366662 4.845944176212569 6.488856543280008 0.08062408365129729 0.10781131967371557 -0.0059518440634471394 0.006339448215182395 -0.009777268672625093 0.011579864693126562 0 nan
9 16520001 16560000 AFR+EAS EUR NEA,DEN 966 0.008890710141582193 0.0017874385890340458 -3.2895208116381127 -3.762141437091735 -0.07409036881532824 -0.08473527378923923 -0.08168801665077327 -0.0882036818431493 -0.10989219575082872 -0.11581875355584634 0 nan
9 16560001 16600000 AFR+EAS EUR NEA,DEN 1029 0.0024719610094924406 0.013019347645432937 -3.0949185837356765 -2.3295899207926993 -0.07022371194458008 -0.05285840228768851 -0.024174513563260166 -0.021736551375934953 -0.05323265434587958 -0.051743268070762606 0 0.0008946322067594431
9 16600001 16640000 AFR+EAS EUR NEA,DEN 1063 0.022253891584592442 -0.05846778040121223 3.258157404777677 -3.199939253368683 0.07998846049314343 -0.06970294049702096 0.025940599557366877 -0.0056209360304606635 0.10418231535859726 -0.01922836158621726 0 0.0
9 16640001 16680000 AFR+EAS EUR NEA,DEN 1153 0.10931478521694758 -0.02727035560668536 11.405249061188428 -2.9954578087033354 0.17733519261177239 -0.04657505369825686 0.07092801200718006 -0.01271958977736748 0.14831905506347323 -0.0603694010038118 0 nan
9 16680001 16720000 AFR+EAS EUR NEA,DEN 999 0.15113133237922183 0.14710421254212552 18.309291078923295 17.05495597841523 0.25761194634556367 0.24391495232933705 0.1898707100981595 0.17809065689988668 0.22245899486649723 0.3768632297159813 1 0.6282306163021869
9 16720001 16760000 AFR+EAS EUR NEA,DEN 1007 0.4129294073240463 0.20572780178794584 53.56721772539367 29.49165522736596 0.49454474017007327 0.2692256133787326 0.45835432326244324 0.2539470613190382 0.5958374858338679 0.354370074530527 2 0.6828528827037773
9 16760001 16800000 AFR+EAS EUR NEA,DEN 1060 0.40683688578049354 0.1438067334121798 40.19722340559816 16.581497128341063 0.5263549295804029 0.2153953867869153 0.5848435316604244 0.21982990433919478 0.5444309862037052 0.23506396309346742 1 0.6988071570576541
9 16800001 16840000 AFR+EAS EUR NEA,DEN 960 0.40565740562034003 0.22930692442169218 24.16718735207801 13.914000126226764 0.5158532156040262 0.29611831851629133 0.3520651233378468 0.1984114833463481 0.6411180111472388 0.5318202119747509 1 0.7176938369781312
9 16840001 16880000 AFR+EAS EUR NEA,DEN 923 0.21619895332973812 0.19275643241641574 12.284513159140396 10.489264807977534 0.22271722145226366 0.19016951529508055 0.09059952770351988 0.058657408877032656 0.08230019337435454 0.06443318954095166 0 0.15382703777335985
9 16880001 16920000 AFR+EAS EUR NEA,DEN 512 0.3881193782134191 0.3589872405156586 20.739761825870175 19.506952333617342 0.48085645624316853 0.4522734663012016 0.5199698990819357 0.5044977777777778 0.47736503902045885 0.55657933398083 0 nan

Each row in the output file corresponds to a genomic window, summarizing the result for that region. The columns are as follows:

Column Description
Chrom Chromosome number where the window is located.
Start Start position (1-based, inclusive) of the window.
End End position (inclusive) of the window.
Ref Reference population.
Tgt Target population.
Src Source population.
N(Variants) Number of variants found in this genomic window. Sites with missing genotypes in any population are excluded. If ancestral allele information is provided, variants whose ancestral allele differ from both the reference and alternative allele in the VCF file are also excluded.
Danc.src1 Value of the \(D_{anc}\) statistic using src1 as the source population
Danc.src2 Value of the \(D_{anc}\) statistic using src2 as the source population
DD.src1 Value of the \(D_D\) statistic using src1 as the source population
DD.src2 Value of the \(D_D\) statistic using src2 as the source population
Dplus.src1 Value of the \(D^+\) statistic using src1 as the source population
Dplus.src2 Value of the \(D^+\) statistic using src2 as the source population
df.src1 Value of the \(d_f\) statistic using src1 as the source population
df.src2 Value of the \(d_f\) statistic using src2 as the source population
fd.src1 Value of the \(f_d\) statistic using src1 as the source population
fd.src2 Value of the \(f_d\) statistic using src2 as the source population
U Value of the \(U\) statistic using both src1 and src2 as the source populations
Q Value of the \(Q\) statistic using both src1 and src2 as the source populations

Examples

The following example estimates all the statistics defined by the example configuration above with biallelic single nucleotide polymorphisms (SNPs) in the region chr9:16400001-16900000 from the 1000 Genomes Project. The Neanderthal genome was obtained from here, and the Denisovan genome from here. The reference population includes all African populations (ESN, GWD, LWK, MSL, YRI), excluding the admixed African populations (ACB and ASW), and all East Asian populations (CDX, CHB, CHS, JPT, KHV). The target population consists of all European populations (CEU, FIN, GBR, IBS, TSI). The source populations comprise one Neanderthal individual (AltaiNeandertal) and one Denisovan individual (Denisova).

To detect adaptively introgressed variants from both Neanderthals and Denisovans, we can use the following command:

sai score --vcf examples/data/1KG.nea_den.chr9.example.vcf.gz \
          --anc-alleles examples/data/hg19.chr9.example.anc.alleles.bed \
          --chr-name 9 \
          --win-len 40000 --win-step 40000 \
          --config examples/data/1KG.nea_den.chr9.example.both.config.yaml \
          --output examples/results/both/1KG.nea_den.chr9.example.both.stats.scores.tsv

We provide a BED file to specify the ancestral allele for each variant, based on alignments from Ensembl. In this example, for the \(U\) statistic, the ref entry in the configuration file specifies that the allele frequency must be less than 0.01 in the AFR+EAS population, the src entries require the allele to be fixed (=1) in both Neanderthal and Denisovan genomes, and the tgt entry indicates that the allele frequency must be greater than 0.5 in the EUR population. Similarly, for the \(Q\) statistic, the tgt entry specifies the quantile of the derived allele frequencies in the EUR population at 0.95. The output is shown above and can be found here.

The src entries can be set as Nea: "=1", Den: "=0" or the reverse to identify lineage-specific variants:

statistics:
  U:
    ref: 
      AFR+EAS: 0.01
    tgt: 
      EUR: 0.5
    src:
      NEA: "=1"
      DEN: "=0"
  Q:
    ref: 
      AFR+EAS: 0.01
    tgt: 
      EUR: 0.95
    src:
      NEA: "=1"
      DEN: "=0"

An example configuration file for Neanderthal-specific introgressed variants can be found here, with its corresponding output available here. Similarly, an example configuration file for Denisovan-specific introgressed variants is provided here, with the output available here. Note that only the \(U\) and \(Q\) statistics support multi-source introgression filtering.

In addition, log files corresponding to the \(U\) or \(Q\) statistics are generated to record the positions of variants used in their calculation. An example is shown below:

Chrom Start End U_SNP
9 16400001 16440000 NA
9 16440001 16480000 NA
9 16480001 16520000 NA
9 16520001 16560000 NA
9 16560001 16600000 NA
9 16600001 16640000 NA
9 16640001 16680000 NA
9 16680001 16720000 9:16715657
9 16720001 16760000 9:16758710,9:16759357
9 16760001 16800000 9:16770229
9 16800001 16840000 9:16800789
9 16840001 16880000 NA
9 16880001 16920000 NA