Calculate Statistics
sai
provides the score
command to calculate the number of uniquely shared sites (U statistic) and the quantile statistic (Q statistic).
To see available options, we can use the following command:
sai score -h
This will show information for each argument:
Argument | Description |
---|---|
-h, --help | show this help message and exit |
--vcf | Path to the VCF file containing variant data |
--chr-name | Chromosome name to analyze from the VCF file. |
--ref | Path to the file with reference population identifiers. |
--tgt | Path to the file with target population identifiers. |
--src | Path to the file with source population identifiers. |
--win-len | Length of each genomic window in base pairs. Default: 50,000. |
--win-step | Step size in base pairs between consecutive windows. Default: 10,000. |
--anc-alleles | Path to the BED file with ancestral allele information. If ancestral allele information is not provided, filtering will be performed for each variant based on whether the allele frequency of any allele (assuming biallelic) meets the specified condition during the calculation of the statistics. Default: None. |
--w | Frequency threshold for variants in the reference population; only variants with frequencies below this threshold are included in the analysis. Default: 0.01. |
--x | Frequency threshold for variants in the target population; only variants with frequencies exceeding this threshold are included in the analysis. This argument is omitted when estimating the Q statistic. Default: 0.9. |
--y | List of allele frequency conditions for the source populations. Each value must be in the form =X, >X, --src ; the order of the allele frequency conditions should also correspond to the order of source populations in that file. Default: =1. |
--output | Output file path for saving results. |
--stat | Type of statistic to compute: U or QXX, where 'XX' represents the percentile (e.g., Q90 , Q95 ). |
Input files
sai
requires a biallelic VCF file containing all populations, including the reference population (assumed to have no introgression), the target population (assumed to have received introgressed material), and the source population (assumed to have provided the introgressed material). It also assumes that all variants observed in these populations are already included in the input VCF file. The program does not fill in missing genotypes (e.g., treating unobserved genotypes as homozygous reference, as done in MaLAdapt). An example VCF file can be found here.
Three tab-delimited files are required to specify the population information for individuals in the reference, target, and source populations. An example file is shown below and can also be found here.
NEA AltaiNeandertal
DEN Denisova
This file specifies the individuals in the source populations. In this example, there are two source populations, NEA and DEN (first column), and the second column lists the corresponding sample names matching those in the VCF header.
Output files
An example output of the U statistic calculation is shown below.
Chrom | Start | End | Ref | Tgt | Src | N(Variants) | U | Candidate |
---|---|---|---|---|---|---|---|---|
9 | 16400001 | 16440000 | AFR+EAS | EUR | NEA,DEN | 1082 | 0 | NA |
9 | 16440001 | 16480000 | AFR+EAS | EUR | NEA,DEN | 983 | 0 | NA |
9 | 16480001 | 16520000 | AFR+EAS | EUR | NEA,DEN | 985 | 0 | NA |
9 | 16520001 | 16560000 | AFR+EAS | EUR | NEA,DEN | 966 | 0 | NA |
9 | 16560001 | 16600000 | AFR+EAS | EUR | NEA,DEN | 1029 | 0 | NA |
9 | 16600001 | 16640000 | AFR+EAS | EUR | NEA,DEN | 1063 | 0 | NA |
9 | 16640001 | 16680000 | AFR+EAS | EUR | NEA,DEN | 1152 | 0 | NA |
9 | 16680001 | 16720000 | AFR+EAS | EUR | NEA,DEN | 999 | 1 | 9:16715657 |
9 | 16720001 | 16760000 | AFR+EAS | EUR | NEA,DEN | 1007 | 2 | 9:16758710,9:16759357 |
9 | 16760001 | 16800000 | AFR+EAS | EUR | NEA,DEN | 1060 | 1 | 9:16770229 |
9 | 16800001 | 16840000 | AFR+EAS | EUR | NEA,DEN | 960 | 1 | 9:16800789 |
9 | 16840001 | 16880000 | AFR+EAS | EUR | NEA,DEN | 923 | 0 | NA |
9 | 16880001 | 16920000 | AFR+EAS | EUR | NEA,DEN | 512 | 0 | NA |
Each row in the output file corresponds to a genomic window, summarizing the result for that region. The columns are as follows:
Column | Description |
---|---|
Chrom | Chromosome number where the window is located. |
Start | Start position (1-based, inclusive) of the window. |
End | End position (inclusive) of the window. |
Ref | Reference population. |
Tgt | Target population. |
Src | Source population. |
N(Variants) | Number of variants found in this genomic window. Sites with missing genotypes in any population are excluded. If ancestral allele information is provided, variants whose ancestral allele differ from both the reference and alternative allele in the VCF file are also excluded. |
U or QXX | Number of uniquely shared sites or the XXth percentile in this window. |
Candidate | Variant(s) (chrom:pos ) that meet the specified condition. For the U statistic, this refers to variant(s) whose allele frequency in the reference population is below a specified threshold, allele frequency in the target population exceeds a specified threshold, and allele frequency in the source population meets a defined condition (e.g., fixed). For the Q statistic, this refers to variant(s) that satisfy the following three conditions: (1) the allele frequency in the reference population is below a specified threshold; (2) the allele frequency in the source population meets a defined condition (e.g., fixed); and (3) among all variants meeting (1) and (2), the allele frequency in the target population is greater than or equal to a specified percentile (e.g., the 95th percentile) within this subset of variants in the window. |
Examples
The following example estimates the U statistic using biallelic single nucleotide polymorphisms (SNPs) in the region chr9:16400000-16900000
from the 1000 Genomes Project. The Neanderthal genome was obtained from here, and the Denisovan genome from here. The reference population includes all African populations (ESN, GWD, MSL, YRI), excluding the admixed African populations (ACB and ASW), and all East Asian populations (CHB, CHS, DAI, KHV, JPT). The target population consists of all European populations (CEU, FIN, GBR, IBS, TSI). The source populations comprise one Neanderthal individual (AltaiNeandertal) and one Denisovan individual (Denisova).
To detect adaptively introgressed variants from both Neanderthals and Denisovans, we can use the following command:
sai score --vcf examples/data/1KG.nea_den.chr9.example.vcf.gz \
--ref examples/data/1KG.ref.samples.txt \
--tgt examples/data/1KG.tgt.samples.txt \
--src examples/data/1KG.nea_den.samples.txt \
--anc-alleles examples/data/hg19.chr9.example.anc.alleles.bed \
--chr-name 9 --stat U --w 0.01 --x 0.5 --y =1 =1 \
--win-len 40000 --win-step 40000 \
--output examples/results/1KG.nea_den.chr9.example.both.U.scores.tsv
We provide a BED file to specify the ancestral allele for each variant, based on alignments from Ensembl.
In this example, the --w
, --x
, and --y
arguments define the filtering conditions: the allele frequency of a variant must be less than 0.01 in the reference population, greater than 0.5 in the target population, and fixed (=1
) in both Neanderthal and Denisovan genomes. The output is shown above and can be found here.
To estimate the Q95 statistic for the same data, simply change --stat U
to --stat Q95
:
sai score --vcf examples/data/1KG.nea_den.chr9.example.vcf.gz \
--ref examples/data/1KG.ref.samples.txt \
--tgt examples/data/1KG.tgt.samples.txt \
--src examples/data/1KG.nea_den.samples.txt \
--anc-alleles examples/data/hg19.chr9.example.anc.alleles.bed \
--chr-name 9 --stat Q95 --w 0.01 --y =1 =1 \
--win-len 40000 --win-step 40000 \
--output examples/results/1KG.nea_den.chr9.example.both.Q95.scores.tsv
The --x
argument is not needed when calculating the Q statistic. If provided, it will be ignored. The output can be found here.
To detect adaptively introgressed variants from Neanderthal introgression only, we can use:
sai score --vcf examples/data/1KG.nea_den.chr9.example.vcf.gz \
--ref examples/data/1KG.ref.samples.txt \
--tgt examples/data/1KG.tgt.samples.txt \
--src examples/data/1KG.nea_den.samples.txt \
--anc-alleles examples/data/hg19.chr9.example.anc.alleles.bed \
--chr-name 9 --stat U --w 0.01 --x 0.5 --y =1 =0 \
--win-len 40000 --win-step 40000 \
--output examples/results/1KG.nea_den.chr9.example.nea.specific.U.scores.tsv
Here, we set --y =1 =0
to require that the variant be fixed in the Neanderthal individual and absent in the Denisovan. This ensures that only Neanderthal-specific variants are considered as potential candidates for introgression. The output can be found here.
To detect adaptively introgressed variants from Denisovan introgression only, we can use:
sai score --vcf examples/data/1KG.nea_den.chr9.example.vcf.gz \
--ref examples/data/1KG.ref.samples.txt \
--tgt examples/data/1KG.tgt.samples.txt \
--src examples/data/1KG.nea_den.samples.txt \
--anc-alleles examples/data/hg19.chr9.example.anc.alleles.bed \
--chr-name 9 --stat U --w 0.01 --x 0.5 --y =0 =1 \
--win-len 40000 --win-step 40000 \
--output examples/results/1KG.nea_den.chr9.example.den.specific.U.scores.tsv
In this case, --y =0 =1
requires that the variant be fixed in the Denisovan individual and absent in the Neanderthal, thereby identifying variants potentially introgressed from Denisovans only. The output can be found here.
Any number of source populations is supported, including one, two, or more. However, note that the order of values provided to --y
must match the order of source populations as specified in the file given by --src
. For example, in examples/data/1KG.nea_den.samples.txt
, the first source population is Neanderthal and the second is Denisovan. Therefore, --y =1 =0
specifies that the variant must be fixed in Neanderthals and absent in Denisovans, while --y =0 =1
specifies the reverse.
Providing --y
values in the wrong order (e.g., reversing the values without changing the order of the populations in --src
) will result in incorrect interpretation of source-specific introgression signals.