Calculate S* scores

Input

To calculate S* scores, users should provide a VCF file containing genotypes from the reference and target populations (e.g. test.score.data.vcf). The genotypes in the VCF file should be segregating in the combined reference and target populations used for scoring, and should not contain non-variable sites across these individuals. If additional individuals are included in the VCF file, such as potential source individuals used in later analyses, sites that are variable only in those additional individuals should not be included. Users also need to provide two files containing names of individuals from the reference and target populations (e.g. test.ref.ind.list and test.tgt.ind.list) for analysis.

Users can calculate S* scores with the following command:

sstar score --vcf test.data.vcf --ref test.ref.ind.list --tgt test.tgt.ind.list --output test.score.results

The expected result above can be found in test.score.exp.results.

Output

An example for the output is below:

chrom	start	end	sample	hap_index	S*_score	region_ind_SNP_number	S*_SNP_number	S*_SNPs
21	0	50000	ind1	NA	51470	11	6	2309,25354,26654,29724,40809,45079

The meaning of each column:

The chrom column is the name of the chromosome.
The start column is the start position of the current window for calculating S* scores.
The end column is the end position of the current window for calculating S* scores.
The sample column is the name of the target individual.
The hap_index column is the haplotype index. It is NA in the default dosage-based mode. With --phased, it identifies the target haplotype.
The S*_score column is the estimated S* score.
The region_ind_SNP_number column is the number of shared derived variants in the current window between all the individuals from the reference population and the current target individual or haplotype.
The S*_SNP_number column is the number of S* SNPs found in the current target individual or haplotype.
The S*_SNPs column is the positions for S* SNPs found in the current target individual or haplotype.

Settings

By default, sstar assumes the reference allele is the ancestral allele and the alternative allele is the derived allele. Users can use the argument --anc-allele with a BED format file (e.g. test.anc.allele.bed) to define the ancestral allele for each variant. If --anc-allele is used, variants without ancestral allele information will be removed.

sstar score uses a window size of 50,000 bp and a step size of 10,000 bp by default. Users can change these settings with --win-len and --win-step.

The main command-line options are:

Option	Default	Description
`--vcf`	required	Path to the VCF file containing genotype data.
`--ref`	required	Path to the file containing reference individual IDs.
`--tgt`	required	Path to the file containing target individual IDs.
`--output`	required	Path to the output score file.
`--anc-allele`	`None`	Path to a BED format file containing ancestral allele information. Variants without ancestral allele information are removed.
`--win-len`	`50000`	Length of sliding windows in base pairs.
`--win-step`	`10000`	Step size for sliding windows in base pairs.
`--match-bonus`	`5000`	Bonus for matching genotypes between two variants.
`--max-mismatch`	`5`	Maximum genotype distance allowed before a pair is discarded.
`--mismatch-penalty`	`-10000`	Penalty for mismatching genotypes between two variants.
`--phased`	disabled	Calculate scores on phased haplotypes instead of genotype dosages.
`--thread`	`1`	Number of CPUs used for multiprocessing.