Calculate S* scores
Input
To calculate S* scores, users should provide a VCF file containing genotypes from the reference and target populations (e.g. test.score.data.vcf). The genotypes in the VCF file should be segregating in the combined reference and target populations used for scoring, and should not contain non-variable sites across these individuals. If additional individuals are included in the VCF file, such as potential source individuals used in later analyses, sites that are variable only in those additional individuals should not be included. Users also need to provide two files containing names of individuals from the reference and target populations (e.g. test.ref.ind.list and test.tgt.ind.list) for analysis.
Users can calculate S* scores with the following command:
sstar score --vcf test.data.vcf --ref test.ref.ind.list --tgt test.tgt.ind.list --output test.score.results
The expected result above can be found in test.score.exp.results.
Output
An example for the output is below:
| chrom | start | end | sample | hap_index | S*_score | region_ind_SNP_number | S*_SNP_number | S*_SNPs |
|---|---|---|---|---|---|---|---|---|
| 21 | 0 | 50000 | ind1 | NA | 51470 | 11 | 6 | 2309,25354,26654,29724,40809,45079 |
The meaning of each column:
- The
chromcolumn is the name of the chromosome. - The
startcolumn is the start position of the current window for calculating S* scores. - The
endcolumn is the end position of the current window for calculating S* scores. - The
samplecolumn is the name of the target individual. - The
hap_indexcolumn is the haplotype index. It isNAin the default dosage-based mode. With--phased, it identifies the target haplotype. - The
S*_scorecolumn is the estimated S* score. - The
region_ind_SNP_numbercolumn is the number of shared derived variants in the current window between all the individuals from the reference population and the current target individual or haplotype. - The
S*_SNP_numbercolumn is the number of S* SNPs found in the current target individual or haplotype. - The
S*_SNPscolumn is the positions for S* SNPs found in the current target individual or haplotype.
Settings
By default, sstar assumes the reference allele is the ancestral allele and the alternative allele is the derived allele. Users can use the argument --anc-allele with a BED format file (e.g. test.anc.allele.bed) to define the ancestral allele for each variant. If --anc-allele is used, variants without ancestral allele information will be removed.
sstar score uses a window size of 50,000 bp and a step size of 10,000 bp by default. Users can change these settings with --win-len and --win-step.
The main command-line options are:
| Option | Default | Description |
|---|---|---|
--vcf |
required | Path to the VCF file containing genotype data. |
--ref |
required | Path to the file containing reference individual IDs. |
--tgt |
required | Path to the file containing target individual IDs. |
--output |
required | Path to the output score file. |
--anc-allele |
None |
Path to a BED format file containing ancestral allele information. Variants without ancestral allele information are removed. |
--win-len |
50000 |
Length of sliding windows in base pairs. |
--win-step |
10000 |
Step size for sliding windows in base pairs. |
--match-bonus |
5000 |
Bonus for matching genotypes between two variants. |
--max-mismatch |
5 |
Maximum genotype distance allowed before a pair is discarded. |
--mismatch-penalty |
-10000 |
Penalty for mismatching genotypes between two variants. |
--phased |
disabled | Calculate scores on phased haplotypes instead of genotype dosages. |
--thread |
1 |
Number of CPUs used for multiprocessing. |