Calculate significant S* score thresholds
Input
To determine the significance of S* scores, users should use data from simulations. Users need to provide data from a simulated dataset (e.g. gravel_asn_scale_60k.simulated.data) for building a generalized additive model to predict the expected S* score under a demographic model without introgression. This file can be created with the command sstar quantile or the sstar pipeline sstar.snake in sstar-analysis.
In the simulation, S* is calculated for a given number of SNPs per window and, when a recombination map is used, a given recombination rate. When specifying a recombination map for real data, simulations with sstar quantile should be performed across recombination-rate steps so that the simulated table contains the recombination-rate range needed by sstar threshold. For each value of the number of SNPs, or for each grid point of number of SNPs and recombination rate, the simulation should be of sufficient scale (e.g. simulating 20,000 windows), use the same window size used for calculating S* in real data (e.g. 50 kbp), and use the same number of individuals used in real data.
An example for the simulated dataset is below:
| S*_score | SNP_num | quantile | log(local_recomb_rate) |
|---|---|---|---|
| 0 | 10 | 0.8 | -7.7500000013365 |
The meaning of each column:
- The
S*_scorecolumn is the S* score at a given quantile. - The
SNP_numcolumn is the number of SNPs within windows of a simulated dataset. - The
quantilecolumn is the quantile at which the S* score is observed in the simulated dataset. - The
log(local_recomb_rate)column is the base-10 logarithm of the local recombination rate of a simulated dataset.
Users also need to provide the file containing S* scores estimated from real data (e.g. test.score.exp.results). Users can use a BED-like file (e.g. hum.windows.50k.10k.recomb.map) to specify local recombination rates across the genome. If a window is not found in the recombination map but is present in the output file from sstar score, this window will be ignored when calculating significant thresholds.
Users can estimate significant S* score thresholds at quantile 0.99 with the following command:
sstar threshold --score test.score.exp.results --sim-data gravel_asn_scale_60k.simulated.data --recomb-map hum.windows.50k.10k.recomb.map --quantile 0.99 --output test.threshold.results
The expected result above can be found in test.threshold.exp.results.
Output
An example for the output is below:
| chrom | start | end | sample | S*_score | expected_S*_score | local_recomb_rate | quantile | significant |
|---|---|---|---|---|---|---|---|---|
| 21 | 9480000 | 9530000 | NA06984 | 48493 | 53263.269232 | 1.29162 | 0.99 | False |
| 21 | 9650000 | 9700000 | NA06984 | 55639 | 53754.835563 | 1.29162 | 0.99 | True |
The meanings of the first four columns are the same as those in the output from sstar score. The meanings of the remaining columns:
- The
S*_scorecolumn is the observed S* score. - The
expected_S*_scorecolumn is the expected S* score calculated under the null model without introgression. - The
local_recomb_ratecolumn is the local recombination rate in the current window. - The
quantilecolumn is the quantile used to determine whether the S* score is significant. - The
significantcolumn indicates whether the S* score is significant, defined asS*_score>expected_S*_score.
The threshold output keeps the first four score-file columns (chrom, start, end, and sample) plus the observed S*_score; it does not write the hap_index column.
Settings
Users can use --phased when thresholding score files generated by sstar score --phased. When --phased is used, the score file must contain non-NA values in the hap_index column. When --phased is not used, the score file must not contain phased rows.
Users can use --k to set the basis dimension passed to the MGCV smooth term. The default is 8. This option mainly affects how flexible the fitted GAM can be. When the simulated table includes local recombination rate, the useful value of --k depends on how many SNP-number, quantile, and recombination-rate grid points are available; increasing --k is not useful if the simulation grid is too sparse.
When --recomb-map is provided, the GAM includes local recombination rate and uses the values in the recombination-map file. Windows missing from the recombination map are skipped.
The --recomb-rate option is available and defaults to 1e-8. In the current implementation, when --recomb-map is not provided, the GAM is fitted using SNP number and quantile only, and the local_recomb_rate column in the output is written as nan.