Calculate quantiles of expected S*
sstar provides a quantile subcommand to estimate quantiles of expected S* scores from simulated data without introgression. If users need to estimate quantiles of expected S* with large-scale simulations, we recommend users follow our sstar pipelines in sstar-analysis, for example, sstar.snake.
Input
Users need to install the ms program from Hudson Lab and provide a demographic model (e.g. BonoboGhost_4K19_no_introgression.yaml) in Demes YAML format. The reference and target sample sizes should match the number of chromosomes in the real data to be analyzed. Population index numbers used in the generated ms command are defined by the order of population appearance in the Demes file, so users should check that --ref-pop and --tgt-pop refer to the intended populations. Then users can use the following command:
sstar quantile --model BonoboGhost_4K19_no_introgression.yaml --ms-dir ./msdir/ --N0 1000 --nsamp 22 --nreps 20000 --ref-pop Western --ref-size 20 --tgt-pop Central --tgt-size 2 --mut-rate 1.2e-8 --rec-rate 0.7e-8 --seq-len 40000 --snp-num-range 25 30 5 --output-dir quantiles --thread 2
The expected result above can be found in test.quantile.exp.summary.
Output
An example for the output is below:
| S*_score | SNP_num | quantile | log(local_recomb_rate) |
|---|---|---|---|
| 62168.0 | 25 | 0.5 | -8.154901959985743 |
The meaning of each column:
- The
S*_scorecolumn is the S* score. - The
SNP_numcolumn is the number of SNPs. - The
quantilecolumn is the quantile of the expected S* score in the simulated dataset. - The
log(local_recomb_rate)column is the base-10 logarithm of the local recombination rate.
Settings
The main simulation settings are:
| Option | Description |
|---|---|
--model |
Path to the demographic model file used for simulation. |
--ms-dir |
Path to the directory containing the ms executable. |
--N0 |
Reference effective population size used for ms parameter scaling. |
--nsamp |
Haploid sample size used in ms simulation. |
--nreps |
Number of simulation replicates. |
--ref-pop |
Name of the reference population in the demographic model. |
--ref-size |
Haploid sample size of the reference population. |
--tgt-pop |
Name of the target population in the demographic model. |
--tgt-size |
Haploid sample size of the target population. |
--mut-rate |
Mutation rate per site per generation. |
--rec-rate |
Recombination rate per site per generation. |
--seq-len |
Length of the simulated sequence. |
--snp-num-range |
Minimum SNP count, maximum SNP count, and step size used for ms simulations. |
--output-dir |
Path to the output directory. |
Users can use --phased to run sstar score on phased haplotypes during simulation scoring. Use phased quantiles with score files generated by sstar score --phased.
Users can use --seeds to pass three random seeds to ms, for example --seeds 1 2 3. Users can use --keep-simulated-data to keep intermediate simulation directories instead of deleting them after the summary file is created. Users can use --thread to specify the number of CPUs used for multiprocessing.
To simulate a range of recombination rates, run sstar quantile separately for each recombination-rate step and merge the resulting summary tables before using them with sstar threshold.