Skip to content

Calculate quantiles of expected S*

sstar provides a quantile subcommand to estimate quantiles of expected S* scores from simulated data without introgression. If users need to estimate quantiles of expected S* with large-scale simulations, we recommend users follow our sstar pipelines in sstar-analysis, for example, sstar.snake.

Input

Users need to install the ms program from Hudson Lab and provide a demographic model (e.g. BonoboGhost_4K19_no_introgression.yaml) in Demes YAML format. The reference and target sample sizes should match the number of chromosomes in the real data to be analyzed. Population index numbers used in the generated ms command are defined by the order of population appearance in the Demes file, so users should check that --ref-pop and --tgt-pop refer to the intended populations. Then users can use the following command:

sstar quantile --model BonoboGhost_4K19_no_introgression.yaml --ms-dir ./msdir/ --N0 1000 --nsamp 22 --nreps 20000 --ref-pop Western --ref-size 20 --tgt-pop Central --tgt-size 2 --mut-rate 1.2e-8 --rec-rate 0.7e-8 --seq-len 40000 --snp-num-range 25 30 5 --output-dir quantiles --thread 2

The expected result above can be found in test.quantile.exp.summary.

Output

An example for the output is below:

S*_score SNP_num quantile log(local_recomb_rate)
62168.0 25 0.5 -8.154901959985743

The meaning of each column:

  • The S*_score column is the S* score.
  • The SNP_num column is the number of SNPs.
  • The quantile column is the quantile of the expected S* score in the simulated dataset.
  • The log(local_recomb_rate) column is the base-10 logarithm of the local recombination rate.

Settings

The main simulation settings are:

Option Description
--model Path to the demographic model file used for simulation.
--ms-dir Path to the directory containing the ms executable.
--N0 Reference effective population size used for ms parameter scaling.
--nsamp Haploid sample size used in ms simulation.
--nreps Number of simulation replicates.
--ref-pop Name of the reference population in the demographic model.
--ref-size Haploid sample size of the reference population.
--tgt-pop Name of the target population in the demographic model.
--tgt-size Haploid sample size of the target population.
--mut-rate Mutation rate per site per generation.
--rec-rate Recombination rate per site per generation.
--seq-len Length of the simulated sequence.
--snp-num-range Minimum SNP count, maximum SNP count, and step size used for ms simulations.
--output-dir Path to the output directory.

Users can use --phased to run sstar score on phased haplotypes during simulation scoring. Use phased quantiles with score files generated by sstar score --phased.

Users can use --seeds to pass three random seeds to ms, for example --seeds 1 2 3. Users can use --keep-simulated-data to keep intermediate simulation directories instead of deleting them after the summary file is created. Users can use --thread to specify the number of CPUs used for multiprocessing.

To simulate a range of recombination rates, run sstar quantile separately for each recombination-rate step and merge the resulting summary tables before using them with sstar threshold.