Configuration

For the train and infer commands, sstar2 requires a YAML configuration file specifying parameters for simulation, preprocessing, and the machine learning model (GradientBoostingRegressor). An example configuration file can be found here and is shown below:

simulation:
  nrep: 10000
  nref: 5
  ntgt: 10
  ref_id: YRI
  tgt_id: CEU
  seq_len: 50_000
  mut_rate: 1.29e-8
  rec_rate: 1e-8
  ploidy: 2
  is_phased: false
  nfeature: 100_000
  is_shuffled: false
  nprocess: 2
  seed: 12345

preprocessing:
  vcf_file: examples/data/sstar2.example.biallelic.snps.vcf.gz
  chr_name: "21"
  ref_ind_file: examples/data/ref.samples.list
  tgt_ind_file: examples/data/tgt.samples.list
  win_len: 50_000
  win_step: 10000
  is_phased: false
  nprocess: 2
  ploidy: 2

model:
  params:
    loss: "quantile"
    alpha: 0.99
    n_estimators: 200
    max_depth: 3
    random_state: 12345

The configuration file has three top-level sections: simulation, preprocessing, and model.

In the simulation section:

Parameter	Description
`nrep`	Number of simulations run in each batch. Additional batches are simulated until at least `nfeature` genomic windows are obtained.
`nref`	Number of reference individuals.
`ntgt`	Number of target individuals.
`ref_id`	Population label used for the reference population in the simulated data.
`tgt_id`	Population label used for the target population in the simulated data.
`seq_len`	Length of the simulated sequence.
`mut_rate`	Mutation rate used in simulation.
`rec_rate`	Recombination rate used in simulation.
`ploidy`	Ploidy of the simulated individuals.
`is_phased`	Whether the simulated data are treated as phased.
`nfeature`	Minimum number of genomic windows to generate for model training.
`is_shuffled`	Whether the generated genomic windows are shuffled before model training.
`nprocess`	Number of processes used for simulation.
`seed`	Random seed for reproducibility.

In the preprocessing section:

Parameter	Description
`vcf_file`	Path to the input VCF file for inference.
`chr_name`	Chromosome name to analyze.
`ref_ind_file`	Path to the file listing reference individuals.
`tgt_ind_file`	Path to the file listing target individuals.
`win_len`	Window length used to divide the genome.
`win_step`	Step size between adjacent windows.
`is_phased`	Whether the input VCF is treated as phased.
`nprocess`	Number of processes used for preprocessing.
`ploidy`	Ploidy of the individuals in the input VCF.

In the model section, parameters under params are passed to GradientBoostingRegressor:

Parameter	Description
`loss`	Loss function used by the gradient boosting model. Must be set to `quantile`.
`alpha`	Quantile level used.
`n_estimators`	Number of boosting stages.
`max_depth`	Maximum depth of each regression tree.
`random_state`	Random seed used by the model.

For other available model parameters, please see the official scikit-learn documentation for GradientBoostingRegressor.