Configuration

For the train and infer commands, sstar2 requires a YAML configuration file specifying parameters for simulation, preprocessing, and the machine learning model (GradientBoostingRegressor). An example configuration file can be found here and is shown below:

simulation:
  nrep: 100
  nref: 10
  ntgt: 10
  ref_id: "Western"
  tgt_id: "Bonobo"
  seq_len: 40000
  mut_rate: 1.2e-8
  rec_rate: 0.7e-8
  ploidy: 2
  is_phased: False
  nfeature: 100
  is_shuffled: False
  nprocess: 2
  seed: 4836

preprocessing:
  vcf_file: "examples/data/example.vcf"
  chr_name: "1"
  ref_ind_file: "examples/data/ref.ind.list"
  tgt_ind_file: "examples/data/tgt.ind.list"
  win_len: 40000
  win_step: 10000
  is_phased: False
  nprocess: 2
  ploidy: 2

model:
  params:
    loss: "quantile"
    alpha: 0.999
    n_estimators: 200
    max_depth: 3
    random_state: 4836

The configuration file has three top-level sections: simulation, preprocessing, and model.

In the simulation section:

Parameter	Description
`nrep`	Number of simulations run in each batch. Additional batches are simulated until at least `nfeature` genomic windows are obtained.
`nref`	Number of reference individuals.
`ntgt`	Number of target individuals.
`ref_id`	Population label used for the reference population in the simulated data.
`tgt_id`	Population label used for the target population in the simulated data.
`seq_len`	Length of the simulated sequence.
`mut_rate`	Mutation rate used in simulation.
`rec_rate`	Recombination rate used in simulation.
`ploidy`	Ploidy of the simulated individuals.
`is_phased`	Whether the simulated data are treated as phased.
`nfeature`	Minimum number of genomic windows to generate for model training.
`is_shuffled`	Whether the generated features are shuffled before model training.
`nprocess`	Number of processes used for simulation.
`seed`	Random seed for reproducibility.

In the preprocessing section:

Parameter	Description
`vcf_file`	Path to the input VCF file for inference.
`chr_name`	Chromosome name to analyze.
`ref_ind_file`	Path to the file listing reference individuals.
`tgt_ind_file`	Path to the file listing target individuals.
`win_len`	Window length used to divide the genome.
`win_step`	Step size between adjacent windows.
`is_phased`	Whether the input VCF is treated as phased.
`nprocess`	Number of processes used for preprocessing.
`ploidy`	Ploidy of the individuals in the input VCF.

In the model section, parameters under params are passed to GradientBoostingRegressor:

Parameter	Description
`loss`	Loss function used by the gradient boosting model. Must be set to `quantile`.
`alpha`	Quantile level used.
`n_estimators`	Number of boosting stages.
`max_depth`	Maximum depth of each regression tree.
`random_state`	Random seed used by the model.

For other available model parameters, please see the official scikit-learn documentation for GradientBoostingRegressor.