Configuration
For the train and infer commands, sstar2 requires a YAML configuration file specifying parameters for simulation, preprocessing, and the machine learning model (GradientBoostingRegressor). An example configuration file can be found here and is shown below:
simulation:
nrep: 10000
nref: 5
ntgt: 10
ref_id: YRI
tgt_id: CEU
seq_len: 50_000
mut_rate: 1.29e-8
rec_rate: 1e-8
ploidy: 2
is_phased: false
nfeature: 100_000
is_shuffled: false
nprocess: 2
seed: 12345
preprocessing:
vcf_file: examples/data/sstar2.example.biallelic.snps.vcf.gz
chr_name: "21"
ref_ind_file: examples/data/ref.samples.list
tgt_ind_file: examples/data/tgt.samples.list
win_len: 50_000
win_step: 10000
is_phased: false
nprocess: 2
ploidy: 2
model:
params:
loss: "quantile"
alpha: 0.99
n_estimators: 200
max_depth: 3
random_state: 12345
The configuration file has three top-level sections: simulation, preprocessing, and model.
In the simulation section:
| Parameter | Description |
|---|---|
nrep |
Number of simulations run in each batch. Additional batches are simulated until at least nfeature genomic windows are obtained. |
nref |
Number of reference individuals. |
ntgt |
Number of target individuals. |
ref_id |
Population label used for the reference population in the simulated data. |
tgt_id |
Population label used for the target population in the simulated data. |
seq_len |
Length of the simulated sequence. |
mut_rate |
Mutation rate used in simulation. |
rec_rate |
Recombination rate used in simulation. |
ploidy |
Ploidy of the simulated individuals. |
is_phased |
Whether the simulated data are treated as phased. |
nfeature |
Minimum number of genomic windows to generate for model training. |
is_shuffled |
Whether the generated genomic windows are shuffled before model training. |
nprocess |
Number of processes used for simulation. |
seed |
Random seed for reproducibility. |
In the preprocessing section:
| Parameter | Description |
|---|---|
vcf_file |
Path to the input VCF file for inference. |
chr_name |
Chromosome name to analyze. |
ref_ind_file |
Path to the file listing reference individuals. |
tgt_ind_file |
Path to the file listing target individuals. |
win_len |
Window length used to divide the genome. |
win_step |
Step size between adjacent windows. |
is_phased |
Whether the input VCF is treated as phased. |
nprocess |
Number of processes used for preprocessing. |
ploidy |
Ploidy of the individuals in the input VCF. |
In the model section, parameters under params are passed to GradientBoostingRegressor:
| Parameter | Description |
|---|---|
loss |
Loss function used by the gradient boosting model. Must be set to quantile. |
alpha |
Quantile level used. |
n_estimators |
Number of boosting stages. |
max_depth |
Maximum depth of each regression tree. |
random_state |
Random seed used by the model. |
For other available model parameters, please see the official scikit-learn documentation for GradientBoostingRegressor.