Configuration
GAISHI uses one YAML configuration file for both training and inference.
The configuration file has three top-level blocks:
simulationpreprocessingmodel
The train command uses simulation and model.
The infer command uses preprocessing and model.
Feature-vector configuration
Feature-vector configuration is used for logistic_regression and extra_trees_classifier.
simulation:
sim_type: "feature_vector"
nrep: 100
nref: 5
ntgt: 5
ref_id: "Reference"
tgt_id: "Target"
src_id: "Source"
ploidy: 2
is_phased: true
seq_len: 50000
mut_rate: 1.2e-8
rec_rate: 1e-8
nprocess: 2
feature_config_file: "examples/configs/ArchIE.features.yaml"
intro_prop: 0.7
non_intro_prop: 0.3
output_prefix: "example.lr"
output_dir: "examples/results/data/training"
seed: 12345
nfeature: 100
is_shuffled: false
force_balanced: false
keep_sim_data: false
preprocessing:
process_type: "feature_vector"
vcf_file: "examples/data/example.vcf"
chr_name: "1"
ref_ind_file: "examples/data/ref.ind.list"
tgt_ind_file: "examples/data/tgt.ind.list"
win_len: 50000
win_step: 50000
is_phased: true
feature_config_file: "examples/configs/ArchIE.features.yaml"
output_dir: "examples/results/data/infer"
output_prefix: "example.lr"
nprocess: 2
ploidy: 2
model:
name: "logistic_regression"
params:
solver: "newton-cg"
max_iter: 10000
random_state: 12345
Genotype-matrix configuration
Genotype-matrix configuration is used for unet++.
simulation:
sim_type: "genotype_matrix"
nrep: 100
nref: 50
ntgt: 50
ref_id: "Reference"
tgt_id: "Target"
src_id: "Source"
ploidy: 2
is_phased: true
seq_len: 50000
mut_rate: 1.2e-8
rec_rate: 1e-8
nprocess: 2
output_prefix: "example.unet"
output_dir: "examples/results/data/training"
seed: 12345
force_balanced: false
keep_sim_data: false
num_polymorphisms: 192
num_genotype_matrices: 100
preprocessing:
process_type: "genotype_matrix"
vcf_file: "examples/data/example.vcf"
chr_name: "1"
ref_ind_file: "examples/data/ref.ind.list"
tgt_ind_file: "examples/data/tgt.ind.list"
anc_allele_file: null
num_polymorphisms: 192
step_size: 56
output_dir: "examples/results/data/infer"
output_prefix: "example.unet"
ploidy: 2
is_phased: true
nprocess: 2
model:
name: "unet++"
params:
add_rnn: false
learning_rate: 0.001
batch_size: 32
n_early: 10
n_epochs: 20
min_delta: 1e-4
val_prop: 0.05
seed: 12345
simulation
| Field | Description |
|---|---|
sim_type |
Simulation type. Use feature_vector or genotype_matrix. |
nrep |
Number of simulation replicates per batch. |
nref |
Number of reference individuals. |
ntgt |
Number of target individuals. |
ref_id |
Reference population ID in the DEMES model. |
tgt_id |
Target population ID in the DEMES model. |
src_id |
Source population ID in the DEMES model. |
ploidy |
Sample ploidy. |
is_phased |
Whether the simulated data are phased. |
seq_len |
Simulated sequence length in base pairs. |
mut_rate |
Mutation rate per base pair per generation. |
rec_rate |
Recombination rate per base pair per generation. |
nprocess |
Number of processes. Default: 1. |
output_prefix |
Prefix of output files. |
output_dir |
Output directory. |
seed |
Random seed. |
force_balanced |
Enforce balanced introgressed and non-introgressed classes. Default: false. |
keep_sim_data |
Keep raw simulation files. Default: false. |
Feature-vector only:
| Field | Description |
|---|---|
feature_config_file |
Path to the feature configuration file. |
intro_prop |
Minimum introgressed proportion for labeling a window as introgressed. |
non_intro_prop |
Maximum introgressed proportion for labeling a window as non-introgressed. |
nfeature |
Number of feature vectors to generate. |
is_shuffled |
Shuffle feature rows before output. Default: true. |
Genotype-matrix only:
| Field | Description |
|---|---|
num_polymorphisms |
Number of polymorphic sites per genotype matrix. |
num_genotype_matrices |
Number of genotype matrices to generate. |
preprocessing
| Field | Description |
|---|---|
process_type |
Preprocessing type. Use feature_vector or genotype_matrix. |
vcf_file |
Path to the input VCF/BCF file. |
chr_name |
Chromosome name to process. |
ref_ind_file |
File containing reference individual IDs. |
tgt_ind_file |
File containing target individual IDs. |
output_dir |
Output directory. |
output_prefix |
Prefix of output files. |
nprocess |
Number of processes. Default: 1. |
ploidy |
Sample ploidy. Default: 2. |
is_phased |
Whether the VCF genotypes are phased. Default: true. |
anc_allele_file |
Optional ancestral allele file. Default: null. |
Feature-vector only:
| Field | Description |
|---|---|
win_len |
Window length in base pairs. |
win_step |
Window step size in base pairs. |
feature_config_file |
Path to the feature configuration file. |
Genotype-matrix only:
| Field | Description |
|---|---|
num_polymorphisms |
Number of polymorphic sites per genotype matrix. |
step_size |
Step size between genotype matrices. |
model
| Field | Description |
|---|---|
name |
Model name. Use logistic_regression, extra_trees_classifier, or unet++. |
params |
Model-specific parameters. |
For logistic_regression, parameters are passed to sklearn.linear_model.LogisticRegression.
For extra_trees_classifier, parameters are passed to sklearn.ensemble.ExtraTreesClassifier.
For unet++, commonly used parameters include:
| Field | Description |
|---|---|
add_rnn |
Use the RNN/4-channel UNet++ variant. Default: false. |
batch_size |
Batch size. Default: 32. |
n_early |
Early stopping patience. Default: 10. |
n_epochs |
Maximum number of epochs. Default: 100. |
learning_rate |
Learning rate. Default: 0.001. |
min_delta |
Minimum validation-loss improvement. Default: 1e-4. |
val_prop |
Validation proportion. Default: 0.05. |
seed |
Random seed. Default: null. |
device |
Device, such as cpu or cuda:0. Default: null. |
num_workers |
Number of PyTorch DataLoader workers. Default: 0. |
use_amp |
Use CUDA AMP. Default: false. |
recent_window |
Window size for recent training-log metrics. Default: 500. |