Skip to content

Configuration

GAISHI uses one YAML configuration file for both training and inference.

The configuration file has three top-level blocks:

  • simulation
  • preprocessing
  • model

The train command uses simulation and model.

The infer command uses preprocessing and model.

Feature-vector configuration

Feature-vector configuration is used for logistic_regression and extra_trees_classifier.

simulation:
  sim_type: "feature_vector"
  nrep: 100
  nref: 5
  ntgt: 5
  ref_id: "Reference"
  tgt_id: "Target"
  src_id: "Source"
  ploidy: 2
  is_phased: true
  seq_len: 50000
  mut_rate: 1.2e-8
  rec_rate: 1e-8
  nprocess: 2
  feature_config_file: "examples/configs/ArchIE.features.yaml"
  intro_prop: 0.7
  non_intro_prop: 0.3
  output_prefix: "example.lr"
  output_dir: "examples/results/data/training"
  seed: 12345
  nfeature: 100
  is_shuffled: false
  force_balanced: false
  keep_sim_data: false

preprocessing:
  process_type: "feature_vector"
  vcf_file: "examples/data/example.vcf"
  chr_name: "1"
  ref_ind_file: "examples/data/ref.ind.list"
  tgt_ind_file: "examples/data/tgt.ind.list"
  win_len: 50000
  win_step: 50000
  is_phased: true
  feature_config_file: "examples/configs/ArchIE.features.yaml"
  output_dir: "examples/results/data/infer"
  output_prefix: "example.lr"
  nprocess: 2
  ploidy: 2

model:
  name: "logistic_regression"
  params:
    solver: "newton-cg"
    max_iter: 10000
    random_state: 12345

Genotype-matrix configuration

Genotype-matrix configuration is used for unet++.

simulation:
  sim_type: "genotype_matrix"
  nrep: 100
  nref: 50
  ntgt: 50
  ref_id: "Reference"
  tgt_id: "Target"
  src_id: "Source"
  ploidy: 2
  is_phased: true
  seq_len: 50000
  mut_rate: 1.2e-8
  rec_rate: 1e-8
  nprocess: 2
  output_prefix: "example.unet"
  output_dir: "examples/results/data/training"
  seed: 12345
  force_balanced: false
  keep_sim_data: false
  num_polymorphisms: 192
  num_genotype_matrices: 100

preprocessing:
  process_type: "genotype_matrix"
  vcf_file: "examples/data/example.vcf"
  chr_name: "1"
  ref_ind_file: "examples/data/ref.ind.list"
  tgt_ind_file: "examples/data/tgt.ind.list"
  anc_allele_file: null
  num_polymorphisms: 192
  step_size: 56
  output_dir: "examples/results/data/infer"
  output_prefix: "example.unet"
  ploidy: 2
  is_phased: true
  nprocess: 2

model:
  name: "unet++"
  params:
    add_rnn: false
    learning_rate: 0.001
    batch_size: 32
    n_early: 10
    n_epochs: 20
    min_delta: 1e-4
    val_prop: 0.05
    seed: 12345

simulation

Field Description
sim_type Simulation type. Use feature_vector or genotype_matrix.
nrep Number of simulation replicates per batch.
nref Number of reference individuals.
ntgt Number of target individuals.
ref_id Reference population ID in the DEMES model.
tgt_id Target population ID in the DEMES model.
src_id Source population ID in the DEMES model.
ploidy Sample ploidy.
is_phased Whether the simulated data are phased.
seq_len Simulated sequence length in base pairs.
mut_rate Mutation rate per base pair per generation.
rec_rate Recombination rate per base pair per generation.
nprocess Number of processes. Default: 1.
output_prefix Prefix of output files.
output_dir Output directory.
seed Random seed.
force_balanced Enforce balanced introgressed and non-introgressed classes. Default: false.
keep_sim_data Keep raw simulation files. Default: false.

Feature-vector only:

Field Description
feature_config_file Path to the feature configuration file.
intro_prop Minimum introgressed proportion for labeling a window as introgressed.
non_intro_prop Maximum introgressed proportion for labeling a window as non-introgressed.
nfeature Number of feature vectors to generate.
is_shuffled Shuffle feature rows before output. Default: true.

Genotype-matrix only:

Field Description
num_polymorphisms Number of polymorphic sites per genotype matrix.
num_genotype_matrices Number of genotype matrices to generate.

preprocessing

Field Description
process_type Preprocessing type. Use feature_vector or genotype_matrix.
vcf_file Path to the input VCF/BCF file.
chr_name Chromosome name to process.
ref_ind_file File containing reference individual IDs.
tgt_ind_file File containing target individual IDs.
output_dir Output directory.
output_prefix Prefix of output files.
nprocess Number of processes. Default: 1.
ploidy Sample ploidy. Default: 2.
is_phased Whether the VCF genotypes are phased. Default: true.
anc_allele_file Optional ancestral allele file. Default: null.

Feature-vector only:

Field Description
win_len Window length in base pairs.
win_step Window step size in base pairs.
feature_config_file Path to the feature configuration file.

Genotype-matrix only:

Field Description
num_polymorphisms Number of polymorphic sites per genotype matrix.
step_size Step size between genotype matrices.

model

Field Description
name Model name. Use logistic_regression, extra_trees_classifier, or unet++.
params Model-specific parameters.

For logistic_regression, parameters are passed to sklearn.linear_model.LogisticRegression.

For extra_trees_classifier, parameters are passed to sklearn.ensemble.ExtraTreesClassifier.

For unet++, commonly used parameters include:

Field Description
add_rnn Use the RNN/4-channel UNet++ variant. Default: false.
batch_size Batch size. Default: 32.
n_early Early stopping patience. Default: 10.
n_epochs Maximum number of epochs. Default: 100.
learning_rate Learning rate. Default: 0.001.
min_delta Minimum validation-loss improvement. Default: 1e-4.
val_prop Validation proportion. Default: 0.05.
seed Random seed. Default: null.
device Device, such as cpu or cuda:0. Default: null.
num_workers Number of PyTorch DataLoader workers. Default: 0.
use_amp Use CUDA AMP. Default: false.
recent_window Window size for recent training-log metrics. Default: 500.