Unveiling Sorghum's Genetic Secrets: A Comprehensive Methodological Approach
This document outlines the rigorous methodologies employed in the study of sorghum genetics, from material preparation to advanced genomic and phenotyping analyses.
Plant Material and Nucleic Acid Extraction
Young sorghum seedlings underwent a 24–30 hour dark treatment to optimize DNA yield. Tissue was then flash-frozen in liquid nitrogen and stored at –80 °C.
DNA was meticulously extracted using a modified CTAB protocol, followed by chloroform-isoamyl alcohol purification, isopropanol precipitation, and RNAse treatment.
DNA purity, concentration, and size were subsequently validated. For RNA analysis, samples were collected from a diverse array of reference genotypes, spanning key developmental stages and various organ types—including stem, leaf, root, inflorescences, grain, and seedlings. These samples were obtained under both well-watered and water-stressed conditions, then flash-frozen for RNA extraction.
Genome Assembly
The project leveraged long-read sequencing technologies, specifically PACBIO CLR and HiFi, alongside various assemblers such as CANU, MECAT, and HiFiAsm+HIC, for pangenome reference assemblies.
Assemblies were further refined through polishing with ARROW or RACON. Misjoins were systematically identified using syntenic markers and annotated genes. Subsequently, contigs were ordered, oriented, and assembled into ten chromosomes, with 10,000 Ns padding each chromosome join. Telomeric sequences were properly oriented, and contaminant scaffolds were meticulously removed. Furthermore, chloroplast and mitochondrial genomes were generated, and homozygous SNPs and indels were corrected using Illumina fragment reads.
Genome Annotation
Genome annotation was executed in two comprehensive rounds of gene prediction. The initial round independently predicted gene models for each genome. These models were then propagated across the pangenome in the second round.
Transcript assemblies were generated from Illumina RNA sequencing and PacBio Iso-Seq reads, utilizing tools like PERTRAN, GSNAP, and PASA.
A species-specific repeat library was constructed, and genomes were soft-masked to facilitate accurate gene identification. Gene loci were identified through a combination of transcript assembly and protein alignments from various plant species. Gene models were predicted using FGENESH+, AUGUSTUS, and EXONERATE, then refined by PASA. The second annotation round employed hard-masked genomes to project high-confidence models from other pangenome members.
Pangenome Graph Construction and Exploration
A pangenome graph was constructed for each chromosome using Minigraph-Cactus, with the BTx623 V5 genome serving as the primary reference. Clipped chromosome graphs were merged and visualized using vg and sequenceTubeMap. ODGI was employed to inspect genome representation. Specific loci, such as the dhurrin BGC and SH1, underwent further processing to reduce allelic complexity and retain larger variants.
Comparative Genomics
Gene families were calculated with OrthoFinder and subsequently parsed with GENESPACE. Synteny maps were generated using DEEPSPACE. Pairwise variant detection from whole-genome alignments was performed using SyRI and minimap2. Pangenome graph saturation curves were calculated using Panacus to assess the completeness of the pangenome.
DNA Re-sequencing and Variant Detection
A total of 2,145 diversity samples were re-sequenced using Illumina HiSeq X10 and NovaSeq 6000 platforms, with reads downsampled to a maximum of 50x coverage.
SNPs and short indels were called by aligning Illumina reads to the BTx623 V5 reference using bwa mem.
This process was followed by filtering duplicates, realigning around indels, and multisample SNP calling with SAMtools mpileup and Varscan. Variants located within 25 bp of a 24-mer repeat were removed, and only SNPs with minor allele frequencies (MAFs) greater than 0.001 were retained for the majority of analyses. Phasing of variants was performed using SHAPEIT.
Pangenome-based Genotyping
K-mer based genotyping was specifically applied to the SH1 and dhurrin BGC loci to project diversity patterns within these regions. The methodology involved extracting ancestry-informative 80-mers that were globally single-copy within individual references and absent in at least one pangenome member. The established pipeline converted reference genomes to k-mers, flagged multicopy k-mers, combined hashes, isolated k-mers specific to regions of interest (ROI), and genotyped Illumina libraries by counting k-mer frequencies. Clustering was performed using de novo methods for the dhurrin BGC locus and a priori bins for the SH1 locus.
Diversity Panel Source Material and Passport Data
The sorghum diversity panel comprised 1,984 unique accessions from structured panels and curated regional collections, representing broad genetic, phenotypic, and geographic variation across Africa, Asia, and the Americas.
Passport data, encompassing botanical classification, breeding status, country of origin, and geographical coordinates, were compiled from multiple sources.
These sources included GRIN-Global, Genesis Global, ICRISAT, and published studies, with careful cross-referencing to resolve any discrepancies.
Population Genetics
ADMIXTURE was applied to 1,984 genotypes, which were filtered for MAF < 0.05, missingness, and subsequently LD-pruned. This analysis estimated an optimal K of approximately 10 ancestral groups. For African landraces, effective migration surfaces were estimated using FEEMS. Wavelet transformations were utilized to model allele frequency change at 15 spatial scales, aiming to identify outlier loci potentially important in adaptation. Putative selective sweeps were identified using normalized xpnsl, comparing individuals from high-drought prevalence sites to those from low-drought prevalence sites across geographical regions.
Assignment of Landraces to Climate-based Clusters
The Cycles agroecosystem model was employed to characterize stage-specific water stress variables for 326 sorghum landrace accessions, simulating plant growth using historical weather data and soil parameters. Accessions were then clustered into low-drought and high-drought prevalence groups using k-means clustering. An artificial neural network (ANN) was subsequently used to predict cluster membership for additional landraces based on bioclimatic predictors from the WorldClim 2.1 dataset.
Dhurrin Phenotyping
Dhurrin concentration and hydrogen cyanide potential (HCNp) were evaluated in diverse sorghum germplasm under controlled conditions. HCNp was measured at both seedling and early vegetative stages using Cyantesmo test strips and image analysis. Dhurrin concentration was accurately quantified using ultrahigh-performance liquid chromatography coupled with tandem mass spectrometry (UHPLC–MS/MS) on dried leaf samples, with quantification based on robust standard curves.
Whole Plant Phenotyping
Field-based measurements were collected across multiple locations in 2023 and 2024 to capture key agronomic traits, including plant height, flowering time, panicle architecture, and yield components. Drone-based imagery provided high-resolution temporal vegetation indices, offering a dynamic view of plant growth. Controlled-environment phenotyping quantified water use efficiency and drought response, enabling the classification of genotypes by their drought response strategies. Greenhouse-based evaluations provided additional trait measurements under standardized conditions.
Environmental Gradients and Genomic Variation
Redundancy analysis (RDA) was used to examine how various environmental gradients explained genome-wide SNP variation among 443 sorghum landraces. These gradients included climatic factors, elevation, water balance, radiation, soil physicochemical properties, and striga risk. The model indicated that environmental variables jointly explained approximately 19.5% of genome-wide SNP variation.
Dhurrin and Cyanide Genome-Wide Association
Genome-Wide Association Studies (GWAS) were performed using a linear mixed model implemented in GEMMA. Variants were filtered to retain those with a minor allele frequency (MAF) of 0.05 or greater. Association statistics were computed using the Wald test, and significance was determined by FDR-correcting raw P values through the Benjamini–Hochberg method.
In silico Ligand Docking of Tyrosine to CYP79A1 Protein Variants
In silico ligand docking was performed using SwissDock to evaluate the impact of sequence variants on the binding affinity of CYP79A1 for its substrate, tyrosine. Homology models of wild-type and mutant CYP79A1 proteins were generated and oriented within the endoplasmic reticulum membrane. The binding affinity of tyrosine to each CYP79A1 variant was compared by identifying the lowest predicted binding free energy (ΔG).
Dhurrin BGC cis-element analysis
Intergenic variants associated with seedling dhurrin content in the dhurrin BGC were meticulously examined using the Plant PAN 4.0 and SorghumBase resources. Genomic coordinates were assessed for overlap with known accessible chromatin regions derived from ATAC–seq profiling. Putative transcription factor (TF) binding sites overlapping variant positions were identified using PlantPAN to investigate potential mutation of functional cis-regulatory elements.
Statistics and Reproducibility
Sequencing and genotyping of all Illumina libraries were performed once. In cases of duplicated genotypes, the library with the higher depth of uniquely mapping reads was retained for analysis. Phenotyping campaigns were spatially and temporally replicated as described in the supplementary material, ensuring robust and reproducible data collection.