Comprehensive Methods for PARM Model Development and Validation
This document details the methodologies employed in a study leveraging the PARM deep learning model for predicting promoter activity. It encompasses the selection of Transcription Start Sites (TSSs), construction of MPRA libraries, cell and organoid culture, various experimental treatments, sequencing, data processing, and advanced bioinformatics analyses.
TSS Selection
Transcription Start Sites (TSSs) were meticulously selected using GENCODE definitions and their observed activity in at least one cell type or tissue within the FANTOM5 database. This stringent process culminated in a curated set of 30,607 TSSs.
Genome-wide MPRA Dataset
Initial PARM models for HepG2 and K562 cells were trained on publicly available genome-wide MPRA data (GEO accession GSE128325). The focus was on DNA fragments overlapping a -300 bp to +100 bp window relative to the selected TSSs.
MPRA Library Construction
Focused Library
The construction of the focused library involved several precise steps:
- DNA was extracted from the human cell line HG02601, then fragmented to a size range of 200-400 bp, followed by end-repair and A-tailing.
- Custom dsDNA adapters with T-overhangs were ligated, containing overlaps specifically designed for Gibson assembly into a linear barcoded p101 vector.
- A preliminary PCR amplification was performed for eight cycles.
- A high-stringency hybridization capture library (Twist), comprising 127,575 probes, was utilized to capture promoter regions (from -300 to +100 bp) corresponding to the 30,607 selected TSSs.
- To prevent non-specific binding of probes, custom blockers were designed and implemented.
- Fragments were hybridized, enriched through streptavidin binding, and subsequently amplified via PCR for nine cycles.
- The captured fragments were purified and then cloned into the barcoded p101 vector using Gibson assembly.
- The resulting assembly mix was transformed into MegaX DH10B ultracompetent bacteria, from which the plasmid library was then isolated.
Oligonucleotide-based Libraries
Three distinct synthetic libraries—synthetic promoters, ISM (In silico Mutagenesis), and motif insertions—were created using custom synthetic oligonucleotides from Twist.
- These oligonucleotides underwent amplification, bead-purification, end-repair, and digestion with EcoRI-HF and NheI-HF.
- The processed fragments were then ligated into a modified p101 vector and transformed into bacteria following the same procedure as for the focused library.
Cell Culture and Transfection
Various human cell lines, including K562, U2OS, HCT116, LNCaP, HepG2, HEK293, MCF7, AGS, and HAP1, obtained from ATCC, were cultured in their respective specific media. These media were supplemented with FBS (Fetal Bovine Serum) and penicillin-streptomycin. Regular testing for mycoplasma contamination was performed.
Transfections were conducted as previously described. Specific nucleofector programs were utilized, while Lipofectamine or Xfect reagents were employed for certain cell lines. For focused libraries, 10 million cells (or 50 million for K562) were used per replicate, whereas 5 million cells were used for oligonucleotide-based libraries.
Organoid Culture and Electroporation
CRC organoids, derived from surgical resections, were cultured in a specific medium and authenticated via SNP array. Routine testing for mycoplasma was carried out.
- For electroporation, organoids were dissociated into single cells, then resuspended in Opti-MEM containing the focused library.
- Electroporation was performed using a NEPA21 system with defined settings.
- Following electroporation, cells were incubated and subsequently resuspended in culture medium supplemented with BME.
Conditions and Treatments
Specific experimental conditions and treatments were applied:
- Heat-shock: Cells were first incubated at 37 °C for 24 hours, then transferred to 43 °C for 3 hours before RNA isolation.
- Chemical perturbations: Nutlin-3a (8 µM) and PMA (1 µM) were added immediately post-transfection, and cells were collected 24 hours later.
Library Sequencing
Focused library characterization (barcode-to-fragment association) and input (pDNA) generation were achieved using iPCR methods. For plasmid library digestion, NcoI-HF was employed. Subsequent single-read sequencing was performed on Illumina NextSeq or NovaSeq 6000 platforms.
RNA Sample Processing
RNA was isolated using TRIzol reagent, followed by DNase I treatment to remove genomic DNA. It was then reverse transcribed into cDNA using a gene-specific primer. The resulting cDNA was PCR-amplified and bead-purified prior to sequencing.
Read Processing
Sequencing reads were processed to generate raw read counts per fragment for both cDNA (representing activity) and iPCR/pDNA (representing input). These raw values were converted to reads per million (RPM).
Fragment ratios were calculated by dividing cDNA RPM by iPCR RPM, with the 0.1 quantile of non-zero ratios added for normalization. Fragment activity was subsequently defined as the mean log2 of these normalized ratios across all biological replicates.
Model Input
The PARM model accepts genomic sequences of fragments as input. These sequences are represented as a one-hot-encoded DNA matrix of shape (600, 4). Shorter sequences were padded with 'N' to achieve the 600 bp length, using a random padding distribution. Adapter sequences from the plasmid backbone (CAGTGAT and ACGACTG for genome-wide, CAGTGAT and CACGACG for focused libraries) were included on both ends of the input sequences.
DL Model Architecture
PARM's architecture, an adaptation of Enformer, is a Convolutional Neural Network (CNN) incorporating self-attention pooling layers.
The architecture is structured as follows:
- An initial 1D convolution block (kernel size 7, filter size 125, batch normalization, GeLU activation, followed by another 1D convolution with kernel size 1).
- This initial block is succeeded by five repeated sequential layers, each consisting of:
- Batch normalization
- GeLU activation
- 1D convolution with a skip connection
- Batch normalization
- GeLU activation
- Another 1D convolution
- The architecture is completed by an attention pooling layer, global max pooling, and a dense layer with GeLU activation.
All models were implemented in PyTorch and individually optimized for each specific cell type.
Model Training and Evaluation
Promoters were systematically divided into six equally sized folds, ensuring no overlap of sequence fragments across these folds. One fold was reserved for the final assessment, while the remaining five were utilized in a cross-validation scheme (with four folds for training and one for hyperparameter tuning).
A Poisson negative log-likelihood or heteroscedastic loss function was used for model optimization.
The network was trained for 10 epochs with a batch size of 128. The AdamW optimizer was employed, coupled with a warm-up and cosine decay learning rate schedule. Training was exclusively conducted on an NVIDIA Quadro RTX 6000 GPU.
Generating Active Synthetic Promoters
A genetic algorithm, implemented within the DEAP-ER framework, was utilized to generate synthetic human promoters.
- The population size was set to 200, consisting of randomly generated 232 bp sequences.
- Evolution proceeded for 300 generations, with a mutation probability of 0.8 and a two-point crossover probability of 0.5.
- Tournament selection was used, and the PARM's K562 model prediction served as the evaluation operator for promoter activity.
Synthetic Promoter Library
To experimentally validate the genetic algorithm, a comprehensive synthetic promoter library was constructed, including:
- 390 fragments total: Comprising the least, median, and maximum predicted active fragments from 13 different generations across 10 independent genetic algorithm runs.
- Synthetic promoter (strongest): The single most active predicted fragment from each of the 10 runs.
- Mutated versions: These were derived from the 10 strongest synthetic promoters, with modifications designed to reduce their predicted activity to 20%.
- Control sequences: Natural promoters spanning a range of activity levels, including the 15 most active in K562 cells.
Fragment activity was calculated as the mean log2[RNA/DNA] ratio across four associated barcodes.
ISM Validation Library
For the ISM validation library, fragment activity was determined as the mean of three barcodes. Promoter-cell line combinations were selected based on an autocorrelation cutoff of 0.5 for the effect of mutations, resulting in 30 reliable combinations.
Motif Insertion Library
Fragment activity in the motif insertion library was calculated as the mean of five barcodes, incorporating a pseudocount of 0.01. Fragments were required to link to at least four barcodes and exhibit a pDNA count greater than 100. This process yielded 19 promoters for each of the YY1, NFYA, NRF1, and SP1 motif insertions.
ZNF48 Protein Expression and Pull-down
A codon-optimized ZNF48 gene block, featuring a 10Ă—His-2Ă—StrepII-tag, was cloned into a pFastBac vector for expression in Sf9 insect cells using the Bac-to-Bac method.
- Cells were harvested, lysed, and debris was removed.
- Recombinant ZNF48 pull-downs utilized biotinylated synthetic oligonucleotides (either the wild-type TMEM sequence or a mutated counterpart), which were annealed to streptavidin sepharose beads.
- Sf9 cell lysate was incubated with the DNA-bound beads, washed thoroughly, and the eluted proteins were resolved by SDS-PAGE. Detection was performed via Western blotting using an anti-ZNF48 antibody.
Cell Lysate Preparation for Mass Spectrometry Analysis
Crude nuclear extracts were prepared through a sequence of differential centrifugation and dounce homogenization steps. Cells were washed, resuspended in buffer, lysed, and crude nuclei were collected. Nuclei were then washed again, resuspended in buffer C, and incubated. Following centrifugation, the soluble nuclear fraction was collected, aliquoted, and stored for subsequent analysis.
DNA Pull-downs and Mass Spectrometry Analysis
DNA oligonucleotides with 5'-biotinylation were annealed and incubated with streptavidin-sepharose beads. Nuclear extracts were then incubated with these DNA-bound beads, washed meticulously, and the bound proteins were eluted using a urea buffer.
- Proteins were reduced with DTT, alkylated with iodoacetamide, and digested with trypsin.
- Samples underwent purification using StageTips. Dimethyl labelling was performed on StageTips as previously described.
- Peptides were analyzed by LC-MS/MS on an Orbitrap Exploris 480 mass spectrometer interfaced with an Evosep One LC system, utilizing a pre-programmed extended method and specific DDA acquisition parameters.
Mass Spectrometry Data Analysis
Raw mass spectrometry spectra were processed with Maxquant software (v.2.4.9.0) and searched against the UniProt human proteome database. Identified proteins were filtered to exclude common contaminants. For dimethyl-labelled samples, quantification was performed using a modified 3plex method, leveraging both light and heavy dimethyl-labelled peptides. Only proteins quantified across all four channels were included in downstream analysis. Outlier statistics were applied to identify significant proteins, defined by one interquartile range for both forward and reverse experiments.
ISM Assay
The In silico Mutagenesis (ISM) assay involves predicting promoter activity for a reference sequence and for all possible single nucleotide mutations at a position of interest.
The importance score was calculated as the reference sequence score minus the average score of all possible mutations at that specific position.
ICGC Data
International Cancer Genome Consortium (ICGC) data was obtained from release 28. Specifically, whole-genome sequencing donors with data in chromosome 5 (1,295,152 to 1,295,462 bp) in the hg19 reference genome were considered.
Enformer Predictions
For comparisons with Enformer, variant effects were predicted by computing CAGE outputs, log-transforming these predictions, and then averaging across both strands, three neighboring bins around SNP positions, and available CAGE tracks for relevant cell lines. Missing cell lines were mapped to similar available Enformer cell lines.
Borzoi Predictions
Borzoi predictions followed an identical variant effect prediction procedure as Enformer, utilizing CAGE outputs and mapping missing cell lines to similar available ones.
Prediction of the Direction of Effect of Cis-acting eQTLs
Model accuracy was evaluated by comparing the predicted direction of effect with the sign of the measured beta of fine-mapped GTEx v8 cis-acting eQTLs.
Only fine-mapped single nucleotide variants (SNVs) with a posterior causal probability (PIP) ≥0.9 were included in the analysis.
Concordance in direction was calculated as the sum of cases where (variant effect * beta) > 0, divided by the total number of eQTLs. A baseline for comparison was established by randomly sampling from a uniform distribution.
Motif Scanning and Regulatory SNP (RS) Identification
Importance scores of the 30,607 promoters were scanned using Hocomoco (v.11) human database motifs with a stride of 1.
- Pearson's correlation coefficient (
r position) between motif information content (IC) and importance score was computed. - The mean of effect scores multiplied by IC (
IS position) was also calculated.
Hits were retained if r position was ≥|0.75| and IS position met a second cutoff that accounted for motif length and IC. This cutoff was derived from the Importance score x% (95th percentile for activators, 5th for repressors). A Regulatory SNP (RS) was defined as a motif satisfying both cutoffs in the ISM profile. For measured data, r position was calculated similarly, and the IS position cutoff from models was converted to a Z-score and applied.
RS Distribution Comparisons
The Kolmogorov–Smirnov test was employed to establish significant differences between the RS distribution and a generic motif distribution. A permutation test was leveraged to specifically assess for a shift in the mode between activating and repressive RS distributions, with modes determined after smoothing with a 20 bp sliding window.
Identifying Unknown Motifs
Regulatory SNPs (RS) that did not match known motifs were identified by scanning for patches of at least five nucleotides where importance scores were either equal to or higher than the 95th percentile for activators, or equal to or lower than the 5th percentile for repressors. Consecutive RS separated by a maximum of two nucleotides were treated as a single RS. The nucleotide importance of the full RS region was correlated with motifs in Hocomoco (v.11) or the Jaspar2024 CORE vertebrates non-redundant database. If no hit with R ≥ 0.75 was found, the motif was annotated as an 'unknown motif'.
Clustering Motifs to Obtain Motif Families
Pairwise similarity scores of position frequency matrices were computed using TOMTOM (v5.4.1), specifying a minimum overlap of 6 bp and using Pearson’s correlation. These scores were converted to E values by log transformation. Hierarchical clustering with complete linkage and Pearson’s correlation was then performed, and the resulting tree was cut at a height of 0.8 to define distinct motif families.
Matching a RS with its TF Expression
For further detailed analysis, only RSs that were associated with expressed Transcription Factors (TFs) were included. Expression status was determined using data from the Human Protein Atlas (v.23.0), where TFs with ln(TPM+1) ≥ 1 were considered to be expressed.
Consensus Motif
To construct a consensus motif within a cluster of unknown motifs, the motif exhibiting the lowest E value was chosen as the representative. Position frequency matrices of other motifs within the cluster were then added to this representative, carefully considering optimal offset and orientation. Each position was subsequently normalized by the number of motifs contributing to it, and positions contributed by less than 30% of motifs were removed.
Sequence Scanning
A naive scan of all promoter sequences (spanning from 250 bp upstream to 50 bp downstream of the TSS) was executed using FIMO (v5.4.1) with default arguments and the Hocomoco (v.11) human database.
RS Insertion Analysis
In silico insertion of YY1, SP1, NRF1, and NFYA motifs was performed across all promoters. For each transcription factor, three variants were inserted: the consensus sequence (shortened from Hocomoco v.11) and two mutated negative controls. Insertions were made at 15 distinct positions, ranging from 160 bp upstream to 63 bp downstream of the TSS, in 14-nucleotide steps.
Endogenous Promoter Activity
An estimate of endogenous promoter activity in K562 cells was obtained by aggregating data from GRO-cap, csRNA, NetCage, and Stripe-Seq datasets.
- Raw fastq files were trimmed and aligned first to human rDNA, and then to the hg19 reference genome.
- Bigwig files were generated, overlaid with the 30,607 TSSs, and principal component analysis was subsequently used to combine these measures into a single, comprehensive activity score.
Measured Differential Promoter Activity Analysis
To analyze differential promoter activity in measured data across various conditions, a LOESS correction was applied to the promoter activity values. A Wilcoxon signed-rank test was then used to evaluate the effects of different conditions, and the raw P values were adjusted using FDR (False Discovery Rate) correction.
Promoters with adjusted P values < 0.05 and a log2[fold change] > 0.5 were classified as upregulated. Conversely, those with a log2[fold change] < -0.5 were considered downregulated.