PeptideAtlas: A Massive Repository of Human Proteomic and Immunopeptidomic Data
The PeptideAtlas project has assembled an enormous collection of non-HLA and HLA immunopeptidomics data from public datasets, encompassing over 3.7 billion MS/MS spectra.
Two Major Builds
Non-HLA Build (2023-06)
- Scale: 295 datasets, 1,172 experiments, and 3.5 billion MS/MS spectra
- Search Engine: MSFragger v3.7 with semi-enzymatic settings
- Database: THISP level 4 (2023-02), including 7,264 Ribo-seq ORFs
- Variable Modifications: Methionine oxidation, protein N-terminal acetylation, peptide N-terminal pyro-glutamic acid, Asn and Gln deamidation
- Fixed Modification: Alkylation (typically carbamidomethylated cysteine)
- Validation: Trans-Proteomic Pipeline (TPP) v7.0 with PeptideProphet, iProphet, PTMProphet. Mapping to human proteome via ProteoMapper and to genome via ENSEMBL
- FDR Estimation: Target–decoy entrapment with scrambled decoy sequences at 1:1 ratio
HLA Build (2023-11)
- Scale: 118 datasets, 592 experiments, 240 million MS/MS spectra from 9,776 runs
- Search Engine: MSFragger v3.7 with no-enzyme mode
- Database: THISP level 4 (2023-07), including 7,264 Ribo-seq ORFs and 299 common contaminants
- Variable Modifications: Methionine oxidation, cysteine cysteinylation, protein N-terminal acetylation, peptide N-terminal pyro-glutamic acid, Asn and Gln deamidation
- Validation: Same as non-HLA build
Protein Identification and Categorization
Peptides are mapped to neXtProt (2023 version), covering 20,389 core proteome entries and isoforms using ProteoMapper.
Canonical proteins are defined by ≥2 uniquely mapping non-nested peptides of length ≥9, covering at least 18 amino acids.
Protein Categories: canonical, non-core canonical, indistinguishable representative, indistinguishable, representative, marginally distinguished, subsumed, weak, insufficient evidence, identical, not detected
Manual Validation of PSMs
Manual inspection of ncORF peptides uses a rigorous classification system: excellent, good, false positive, close but false positive, low information
Criteria for 'excellent': nearly complete b and y ion series, no prominent unannotated peaks, minimal mass modifications, no gaps with plausible misassignment
HLA Peptide Categorization
- 865,922 peptides filtered (homopolymer starting peptides removed)
- Length ≥8, mapping to ≤30 UniProtKB/Swiss-Prot entries: considered canonical
- Length ≥8, mapping to ncORFs (not canonical) with ≤10 distinct mappings: considered ncORF-derived
- Remaining peptides: 'other' category
ncORF Expression in Cancer Tissues
Each ncORF peptide is categorized by source (cancer vs non-cancer MS runs). ncORFs are classified as cancer-related if from Cancer Gene Census genes.
Enriched Ubiquitination Datasets
11 public datasets reanalysed with Comet (semi-tryptic, missed cleavages=4, variable modifications: Gly-Gly ubiquitination, Met oxidation, N-terminal acetylation). TPP postprocessing, FDR via decoy database.
HLA Binding Predictions
NetMHCpan v4.1 for MS runs with known four-digit HLA typing. Peptides with rank score ≤2 considered binders.
Detectability Determinants
Canonical proteins and ncORFs categorized as detected or undetected. Significance determined by two-sided Wilcoxon rank-sum test.
Tier Classification System
Tier Requirements Tier 1A Two non-nested peptides in MS proteome + Ribo-seq Tier 1B Two non-nested peptides in HLA immunopeptidomics + Ribo-seq Tier 2A One peptide in MS proteome + Ribo-seq Tier 2B One peptide in HLA immunopeptidomics + Ribo-seq Tier 3 Any immunopeptidomics or tryptic proteome evidence without Ribo-seq Tier 4 Ribo-seq evidence without proteomics Tier 5 In silico prediction without Ribo-seq or proteomicsMachine Learning Models
MLP Classifier
- Dataset: 677 ncORF 9-mer peptides, 22 attributes
- Architecture: Hidden layer size 280, tanh activation, alpha=0.01
- Training: 80% of data, hyperparameter tuning via grid search
TensorFlow Keras Model
- Dataset: 7,264 ncORFs (1,785 detected, 5,479 undetected)
- Architecture: 16-neuron input, ReLU, L2 regularization, batch normalization, dropout, sigmoid output
- Training: 60 epochs, batch size 12, balanced class weights
Evolutionary Conservation and Constraint (ORBL)
- ORBLv: Phylogenetic branch length fraction for conserved species
- ORBLq: Empirical P value against matched untranslated ORFs of same biotype
- Biotype determination: GENCODE v42 with strict 'pure' criteria; mixed biotypes excluded
CRISPR Screening and Functional Genomics
A massive functional genomics effort targeted 2,196 ncORFs from GENCODE Phase I and 1,245 from a previous library.
Library Design: 27,464 sgRNAs, designed with CRISPick (modified settings). Selection criteria: max 3 ncORFs per gene, min size 12 aa, ≥4 predicted targeting gRNAs.
Controls: Pan-essential, non-targeting, and cutting controls included.
Experimental Workflow:
- Lentiviral infection in triplicate, 500 cells/gRNA representation
- Puromycin selection
- Genomic DNA extraction at initial and day 14 timepoints
- Sequencing, normalization to RPM, log2 fold change relative to initial
CRISPR Data Analysis:
- sgRNA mapping with Bowtie2 (stringent) to GRCh38
- Chronos v2.0.8 with CNV data
- Loss-of-function hits: Chronos score < -0.5 or > 0.5
- Cross-validation with RNA-seq/Ribo-seq (TPM thresholds: ≥10 RNA-seq, ≥5 Ribo-seq)
Pooled c10riboseqorf92 Knockout
A pooled screen across 486 barcoded cancer cell lines tested two targeting gRNAs against c10riboseqorf92.
- Controls: Non-cutting LacZ control, cutting control Chr2-2
- Cell line abundance determined by barcode RNA-seq
- log2FC at day 15 vs input: Linear regression identified outliers (residuals >2 s.d.)
- Spearman correlation with DepMap gene dependency profiles
siRNA and Overexpression Experiments
siRNAs targeting OLMALINC tested in A375 and A549 cells. Knockdown efficiency measured by qPCR (GAPDH, ACTB controls). Proliferation monitored by Incucyte, confluency calculated, AUC compared statistically.
Bulk RNA-seq
- RNA extracted, poly(A) enriched, sequenced on Illumina NovaSeq or Element AVITI
- Alignment with STAR, quantification with RSEM (GENCODE v45)
- Differential expression with DESeq2 (design: ~ treatment + background + interaction)
- Significance: adj.P<0.05, |log2FC|>0.5
- Functional enrichment with clusterProfiler (MSigDB Hallmark)
Multiplexed Single-cell Transcriptional Response
A sophisticated single-cell experiment involved 21 human cell lines expressing SpCas9, transduced with 4 sgRNAs.
Experimental Design:
- Non-targeting control, two targeting c10riboseqorf92, KIF11 positive control
- Puromycin selection
- Processed with 10x Chromium (GEM-X On-chip Multiplexing, 3' v4)
- Sequenced to ~30,000 reads/cell
Data Processing:
- Cell Ranger, QC (MAD outlier detection), SoupX ambient correction
- Normalization to 10,000 counts, log transformation
- Cell line deconvolution with demuxalot, dropulation, scSplit
Pseudobulk differential expression: limma-voom, Benjamini-Hochberg FDR
Gene set enrichment: gseGO (GO Biological Process)
Co-expression modules: hdWGCNA with metacell construction
Perturbation distances:
- E-distance in PCA space (15 components) between conditions
- Hierarchical clustering
- Consensus NMF with K from 5 to 29, 5,000 HVGs, 50 iterations
Multiplexed PRM MS of ncORF Targets
Cell lines tested: HEK293, HeLa S3, K562 with two sample preparation protocols per line.
Protocol 1: Lysis in 8M guanidine HCl, reduction (TCEP), alkylation (iodoacetamide), dilution, trypsin digestion, desalting
Protocol 2: Acetonitrile-based small protein extraction, then reduction, alkylation, digestion, desalting
Mass Spectrometry: Orbitrap Astral (20 min gradient, PRM mode) and ZenoTOF 8600 (nanoflow, 135 min gradient)
Data Analysis: Skyline for detection (signal-to-noise), with heavy-labelled synthetic peptide analogues and iRT peptides spiked in
Reporting Summary
Further details available in the Nature Portfolio Reporting Summary.