shmews.

PeptideAtlas: A Massive Repository of Human Proteomic and Immunopeptidomic Data

The PeptideAtlas project has assembled an enormous collection of non-HLA and HLA immunopeptidomics data from public datasets, encompassing over 3.7 billion MS/MS spectra.

Two Major Builds

Non-HLA Build (2023-06)

Scale: 295 datasets, 1,172 experiments, and 3.5 billion MS/MS spectra
Search Engine: MSFragger v3.7 with semi-enzymatic settings
Database: THISP level 4 (2023-02), including 7,264 Ribo-seq ORFs
Variable Modifications: Methionine oxidation, protein N-terminal acetylation, peptide N-terminal pyro-glutamic acid, Asn and Gln deamidation
Fixed Modification: Alkylation (typically carbamidomethylated cysteine)
Validation: Trans-Proteomic Pipeline (TPP) v7.0 with PeptideProphet, iProphet, PTMProphet. Mapping to human proteome via ProteoMapper and to genome via ENSEMBL
FDR Estimation: Target–decoy entrapment with scrambled decoy sequences at 1:1 ratio

HLA Build (2023-11)

Scale: 118 datasets, 592 experiments, 240 million MS/MS spectra from 9,776 runs
Search Engine: MSFragger v3.7 with no-enzyme mode
Database: THISP level 4 (2023-07), including 7,264 Ribo-seq ORFs and 299 common contaminants
Variable Modifications: Methionine oxidation, cysteine cysteinylation, protein N-terminal acetylation, peptide N-terminal pyro-glutamic acid, Asn and Gln deamidation
Validation: Same as non-HLA build

Protein Identification and Categorization

Peptides are mapped to neXtProt (2023 version), covering 20,389 core proteome entries and isoforms using ProteoMapper.

Canonical proteins are defined by ≥2 uniquely mapping non-nested peptides of length ≥9, covering at least 18 amino acids.

Protein Categories: canonical, non-core canonical, indistinguishable representative, indistinguishable, representative, marginally distinguished, subsumed, weak, insufficient evidence, identical, not detected

Manual Validation of PSMs

Manual inspection of ncORF peptides uses a rigorous classification system: excellent, good, false positive, close but false positive, low information

Criteria for 'excellent': nearly complete b and y ion series, no prominent unannotated peaks, minimal mass modifications, no gaps with plausible misassignment

HLA Peptide Categorization

865,922 peptides filtered (homopolymer starting peptides removed)
Length ≥8, mapping to ≤30 UniProtKB/Swiss-Prot entries: considered canonical
Length ≥8, mapping to ncORFs (not canonical) with ≤10 distinct mappings: considered ncORF-derived
Remaining peptides: 'other' category

ncORF Expression in Cancer Tissues

Each ncORF peptide is categorized by source (cancer vs non-cancer MS runs). ncORFs are classified as cancer-related if from Cancer Gene Census genes.

Enriched Ubiquitination Datasets

11 public datasets reanalysed with Comet (semi-tryptic, missed cleavages=4, variable modifications: Gly-Gly ubiquitination, Met oxidation, N-terminal acetylation). TPP postprocessing, FDR via decoy database.

HLA Binding Predictions

NetMHCpan v4.1 for MS runs with known four-digit HLA typing. Peptides with rank score ≤2 considered binders.

Detectability Determinants

Canonical proteins and ncORFs categorized as detected or undetected. Significance determined by two-sided Wilcoxon rank-sum test.

Tier Classification System

Tier Requirements Tier 1A Two non-nested peptides in MS proteome + Ribo-seq Tier 1B Two non-nested peptides in HLA immunopeptidomics + Ribo-seq Tier 2A One peptide in MS proteome + Ribo-seq Tier 2B One peptide in HLA immunopeptidomics + Ribo-seq Tier 3 Any immunopeptidomics or tryptic proteome evidence without Ribo-seq Tier 4 Ribo-seq evidence without proteomics Tier 5 In silico prediction without Ribo-seq or proteomics

Machine Learning Models

MLP Classifier

Dataset: 677 ncORF 9-mer peptides, 22 attributes
Architecture: Hidden layer size 280, tanh activation, alpha=0.01
Training: 80% of data, hyperparameter tuning via grid search

TensorFlow Keras Model

Dataset: 7,264 ncORFs (1,785 detected, 5,479 undetected)
Architecture: 16-neuron input, ReLU, L2 regularization, batch normalization, dropout, sigmoid output
Training: 60 epochs, batch size 12, balanced class weights

Evolutionary Conservation and Constraint (ORBL)

ORBLv: Phylogenetic branch length fraction for conserved species
ORBLq: Empirical P value against matched untranslated ORFs of same biotype
Biotype determination: GENCODE v42 with strict 'pure' criteria; mixed biotypes excluded

CRISPR Screening and Functional Genomics

A massive functional genomics effort targeted 2,196 ncORFs from GENCODE Phase I and 1,245 from a previous library.

Library Design: 27,464 sgRNAs, designed with CRISPick (modified settings). Selection criteria: max 3 ncORFs per gene, min size 12 aa, ≥4 predicted targeting gRNAs.

Controls: Pan-essential, non-targeting, and cutting controls included.

Experimental Workflow:

Lentiviral infection in triplicate, 500 cells/gRNA representation
Puromycin selection
Genomic DNA extraction at initial and day 14 timepoints
Sequencing, normalization to RPM, log2 fold change relative to initial

CRISPR Data Analysis:

sgRNA mapping with Bowtie2 (stringent) to GRCh38
Chronos v2.0.8 with CNV data
Loss-of-function hits: Chronos score < -0.5 or > 0.5
Cross-validation with RNA-seq/Ribo-seq (TPM thresholds: ≥10 RNA-seq, ≥5 Ribo-seq)

Pooled c10riboseqorf92 Knockout

A pooled screen across 486 barcoded cancer cell lines tested two targeting gRNAs against c10riboseqorf92.

Controls: Non-cutting LacZ control, cutting control Chr2-2
Cell line abundance determined by barcode RNA-seq
log2FC at day 15 vs input: Linear regression identified outliers (residuals >2 s.d.)
Spearman correlation with DepMap gene dependency profiles

siRNA and Overexpression Experiments

siRNAs targeting OLMALINC tested in A375 and A549 cells. Knockdown efficiency measured by qPCR (GAPDH, ACTB controls). Proliferation monitored by Incucyte, confluency calculated, AUC compared statistically.

Bulk RNA-seq

RNA extracted, poly(A) enriched, sequenced on Illumina NovaSeq or Element AVITI
Alignment with STAR, quantification with RSEM (GENCODE v45)
Differential expression with DESeq2 (design: ~ treatment + background + interaction)
Significance: adj.P<0.05, |log2FC|>0.5
Functional enrichment with clusterProfiler (MSigDB Hallmark)

Multiplexed Single-cell Transcriptional Response

A sophisticated single-cell experiment involved 21 human cell lines expressing SpCas9, transduced with 4 sgRNAs.

Experimental Design:

Non-targeting control, two targeting c10riboseqorf92, KIF11 positive control
Puromycin selection
Processed with 10x Chromium (GEM-X On-chip Multiplexing, 3' v4)
Sequenced to ~30,000 reads/cell

Data Processing:

Cell Ranger, QC (MAD outlier detection), SoupX ambient correction
Normalization to 10,000 counts, log transformation
Cell line deconvolution with demuxalot, dropulation, scSplit

Pseudobulk differential expression: limma-voom, Benjamini-Hochberg FDR
Gene set enrichment: gseGO (GO Biological Process)
Co-expression modules: hdWGCNA with metacell construction

Perturbation distances:

E-distance in PCA space (15 components) between conditions
Hierarchical clustering
Consensus NMF with K from 5 to 29, 5,000 HVGs, 50 iterations

Multiplexed PRM MS of ncORF Targets

Cell lines tested: HEK293, HeLa S3, K562 with two sample preparation protocols per line.

Protocol 1: Lysis in 8M guanidine HCl, reduction (TCEP), alkylation (iodoacetamide), dilution, trypsin digestion, desalting
Protocol 2: Acetonitrile-based small protein extraction, then reduction, alkylation, digestion, desalting

Mass Spectrometry: Orbitrap Astral (20 min gradient, PRM mode) and ZenoTOF 8600 (nanoflow, 135 min gradient)

Data Analysis: Skyline for detection (signal-to-noise), with heavy-labelled synthetic peptide analogues and iRT peptides spiked in

Reporting Summary

Further details available in the Nature Portfolio Reporting Summary.

Hey There!

PeptideAtlas database construction and searching methodology