Back
Science

PeptideAtlas database construction and searching methodology

View source

PeptideAtlas: A Massive Repository of Human Proteomic and Immunopeptidomic Data

The PeptideAtlas project has assembled an enormous collection of non-HLA and HLA immunopeptidomics data from public datasets, encompassing over 3.7 billion MS/MS spectra.

Two Major Builds

Non-HLA Build (2023-06)

  • Scale: 295 datasets, 1,172 experiments, and 3.5 billion MS/MS spectra
  • Search Engine: MSFragger v3.7 with semi-enzymatic settings
  • Database: THISP level 4 (2023-02), including 7,264 Ribo-seq ORFs
  • Variable Modifications: Methionine oxidation, protein N-terminal acetylation, peptide N-terminal pyro-glutamic acid, Asn and Gln deamidation
  • Fixed Modification: Alkylation (typically carbamidomethylated cysteine)
  • Validation: Trans-Proteomic Pipeline (TPP) v7.0 with PeptideProphet, iProphet, PTMProphet. Mapping to human proteome via ProteoMapper and to genome via ENSEMBL
  • FDR Estimation: Target–decoy entrapment with scrambled decoy sequences at 1:1 ratio

HLA Build (2023-11)

  • Scale: 118 datasets, 592 experiments, 240 million MS/MS spectra from 9,776 runs
  • Search Engine: MSFragger v3.7 with no-enzyme mode
  • Database: THISP level 4 (2023-07), including 7,264 Ribo-seq ORFs and 299 common contaminants
  • Variable Modifications: Methionine oxidation, cysteine cysteinylation, protein N-terminal acetylation, peptide N-terminal pyro-glutamic acid, Asn and Gln deamidation
  • Validation: Same as non-HLA build

Protein Identification and Categorization

Peptides are mapped to neXtProt (2023 version), covering 20,389 core proteome entries and isoforms using ProteoMapper.

Canonical proteins are defined by ≥2 uniquely mapping non-nested peptides of length ≥9, covering at least 18 amino acids.

Protein Categories: canonical, non-core canonical, indistinguishable representative, indistinguishable, representative, marginally distinguished, subsumed, weak, insufficient evidence, identical, not detected

Manual Validation of PSMs

Manual inspection of ncORF peptides uses a rigorous classification system: excellent, good, false positive, close but false positive, low information

Criteria for 'excellent': nearly complete b and y ion series, no prominent unannotated peaks, minimal mass modifications, no gaps with plausible misassignment

HLA Peptide Categorization

  • 865,922 peptides filtered (homopolymer starting peptides removed)
  • Length ≥8, mapping to ≤30 UniProtKB/Swiss-Prot entries: considered canonical
  • Length ≥8, mapping to ncORFs (not canonical) with ≤10 distinct mappings: considered ncORF-derived
  • Remaining peptides: 'other' category

ncORF Expression in Cancer Tissues

Each ncORF peptide is categorized by source (cancer vs non-cancer MS runs). ncORFs are classified as cancer-related if from Cancer Gene Census genes.

Enriched Ubiquitination Datasets

11 public datasets reanalysed with Comet (semi-tryptic, missed cleavages=4, variable modifications: Gly-Gly ubiquitination, Met oxidation, N-terminal acetylation). TPP postprocessing, FDR via decoy database.

HLA Binding Predictions

NetMHCpan v4.1 for MS runs with known four-digit HLA typing. Peptides with rank score ≤2 considered binders.

Detectability Determinants

Canonical proteins and ncORFs categorized as detected or undetected. Significance determined by two-sided Wilcoxon rank-sum test.

Tier Classification System

Tier Requirements Tier 1A Two non-nested peptides in MS proteome + Ribo-seq Tier 1B Two non-nested peptides in HLA immunopeptidomics + Ribo-seq Tier 2A One peptide in MS proteome + Ribo-seq Tier 2B One peptide in HLA immunopeptidomics + Ribo-seq Tier 3 Any immunopeptidomics or tryptic proteome evidence without Ribo-seq Tier 4 Ribo-seq evidence without proteomics Tier 5 In silico prediction without Ribo-seq or proteomics

Machine Learning Models

MLP Classifier

  • Dataset: 677 ncORF 9-mer peptides, 22 attributes
  • Architecture: Hidden layer size 280, tanh activation, alpha=0.01
  • Training: 80% of data, hyperparameter tuning via grid search

TensorFlow Keras Model

  • Dataset: 7,264 ncORFs (1,785 detected, 5,479 undetected)
  • Architecture: 16-neuron input, ReLU, L2 regularization, batch normalization, dropout, sigmoid output
  • Training: 60 epochs, batch size 12, balanced class weights

Evolutionary Conservation and Constraint (ORBL)

  • ORBLv: Phylogenetic branch length fraction for conserved species
  • ORBLq: Empirical P value against matched untranslated ORFs of same biotype
  • Biotype determination: GENCODE v42 with strict 'pure' criteria; mixed biotypes excluded

CRISPR Screening and Functional Genomics

A massive functional genomics effort targeted 2,196 ncORFs from GENCODE Phase I and 1,245 from a previous library.

Library Design: 27,464 sgRNAs, designed with CRISPick (modified settings). Selection criteria: max 3 ncORFs per gene, min size 12 aa, ≥4 predicted targeting gRNAs.

Controls: Pan-essential, non-targeting, and cutting controls included.

Experimental Workflow:

  • Lentiviral infection in triplicate, 500 cells/gRNA representation
  • Puromycin selection
  • Genomic DNA extraction at initial and day 14 timepoints
  • Sequencing, normalization to RPM, log2 fold change relative to initial

CRISPR Data Analysis:

  • sgRNA mapping with Bowtie2 (stringent) to GRCh38
  • Chronos v2.0.8 with CNV data
  • Loss-of-function hits: Chronos score < -0.5 or > 0.5
  • Cross-validation with RNA-seq/Ribo-seq (TPM thresholds: ≥10 RNA-seq, ≥5 Ribo-seq)

Pooled c10riboseqorf92 Knockout

A pooled screen across 486 barcoded cancer cell lines tested two targeting gRNAs against c10riboseqorf92.

  • Controls: Non-cutting LacZ control, cutting control Chr2-2
  • Cell line abundance determined by barcode RNA-seq
  • log2FC at day 15 vs input: Linear regression identified outliers (residuals >2 s.d.)
  • Spearman correlation with DepMap gene dependency profiles

siRNA and Overexpression Experiments

siRNAs targeting OLMALINC tested in A375 and A549 cells. Knockdown efficiency measured by qPCR (GAPDH, ACTB controls). Proliferation monitored by Incucyte, confluency calculated, AUC compared statistically.

Bulk RNA-seq

  • RNA extracted, poly(A) enriched, sequenced on Illumina NovaSeq or Element AVITI
  • Alignment with STAR, quantification with RSEM (GENCODE v45)
  • Differential expression with DESeq2 (design: ~ treatment + background + interaction)
  • Significance: adj.P<0.05, |log2FC|>0.5
  • Functional enrichment with clusterProfiler (MSigDB Hallmark)

Multiplexed Single-cell Transcriptional Response

A sophisticated single-cell experiment involved 21 human cell lines expressing SpCas9, transduced with 4 sgRNAs.

Experimental Design:

  • Non-targeting control, two targeting c10riboseqorf92, KIF11 positive control
  • Puromycin selection
  • Processed with 10x Chromium (GEM-X On-chip Multiplexing, 3' v4)
  • Sequenced to ~30,000 reads/cell

Data Processing:

  • Cell Ranger, QC (MAD outlier detection), SoupX ambient correction
  • Normalization to 10,000 counts, log transformation
  • Cell line deconvolution with demuxalot, dropulation, scSplit

Pseudobulk differential expression: limma-voom, Benjamini-Hochberg FDR
Gene set enrichment: gseGO (GO Biological Process)
Co-expression modules: hdWGCNA with metacell construction

Perturbation distances:

  • E-distance in PCA space (15 components) between conditions
  • Hierarchical clustering
  • Consensus NMF with K from 5 to 29, 5,000 HVGs, 50 iterations

Multiplexed PRM MS of ncORF Targets

Cell lines tested: HEK293, HeLa S3, K562 with two sample preparation protocols per line.

Protocol 1: Lysis in 8M guanidine HCl, reduction (TCEP), alkylation (iodoacetamide), dilution, trypsin digestion, desalting
Protocol 2: Acetonitrile-based small protein extraction, then reduction, alkylation, digestion, desalting

Mass Spectrometry: Orbitrap Astral (20 min gradient, PRM mode) and ZenoTOF 8600 (nanoflow, 135 min gradient)

Data Analysis: Skyline for detection (signal-to-noise), with heavy-labelled synthetic peptide analogues and iRT peptides spiked in

Reporting Summary

Further details available in the Nature Portfolio Reporting Summary.