shmews.

Scientists at the Icahn School of Medicine at Mount Sinai have developed a gene set foundation model (GSFM) that learns relationships between genes from over one million gene sets derived from published studies and transcriptomics data.

The model, inspired by large language models, predicts missing genes in incomplete sets and generates embeddings that capture biological patterns. GSFM outperformed existing models including Geneformer and scGPT on benchmarks such as KEGG pathways and Gene Ontology Biological Processes.

Key Innovations

GSFM successfully predicted genes involved in ferroptosis, including PLIN2, which later appeared in scientific literature. The model uses a denoising autoencoder architecture with a 256-dimensional hidden layer, trained on Rummagene and RummaGEO data.

Training required about 30 minutes on standard hardware
The model is publicly available on GitHub and HuggingFace

Technical Specifications

GSFM trained on 626,000+ filtered gene sets covering ~97,000 genes
Final architecture: denoising autoencoder with hidden layer size 256, trained for ~50 epochs
Tested on KEGG, Gene Ontology, GWAS Catalog, and ChEA datasets
Achieved strongest AUROC scores when trained on Rummagene

Example Prediction

PLIN2 was identified as a candidate for ferroptosis, later confirmed in studies on oligodendrocytes.

Future Directions

The researchers plan several exciting developments:

Combine with language models for plain-language gene function explanations
Integrate with drug-focused AI systems to predict medicine-cell interactions
Potential applications in precision medicine and drug discovery

Code and pretrained weights are available at GitHub and HuggingFace, making this powerful tool accessible to the broader scientific community.

Hey There!

Mount Sinai develops AI system that learns gene interactions from datasets

Key Innovations

Technical Specifications

Example Prediction

Future Directions