Awesome Genomics
Genomics is one of the most active areas in Awesome AI for Science β 1,562 papers in this collection, evaluated on datasets like ProteinGym, TCGA, The Cancer Genome Atlas (TCGA). A strong starting point is "An AI system to help scientists write expert-level empirical software".
Datasets & benchmarks
Key papers
- An AI system to help scientists write expert-level empirical software (2025)Eser Ayg\"un et al.12.64
- DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA (2024)Aman Patel et al.11.97
- InternAgent-1.5: A Unified Agentic Framework for Long-Horizon Autonomous Scientific Discovery (2026)Shiyang Feng et al.11.14
- BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology (2025)Ludovico Mitchener et al.10.71
- Towards an AI co-scientist (2025)Juraj Gottweis et al.9.77
- DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively (2025)Yixuan Weng et al.9.55
- Accurate RNA 3D structure prediction using a language model-based deep
learning approach (2022)Tao Shen et al.9.35
- Artificial Intelligence and Deep Learning Algorithms for Epigenetic
Sequence Analysis: A Review for Epigeneticists and AI Experts (2025)Muhammad Tahir et al.8.64
- MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based
Scientific Research (2025)James Burgess et al.8.49
- Multi-Objective-Guided Discrete Flow Matching for Controllable Biological Sequence Design (2025)Tong Chen et al.8.29
- HEIST: A Graph Foundation Model for Spatial Transcriptomics and Proteomics Data (2025)Hiren Madhu et al.7.70
- RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks (2024)Rafael Josip Peni\'c et al.7.62
- drGT: Attention-Guided Gene Assessment of Drug Response Utilizing a Drug-Cell-Gene Heterogeneous Network (2024)Yoshitaka Inoue et al.7.38
- A Text-guided Protein Design Framework (2023)Shengchao Liu et al.7.32
- Gumbel-Softmax Flow Matching with Straight-Through Guidance for
Controllable Biological Sequence Generation (2025)Sophia Tang et al.7.24
- Contextualizing biological perturbation experiments through language (2025)Menghua Wu et al.7.19
- BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation
Experiments (2024)Yusuf Roohani et al.7.00
- Materials Graph Library (MatGL), an open-source graph deep learning
library for materials science and chemistry (2025)Tsz Wai Ko et al.6.89
- Understanding protein function with a multimodal retrieval-augmented foundation model (2025)Timothy Fei Truong Jr et al.6.75
- PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation Analysis (2024)Yan Wu et al.6.50
- Kosmos: An AI Scientist for Autonomous Discovery (2025)Ludovico Mitchener et al.6.40
- RNA-FrameFlow: Flow Matching for de novo 3D RNA Backbone Design (2024)Rishabh Anand et al.6.39
- Molecular-driven Foundation Model for Oncologic Pathology (2025)Anurag Vaidya et al.6.12
- Multi-Exit Kolmogorov-Arnold Networks: enhancing accuracy and parsimony (2025)James Bagrow and Josh Bongard6.12
- DualEquiNet: A Dual-Space Hierarchical Equivariant Network for Large Biomolecules (2025)Junjie Xu et al.6.12
- scDrugMap: Benchmarking Large Foundation Models for Drug Response Prediction (2025)Qing Wang et al.6.07
- Machine Learning Methods for Gene Regulatory Network Inference (2025)Akshata Hegde et al.6.01
- Interpretable Graph Kolmogorov-Arnold Networks for Multi-Cancer Classification and Biomarker Identification using Multi-Omics Data (2025)Fadi Alharbi et al.5.96
- Whole-Genome Phenotype Prediction with Machine Learning: Open Problems
in Bacterial Genomics (2025)Tamsin James et al.5.90
- Iterative Distillation for Reward-Guided Fine-Tuning of Diffusion Models in Biomolecular Design (2025)Xingyu Su et al.5.87
- AI-driven multi-omics integration for multi-scale predictive modeling of
causal genotype-environment-phenotype relationships (2024)You Wu (1) et al.5.78
- GBDTSVM: Combined Support Vector Machine and Gradient Boosting Decision Tree Framework for efficient snoRNA-disease association prediction (2025)Ummay Maria Muna et al.5.76
- CellVerse: Do Large Language Models Really Understand Cell Biology? (2025)Fan Zhang et al.5.76
- Benchmarking AI scientists for omics data driven biological discovery (2025)Erpai Luo et al.5.76
- Universal Biological Sequence Reranking for Improved De Novo Peptide Sequencing (2025)Zijie Qiu et al.5.76
- LOCO-EPI: Leave-one-chromosome-out (LOCO) as a benchmarking paradigm for
deep learning based prediction of enhancer-promoter interactions (2025)Muhammad Tahir et al.5.70
- Multi-modal AI for comprehensive breast cancer prognostication (2024)Jan Witowski et al.5.68
- Democratizing AI scientists using ToolUniverse (2025)Shanghua Gao et al.5.63
- Protein Large Language Models: A Comprehensive Survey (2025)Yijia Xiao et al.5.59
- Hyperbolic Genome Embeddings (2025)Raiyan R. Khan et al.5.52
- ProtChatGPT: Towards Understanding Proteins with Large Language Models (2024)Chao Wang et al.5.51
- JanusDNA: A Powerful Bi-directional Hybrid DNA Foundation Model (2025)Qihao Duan et al.5.40
- MAMMAL -- Molecular Aligned Multi-Modal Architecture and Language (2024)Yoel Shoshan et al.5.37
- On learning functions over biological sequence space: relating Gaussian process priors, regularization, and gauge fixing (2025)Samantha Petti et al.5.35
- Diffusion on language model encodings for protein sequence generation (2024)Viacheslav Meshchaninov et al.5.29
- Differentiable Folding for Nearest Neighbor Model Optimization (2025)Ryan K. Krueger et al.5.29
- Towards Scientific Intelligence: A Survey of LLM-based Scientific Agents (2025)Shuo Ren et al.5.29
- HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model (2025)Mingqian Ma et al.5.24
- Learning to Discover Regulatory Elements for Gene Expression Prediction (2025)Xingyu Su et al.5.24
- LLMs for Bayesian Optimization in Scientific Domains: Are We There Yet? (2025)Rushil Gupta et al.5.21
- Comparative Performance Evaluation of Large Language Models for
Extracting Molecular Interactions and Pathway Knowledge (2023)Gilchan Park et al.5.12
- Benchmarking and Evaluation of AI Models in Biology: Outcomes and Recommendations from the CZI Virtual Cells Workshop (2025)Elizabeth Fahsbender et al.5.10
- Virtual Cells: Predict, Explain, Discover (2025)Emmanuel Noutahi et al.4.98
- PROTOCOL: Late Interaction Retrieval for Protein Homolog Search (2026)Gabrielle Cohn et al.4.95
- PLM-eXplain: Divide and Conquer the Protein Embedding Space (2025)Jan van Eck et al.4.93
- Hallucination, reliability, and the role of generative AI in science (2025)Charles Rathkopf4.93
- A Phylogenetic Approach to Genomic Language Modeling (2025)Carlos Albors et al.4.87
- SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models (2025)Chuan Qin et al.4.87
- BAnG: Bidirectional Anchored Generation for Conditional RNA Design (2025)Roman Klypa et al.4.82
- Distribution-Conditioned Transport (2026)Nic Fishman et al.4.81