DNA methylation is a critical epigenetic modification that regulates gene expression and plays a significant role in development and disease processes. Here, we present the Cytosine-phosphate-Guanine Pretrained Transformer (CpGPT), a novel foundation model pretrained on over 1,500 DNA methylation datasets encompassing over 100,000 samples from diverse tissues and conditions. CpGPT leverages an improved transformer architecture to learn comprehensive representations of methylation patterns, allowing it to impute and reconstruct genome-wide methylation profiles from limited input data. By capturing sequence, positional, and epigenetic contexts, CpGPT outperforms specialized models when finetuned for aging-related tasks, including chronological age prediction, mortality risk, and morbidity assessments. The model is highly adaptable across different methylation platforms and tissue types. Furthermore, analysis of sample-specific attention weights enables the identification of the most influential CpG sites for individual predictions. As a foundation model, CpGPT sets a new benchmark for DNA methylation analysis, achieving strong performance in the Biomarkers of Aging Challenge, where it placed second overall in chronological age estimation and first on the public leaderboard in methylation-based mortality prediction.
Competing Interest Statement
L.P.D.L.C. is the Head of Machine Learning at Shift Bioscience. R.S. has received consulting fees from TruDiagnostic, LongevityTech.fund and Cambrian BioPharma. A.H.C. has received consulting fees from TruDiagnostic and FOXO Biosciences. S.H. works for Altos Labs Limited UK and is a founder and consultant of the nonprofit Epigenetic Clock Development Foundation. B.W. serves as a scientific advisor to Shift Bioscience, Vevo Therapeutics, Deep Genomics. B.W. receives consulting fees from Arsenal Bioscience, Viecure Inc. The methodology described in this manuscript is the subject of a pending patent application where L.P.D.L.C. is named as the sole inventor. L.P.D.L.C is the primary owner, and R.S. is the secondary owner.
This has really high performance