https://www.nature.com/articles/s41591-026-04359-w
chatGPT:
Summary
This Nature Medicine paper presents Reti-Pioneer, an AI system that uses colour fundus photographs of the retina, plus clinical metadata, to screen for six endocrine/metabolic diseases: type 2 diabetes, hypertension, hyperlipidaemia, gout, osteoporosis and thyroid disease. The premise is that the retina contains vascular, neural and possibly systemic-health signals that can act as a low-cost “oculomics” window into broader disease risk.
The system was trained/fine-tuned on 107,730 retinal images from 53,865 people, using UK Biobank and Chinese hospital/community datasets. It combines a quality-aware image module with three frozen/pre-trained visual foundation models: Swin Transformer, Vision Mamba and RETFound. Internally, its AUROCs were strongest for type 2 diabetes 0.833, gout 0.832 and osteoporosis 0.787, more moderate for hypertension 0.740, hyperlipidaemia 0.736, and weakest for thyroid disease 0.699.
External validation was done across Chinese resource-limited and high-resource settings and the Singapore SEED multi-ethnic cohort. Performance varied substantially by disease and cohort: for example, in resource-limited Chinese settings the model achieved AUROC 0.821 for T2DM, 0.805 for hypertension, 0.904 for osteoporosis, but only 0.628 for hyperlipidaemia. In SEED, AUROCs were 0.686 for T2DM, 0.749 for hypertension and 0.615 for hyperlipidaemia.
The paper also tested longitudinal prediction in UK Biobank. The model predicted 5- and 10-year incident disease with moderate accuracy: for example, 5-year T2DM AUROC 0.755, falling to 10-year T2DM AUROC 0.736; hypertension was 0.755 at 5 years and 0.719 at 10 years; hyperlipidaemia was 0.748 at 5 years and 0.735 at 10 years.
For interpretability, the authors used saliency maps and linked retinal latent features to plasma proteomic profiles. Some proteins, including SCARA5 for T2DM and PLA2G7, PTPRF and APOM for hyperlipidaemia, associated with model-derived retinal components after adjustment. Genetic risk-score associations were much weaker.
In clinical workflow testing, a silent primary-care trial of 1,017 participants showed that the system produced results in about 30.6 ± 6.0 seconds, far faster than laboratory report workflows taking roughly 8 hours. A later clinical pilot with 606 participants found AUROCs of 0.776 for T2DM, 0.843 for hypertension, 0.699 for hyperlipidaemia, 0.804 for gout, 0.877 for osteoporosis and 0.646 for thyroid disease. For T2DM, it outperformed FINDRISC, and participant/clinician acceptance was reported as high.
Novelty
The main novelty is not simply “AI from retinal photos”, which already exists, but the combination of several elements:
-
Multidisease screening from one retinal image workflow. Earlier oculomics studies often focused on one disease, especially diabetes or cardiovascular risk. This paper attempts a unified screening framework for six common endocrine/metabolic diseases.
-
Use of foundation models rather than training from scratch. The architecture combines frozen/pre-trained visual models, which is intended to reduce training cost and improve generalisability.
-
Explicit quality-aware modelling. Instead of excluding poor retinal images, the system tries to use image quality as part of the decision process. That is clinically relevant because primary-care and resource-limited screening often produces variable-quality images.
-
Multi-setting validation. The paper includes internal testing, external Chinese datasets, Singapore multi-ethnic data, longitudinal UK Biobank prediction, a silent trial and a prospective pilot. This is more translationally ambitious than many retrospective AI-imaging papers.
-
Biological plausibility layer. The proteomic association analysis is a notable attempt to show that the model is not merely exploiting arbitrary image artefacts, although it does not prove mechanism.
Critique
The paper is impressive as an engineering and translational study, but the clinical claim should be interpreted cautiously.
The strongest use case is probably triage or risk stratification, not diagnosis. For several diseases the AUROCs are only moderate, especially thyroid disease and hyperlipidaemia. Even where AUROC is good, the positive predictive values in a screening population can be modest. For example, in the prospective table, T2DM had a high NPV but low PPV, meaning the model may be better at ruling out or prioritising people for testing than replacing blood tests.
There is also a risk that the model is partly detecting age, vascular damage, frailty, obesity, medication patterns, healthcare access, or image-acquisition artefacts, rather than disease-specific biology. The quality-aware module is clever, but it also raises a concern: poor image quality may itself correlate with age, cataract, frailty, comorbidity or setting. That can improve prediction while reducing disease specificity.
The external validation is valuable, but performance variation is substantial. SEED performance for T2DM and hyperlipidaemia was only modest, and the paper itself notes potential explanations such as ethnicity, different imaging protocols, label noise and under-ascertainment. That means deployment in a new population, such as the NHS, would require local validation rather than assuming portability.
The biological interpretation is suggestive but not decisive. Proteomic correlations show plausibility, not causality. The authors also acknowledge that retinal features are proxies for protein signatures rather than independently established disease mechanisms. Saliency maps can be visually reassuring but are not a robust mechanistic explanation.
The prospective work is promising but still preliminary. A silent trial mainly proves workflow speed and technical feasibility, not patient benefit. The 606-person pilot is relatively small, and follow-up was short. The key unanswered question is whether using Reti-Pioneer actually improves outcomes: earlier diagnosis, better treatment initiation, fewer missed cases, cost-effectiveness, and fewer unnecessary referrals.
Bottom line
This is a strong paper showing that retinal AI can act as a fast, low-cost multidisease screening or triage tool, especially where blood testing is hard to deliver. Its most credible near-term role is to flag people for confirmatory testing, not to diagnose endocrine/metabolic disease on its own. The novelty lies in the multidisease foundation-model framework, quality-aware retinal imaging, broad validation and real-world workflow testing. The main weaknesses are moderate performance for some diseases, possible confounding, variable external generalisability, limited causal interpretability and lack of long-term outcome evidence.