Chat GPT and AI in Healthcare Thread

The latest Gemini 3.1 Pro release has been extolled for its reasoning capabilities, in maths, physics and elsewhere. Some guys in the specific field (for example Dr. Nate B. Jones) are of the opinion that Google issued a model which has the intrinsice reasonign capabilities of the famous Alpha-fold model, which gained the Nobel prize to Demis Hassabis, Google’s Deepmind director.

So presently, I’m watching it closely (together with other models and especially so the Grok 4.20 beta release with its 4-agents architecture). In a while, maybe htey’ll be able to provide answers based on first principles which are still eluding researchers.

My latest attempt was incredible. gemini 3.1 pro, on my request, elaborated a mathematical model on the time-response of the upstream signals of m-TOR. It went actually beyond my request, providing a quantitative, personalized, actionable scheme with monitoring strategies. A precise optimization procedure on how to alternate on/off upstream signaling to bend m-TOR signaling to improve longevity without compromising the immune system and the integrity of musculoskeletal tissue. It has been amazing. It deserves its own thread though.

It would be interesting to see what Grok gave you on this.

This is the discussion thread with the complete answer of Gemini 3.1 Pro, I have yet to submit it to Grok 4.2 beta (but I’m going to do that soon)

1 Like

Microsoft’s New AI Health Tool Can Read Your Medical Records and Give Advice

A new feature within the Copilot app will offer personalized healthcare advice and make it easy to upload test results, fitness data and more

Microsoft MSFT -0.22%decrease; red down pointing triangle is betting on healthcare as a path to become more competitive in artificial intelligence. The company’s biggest push yet: a new tool it describes as an AI concierge doctor—one that can access your medical records and health data, with your consent.

The company on Thursday unveiled Copilot Health, a feature within the Copilot app that lets the chatbot dispense personalized healthcare advice informed by the user’s disease history, test results, medications, doctors’ visit notes and biometric data as recorded by wearable devices.

Health data imported into the feature will be encrypted and firewalled from the rest of the app to address the privacy concerns of handing over one’s medical records to a generative AI platform, Microsoft AI Chief Executive Mustafa Suleyman said in an interview.

“It’s something that Microsoft is uniquely placed to do with our scale, with our regulatory experience, with the kind of trust and confidence that people have in our security and the history that we have as a mature, stable player,” Suleyman said.

The software giant is counting on the new health service to drive engagement for Copilot and attract new users to its app, which trails competitors such as OpenAI’s ChatGPT and Google’s Gemini. Microsoft plans to eventually charge users for the feature, which is first launching in the U.S. in a phased rollout.

In focusing the general consumer version of Copilot on healthcare, Microsoft is following user behavior: The most common category of questions asked on the mobile app is health, the company said.

A mobile phone displaying the welcome screen for Copilot Health, with text describing its features.

Microsoft hopes its new health service will attract users.

The new tool, which appears as a tab within the Copilot app, allows users to connect their hospital and lab data, as well as data from wearables such as Apple Watch and Fitbit, to receive personalized answers to inquiries about conditions or symptoms. For users who don’t plug in their personal data, the tool can provide more generalized answers.

The service could especially benefit those managing chronic medical conditions, executives said. The tool can plug into information from more than 50,000 U.S. hospitals and provider organizations, including lab results from those institutions or through Function Health.

Once users authenticate their identity through the identification service Clear, their data is pulled by vendor HealthEx, which adheres to the federal initiative known as Tefca, a nationwide framework for accessing health records. The data is then streamed into Copilot Health. Microsoft said users can manage and delete their information, and any data and conversations are kept separate from the general Copilot chat on the app using encryption and strict access controls.

Read the full story: Microsoft’s New AI Health Tool Can Read Your Medical Records and Give Advice (WSJ)

2 Likes

Claude and Gemini both do that quite well. They have become very adept at reading charts and photos that I upload. They have both gleaned information about me, which I find a little disturbing. Claude even correctly read the name of my doctor from a photo I took of a lab report, and that doctor’s name was in very fine print.

I don’t believe Claude and Gemini take the “biometric data as recorded by wearable devices” do they?

I agree both Claude and Gemini are very health-aware in the context of jndividuals. Co-pilot is way behind even at the enterprise level and not just consumer levels.For example, I have lately been using Claude plug in for Excel spreadhseets. Its a breeze as to convert the simple conversations into formula, graphs. I almost never think of copliot. Few times I tried, it just doesnt appear to be intuitive even though it has Open AI/Claude DNA.
I guess this forum would agree that its not about just DNA, its mitochondria that holds the key… :rofl:

AI being applied to health insurance claims in the USA: (we need personalized agents to keep submitting claims, it seems)…

from: https://x.com/StuartBlitz/status/2033288359976571200?s=20

1 in 3 Americans now use AI chatbots for health information, which is almost doubled in a year.

64% do it weekly. 81% take action afterward: schedule a doctor’s visit, change a medication, try a new behavior.

The detail that should stop you: 74% are using ChatGPT or Gemini. Not a clinical tool. Not an FDA-cleared system. A general-purpose chatbot.

I remember when chatGPT was first launched, the medical community had the most heated debate about “is AI ready for healthcare?” … now this debate has already been decided by users. They didn’t wait for the system to be ready. They just started using it and acting on it.

Meanwhile 71% of physicians say accuracy and reliability are their top concerns with AI.

The reality: consumers acting on general-purpose AI, clinicians not trusting it, seems to be the defining tension in health AI right now. The question isn’t whether people will use AI for health decisions. They already do. The question is whether anyone builds models actually calibrated for the stakes involved.

Source: https://x.com/BoWang87/status/2037047528055857645?s=20

1 Like

Health AI is going D2C (Direct to consumer):

Anthropic, OpenAI, Perplexity, etc are becoming the front door to care – integrating health records, biometrics, and wellness data to deliver personalized insights, while partners gain reach and a new consumer interface

1 Like

Be careful with how you use the chatbots… from the Financial Times:

AI chatbots misdiagnose in over 80% of early medical cases, study finds

Top models including OpenAI and DeepSeek make judgments too quickly when patient data is incomplete

Consumer AI chatbots falter when used to make medical diagnoses, particularly when faced with incomplete information, according to new research highlighting the risks of relying on them as digital doctors.

The study finds that leading large language models struggle to suggest a range of possible diagnoses when patient data is limited, frequently narrowing too quickly to a single answer.

The results point to a broader limitation in AI: while chatbots can identify likely conditions once a case is fully specified, they are less reliable at the earlier, more uncertain stages of clinical reasoning.

The findings highlight the dangers of relying on the technology alone to pinpoint health problems, particularly in cases where the data users input may be vague or patchy.

“These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn’t much information,” said Arya Rao, the study’s lead author and a researcher at the Massachusetts-based Mass General Brigham healthcare system.

The study, published in Jama Network Open on Monday, tested AI models using 29 clinical vignettes based on a standard medical reference text.

Read the full story: AI chatbots misdiagnose in over 80% of early medical cases, study finds (Financial Times)

Related Paper:

Large Language Model Performance and Clinical Reasoning Tasks

Conclusions and Relevance In this cross-sectional study of 21 LLMs, frontier LLMs achieved high accuracy on final diagnoses but performed poorly in generating differential diagnoses and navigating uncertainty relative to other reasoning stages. The PrIME-LLM framework provided greater separation than raw accuracy, revealing critical reasoning gaps obscured by traditional benchmarks. Thus, despite version-based improvements and advantages in reasoning-optimized models, off-the-shelf LLMs have not yet achieved the intelligence required for safe deployment and remain limited in demonstrating advanced clinical reasoning.

Full paper: https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2847679

The AI paradox: How medicine’s embrace of technology could erode doctors’ expertise

AI can improve care, but early evidence of skill erosion is raising alarms at hospitals and medical schools

Key Takeaways

The integration of artificial intelligence in health care is happening at a rapid pace, leading to efficiencies as well as risks.

There is growing evidence that overreliance on AI is causing medical professionals’ skills to deteriorate — or, in the case of young doctors, never develop in the first place.

Medical schools are using a range of approaches in training physicians in the era of AI, with the vast majority now incorporating AI in some way. But few schools have found a clear solution to the challenge, coming to be known as “de-skilling”of physicians who lose or never hone their ability to perform tasks that are increasingly automated.

Read the full story: The AI paradox: How medicine’s embrace of technology could erode doctors’ expertise (Washington Post)

ChatGPT for Clinicians:

ChatGPT for Clinicians

Verified access

Why verify? Verification confirms you’re a licensed clinician so we can unlock clinician-only workflows in a workspace designed for HIPAA-ready use.

Healthcare-grade privacy: Built for HIPAA-ready workflows and designed to protect sensitive data. Your privacy and security come first. Your content is not used to train our models. We include enterprise-grade controls and secure data handling to help safeguard Protected Health Information (PHI).

Built for clinical workflows: ChatGPT adapts to your specialty, care setting, and documentation style.

https://openai.com/index/making-chatgpt-better-for-clinicians/

For us non medical users. I asked perplexity to rank flagship LLMs models, based on the info in the artificialanalysis.ai site, according to their efficiency in specific areas, including health. The list includes all 4 main companies plus Moonshot and Deepseek.
Perplexity used the following benchmark for the health scoring. * Health / Scientific reasoning → GPQA Diamond (graduate-level STEM, “Google-proof” science questions).
It remains to be seen if it is the best one for our individual use.

Blockquote Key Takeaways by Use Case

Health/Medical reasoning: Claude Opus 4.7 and GPT-5.5 trade the top spot on GPQA Diamond (~94–95%), making both the safest choice for medical literature analysis or clinical reasoning tasks. Kimi K2.6 performs surprisingly strong at 91.1% given it is open-weights.

Reasoning & Mathematics: GPT-5.5 and Gemini 3.1 Pro tie at the top of the HLE leaderboard (~44%), the hardest public reasoning benchmark to date. DeepSeek V4 Pro is highly competitive on math (GSM8K ~90.2%) at a fraction of the cost.

Excel / Agentic work tasks: Kimi K2.6 achieves an Elo of 1520 on GDPval-AA (agentic work tasks benchmark), a marked jump over its predecessor. GPT-5.5 and Claude Opus 4.7 likely score higher but specific GDPval-AA ELO values are not yet published for all variants as of April 2026.

OCR & document reading: Gemini 3.1 Pro and Grok 4.20 have the largest context windows (up to 2M tokens), making them structurally superior for ingesting large PDF/document batches. For pure visual OCR, MMMU-Pro scores (multimodal) favor Gemini’s native integration with Google’s vision stack.

Report writing from documents: Claude Opus 4.7 has the lowest hallucination rate among frontier models (36%) per AA-Omniscience, meaning it is least likely to fabricate citations or figures in long-form reports. GPT-5.5 leads on AA-LCR (Long Context Reasoning) at 75.7%, indicating strong synthesis of long documents.

Cost efficiency: DeepSeek V4 Pro is ~85% cheaper than GPT-5.5 at comparable intelligence tiers — a strong choice when API costs matter at scale. Kimi K2.6, while open-weights, is noted as “particularly expensive” relative to other open-weight models of similar size.

1 Like

The above ranking was based upon the artificialanalysis site, and independent, reliable organization, but of course they only carry a number of main benchmarks. According to Gemni 3.1 pro, GPQUA/diamond, although a plausible proxy, is not specific enough.

Is GPQA Diamond a Good Proxy for Health Issue Accuracy?

The short answer is no. It is completely understandable to assume that an advanced science benchmark would translate to medical competence. However, while GPQA Diamond is a fantastic indicator of a model’s raw intelligence and logical reasoning capabilities, it is not a reliable proxy for evaluating an LLM’s accuracy, safety, or “goodness” regarding human health issues.

Here is why:

  • Domain Mismatch: GPQA is strictly an academic science test (Physics, Chemistry, Biology). While biology and chemistry are foundational to medicine, understanding graduate-level quantum mechanics or molecular biology does not equate to understanding clinical diagnostics, pharmacology, or standard-of-care practices.
  • Theoretical vs. Clinical Application: GPQA evaluates theoretical scientific problem-solving. Navigating health issues requires clinical judgment—such as weighing patient histories, interpreting symptoms, and understanding medical guidelines.
  • Safety and Nuance: Giving accurate health information requires a high degree of safety-awareness and bedside manner (knowing when to advise a user to see a doctor immediately). GPQA does not measure safety guardrails or clinical empathy at all.

Better Proxies for Medical AI

If you are looking to rank an LLM’s accuracy in healthcare and medicine, you should look at benchmarks specifically tailored to the medical field:

  • MedQA: Evaluates models using questions derived from the United States Medical Licensing Examination (USMLE). It tests real clinical knowledge.
  • PubMedQA: Tests a model’s ability to answer complex biomedical questions based on medical literature and abstracts.
  • MedMCQA: A large-scale medical dataset derived from professional medical entrance exams.

The Bottom Line: A high score on GPQA Diamond tells you an AI is highly capable, logical, and excellent at hard sciences. But if you need an AI to reliably process health issues, you must evaluate it using clinical medical benchmarks, not a physics and chemistry exam.

1 Like

This is the lates, more refined and more specific answer. It confirms the rumours in the grapevine. ChatGPT 5.5 is probably the absolute best, but followed close by Gemini 3.1 pro.

image

I’m attaching the whole answer in .pdf format
2026_Flagship_Medical_AI_Ranking.pdf (22.2 KB)

1 Like