Language Selection

Get healthy now with MedBeds!
Click here to book your session

Protect your whole family with Orgo-Life® Quantum MedBed Energy Technology® devices.

Advertising by Adpathway

Implications of AI Chatbots Performing Poorly at Differential Diagnosis

1 week ago 77

PROTECT YOUR DNA WITH QUANTUM TECHNOLOGY

Orgo-Life the new way to the future

Advertising by Adpathway

Research published in JAMA Network Open shows that AI chatbots are getting better at diagnostic accuracy when presented with comprehensive clinical information, but they do not do well at differential diagnoses when information is lacking. One of the paper’s authors, Marc Succi, M.D., executive director of the MESH Incubator at Mass General Brigham, spoke with Healthcare Innovation about the implications of the research.

Succi, whose MESH Incubator is a system-wide innovation and entrepreneurship center, explained that the team did an original study in 2023 on public large language models (LLMs) and clinical decision support. This is a follow-up study in which they tested 21 large language models (LLMs) in a series of clinical scenarios.

“Three years later, I wanted to see what changed — if they were better or if they were worse,” he said. “There's a lot of buzz about AI replacing doctors — more so than in previous years. I felt like it was an appropriate time to re-evaluate our original study and see where the field was.”

The research team explained that for the new study they developed a more holistic measure of LLMs that looked beyond accuracy, called PrIME-LLM, which evaluates a model’s competency across different stages of clinical reasoning — coming up with potential diagnoses, conducting appropriate tests, arriving at a final diagnosis, and managing treatment. When models perform well in one area but poorly in another, this imbalance is reflected in the PrIME-LLM score, as opposed to averaging competency across tasks, which may mask areas of weakness.

Succi said that what these models do well is get a final diagnosis when it's an open book test, and they have all the information — images and lab tests — and it’s all organized well. “If you feed them really good information, they're good at making a diagnosis,” he said. “But unfortunately, that's not how medicine is practiced, so they're very poor — just like in the original study — at making a differential diagnosis, which is at the earliest part of the medical visit.”

A patient might come in to the ED with shortness of breath, and maybe they know your demographics, he said. There are one to five plausible diagnoses and there is minimal, uncertain information that the physician has to determine what lab tests to order, which then determines how much information is gathered, and how fast you get to the final diagnosis. “That is where they actually failed more than 80% of the time in getting the full list of the differential diagnoses,” Succi said. “For me, the art of medicine is physicians navigating uncertain, weak, disparate information toward the final diagnosis. So that that's where all the AI models come up short.”

I asked Succi whether they could get better at that aspect of the physician’s role or if there was some limiting factor here.

He responded that he had thought they would be better. But his belief is that it's an inherent limit of the architecture of LLMs because they're pattern predictors. “To predict patterns, you need to have as much information as possible. But they're not very good at getting that information. Just like hallucinations are always going to be baked in — you can try to minimize it. You can try to have non-doctors provide information, and have patients fill out forms, but that’s always going to be a limitation.”

He said the research reinforces the idea that LLMs are not ready for prime-time clinic decision support, but he said he’s hopeful that they continue to benefit in tasks like ambient documentation. “Those are great use cases because they're low-risk. This just supports the need for more humans in the loop to critically appraise the output of these LLMs, because if you have a patient reading the output and the LLMs sound confident, they can be confidently wrong.”

But what if the study had found the LLMs were great at differential diagnosis? What would be the implications for health systems? Wouldn't there be huge issues about transparency and liability of trying to deploy them in higher-risk settings?

Succi responded that even if they were great at everything, including the differential diagnoses, issues around regulation and liability are unsolved.

“I always think about how planes can be operated essentially autonomously. I still wouldn't get on a plane without a pilot,” he said. “While I think the technology may get there in five to 20 years, in terms of actually implementing it for use at scale, I don't think that's going to happen for decades.”

I asked about using LLMs for augmenting clinical reasoning, and whether clinicians in practice and medical schools are having to work through how much they should use LLMs, and whether people might get too reliant on them.

Succi noted that he is on the board of a medical school in Boston that's grappling with this exact question. They're exposing medical students in their first year to understanding how to use LLMs and how to appraise the output, because a lot of the LLMs don't explain themselves, he said, adding that there seems to be a push for policies in med schools and residency to limit the allowed use of LLMs, sort of like taking a math test without a calculator, where you have to learn the underlying mechanics first.

“I think schools are grappling with how much they should allow students to use it as well as residents and faculty,” Succi said. “The other issue I see is a lot of de-skilling, where over-reliance on this technology, even in the course of months, can de-skill even seasoned physicians on how to do procedures, how to read and write notes. It's really a muscle memory function, so that's something I'm a little concerned about, to be honest, but we're keeping an eye on it.”

Read Entire Article