≡ Menu

Are we ready for Doctor AI?

ChatGPT, Gemini, Claude and Large Language Models (LLMs) are impressive with medical diagnoses, with ChatGPT-4 performing better than physicians at diagnosing illness in a small study. A closer look finds AI in medical diagnosis is another example of the cognitive dissonance of AI.

  • Thought – A paper by researchers at the University of Oxford found LLMs could correctly identify relevant conditions 94.9% of the time when directly presented with test scenarios.
  • Thought – Human participants using LLMs to diagnose the same scenarios identified the correct conditions less than 34.5% of the time.

What went wrong?

Looking back at transcripts, researchers found that participants both provided incomplete information to the LLMs and the LLMs misinterpreted their prompts. For instance, one user who was supposed to exhibit symptoms of gallstones merely told the LLM: “I get severe stomach pains lasting up to an hour. It can make me vomit and seems to coincide with a takeaway,” omitting the location of the pain, the severity, and the frequency.

It appears physicians know how to identify the relevant conditions and how to clearly state them to the ChatBot. The Oxford study highlights one problem, not with humans or even LLMs, but with the way we sometimes measure LLM performance.

  • Thought – LLMs can pass medical licensing tests, real estate licensing exams, or state bar exams.
  • Thought – LLMs can often provide poor personal medical, real estate, and legal advice.