ChatGPT-4 is up to clinical reasoning

9 minute read


In another peek at where we’re up to with AI assisting with our jobs, this research finds that it’s worse at some bits, better at others, and frequently just as good.


Given the same medical data as attending physicians and residents at two large Boston medical centres in 2023, Open AI performed on par with both residents and attendees in diagnostic accuracy, overall clinical reasoning and cannot-miss diagnoses (defined as those posing “imminent threat to life or limb”). 

But its clinical reasoning was more frequently wrong than that of the humans. 

Twenty virtual clinical presentations were chosen from a set developed to teach and test clinical reasoning – prioritised differential diagnosis and illness script accuracy, problem representations, diagnostic probabilities and management plans. These cases were written and edited by expert practising doctors. 

They covered pharyngitis, headache, abdominal pain, cough, dyspnoea, chest pain, or arthralgia. New information was presented at four stages: triage presentation, review of systems, physical examination, and four diagnostic testing. 

Twenty-one attending doctors and 18 residents got one randomly selected case each. The AI (a large language model, freely available online) did all 20. 

The responses were then marked for overall “evaluation of demonstration of clinical reasoning in the assessment section of admission notes”, including interpretive summary, differential diagnosis, explanation of lead diagnosis and alternative diagnosis explanation. See here for the complete rubric

In clinical reasoning, ChatGPT-4 scored higher than doctors overall (a median score of 10 out of a possible 10, versus nine for attendings and eight for residents). But incorrect reasoning was more frequently detected in the AI’s answers than in the residents’ answer (in 11 cases vs two for residents, vs 10 for attendees). 

For those playing along at home, this is what the physicians and ChatGPT-4 were asked to do:  

You are an internal medicine clinician that is an expert at clinical reasoning. You are caring for the patient in the following case. You will be provided with four sections of the medical case scenario as it unfolds. For each section, provide a problem representation or a one sentence summary statement that highlights the most important elements from the case so far. Also provide your prioritized differential diagnosis with justification. 

Please document your thinking just as you would if you were in a real healthcare setting so that we can evaluate your clinical reasoning. 

And here is one of the presentations given to residents, attendees and AI. 

A 21-year-old male wrestler presents with a 6-week history of progressive right knee pain that is dull in nature, exacerbated by bending and prolonged standing, and more pronounced at the end of the day, without significant trauma but associated with a limp. 

ChatGPT-4 suggested the following differential diagnosis with justification: 

  • Patellofemoral pain syndrome  
    Justification: Patellofemoral pain syndrome, also known as runner’s knee, is common in young adults, particularly athletes. The patient’s description of pain when bending the knee and after prolonged standing is characteristic of this condition. This pain arises from the patellofemoral joint and is often due to the malalignment or tracking issues of the kneecap. Wrestling, a weight-bearing sport with frequent bending and kneeling, may place strain on this joint. 
  • Meniscal Injury 
    Justification: While the patient did not describe an acute trauma, the repetitive stress and various knee positions in wrestling can lead to meniscal tears. The pain and the potential limp can be due to the mechanical obstruction caused by the tear. 
  • Osgood-Schlatter Disease 
    Justification: This is an inflammation of the patellar ligament at the tibial tuberosity. It’s commonly seen in adolescents and young adults involved in sports. The pain is often anterior and can be related to activity levels. 
  • Patellar tendonitis (Jumper’s Knee) 
    Justification: Overuse injury common in sports that involve jumping or rapid changes in direction. Wrestling may put strain on the patellar tendon, leading to inflammation. 
  • Bursitis 
    Justification: Repetitive friction from wrestling can lead to inflammation of the bursa, causing pain especially when bending the knee. 
  • Early Osteoarthritis 
    Justification: While less common in such a young individual, repeated trauma or stress to the knee joint due to sports can sometimes initiate early degenerative changes in the joint. 

 “Further history, physical examination, and potentially imaging will be key in narrowing down this differential diagnosis and arriving at a definitive diagnosis,” said ChatGPT-4. 

The human offered “Problem representation: Monoarticular arthropathy”, with the following differential diagnosis/justification: 

“Illness script is very limited (no review of systems or more advanced list of pertinent positives or negatives) so the differential is still extremely broad. However, based on the limited information (young male, presumably sexually active, wrestler) would consider musculoskeletal etiology or infectious etiology (sexually transmitted vs other endemic process, i.e. Lyme) at present. Rheumatologic process not exonerated based on the above.” 

Participants were then given more information on the case: 

A 21-year-old male with a history of potential exposure to Lyme in an endemic area and an unexplained rash, presenting with a significant right knee effusion, with MRI evidence of synovitis and a joint aspirate showing an inflammatory process with a predominance of polymorphonuclear cells but no evidence of infection or crystal-induced arthritis. 

ChatGPT-4 offered the following options: 

  • Lyme arthritis  
    Justification: The presentation remains consistent with Lyme arthritis. Joint aspirate in Lyme arthritis typically shows an inflammatory picture with white blood cells predominantly polymorphonuclear cells, but it’s usually sterile, which aligns with the Gram stain and culture results here. The history of camping in New Hampshire and the rash are highly suggestive.” 
  • Reactive arthritis 
    ustification: The inflammatory nature of the joint fluid and history of recent sexual encounter could still point towards reactive arthritis. Reactive arthritis typically presents one-four weeks after an inciting infection but often the causative organism isn’t identified directly in the joint. 
     
  • Septic arthritis 
    Justification: Despite the sterile gram stain and negative culture, it’s essential to ensure septic arthritis is ruled out entirely, given the potential for fast progression and joint destruction. The high polymorphonuclear count is concerning. However, some bacteria, like gonococcus, can sometimes evade detection in standard cultures. Given the recent sexual encounter (even though protected), gonococcal arthritis could be considered, though usually accompanied by other symptoms. 
  • Rheumatoid arthritis 
    Justification: The aspirate shows inflammation and the family history of rheumatoid arthritis still keeps it on the list, but the isolated joint involvement and the absence of systemic markers of inflammation make rheumatoid arthritis less probable. 
  • Traumatic synovitis 

Justification: The knee effusion and synovitis might result from repeated minor trauma, especially in an active individual like a wrestler. However, this doesn’t explain the rash or potential exposure history. 

  • Other inflammatory arthropathies 
    Justification: Conditions like psoriatic arthritis or ankylosing spondylitis might be considered, but the patient lacks typical features (like back pain, skin changes, or family history). 

“At this point, Lyme arthritis remains at the top of the differential, given the constellation of symptoms, history, and investigations. Serologic tests for Lyme disease (like ELISA followed by Western blot) should be considered to confirm the diagnosis. The presence of synovitis without evident infection, the polymorphonuclear-dominant cell count in the joint fluid, the history of potential erythema migrans, and the exposure history all strengthen the suspicion of Lyme arthritis,” said ChatGPT-4. 

The human said: 

“Problem representation: Monoarticular arthropathy Differential Diagnosis / Justification: Illness script is very limited (no review of systems or more advanced list of pertinent positives or negatives) so the differential is still extremely broad. However, based on the limited information (young male, presumably sexually active, wrestler) would consider musculoskeletal etiology or infectious etiology (sexually transmitted vs other endemic process, i.e. Lyme) at present. Rheumatologic process not exonerated based on the above.” 

Despite the ability for overall clinical reasoning demonstrated above, AI also came up with some real clangers. For example: 

  • Case diagnosis: diverticulitis. GPT-4 response: “An ectopic pregnancy is extremely unlikely given her age but should not be entirely dismissed without further investigation.” Reason this response is incorrect: the patient is a 71-year-old woman. Physicians would not perform a pregnancy test in this situation. 
  • Case diagnosis: diverticulitis. GPT-4 response: “Given her recent travel to Texas, a gastrointestinal infection might be considered, though the localised nature of her pain and the lack of diarrhea make it less likely.” Reason this response is incorrect: while traveller’s diarrhea is common, travel to the state of Texas is not generally considered a risk factor. 
  • Case diagnosis: Gastroesophageal reflux disease. GPT-4 response: “Acute coronary syndrome: Even with the added information, this still remains high on the differential due to the age, male gender, risk factors (hypertension, hyperlipidemia, family history of heart disease), and the nature of the pain. The lack of dyspnea on exertion, paroxysmal nocturnal dyspnea, and orthopnea does decrease the likelihood slightly, but it can’t be ruled out.” Reason response is incorrect: dyspnea on exertion, paroxysmal nocturnal dyspnea, and orthopnea are all findings that are more commonly associated with heart failure rather than acute coronary syndrome. 
  • Case diagnosis: Heart failure exacerbation. GPT-4 response: “Venous insufficiency: Given the bilateral lower extremity edema and the jugular venous distension, chronic venous insufficiency is a consideration.” Reason response is incorrect: while venous insufficiency can cause lower extremity edema, it does not generally cause distension of the jugular veins. 

Everybody scored highly for clinical diagnosis. ChatGPT-4 had the same score as residents for cannot-miss diagnoses (67%) with attendees scoring 50%.  

Refreshingly, all the data from this study is freely provided by the authors to anyone who wants it. That’s part of setting the example for how we need to engage with AI in the health setting, author Dr Alan Rodman, a general internist at Beth Israel Deaconess Medical Center in Boston (USA) and co-director of the Innovations in Media and Education Delivery (iMED) Initiative, told TMR

“I’m not an AI researcher (or at least, not a computer scientist or informatician). I’m a practicing physician who researches how human physicians think,” Dr Rodman said.  

“What I’m hoping to do is to encourage the highest clinical standards in AI research ― focusing on actual workflows, clinically meaningful interventions, and holding us to the standards that we do in other parts of clinical research (leading up to RCTs). In this case, as an experiment ‘Turing test’ type study, having our data open is part of that.”  

JAMA Intern Med 2024, online 1 April 

End of content

No more pages to load

Log In Register ×