News Review

1 SEPTEMBER 2023 ausdoc . com . au

Why doctors ’ diagnostic

For critics of the technology , it is a matter of semantics .

“

I

Paul Smith HAVE to say as a doc sometimes : ‘ Oh my , there ’ s this new stethoscope in my toolbox called a large language model , and it ’ s going to do a lot of amazing things .’ But it ’ s not going to replace doctors — I believe it ’ s a tool in the toolbox .”

This is Dr Karen DeSalvo , who , as Google ’ s chief health officer , can be considered the real Dr Google .

As you would expect , on her visit to these shores in July she was talking about the AI chatbots that have generated acres of media coverage for their feats in answering medical exam questions .

But having trained as a physician in the US before joining Google , Dr DeSalvo says she understands the limitations when it comes to the future role of bots in real medicine .

She puts it like this : “ We have a lot of things to work out to make sure the [ bots ] are constrained appropriately , that they ’ re factual , consistent , and that they follow these ethical and equity approaches that we want to take .”

Med-PaLM

Google ’ s answer to ChatGPT is its AI chatbot called Med-PaLM , a large language model ( LLM ).

LLMs are the technical heart of these bots , and so it is important to understand what they are . Rather than harbouring some digital form of human thought , it is actually a grand name for a brute statistical process where the bots basically analyse vast amounts of data and learn the patterns and connections between words and phrases used in the data .

Through this statistical approach , the bot doesn ’ t have any semantic understanding of what it generates , but it is able to mimic human responses . The hope is that if done reliably , bots can be used in medical knowledge retrieval , clinical decision support and in creating summaries of key findings , as well as triage when given information about patient conditions .

In a paper published in Nature in July , scientists from Google Research gave Med-PaLM the usual test and learn approach — asking it more than 3000 common medical questions and seeing if it could come up with something clinically useful .

They found that 93 % of the time it generated answers on a par with the doctor-generated responses .

When it came to the type of questions used in the US Medical Licensing Examination , it was less successful . But it still offered accurate responses 68 % of the time , a score it should improve upon in the future .

A more important endpoint here is when wrong answers could be clinically risky .

Med-PaLM generated answers that were considered potentially harmful 5.9 % of the time . There are few details in the paper on just how risky the answers were ( life-threatening ?), but the rate compared to 5.7 % of the answers coming from doctors .

Dr DeSalvo told The Guardian that LLMs could become be the “ best intern for a doctor by placing every textbook in the world at their fingertips ”.

But she also said it should never replace a doctor ’ s diagnosis and treatment .

She also acknowledged the weaknesses in the bots , which are inclined to suffer “ AI hallucinations ”, where they can make up material when they do not ‘ know ’ how to respond to the questions asked .

One solution to this unreliability is for them to acknowledge doubt , or as the Google researchers put it , to give a risk rating along with their responses .

However , the researchers acknowledged that “ uncertainty measures over LLM output sequences remains an open area of research ”, which feels like a way of saying they are not sure it

will be possible for LLMs to know what they don ’ t know , a core skill when it comes to doctors .

As a result , it says “ guardrails to mitigate against over-reliance on the output of a medical assistant ” may be needed .

Emergency medicine

There is a good example of how

bad the technology can be .

Back in April , a US emergency

physician , Dr Josh Tamayo-Sarver , read about the alleged wonders of ChatGPT and decided to feed it the detailed medical history of 35 to 40 of his patients , including the symptoms of what brought them into the ED . The notes were anonymised . The question he asked ChatGPT : “ What are the differential diagnoses for this patient presenting to the [ ED ] [ insert patient notes here ]?”

“ OpenAI ’ s chatbot did a decent job of bringing up common diagnoses I wouldn ’ t want to miss — as

long as everything I told it was precise , and highly detailed ,” he said .

“ Correctly diagnosing a patient as having nursemaid ’ s elbow , for instance , required about 200 words ; identifying another patient ’ s orbital wall blowout fracture took the entire 600 words of my [ medical notes ] on them .”

He said ChatGPT came up with six possible diagnoses for around half his patients , one of which he

‘ I noticed not a single thing in its response suggested my patient was pregnant . It didn ’ t even know to ask .’

— Dr Josh Tamayo-Sarver

said was the correct one .

But he noted that a 50 % hit

rate in the context of emergency medicine is not a good outcome .

“ ChatGPT ’ s worst performance happened with a 21-yearold female patient who came into the [ ED ] with right lower quadrant abdominal pain .

“ I fed her [ notes ] into ChatGPT , which instantly came back with a differential diagnosis of appendicitis or an ovarian cyst , among other possibilities .”

But he said ChatGPT missed a “ somewhat important diagnosis ”

Dr Karen DeSalvo .

— she had an ectopic pregnancy .

“ Diagnosed too late , it can be fatal — resulting in death caused by internal bleeding . Fortunately for my patient , we were able to rush her into the operating room for immediate treatment .”

He continued : “ Notably , when she saw me in the [ ED ], this patient did not even know she was pregnant .

“ This is not an atypical scenario , and often only emerges after some gentle inquiring …

“ We will often ask : ‘ Any chance you ’ re pregnant ?’

“ Sometimes a patient will reply with something like ‘ I can ’ t be .’ “‘ But how do you know ?’ “ If the response to that follow-up does not refer to an IUD or a specific medical condition , it ’ s more likely the patient is actually saying they don ’ t want to be pregnant for any number of reasons — infidelity , trouble with the family , or other external factors .

“ Again , this is not an uncommon scenario ; about 8 % of pregnancies discovered in the [ ED ] are of women who report that they ’ re not sexually active .

“ But looking through ChatGPT ’ s diagnosis , I noticed not a single thing in its response suggested my patient was pregnant . It didn ’ t even know to ask .”

Australian Doctor 1st September 2023 AD 1st Sept Issue | Página 40

News Review

Why doctors ’ diagnostic

“

I