|
Med-PaLM
Google ’ s answer to ChatGPT is its AI chatbot called Med-PaLM , a large language model ( LLM ).
LLMs are the technical heart of these bots , and so it is important to understand what they are . Rather than harbouring some digital form of human thought , it is actually a grand name for a brute statistical process where the bots basically analyse vast amounts of data and learn the patterns and connections between words and phrases used in the data .
Through this statistical approach , the bot doesn ’ t have any semantic understanding of what it generates , but it is able to mimic human responses . The hope is that if done reliably , bots can be used in medical knowledge retrieval , clinical decision support and in creating summaries of key findings , as well as triage when given information about patient conditions .
In a paper published in Nature in July , scientists from Google Research gave Med-PaLM the usual test and learn approach — asking it more than 3000 common medical questions and seeing if it could come up with something clinically useful .
They found that 93 % of the time it generated answers on a par with the doctor-generated responses .
When it came to the type of questions used in the US Medical Licensing Examination , it was less successful . But it still offered accurate responses 68 % of the time , a score it should improve upon in the future .
|
A more important endpoint here is when wrong answers could be clinically risky .
Med-PaLM generated answers that were considered potentially harmful 5.9 % of the time . There are few details in the paper on just how risky the answers were ( life-threatening ?), but the rate compared to 5.7 % of the answers coming from doctors .
Dr DeSalvo told The Guardian that LLMs could become be the “ best intern for a doctor by placing every textbook in the world at their fingertips ”.
But she also said it should never replace a doctor ’ s diagnosis and treatment .
She also acknowledged the weaknesses in the bots , which are inclined to suffer “ AI hallucinations ”, where they can make up material when they do not ‘ know ’ how to respond to the questions asked .
One solution to this unreliability is for them to acknowledge doubt , or as the Google researchers put it , to give a risk rating along with their responses .
However , the researchers acknowledged that “ uncertainty measures over LLM output sequences remains an open area of research ”, which feels like a way of saying they are not sure it
|
will be possible for LLMs to know what they don ’ t know , a core skill when it comes to doctors .
As a result , it says “ guardrails to mitigate against over-reliance on the output of a medical assistant ” may be needed .
Emergency medicine
There is a good example of how
bad the technology can be .
Back in April , a US emergency
physician , Dr Josh Tamayo-Sarver , read about the alleged wonders of ChatGPT and decided to feed it the detailed medical history of 35 to 40 of his patients , including the symptoms of what brought them into the ED . The notes were anonymised . The question he asked ChatGPT : “ What are the differential diagnoses for this patient presenting to the [ ED ] [ insert patient notes here ]?”
“ OpenAI ’ s chatbot did a decent job of bringing up common diagnoses I wouldn ’ t want to miss — as
long as everything I told it was precise , and highly detailed ,” he said .
“ Correctly diagnosing a patient as having nursemaid ’ s elbow , for instance , required about 200 words ; identifying another patient ’ s orbital wall blowout fracture took the entire 600 words of my [ medical notes ] on them .”
He said ChatGPT came up with six possible diagnoses for around half his patients , one of which he
‘ I noticed not a single thing in its response suggested my patient was pregnant . It didn ’ t even know to ask .’
— Dr Josh Tamayo-Sarver
said was the correct one .
But he noted that a 50 % hit
rate in the context of emergency medicine is not a good outcome .
“ ChatGPT ’ s worst performance happened with a 21-yearold female patient who came into the [ ED ] with right lower quadrant abdominal pain .
“ I fed her [ notes ] into ChatGPT , which instantly came back with a differential diagnosis of appendicitis or an ovarian cyst , among other possibilities .”
But he said ChatGPT missed a “ somewhat important diagnosis ”
|
Dr Karen DeSalvo .
— she had an ectopic pregnancy .
“ Diagnosed too late , it can be fatal — resulting in death caused by internal bleeding . Fortunately for my patient , we were able to rush her into the operating room for immediate treatment .”
He continued : “ Notably , when she saw me in the [ ED ], this patient did not even know she was pregnant .
“ This is not an atypical scenario , and often only emerges after some gentle inquiring …
“ We will often ask : ‘ Any chance you ’ re pregnant ?’
“ Sometimes a patient will reply with something like ‘ I can ’ t be .’ “‘ But how do you know ?’ “ If the response to that follow-up does not refer to an IUD or a specific medical condition , it ’ s more likely the patient is actually saying they don ’ t want to be pregnant for any number of reasons — infidelity , trouble with the family , or other external factors .
“ Again , this is not an uncommon scenario ; about 8 % of pregnancies discovered in the [ ED ] are of women who report that they ’ re not sexually active .
“ But looking through ChatGPT ’ s diagnosis , I noticed not a single thing in its response suggested my patient was pregnant . It didn ’ t even know to ask .”
|