February 16, 2024 | V. "Juggy" Jagannathan
This week, I am looking at two articles, both bearing on the overall topic of future of digital health.
I listened to a podcast on Future of Everything from Stanford University with Professor Eleni Linos, the head of the Stanford Center for Digital Health. The conversation was quite interesting.
Professor Linos referred to several recent studies done at Stanford. One particular study, that caught my attention, involved trying to assess the bias in ChatGPT and Bard. And compare it to the bias exhibited by physicians. That study showed that when you provide a clinical vignette and vary the gender, race, ethnicity and socioeconomic status, the responses varied as well across large language models (LLMs) and physicians.
The study underscores that we have some ways to go before routinely using these artificial intelligence (AI) chat bots to provide clinical advice. Urgent research is needed. A detailed comment to the article listed for this paper underscored what needs to be done to address bias in models. The data used to train the model should include representative samples; evaluation needs to be rigorous for all target subgroups; stakeholders in tech needs to be representative of all perspectives, and finally there needs to be guiding ethical principles that drive evaluations.
Professor Linos, a dermatologist, also commented that though the smart phone pictures are quite good for tele-dermatology, there are discrepancies in diagnosis that vary with skin color. She is now taking the mantle of director for the Center for Digital Health at Stanford. Practically everything in health has a digital component now! She is going to be quite busy.
I came across a new paper in arXiv, where all computer science and AI preprints of papers are uploaded from researchers in Google and DeepMind. This one was titled “Towards Conversational Diagnostic AI.” It turns out there was an accompanying blog post, which Google often posts. It talks about a new system they are building dubbed, “Articulate Medical Intelligence Explorer (AMIE).” What exactly is this system? Essentially, it’s a chatbot that behaves like a physician. You may recall that during the pandemic, there were a slew of applications that attempted to diagnose symptoms of a panicked populace. At that time, they were trying to assess the probability of someone coming down with COVID-19. Fast forward to now, this is a significantly more ambitious chatbot.
So, what does AMIE chatbot do? It follows the same blueprint for questioning the patient that doctors do. It systematically elicits evidence from the patient on their chief complaint, reviews systems and all the various history elements like family history, past medical history, medication history, etc. Then it comes up with a diagnosis and treatment plan – all through the chat interface. The paper sketches how they trained the system and evaluated it as well. Let’s briefly explore how they did this.
They start with standard data sets to train AMIE, then use some creative strategies. LLMs have become so powerful, that I have seen more and more instances of their use in generating, training and fine-tuning data. Starting with a diagnosis, an LLM-agent (in their case, PaLM-2) was used to first generate a clinical vignette that corresponded to ones seen by clinicians in medical exams. Another LLM agent uses this clinical vignette to create a simulated dialogue between doctor and patient. A third LLM agent then plays the role of a critic to improve on the dialogue. The end result is high quality dialogs at scale. This dataset is used to fine tune the AIME system. The AIME chatbot incorporates a chain-of-reasoning approach to systematically comb through the evidence elicited from the patient and create a proper response at every turn.
Now, let’s turn to the evaluation of the system. The researchers created a blind evaluation using a collection of clinical vignettes and patient actors. Patient actors were given clinical vignettes to help them answer questions from the chat interface. Behind the chat interface, it was randomly substituted with a real primary care physician or AIME response. And the result? You guessed it (otherwise, they wouldn’t publish the paper!) AIME was better than real life primary care physicians. That is not the surprising fact. They rank AIME, not only concerning diagnostic accuracy, but on subjective factors such as perceived openness, honesty and empathy. And AIME is better there, too! Imagine a chatbot scoring more points on empathy over real life physicians! However, in this case, the use of a chatbot interface may have biased the results. Verbose, polite responses are the forte of LLMs. I can’t see actual physicians typing up long responses!
Now, they are not going to replace physicians – ever. The physicians for the study were from Canada, the UK and India. One can see the value of such a system globally where there is such a shortage of physicians. I suspect that is what is motivating the researchers. Provide a tool to screen for conditions and get them access to appropriate help. We are seeing the future coming to life in front of our eyes – one research paper at a time.
“Juggy” Jagannathan, PhD, is an AI evangelist with four decades of experience in AI and computer science research.