Doctors are from Venus, Data Scientists from Mars : or Why AI/ML is Moving so Slowly in Healthcare
The world of healthcare may look like the most fertile field for AI/ML apps but in practice it’s fraught with barriers. These range from cultural differences, to the failure of developers to really understand the environment they are trying to enhance, to regulatory and logical Catch 22s that work against adoption.
According to data compiled by research firm Startup Health funding for digital healthcare totalled $14.6 billion in 2018. The world of healthcare may look like the most fertile field for AI/ML apps but in practice it’s fraught with barriers. These range from cultural differences, to the failure of developers to really understand the environment they are trying to enhance, to regulatory and logical Catch 22s that work against adoption.
This is part 3 of our three part series on AI/ML in healthcare. The content is the result of my attending the AIMed Conference last December which is unique because its attendees are 80% clinicians and hospital CIO/Administrators and not data scientists.
The really unique value of this conference is seeing the AI/ML landscape through the eyes of users, and not through the overly optimistic eyes of data scientists and VCs.
* In part 1 we talked about the extremely low adoption rate of AI/ML in healthcare, on the order of only 1% of hospitals.
* In part 2 we tried to set those opportunities into an orderly landscape organized around the physician and their patients.
* In part 3 we’ll describe what those physicians and administrators told us that’s holding back adoption.
Here are some of the many reasons stated by clinicians attending the AIMed Conference.
Too Many False Positives
Data scientists don’t need to think twice about the fact that all of our techniques are probabilistic and contain both false positive and false negative errors.
In healthcare however, false negatives, that is failing to detect a disease state is the ultimate failure to be avoided at all cost. As a result, applications designed, for example, to automatically detect cancer or other diseases in medical images are tuned to minimize these type 2 errors.
This necessarily increases false positives that can only be reduced by increasing overall model accuracy. And that can only happen where a large amount of training data is available. More on that later.
Radiologists and pathologists complain that false positives slow them down too much as they are forced to examine all the portions of the image flagged by the model. And in fact spend even more time on the false positive indications, not wanting to miss something important.
The radiologists who spoke at the conference said the impact was so severe that not only was there no time savings but the use of these systems actually took more time to evaluate. The model-based system may be faster, but the total time to evaluate was longer.
One solution suggested, aside from diminishing error, was for the model to describe specifically what caused a particular area to be called out as abnormal, for example by identifying a particular type of cell abnormality that the trained radiologist or pathologist would recognize. Think of this as the clinician’s version of a plea for transparency.
Similarly with the many IoT type applications being promoted to monitor in-patients for critical events, clinicians, nursing, and other professional staff reported ‘alarm fatigue’ from too many false positives, reducing the likelihood that they would respond with urgency.
Turn Down the Hype
While we’re on the topic of automated image evaluation, radiologists and pathologists would like the press to turn down the hype on these ‘breakthroughs’ often described as new levels of accuracy in the detection of this or that cancer.
They remind us that the job of radiologists and pathologists is not to tell the treating doctor that they have discovered a cancer in the image, but rather to say that a specific area looks suspicious and requires the doctor to examine it more closely.
Don’t Disrupt my Workflow Bro
Unlike the human workers in the broader world of commerce, hospitals are staffed by a very high percentage of very highly educated workers (clinicians) who seem to be always scheduled to the very edge of efficiency or exhaustion, depending on your point of view.
It’s in the nature of running a hospital not to have too few or too many of any particular clinical specialty for reasons of cost. That means that healthcare professionals are quite possibly the most overworked or at least critically scheduled category of workers anywhere. There is seldom a moment when they are under-utilized.
Where in the world of general commerce for example would you have your specialist sleep at the office so you could wake him in the middle of the night to evaluate a problem.
The result is the adoption of natural workflow patterns that allow attending physicians to see as many patients as possible (without causing harm) or for radiologists and pathologists to examine as many images or slides as possible in the shortest amount of time. For example, the average radiologist is said to serve 200 patients per day, evaluating 3,000 medical images at least 90% of which will be normal.
These folks have all learned and developed workflow techniques that maximize their effectiveness and efficiency. It’s the heart of a unique culture that is the opposite of the equally unique culture of the healthcare data science startup to disrupt the status quo with their innovative breakthrough de jour.
For example, in automated image classification, some solutions have attempted to move radiologists off of their film-based media and pathologists off of their microscope-slide based tools by offering large hi-def holographic screens instead. This seemingly harmless improvement was quoted as the source of work slowdown that caused these solutions to be rejected.
This is at the heart of the Mars/Venus analogy in the title. What we heard repeatedly is that new innovations need to seamlessly integrate with existing workflows and practices. Evidently this obvious UX element is missing in many proposed solutions. For adoption to take place, data scientists need to understand the culture, and particularly the importance of integrating with current workflow and practices.
FOSS versus FOMO
By the way, FOSS (fear of small startups) has more weight than FOMO (fear of missing out). Even where AI/ML solutions like automated image classification were shown to be promising, hospital administrators showed the same reticence to contract with new, small startups that any competent commercial enterprise would.
When you are investing in an embedded AI/ML solution you are betting that the vendor will be around to continue to support and upgrade that solution. Small startups in any industry have the obvious risk of not surviving.
Administrators and clinicians asked that embedded AI/ML solution providers, particularly pointing at the automated imaging solutions, should partner with the ‘hard metal’ machine providers to prove or at least insure there staying power.
The Electronic Healthcare Record (EHR) – A Deal with the Devil
A survey from the AMA this year continues to show that the EHR and related clinical systems are the chief reason for physician burnout. It’s widely quoted that the introduction of these systems has burdened the physician with two hours of administration for every hour of patient-facing time.
If you’ve been to your doctor recently you know that the keyboard is now an almost physical barrier between your physician and you. It’s reported that this is the first generation of physicians NOT to recommend that their children enter medicine.
And yet to get to the benefits of AI/ML in healthcare requires the data that starts here in the EHR.
There are many structural and procedural problems with health data but key among them is extracting that data from these electronic health records.
China has an initiative underway employing 50,000 medical students to extract and transcribe this data into data bases. But the feeling among US clinicians is that medical students aren’t qualified to extract this data accurately. That it needs to happen at the time of data capture.
This practically screams out NLP to any data scientist and some applications are making inroads. However, some still haven’t learned lesson 2 above about integrating into existing workflows.
For example, for both data capture and for liability reasons, hospitals would like to have step-by-step documentation of what happens during a surgery. One proposed solution was to put an elaborate headset on the surgeon that would both record video and allow the surgeon to dictate his actions. Guess how well that was accepted.
A major challenge to NLP solutions and indeed all types of data capture in healthcare is interoperability among different data sets. The consistency and standardization isn’t there today restricting most data sets to relatively small size and making the blending of data sets chancy at best.
This is a huge pain point and barrier where NLP promises improvement and where movement is underway toward the standardization necessary. It’s not there yet.
Data is Too Thin and Won’t Generalize
The problems with the healthcare data necessary to train AI/ML solutions doesn’t end with extraction. The first major problem is that the data is simply too thin and won’t generalize.
There are a few large data bases in the 100,000 record range but the effort to rollup patient data into data science worthy DBs is early in the process. Some of the obvious problems:
A solution from one country or even one hospital won’t necessarily generalize to other populations, inside or outside of the US.
Both privacy regulations and the feeling on the part of many hospitals that they should be compensated for their data are dramatically limiting sharing. For example, although medical imaging accounts for 90% of all healthcare data, 97% goes unanalyzed or unused per Keith Bigelow, SVP for GE Healthcare.
In clinical informatics, the first step in the AI/ML augmented physician, creating models to predict better outcomes or prevent worse outcomes are being attempted. But data is so thin at the hospital level that the models suffer from what data scientists understand as ‘leakage from the future’. That is, when the first promising results are found and practices are changed, the data from the new group impacted by the model is mixed back with original data eliminating a proper control group. Still, having immediate benefit is seen as preferable.
Continuous Learning is Broken
What we all want is to continuously improve our models. Thin data resulting in blending new and old data and suffering leakage from the future is only part of this.
The siloing or hoarding of data at the hospital level is reported to be an equal problem. When a hospital accepts an embedded AI/ML solution from a new vendor, there are several factors that may prevent results data from flowing back for improvement.
One issue is interoperability even at the machine level. Several examples have been shown in imaging where data from similar machines with different manufacture or even different settings are not comparable.
Second, some hospitals take the position that they are financial partners in this arrangement and need to be compensated for the returning data.
An even more formidable barrier is raised by the FDA in approving imaging based solutions. On the one hand, the FDA has taken a very permissive approach to approving AI/ML image classifying solutions based on training with as few as 100 to 300 images.
However, those approvals are then frozen and require reapplication before an improved solution can be released. Not to mention that using such a small number of images for training implies transfer learning and inaccuracies that may arise from using a base model that actually does not transfer weights well in differing circumstances.
Not So Fast With Those Rollouts
This problem lies squarely at the feet of those data science healthcare vendors still in the ‘move fast and break things’ mindset. The best example is the use of chatbots which are a natural for patient scheduling, intake, and even determining when or whether a patient should see a doctor.
In a widely touted rollout just a few months ago, a chatbot called Babylon was rolled out in the UK to provide diagnostic advice on common ailments without human interaction. The app would then also vet access to the single-payer UK system presumably reducing the cost of the initial interaction and offering accurate and timely referrals to the correct physicians and hospitals.
However, as reported in a recent Forbes article, a group of auditing physicians “found that around 10% to 15% of the chatbot’s 100 most frequently suggested outcomes, such as a chest infection, either missed warning signs of a more serious condition like cancer or sepsis or were just flat-out wrong”.
The problem was simply shortcuts taken during training and too great an emphasis on rolling out fast before being audited.
It’s likely that most non-data scientists think that chatbots are actually smarter than they are. The AI in chatbot refers to its NLP ability to understand free form text and voice input and create similar output.
What most folks don’t realize is that the internal logic of which answers to provide are still hand coded decision trees in 95% of chatbots, not the result of some exotic AI/ML related search or automated intelligence (which might not be better anyway).
So buyers beware and be sure to satisfy yourself about the accuracy of any chatbot or similar AI/ML solutions before you put them in production. That’s one of the first lessons we learn in commercial rollouts.
Real physical business
One of the conference speakers, Ted Shortliffe who is a physician, data scientist, and widely credited author observed that perhaps the reason for slow adoption in healthcare has to do with the fact that it is a ‘real physical business’.
He contrasts this with where AI/ML had its earliest successes in the e-world of finance, ecommerce, and entertainment, none of which is hobbled by an existing physical operational environment.
Perhaps he’s right that the brick and mortar world is a much tougher nut to crack and requires a slower, more deliberate approach. Especially when complicated with the specialized processes and skills needed in healthcare.
For data scientists hoping to capitalize in this market, there are few important lessons.
Slow down a little and make sure you understand how your disruptive application can actually be integrated into this world of specialized workflows.
Make sure you understand both the current restrictions on data size and accuracy, and how long it may take before that gets better.
Don’t rush the rollout. People’s health is at stake.
And if your VC is pushing you a little too hard, remember that the early bird may get the worm, but the second mouse gets the cheese.