IHPI Seminar: White Coat, Black Box: Augmenting Clinical Care with AI in the Era of Deep Learning

Articles


– We’re gonna go ahead and introduce our speaker for today. Jenna Wiens, is a Morris
Wellman Assistant professor of Computer Science and Engineering here at the University of Michigan. So she’s one of our own. It’s an incredible privilege to have her here to speak with us today. She received her PhD from MIT in 2014 and in 2015, this is one year after she got her PhD, she was named Forbes 30 under
30 in Science and Health care. She received an NSF CAREER Award in 2016 and recently was named in
the MIT Tech Review List of Innovators Under 35. Her research focuses on Machine Learning, Data Mining and its
Intersection with Health Care. And we asked her here to talk to us about, what’s the future of Machine
Learning and Data Mining, Big Data in Health Care and where does she see this is going. So, welcome. – Thank you very much. (applause) So thank you so much
for inviting me to come and speak today. I’m very excited to be here to give my talk on White Coat, Black Box, Augmenting Clinical
Care with AI in the Era of Deep Learning. And I have to give credit to the CBC, the Canadian Broadcasting Corporation for inspiring this title. They have a popular podcast
called White Coat, Black Art. So in hospitals today, we’re collecting an immense
amount of patient data. So everything from medications to lab results, to clinical notes. And in my lab, we develop and we use machine learning to detect patterns in these data. Patterns that can be used to sort a patient from low risk to high risk for a particular
adverse outcome or a disease. And given the challenges that today’s health
care systems are facing, there’s a critical need for such clinical decision support tools. Currently, the demand for clinical care far exceeds the supply. An economist anticipate that this is only going to
get worse in the near future. By the year 2025, we anticipate a shortage of over a hundred thousand
clinicians in the US. And this shortage, exacerbates what’s already a
serious issue in the field, physician burnout. Today, the increasing computerization of the field is widely
viewed as a problem rather than a solution. Clinicians are spending
evermore time entering data about their patients, studying data about their
patients while still ignoring the vast majority of it. And this burnout combined
with a lack of tools to make sense of all these data, has contributed to a large
number of medical errors. Data published by the CDC, estimate the medical errors
as a third most common cause of death in the US, after heart disease and cancer. These issues combined
highlight an important need but also an opportunity for AI. For example in our work, we show in how machine
learning tools can be used to identify which patients
are at greatest risk of acquiring an infection
during their hospital stay, by leveraging the contents of
the electronic medical record. So, all of those data that
we’re collecting on patients, and machine learning tools, we can accurately identify at-risk patients five days in advance of clinical suspicion. Really this is far enough in advance that you could actually intervene and change the outcome of a patient. Whether the intervention is
targeted environmental cleaning or even whom to test, and this in turn could lead
to decreased length of stay, decrease transmission, reduce costs and improve patient outcomes. But we’re not stopping there. The data and techniques that we’ve used, to try to predict healthcare
associated infections can be extended to other
inpatient outcomes, like acute respiratory distress syndrome or graft versus host
disease in HCT patients, or more common outcomes like shock and acute respiratory failure. Or now that we’ve been collecting data for over a decade in EHR systems, we can start to look at
long term trajectories of patients that lead to perhaps
neurodegenerative disease like Alzheimer’s disease. We can even start to look at data collected outside the hospital, data from wearables, micro biome data, can help us shed light on how to best manage chronic
diseases like diabetes or cystic fibrosis. And I’d like to pause
here just to take a moment to acknowledge all of our
clinical collaborators that work with us on these projects. This work would not be possible
without frequent meetings, where they update us or help guide us on through
many of the problems that we run into and keep us on track. So while our work is motivated by important clinical problems, ultimately we’re interested in tackling the technical challenges. What do we have to do to
make AI work for healthcare? To make AI work for clinicians? My lab we work across a number of different technical domains. So from time series analysis
to causal inference, to representation learning. And today, I’m going to focus
on a subset of our technical contributions starting with
representation learning. So just a show of hands, how many people have heard
of representation learning? No one, okay.
(laughter) How many people have heard of deep learning? All the hands go up, okay. So these are very closely related, great. So roadmap, here’s the outline for
the remainder on my talk. We’re gonna start off with
how representation learning and deep learning are related. Then move on to some of our
technical contributions, diving into the details and then finally, I’m gonna come back out and
end with the big picture. What does this mean for healthcare? All right, so I’ll just
start with the background. So, one of the first steps
in applying machine learning to any dataset is deciding
how to represent the data. Okay, so let’s take a simple
toy example where I’m trying to classify heartbeats. I’m trying to identify ectopic heartbeat, so normal sinus rhythm
versus everything else. I have two heartbeats here that I’ve extracted from the EKG and I’d wanna represent them as a feature vector in a binary label. So negative one, for
normal sinus rhythm beat, positive one, for an ectopic beat. And we need to think of a
good feature vector, right? So how are we gonna represent
these data to the algorithm? One idea we might have is to consider the pre-peak time, so the rr interval and the amplitude of the QRS complex. We can calculate both of those different
features for each heartbeat and we get a two
dimensional feature vector, that we can then plot in
this two dimensional space. We can collect more heartbeats and we can then plot all of the data in our two
dimensional feature space. Then the goal here is to
identify a decision boundary or a classifier that separates
the positive from the negative examples. And they’re very is learning algorithms to solve this supervised learning task. But hopefully you’re starting to see how fundamental
the representation can be. And so if instead, I chose a less informative representation, so I decide to represent
each be in terms of the width of the T-wave and the start at the T-wave. I’d be setting up the problem to be more difficult
than it was otherwise. And researchers for some
time have been spending a lot of time thinking of how to
best represent their data. And so this was true, you know, a decade or two ago and
it’s even still true today. Researchers still spend time
engineering these features, thinking about how to
best represent the data. And this is a really
labor intensive process and it can be error prone, right? If I don’t have the domain expertise, I might struggle to come up
with a good representation. So that’s where representation
learning comes in. Instead of sitting down and thinking, what are the best ways
to represent my data, I can learn that representation. So in addition to learning
how to weight each feature, I can also learn what
the feature should be. And to do that, we start with
some original representation. And so think of this is just
the value of the signal. So each of these correspond
to different point in time, and that’s the input too. I can feed that into a
multilayer perceptron. So this is just, you can think
of it as a deep neural net, some deep learning. And then think of, look at the last layer. So the last layer here corresponds to a linear combination of my features. So I have three features here, X1, X2, X3 that are all being fed in to that last neuron that’s
going to output a prediction, normal sinus rhythm beat or not. So even though I’m feeding in
this regional representation, what I’m getting out is this X prime, it’s representation at the last layer. And so I can think of that
as my learn representation. This is my new feature space, cause I have some three dimensional feature space in this case. And the same thing, I wanna learn a classifier, right? But I was able to learn the classifier in the representation jointly. This is representation learning and the foundation for deep
learning as we know it today. So in recent years, deep learning has led to
breakthroughs across a number of domains. The computer vision
community has benefited tremendously for many
image recognition tasks. These algorithms can do as
well as or surpass humans. Similarly, machine translation
tasks can also benefit from these deep learning techniques. They can be trained on tens
of millions of sentence pairs and better capture context
leading to better translations. We’ve also seen impact in healthcare. So researchers at Stanford
have achieved dermatologists level classification
of skin cancer using ML and others at MIT and MGH have built tools to successfully summarize
pathology reports. So the potential impact of
machine learning in healthcare, warrants genuine enthusiasm. But the limited adoption
to date highlights the fact that we still have a long ways to go, compared to other domains that have benefited immensely from recent advances in machine learning, like deep learning, healthcare represents a number
of additional challenges. So, in addition to the increased
risks and responsibility that come along with
working with health data, they’re unique technical
challenges related to model interpretability and sample size. And in the remainder of my talk, I’m going to dive into details
of each of these issues and what we’re doing to tackle them. So first interpretability. So going back to our
toy heartbeat example, the model on the left or
the feature representation on the left here is inherently
interpretable, right? By design you came up with the features or we came up with the
features of pre-peak time and amplitude. So we know what the
features correspond to. Whereas, on the right, the deep learning approach
generates some representation, and while might be good representation, we don’t really know what
these dimensions correspond to. And so some people call deep learning, generates these sort of black box models. We don’t know what those
representations correspond to. And this is precisely what makes them useful in the first place, right? They’re learning the representations. So arguably, the simplest models to understand the most
interpretable models are hand crafted risk scores, like the Timi Risk Score for non-ST elevation
myocardial infarction. This is a risk that’s based
on four historical factors and three present factors, and each one contributes
just a single point. So the points are like the weights and so it’s not learned. But this model is both
interpretable to the user, so the clinician who’s
going to ask these questions and sum up, and also the
person who designed it. That’s not always true in
deep learning approaches. And for some time researchers
have sought interpretable models in high stakes
domains like health care, since they allow the user, the clinician to check the model, to test their intuition against the model. My lab, we think interpretability is really important since
it allows the designer, the ML researcher, to check
the model and find bugs. And so for example this is
a project that was done in collaboration with Dr.
Michael Schottenstein here at Michigan where we
were looking at a cohort of patients with acute
respiratory failure. And we had students working
on predicting which patients were going to die during
the current hospitalization. So the task here is predict in-hospital mortality in this ICU setting. And we had two undergraduate
students working on this problem. They are both in computer science here and about two weeks into the project, they came back to us and they had this incredible
area under the receiver operating characteristic curves. It was like 0.95. It was really good. And if any of you have
seen me speak before, you know I’m very suspicious
of results that are too good. So we wanted to see, well
what had the model learned. So this was an interpretable model, so we were able to check it and we noticed that one of
the most important features was this drug here, scopolamine. We didn’t know what that was. We did a little bit of Googling, but then we sat down with our
collaborator and Mike said, you know, this is a drug that’s almost exclusively used
for end of care life. So it wasn’t that surprising that the model was
picking up on this signal. So it wasn’t that the model was wrong, it was a real pattern in the data, but it wasn’t useful, right? If we want to then go and use this model, we wouldn’t be telling the clinicians anything they didn’t already know. So we’re not really
augmenting clinical care then, but we were able to fix it because the
model was interpretable. We were able to just
exclude patients who are on comfort measures
only from the analysis. And then this resolved the problem. This particular feature
was no longer informative and AUC dropped. (laughter) So though it may seem
like the arguments for interpretability are sound, there’s still this ongoing
debate in the community. And in particular at NAPS 2017 which just
happened just over a year ago. They hosted the first ever NAPS debate, and the debate, the topic
was Interpretability is Necessary for Machine Learning. And on one side, we had Rich Caruana from Microsoft Research arguing
for interpretability, and then Yaan LeCun, one of the founders of this
whole deep learning community arguing the other side. And the debate generated a ton
of excitement and enthusiasm, especially in the machine
learning for healthcare community. So we’re getting this play
by play by Dave Quail. He was blowing up on Twitter
saying every single thing that everyone had said,
all of the back and forth. The debate lasted over an hour. I don’t have time to show
you the entire debate, so I won’t, but I wanna summarize a few of the key arguments that were made. So I think my volume is turned up. Listen carefully. I’m also displaying the text
just in case it’s unclear. – [Rich] We know think that you just won’t know about these problems you wouldn’t have
anticipated them in advance if you can’t open up your
model and see what’s happening. So we think interpretation is very very important for these models or else the model is just
gonna learn risky things. – [Yann] Interpretability
schminterpretability. (laughter) So it’s not that interpretability
is completely useless, it’s not nearly as useful as you think. First of all, the vast
majority of decisions made by automatic systems so we know them. There’s a small number of domains where interpretability is not
only useful but required, like legal decisions, and we concede for certain
types of medical decisions. – [Rich] You you just
can’t trust performance on a test set in many domains, especially when your intended use of the model is something that is sort of causal or counterfactual. You didn’t evaluate the model
that way when you trained it, so now you have to do something different. We find that every time, we put on these you
promised magic glasses, and we see what’s inside model, we find an amazing number of
things that are wonderful, beautiful that do really
make the model high accuracy, but we also see things
that are very surprising, very disturbing and that no human would
ever let get deployed. – [Yann] Machines are gonna
make stupid mistakes until they have the same kind of background
knowledge that we have, until they get common sense. So that could be an argument to be careful and be more thorough with testing. – So that’s the gist of it. You can go and watch the
entire thing on YouTube. But in the end, Yaan LeCun, or actually very early on in the debate, Yaan LeCun concedes that
interpretability is really important in certain domains
including healthcare. In part because AI or models
today lack common sense. And this lack of common
sense is not relegated to just deep models, interpretable models can
also lack common sense. So let me give you an example here. This lack of common sense can lead to really strange behavior, when you’re looking at
what a model has learned. So for example, suppose we have some
object recognition task. I wanna recognize dogs in images, and the model has learned, and I’ve tested it, and I go to my test data and say, it’s a dog. Look great, good performance, I’m doing well. But if I look at it more carefully, let’s say it’s an interpretable model and I can look at what the models learned. For some reason, my data
were biased in such a way that every dog appeared with a frisbee. (laughter) So the model is actually just
picking up on the frisbee. But what I really want is for the model to recognize the dog. I can address this by fixing
the bias in my data set, right? But I won’t know it’s an issue
unless it’s interpretable. And the same thing can
happen in healthcare. So back to the healthcare
associated infections example that I gave at the beginning of the talk. If we look at what features received the highest weight for our model on predicting infections
with clostridium difficile, we can be reassured by the first year. And so, a diagnosis in the
past year received death, means you’re at high risk, which makes sense because roughly 20% of cases will relapse within 60 days. But we don’t have to go too far down the
list before we start to see some things that might be puzzling, and it might just be that these features, the frisbee, it’s showing up here because it’s highly correlated with
some other known risk factors. But we have common sense or we have clinical domain expertise. So can’t we just somehow incorporate this domain expertise into the model, help it choose between
that frisbee and the dog. So to do this, we propose a new regularization penalty. And so we incorporate
expert knowledge regarding what is known to cause an
increase in patient risk and the form of this regularization term, which we call the eye penalty, based on its shape. For every input variable, so we have d features, for every input variable, we have a binary variable, r are telling us whether or not that feature is known to lead to an increase in risk. So for example, cephalosporins would lead
to an increase in risk. So given this information, the model can then favor a solution that sparse in a set of unknown features and dense in a set of known, meaning it will keep
the dog over the frisbee because the dog is known to lead to the label of a dog. So applied to the task of predicting healthcare
associated infections with C Diff, we identify known risk factors based on the clinical literature and we measured how often the
known risk factors appeared near the top of our rank
list using average precision. And so here I’m showing you the method, Average precision is AP, higher is better, and then the area under the receiver operating
characteristic curve, which is a measure of the
discriminative performance of the model. And I’m comparing to expert features only. So if we know what causes C. Diff or what leads to an increase in risk, why do I even need the other variables? Let’s just go with the expert knowledge. And you can see that
this trivially achieves an average precision of one, because of the way we’re
measuring average precision, we’re only including
those known risk factors, so they’re gonna appear near the top. But the discriminative
performance is a bismal. So remember AUC 0.5 is a coin flip, no better than random, we’re at 0.5, 0.59. So there’s stick there, right? The features do in fact mean something, but we would suffer considerably by just throwing out all the other data. The i penalty leads to worse average precision because it’s considering many more features but better discriminative performance, and compared to other
regularization penalties. So you might be more familiar with lasso or something like ridge. So here I’m comparing it to lasso, weighted lasso and elastic net. And we see that compared to all of these approaches we’re
doing on par in terms of the discriminative performance, the AUC, but in order of
magnitude better in terms of the average precision. And so it’s encouraging more
of those known risk factors to float up near the top. And we shared all the codes of this work, has been published. We shared all the code on GitHub. So if you want, you can go and start training credible models or models with the eye penalty today. So to wrap up our discussion
on interpretability, it’s important to understand that it’s neither necessary nor sufficient. And many researchers will cite
this example of black boxes that are already deployed in healthcare. And so I’ve heard clinicians
give this example where they say, well, you know, we use devices or tools every day that
we don’t understand. For example, a digital thermometer. So how many clinicians know precisely how a digital thermometer work? Probably very few, but you still use it. So the argument to that is that, well, the clinicians might
not know how it works, the manufacturer’s do, right? So there are physics behind how these digital thermometers work. They’ve been tested, we know they work. But then there’s still some treatments that no one knows why they work. And it’s a certain drugs
where it’s not well understood what the mechanism is
behind why something works. But such treatments have
been thoroughly vetted. They’ve gone through several
stages of clinical trials. You might be thinking, well, let’s just do the same
with machine learning, with black boxes. And we could, and this is in fact what Yaan LeCun
argues for in the debate. But look at this timeline, and it can take several years if not longer to vet some of these drugs. So we should do as much
as we can in advance to vet the model. And that’s definitely easier
with interpretable models. So I’m gonna pause there before I move on to the second technical challenge and ask if there are any questions. No questions? Yes. – [Audience Member] Sure, where do the known risk factors come from and how do you trust that
they’re not just garbage? – Great, that’s an excellent question. Where do the known risk factors come from and how do you trust
that they’re not garbage? So in our work, the known risk factors
came from the literature. But you’re right, going forward, they might come from your
clinical collaborator, and there’s this risk. Experts can be wrong. Experts have been wrong in the past. Clinical literature has
been wrong in the past. So the nice thing about our approach, is that it won’t trade off accuracy for average precision. Okay, so it’s not going to force or bias the model towards believing the expert if that feature doesn’t help, right? So it’s still needs to have a
relationship with the outcome. And in the paper we compare, I have many more baselines, I didn’t show on the side, but we compare to a random expert. So we say, what if we
just randomly permuted r, that vector that encoded what was known and what was unknown, which would be the same as an expert that was giving us garbage. And it led to the accuracy was the same. The average precision was lower, but that’s actually a good thing. So we think that’s a feature of the model that if you have expert
knowledge, you incorporate it. But then the average
precision is about the same as all the other
regularization techniques. It’s a sign that your
expert knowledge is off or it doesn’t agree with the data. Great. Any other questions? Yes. – [Audience Member] Have
there been models run of interpretable models
versus autonomous models to see which ones provide the best outcome in terms of the credibility of the model
against the gold standard? – You mean when you say credibility, do you mean performance? – [Audience Member] So if I’m looking at, for instance, the unexpected death, not one that’s expected because
they’re pre-ordained to die, if that’s the outcome
that I’m trying to get to, have they been head to head comparisons of interpretable versus the autonomous? – Yes, yes, absolutely. And on some tasks deep
learning wins hands down, but on many tasks it doesn’t. And on many tasks you actually
see similar performance between the deep model and
the interpretable model. And in such scenarios, even Yaan LeCun concedes that you would prefer
the interpretable model. So I’m gonna get in to more examples of precisely where deep learning
works really really well. So moving on and I’ll give opportunity for more questions at the end. There’s this small sample size issue. And so machine learning and deep learning techniques benefit when there are large amounts of training data. Right? And this could be an
argument for collecting and sharing more patient data so we can train better
deep learning models, but even if we had data on every single patient in the world, there are some outcomes or some diseases that are so rare that you
might not learn meaningful relationships from those approaches alone. So what do we do about it? Well, first what do other domains do? How do other domains deal with this? Well, domains involving natural images, and this is where deep learning approaches have worked really really well, have for some time recognized important invariances present in task and they’ve developed architectures or techniques that are designed to exploit these invariances. So what do I mean by invariance? An invariance is simply a transformation that’s applied to the input. So the image in this case
that doesn’t change the label. So for an object recognition
task on natural images, it doesn’t matter where this
Schnauzer appears in the image, it’s still a Schnauzer. And so there’s some spatial
invariance in the task of object recognition. But what invariances hold for
tasks involving health data? And do existing techniques apply? So techniques that have been designed to specifically exploit things like translational invariance. Well the short answer
is no, they don’t apply. Health data are unique in
several aspects and require us to rethink some of the
common assumptions made by conventional ML approaches. And here I’ll describe our
recent work in challenging some of these assumptions. So the first project focuses
on images of the brain. And this work was led by Pascal Sturmfels, an alum from our undergraduate
CS program here at Michigan. And we started to explore CNN
architecture choices in the context of structural neuroimaging data. And when we reviewed the
literature in this area, we noticed that a lot of researchers we’re taking off the shelf models that had been designed
specifically for tasks involving natural images and applying them to neuro imaging data. But brain images,
specifically structural MRI, are all aligned. And there’s a lot of
pre-processing steps aligned in all of these images. So this spatial invariance
really hold here and we hypothesize that
in fact the answer is no. An architecture that looks
for different patterns in different regions of the
brain would out perform a conventional CNN. So to test this hypothesis, we considered two modifications
to conventional CNNs, those are convolutional neural networks. So think of it as another
form of deep learning. The first modification segmented the brain into consecutive regions
and treated each region as an input channel, thus encouraging the network to learn different region
dependent patterns. So where’s a dog is a dog
anywhere it appears in the image, a particular pattern might have a different meaning if it
appears on the left hand side of the brain versus the
right hand side of the brain. So at a high level, instead of convolving each filter or you can think of this
as a feature detector. So the blue box is a feature detector. I’m looking for a specific
pattern with the entire image. This is what is CNN would do. We considered an architecture that had different filters
for different regions. And our proposed set up, we learned feature weights, that our learned feature weights are not shared across regions and this allows the network to learn region specific patterns. However, this region specific information gets lost after the
very first convolutional layer since everything gets combined. So to address this, we proposed a second modification in which we applied more filters
earlier on in the network. So we’re trying to detect
more patterns at lower levels. And this is in contrast to
a typical CNN architecture. So what people normally do if they’re working with natural images, is they’ll have few filters, few feature detectors at
the bottom of the network. So at the first layer, and then more later on. And with the intuition being that if you’re trying
to recognize objects, they’re more distinguishable
at a higher level. We thought the opposite might
be true for brain images. So we hypothesize an architecture that focuses on more
distinguishing details when outperform a typical CNN. The details are in our paper, but together both of these changes led to not only more accurate predictions, but faster predictions. And that’s really important since when you’re working with
these large 3D volumes. So here I was just showing
you 2D volumes or 2D images, but they’re in fact 3D volumes. The computational efficiency of these algorithms is often a bottleneck. So beyond images, there are
many other types of health data. For example, the data types that we often work with
are clinical time series. So measurements like vitals over time or weight form data, right? So going back to the
heartbeat example here, suppose we’re trying to not
just detect a topic heartbeats, but classify arrhythmias. Do we have any cardiologists
in the audience? There was one, but he left. All right, so this is an example of ventricular tachycardia. Cardiologists would be
able to tell you that. This example below is also ventricular tachycardia, right? Even though they don’t
line up, same label. So this is a phase invariance. You could think of it almost like a translational invariance. And just like the dog was a dog, no matter where it appeared in the image, V-tach is V-tach no matter
where it appears in the sample. Okay. So this suggests that
at least for some tasks involving time series data, there’s some phase or
translational invariance. But there are many other types of invariances present
in these types of tasks. And for sometime researchers have applied pre-processing techniques, like dynamic time warping to magnify relevant similarities between patients. So dynamic time warping, finds an alignment between two signals that can minimize task
irrelevant differences. And so such deep techniques
were first used in speech processing where the speed at which I issue a command to Amazon Alexa has no effect on the meaning of that command, right? So the speed shouldn’t matter. So I wanna find the alignment. So this is finding the alignment
between two wave forms. However, such techniques
rely on dynamic programming and they’re computationally
expensive at runtime. So what do other domains do? So again when working with natural images, when multiple invariances arise, they can be exploited
through data augmentation. And so one can rotate,
flip or crop an image. So I can rotate this flower. I can do that to all of the
images in my training set and I can exploit those invariances by increasing the size
of my training data. But back to clinical time series data, we might have data pertaining to labs, data pertaining to vitals, data pertaining to medications. How are we going to flip, crop, rotate a patient, right? And even if you sat down
with a clinical expert, and I’ve done this and asked, well what invariances do you think apply? What transformations can I
apply to augment our data? Because we only have 200 patients and I’d really like a thousand. They don’t have any answer. It’s not clear. There might be some invariances
for some types of data, other invariances for others. So data augmentation is
not straight forward. So given this challenge, we set out to learn invariances
directly from the data. And similar approaches have been proposed for image based tasks. So in spacial transformer networks, an image input, so here
a handwritten digit, is transformed in such a way that it reduces intra-class variation and then fed through a classifier
that predicts it’s a five. And so it takes all of
these warped images, lines them all up, learns how to line them all up, like dynamic time warping we do, but it learns how to do it. So inspired by this, we proposed sequence transformer networks, where we take this time series and instead of applying something
like dynamic time warping, we learned the transformation. And this is work done by
one of my PhD students, Jeeheh Oh and others in my lab. So our approach is based on a
sequence transformer network composed of multiple 1D
convolutional layers. And if you aren’t familiar with how to read these network
architectures, that’s okay. You can think of this as a black box. So this part is a black box. It takes as input, the time series, and so the input might be measurements of heart rate over time or
respiratory rate over time. And then it outputs two parameters, theta, and phi. And theta and phi depend on the input, so depending on the input, I’ll get some theta and phi, and there’ll be different theta and phi’s. I transform the output, so theta will tell me, what dictate the temporal transformation that’s applied to the input, and then phi will tell me
the magnitude transformation. So should I increase it or stretch it? Should I compress it? Should I shift it? So we apply these transformations and then you have your new output, which you can just feed in to a conventional CNN. And so we end up with something like this, where you first transform the input. So we go from X to X prime and then we pass that
through our final classifier. That’s going to predict whether or not the patient’s at
high risk for C. diff or for in-hospital mortality or any other outcome. And we compare this to a baseline. So again, you can think
of this as a black box that didn’t have the sequent
transformer network piece. And so it didn’t try to align the signals. We train the sequence transformer network, so how do we train the parameters that will learn the mapping
from inputs to phi and theta? Well, we train the whole thing end-to-end. And so anytime I make
an error at the output, that error helps me update the parameters in the sequence transformer network. And we compare it to this baseline that had the same capacity, cause we might be thinking, oh well you have more parameters when he have this sequence
transformer network. Therefore, it’s going to do better. But that wasn’t the case. So we evaluated our proposed
approach on a benchmark task using publicly available
data from mimic three and given data collected
during the first 48 hours of a patient’s ICU stay, we aimed to predict again
in-hospital mortality. This is a common benchmark task. And we can see that
compared to a baseline, we do better in both the area under the receiver operating
characteristic curves, and in terms of the area
under the precision reconquer. The increase in performance is modest, but it’s consistent
across these two metrics. And when we break down the
type of transformation, so we have the sequence
transformer networks, they produce only the
temporal transformations or only the magnitude transformations. We again see improvements
over the baseline, but the further improvement is gained by our approach in which
we combine both of these, and is suggesting that these two sorts of transformations are complimentary, which we would expect. So the sequence transformer network is reducing intra-class dispenses. So patients with similar outcomes are made to look more similar, which we believe contributes to the overall increase in
discriminative performance. So admittedly these differences are small, but we think this is a promising direction that could yield further improvements as we either get more data or we start to explore other
types of transformations. Okay. So this brings us to the
last part of the talk. We spent an inconsiderable amount of time, I’m working through some
of the technical challenges of getting ML to work for healthcare. But you might be wondering, okay, how close are we really
to augmenting clinical care? So I wanna go back to the
example from the beginning of my talk, The work on Predicting
Healthcare associated Infections, and many of our projects
are at various stages of the pipeline. This work is probably the closest to augmentin clinical care. And again, here we have
an interpretable model that using the contents of
the electronic health record can predict which patients
are at greatest risk of acquiring an infection with C. diff?, a really nasty type of
healthcare associated infection. So based on the contents of the EHR, this model automatically
estimates daily risk. And so you get an estimate daily, an estimated risk score for each day and it gets updated based on changes to the patient’s treatment, they’re in hospital location et cetera. And at this point we’ve
validated our approach on retrospectively collected data. So we had published our paper in Itchy where we were validating on data from 2016 and then we recently
re-validated on data from 2018. The model works well or yields good discriminative performance. But that’s not enough, and so right now, we’re at the stage of prospective validation. So this involves integrating the model into the EHR system so that it can produce daily
estimates in near real time. And while it sounds straightforward or mason straight forward, it requires quite a bit of work and infrastructure to get
those real time data feeds. And so over the last few months, we’ve been working with folks in the research data warehouse
to accomplish this goal. So why do we need to
prospectively validate? We’ve retrospectively validated this across many institutions. But prospective validation
gives us the opportunity to test the model in silent
mode before we act on the model. And this serves two purposes. One, it highlights any
relevant differences that might surprise us. So if there are changes
in clinical protocol that the model can’t adapt to, we can identify that and fix it. And two, it gives us the opportunity to review errors in real time. And so we can review errors
on retrospective data, but we can’t go and ask the
clinician who is treating the patient why they think
the model made the error. Right? And by having that ability to query those errors in real time, we have the potential to
perhaps identify something that’s missing from the model or a blind spot of the model
that we can then improve. So once we have this infrastructure in place for near real time predictions and we have prospectively
validated the model, the next step will be a clinical study to test interventions based
on the models predictions. So you can imagine that you
have predictions for every day and you have some decision threshold of patients above that decision threshold, you intervene or if they always lie below that decision threshold, you know they’re of low risk and maybe you might do
something like decide whom not to test for C. Diff. So we’re currently working
with clinicians here and at Massachusetts
General Hospital in Boston to explore potential interventions. So in summary, there’s a critical
need for AI in healthcare. By augmenting clinical practice, AI has the potential to reduce costs and medical errors through
decision support tools. And it’s critical that we start to build that infrastructure now. We need to be able to
prospectively test these models and efficiently run clinical studies. And though we should proceed with caution, we shouldn’t let black boxes
stop us from making progress. Interpretability is neither
necessary nor sufficient. Black boxes might be okay if they’ve been thoroughly
vetted in an environment that mimics how they will be used, and even interpretable
models can lack common sense. But both of these issues
can be mitigated with domain expertise, the white coats. Domain expertise can
inform our approach either through regularization techniques or domain specific architectures. Thank you. (applause) And I’m happy to take questions, but I just like to acknowledge
all of my collaborators and my students, and my funding sources, without them this work
would not be possible. Yes. – That was very interesting and I really appreciate your emphasis on interpretable models or at least the consideration of them. So one of the biggest problems with with EMRs is that is really
a measurement problem, which is that, there’s tons of data in it, but the incentives for actually sort of taking seriously the entry
of the data vary enormously. And I think your example, I think you had an example about a protocol changing in your C. diff model, and it having a sort effect on, I can’t remember exactly what it was. But it would be incredibly useful actually as part of this sort of work, to sort of come up with automated ways to actually detect measurement problems in EMR data.
– Yeah. Because there’s not much
point in us collecting data that’s it’s clearly spurious,
which there’s tons of it. And there may be incentives to sort of clean up certain areas of it, but it’s an enormous task, and this is where some sort of, it would seem like you
could make some rules based approaches to detecting major measurement problems in common data elements in a medical record. Do you have any thoughts about that? – Yeah, so absolutely. I agree with you 100%. There’s definitely perhaps an incentives issue where a lot of this is long term rather than short term. A clinician won’t
necessarily see the benefit of them entering the data at this moment, but if they enter the data correctly. – [Audience Member] They
have a different agenda. They can’t get out the door and put the next patient in
without closing that box. – So absolutely, and
people are working on this. In particular, as it relates to temporal changes because
once you deploy these models, no one really knows how often you should be updating the models or when do you even check
when to update the models. So you can do something like a data diff. When did my data really start to change from the training data? That might be one suggestion, but then there are feedback issues. If my model is working really well, then all the patients who
I predicted were high risk, and then I intervene on them are got now going to look low risk. So this is an active area of research, people don’t have answers yet, but we’re definitely working on it. Are there questions? Yes. – So even something straight forward seeing this image analysis, how much of a problem is
the different manufacturers data processing to generate the image as you then try to apply one of the machine learning algorithms? – So differences in how the data were collected can lead to spurious relationships or strange results. So you have to be really careful of that. And this is again, an issue
with black boxes, all right? So if you have a model that’s saying, trying to predict Alzheimer’s disease, let’s say based on an MRI, and all of the controls were collected at one case in all of the cases, this is extreme, but all the cases were
collected in another. And then they use different protocols, then your model might just be detecting a difference in protocol and not our difference in the machines, and not really a difference
in true underlying relationships between the
data and Alzheimer’s disease. This can happen in other
domains other than imaging. So this is where, again, helpful to have that interpretability, to be able to open up the black
box or as Rich Caruana says, put on those magic glasses and see what the model has learned. – Great talk Jenna. I just wanted to ask a question
just along the reasoning that you were just explaining. It seems to me that there’d
be a tremendous value to machine learning and allowing it to just go and muck, basically uninterpretable data, but then in retrospect go back and open the uninterpretable data and make it uninterpretable. Because we’re assuming
that the human element, the clinician element, i.e interpretability is the gold standard and it may not be. There may be things we
simply have not thought of. So what does is the prospect
for opening the black box as you say? – So this is definitely a
really active area of research, a lot of people are trying
to explain deep models and so given a deep model, try to come up with a simple
explanation for it’s outputs, try essentially to infer what that feature representation
corresponds to. There are many different
techniques for this that have been published
in the last few years. But it’s not always the case that you actually need the deep model. Oftentimes in our work, like the C. Diff model, that’s an interpretable model. And we’ve compared it to deep approaches and it doesn’t do any better. So there’s this assumption that you’re always going to be trading off
interpretability and accuracy, but that doesn’t always hold. So yes, people are working on it, but it’s not necessarily a silver bullet. Yes. – [Audience Member] Nice talk Jenna. I was wondering how you deal with all of the different types of data available. Like you could have one piece of information represented in five or several more different ways, but they’re actually
the same piece of data. So for example, there’s a diagnosis, but then there are several
different ICD codes or encounter a diagnoses or problemless. So how do you choose, or how does AI choose, which is the best way to use it? How does it prevent duplicative data on the same data point? – So, it learns with a lot of data. So the more data I can figure out, which ICD-9 codes are actually
telling me the same thing. But often times we don’t
have a lot of data, so that’s where the feature
engineering comes in. And for example, whether C. diff work, we’ll do something like
represent medications at various levels of abstraction. So we have medications that include, the ingredient, the dose, the route, and then we have another
level of abstraction that just tells me the ingredient. And then another level of abstraction that just tells me the class. So where they are on cephalosporins or not or fluoroquinolones, and this then allows the model to choose. For some meds, they might wanna go down to the road. It might be really important that it’s oral vancomycin
versus ivy vancomycin and for other drugs that might not matter. – Hi, you kind of touched
on this in your answer to the first question, but I wanna ask a little
more about the feedback loop. The idea of if this is actually
implemented in the system, then how you use that new data given that you change the outcome potentially. What are some kind of
proposed solutions to this? – So like I said, it’s an
active area of research. So I don’t know that anyone has the be-all and end-all answer. One thing you could think of is, so long as it’s recorded, that you’ve intervened, then you can set those data aside, right? But you won’t ever have
the counterfactual. You won’t know what happened
if you weren’t to intervene. But you wouldn’t retrain necessarily, you definitely wouldn’t
retrain on those data, because then you’d start
flipping back and forth. But I think it’s a really
interesting question and one that will be better
equipped to answer, the more, the further we get. And so I don’t think that we should view this as a roadblock, but rather, move forward with caution, collect more data and to learn from it. – Are you doing anything to
assess the predictive power of interactions over time
in multivariate time series? – Yeah, so interactions over time. So for example, a relationship between a med given at one time point and a lab result at a
different time point. – [Audience Member] Or
functional operating imaging. – Yeah, oh, okay, okay. So in time series, in our analysis of the
sequence transformer network, absolutely. We’re looking at relationships over time. For the brain images, right now we’ve just
stuck to structural MRI. We wanna get to FMRI, but those are really big data sets. So we’re still working
out some of those issues. Well, I’m going to stick
around for the reception, and so I’ll be just outside and thank you everyone for your attention. (applause)

Leave a Reply

Your email address will not be published. Required fields are marked *