Messy Data, Real Answers

In a world teeming with health data—from smart watch accelerometry to millions of hospital system electronic records—how do researchers find out which medical treatments truly work? Biostatistician Rebecca Hubbard discusses the messiness of real-world data, the limits of randomized control trials and how both of these powerful—but imperfect—methods are essential for building a trustworthy ‘edifice of evidence.’

Public health research shapes our country’s health care policies, tests treatments and interventions for safety and effectiveness, and informs and enables us to make decisions that lead to healthier lives.

In our latest episode of Humans in Public Health, host Megan Hall is joined by Rebecca Hubbard, the Carl Kawaja and Wendy Holcombe Professor of Public Health and professor of biostatistics and data science at Brown, to discuss how data is collected, analyzed and used in world-class, evidence-based research. Below is an edited transcript of the podcast.

LISTEN NOW

How did you get interested in biostatistics and research methodology?

Hubbard: I’ve always been curious about everything—and about the process of science itself. I like doing statistics because it lets me be part of figuring out how we generate evidence and how we learn what’s real and true. And the best part is I don’t have to choose just one topic.

Data are everywhere now because of the fact that everything is electronic. For instance, accelerometry data from smartwatches, social media data from people's social media posts, environmental monitors that describe air pollution and things like that. 

smartwatch
Data are everywhere, from accelerometry data from smartwatches, to social media data, to environmental monitors of heat and air pollution. 

Each time I start on a new topic, I meet with the clinicians and researchers in that field, and they tell me all about the science they’re passionate about. Listening to them talk, I get excited too. And I think, okay, I get to work on another fascinating project and help make science better in a new area, in a new way.

You often use real-world data in your work examining health outcomes. What are common pitfalls when it comes to data quality? 

The important thing to remember about health care–derived data is that they come from people coming in to receive care in their usual ways. That means the data we have are the ones needed to deliver care and bill for it, not necessarily the data we’d collect if our goal were research.

In statistical terms, that creates all kinds of bias because you have differential information for different people, which is always problematic in research, and the quality and amount of information you have is possibly related to the outcomes that you’re trying to study. So, you’re more likely to capture those outcomes for some people than others, which is very problematic from a methodological perspective.

What do you do about that problem?

The first and most important thing is recognizing that it’s a problem. One thing I noticed, especially during the pandemic when everyone was excited about using real-world health care data, was this idea that if you just have a huge dataset—say, data on ten million people—and throw it into an algorithm, you’ll automatically get the right answer. But from a statistical perspective, that’s completely wrong. More data doesn’t eliminate bias.

So the first step is realizing that these data aren’t the same as research data. 

What I like to do is think hypothetically: if I had designed a research study, what data would I have collected? What would I really need to know about each person—why they received a treatment, how their outcomes were measured and so on? Then I take the health care data and try to harmonize it. Once you do that, you start to see where the gaps are.

One thing I noticed, especially during the pandemic when everyone was excited about using real-world health care data, was this idea that if you just have a huge dataset—say, data on ten million people—and throw it into an algorithm, you’ll automatically get the right answer. But from a statistical perspective, that’s completely wrong. More data doesn’t eliminate bias.

Rebecca Hubbard the Carl Kawaja and Wendy Holcombe Professor of Public Health and professor of biostatistics and data science at Brown
 
Rebecca Hubbard, the Carl Kawaja and Wendy Holcombe Professor of Public Health and professor of biostatistics and data science at Brown

Randomized control trials are often considered the gold standard of research. What are the good things about them? Why are they considered the best form of research, traditionally?

In a randomized controlled trial, or RCT, we’re comparing two different health care interventions—often two drugs, surgical procedures or other treatments—head-to-head. We randomly assign patients to receive one treatment or the other.

researchers discuss a study
Randomized controlled trials are considered the ‘gold standard’ for single-study evidence because there are no systematic differences between the people receiving the two treatments.

They’re considered the ‘gold standard’ for single-study evidence because randomization eliminates confounding. That means there are no systematic differences between people receiving the two treatments—which is both crucial and very hard to achieve without randomization.

Another key reason is the quality and pre-specification of the data. Pre-specification is essential—it’s not just that the data happen to be complete with little missingness, but that before the study even begins, the researchers decide exactly what information they need. They define the outcome, determine how and when it will be assessed and make sure it’s done uniformly for everyone.

That said, despite all the papers I’ve started by saying RCTs are the gold standard, not all RCTs are created equal. You can have a really well-designed RCT that produces high-quality evidence, or a poorly designed one that doesn’t. Each needs to be evaluated on its own merits—how well the outcome was defined, how complete the data collection was and so on. So yes, they’re regarded as the gold standard—but they have flaws, too.

What are some of those flaws, even if it’s conducted well? What are the things that RCTs can miss or do wrong?

From my perspective, the biggest limitation of RCTs is lack of generalizability. Because they are conducted in a really precise, specific way, that means the care that patients are receiving in the context of an RCT does not necessarily look like the care they would receive in routine practice. So often the results that are observed in a clinical trial population do not fully translate into a real-world population.

In addition to differences in the care environment and the intervention that patients are receiving, there are also big differences in the patient population. I've seen this a lot, especially in oncology, where the patient population that’s specified for an RCT is the population that the investigators think has the highest likelihood of benefiting from the intervention, and that usually means they're healthier, and more likely to have a good prognosis.

But once the trial is done, the drug is approved and it enters real-world use, everyone receives it—not just the patients with good prognoses, but also those with poorer outcomes or who might have faced barriers to accessing clinical trials in the first place. So it’s not surprising that there can be pretty big differences between the results you see in the real world and those observed in RCTs.

“ Not all RCTs are created equal. You can have a really well-designed RCT that produces high-quality evidence, or a poorly designed one that doesn’t. Each needs to be evaluated on its own merits—how well the outcome was defined, how complete the data collection was and so on. So yes, they’re regarded as the gold standard—but they have flaws, too. ”

Rebecca Hubbard the Carl Kawaja and Wendy Holcombe Professor of Public Health and professor of biostatistics and data science at Brown

Let's talk more about using real-world data. We briefly touched upon the challenges, but quickly tell us about the pros and cons of using big data sets from the real-world.

To me the major advantage of using real-world data is specifically that it addresses that generalizability limitation of RCTs.

Real-world data are data that are generated as a byproduct of health care, and therefore they reflect the real patient population who will actually receive this intervention as it's delivered in the real world. That's a huge strength. The strength of RCTs lies in their randomization and the high-quality, carefully planned data they produce. 

Real-world data, on the other hand, doesn’t have that randomization. You’re just looking at the patients who happened to receive a particular treatment, and the data weren’t collected according to a pre-specified plan. Essentially, you’re limited to whatever information was gathered as part of that patient’s routine care and billing—and that’s what you have to work with.

All of this information is from real people's real health-care records. Should we be worried about our information being used for research?

I do definitely think about that. When I’m analyzing electronic health records data, I think, ‘I could be in this data set,’ or ‘my partner could be in this data set.’ And that makes me want to make sure that the students and trainees that I'm working with think about the ethics, and really treat it as, every data point is a person and every data point should be treated with respect. Because it's a gift that we're being allowed to analyze these data.

What are the ethical lessons you give your students about using data?

Probably the most important one is the concept of minimum necessary; that we want to access only the data that we need for research, and nothing beyond that. We certainly don't want to go poking around in people’s medical records finding out information about them that we don't need.

electronic medical records
Researchers access only the data from health care records  that are needed for research, nothing more.

When I'm conducting a study with electronic health records or claims right at the start of the study, I'll do that hypothetical exercise of thinking through, okay, what would a designed research study look like using these data? And then those are the data that I'll request. So, I request only the information that I need, and I always want to see as little about people's private information as I possibly can.

As a statistician, I do not need to know your name or your social security number or your address and I never want to see that kind of personal, private information. 

Can you tell me about a real-world example of comparing randomized control trials with real-world data, showing their differences?

About five to 10 years ago, I was working with oncologists at the University of Pennsylvania, and at that time, immunotherapy had just been approved for treatment of advanced bladder cancer. It had been approved through a pathway that the FDA has called accelerated approval.

In accelerated approval, you don’t have to do an RCT to get your new drug approved. Immunotherapy had come on the market, but there was no RCT data comparing it head-to-head with the existing standard of care, which was chemotherapy.

The oncologists came to me and said, ‘This is on the market now. We are using it, but we don't actually know if it’s benefiting our patients. There is an RCT that’s ongoing to answer this question, head-to-head comparison of immunotherapy versus chemotherapy, but it’s not expected to complete for some time.’

In the meantime, they had this evidentiary gap where they're like, What should I do? What's best for my patients?

Because immunotherapy was being used in routine practice, we were able to take oncology EHR data, look at the patients that were getting immunotherapy, compare them to patients that were getting chemotherapy, do the head-to-head comparison in the absence of the RCT and generate real-world evidence about how patients were doing.

What we found was immunotherapy looked worse than chemotherapy early in follow-up, but if you followed patients out long enough, the patients who survived over time ultimately did better on immunotherapy than chemotherapy. About six months after we published that result, the RCT completed; they published their results and their results were amazingly similar. We were able to fill that evidentiary gap, move the evidence forward sooner, and answer this question for oncologists. 

So what does that tell you? What's the conclusion from that head-to-head comparison?

I really think of all the evidence we generate from different study designs as fitting together in complementary ways — building an edifice of evidence that we should consider as a whole, rather than focusing on any single study.

We can look at an RCT and say, “I’m concerned about how generalizable this is,” or point out specific flaws. Real-world data can help address some of those limitations—but it also comes with its own biases, some of which can’t be completely solved. The lack of randomization means there’s always the possibility of confounding; you can never be entirely sure you’ve accounted for every relevant patient variable.

That’s why we need both. We need real-world data studies to help confirm that RCT findings actually translate to real-world settings. But we also need RCTs—the high-quality, carefully controlled data that provide the bedrock of evidence we can be confident in. I wouldn’t trust any one study on its own; the answer comes from considering all the evidence together.

And I think it’s also important to remember that science is self-correcting. When I present about electronic health records and real-world data, people sometimes say, “But the data are so messy. How do you know you’re getting the right answer?” My response is that we’re doing the best, highest-quality study we can while acknowledging the biases and limitations. And we’ll do another study. This isn’t the end of the chain of evidence. Future research will continue to refine what we know, update the evidence, and bring us closer to the right answer.

“ We need real-world data studies to help confirm that RCT findings actually translate to real-world settings. But we also need RCTs—the high-quality, carefully controlled data that provide the bedrock of evidence we can be confident in. I wouldn’t trust any one study on its own; the answer comes from considering all the evidence together. ”

Rebecca Hubbard the Carl Kawaja and Wendy Holcombe Professor of Public Health and professor of biostatistics and data science at Brown

It sounds like randomized control trials and real-world data are kind of in conversation with each other.

That's what I think. RCTs are sometimes called “pivotal,” as if the pivotal trial gives us the final answer about whether a treatment is effective. But in reality, science never really ends. Patient populations change over time, as do disease characteristics and risk factors, so we always need to keep updating the evidence. In that sense, the RCT starts the conversation, and the real-world data help carry it forward.