Public health research shapes our country’s health care policies, tests treatments and interventions for safety and effectiveness, and informs and enables us to make decisions that lead to healthier lives.
In our latest episode of Humans in Public Health, host Megan Hall is joined by Rebecca Hubbard, the Carl Kawaja and Wendy Holcombe Professor of Public Health and professor of biostatistics and data science at Brown, to discuss how data is collected, analyzed and used in world-class, evidence-based research. Below is an edited transcript of the podcast.
How did you get interested in biostatistics and research methodology?
Hubbard: I’ve always been curious about everything—and about the process of science itself. I like doing statistics because it lets me be part of figuring out how we generate evidence and how we learn what’s real and true. And the best part is I don’t have to choose just one topic.
Data are everywhere now because of the fact that everything is electronic. For instance, accelerometry data from smartwatches, social media data from people's social media posts, environmental monitors that describe air pollution and things like that.
Each time I start on a new topic, I meet with the clinicians and researchers in that field, and they tell me all about the science they’re passionate about. Listening to them talk, I get excited too. And I think, okay, I get to work on another fascinating project and help make science better in a new area, in a new way.
You often use real-world data in your work examining health outcomes. What are common pitfalls when it comes to data quality?
The important thing to remember about health care–derived data is that they come from people coming in to receive care in their usual ways. That means the data we have are the ones needed to deliver care and bill for it, not necessarily the data we’d collect if our goal were research.
In statistical terms, that creates all kinds of bias because you have differential information for different people, which is always problematic in research, and the quality and amount of information you have is possibly related to the outcomes that you’re trying to study. So, you’re more likely to capture those outcomes for some people than others, which is very problematic from a methodological perspective.
What do you do about that problem?
The first and most important thing is recognizing that it’s a problem. One thing I noticed, especially during the pandemic when everyone was excited about using real-world health care data, was this idea that if you just have a huge dataset—say, data on ten million people—and throw it into an algorithm, you’ll automatically get the right answer. But from a statistical perspective, that’s completely wrong. More data doesn’t eliminate bias.
So the first step is realizing that these data aren’t the same as research data.
What I like to do is think hypothetically: if I had designed a research study, what data would I have collected? What would I really need to know about each person—why they received a treatment, how their outcomes were measured and so on? Then I take the health care data and try to harmonize it. Once you do that, you start to see where the gaps are.