Professor Alyssa Bilinski set out to answer a seemingly simple question: how often are pregnant people included in medical trials? But finding the answer was anything but simple. With 90,000 records to analyze, she turned to AI for help—but ensuring the accuracy of the results required a creative approach.
Ever since ChatGPT came out about two years ago we've had a hard time escaping two letters: “AI.” There’s a lot of debate about the role of artificial intelligence in our lives. When is it a useful tool? And when can it create harm? Academic researchers are also grappling with these questions. Our interviewee today, Alyssa Bilinski, is one of them. Alyssa is a professor at Brown University, and she’s not a computer scientist or a programmer. She’s a health policy researcher.
Bilinski: At a high level, I think that there are two main ways that AI is changing research. So first, it's changing the conduct of research, so it can help us with things like coding, reading papers, making a podcast of your papers. And in my work in particular, where we're gathering information from many, many different sources, it can change our ability to get information.
The second piece that we're thinking a lot about is not just how AI helps us do research, but how we can think about improving the AI available to us to make it better for research. And so for both of these tasks, it's really hard to overstate the potential benefits of AI, but there's really big risks as well.
But if you ever played around with ChatGPT, you know that it makes errors, and it makes unexpected errors, and it makes them really confidently. So that can make it really hard to actually apply AI to the problems we want to work on.
Can you give an example of how you use AI to gather information without making big errors?
Sure. First think about how clinical trials work: You take a group and you randomly give some of them treatment, and others not. And this is the best way to learn reliably about how well a drug works and doesn't work, and what its side effects are.
Our research looked at studies involving pregnant people. Over 90 million women in the U.S. have given birth. That's about more than 70% of women between the ages of about 18 and 85. But traditionally, pregnant women have been excluded from drug development clinical trials.
We wanted to ask what might actually seem like a pretty basic question: How often are pregnant people included in these clinical trials and how has that changed over time? And this is a deceptively hard question to answer.
Alyssa Bilinski
Peterson Family Assistant Professor of Health Policy, assistant professor of Health Services, Policy and Practice and Biostatistics
So we wanted to ask what might actually seem like a pretty basic question: How often are pregnant people included in these clinical trials and how has that changed over time? And this is a deceptively hard question to answer, because there's a really good database about nearly all clinical drug trials in the United States, but there's no actual field related to pregnant inclusion. Sometimes it's going to say, ‘Had to have a negative pregnancy test’. Sometimes it might say, ‘Be postmenopausal’ or, ‘Have a positive pregnancy test’. But it's completely unstandardized.
And this is where the AI comes in?
Right. What we didn't want to do was tell a poor research assistant to go read between 40,000 and 60,000 of those studies. It would probably be a summer's worth of work for an R.A. to do, and a really unpleasant summer at that.
So AI helped us out by basically reading these blurbs for us and telling us whether pregnant people were or were not included, as well as some additional information. But I think the key is that we didn't just say, ‘Here chat GPT, tell us whether pregnant people are included’.
We pulled a small group of studies and we asked the AI to tell us whether it thinks pregnant people were included in this clinical trial—give us the reason for that classification and give us a quote that supports your claim.
So don't just give us the answer: support your work.
Yep, support your work. And that helped us to both catch edge cases that we maybe hadn't thought about, like, what if it talks about breastfeeding, but not pregnancy directly? And it also helped us to catch cases where the AI was likely to hallucinate.
The cases that it had a lot of trouble with wouldn't have any information to make a call. The model would classify the trial as excluding pregnant participants. But we wanted it to tell us there was no data, and so we added a second step where we would have a second AI agent take a look at the result from the first agent and say, ‘Did you classify this correctly?’ And in particular, are you making this error of being too confident?
Was the second AI trained differently? Because if it's the exact same algorithm, won't it just make the same mistake?
Interestingly, no. Calling it a second time, it could often kind of catch itself just like maybe sometimes when you play with ChatGPT and it gives you an error. You might say, that's not right, it will say, ‘Oops, sorry. You're right. I'm wrong.’
So interestingly, that worked really well. And then we had a human, to whom we are very grateful, actually label 1,000 studies as a larger training set. And we went through this process on a larger scale of really trying to refine the prompt and the different steps at play. And after that process, what we found was this model was more than 98% accurate.
How would you have done this research if you didn't have AI?
If we didn't have AI, we would have only been able to look at a much smaller sample of trials, and we would have had a much less well rounded understanding of this phenomenon.
In contrast, the work that I'm describing here allowed us to very nimbly answer this question, as well as a number of other questions that were related that we looked at in the trial very, very quickly.
So actually, running this analysis with AI on the full sample takes about an hour. Now, the training takes longer, but it's still much, much faster and so totally changed how we thought about our ability to do this kind of work.
And what did you find for the results of the research?
So what we found out is perhaps not surprising, but not very heartening. Less than 1% of the studies included pregnant participants. And even more so, despite calls to more broadly include pregnant participants in randomized control trials, this rate has been completely flat over the past 15 years.
And if we think potentially risky to experiment on pregnant people, it's even worse to make every pregnant person take imperfect information and make a guess. It's kind of like experimenting on all of them, but not learning from it.
We’re asking women to make a hard call during a really important time, with less data than people would normally have available to them to make decisions about taking medications.
So what are your thoughts now that you've done this research?
I think at a really high level, what we like to emphasize is that it was only in 1962 that the FDA required that companies submit evidence that medications were safe and effective. And for quite a long time, not just pregnant people, but all women of childbearing age were barred from participating in clinical trials.
And it was only in 1993 that federal law required including non pregnant people and sort of broader notions of representation in clinical trials. That was just 30 years ago. And so our hope is that 30 years from now, not including pregnant people in clinical trials and not having high quality evidence about the safety of medications, both for pregnant people and for their babies, will seem just as odd and unusual to us as not including women in clinical trials seems to us today.
What's your perspective on the role of this tool moving forward, specifically in research?
I think we should be cautious, but engaged. I think that AI is only going to become more ubiquitous, and it's to our benefit and the benefit of the research we do in the communities we work with to harness what it can do to improve our research.
At the same time, on the AI front, if there is one thing I would want people to take away from this, it's the phrase, ‘test driven development’. Anytime we are using AI, I want us to stop and ask the question, ‘How will we know if the results we get are correct and reliable?’ And only after that do we go to using AI.
We spoke with Dr. Michael Silverstein, director of the Hassenfeld Child Health Innovation Institute at Brown and vice chair of the U.S. Preventive Services Task Force—about the rise of syphilis and the task force’s recommendations.
As opioid-related hospitalizations rise, skilled nursing facilities could offer a crucial bridge to recovery for patients with opioid use disorder. However, stigma, regulatory hurdles and funding challenges limit their potential. New research highlights policy solutions to ensure these facilities can better meet the needs of a growing and aging population with OUD.