Wednesday, 19 January 2011

Data mining: It's about research efficiency.

I have taken a little break from my reviewing literature series -- work has superseded all other pursuits for a little while. But I did want to do a brief post today, since this JAMA Commentary really intrigued me.

First thing that interested me was the authors. Now, I know who Benjamin Djulbegovic is -- you have to live under a rock as an outcomes researcher not to have heard of him. But who is Mia Djulbegovic? It is an unusual enough surname to make me think that she is somehow related to Benjamin. So, I queried the mighty Google, and it spat out 1,700 hits like nothing. But only one was useful in helping me identify this person, and that was a link to her paper in BMJ from 2010 on prostate cancer screening. On this paper (her only one listed on Medline so far), she is the first author, and her credentials are listed as "student", more specifically in the Department of Urology at the University of Florida College of Medicine in Gainesville, FL. The penultimate author on the paper is none other than Benjamin Djulbegovic, at the University of South Florida in Tampa, FL. So, I am surmising from this circumstantial evidence that Mia is Benjamin's kid who is either a college or a medical student. Why does this matter? Well, there seem to be so few papers in high impact journals that are authored by people without an advanced degree, let alone in the first position, that I am in awe of this young woman, now with two major journals to her name -- BMJ and JAMA. This is evidence that parental mentorship counts for a lot (assuming that I am correct about their relationship). But regardless, kudos to her!

Secondly, the title of the essay really grabbed me: what is the "principle of question propagation", and what does it have to do with comparative effectiveness research (CER) and data mining? Well, basically, the principle of question propagation is something we talk about here a lot: questions beget questions, and the further you go down any rabbit hole, the more detailed and smaller the questions become. This is the beauty and richness of science as well as what I have referred to as "unidirectional skepticism" of science, meaning that a lot of the time, building on existing concepts, we just continue down the same direction in a particular research pursuit. This is why Max Planck was right when he said
A new scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die, and a new generation grows up that is familiar with it.
So, yes, we build upon previous work, and continue our journey down a single rabbit hole our entire career. Though of course there are countless rabbit holes all being explored at the same time. It is really more of a fractal-like situation than a single linear progression. What is clear, as the authors of the Commentary point out, is that this results in the ever-escalating theoretical complexity of scientific concepts. What does this have to do with anything? This, the authors state, argues for continued use of theory driven hypothesis testing, given that medical knowledge will forever be incomplete. And this brings them to data mining.

Here is where I get a little confused and annoyed. They caution the powers that be from consigning all clinical research to data mining, at the expense of more rigorous studies to pursue hypothesis testing. They argue that mining data that already exist is limiting precisely because it is constrained by the scope of our current knowledge, and that we cannot use these data to generate new associations and new treatment paradigms. They further state that emerging knowledge will require updating these data sets with new data points, and this, according to the authors
...creates a paradox, which is particularly evident when searching for treatment effects insubgroups—one of the purported goals of the IT CER initiative. As new research generates new evidence of the importance for tailoring treatments to a given subpopulation of patients, the existing databases will need to be updated, in turn undermining the original purpose to discover new relationships via existing records.
Come agin? And then they say that "consequently, the data mining approach can never result in credible discoveries that will obviate the need for new data collection". Mmhm, and so? Is this the punch line? Well, OK, they also say that because of all this we will still need to do hypothesis testing research. Is this not self-evident?

I don't know about you, but I have never thought that retrospective data mining would be the only answer to our research needs. Rather, the way to view this type of research is as an opportunistic pursuit of information from massive repositories of existing data. We can look for details that are unavailable in the interventional literature, zoom in on the potentially important bits, and use this information to inform more focused (and therefore pragmatically more realistic) interventional studies.

Don't take me wrong, I am happy that the Djulbegovics published this Commentary. It is really designed more as an appeal to policy makers, who, in their perennial search for one-size-fit-all panaceas, may misinterpret our zeal for data mining as the singular answer to all our questions. No indeed, hypothesis testing will continue. But using these vast repositories of data should make us smarter and more efficient at asking the right questions and designing the appropriate studies to answer them. And then generate further questions. And then answer those. And then... Well, you get the picture.

0 comments:

Post a Comment