Saturday, 2 April 2011

Invalidated Results Watch, Ivan?

My friend Ivan Oransky runs a highly successful blog called Retraction Watch; if you have not yet discovered it, you should! In it he and his colleague Adam Marcus document (with shocking regularity) retractions of scientific papers. While most of the studies are from the bench setting, some are in the clinical arena. One of the questions they have raised is what should happen with citations of these retracted studies by other researchers? How do we deal with this proliferation of oftentimes fraudulent and occasionally simply mistaken data?

A more subtle but no less difficult conundrum arises when papers cited are recognized to be of poor quality, yet they are used to develop defense for one's theses. The latest case in point comes from the paper I discussed at length yesterday, describing the success of the Keystone VAP prevention initiative. And even though I am very critical of the data, I do not mean to single out these particular researchers. In fact, because I am intimately familiar with the literature in this area, I can judge what is being cited. I have seen similar transgressions from other authors, and I am sure that they are ubiquitous. But let me be specific.

In the Methods section on page 306, the investigators lay out the rationale for their approach (bundles) by stating that the "ventilator care bundle has been an effective strategy to reduce VAP..." As supporting evidence they cite references #16-19. Well, it just so happens that these are the references that yours truly had included in her systematic review of the VAP bundle studies, and the conclusions of that review are largely summarized here. I hope that you will forgive me for citing myself again:
A systematic approach to understanding this research revealed multiple shortcomings. First, since all of the papers reported positive results and none reported negative ones, there is a potential for publication bias. For example, a recent story in a non-peer-reviewed trade publication questioned the effectiveness of bundle implementation in a trauma ICU, where the VAP rate actually increased directionally from 10 cases per 1,000 MV days in the period before to 11.9 cases per 1,000 MV days in the period after implementation of the bundle (24). This was in contradistinction to the medical ICU in the same institution, which achieved a reduction from 7.8 to 2.0 cases per 1,000 MV days with the same intervention (24). Since the results did not appear in a peer-reviewed form, it is difficult to judge the quality or significance of these data; however, the report does highlight the need for further investigation, particularly focusing on groups at heightened risk for VAP, such as trauma and neurological critically ill (25).             
Second, each of the four reported studies suffers from a great potential for selection bias, which was likely present in the way VAP was diagnosed. Since all of the studies were naturalistic and none was blinded, and since all of the participants were aware of the overarching purpose of the intervention, the diagnostic accuracy of VAP may have been different before as compared to after the intervention. This concern is heightened by the fact that only one study reports employing the same team approach to VAP identification in the two periods compared (23). In other studies, although all used the CDC-NNIS VAP definition, there was either no reporting of or heterogeneity in the personnel and methods of applying these definitions. Given the likely pressure to show measurable improvement to the management, it is possible that VAP classification suffered from a bias. 
Third, although interventional in nature, naturalistic quality improvement studies can suffer from confounding much in the same way that observational epidemiologic studies do. Since none of the studies addressed issues related to case mix, seasonal variations, secular trends in VAP, and since in each of the studies adjunct measures were employed to prevent VAP, there is a strong possibility that some or all of these factors, if examined, would alter the strength of the association between the bundle intervention and VAP development. Additional components that may have played a role in the success of any intervention are the size and academic affiliation of the hospital. In a study of interventions aimed at reducing the risk of CRBSI, Pronovost et al. found that smaller institutions had a greater magnitude of success with the intervention than their larger counterparts (26). Similarly, in a study looking at an educational program to reduce the risk of VAP, investigators found that community hospital staff were less likely to complete the educational module than the staff at an academic institution; in turn, the rate of VAP was correlated with the completion of the educational program (27). Finally, although two of the studies included in this review represent data from over 20 ICUs each (20, 22), the generalizability of the findings in each remains in question. For example, the study by Unahalekhaka and colleagues was performed in the institutions in Thailand, where patient mix and the systems of care for the critically ill may differ dramatically from those in the US and other countries in the developed world (22). On the other hand, while the study by Resar and coworkers represents a cross section of institutions within the US and Canada, no descriptions are given of the particular ICUs with respect to the structure and size of their institutions, patient mix or ICU care model (e.g., open vs. closed; intensivists present vs. intensivists absent, etc.) (20). This aggregate presentation of the results gives one little room to judge what settings may benefit most and least from the described interventions. The third study includes data from only two small ICUs in two community institutions in the US (21), while the remaining study represents a single ICU in a community hospital where ICU patients are not cared for by an intensivist (23).  Since it is acknowledged that a dedicated intensivist model leads to improved ICU outcomes (28, 29), the latter study has limited usefulness to institutions that have a more rigorous ICU care model.
OK, you say, maybe the investigators did not buy into my questions about the validity of the "findings." Maybe not, but evidence suggests otherwise. In the Discussion section on page 311 they actually say
While the bundle has been published as an effective strategy for VAP prevention and is advocated by national organizations, there is significant concern about its internal validity.
And guess what they cite? Yup, you guessed it, the paper excerpted above. So, to me it feels like they are trying to have it both ways -- the evidence FOR implementing the bundle is the same evidence AGAINST its internal validity. Much like Bertrand Russell, I am not that great at dealing with paradoxes. Will this contradiction persist in our psyche, or will sense prevail? Perhaps Ivan and Adam need to start a new blog: Invalidated Results Watch. Oh? Did you say that peer review is supposed to be the answer to this? Right.  
    

Friday, 1 April 2011

Another swing at the windmill of VAP

Sorry, folks, but I have been so swamped with work that I have been unable to produce anything cogent here. I see today as a gift day, as my plans to travel to SHEA were foiled by mother nature's sense of humor. So, here I am trying to catch up on some reading and writing before the next big thing. To be sure, I have not been wasting time, but have completed some rather interesting analyses and ruminations, which, if I am lucky, I will be able to share with you in a few weeks.

Anyhow, I am finally taking a very close look at the much touted Keystone VAP prevention study. I have written quite a bit about VAP prevention here, and my diatribes about the value proposition of "evidence" in this area are well known and tiresome to my reader by now. Yet, I must dissect the most recent installment in this fallacy-laden field, where random chance occurrences and willful reclassifications are deemed causal of dramatic performance improvements.

So, the paper. Here is the link to the abstract, and if you subscribe to the journal, you can read the whole study. But fear not, I will describe it to you in detail.

In its design it was quite similar to the central line-associated blood stream infection prevention study published in the New England Journal in 2006, and similarly the sample frame included Keystone ICUs in Michigan. Now, recall that the reason this demonstration project happened in Michigan is because of their astronomical healthcare-associated infection (HAI) rates. Just to digress briefly, I am sure you have all heard of MRSA; but have you heard of VRSA? VRSA stands for vancomycin-resistant Staphylococcus aureus, MRSA's even more troubling cousin, vancomycin being a drug that MRSA is susceptible to. Now, thankfully, VRSA has not yet emerged as an endemic phenomenon, but of the handful of cases of this virtually untreatable scourge that has been reported, Michigan has had plurality of them. So, you get the picture: Michigan is an outlier (and not in the desirable direction) when it comes to HAIs.

Why is it important to remember Michigan's outlier status? Because of the deceptively simple yet devilishly confounding concept of regression to the mean. The idea is that in an outlier situation, at least some of the effect is due to random luck. Therefore, if the performance of an extreme outlier is measured twice, the second time it will be closer to the population mean just by pure luck alone. But I do not want to get too deeply into this somewhat muddy concept right now -- I will reserve a longer discussion of it for another post. For now I would like to focus on some of the more tangible aspects of the study. As usual, two or three features of the study design reduce substantially the likelihood that the causal inference is correct.

First feature is the training period. Prior to the implementation of the protocol, which by the way consisted of the famous VAP bundle, which we have discussed on this blog ad nauseam, there was intensive educational training of the personnel on a "culture of change", as well as the proper definitions of the interventions and outcomes. It is at this time that the "trained hospital infection prevention personnel" were intimately focused on the definition of VAP that they were using. And even though the protocol states that the surveillance definition of VAP would not change throughout the study period, what are the chances that this intensified education and emphasis did not alter at least some of the classification practices?

Skeptical? Good. Here is another piece of evidence supporting my stance. A study from Michael Klompas from Harvard examined inte-rater variability in the assessment of VAP looking at the same surveillance definition applied in the Keystone (and many other) study. Here is what he wrote:
Three infection control personnel assessing 50 patients for VAP disagreed on 38% of patients and reported an almost 2-fold variation in the total number of patients with VAP. Agreement was similarly limited for component criteria of the CDC VAP definition (radiographic infiltrates, fever, abnormal leukocyte count, purulent sputum, and worsening gas exchange) as well as on final determination of whether VAP was present or absent.
And here is his conclusion:
High interobserver variability in the determination of VAP renders meaningful comparison of VAP rates between institutions and within a single institution with multiple observers questionable. More objective measures of ventilator-associated complication rates are needed to facilitate benchmarking and quality improvement efforts. 
Yet, the Keystone team writes this in their Methods section:
Using infection preventionists minimized the potential for diagnosis bias because they are trained to conduct surveillance for VAP and other healthcare-associated infections by using standardized definitions and methods provided by the CDC in its National Healthcare Safety Network (NHSN).
Really? Am I cynical to invoke circular reasoning here? Have I convinced you yet that CAP diagnosis is a moving target? And as such it can be moved by cognitive biases, such as the one introduced by the pre-implementation training of study personnel? No? OK, consider this additional piece from the Keystone study. The investigators state that "teams were instructed to submit at least 3 months of baseline VAP data." What they do not state is whether this was a retrospective collection or a prospective one, and this matters a little. First, retrospective reporting in this case would be a lot more representative of what has been, since these rates of VAP are already recorded for posterity and cannot presumably be altered. On the other hand, if the reporting is prospective, I can still conceive of ways to introduce a bias into this baseline measure. Imagine, if you will, that you are employed by a hospital that is under scrutiny for a particular transgression, and that you know the hospital will look bad if you do not demonstrate improvement following a very popular and "common-sense" intervention. Might you be a tad more liberal with identifying these transgressive episodes in your baseline period that after the intervention has been instituted? This is a subtle, yet all too real conflict of interest, which, as we know so well, can introduce a substantial bias into any study. Still don's believe me? OK, come to my office after school and we will discuss. In the meantime, let's move on.

The next nugget is in the graph in Figure 1, where VAP trends over the pre-specified time periods are plotted (you can find the identical graph in this presentation on slide #20). Look at the mean, rather than the median line. (The reason I want you to look at the mean is that the median is zero, and therefore not credible. Additionally, if we want to assess the overall impact of the intervention, we need to be embracing the outliers, which the median ignores). What is tremendously interesting to me is that there is a precipitous drop in VAP during the period called "intervention", followed by much smaller fluctuations around the new mean across the subsequent time periods. This to me confirms the high probability of reclassification (and Hawthorne effect), rather than an actual improvement in VAP rates, as the cause of the drop.

Another piece of data makes me think that it was not the bundle that "did it." Figure 2 in the paper depicts the rates of compliance with all 5 of the bundle components in the corresponding time periods. Again, here as in the VAP rates graph, the greatest jump in adherence to all 5 strategies is observed in the intervention period. However, there is still a substantial linear increase in this metric between the intervention period and through to 25-27 months period. Yet, looking back at the VAP data, no such robust commensurate reduction is observed. While this is somewhat circumstantial, it makes me that much more wary of trusting this study.

So, does this study add anything to our understanding of what bundles do for VAP prevention? I would say not, and it actually muddies the waters. What would have been helpful to see is whether any of the downstream outcomes, such as antibiotics administration, time on the ventilator and length of stay were impacted. Without impacting these outcomes, our efforts are Quixotic, merely swinging at windmills, mistaking them for a real threat.