Control in Clinical Trials

I am going to use diet comparison studies for examples throughout this post. In a previous post, A Matter of Control, I discussed the concept of control in experiments compared with the general English language meaning of the word. Therefore, when using the word control it has no meaning with respect to how well implemented a particular study is with respect to compliance, completeness of data, etc.

I do want to stress, however, that the "proper" usage of the term in clinical trial design can be rendered all but meaningless if the compliance cannot be properly assessed and verified. Without this, we have GIGO = Garbage In, Garbage Out. If you are studying the effects of a daily pill, you can have a perfect experiment, but if the subjects don't take that pill according to schedule, the outcome will always be shadowed with doubt.

Before I go on, I also want to make it clear that this is in no way intended to be a detailed discussion of statistical methods, etc. I do, however, want to expand on the concept and its various meanings in clinical trials.

Treatment Control in Experiments:

Let us look at a dietary weight loss study as an example, and just for fun, let's say I want to test the superiority of an 85% Fat Fast for weight loss. I could do an uncontrolled experiment and just find X number of people willing to follow the fat fast protocol for a month and measure their weights at various time points along the way.

If they lose weight, it could be due to just the fact that the diet is 85% fat, but I cannot draw any grand conclusions due to any number of possible confounding variables. A confounding variable is any factor other than the one being studied that might be responsible for an observed effect. Therefore, before I can shout from the rooftops that 85% fat melts away pounds, I would need to take into account what other factors such as (but by no means limited to) caloric intake, protein intake and activity levels might have played in the outcome.

A controlled experiment is one in which a control group (or alternately the study group at different times, "crossover") receives the exact same treatment protocol with the only difference being the factor that is being investigated. The closer the "everything else but" is between the groups, the more "well controlled" the experiment is considered to be. In the fat fast example, good control would be achieved if the control group received the same calories and protein. Better control might be achieved by providing similar food-types such as shakes or muffins, and specifying an eating schedule. Theoretically, "complete control" can be exerted on this aspect of the experiment, though practically impossible. However I always look to the methods section for indications of this control as a measure of the quality of the study.

Control of Subject Variables

I am going to make a distinction here that I don't always see made when discussing clinical trials, and that is treatment variables vs. subject variability. In an ideal world, as I just discussed, one can have virtually perfect control over treatment variables. Meanwhile, the heterogeneity of human populations make for that level of control over the subjects virtually impossible. Another way to look at this is that you can apply the same treatment to two people who appear to be quite similar in a number of variables, but who may well react differently based on some unknown factor that differs between them.

Therefore, in addition to treatment control, a well controlled experiment is one in which the control group is as similar as possible to the study group (or groups) in as many possible variables that might influence the outcome. Some factors, by all means not all inclusive: gender, age, weight, body composition, race, reproductive status, socioeconomic status, occupation, health history, active disease, medications, activity, lifestyle, etc.etc. Control in this area is achieved through study selection and assignment.

As an aside, this is yet another aspect of animal studies that sometimes goes under-appreciated. Most are done in specific strains of animals whose characteristics have been bred for generations, are well characterized and are consistent within the strain or different between strains in predictable ways. This is rare to impossible in humans for practical and ethical reasons.

Control via Pre-Screening:

Just about every clinical trial has some pre-screening criteria which must be met in order to be considered as a participant. These are usually listed as inclusion and exclusion criteria. So for example, a study on the effect of diet on newly diagnosed middle aged diabetics may limit age to between 45 and 60 with a diagnosis based on XYZ criteria within the past 6 months. Out of that group may be excluded anyone with current unrelated health issues, those diagnosed with very high XYZ indicating longer standing diabetes possible prior to diagnosis, those taking medications for glucose and/or lipids and/or blood pressure and/or anything else, smokers, excessive drinkers, etc.

Pre-screening can be a bit of a double-edged sword. On the one hand, extensive pre-screening can avoid results being questioned left and right due to potential differences in these confounders. On the other hand, the ultimate results would have more limited application to the general population -- a study in overweight post menopausal prediabetic women may offer sound information to someone fitting that description, but it offers little that is meaningful to the lean male twenty-something elite athlete.

There's clearly no right or wrong way to screen here, but it should definitely be given utmost consideration so as to generate data that will be applicable to significant segment of the population but not so lax as to render the whole exercise practically moot. Personally, I'd rather see 5 studies in groups of 30 with specific criteria than one study in a group of 150 ... I think there is much more to be learned from the former.

Control via Group Assignment

The "R" in RCT stands for randomized. It is said that the RCT is the "gold standard", but I'm going to challenge that. I know, I know. Who am I ... but hear me out! Even randomized can mean different things.

Below is Slide 73/76 from the PowerPoint slides that accompany what would appear to be the 11th edition of Elementary Statistics by Mario Triola. This specifically deals with how best to "control" for variability in the subjects at the level of assigning subjects to groups.

Triola's terms are not universally used nor are the explanations on this PPT slide very detailed. So, for example, what he calls Randomized Block Design I believe refers to what is called Stratified Randomization in this other reference shown here. (I can no longer get this page to load).

Nonetheless, I think these two "slides" give an idea of different strategies that can be employed on the subject end of the experimental design so as to effectively and best control for subject variability. The idea is to keep the random as much as possible but to force it when necessary. In the above example, we see that if gender and age are important, the resulting groups end up being well matched in terms of the numbers of each. Likely they broke the initial group up into the 4 categories and then did random assignments to A or B from each sub-group. A further advantage here is that subgroup analyses can be easily conducted.

It would appear to me -- and I have never been involved at this level of human studies so I really have no clue if there are standard protocols all must follow when they go to get approval for and register a clinical trial, or what -- but it would appear to me that "completely randomized", or something very close to that, is favored over all other approaches. And ... I'm not entirely sure why. Perhaps there are some reading this with more experience in these matters who can enlighten me as to why some of the other approaches aren't utilized more often.

In general, if you have sufficient sample sizes, then completely random assignment should serve to "control" for potential confounders in the subjects. The idea is that any factor -- let's say height -- whether suspected to matter or not -- would "average out" ... a tall person here, a short person there, a short person here, a tall person there ... and so it goes. In the strictest sense, the study has not been specifically designed to control for ANY of these variables by this method. In essence the hope is that, with the laws of probability on our side, randomization produces two (or more) groups that are similar in most if not all characteristics.

I had a post here at the Asylum on randomizing and replication that I've temporarily (yet for the foreseeable future) reverted to draft and replaced with this one. In it I referenced a Chris Masterjohn blog post: When Standing At the Brink of the Abyss, Staring Into the Great Unknown, We Randomize. It's kinda funny how things go around, because in his discussion on the limits of randomizing and confounding variables, he discusses the LA Veterans study. In that study, involving almost 850 subjects, randomizing failed to equalize out smoking behaviors between the two groups. [Yeah, this is where I'm going with this and I'll be addressing this very soon in the Nina Teicholz, Shai'ster series.] Chris' major point in his blog was that if a known confounder of cardiovascular risk -- smoking -- was not "averaged out" by randomizing in such a relatively huge study, what other yet-to-be-identified confounders may be lurking within the more modestly populated clinical trials upon which we depend? It's a valid point.

It bothers me that I've seen some reasonably large studies where randomization failed to produce an equitable distribution in baseline measures. This is even more troublesome when those measures are designated as primary or secondary outcomes of the study! By that I mean if you are looking at an intervention to alter some lipid biomarker, it would be beneficial for the study groups to have as similar lipid profiles to begin with.*

Then, as in Masterjohn's example and countless others, there are what I'd call the obvious confounders. Smoking, active disease, medications etc. being obvious here. Usually this is "controlled for" in the pre-screening so as not to be a factor. But obvious factors not screened out by exclusion should be pro-actively controlled. By that I mean, if randomization is used, check and see before proceeding that the groups did indeed "shake out" to be relatively equivalent in the various factors.

There will always remain, as Chris put it, those factors in the abyss, but I believe if efforts are made to better ensure control over the knowns, these become less and less of a concern. Therefore, in this computer age, I don't see why with all the thousands and millions of dollars being spent on these trials, that first and foremost everything possible isn't being be done to level the playing field at the subject level to begin with.

I keep feeling like I must be missing something obvious. I mean how about having the computer randomize "simply" a hundred times and employ an algorithm to determine a best matched (weighted for the more important factors) set of groups that was "stumbled upon" by this method? Then randomly assign those groups to treatment or control? Or establish narrow boundaries for a level playing field and have a computer randomize until it hits on a combination that meets those boundaries? I'm sure some brilliant minds can do better than that and develop an algorithm to optimize randomizing, but even my cruder suggestions should be acceptable. Simply randomizing once, and hoping for the best on the first shot doesn't seem to produce great starting points at times. It is not unheard of for there to be statistically significant differences at baseline for a measure that is part of the study outcomes. This should be a cue to re-randomize or apply more "rigor" to the process ... otherwise months, years and millions of dollars really are being wasted.

A Badly Controlled Study -

I'm obviously going somewhere with all this groundwork, and expect that post soon discussing the misuse of "rigorously controlled" and whether or not a variable is "controlled for" as concerns the Shai diet comparison trial and related trials cited by Nina Teicholz. But for now it occurs to me that quite recently I discussed an RCT that was a case study in misusing the "C" in RCT. That study would be Saslow, et.al. , blogged on here, involving 34 diabetics, with some prediabetics, some taking various medications, etc. It is a perfect example of control gone wrong at essentially every turn. This was ostensibly a pilot study to assess the effect of two dietary interventions, along with education and coaching in behavioral techniques, on glycemic control, and other diabetic risk factors. I won't rehash the entire blog post, but in terms of the topics discussed in this post, we have (not all inclusive lists):

Experimental Control Problems:

The instructional sessions for one group were led by the lead investigator involved in study design wile the other group was led by an RD brought on board for this task.
The diets were not matched for any of the macros or calories.
Types of foods were not specified
Compliance for one group was carefully monitored through routine (daily to twice weekly) measurements while there was no equivalent for the other group.

Pre-screen Subject Control Problems:

Overweight and obese
Some pre-diabetics
Wide range of medications allowed and used
Some unmedicated

Problems with "Randomized Control" of Subject Variables: Some sort of block/stratified random allocation was apparently used (groups of 4) but this resulted in:

Differing gender makeup between groups
Differing racial/ethnic makeup between groups
Different proportions of prediabetics between groups
Differences in distribution of HbA1c between groups where all but one subject in one group was at or below HbA1c= 7.0 at baseline while one-third in the other group were above 7.0
Similar types of differences, though not as dramatic, between groups for FBG and triglycerides
Impossible to even compare medications with different types and doses though distribution could possibly have been better.
Statistically significant differences in HDL at baseline

The thing about this study is that due to its relative small size (16 + 18 = 34 subjects) and wide variation, none of the above differences (except HDL) rose to statistical significance for the means (or proportions). In other words, "on paper" and "by the numbers", these groups were "the same". This is an artifact of the small numbers, however and a perfect example of where things should have been "forced". This is what I was talking about above at the blue asterisk *. There were statistically significant differences in an outcome measure (HDL), but the fact that there were not in other measures was not reflective of the magnitude of differences in the baseline measures, but simply because the groups were too small to detect these statistically.

I submit that there was room for improvement at almost every level of "control" here, and the failures render the results in that study all but meaningless.

"Observational Control" in Experiments, aka The Post Hoc Analysis

If a study is large enough and the data obtained are of good integrity, I want to point out that all is not necessarily lost if one has failed to control for a variable at the outset. From Wikipedia:

In the design and analysis of experiments, post-hoc analysis (from Latin post hoc, "after this") consists of looking at the data—after the experiment has concluded—for patterns that were not specified a priori. It is sometimes called by critics data dredging to evoke the sense that the more one looks the more likely something will be found.

While I can certainly understand and agree to large extent with the sentiment about data dredging, I disagree with the overall negative view regarding post hoc analyses. In a way, not doing any post hoc analysis could be viewed as anti-scientific at its core.

If an expected outcome occurs, would it not be prudent to see if other causal factors may have been missed with the current study design?

Often outcomes are designated as primary and secondary from the outset. This is arbitrary, really. Let's use two outcomes for a diet study: weight and FBG. If my dietary intervention alters FBG but not weight, is its "success" or usefulness altered by whether I have designated weight as the primary outcome and FBG secondary or vice versa?

Perhaps the data is there, but the primary researchers just never thought to look at it from a particular angle because that wasn't the focus of their study. Should it be considered nefarious to look at things a different way just because we didn't set out "a priori" to do so? I tend to believe the "accidental gem" studies provide some of the best evidence. By that I mean if investigators conduct an experiment to demonstrate A, but the data can be useful to support B (which may not even be related to A), then that data is most likely the purest of all. After all, it would be difficult for the researchers to inject any bias into something they didn't even consider because that wasn't their focus.

In observational studies, scientists "control" all the time. Data from 1000 subjects can be analyzed by dividing into any number of subgroups according to any number of variables and then compared within or between such groups. For example risk assessments are often done comparing the lowest and highest quartiles for a given variable in terms of the outcome in question.

In experiments there may not always be sufficient numbers to do detailed post hoc analyses, and yet, quite often the data is there and just never examined. If your data pool lacks integrity due to lack of compliance, what happens when you just look at the "good data" from those that did comply? If you didn't screen for smoking what happens when you control for smoking (as you would in an observational study) and look at the outcomes? Or what happens if smokers are just excluded from the analysis? Do the results change? All manner of "what ifs" can be looked into, indeed this is often addressed with apparent outliers.

It is easy to see how such an exercise could devolve into the dreaded data dredging, but rather than simply list limitations, many experimental reports could be made better if the analysis were done -- even if just to say "we did it and the numbers were insufficient to determine a darned thing to statistical significance". Then "I need a cool mill for further research" - grin.

R should stand for Robust!

image link
Tangentially relevant ....

Rather than easily misinterpreted terms like "rigorously controlled" ....

Rather than tacking the "gold standard" label on R for Randomized ....

How about a new term. I like ROBUST.

We used to use that term when describing an analytical process that consistently yielded reliable results, even if every little step along the way wasn't perfect. There may well have existed another process to yield superior results, but that only did so when Jupiter aligned with Mars. Perhaps useful for some contexts, but useless for daily analyses and such. I think that this term can be applied to clinical trials.

A robust clinical trial is ...

... well designed and controlled at the experimental level, at a minimum for all known and/or obvious confounders.
... conducted on subject groups of as similar composition as possible -- e.g. controlled for subject variables -- especially where outcome measures are concerned.
... one where adherence to the protocol is monitored and verified, and/or there is some measure of the degree of adherence assessed.

If you strive for a robust study model, then chances are the study will yield meaningful results even if things deviate from plan here and there a bit along the way.

If you are doing secondary research -- e.g. combing the peer review literature as I do these days -- look for robust clinical trials. They are few and far between, but they are out there.

Lastly ... If Nina Teicholz calls a trial well or rigorously controlled, you can pretty much bet it's not. Until the next installment ... Tei¢holz, Shai'ster! ... later peeps.

Comments

MacSmiley said…

I tend to believe the "accidental gem" studies provide some of the best evidence. By that I mean if investigators conduct an experiment to demonstrate A, but the data can be useful to support B (which may not even be related to A), then that data is most likely the purest of all.

Do you consider Fleming's FLow Mediated Dilation study which included vegetarians who had unexpectedly switched to an Atkins diet one of those "accidental gems"??

June 12, 2014 at 1:19 PM

Search This Blog

The Carb-Sane Asylum