Science Made Simple ~ Of Causation & Correlation

Summary:


If A and B are significantly correlated, then:

  • MAY cause B, or
  • MAY cause A, or
  • Any number of other factors could correlate with A and B such that these two variables "track" together.
If A and B are not correlated, then it is exceedingly improbable that any of the above are true.  

Shorter Summary:





Correlation does not equal causation, but without correlation, there's no causation.






We often hear that a true scientist devotes his/her life to falsifying or disproving their hypothesis.  In reality, such endeavors would be folly, as one could spend their entire career disproving one hypothesis/mechanism after another, and ultimately this may still lend little if any support to validating a favored hypothesis/mechanism.  This is because scientific hypotheses rarely exist in competing pairs, where either one is correct or the other, with no third (fourth, fifth, etc.) option available.  When one talks about a null hypothesis and an alternative hypothesis, these are statistical terms.  Scientific hypotheses are a different animal from statistical hypotheses.  I wrote about this several years back.

The whole science of biomarkers is based on observations.  How do biomarkers come about?  If we are seeking a predictor of B, we look at some measure A that correlates strongly with B.  It bears reminding that this is what Ancel Keys was doing with his "other" Minnesota Study.  In order to establish a  biomarker -- which does not establish causality! -- the following MUST be true:
  • Levels of A differ between treatment groups and/or populations, AND
  • Levels of B differ between treatment groups and/or populations, AND
  • The differences are correlated
Needless to say, that differences and/or correlations would need to rise to a level of statistical significance.  If either of the first two bullet points is not satisfied, the third is impossible.  If there are no differences in A and a spread of B values, then A is not even correlated with B.  The same would go for any case where there is a range of variation in A with relative consistency in B.  

If one has observational data wherein either of the first two bullet points are not satisfied, it would be mighty unscientific to hypothesize over some correlation between A & B, let alone any sort of causal one.  It would be an even mightier waste of time, money and effort to construct and execute experiments to test your hypothesis.  Yet this is the case with a lot of the sugar consumption vs. metabolic data ... but I digress.

In the absence of concrete observations of A or B, a reasonable scientific scenario, based on a feasible mechanism, could still be warranted.  For example, if one had no data regarding the number of accidental hits to the bodies of construction workers, but had data regarding a wide variation in the incident of bruising in the construction worker population, it would be reasonable to hypothesize that there is a correlation between such accidental hits and bruising.  One could do either an observational study wherein the numbers of hits and bruises are collected, and one could even do something more definitive of causality which would be to follow up every "hit" to determine if bruising occurred.  This probably sounds like a ridiculous experiment, but extend me some license here as I'm just trying to make a point.  By contrast, if one observed a variation in acne in construction workers but had no data to compare accidental hits and acne incidence, it would seem an absurd leap to propose that getting hit by 2x4's was related to acne.

So we do see a number of studies of the hit-bruise variety in nutrition and other medical research.  Some incidental observations and/or feasible mechanistic relationships between A & B that can warrant at least a small-scale experiment to tease out if there is anything more there.  If that fails to bear fruit, the "scientific" thing to do would be to abandon the hypothesis ... start looking for other mechanisms, etc., that fit observations.  Realize this.  The results of experiments almost always add to our observational knowledge, regardless of whether or not they were favorable or conclusive for the hypothesis being tested.

If you test A causes B, and cannot even establish that A correlates with B, it's time to look elsewhere.


Summary:


If A and B are significantly correlated, then:
  • MAY cause B, or
  • MAY cause A, or
  • Any number of other factors could correlate with A and B such that these two variables "track" together.
If A and B are not correlated, then it is exceedingly improbable that any of the above are true.  


Shorter Summary:




Correlation does not equal causation, but without correlation, there's no causation.

Comments

Erik Arnesen said…
However, one must bear in mind the biases that often lead to false zero-correlations. As Skeaff & Miller wrote in their review on dietary fat and CHD:

«The null results of the observational studies of dietary lipids and CHD do not negate the importance of the underlying associations, but reflect the combined effects of limitations of dietary assessment methods, inadequate numbers of participants studied and the prolonged follow-up of individuals.»
Bris Vegas said…
The problem is that medical researchers try and apply experimental methods designed for the physical sciences. The reality is that the so called 'Gold Standard' double blinded placebo of clinical trials only works when the treatment effect is so strong (eg antibiotics to treat pneumonia) that variables have little effect on outcomes. In most (almost all according to Dr John Ionnidis) clinical trials the results are likely to be nothing more than statistical anomalies. This means valid treatments are ignored and worthless treatments get adopted.
carbsane said…
Good point Erik ... I think the problem with meta-analyses is that they are often too dissimilar in methodology to pool results. Looking at that particular study, we also have a situation where a full range of variation in A is not examined. The lowest fat intake was 23%, and "low fat" ranged from 23-30 vs 38-47% for "high fat". Such are irrelevant to those who contend that 20% fat is unhealthy (or, gasp, less!) and promote 50-80% fat diets as healthy for all.


Take into account your quotation -- I would stress study length limitations as well -- I think it draws into question whether the experiment is sufficient to test the hypothesis because most health issues take years, often decades to develop.
StellaBarbone said…
I think your point about time frame is particularly relevant. We often look at surrogate end-points in studies because looking at what we are really interested in creates an overly complex, long, or expensive test. Meta-analyses then almost always lump studies with varying end-points into one pool and the noise really swamps the signal.
LWC said…
Or the experimental design can't measure what they're trying to test.

This is an old paper from 1979, (http://www.ncbi.nlm.nih.gov/pubmed/313701) that was referenced in a video by Dr Greger (a vegan MD whose answer to just about everything is "eat more plants"-- just to be upfront about the inherent bias in the analysis that I am referencing). Have you read it? The link is to the abstract only, which is all I can access.

Greger's point in his video (http://nutritionfacts.org/video/the-saturated-fat-studies-set-up-to-fail/) was that because everyone's baseline cholesterol is genetically different, cross-sectional, observational population studies looking at cholesterol and saturated fat intake are meaningless. It's the change in cholesterol from where you start that matters. The recommendation to lower saturated fat is based on metabolic ward studies where baseline cholesterol level was established for each subject, and then diet changed. In all of those studies, lower saturated fat intake lowered cholesterol level. In fact, there's a linear equation that predicts how cholesterol will change with saturated fat intake.



If this is true, it's not the time frame that relevant. It's the experimental design.