2022-04-13

Correlation and Causation

What is Correlation

Correlation represents a statistical measure that describes the size and direction of a relationship between two or more variables. The correlation coefficient, often denoted by r, quantifies this relationship and ranges between -1 and +1. A correlation coefficient close to +1 indicates a strong positive correlation, meaning both variables move in the same direction. In contrast, a correlation coefficient close to -1 implies a strong negative correlation, indicating the variables move in opposite directions. A correlation of zero suggests no linear relationship between the variables.

Correlation is an essential tool in statistics as it quantifies the degree to which two variables are related. However, it's crucial to remember that correlation does not imply causation. It cannot explain why the variables move together, only that they do.

What is Causation

Causation, or causality, is a far more complex concept. It refers to the cause and effect relationship between variables. Establishing causation means proving that a change in one variable causes a change in another.

To infer causation, three criteria must typically be met:

  • The cause must occur before the effect (temporal precedence).
  • The cause and effect must be statistically correlated.
  • There must be no plausible alternative explanation for the effect other than the cause (non-spuriousness).

Causation is much harder to establish than correlation. It often involves carefully designed experiments that control for potential confounding variables — third variables that could influence both the cause and effect variables, creating a spurious correlation.

Differences Between Correlation and Causation

Correlation and causation differ primarily in what they can tell us about the relationship between variables. Correlation tells us that two variables change together, but it does not tell us why. Causation tells us not only that two variables change together, but also that changes in one variable lead to changes in the other.

To understand this concept, consider the well-known saying in statistics: "correlation does not imply causation." This means that just because two variables are correlated does not mean that one variable is causing the changes in the other.

There might be a third variable causing changes in both variables, or the correlation might be entirely coincidental. For example, there may be a high correlation between ice cream sales and shark attacks in a given area. But this does not mean that buying ice cream causes shark attacks or vice versa. Instead, a third variable—like warm weather—might be influencing both.

Misinterpretations and Misuses of Correlation and Causation

Misinterpretations and misuses of correlation and causation occur when individuals overlook the crucial distinction between these two concepts. The consequences can range from benign misunderstandings to potentially harmful mistakes in various fields, including public health, economics, and policy-making.

Common Misinterpretations

One common misinterpretation is to infer causation from correlation. For instance, if there's a positive correlation between children's shoe sizes and their reading skills, it would be incorrect to conclude that having bigger feet causes children to read better. In reality, a third factor, age, influences both shoe size and reading ability.

Misuses in Daily Life

Another frequent misuse is to assume a causal relationship where none exists. For instance, a business may notice that their highest-earning months are those with the highest advertising spend. However, it would be a mistake to automatically increase advertising expenditure without considering other factors, like holiday seasons, which could be driving both higher sales and higher advertising spend.

Misuses in Media and Policy-making

Media and policy-making are not immune to such misinterpretations. For example, a news report might highlight a study that found a correlation between a specific diet and lower risk of a particular disease, implying that the diet directly reduces disease risk. Such an inference could be misleading if the study didn't account for confounding factors, such as participants' overall lifestyle or genetic predispositions.

Pseudo Correlation

Pseudo correlation refers to a seeming relationship between two variables which, upon closer inspection, turns out to be spurious or coincidental. Such correlation appears to exist due to the presence of confounding variables or random chance, rather than any meaningful relationship.

Pseudo correlation lies in the premise that correlation does not imply causation. When two variables appear to be correlated, it may be tempting to assume that changes in one variable cause changes in the other. However, without a rigorous statistical or experimental investigation, such an assumption may lead to pseudo correlations, which are misleading at best and incorrect at worst.

Examples of Pseudo Correlation

A classic example of a pseudo correlation is the relationship between the number of films Nicolas Cage has starred in a particular year and the number of people who drowned by falling into a pool in the same year. While there appears to be a correlation between these two variables, it is purely coincidental.

Another example is the correlation between the use of Internet Explorer and murder rates in the US. These two variables appeared to be strongly correlated over a certain period, but it is illogical to assert that using a specific web browser could influence crime rates.

Implications of Pseudo Correlation

Pseudo correlations can lead to false conclusions and misguided decisions if not identified. For example, in business, a company might see a correlation between a marketing campaign's launch and an increase in sales and assume that the campaign caused the sales increase. However, if the sales increase was actually due to a different factor, such as a seasonal trend, the company could make poor future marketing decisions based on this pseudo correlation.

Identifying and Avoiding Pseudo Correlation

The best way to avoid pseudo correlation is to approach any correlation with a healthy dose of skepticism and critical thinking. Correlations should be used as starting points for further investigation rather than ends in themselves. Experimental research, controlling for confounding variables, and utilizing statistical tests can help distinguish between true correlations and pseudo correlations.

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!