External Publication

Thinking Clearly about Association Studies (Risk Factors and Causal Salad included)

Datamethods Discussion Forum [Unofficial] March 29, 2026

besttd:

Can there be something like an “association study” that does not have a descriptive, predictive, or causal intent? Can there be any study at all that does not have one of the three goals?

Kezios went and looked, which is more than most critics of this practice have done. In a random sample of 100 observational epidemiologic studies, 13% stated explicitly causal aims and 69% used associational framing while clearly pursuing a causal question. Two were predictive. Two were descriptive. Fifteen had goals so vague that classification was impossible. The “association study” as a distinct scientific category is essentially a fiction in practice it’s nearly always a causal study that has decided not to say so.

Her term for the dominant mode is seemingly causal. A specific exposure disease relationship is front and center, but the aims are written as “to examine the independent association between X and Y” or “is X a risk factor for Y?” This isn’t a genuine fourth epistemological category sitting alongside description, prediction, and causation. It’s a rhetorical posture causal intent with the causal accountability stripped out. The authors want you to read the finding causally. They just don’t want to be held to the standards that causal inference requires.

The alignment data make this concrete. Only 9% of studies achieved full alignment of goal, methods, and interpretation. But among explicitly causal studies the rate was 38%, versus 4% for the seemingly causal ones. Stating your goal in causal terms , effect, cause, intervention , more than triples your probability of actually executing the analysis correctly. The associational framing isn’t a neutral stylistic choice. It’s operationally corrosive. When you don’t name the causal question, you don’t build the DAG, you don’t specify the sufficient adjustment set, and you don’t feel obligated to defend your identifiability assumptions because officially you never made any.

Among studies with unclear goals and adequately reported methods, 91% used an outcome-focused variable selection strategy , selecting covariates based on their association with Y alone and, every single one of them went on to interpret or discuss coefficients inappropriately. This is the “known risk factors for the outcome” heuristic and its companion, the EPV rule, operating exactly as trained. It feels principled. It produces a tidy multivariable table with p-values and confidence intervals. What it actually does is ignore the exposure confounder relationship entirely, create systematic risk of mediator and collider inclusion, and generates effect estimates whose bias is uncharacterized and unacknowledged. The analysis looks like causal inference. It satisfies none of its requirements.

I review manuscripts in veterinary medicine, where this methodology is essentially the default, the Kezios numbers are a reasonable lower bound on the problem. Her sample came from the top five general epidemiology journals,fields with far more methodological infrastructure and causal reasoning tradition than veterinary clinical research. If 91% of unclear-goal studies in those journals used outcome-focused variable selection and then over-interpreted the results, the corresponding figure in equine or small animal observational literature is unlikely to be lower.

Associationstudies.pdf (680.0 KB)

Discussion in the ATmosphere