External Publication

Thinking Clearly about Association Studies (Risk Factors and Causal Salad included)

Datamethods Discussion Forum [Unofficial] March 30, 2026

ESMD:

The circular logic required to answer this question always does my head in. If everyone agrees that DAG-free studies touting “associations” are insufficient for causal inference, how can we then justify using these same studies as “evidence” to inform construction of a DAG for a subsequent observational study on the same topic (??)

This has also been confusing / bothering me for a long time. We use expert consensus to draw a DAG, but the expert consensus is based on poor studies and clinical experience of us biased humans.

I think it’s helpful to keep in mind that Causal Inference with DAGs is deductive. We try to agree on some theory (the DAG), and if we believe that theory to be true, our conclusions should be true as well. You could have a very different theory than @Pavlos_Msaouel and thus not believe the conclusions. In think that DAGs serve mostly a bureaucratic function (as much of statistics imo). They allow stakeholders to speak a common language and agree on some rules that guide decisions.

besttd:

Am I correct in reading f2harrell’s characterization of descriptive studies as: A descriptive study can hardly include a multivariable model of Y? (exceptions to this clunky rule are granted)

I’d also be very interested in @f2harrell 's view here. Personally, I’m not so sure / on the fence. It might still be useful to make et ceteris paribus comparisons, which don’t necessarily have a causal interpretation. I could for example be interested in whether men have a larger VO2max than women, but I want to compare men and women of equal height and weight.

I could model (y \sim \text{gender} * (\text{height} + \text{weight})) and use that to predict some contrasts for men and women of different heights and weights. I think the the interpretation of these results still can be descriptive, i.e. “ men and women of equal height and weight have different VO2max ”, vs “ being male causes higher VO2max ”. Happy to be convinced otherwise though.

I’d appreciate some concrete input on the following:

We are interested in the pretest probability of a certain biomarker, based on some clinical variables
Our sample size for a true prediction model is too low and the study won’t get larger
There’s no literature on which clinical variables might be causal / predictive
Is it really better to not try to identify (through careful multivariable modeling strategies as taught in RMS) some variables that might be important to inform future data collection? Sure, our power is low, so we might miss some, but you still have to start somewhere

For example, we might find that radiotherapy appears to be an important predictor. This could be because of radiotherapy itself or because people who receive radiotherapy have some other important features. As long as we clearly acknowledge this, isn’t this at least worth knowing and potentially exploring?

Discussion in the ATmosphere