Consider the problem of extremes that was mentioned in the previous post. You have a set of flood data concerning water discharges at a river station. As an engineer, you are concerned with building a damn that is both cost-effective and able to prevend inundations (water going to places that it shouldn’t).

An early 20th century / 19th century scientist or engineer would likely analyze the flood data directly and design a dam that would withstand any event in their dataset. They would mix data with engineering intuition to arrive at a solution. The difficulty with this is that they have to plan several decades in advance and there are plenty of chances over such a long period for there to be a series of floods that are greater than the largest flood in the dataset.

The fundamental fallacy of pre-20th century science, and these engineers as well as many people today, is the belief that science was concerned with analyzing and describing data; this is very much not the case. In spite of this, education systems around the globe do the disservice of giving students the impression that to be scientific is to have the 19th century mindset: listen to the data.

What makes 20th century scientific disciplines such as quantum mechanics and modern biology even possible is the realization that we are not concerned with analyzing or modelling the data itself, but the underlying distribution or data generating process that gave rise to that data. In the case of the flood, we are not interested in analyzing the flood data; we are interested in analyzing the underlying, unseen distribution of extreme discharges of water.

This was the statistical revolution and it was a great trade-off. Traditionally, the object of interest within science was directly observable. Scientists saw, and many still do (despite the methodologies they use in practice), see themselves as studying planets, plants, genes and so forth. What their statistics classes often fail to teach them is that the moment you sell your soul to statistics, you become a scholar of unobservable distributions and those objects you think you are concerned with are merely shadows or expressions of the thing you are actually studying

Consider the case of a really basic Randomized Control Trial for testing a weight-loss drug. You have a treatment group and a control group. The treatment group receives the drug and the control group receives either no drug or a placebo. Subjects are randomly assigned to different groups using a random number generator. This allows the treatment itself to be uncorrelated with individual factors of the subjects. If we are testing a weight loss pill, we don’t want to compare the outcomes of fat people who take the medicine with skinny people who don’t.

After the outcomes are measured, scientists will typically compute the means of both groups and subtract them from one another before proceeding to test whether the means are significantly different from each other in a statistical sense. They are essentially trying to estimate whether there is a desirable average treatment effect; however, what many lay people don’t realize is that by going this route the scientists are no longer studying the impact of medicine on individuals; they are studying the distribution of the differences in the means of two distributions (the distribution of outcomes for the treatment group and the distribution of outcomes for the control group).

The implications of this are staggering. Yes, we have to make extra-data assumptions using our own judgment when modelling phenomena, but we gain so much safety, precision and expressivity by doing so. By modelling the extreme value distribution of floods using statistics, we got much better dams, safer streets, safer cars, the ability to apply physics to problems with much larger degrees of freedom (variables) via statistical mechanics and so forth. We can make precise statements about the differences in distributions rather than highly biased statements about differences in individuals who took a drug vs those who didn’t.

Embracing this perspective engenders both a healthy respect for science and healthy skepticism for scientific models as well as other data, such as the data you’ll find in finance and business intelligence. In the age of big data, it is a fad to say “follow the data”, but know that this advice invites you to go back to a much more primitive time in science’s past. Data says nothing; analysis does the talking. Data is there to check that the models we constructed using our scientific knowledge are still consistent with the real world and that is all it can do. There is no replacing you, the analyst.