# 01_data_science_intro.pdf - 1 Introduction This book is...

• 4

This preview shows page 1 - 3 out of 4 pages.

1 Introduction T his book is about data science. This term has no precise defini- tion. Data science involves some statistics, some probability, some computing—and above all, some knowledge of your data set (the “science” part). The goal of data science is to help us understand patterns of variation in data: economic growth rates, dinosaur skull volumes, student SAT scores, genes in a population, Congressional party affiliations, drug dosage levels, your choice of toothpaste versus mine . . . really any variable that can be measured. To do that, we often use models . A model is a metaphor, a de- scription of a system that helps us to reason more clearly. Like all metaphors, models are approximations, and will never account for every last detail. A useful mantra here is: all models are wrong, but some models are useful. 1 Aerospace engineers work with 1 Attributed to George Box. physical models—blueprints, simulations, mock-ups, wind-tunnel prototypes—to help them understand a proposed airplane design. Geneticists work with animal models—fruit flies, mice, zebrafish— to help them understand heredity. In data science, we work with statistical models to help us understand variation . Like the weather, most variation in the world exhibits some features that are predictable, and some that are unpredictable. Will it snow on Christmas day? It’s more likely in Boston than Austin, and more likely still at the North Pole; that’s predictable variation. But even as late as Christmas eve, and even at the North Pole, nobody knows for sure; that’s unpredictable variation. Statistical models describe both the predictable and the un- predictable variation in some system. More than that, they allow us to partition observed variation into its predictable and unpre- dictable components. This focus on the structured quantification of uncertainty is what distinguishes data science from ordinary evidence-based reasoning. It’s important to know what the evi- dence says, goes this line of thinking. But it’s also important to
6 data science know what it doesn’t say. Sometimes that’s the tricky part.