and riddles of interest. Because we will think systematically through the research question,
an appropriate analytical framework will emerge from the process, which we then can subject to
a regression model for empirical testing.
Let us first think of the

statistical significance. The coefficients for bedrooms ( bdrms ) is 15956.22. I illustrated how to
estimate the coefficients in a simple regression model in Equation 7.10 . This suggests that each
additional bedroom adds $15,956 to the price of the house

Thinking about tables, and more importantly their comparison with graphics, reminds us of
Daniel Kahneman, who is one of the most prominent thinkers of our time. Professor Kahneman
received a Nobel for his research in decision-making, which he conducted e

vars n mean sd median trimmed mad min max range se
age 1 463.00 48.37 9.80 48.00 48.35 11.86 29.00 73.00 44.00 0.46
beauty 2 463.00 0.00 0.79 -0.07 -0.05 0.87 -1.45 1.97 3.42 0.04
eval 3 463.00 4.00 0.55 4.00 4.03 0.59 2.10 5.00 2.90 0.03
students 4 463.0

7. We illustrate time series plots in Chapter 11 .
8. Piketty, T. (2014). Capital in the Twenty-First Century . Translated by Arthur Goldhammer. Harvard
University Press. Cambridge, MA.
9. Blau, F. D. and Kahn, L. M. (n.d.). Gender differences in pay. The

The regression models, though very powerful tools, are vulnerable to mis-specifications and
violations
of the assumptions that make them work. The applied statisticians, data scientists, and
analysts sometimes forget, or worse, are ignorant of the conditi

In this particular case, I set the coefficients for being unemployed to 0. This implies that
V u = 0 = 0 + 1 X 1 + . + n X n
Therefore,
( )=
+
work exp
exp
Pr
1
V
V
w
w
because exp(0) = 1.
If you divide both the numerator and the denominator by exp Vw , y

and West Asia, income inequality in Toronto, teaching evaluations in Texas, commuting times
in New York, and religiosity and extramarital affairs are all examples that make GSDS resonate
with what is current and critical today.
As I mentioned earlier, in

The z-transformation for 3.5 returns a negative z-score of 0.899. I again use Figure 6.13 to
first locate 0.8 in the first column and then 0.09 in the first row and search for the corresponding
p-value that is located at the intersection of the two. The r

alternative hypothesis.
For large samples, and again there are no fixed thresholds for how large a sample should be
to be considered statistically large, the critical value for the t-test is 1.96.
Inferences Concerning 0
In statistical analyses, discussin

lower propensity to smoke.
Education Highly educated individuals
smoke more than others
because of lifestyles associated
with higher education, such as
writers, editors, and so on.
Education does not
affect cigarette smoking.
Highly educated individuals
s

learned can be applied to all problems. Such a conclusion would be erroneous.
Recall the story of European settlers who spotted a black swan in Western Australia that
immediately contradicted their belief that all swans were white. The settlers could have

draw inferences and devise strategies. The summary table is, for all intents and purposes, small
data. After it is summarized in a table or a graph, data becomes easier to comprehend. We can see
patterns and trends and devise strategies to develop our com

concepts later in the book; however, at this stage, it is sufficient to say that I will plot a line
that will attempt to capture the underlying relationship in the data set.
Figure 5.4 builds on Figure 5.3 by adding the best-fit line to the data set. We s

sciences did not speak a western language as mother tongue. In addition, more than 60 percent of
engineering graduates were visible minorities, suggesting that the supply chain of highly qualified
professional talent in Canada, and largely in North Americ

including how to sort the resulting tabulation. Table 4.14 shows cross-tabulation of age and Internet
after sorting responses.
Table 4.14 Cross-Tabulation of Age and Internet After Sorting Responses
Internet users
Age cohorts no yes Sum
youth 28.8 71.2 10

in determining whether the difference in teaching evaluations was statistically significant on its
own; that is, when we do not consider other relevant factors. I demonstrated how one could use
a t-test to determine the statistical significance in average

browsing the Internet. A long list of Net-enabled devices, including smartphones, tablets, and
even some TVs, compete with the way we have browsed the Internet in the pastthat is, laptops
and desktops. The constant flow of information implies that a large

Unlike in the Middle East where the Arab governments do not allow assimilation of
migrant workers, the Canadian government, and the society, largely does not create systematic
barriers that might limit the immigrants ability to succeed and assimilate. Thi

have to ask ourselves, is it necessary to use decimal points in reporting percentages? Situations
where valid and statistically significant statistics end up below one percent warrant the use of
decimal points. Otherwise, you can reduce the number of digi

Figure 8.1 and Figure 8.2 suggest that the impact of income on smoking is rather limited compared
to the other three explanatory variables.
310 Chapter 8 To Be or Not to Be
Figure 8.2 Added variable plot for the OLS model generated in Stata
Interpreting M

Consider the following calculation of the forecasted probability at the mean values of all
explanatory variables.
318 Chapter 8 To Be or Not to Be
Equation 8.2 shows the logit equation:
=
+P
e
1
1 Z Equation 8.2
Where Z is the equation obtained from the l

big data?
The answer is simple. It is quite likely that by the time you review this book, the definition
of big data would have evolved. More importantly, this book is intended to be the very first step
on ones journey to becoming a data scientist. The fu

sites in Minnesota. Several varieties of barley were grown at each site. Subsequent yields
were recorded for each type of barley grown at each site. Figure 5.16 presents the multi-facet data
in one coherent graphic. The x-axis represents the yield and the

(0.493)
4.196
(0.481)
44.24
(45.54)
74.89
0.405
0.090
0.060
0.869
-306
79
All Lower Division Upper Division
Means with standard deviations in parentheses. All statistics except for those describing the number
of
students, the percent evaluating the instru

SPSS by default outputs the exponentiated coefficients for the logit and probit models.
Compare the Exp(B) column in Figure 8.29 with the model labeled (2) in Table 8.10 . You will
see that the exponentiated coefficients are identical in Table 8.10 and Fi

Figure 4.5 Restricting to instructor-specific observations for age and beauty
Descriptive Statistics by Categorical Variables
Now I illustrate several features to generate descriptive statistics in Stata.
htopen using ht_beauty, append
htput <h3> Descript

America and Europe. Before the recession, people like columnist Margaret Wente, who were fast
approaching retirement, had a 10-year plan. But then a black swan pooped all over it. 1
Nassim Nicholas Taleb, a New York-based professor of finance and a former

Let us interpret the probit model in light of our initial four hypotheses. Briefly, we wanted to
determine the impact of age, income, education, and the price of cigarettes on the probability of
smoking. The estimated coefficients for probit models cannot

60
5000 10000 15000
5000 10000 15000
5000 10000 15000
census tract density (person/sq. km.)
percent of transit commutes
0
20
Figure 5.38 Scatter plot between population density and transit use
At the same time, I find evidence for longer commute times for

Figure 6.33 shows the graphical display.
0.0 0.1 0.2 0.3 0.4
-2
Comparison of mean test
t distribution
dof = 461
t - test = 3.25
p - value = 0.000619
density
-4 0 3.25
1.65
24
Figure 6.33 Graphical output for equal variances, right-tailed test
226 Chapter

Variable
Interpretation of the
Coefficient
Level-level y x
A unit change in x results in a unit change in y . Models
presented in Table 7.9 follow this formulation.
Log-log Log( y ) Log( x )
1
A percentage change in x results in a percent change
in y . T

The Congested Lives in Big Cities
I demonstrate illustrative graphics further with an example of traffic congestion in large cities.
If you happen to live in Chicago, New York, or San Francisco, your mobility is constrained by
congested arterials and over