and riddles of interest. Because we will think systematically through the research question,
an appropriate analytical framework will emerge from the process, which we then can subject to
a regression model for empirical testing.
Let us first think of the
statistical significance. The coefficients for bedrooms ( bdrms ) is 15956.22. I illustrated how to
estimate the coefficients in a simple regression model in Equation 7.10 . This suggests that each
additional bedroom adds $15,956 to the price of the house
Thinking about tables, and more importantly their comparison with graphics, reminds us of
Daniel Kahneman, who is one of the most prominent thinkers of our time. Professor Kahneman
received a Nobel for his research in decision-making, which he conducted e
vars n mean sd median trimmed mad min max range se
age 1 463.00 48.37 9.80 48.00 48.35 11.86 29.00 73.00 44.00 0.46
beauty 2 463.00 0.00 0.79 -0.07 -0.05 0.87 -1.45 1.97 3.42 0.04
eval 3 463.00 4.00 0.55 4.00 4.03 0.59 2.10 5.00 2.90 0.03
students 4 463.0
7. We illustrate time series plots in Chapter 11 .
8. Piketty, T. (2014). Capital in the Twenty-First Century . Translated by Arthur Goldhammer. Harvard
University Press. Cambridge, MA.
9. Blau, F. D. and Kahn, L. M. (n.d.). Gender differences in pay. The
The regression models, though very powerful tools, are vulnerable to mis-specifications and
of the assumptions that make them work. The applied statisticians, data scientists, and
analysts sometimes forget, or worse, are ignorant of the conditi
In this particular case, I set the coefficients for being unemployed to 0. This implies that
V u = 0 = 0 + 1 X 1 + . + n X n
because exp(0) = 1.
If you divide both the numerator and the denominator by exp Vw , y
and West Asia, income inequality in Toronto, teaching evaluations in Texas, commuting times
in New York, and religiosity and extramarital affairs are all examples that make GSDS resonate
with what is current and critical today.
As I mentioned earlier, in
The z-transformation for 3.5 returns a negative z-score of 0.899. I again use Figure 6.13 to
first locate 0.8 in the first column and then 0.09 in the first row and search for the corresponding
p-value that is located at the intersection of the two. The r
For large samples, and again there are no fixed thresholds for how large a sample should be
to be considered statistically large, the critical value for the t-test is 1.96.
Inferences Concerning 0
In statistical analyses, discussin
lower propensity to smoke.
Education Highly educated individuals
smoke more than others
because of lifestyles associated
with higher education, such as
writers, editors, and so on.
Education does not
affect cigarette smoking.
Highly educated individuals
learned can be applied to all problems. Such a conclusion would be erroneous.
Recall the story of European settlers who spotted a black swan in Western Australia that
immediately contradicted their belief that all swans were white. The settlers could have
draw inferences and devise strategies. The summary table is, for all intents and purposes, small
data. After it is summarized in a table or a graph, data becomes easier to comprehend. We can see
patterns and trends and devise strategies to develop our com
concepts later in the book; however, at this stage, it is sufficient to say that I will plot a line
that will attempt to capture the underlying relationship in the data set.
Figure 5.4 builds on Figure 5.3 by adding the best-fit line to the data set. We s
sciences did not speak a western language as mother tongue. In addition, more than 60 percent of
engineering graduates were visible minorities, suggesting that the supply chain of highly qualified
professional talent in Canada, and largely in North Americ
including how to sort the resulting tabulation. Table 4.14 shows cross-tabulation of age and Internet
after sorting responses.
Table 4.14 Cross-Tabulation of Age and Internet After Sorting Responses
Age cohorts no yes Sum
youth 28.8 71.2 10
in determining whether the difference in teaching evaluations was statistically significant on its
own; that is, when we do not consider other relevant factors. I demonstrated how one could use
a t-test to determine the statistical significance in average
browsing the Internet. A long list of Net-enabled devices, including smartphones, tablets, and
even some TVs, compete with the way we have browsed the Internet in the pastthat is, laptops
and desktops. The constant flow of information implies that a large
Unlike in the Middle East where the Arab governments do not allow assimilation of
migrant workers, the Canadian government, and the society, largely does not create systematic
barriers that might limit the immigrants ability to succeed and assimilate. Thi
have to ask ourselves, is it necessary to use decimal points in reporting percentages? Situations
where valid and statistically significant statistics end up below one percent warrant the use of
decimal points. Otherwise, you can reduce the number of digi
Figure 8.1 and Figure 8.2 suggest that the impact of income on smoking is rather limited compared
to the other three explanatory variables.
310 Chapter 8 To Be or Not to Be
Figure 8.2 Added variable plot for the OLS model generated in Stata
Consider the following calculation of the forecasted probability at the mean values of all
318 Chapter 8 To Be or Not to Be
Equation 8.2 shows the logit equation:
1 Z Equation 8.2
Where Z is the equation obtained from the l
The answer is simple. It is quite likely that by the time you review this book, the definition
of big data would have evolved. More importantly, this book is intended to be the very first step
on ones journey to becoming a data scientist. The fu
sites in Minnesota. Several varieties of barley were grown at each site. Subsequent yields
were recorded for each type of barley grown at each site. Figure 5.16 presents the multi-facet data
in one coherent graphic. The x-axis represents the yield and the
All Lower Division Upper Division
Means with standard deviations in parentheses. All statistics except for those describing the number
students, the percent evaluating the instru
SPSS by default outputs the exponentiated coefficients for the logit and probit models.
Compare the Exp(B) column in Figure 8.29 with the model labeled (2) in Table 8.10 . You will
see that the exponentiated coefficients are identical in Table 8.10 and Fi
Figure 4.5 Restricting to instructor-specific observations for age and beauty
Descriptive Statistics by Categorical Variables
Now I illustrate several features to generate descriptive statistics in Stata.
htopen using ht_beauty, append
htput <h3> Descript
America and Europe. Before the recession, people like columnist Margaret Wente, who were fast
approaching retirement, had a 10-year plan. But then a black swan pooped all over it. 1
Nassim Nicholas Taleb, a New York-based professor of finance and a former
Let us interpret the probit model in light of our initial four hypotheses. Briefly, we wanted to
determine the impact of age, income, education, and the price of cigarettes on the probability of
smoking. The estimated coefficients for probit models cannot
5000 10000 15000
5000 10000 15000
5000 10000 15000
census tract density (person/sq. km.)
percent of transit commutes
Figure 5.38 Scatter plot between population density and transit use
At the same time, I find evidence for longer commute times for
Figure 6.33 shows the graphical display.
0.0 0.1 0.2 0.3 0.4
Comparison of mean test
dof = 461
t - test = 3.25
p - value = 0.000619
-4 0 3.25
Figure 6.33 Graphical output for equal variances, right-tailed test
Interpretation of the
Level-level y x
A unit change in x results in a unit change in y . Models
presented in Table 7.9 follow this formulation.
Log-log Log( y ) Log( x )
A percentage change in x results in a percent change
in y . T
The Congested Lives in Big Cities
I demonstrate illustrative graphics further with an example of traffic congestion in large cities.
If you happen to live in Chicago, New York, or San Francisco, your mobility is constrained by
congested arterials and over