subject matter; faulty, misleading, or imprecise interpretation of the
data and results; incorrect or inadequate analytical methodology. In the
present book we concentrate on how to choose adequate analytical
methodologies and give precise interpretation
reader finds a description of the most important aspects of the normal
distribution, including the reason of its broad applicability.
1.5 Beyond a Reasonable Doubt.
Consider, for instance, the dataset of Example 1.3 and the statement
the 100 electrical re
further, that we wanted to statistically assess the statement the
student performance is 3 or above. Denoting by p the probability of
the event the student performance is 3 or above we derive from the
dataset an estimate of p, known as point estimate and
1.8 Software Tools 21
In the following we use courier type font for denoting SPSS and
220.127.116.11 SPSS The menu bar of the SPSS user interface is shown in Figure
1.8 (with the data file Meteo.sav in current operation). The contents of
ending with , where we present which commands (or functions in the
MATLAB and R cases) to use in order to perform the explained statistical
operations. The MATLAB functions listed in Commands are, except
otherwise stated, from the MATLAB Base or Statistic
In STATISTICA variables are symbolically denoted by v followed by a
number representing the position of the variable column in the
spreadsheet. Since Pmax happens to be the first column, it is then
denoted v1. The cases column is v0. It is also possible t
2.2 Presenting the Data 45
Figure 2.12. Specification of a bar chart for variable PClass (Example 2.1)
using STATISTICA. The category codes can be filled in directly or by
clicking the All button.
Figure 2.13. Bar graph, obtained with STATISTICA, represen
methods explained in this book are illustrated with real-life problems.
The real datasets used in the book examples and exercises are stored in
EXCEL files. They are described in Appendix E and included in the book
CD. Dataset names correspond to the resp
For MATLAB and R functions ; is simply a separator. Alternative menu
options or functions are separated by |.
Example frames end with .
2.2 Presenting the Data 41
In Table 2.1 the counts are shown in the column headed by Frequency,
and the frequencies, gi
the comparison of foetal heart rate baseline measurements proposed in
Exercise 4.11. The heart rate
minute, bpm), after discarding rhythm acceleration or deceleration
episodes. The comparison proposed in Exercise 4.11 respects to
measurements obtained in
Foundation for Statistical Computing, Vienna, Austria, ISBN 3- 90005107-0), a free software product for statistical computing, are popular
http:/www.r-project.org/. This site explains the R history and indicates
a set of URLs (the so-called CRAN mirrors)
command lines must be terminated with the Return or the Enter
The <- symbol is the assignment operator. The c function fills the
vector with the list of values. The symbol x is the vector identifier.
Object identifiers in R
26 1 Introduction
> x 
This book explains how to apply SPSS, STATISTICA, MATLAB or R to
solving statistical problems. The explanation is guided by solved
examples where we usually use one of the software products and
provide indications (in specific
STATISTICA 7.0, MATLAB 7.1 w
producing a value assigned to a variable (considered to be a matrix).
This is illustrated next, with the command that computes the mean of a
sequence of values structured as a row vector:
v=[1 2 3 4 5 6]; mean(v) ans =
3.5000 y=mean(v) y =
would fill in the above formula using the respective variable identifiers;
in this case: 1+(pmax>20)+(pmax>80). Looking to Figure 2.6 one may
rightly suspect that a large number of functions are available in SPSS for
building arbitrarily complex formulas.
2.2 Presenting the Data 47
Figure 2.15. MATLAB figure window, containing the bar graph of PClass.
The graph itself can be copied to the clipboard using the Copy Figure
option of the Edit menu.
Figure 2.16. Bar graphs of PClass obtained with R: a) Using gr
button over the respective column heading. The specification box
shown in Figure 2.4 is then displayed. Note the possibility of specifying
a variable label (describing the variable meaning) or a formula (this last
possibility will be used later). Missing
files (*.sav for SPSS, *.sta for STATISTICA, *.mat for MATLAB, files
containing data frames for R).
2.2.1 Counts and Bar Graphs Tables of counts and bar graphs are used
to present discrete data. Denoting by X the discrete random variable
associated to the
Several discrete distributions are described in Appendix B. An important
one, since it occurs frequently in statistical studies, is the binomial
trial is denoted p. The complementary probability of the failure is 1 p,
also denoted q. Deta
125 36 40 38 Montalegre 80
102 37 36 35 Mirandela
37 37 35 Rgua
111 34 33 31 Bragana
98 40 40 38 M. Douro
109 41 41 40 .
2.1.2 Operating with the Data After having read in a data set, one is
often confronted with the need of defining ne
such as the use of external code (DLLs) and application programming
interfaces (API), as well as the possibility of developing specific routines
in a Basic-like programming language.
perform menu operations. SPSS and STATISTICA are examples of
Figure 2.14. The STATISTICA All Options window that allows the user to
completely customise the graphic output. This window has several subwindows that can be opened with the left tabs. The sub-window
corresponding to the axis units is shown.
hits a particular point in a target is zero (the variable domain is here
. For a continuous variable, X (with value denoted by the same
lower case letter, x), one can assign infinitesimal probabilities p(x) to
infinitesimal intervals x:
xxfxp = )()(
2.2 Presenting the Data 51
Commands 2.3. SPSS, STATISTICA, MATLAB and R commands used to
obtain histograms. SPSS Graphs; Histogram |Interactive; Histogram
STATISTICA Graphs; Histograms MATLAB hist(y,x) R hist(x)
The commands used to obtain histograms of c
the bin width (Step size) and the starting point of the bins. With
MATLAB one obtains both the frequencies and the histogram with the
hist command. Consider the following commands applied to the cork
stopper data stored in the MATLAB cork matrix:
prt = c
matrix from the Meteodata.mat file as can be confirmed by displaying
its contents with: meteo.
18.104.22.168 R Data Entry The tabular form of data in R is called data frame. A
data frame is an aggregate of column vectors, corresponding to the
1193.111<x<=1360.667 11 141 7.33333 94.0000
1360.667<x<=1528.222 8 149 5.33333 99.3333 1528.222<x<=1695.778
1 150 0.66667 100.0000 Missing 0 150 0.00000 100.0000
This variable measures the total perimeter of cork defects, and can be
examples of open products. R can be downloaded from the Internet
1.8 Software Tools 23
illustrates that after writing down the command help stats (ending with
(functions) of the MATLAB Statistical toolbox. One could go on and
write, for instance,
bar graphs. The | symbol separates alternative options or functions.
Figure 2.10. For the frequency bar graph one must check the % of
44 2 Presenting and Summarising the Data
hist(y,x), plots a bar graph of the y frequencies, using a vector x with t
data spreadsheet. This file can then be comfortably opened in a
following session with the Open option of the File menu.
Figure 2.3. Variable View spreadsheet of SPSS for the meteorological
data. Notice the fields for filling in variable labels and missin