This preview shows page 1. Sign up to view the full content.
Unformatted text preview: ChiSquare Tests
Chapter 12.112.3
and elsewhere
ChiSquare Tests
ChiSquare 1 Categorical Data When analyzing continuous data,
When underlying theory is from normal dist.
underlying If the data represent attributes, theory is
If
from the binomial.
from If there are multiple categories, there is a
If
multinomial generalization.
multinomial We will look at these, but first revisit some
We
categorical data analysis we’ve already
done.
done.
ChiSquare Tests
ChiSquare 2 10.3: Comparing two proportions Two (large) samples of size n1 and n2 Let X1 denote the number in sample 1 that
have some characteristic, X2 the same in
have
sample 2.
sample Compute sample proportions p1 = X1/n1
and p2 = X2/n2
and Use these to test the hypothesis about
Use population proportions H0: π1 – π2 = 0
population
ChiSquare Tests
ChiSquare 3 Example
Weekend
Sampled Merchandise at
Merchandise
Regular Price
Regular During annual
During
Mega Sale
Mega Paid by
Paid
credit card
credit 201 312 Transactions
sampled
sampled 300 416 Do people use credit cards more frequently during
Do
sales?
From an upscale store's data base, we sampled a
number of transactions over two weekends.
number
ChiSquare Tests
ChiSquare 4 Is there a difference?
201
201
p =  = .67
1 300
300 312
p =  = .75
2 416 How much is significant?
The SE (page 355) is .03414 ChiSquare Tests
ChiSquare 5 PHStat results Critical values ±
Critical ZCALC 1.96 = 2.34 There is more CC
There usage during sales.
usage ChiSquare Tests
ChiSquare 6 12.1: An alternate test It is difficult to generalize this Ztest to a problem
It
with more than two populations; for example,
weekdays and weekends during both sales and
regularpriced days (four samples).
regularpriced
It would be even more difficult to accommodate
It
multiple categories like payment by cash, check,
credit card, debit card and gift card.
credit
We will look at a test that can handle all these.
We
But first, let’s apply it to the current example.
But ChiSquare Tests
ChiSquare 7 Contingency tables These are tables crosstabulating count
These data by two factors (here, these are time
period and type of payment).
period In general, we are going to see if the rows
In
differ by the columns, on a proportionate
basis.
basis. We thus have the time period on the
We
columns and type of $ on the rows.
columns
ChiSquare Tests
ChiSquare 8 Our example again
Time Period
Type Of
Type
Payment
Payment
Credit Card
Cash
Total During
At Regular
During
At
Mega Sale
Prices
Mega
Prices Total 312
104 201
99 513
203 416 300 716 This is a 2by2 table, so there are four “cells”
ChiSquare Tests
ChiSquare 9 Methodology Method: compute an
Method: expected frequency
expected
for each “cell” and compare it to what we
actually observed.
observed In each cell, compute the difference
In
between what was observed and expected.
between If the payment methods were used at about
If
the same frequency in both time periods,
the differences would be small.
the
ChiSquare Tests
ChiSquare 10 Overall credit card usage
201 + 312 513
201
p
=  =  = .71648
Credit 300 + 416 716
Credit
Expected during sales =
During regular =
Cash usage expected =
ChiSquare Tests
ChiSquare 11 Observed and expected
and expected
Type Of
Type
Payment
Payment During
During
Sales
Sales Credit Card 312 298.1
312
Cash 104 117.9
104 117.9 Total At Regular
At
Prices
Prices Total 201 214.9
201 513 99
99 203 416 85.1
300 ChiSquare Tests
ChiSquare 716 12 The ChiSquare statistic Next, square the error and express it relative to
Next,
what we expect.
what
Add these up across all four cells.
They call this sum the ChiSquare or χ2 statistic.
They (Oi − E i ) 2
χ2 = ∑
Ei
i =1
K K is the number of “cells” in the table
Oi is the observed count per “cell”
Ei is the expected count
ChiSquare Tests
ChiSquare 13 The ChiSquare Distribution
Chisquare with 5 (dash) and 10 (solid) degrees of freedom
0.15 f(x) 0.10 0.05 0.00
0 10 20 30 X ChiSquare Tests
ChiSquare 14 Where is upper 5% of distribution? In our text, the ChiSquare table (Table
In E.4) is on page 741.
E.4) We want to find the value that puts 5% of
We
the area in the upper tail.
the For a 2by2 table, we use a χ2 distribution
For
with one degree of freedom.
with The critical value is thus 3.841. ChiSquare Tests
ChiSquare 15 The Critical Value (at α = .05)
The
Critical Values of ChiSquare
UpperTail Areas ( α )
df
1
2
3
4
5
6
7
8
9
10 0.10
2.706
4.605
6.251
7.779
9.236
10.645
12.017
13.362
14.684
15.987 0.05
3.841
5.991
7.815
9.488
11.070
12.592
14.067
15.507
16.919
18.307 0.025
5.024
7.378
9.348
11.143
12.833
14.449
16.013
17.535
19.023
20.483 ChiSquare Tests
ChiSquare 0.01
6.635
9.210
11.345
13.277
15.086
16.812
18.475
20.090
21.666
23.209 0.005
7.879
10.597
12.838
14.860
16.750
18.548
20.278
21.955
23.589
25.188
16 Is credit card usage the same?
Hypotheses
Test statistic
Decision rule
Results
ChiSquare Tests
ChiSquare 17 PHStat calculation ChiSquare Tests
ChiSquare 18 This procedure not needed? True – we can always use the Z test. It actually is the same test; take the Zcalc value and square it to get the ChiSquare
value.
value. False – it is a useful starting point for
False
larger problems with more rows and
columns.
columns. ChiSquare Tests
ChiSquare 19 More general procedures Let r = the number of row levels Let c = the number of column levels × c or rbyc table. Methodology is the same. The ChiSquare
Methodology
statistic has (r1)(c1) d.f.
statistic We call it an r
We ChiSquare Tests
ChiSquare 20 12.2: Multiple populations We still have two row categories but now
We there are c populations on the columns
there This is a 2byc analysis and there are
This
thus (21)(c1) = c1 d.f.
thus The hypothesis is that the proportions in
The
each row are the same across columns.
each ChiSquare Tests
ChiSquare 21 Our example expanded
Time Period
Type Of
Type
Payment
Payment Weekend Weekday
Weekend Weekday
Sales
Sales
Sales
Sales Weekend
Regular Weekday
Weekday
Regular
Regular Total Credit
Credit
Card
Card 312 289 201 184 986 Cash 104 97 99 61 361 Total 416 386 300 245 1347 ChiSquare Tests
ChiSquare 22 Some of the computations
1.
2.
3.
4. Overall proportion by credit card is
Overall
986/1347 = .7320
986/1347
Expected during weekend sales is thus .
7320(416) = 304.51
(Observed  Expected) = 7.49
(OE)2/E = (312304.51)2/304.51 = .1842 ChiSquare Tests
ChiSquare 23 Here, there are (c1)=(41)=3 df
Critical Values of ChiSquare
UpperTail Areas ( α )
df
1
2
3
4
5
6
7
8
9
10 0.10
2.706
4.605
6.251
7.779
9.236
10.645
12.017
13.362
14.684
15.987 0.05
3.841
5.991
7.815
9.488
11.070
12.592
14.067
15.507
16.919
18.307 0.025
5.024
7.378
9.348
11.143
12.833
14.449
16.013
17.535
19.023
20.483 ChiSquare Tests
ChiSquare 0.01
6.635
9.210
11.345
13.277
15.086
16.812
18.475
20.090
21.666
23.209 0.005
7.879
10.597
12.838
14.860
16.750
18.548
20.278
21.955
23.589
25.188
24 Now, is credit card usage similar?
Hypotheses
Test statistic
Decision rule
Results
ChiSquare Tests
ChiSquare 25 In
In
PHStat
PHStat ChiSquare Tests
ChiSquare 26 Suppose it were just a little different? Because we accepted H0 we would not go
looking for significant differences between
time periods. There aren’t any!
time Suppose we find a data transcription error.
Suppose
During weekday sales, 298 (not 289) of
the 386 sales were by credit card.
the The ChiSquare is now 10.02. We might want to find out what differences
We
are significant.
are
ChiSquare Tests
ChiSquare 27 The Marascuilo procedure Similar to TukeyKramer analysis for a
Similar OneWay ANOVA.
OneWay It figures out a critical range value, and
It
any two proportions different by this or
more will be called significant.
more It only applies to a 2byc table. ChiSquare Tests
ChiSquare 28 Marascuilo output
Conclude? ChiSquare Tests
ChiSquare 29 12.3: χ2 test of independence
12.3: The hypothesis is that the row variable is
The independent of the column variable.
independent The alternative is that there is some kind
The
of relationship among them.
of Under this H0 the expected count per cell
is related to its row total and column total
is
Eij = RiCj / n ChiSquare Tests
ChiSquare 30 Data Layout (3 by 4)
Col 1 Col 2 Col 3 Col 4 Total Row 1 O11 O12 O13 O14 R1 Row 2 O21 O22 O23 O24 R2 Row 3 O31 O32 O33 O34 R3 Total C1 C2 C3 C4 n ChiSquare Tests
ChiSquare 31 Example The human resource manager for a large
The firm wants to assess the popularity of
three alternative flextime plans among
workers in four offices.
workers If the plans are arrayed on the rows and
If
offices across columns, we have r=3 and
c=4.
c=4. The ChiSquare statistic thus has (r1)(c1) = (31)(41) = 2*3 = 6 df.
ChiSquare Tests
ChiSquare 32 Survey results
Office1 Office2 Office3 Office4 Total Plan1 15 32 18 5 70 Plan2 8 29 23 18 78 Plan3 1 20 25 22 68 Total 24 81 66 45 216 ChiSquare Tests
ChiSquare 33 Using PHStat
α
r
c ChiSquare Tests
ChiSquare 34 That produces this blank table Fill these in ChiSquare Tests
ChiSquare 35 Filled In table ChiSquare Tests
ChiSquare 36 Results
Data
Level of Significance
Number of Rows
Number of Columns
Degrees of Freedom 0.05
3
4
6 Results
Critical Value
12.59159
ChiSquare Test Statistic
27.135
p Value
0.000137
Reject the null hypothesis Strong significance Expected frequency assumption
is met.
ChiSquare Tests
ChiSquare 37 Conclusions H0 is rejected, so we can say that some
offices had different preferences for the
plans than others.
plans To dig a little deeper, we can look at the
To
table of (Oij – Eij ) 2 / Eij values.
(O
values. Large values here show “extra
Large information”.
information”. ChiSquare Tests
ChiSquare 38 The (Oij – Eij ) / Eij Table
The (O
2 (fofe)^2/fe
6.706349 1.259524 0.536941 6.297619
0.051282 0.002137 0.029138 0.188462
5.687908 1.186275 0.857992 4.331373 The workers in Office 1 liked Plan 1 but not Plan
The
3.
3.
In Office 4 it was pretty much the opposite.
Offices 2 and 3 had no strong preferences.
ChiSquare Tests
ChiSquare 39 Note on “Expected Frequency” PHStat said the expected frequency
PHStat assumption was met.
assumption Essentially, this test is based on an
Essentially,
approximation that is met as long as most
of the Eij values are at least 5.
of If you had a “sparse” table, you might
If have to combine some categories to meet
this assumption.
this
ChiSquare Tests
ChiSquare 40 On an exam Obviously, this is a computer procedure You would not have to do one of these on
You an exam.
an However, I could give you the PHStat
However,
output and ask you to figure it out.
output The “extra information” analysis, too. ChiSquare Tests
ChiSquare 41 GoodnessofFit Tests These are tests to see if the observed
These data follow an expected pattern.
data For example, suppose we have several
For
categories and want to see if they are all
equal in size.
equal Another example: do people choose the
Another
same type of chocolate that they did
before?
before?
ChiSquare Tests
ChiSquare 42 First Example
Are technical support
Are
calls equal across
all days of the
week?
We sample data for
10 days for each
day of week.
day Day of Week
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
Total ChiSquare Tests
ChiSquare No. Calls
290
250
238
257
265
230
192
1722 43 Logic of the test If calls are uniformly distributed, the 1722
If
are
calls would be expected to be equally
divided across the 7 days.
divided This would mean that each day would
This
have 1722 ÷ 7 = 246 support calls. Does the data agree with this? ChiSquare Tests
ChiSquare 44 Observed versus Expected
Day of Week
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday Observed
290
250
238
257
265
230
192 ChiSquare Tests
ChiSquare Expected
246
246
246
246
246
246
246 45 The ChiSquare Test
Ho: The distribution of calls is uniform across
The
days of the week
days
H1: The distribution of calls is not uniform
The
across days of the week
across
The test statistic is:
2
K
(Oi − Ei )
2
χ =∑
(where d.f. = K − 1)
Ei
i=1
where:
K = number of categories
Oi = observed frequency for category i
Ei = expected frequency for category i
ChiSquare Tests
ChiSquare 46 The Critical Value (at α = .05)
The
Critical Values of ChiSquare
UpperTail Areas ( α )
df
1
2
3
4
5
6
7
8
9
10 0.10
2.706
4.605
6.251
7.779
9.236
10.645
12.017
13.362
14.684
15.987 0.05
3.841
5.991
7.815
9.488
11.070
12.592
14.067
15.507
16.919
18.307 0.025
5.024
7.378
9.348
11.143
12.833
14.449
16.013
17.535
19.023
20.483 ChiSquare Tests
ChiSquare 0.01
6.635
9.210
11.345
13.277
15.086
16.812
18.475
20.090
21.666
23.209 0.005
7.879
10.597
12.838
14.860
16.750
18.548
20.278
21.955
23.589
25.188
47 Computations
Day of Week
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday Observed
290
250
238
257
265
230
192 Expected
246
246
246
246
246
246
246 ChiSquare Tests
ChiSquare OE 2 (O  E) /E 48 Completed Table
Day of Week
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday Observed
290
250
238
257
265
230
192 Expected
246
246
246
246
246
246
246 ChiSquare Tests
ChiSquare OE
44
4
8
11
19
16
54 (O  E)2/E
7.8699
0.0650
0.2602
0.4919
1.4675
1.0407
11.8537
23.0488 49 Comments
We rejected H0 so end up concluding
that the assumption of uniformity across
days is not true.
days
2. If you examine the results more closely,
If
however, you see that wasn’t a bad
assumption except on Sunday and
Monday.
Monday.
3. You can often get “extra information” by
You
examining the (Oi – Ei )2/Ei column for
(O
large numbers.
large
1. ChiSquare Tests
ChiSquare 50 On an exam This would have been too large of a
This problem for me to expect you to do it by
hand on an exam.
hand A smaller problem (say K = 4) would be
smaller
“fair game”.
“fair Don’t be surprised if I ask you to do that,
Don’t
assuming I gave you the pattern to test
for.
for.
ChiSquare Tests
ChiSquare 51 Another example Historical data suggests customer
Historical preference for chocolate bars are: Mr.
Goodbar (30%), Hershey’s Milk Chocolate
(50%), Special Dark (15%), Krackel (5%).
(50%), In a marketing research lab, a survey of
In
200 students showed 50 selected Mr.
Goodbar, 93 Milk Chocolate, 45 Special
Dark and 12 Krackel.
Dark Are student preferences different?
ChiSquare Tests
ChiSquare 52 Is local preference different?
Hypotheses
Test statistic
Decision rule
Results
ChiSquare Tests
ChiSquare 53 ...
View Full
Document
 Spring '08
 Thompson

Click to edit the document details