20 Pages

lecture-2

Course: SOC 740, Fall 2009
School: St. Mary MD
Rating:
 
 
 
 
 

Word Count: 5698

Document Preview

and Examining Transforming Data 1 Sociology 740 John Fox 1. Goals To motivate the inspection and exploration of data as a necessary preliminary to statistical modeling. To review (quickly) familiar graphical displays (histograms, boxplots, scatterplots). To introduce displays that may not be familiar (nonparametric density estimates, quantile-comparison plots, scatterplots matrices, jittered scatterplots)....

Register Now

Unformatted Document Excerpt

Coursehero >> Maryland >> St. Mary MD >> SOC 740

Course Hero has millions of student submitted documents similar to the one
below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.

Course Hero has millions of student submitted documents similar to the one below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.
and Examining Transforming Data 1 Sociology 740 John Fox 1. Goals To motivate the inspection and exploration of data as a necessary preliminary to statistical modeling. To review (quickly) familiar graphical displays (histograms, boxplots, scatterplots). To introduce displays that may not be familiar (nonparametric density estimates, quantile-comparison plots, scatterplots matrices, jittered scatterplots). To introduce the `family' of power transformations. To show how power transformations can be used to correct common problems in data analysis, including skewness, nonlinearity, and nonconstant spread. To introduce the logit transformation for proportions (time permitting). Lecture Notes 2. Examining and Transforming Data Copyright 2009 by John Fox c 2009 by John Fox Sociology 740 Examining and Transforming Data 2 Examining and Transforming Data 3 2. A Preliminary Example Careful data analysis begins with inspection of the data, and techniques for examining and transforming data find direct application to the analysis of data using linear models. The data for the four plots in Figure 1, given in the table below, were cleverly contrived by Anscombe (1973) so that the least-squares regression line and all other common regression `outputs' are identical in the four datasets. 10 8.04 9.14 7.46 8 6.58 8 6.95 8.14 6.77 8 5.76 13 7.58 8.74 12.74 8 7.71 9 8.81 8.77 7.11 8 8.84 11 8.33 9.26 7.81 8 8.47 14 9.96 8.10 8.84 8 7.04 6 7.24 6.13 6.08 8 5.25 4 4.26 3.10 5.39 19 12.50 12 10.84 9.13 8.15 8 5.56 7 4.82 7.26 6.42 8 7.91 5 5.68 4.74 5.73 8 6.89 It is clear, however, that each graph tells a different story about the data: In (a), the linear regression line is a reasonable descriptive summary of the tendency of to increase with . c 2009 by John Fox Sociology 740 c 2009 by John Fox Sociology 740 Examining and Transforming Data 4 Examining and Transforming Data 5 (a) 15 15 (b) In Figure (b), the linear regression fails to capture the clearly curvilinear relationship between the two variables; we would do much better to fit 2 + . a quadratic function here, = + In Figure (c), there is a perfect linear relationship between and for all but one outlying data point. The least-squares line is pulled strongly towards the outlier, distorting the relationship between the two variables for the rest of the data. When we encounter an outlier in real data we should look for an explanation. Finally, in (d), the values of are invariant (all are equal to 8), with the exception of one point (which has an -value of 19); the leastsquares line would be undefined but for this point. We are usually uncomfortable having the result of a data analysis depend so centrally on a single influential observation. Only in this fourth dataset is the problem immediately apparent from inspecting the numbers. 10 Y 5 Y 0 0 5 10 X 15 20 0 0 5 10 5 10 X 15 20 (c) 15 15 (d) 10 Y 5 Y 0 0 5 10 X 15 20 0 0 5 10 5 10 X 15 20 Figure 1. Anscombe's "quartet": Each data set has the same linear least-squares regression of on . c 2009 by John Fox Sociology 740 c 2009 by John Fox Sociology 740 Examining and Transforming Data 6 Examining and Transforming Data 7 3. Univariate Displays 3.1 Histograms Figure 2 shows two histograms for the distribution of infant morality rate per 1000 live births for 193 nations of the world (using 1998 data from the UN). The range of infant mortality is dissected into equal-width class intervals (called `bins'); the number of observations falling into each interval is counted; and these frequency counts are displayed in a bar graph. Both histograms use bins of width 10 they differ in that the bins in (a) start at 0, while those in (b) start at -5. The two histograms are more similar than different but they do give slightly different impressions of the shape of the distribution. 40 30 (a) (b) Frequency Frequency 0 50 100 150 20 10 0 0 0 10 20 30 40 50 50 100 150 Infant Mortality Rate (per 1000) Infant Mortality Rate (per 1000) Figure 2. Two histograms with the same bin width but different origins for infant mortality in the United Nations data. c 2009 by John Fox Sociology 740 c 2009 by John Fox Sociology 740 Examining and Transforming Data 8 Examining and Transforming Data 9 Histograms are very useful graphs, but they suffer from several problems: The visual impression of the data conveyed by a histogram can depend upon the arbitrary origin of the bin system. Because the bin system dissects the range of the variable into class intervals, the histogram is discontinuous (i.e., rough) even if, as in the case of income, the variable is continuous. The form of the histogram depends upon the arbitrary width of the bins. If we use bins that are narrow enough to capture detail where data are plentiful -- usually near the center of the distribution -- then they may be too narrow to avoid `noise' where data are sparse -- usually in the tails of the distribution. 3.2 Density Estimation Nonparametric density estimation addresses the deficiencies of traditional histograms by averaging and smoothing. The kernel density estimator continuously moves a window of fixed width across the data, calculating a locally weighted average of the number of observations falling in the window -- a kind of running proportion. The smoothed plot is scaled so that it encloses an area of one. Selecting the window width for the kernel estimator is primarily a matter of trial and error -- we want a value small enough to reveal detail but large enough to suppress random noise. The adaptive kernel estimator is similar, except that the window width is adjusted so that the window is narrower where data are plentiful and wider where data are sparse. Details are in the text c 2009 by John Fox Sociology 740 An example is shown in Figure 3. c 2009 by John Fox Sociology 740 Examining and Transforming Data 10 Examining and Transforming Data 11 0.025 3.3 Quantile-Comparison Plots Quantile-comparison plots are useful for comparing an empirical sample distribution with a theoretical distribution, such as the normal distribution. A strength of the display is that it does not require the use of arbitrary bins or windows. Let ( ) represent the theoretical cumulative distribution function (CDF) to which we wish to compare the data; that is, Pr( ) = ( ). A simple (but flawed) procedure is to calculate the empirical cumulative distribution function (ECDF) for the observed data, which is simply the proportion of data below each : 0 50 100 150 200 Estimated Density 0.000 0.005 0.010 0.015 0.020 #( b( ) = =1 ) Infant Mortality Rate (per 1000) Figure 3. Kernel (broken line) and adaptive-kernel (solid line) density estimators for the distribution infant mortality. A "one-dimensional scatterplot" (or "rug plot") of the observations is shown at the bottom. c 2009 by John Fox Sociology 740 As illustrated in Figure 4, however, the ECDF is a `stair-step' function, while the CDF is typically smooth, making the comparison difficult. c 2009 by John Fox Sociology 740 Examining and Transforming Data 12 Examining and Transforming Data 13 ^ P x (a) P(x) (b) The quantile-comparison plot avoids this problem by never constructing the ECDF explicitly: 1. Order the data values from smallest to largest, denoted The ( ) are called the order statistics of the sample. (1) (2) ( ). 2. By convention, the cumulative proportion of the data `below' given by () is = (or a similar formula). 1 2 X x1 x2 x3 xn 1 X xn 3. Use the inverse of the CDF (the quantile function) to find the value corresponding to the cumulative probability ; that is, ! = 1 1 2 Figure 4. (a) Typical ECDF; (b) typical CDF. c 2009 by John Fox Sociology 740 as horizontal coordinates against the 4. Plot the coordinates. c 2009 by John Fox () as vertical Sociology 740 Examining and Transforming Data 14 Examining and Transforming Data 15 If is sampled from the distribution , then () If the distributions are identical except for location, then If the distributions are identical except for scale, then () ' . () + . . The values along the fitted line are given by b( ) = b + b . If the distributions differ both in location and scale but have the same shape, then ( ) + . 5. It is often helpful to place a comparison line on the plot to facilitate the perception of departures from linearity. 6. We expect some departure from linearity because of sampling variation; it therefore assists interpretation to display the expected degree of sampling error in the plot. The standard error of the order statistic ( ) is r b (1 ) SE( ( )) = ( ) where ( ) is the probability-density function corresponding to the CDF ( ). c 2009 by John Fox Sociology 740 Figure 5 display normal quantile-comparison plots for several illustrative distributions: (a) A sample of = 100 observations from a normal distribution with mean = 50 and standard deviation = 10. (b) A sample of = 100 observations from the highly positively skewed 2 distribution with two degrees of freedom. (c) A sample of = 100 observations from the very-heavy-tailed distribution with two degrees of freedom. An approximate 95 percent confidence `envelope' around the fitted line is therefore b( ) 2 SE( ( )) c 2009 by John Fox Sociology 740 Examining and Transforming Data (a) 2 16 (b) 8 10 Examining and Transforming Data 17 Sample from N 50 10 30 40 50 60 70 Samplefrom A normal quantile-comparison plot for the infant-mortality data appears in Figure 6. The positive skew of the distribution is readily apparent. The multi-modal character of the data, however, is not easily discerned in this display: 2 2 0 2 4 6 -2 -1 0 1 2 -2 -1 0 1 2 Normal Quantiles Normal Quantiles (c) Quantile-comparison plots highlight the tails of distributions. This is important, because the behavior of the tails is often problematic for standard estimation methods like least-squares, but it is useful to supplement quantile-comparison plots with other displays. Quantile-comparison plots are usually used not to plot a variable directly but for derived quantities, such as residuals from a regression model. Sample from t2 -10 -2 0 5 15 -1 0 1 2 Normal Quantiles Figure 5. Normal quantile comparison plots for samples of size drawn from three distributions. c 2009 by John Fox = 100 Sociology 740 c 2009 by John Fox Sociology 740 Examining and Transforming Data 18 Examining and Transforming Data 19 3.4 Boxplots Infant Mortality Rate (per 1000) Boxplots (due to John Tukey) present summary information on center, spread, skewness, and outliers. An illustrative boxplot, for the infant-mortality data, appears in Figure 7. This plot is constructed according to these conventions: 1. A scale is laid off to accommodate the extremes of the data. 2. The central box is drawn between the hinges, which are simply defined quartiles, and therefore encompasses the middle half of the data. The line in the central box represents the median. 0 -3 50 100 150 -2 -1 0 1 2 3 Normal Quantiles Figure 6. Normal quantile-comparison plot for infant mortality. c 2009 by John Fox Sociology 740 c 2009 by John Fox Sociology 740 Examining and Transforming Data 20 Examining and Transforming Data 21 Sierra.Leone Infant Mortality Rate (per 1000) 150 Liberia Afghanistan 3. The following rule is used to identify outliers, which are shown individually in the boxplot: The hinge-spread (or inter-quartile range) is the difference between the hinges: -spread = The `fences' are located 1 5 hinge-spreads beyond the hinges: F = 1 5 -spread F = + 1 5 -spread Observations beyond fences are identified as outliers. The fences themselves are not shown in the display. (Points beyond 3 spread are extreme outliers.) The `whisker' growing from each end of the central box extends either to the extreme observation on its side of the distribution (as at the low end of the infant-mortality data) or to the most extreme non-outlying observation, called the `adjacent value' (as at the high end of the infant-mortality distribution). Sociology 740 c 2009 by John Fox Sociology 740 0 50 100 Figure 7. Boxplot of infant mortality. c 2009 by John Fox Examining and Transforming Data 22 Examining and Transforming Data 23 The boxplot of the infant-mortality distribution clearly reveals the skewness of the distribution: The lower whisker is shorter than the upper whisker; and there are outlying observations at the upper end of the distribution, but not at the lower end. The median is closer to the lower hinge than to the upper hinge. The apparent multi-modality of the infant-mortality data is not represented in the boxplot. 4. Plotting Bivariate Data The scatterplot -- a direct geometric representation of observations on two quantitative variables (generically, and )-- is the most useful of all statistical graphs. Scatterplots are familiar, so I will limit this presentation to a few points (see Figure 8): It is convenient to work in a computing environment that permits the interactive identification of observations in a scatterplot. Since relationships between variables in the social sciences are often weak, scatterplots can be dominated visually by `noise.' It often helps to enhance the plot with a non-parametric regression of on . Scatterplots in which one or both variables are highly skewed are difficult to examine because the bulk of the data congregate in a small part of the display. It often helps to `correct' substantial skews prior to examining the relationship between and . Scatterplots in which the variables are discrete are difficult to examine. Boxplots are most useful as adjuncts to other displays (e.g., in the margins of a scatterplot) or for comparing several distributions. c 2009 by John Fox Sociology 740 c 2009 by John Fox Sociology 740 Examining and Transforming Data 24 Examining and Transforming Data 25 Sierra.Leone Infant Mortality Rate (per 1000) 100 150 Afghanistan Iraq Gabon 50 Libya French.Guiana An extreme instance of this phenomenon is shown in Figure 9, which plots scores on a ten-item vocabulary test included in NORC's General Social Survey against years of education. One solution -- particularly useful when only is discrete -- is to focus on the conditional distribution of for each value of . Boxplots, for example, can be employed to represent the conditional distributions. Another solution is to separate overlapping points by adding a small random quantity to the discrete scores. For example, I have added a uniform random variable on the interval [ 0 4 +0 4] to each of vocabulary and education. 40000 0 0 10000 20000 30000 GDP Per Capita (US dollars) Figure 8. Scatterplot of infant morality by GDP per capita, for the UN data. The solid line is for a lowess smooth with a span of .5. c 2009 by John Fox Sociology 740 c 2009 by John Fox Sociology 740 Examining and Transforming Data 26 Examining and Transforming Data 27 As mentioned, when the explanatory variable is discrete, parallel boxplots can be used to display the conditional distributions of . One common case occurs when the explanatory variable is a qualitative/categorical variable. An example is shown in Figure 10, using data collected by Michael Ornstein (1976) on interlocking directorates among the 248 largest Canadian firms. The response variable in this graph is the number of interlocking directorships and executive positions maintained by each firm with others in the group of 248. The explanatory variable is the nation in which the corporation is controlled, coded as Canada, United Kingdom, United States, and other foreign. It is relatively difficult to discern detail in this display: first, because the conditional distributions of interlocks are positively skewed; and, second, because there is an association between level and spread. c 2009 by John Fox Sociology 740 Figure 9. Vocabulary score by education: (a) original scatterplot; (b) jittered, with the least-squares lines, lowess line (for span = .2), and conditional means. c 2009 by John Fox Sociology 740 Examining and Transforming Data 28 Examining and Transforming Data 29 5. Plotting Multivariate Data Because paper and computer screens are two-dimensional, graphical display of multivariate data is intrinsically difficult. Multivariate displays for quantitative data often project the higherdimensional `point cloud' of the data onto a two-dimensional space. The essential trick of effective multidimensional display is to select projections that reveal important characteristics of the data. In certain circumstances, projections can be selected on the basis of a statistical model fit to the data or on the basis of explicitly stated criteria. Other Canada UK US Nation of Control Figure 10. Parallel boxplots of number of interlocks by nation of control, for Ornstein's interlocking-directorate data. c 2009 by John Fox Sociology 740 A simple approach to multivariate data, which does not require a statistical model, is to examine bivariate scatterplots for all pairs of variables. Arraying these plots in a `scatterplot matrix' produces a graphical analog to the correlation matrix. c 2009 by John Fox Sociology 740 Examining and Transforming Data 30 Examining and Transforming Data 20 40 60 80 minister RR engineer 100 minister RR engineer 60 80 100 31 Figure 11 shows an illustrative scatterplot matrix, for data from Duncan (1961) on the prestige, education, and income levels of 45 U.S. occupations. It is important to understand an essential limitation of the scatterplot matrix as a device for analyzing multivariate data: By projecting the multidimensional point cloud onto pairs of axes, the plot focuses on the marginal relationships between the corresponding pairs of variables. The object of data analysis for several variables is typically to investigate partial relationships, not marginal associations can be related marginally to a particular even when there is no partial relationship between the two variables controlling for other 's. It is also possible for there to be a partial association between and an but no marginal association. c 2009 by John Fox Sociology 740 prestige 100 minister 80 minister 60 education conductor RR engineer conductor RR engineer 20 40 RR engineer conductor RR engineer conductor 0 20 40 60 80 100 20 40 60 80 Figure 11. Scatterplot matrix for prestige, income, and education in Duncan's occupational prestige data. c 2009 by John Fox Sociology 740 20 minister minister 40 income 60 80 0 20 40 conductor conductor Examining and Transforming Data 32 Examining and Transforming Data 33 Measured Weight (kg) Despite this intrinsic limitation, scatterplot matrices often uncover interesting features of the data, and this is indeed the case here, where the display reveals three unusual observations: Ministers, railroad conductors, and railroad engineers. 120 140 Furthermore, if the 's themselves are nonlinearly related, then the marginal relationship between and a specific can be nonlinear even when their partial relationship is linear. 160 F 12 M 21 M M M M MM M MM M MM M MM M M M MM MMM MMMM FMM F M MMF MM MM M FM M M MM FF FFMMMM M FM FMF FF FFFM M FMM M FM MF FFFF FFF MF F FM F F F F FFFF F M FF M F FF FF F FF F F FF F F F FF 60 80 100 120 Information about a categorical third variable may be entered on a bivariate scatterplot by coding the plotting symbols. The most effective codes use different colors to represent categories, but degrees of fill, distinguishable shapes, and distinguishable letters can also be effective. (See, e.g., Figure 12, which uses Davis's data on weight and reported weight.) 40 60 F 40 80 100 Reported Weight (kg) Figure 12. Measured by reported weight for 183 men (M) and women (F) engaged in regular exercise. c 2009 by John Fox Sociology 740 c 2009 by John Fox Sociology 740 Examining and Transforming Data 34 Examining and Transforming Data 35 Another useful multivariate display, directly applicable only to three variables at a time, is the three-dimensional scatterplot. This display is an illusion produced by modern statistical software, since the graph really represents a projection of a three-dimensional scatterplot onto a two-dimensional computer screen. Nevertheless, motion (e.g., rotation) and the ability to interact with the display -- sometimes combined with the effective use of perspective, color, depth-cueing, fitted surfaces, and other visual devices -- can produce a vivid impression of directly examining a three-dimensional Transformations: space. 6. The Family of Powers and Roots `Classical' statistical models make strong assumptions about the structure of data, assumptions which often fail to hold in practice. One solution is to abandon classical methods. Another solution is to transform the data so that they conform more closely to the assumptions. As well, transformations can often assist in the examination of data in the absence of a statistical model. A particularly useful group of transformations is the `family' of powers and roots: If 1 c 2009 by John Fox Sociology 740 is negative, then the transformation is an inverse power: 2 = 1 2. , and 1 = c 2009 by John Fox Sociology 740 Examining and Transforming Data 36 Examining and Transforming Data 37 If is a fraction, then the transformation represents a root: 1 2 and =1 . 1 3 = 3 3 It is sometimes convenient to define the family of power transformations in a slightly more complex manner (called the Box-Cox family ): 1 ( ) 2 Since , the two transformations have the is a linear function of same essential effect on the data, but, as is apparent in Figure 13, ( ) reveals the essential unity of the family of powers and roots: Dividing by preserves the direction of , which otherwise would be reversed when is negative: 1 1 ( ) 1 0 -1 0 1 2 X 3 4 1 1 1 2 1 2 3 1 3 4 1 4 c 2009 by John Fox 1 1 2 1 3 1 4 Sociology 740 Figure 13. The Box-Cox familily of modified power transformations, ( ) = ( 1) , for values of = 1 0 1 2 3. When = 0, ( ) = log . c 2009 by John Fox Sociology 740 Examining and Transforming Data 38 Examining and Transforming Data 39 The transformations slope. ( ) are `matched' above = 1 both in level and The power transformation 0 is useless, but the very useful log transformation is a kind of `zeroth' power: 1 = log lim 0 Review of logs: logs are exponents: log that = = ("the log of to the base is ") means where 2 718 is the base of the natural logarithms. Thus, we will take (0) log( ). It is generally more convenient to use logs to the base 10 or base 2, which are more easily interpreted than logs to the base . Changing bases is equivalent to multiplying by a constant. Descending the `ladder' of powers and roots from = 1 (i.e., no and transformation) towards ( 1) compresses the large values of spreads out the small ones Ascending the ladder of powers and roots towards effect. (2) Some examples: log10 100 = 2 log10 0 01 = 2 log10 10 = 1 log2 8 = 3 log2 1 = 3 8 log 1 = 0 102 = 100 1 10 2 = 102 = 0 01 1 10 = 10 23 = 8 1 2 3 = 213 = 8 0 =1 has the opposite c 2009 by John Fox Sociology 740 c 2009 by John Fox Sociology 740 Examining and Transforming Data 40 Examining and Transforming Data 41 1 log2 1 1{ 2 3 0 1 1 }1 2 }1 1 }3 4 }5 9 }1 }7 16 1 }7 8 } 19 27 } 37 64 Second, the power transformations are not monotone when there are both positive and negative values in the data: 2 1 2 1 6 1 12 { 1 2 { 0.59 { 1 3 { 0.41 { 1.59 3 2 4 1 4 Power transformations are sensible only when all of the values of are positive. First of all, some of the transformations, such as log and square root, are undefined for negative or zero values. 2 4 1 1 0 0 1 1 2 4 We can add a positive constant (called a `start') to each data value to make all of the values positive: ( + ): ( + 3)2 2 1 1 4 0 9 1 16 2 25 c 2009 by John Fox Sociology 740 c 2009 by John Fox Sociology 740 Examining and Transforming Data 42 Examining and Transforming Data 43 Power transformations are effective only when the ratio of the biggest data values to the smallest ones is sufficiently large; if this ratio is close to 1, then power transformations are nearly linear; in the following example, 1995 1991 = 1 002 1: log10 1991 3 2991 1{ } 0.0002 1992 3 2993 1{ } 0.0002 1993 3 2995 1{ } 0.0002 1994 3 2997 1{ } 0.0002 1995 3 2999 Using a negative start produces the desired effect: log10( 1990) 1991 0 1{ }0.301 1992 0 301 1{ }0.176 1993 0 477 1{ }0.125 1994 0 602 1{ }0.097 1995 0 699 Using reasonable starts, if necessary, an adequate power transformation can usually be found in the range 2 3. c 2009 by John Fox Sociology 740 c 2009 by John Fox Sociology 740 Examining and Transforming Data 44 Examining and Transforming Data 45 7. Transforming Skewness Power transformations can make a skewed distribution more symmetric. But why should we bother? Highly skewed distributions are difficult to examine. Apparently outlying values in the direction of the skew are brought in towards the main body of the data. Unusual values in the direction opposite to the skew can be hidden prior to transforming the data. Statistical methods such as least-squares regression summarize distributions using means. The mean of a skewed distribution is not a good summary of its center. How a power transformation can eliminate a positive skew: log10 1 0 9{ }1 10 1 90 { }1 100 2 900 { }1 1000 3 Descending the ladder of powers to log symmetric by pulling in the right tail. Ascending the ladder of powers (towards negative skew. makes the distribution more 2 and 3 ) can `correct' a For infant mortality in the UN data, the log transformation works well, as shown in Figure 14. c 2009 by John Fox Sociology 740 c 2009 by John Fox Sociology 740 Examining and Transforming Data 46 Examining and Transforming Data 47 If we have a choice between transformations that perform roughly equally well, we may prefer one transformation to another because of interpretability: The log transformation has a convenient multiplicative interpretation (e.g. adding 1 to log2 doubles ; adding 1 to log10 multiples by 10. In certain contexts, other transformations may have specific substantive meanings: The inverse of time required to travel a fixed distance (e.g., hours for 1 km) is speed (km per hour). The inverse of response latency (e.g., in a psychophysical experiment, in milliseconds) is response frequency (responses per 1000 seconds). X 1 X 1 2 log X X 1 2 X Figure 14. Boxplots for various transformations down the ladder of powers and roots for infant mortality in the UN datqa. c 2009 by John Fox Sociology 740 c 2009 by John Fox Sociology 740 Examining and Transforming Data 48 Examining and Transforming Data 49 Infant Mortality Rate (per 1000) 5 0.8 10 20 40 80 160 The square root of a measure of area (say, in m ) is a linear measure of size (in meters). The cube of a linear measure (say in cm) can be interpreted as a volume (cm3 ). 2 Estimated Density One can also label an axis with the original units, as in Figure 15. 0.0 0.2 0.4 0.6 0.5 1.0 1.5 2.0 2.5 log10 Infant Mortality Rate Figure 15. Adaptive-kernel density estimate for log-transformed infant mortality. c 2009 by John Fox Sociology 740 c 2009 by John Fox Sociology 740 Examining and Transforming Data 50 Examining and Transforming Data 51 8. Transforming Nonlinearity Power transformations can also be used to make many nonlinear relationships more nearly linear. Again, why bother? -- are Linear relationships -- expressible in the form b = + particularly simple. The following simple example suggests how a power transformation can serve to straighten a nonlinear relationship; here, = 1 2 (with no 5 residual): 1 0.2 2 0.8 3 1.8 4 3.2 5 5.0 These `data' are graphed in part (a) of Figure 16. We could replace We could replace by by 0 0 When there are several explanatory variables, the alternative of nonparametric regression may not be feasible or may be difficult to visualize. There is a simple and elegant statistical theory for linear models. There are certain technical advantages to having linear relationships among the explanatory variables in a regression analysis. = = 2 , in which case , in which case 0 = = 1 5 q 1 5 [see (b)]. [see (c)]. 0 A power transformation works here because the relationship between and is both monotone and simple. In Figure 17: the curve in (a) is simple and monotone; c 2009 by John Fox Sociology 740 c 2009 by John Fox Sociology 740 Examining and Transforming Data 52 Examining and Transforming Data (a) (b) Y 53 (a) 5 (b) Y 4 2.0 Y' 0.5 1.0 1.5 Y Y 1 2 3 1 2 3 X 4 5 1 2 3 X 4 5 X X (c) 5 (c) Y Y 1 2 3 4 5 10 X' 15 X 2 20 25 X Figure 16. Transformating a nonlinear relationship (a) to linearity, (b) or (c). c 2009 by John Fox Sociology 740 Figure 17. (a) A simple monotone relationship. (b) A monotone relationship that is not simple. (c) A simple nonmonotone relationship. c 2009 by John Fox Sociology 740 Examining and Transforming Data 54 Examining and Transforming Data 3 2 55 in (b) monotone, but not simple; in (c) simple but not monotone. In (c), we could fit a quadratic model, b = Y Y + 1 + 2 2 . Y up: Figure 18 introduces Mosteller and Tukey's `bulging rule' for selecting a transformation. For example, if the `bulge' points down and to the right, we need to transform down the ladder of powers or up (or both). Recall the relationship between prestige and income for 102 Canadian occupations, shown again in Figure 19. The relationship between prestige and income is clearly monotone and nonlinear. Since the bulge points up and to the left, we can try transforming prestige up the ladder of powers or income down. The cube-root transformation of income works reasonably well. X down: log X , X X up: X 2, X 3 Y down: Y log Y Figure 18. Mosteller and Tukey's bulging rule for selecting linearizing transformations. c 2009 by John Fox Sociology 740 c 2009 by John Fox Sociology 740 Examining and Transforming Data 56 Examining and Transforming Data 57 Average Income (dollars) 1000 80 80 5000 10000 20000 40 Prestige Prestige A more extreme example appears in Figure 20, which shows the relationship between the infant-mortality rate and GDP per capita in the UN data. The skewness of infant mortality and income makes the scatterplot difficult to interpret; the nonparametric regression reveals a nonlinear but monotone relationship. The bulging rule suggests that infant mortality or income should be transformed down the ladder of powers and roots. 60 20 0 5000 15000 25000 20 10 40 60 15 20 Income 1 3 25 30 Average Income (dollars) Figure 19. Transformating the relationship between prestige and income to (near) linearity: (left) original scatterplot; (right) with income transformed. c 2009 by John Fox Sociology 740 c 2009 by John Fox Sociology 740 Examining and Transforming Data 58 Examining and Transforming Data 59 GDP Per Capita (US dollars) 50 Infant Mortality Rate (per 1000) Sierra.Leone 500 Liberia 5000 50000 160 Infant Mortality Rate (per 1000) 150 log Infant Mortality Rate 2.0 Afghanistan Afghanistan Iraq Sudan 100 1.0 50 French.Guiana 0.5 0 Tonga 0 10000 30000 1.5 2.5 3.5 4.5 GDP Per Capita (US dollars) log10GDP Per Capita Figure 20. Transforming the relationship between infant mortality and GDP per capita. c 2009 by John Fox Sociology 740 c 2009 by John Fox Sociology 740 5 10 Libya Bosnia 20 Iraq Gabon 1.5 Sao.Tome Transforming both variables by taking logs makes the relationship nearly linear; the least-squares fit is: \ log10 Infant mortality = 3 06 0 493 log10 GDP Because both variables are expressed on log scales to the same base, the slope of this relationship has a simple interpretation: A one-percent increase in per-capita income is associated on average with an approximate half-percent decline in the infantmortality rate. Economists call this type of number an `elasticity.' 40 80 Examining and Transforming Data 60 Examining and Transforming Data 61 9. Transforming Non-Constant Spread When a variable has very different degrees of variation in different groups, it becomes difficult to examine the data and to compare differences in level across the groups. Recall Ornstein's Canadian interlocking-directorate data, examining the relationship between number of interlocks and nation of control. Differences in spread are often systematically related to differences in level. Using the median and hinge-spread (inter-quartile range) as indices of level and spread, respectively, the following table shows that there is indeed an association, if an imperfect one, between spread and level for Ornstein's data: Nation of Control Lower Hinge Median Upper Hinge Hinge Spread Other 3 14.5 23 20 Canada 5 12.0 29 24 United Ki...

Find millions of documents on Course Hero - Study Guides, Lecture Notes, Reference Materials, Practice Exams and more. Course Hero has millions of course specific materials providing students with the best way to expand their education.

Below is a small sample set of documents:

UCLA - IES - 280
The UCLA Program in Indo-European Studies/ a Winter 2001 seminar Indo-European Studies 280A (Seminar: Indo-European Linguistics): Monday, 4-6 2001 Dodd 248/ Instructor: Professor Vyacheslav V. IvanovOffice hours Monday 3-3.45 P.M., Wednesday 2-2.45 P.M.
Harvard - MATH - 265
Y 2 E|cT 2 ) T 2 D 0( %I ! ) ! ) dP 2 8U 8SPd ) 8y&Pw ) 8y c) ) P %t2 I 1 P |2 %P 2 j DIE 64 2B Ec h ByI8USIUc c2 ) c ! c j y dT h) d ! d ) c cP |c 2 j ) @E ) %H @| Q P i5VpS%zii%hG%5#zuSqSVf#g%wiViViviGrip&wg%q|z|u|%wq S8U5yQ%yI8Uc p pg f u ig f i f u p
University of Toronto - DGP - 384
(PRINT) Name _Student No_ Signature _Total Mark_/100 The University of Toronto Computer Science 384 Introduction to Artificial Intelligence Midterm Test 2 2003 March 21 Time: 45 minutes Total marks: 100 Answer all questions on this paper. No books or othe
UPenn - BSTA - 790
Propensity Scores Cover: Propensity score and its properties Methods for using propensity score to control confounding:1Methods for using propensity score: in analysis Simple control: subclassification/modeling Weighting Tests of ignorable treatment ass
UPenn - CIS - 550
Fall, 2005CIS 550Database and Information Systems Homework 5November 19, 2007; Due November 26, 2007 at 4:30 pm For this homework, you will work on a case of data integration. Consider the task of integrating dierent instances of the PBAY auction syste
Messenger - TD - 1702
A TESTBED FOR THE MSX ATTITUDE AND TRACKING PROCESSORSA Testbed for the MSX Attitude and Tracking ProcessorsDaniel S. WilsonThe Midcourse Space Experiment (MSX) spacecraft employs infrared, ultraviolet, and visible light sensors to collect images and
National Taiwan University - HCS - 570
HORTICULTURE & CROP SCIENCE 570 EXAM II February 26, 2001 1. A plant growth regulator that has a plant hormone enhancer (ethylene) as its mode of action is A. B. C. D. Ethophon (Proxy) Paclobutrazol (Scotts TGR) Maleic hydrazide (Slow Grow) Trinexapac-eth
Glasgow Caledonian University - COMP - 150
%!PS-Adobe-2.0 %Copyright: Copyright (c) 1993 AT&T, All Rights Reserved %Version: 3.4 %DocumentFonts: (atend) %Pages: (atend) %BoundingBox: 0 0 612 792 %EndComments /DpostDict 200 dict def DpostDict begin % % Copyright (c) 1993 AT&T, All Rights Reserved %
McGill - MUMT - 614
CLASSIFYING MUSIC BY GENRE USING THE WAVELET PACKET TRANSFORM AND A ROUND-ROBIN ENSEMBLE Marco Grimaldi, Anil Kokaram, Pdraig Cunningham Computer Science Dept.; Electronic and Electrical Engineering Dept., Trinity College Dublin, Ireland grimaldm@cs.tcd.i
MIT - PUBLIC - 12815
* MIEV0 Test Case 1: Refractive index: real 1.330 imag 1.000E-01, Mie size parameter = 100.000 Efficiency factors for Asymmetry Extinction Scattering Absorption Factor 2.087551 1.103027 0.984524 0.969021 Angle Cosine(Ang) Normalized Phase Function
National Taiwan University - AEDE - 501
AEDE 501: Pricing Strategy for Agribusiness TEAM PRICING PRESENTATION4, depending on class size. Lab time will be allocated to group work, though teams may be required to work on this activity outside of lab time depending on the scope of your work. This
City Colleges of Chicago - MATH - 318
Instructor’s Solution Manual for
Stanford - PUBS - 21207
TABLE OF CONTENTSACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vii INTRODUCTION: STORYTELLING OF INDIGENOUS PEOPLE IN THE UNITED STATES . . . . . . . . . .1 KINDERGARTEN LESSON: THE ART OF NA
Allan Hancock College - PHYS - 201
MACQUARIE UNIVERSITYDivision of Information and Communication SciencesPhysics Department PHYS201 Physics IIA 2006 (3 Credit Points) OUTLINELecturers: Prof. Deb Kane E7A 209 9850 8907 debkane@physics.mq.edu.au E7A 208 9850 8913 jcresser@physics.mq.edu.a
Cal Poly Pomona - CS - 210
Computer Science 210 Section 1 (Class Nbr 12513) Winter 2006 Computer Logic Instructor Craig A. Rich Ofce: Bldg. 8, Room 22 Ofce Hours: Mon, Wed 2:003:30; Tue 10:0012:00; Fri 10:3011:30 Phone: (909) 869-3447 Electronic Mail: carich@csupomona.edu Lecture M
Washington - EE - 485
Autumn 2008EE485 Homework #8 Solutions 1. (20%) Use the table shown on slide 16 of the supplemental notes "Semiconductor Optics". (a) Assuming the bandgap energy Eg doesn't change with temperature, determine the intrinsic concentration of GaAs at 100C (N
Kentucky - BULL - 0506
MUC Music - Class Instruction MUC 150 CLASS INSTRUCTION IN PIANO. (1) A beginning course in the fundamentals of playing the piano. Lecture, two hours. Prereq: For music majors; other students by consent of instructor. MUC 151 CLASS INSTRUCTION IN PIANO. (
Michigan State University - ECE - 411
Chapter 1: Review of Logic Chapter 1: Review of Logic Design Fundamentals Design FundamentalsPROF. NIHAR R. MAHAPATRAE-mail: nrm@egr.msu.edu URL: http:/angel.msu.edu http:/angel.msu.edu Adapted from the publishers lecture slidesECE 411: Electronic Desi
Michigan State University - ECE - 411
Chapter 2: Introduction to VHDL Chapter 2: Introduction to VHDLPROF. NIHAR R. MAHAPATRAE-mail: nrm@egr.msu.edu URL: http:/angel.msu.edu http:/angel.msu.edu Adapted from the publishers lecture slidesECE 411: Electronic Design Automation Michigan State U
Michigan State University - ECE - 411
Chapter 3: Introduction to Chapter 3: Introduction to Programmable Logic Devices Programmable Logic DevicesPROF. NIHAR R. MAHAPATRAE-mail: nrm@egr.msu.edu URL: http:/angel.msu.edu http:/angel.msu.edu Adapted from the publishers lecture slidesECE 411: E
Michigan State University - ECE - 411
ECE 411 Electronic Design Automation http:/angel.msu.edu/ Instructor: Prof. N IHAR R. M AHAPATRA Homework 1Please solve the following problems from the Digital Systems Design Using VHDL (Second Edition) required textbook. Problems are labeled in the form
Michigan State University - ECE - 411
ECE 411 Electronic Design Automation http:/angel.msu.edu/ Instructor: Prof. NIHAR R. MAHAPATRA Solutions to Homework 1 Note: Grading scheme used is given at the end.Problem 1.3 [10 points] 00 00 01 11 G+H 1001EE CD111F110 ABFWe note that since
Michigan State University - ECE - 411
ECE 411 Electronic Design Automation http:/angel.msu.edu/ Instructor: Prof. N IHAR R. M AHAPATRA Homework 2Please solve the following problems from the Digital Systems Design Using VHDL (Second Edition) required textbook. Problems are labeled in the form
Michigan State University - ECE - 411
ECE 411 Electronic Design Automation http:/angel.msu.edu/ Instructor: Prof. NIHAR R. MAHAPATRA Solutions to Homework 2 Note: Grading scheme used is given at the end.Problem 2.6 [8 points] (a) [4 points]entity circuit is port( A, B, C, D: in bit; G: out
Michigan State University - ECE - 313
ECE 313 (Fall 2009) Homework #1(Due: September 16, 2009, Wednesday, in class) Reading Assignment Reminder: Chapter 1, Chapter 2.1-2.4 Problems:1) (10 pts) Briefly answer the following review questions: 1, 2, 3, 4, 10, 15 on page 23-24 of the textbook. 2
Michigan State University - ECE - 313
Michigan State University - ECE - 313
ECE 313 (Fall 2009) Homework #2(Due: September 23, 2009, Wednesday, in class) Reading Assignment Reminder: Chapter 2.5-2.7 Problems: For all problems, show your free-body diagrams before writing down your equations. X ( s) 1) (10 pts) Find the transfer f
Michigan State University - ECE - 313
Michigan State University - ECE - 313
ECE 313 (Fall 2009) Homework #3(Due: September 30, 2009, Wednesday, in class) Reading Assignment Reminder: Section 2.8, Section 5.1 5.2 Recommended Optional Reading: Section 2.10 2.11, Section 5.4-5.5 Problems: 1) (10 pts) Problem 42 on page 99 of the te
Michigan State University - CSE - 410
Computer System OverviewChapter 1Operating System Exploits the hardware resources of one or more processors Provides a set of services to system users Manages secondary memory and I/O devicesBasic Elements Processor Main Memory referred to as real me
Michigan State University - CSE - 410
Virtual MemoryChapter 8Hardware and Control Structures Memory references are dynamically translated into physical addresses at run time A process may be swapped in and out of main memory such that it occupies different regions A process may be broken
Michigan State University - ECE - 442
Michigan State University - ECE - 442
Instructor: Subir BiswasIntroduction to Communication Networks ECE-442 Homework #1 Assigned: September 22nd, 2008 Due: September 29th, 2008Directions: Take a print-out and do your work on this home work sheet Complete all problems Write down all steps S
Michigan State University - ECE - 442
Michigan State University - ECE - 442
Instructor: Subir BiswasIntroduction to Communication Networks ECE-442 Homework #3 Assigned: October 8th, 2008 Due: October 15th, 2008Directions: Take a print-out and do your work on this home work sheet Complete all problems Write down all steps Show a
Michigan State University - ECE - 442
Michigan State University - ECE - 442
Instructor: Subir BiswasIntroduction to Communication Networks ECE-442 Homework #4 Assigned: October 17th, 2008 Due: October 24th, 2008Directions: Take a print-out and do your work on this home work sheet Complete all problems Write down all steps Show
Michigan State University - ECE - 442
Michigan State University - ECE - 442
Michigan State University - ECE - 442
Instructor: Subir BiswasIntroduction to Communication Networks ECE-442 Homework #7 Assigned: November 24th, 2008 Due: December 1st, 2008Directions: Take a print-out and do your work on this home work sheet Complete all problems Write down all steps Show
Michigan State University - ECE - 442
Michigan State University - ECE - 442
Michigan State University - ECE - 442
Fundamentals of Computer Communication: An OverviewProf. Subir Biswas sbiswas@egr.msu.edu Electrical & Computer Eng. Department Michigan State University www.egr.msu.edu/~sbiswasIntroductionExamples of computer communication systems Internet Embedded n
Michigan State University - MATH - 133
Michigan State University - MATH - 133
Michigan State University - MATH - 133
Michigan State University - MATH - 133
Michigan State University - MATH - 133
Michigan State University - MATH - 133
Michigan State University - MATH - 133
Michigan State University - MATH - 133
Michigan State University - MATH - 133
Michigan State University - MATH - 133
Michigan State University - MATH - 133
1. a.dy 1 3 = 3 = 2 1 + 9x 1 + 9x 2 dx26x (ln (x 2 + 1) 2 1 dy 2 = 3 (ln (x + 1) 2 2x = b. dx x +1 x2 + 1 c. ln y = sin1 x ln x 1 dy 1 1 = ln x + sin1 x 2 y dx x 1x ln x 1 sin1 x dy = x sin x + 1 x2 dx x 1 dy = x (y 1) dy = y 1 dx2.x2x dx ln y 1 =x
Michigan State University - ECE - 402
ECE 402 e-Notes.Copyright 2009 by Gregory M. Wierzba. All rights reserved.Spring 2009.ECE 402 e-Notes.Copyright 2009 by Gregory M. Wierzba. All rights reserved.Spring 2009.ECE 402 e-Notes.Copyright 2009 by Gregory M. Wierzba. All rights reserved.Spring
Michigan State University - ECE - 402
ECE 402 e-Notes.Copyright 2009 by Gregory M. Wierzba. All rights reserved.Spring 2009.ECE 402 e-Notes.Copyright 2009 by Gregory M. Wierzba. All rights reserved.Spring 2009.ECE 402 e-Notes.Copyright 2009 by Gregory M. Wierzba. All rights reserved.Spring
Michigan State University - ECE - 402
ECE 402 e-Notes.Copyright 2009 by Gregory M. Wierzba. All rights reserved.Spring 2009.ECE 402 e-Notes.Copyright 2009 by Gregory M. Wierzba. All rights reserved.Spring 2009.ECE 402 e-Notes.Copyright 2009 by Gregory M. Wierzba. All rights reserved.Spring
Michigan State University - ECE - 402
ECE 402 e-Notes.Copyright 2009 by Gregory M. Wierzba. All rights reserved.Spring 2009.ECE 402 e-Notes.Copyright 2009 by Gregory M. Wierzba. All rights reserved.Spring 2009.ECE 402 e-Notes.Copyright 2009 by Gregory M. Wierzba. All rights reserved.Spring
Michigan State University - ECE - 402
ECE 402 e-Notes.Copyright 2009 by Gregory M. Wierzba. All rights reserved.Spring 2009.ECE 402 e-Notes.Copyright 2009 by Gregory M. Wierzba. All rights reserved.Spring 2009.ECE 402 e-Notes.Copyright 2009 by Gregory M. Wierzba. All rights reserved.Spring
Michigan State University - ECE - 402
ECE 402 e-Notes.Copyright 2009 by Gregory M. Wierzba. All rights reserved.Spring 2009.ECE 402 e-Notes.Copyright 2009 by Gregory M. Wierzba. All rights reserved.Spring 2009.ECE 402 e-Notes.Copyright 2009 by Gregory M. Wierzba. All rights reserved.Spring
Michigan State University - ECE - 402
Copyright 2008 by Gregory M. Wierzba. All rights reserved.Copyright 2008 by Gregory M. Wierzba. All rights reserved.Copyright 2008 by Gregory M. Wierzba. All rights reserved.Copyright 2008 by Gregory M. Wierzba. All rights reserved.Copyright 2008 by G