Stat133Lecture4

# Stat133Lecture4 - Announcements Regina Wu Hana Ueda and...

This preview shows pages 1–6. Sign up to view the full content.

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Announcements: Regina Wu, Hana Ueda, and John Jimenez will be helping to answer your questions on bSpace and in lab. Homework 1 is due next Wednesday night. There have been some problems on bSpace. Please make sure you get a veriﬁcation email when you upload your assignment. Today’s topics: • • • • • Review of data structures and how to index them The apply mechanism, revisited Reading and writing data from within R Keeping track of your commands High-level graphics 1 Tuesday, September 9, 2008 The types of data structures and how to index them: Vectors: [index] > x[1:10]; x[-3]; x[x>3] Matrices: [rowindex, colindex] > m[1,2]; m[1:2, ]; m[ ,“a”] Arrays: [index1, index2, ..., indexK] > a[1, 3, ]; a[v==TRUE,,] Data frames: [rowindex, colindex], \$name Lists: \$name, [index], [[index]] > cars\$Cars6; cars[,3:4]; cars[cars\$Junction == “7 to 8”,] > ingredients\$meat; indgredients[1:2]; ingredients[[1]] Note: both \$ and [] can index only one element. Tuesday, September 9, 2008 2 Last time we started talking about the apply function. Let’s review how this works for matrices. > args(apply) function (X, MARGIN, FUN, ...) NULL the matrix the function any additional arguments to FUN which dimension to operate on 1 for rows, 2 for columns > m <- matrix(1:4, nrow = 2) >m [,1] [,2] [1,] 1 3 [2,] 2 4 > apply(m, 2, paste, collapse = "") [1] "12" "34" Tuesday, September 9, 2008 3 The lapply and sapply functions both apply a speciﬁed function FUN to each element of a list. The former returns a list object and the latter returns a vector when possible. Again, both allow passing of additional arguments to FUN through the “. . .” argument. > random.draws <- list(x1 = rnorm(10), x2 = rnorm(100000)) > lapply(random.draws, mean) \$x1 [1] 0.0827779 \$x2 [1] 0.001470952 > sapply(random.draws, mean) x1 x2 0.082777901 0.001470952 Tuesday, September 9, 2008 4 The tapply function allows us to apply a function to different par ts of a vector, where the parts are indexed by a factor or list of factors. Single factor : > grp <- factor(rep(c("Control", "Treatment"), each = 4)) > grp [1] Control Control Control Control [5] Treatment Treatment Treatment Treatment Levels: Control Treatment > > effect <- rnorm(8) # Make up some fake data > tapply(effect, INDEX = grp, FUN = mean) Control Treatment 0.2180109 -0.2433582 Tuesday, September 9, 2008 5 Multiple factors: > sex <- factor(rep(c("Female", "Male"), times = 4)) > sex [1] Female Male Female Male Female Male Female Male Levels: Female Male > tapply(effect, INDEX = list(grp, sex), FUN = mean) Female Male Control 0.3634973 0.07252456 Treatment -0.2860360 -0.20068040 Tuesday, September 9, 2008 6 Many data sets are stored as tables in text ﬁles. The easiest way to read these into R is using either the read.table or read.csv function. As you can see in help(read.table), there are quite a few options that can be changed. Some of the important ones are • ﬁle - name or URL • header - are column names at the top of the ﬁle? • sep - what divides elements of the table • na.strings - symbol for missing values, like 9999 • skip - number of lines at the top of the ﬁle to ignore is like read.table, but with different defaults for CSV (comma separated value) ﬁles. read.csv Tuesday, September 9, 2008 7 By default, all strings are read in as factors. If a ﬁle doesn’t contain column names, you can add them after the fact. Here’s how I created the R objects for the assignment last week: > cars <- read.csv("~/Desktop/friday13thcars.csv", + header = FALSE) > cars[1:2,] V1 V2 V3 V4 V5 1 1990 July 139246 138548 7 to 8 2 1990 July 134012 132908 9 to 10 > names(cars) <- c("Year", "Month", "Cars6", + "Cars13", "Junction") > cars[1:2,] Year Month Cars6 Cars13 Junction 1 1990 July 139246 138548 7 to 8 2 1990 July 134012 132908 9 to 10 Tuesday, September 9, 2008 8 Earthquakes Example: Data from the California Geological Survey > CAquakes <- read.table(file = "http://www.consrv.ca.gov/ cgs/rghm/quakes/Documents/ms49epicenters.txt", header = TRUE) > dim(CAquakes) [1] 383 4 > CAquakes[1:3,] Date Latitude Longitude M 1 18001011 36.8 -121.5 5.5 2 18001122 32.9 -117.8 6.3 3 18030000 34.2 -118.1 5.5 > mode(CAquakes\$Date) [1] "numeric" How can we extract the years/months/days from the Date column? Tuesday, September 9, 2008 9 > datechar <- as.character(CAquakes\$Date) > substring(datechar, 1, 4)[1:3] [1] "1800" "1800" "1803" > CAquakes\$Year <- as.numeric(substring(datechar, 1, 4)) > CAquakes\$Month <- as.numeric(substring(datechar, 5, 6)) > CAquakes\$Day <- as.numeric(substring(datechar, 7, 8)) > CAquakes[1:3,] Date Latitude Longitude M Year Month Day 1 18001011 36.8 -121.5 5.5 1800 10 11 2 18001122 32.9 -117.8 6.3 1800 11 22 3 18030000 34.2 -118.1 5.5 1803 0 0 > CAquakes\$Month[CAquakes\$Month == 0] <- NA > CAquakes\$Day[CAquakes\$Day == 0] <- NA > summary(CAquakes\$Month) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 1.000 4.000 6.000 6.281 9.000 12.000 2.000 > summary(CAquakes\$Day) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 1.00 9.00 18.00 16.64 24.00 31.00 3.00 Tuesday, September 9, 2008 10 To save your R commands, use a plain text editor. Here are two I like: • R for Mac and Windows has a built-in text editor. Access commands related to it, such as New Document and Save, under the File menu. One nice feature is that it automatically prints the arguments for functions at the bottom of the window. • The Emacs editor has a special package called ESS, for “Emacs Speaks Statistics,” that makes working with .R ﬁles very easy. It’s installed on all the 342 lab computers. It includes keyboard shortcuts to evaluate the code, rather than cutting and pasting. (See http://stat.ethz.ch/ESS/ refcard.pdf.) Tuesday, September 9, 2008 11 Whichever editor you choose, you can run all the commands in a particular ﬁle using source(“myfile.R”). A few more notes: If you don’t save your ﬁles as plain text, this won’t work, since R cannot interpret any extra formatting commands. So I do NOT recommend you use Microsoft Word. If you’re cutting and pasting from the R session window back into the text editor, be sure not to copy the prompt (> symbol) as well. If you want to keep your results in your .R ﬁle, put a # in front of each line to mark them as comments. Tuesday, September 9, 2008 12 Part 1: High-level graphics functions Graphics in R Tuesday, September 9, 2008 13 We’ll be working in this section with many of R’s built-in data sets. To see a list of them, just type > data() Data sets in package 'datasets': AirPassengers BJsales BJsales.lead (BJsales) BOD CO2 ChickWeight Monthly Airline Passenger Numbers 1949-1960 Sales Data with Leading Indicator Sales Data with Leading Indicator Biochemical Oxygen Demand Carbon Dioxide uptake in grass plants Weight versus age of chicks different diets . . . many more Tuesday, September 9, 2008 14 1. Barplots > x <- 1:5; names(x) <- letters[1:5] > barplot(x) 0 1 2 3 4 5 a b c d e Tuesday, September 9, 2008 15 > VADeaths Rural Male Rural Female Urban 50-54 11.7 8.7 55-59 18.1 11.7 60-64 26.9 20.3 65-69 41.0 30.9 70-74 66.0 54.3 > barplot(VADeaths, legend = TRUE) 200 70−74 65−69 60−64 55−59 50−54 Male Urban Female 15.4 8.4 24.3 13.6 37.0 19.3 54.6 35.1 71.1 50.0 This stacked barplot makes it hard to read anything but the bottom category and the total. 0 50 100 150 Rural Male Rural Female Urban Male Urban Female Tuesday, September 9, 2008 16 Making a good plot in R is often a matter of iterative improvement. > barplot(VADeaths, beside = TRUE, legend = TRUE) 70 50−54 55−59 60−64 65−69 70−74 0 10 20 30 40 50 60 Rural Male Rural Female Urban Male Urban Female Tuesday, September 9, 2008 17 > barplot(VADeaths, beside = TRUE, legend = TRUE, + ylab = "Deaths per 1000", + main = "Death rates in Virginia, 1940") Death rates in Virginia, 1940 70 50−54 55−59 60−64 65−69 70−74 Deaths per 1000 0 10 20 30 40 50 60 Rural Male Rural Female Urban Male Urban Female Tuesday, September 9, 2008 18 Saving your plots as graphics ﬁles If you call a high-level plot command, R will automatically start a graphics device or window. To save the contents of the already open device to a ﬁle, use dev.print. > barplot(VADeaths, legend = TRUE) > dev.print(device = pdf, file = "mybar.pdf", + height = 5, width = 6) # Inches > dev.print(device = jpeg, file = "mybar.jpeg", + height = 500, width = 600) # Pixels See help(device) for a list of other graphics formats. Tuesday, September 9, 2008 19 To close the device (shut the window), type > dev.off() Alternatively, you can open up the device with a given ﬁle name, run the commands, then use dev.off(). The device itself won’t appear as a window. This is useful if you want to run your commands in BATCH mode. > pdf(file = "mybar.pdf", height = 6, width = 6) > barplot(VADeaths, legend = TRUE) > dev.off() Tuesday, September 9, 2008 20 2. Pie charts > pie(c(1, 1, 2), labels = letters[1:3]) b a Note that elements of the vector are normalized by their sum, so that the total gives 100% of the pie. c Tuesday, September 9, 2008 21 > Titanic , , Age = Child, Survived = No Sex Class Male Female 1st 0 0 2nd 0 0 3rd 35 17 Crew 0 0 , , Age = Adult, Survived = No Sex Class Male Female 1st 118 4 2nd 154 13 3rd 387 89 Crew 670 3 . . . two more matrices not printed here, with survivors Did all groups have an equal survival rate? Tuesday, September 9, 2008 22 > apply(Titanic, 1, sum) # Total passengers, each class 1st 2nd 3rd Crew 325 285 706 885 > pie(apply(Titanic, 1, sum), main = "Total Passengers") > pie(apply(Titanic[,,,"Yes"], 1, sum), + main = "Survivors") Total Passengers Survivors 2nd 2nd 1st 1st 3rd 3rd Crew Crew Tuesday, September 9, 2008 23 Studies of human perception show we are not very good at comparing areas, volumes, or angles. • When making bar plots, start the axis at zero and • Try to avoid pie charts for anything requiring a precise comparison. keep all bars the same width, so that length and area are proportional. Tuesday, September 9, 2008 24 3. Histograms > precip[1:4] # Average annual precipitation in cities Mobile Juneau Phoenix Little Rock 67.0 54.7 7.0 48.5 > hist(precip) Histogram of precip The height of the bars shows the number of observations falling into each bin. Frequency 0 0 5 10 15 20 25 10 20 30 precip 40 50 60 70 Tuesday, September 9, 2008 25 There are several ways to change the cutoff points. > hist(precip, breaks = 10) # Only a suggestion > hist(precip, breaks = seq(min(precip), max(precip), + length = 11)) # Force it Histogram of precip 14 Histogram of precip 10 12 Frequency Frequency 10 20 30 40 precip 50 60 70 6 8 0 2 4 0 5 10 15 10 20 30 40 precip 50 60 Tuesday, September 9, 2008 26 Again, let’s add meaningful axis labels and a title. > hist(precip, breaks = 10, xlab = "Inches", + main = "Yearly Average Rainfall for US Cities") Yearly Average Rainfall for US Cities 14 Frequency 0 2 4 6 8 10 12 10 20 30 40 Inches 50 60 70 Tuesday, September 9, 2008 27 4. Boxplots > boxplot(precip, ylab = "Inches", + main = "Yearly Average Rainfall for US Cities") Yearly Average Rainfall for US Cities ● Outlier Upper whisker - Upper quartile + 1.5 IQR 50 60 20 30 Upper quartile Median Lower quartile Inches 40 Inter-quartile range (IQR) ● ● Lower whisker - Lower quartile - 1.5 IQR Outliers Tuesday, September 9, 2008 10 28 > mtcars[1:2,1:5] mpg cyl disp hp drat Mazda RX4 21 6 160 110 3.9 Mazda RX4 Wag 21 6 160 110 3.9 > boxplot(mpg~cyl, data = mtcars, xlab = "Cylinders", + ylab = "Miles per Gallon", + main = "Fuel Consumption") Fuel Consumption Miles per Gallon 15 20 25 30 10 ● 4 6 Cylinders 8 Tuesday, September 9, 2008 29 5. Scatterplots > state.x77[1:2,1:4] Population Income Illiteracy Life Exp Alabama 3615 3624 2.1 69.05 Alaska 365 6315 1.5 69.31 > plot(state.x77[,"Income"], state.x77[,"Life Exp"]) ● 73 ● ● ● ● ● ●● ● ● ● ● ● ● ● state.x77[, "Life Exp"] 72 ● ● ● ● ● 71 ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● 70 ● ● ● 69 ● ● ● ● ● ● 68 ● ● 3000 3500 4000 4500 5000 5500 6000 state.x77[, "Income"] Tuesday, September 9, 2008 30 > plot(state.x77[,"Income"], state.x77[,"Life Exp"], + xlab = "Per Capita Income (Dollars)", + ylab = "Life Expectancy (Years)", + main = "Income and Life Expectancy in U.S., 1970s") Income and Life Expectancy in U.S., 1970s ● 73 ● ● ● ● ● ●● ● ● ● ● ● ● ● Life Expectancy (Years) 72 ● ● ● ● ● 71 ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● 70 ● ● ● 69 ● ● ● ● ● ● 68 ● ● 3000 3500 4000 4500 5000 5500 6000 Per Capita Income (Dollars) How can we label the interesting cases? Tuesday, September 9, 2008 31 ESS [Emacs Speaks Statistics] Reference Card for S and R updated for ESS 5.3.0 April 4, 2006 — as of April 4, 2006 1. Nota Bene: S is the language, R is one dialect ! 2. This is a list of the more widely used key - shortcuts. Many more are available, and most are accessible from the Emacs Menus such as iESS, ESS, etc. Interacting with the S process For use in a process buﬀer ‘*R*’ (inferior-ess-mode): ret tab C-c C-c C-g C-a / C-e C-c C-u C-c C-w Send a command Complete S object name Break interrupt Emacs’ waiting for S Beginning / End of command Delete this command Delete last word Editing S source For use in ess-mode edit buﬀers, (*.R ﬁles): tab C-c tab M- tab M-C-a M-C-e M-C-q M-C-h Evaluation C-c C-l C-c C-n C-c C-e C-c C-j C-c M-j M-C-x C-c C-f C-c M-f C-c C-p C-c C-c C-c C-r C-c M-r C-c C-b C-c M-b Others C-c C-v C-c C-d C-c C-z Indent this line Complete S object name Complete ﬁle- / path- name Beginning of function End of function Indent this expression (use at ‘{’) Mark this function commands (Preﬁx C-u: in/visibly ) Load this buﬀer – detect errors ! Step through code – line by line Evaluate an expression Evaluate this line Evaluate this line and go Evaluate this function Evaluate this function Evaluate this function and go Evaluate this paragraph and step Evaluate this para. or function & step Evaluate this region Evaluate this region and go Evaluate this buﬀer Evaluate this buﬀer and go Help for S object “dump” – Edit another object Return to S process (at prompt) Command history (part of Menu ‘In/Out’) M-p Previous command M-n Next command C-c C-l List command history (& choose!) C-c M-r Previous similar command C-c M-s Next similar command C-c ret Copy current input C-c C-r Top of last output C-c C-o Delete last output Hot keys C-c C-v C-c C-l C-c C-x C-c C-s C-c C-a C-c C-d Others C-c ‘ C-c C-q C-c C-z Help for S object Load source ﬁle (+ error check!) List objects Display search list Attach a directory Edit an object (dump to ﬁle) Jump to error after C-c C-l Quit from S Kill the S process At SfS, or activated by M-x ess-add-MM-keys C-c f insert function() deﬁnition outline Inside S Transcripts (I + O) Inside ESS transcript buﬀers, (*.Rout ﬁles): ret C-c C-n C-c C-p C-c C-w Send and Move Next prompt Previous prompt Clean Region (→ input only) Reading help ﬁles For use in ‘*help[R](. . .)*’ help buﬀers: SPC DEL b / n p s se l r h ? q x Next page Previous page Previous page (‘back’) Search forwards Next section Previous section Skip (‘jump’) to a named section e.g., skip to “Examples:” Evaluate one ‘Example’ line Evaluate current region Help on another object Help for this mode Return to S process (‘quit) Kill this buﬀer and return (‘exit) ...
View Full Document

## This note was uploaded on 10/08/2010 for the course STAT 133 taught by Professor Staff during the Spring '08 term at Berkeley.

### Page1 / 32

Stat133Lecture4 - Announcements Regina Wu Hana Ueda and...

This preview shows document pages 1 - 6. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online