Unformatted text preview: Tutorial A Tutorial on R Programming Programming
Ping Ma Introduction
GNU SPlus GNU A flexible programming language for statistical computing. flexible Multitude of packages exist for computational biology analyses. BioConductor Project. BioConductor Some Programming Gems: Fantastic graphics! Fantastic Extensibility – ports to Python, Java, GtK, HTML, etc. ports GtK Support – active user community, especially in computation biology. Open source in design and nature. Open http://www.rproject.org http://cran.rproject.org 1 R Projects
The BioConductor Project www.bioconductor.org
A suite of statistical and graphical methods for analyzing genomic data. For example, software available for: DNA microarray analysis & normalization, CGH For DNA data, GO analysis, tiling arrays, plus some proteomics analyses (APMS). (AP MS). CRAN – Comprehensive R Archive Network All areas of mathematical and statistical software applications.
Finance modeling, time series, spatial modeling, high performance parallel computing, parallel Outline Outline
Data Structures Data Functionality Functionality Input/Output Input/Output Workspace Management Workspace 2 Getting Started
Installation: (usually) a snap – download file, unzip and run wizard… Start up: via icon or inside a shell >R R Basics
• Note: everything in R is case sensitive. • Assignments can also made using “ = “. • Variable names may be delimited by a ‘.’ > a.meaningful.name < 6 • Indices always begin with 1. • Comments: # > z < 1:4 >z [1] 1 2 3 4 > z[1] [1] 1 > y < c(1,2,3,4) >y [1] 1 2 3 4 > x < 1 + 5 >x [1] 6 3 Mathematical Operators
R as a calculator: >2+3 [1] 5 > 3*4/6 + 2*(1 + 9) [1] 22 > A%*%B # matrix multiplication BuiltIn R Functions Built
R comes with a suite a builtin mathematical and statistical functions. > sqrt(54) [1] 7.348469 > mean(1:5) [1] 3 > lm(y~x) # simple linear regression For more specialized functions, look at CRAN or BioConductor. 4 Matrices
Matrices are 2 dimensional vectors. > A < matrix(1:9, nrow=3, ncol=3, byrow=T) >A [,1] [,2] [,3] [1,] [2,] [3,] 1 4 7 2 5 8 3 6 9 > row.names(A) < c(“a”, “b”, “c”) > colnames(A) < c(“f”, “g”, “h”) >A fgh a123 b456 c789 Extracting Extracting and Extending Matrices Matrices
Extract information from the matrix using indices.
> A[,1] abc 147 > A[1,] fgh 123 Extend the matrix by adding rows or columns.
> B < cbind(A, c(10,20,30)) >B fgh a 1 2 3 10 b 4 5 6 20 c 7 8 9 30 a b c > C < rbind(A, c(10,20,30)) >C f 1 4 7 g 2 5 8 h 3 6 9 10 20 30 A matrix can only consist of the one data type; e.g. numeric, character. 5 Interrogating a Matrix Object
Useful functions are: > dim(A) [1] 3 3 > ncol(A) [1] 3 > nrow(A) [1] 3 > length(A) [1] 9 Similarly for a vector object: > length(x) Operating on Matrices Operating
A really useful function for matrices is the apply function. This allows us to apply a specific function to rowwise or columnwise. > apply(A, 1, mean) [1] 2 5 8 # the 1 means rowwise, # use 2 for columnwise. 6 Data Frame
A data frame is a collection of column vectors. Gpdh Drosophila Fungi Animal Phyla 1.50 40.0 13.2 Sod 25.7 24.9 19.2 Xdh 30.4 13.7 19.2 AvRate 22.4 21.4 17.5 Myr 55 300 600 A useful way to store tablelike information. > molclock < data.frame(Gpdh=c(1.50, 40, 13.2), + Sod=c(25.7, 24.9, 19.2), Xdh=c(30.4, 13.7, 19.2), + AvRate=c(22.4, 21.4, 17.5), Myr=c(55, 300, 600), + row.names=c(“Drosophila”, “Fungi”, “Animal Phyla”)) Working with Data Frame
Extracting data from a data frame object by column, we can use indices or names: > molclock[,1] [1] 1.5 40.0 13.2 > molclock[,”Gpdh”] [1] 1.5 40.0 13.2 For rows: we must use row indices. > molclock[2,] Gpdh Fungi Sod Xdh AvRate Myr 21.4 300 Recall: a data.frame object is a collection of column vectors. 40 24.9 13.7 > class(molclock[,1]) [1] “numeric” > class(molclock[2,]) [1] “data.frame” 7 List Structures
Up until now, all our data structure objects have needed a uniform data type. rm List structures are powerful because we can store multiple data types in the same object. types > miscObjs < list("actin"=c(1.3, 99.6, 2.45), miscObjs "=c(1.3, + "gapdh"=matrix(rnorm(100), nrow=10), "atp"=molclock) "=matrix(rnorm(100), nrow We extract data from a list using names or indices. We > names(miscObjs) names(miscObjs [1] "actin" "gapdh" "atp" [1] > miscObjs$actin miscObjs$actin [1] 1.30 99.60 2.45 > miscObjs[[1]] [1] 1.30 99.60 2.45 Visualizing Data: Plot Function
A simple scatter plot: > x.dat < rnorm(100) # 100 N(0,1) rvs > plot(x.dat, xlab="Index", ylab="Normal RVS", + main="Figure 1: Scatter Plot") 8 Exporting Graphics
In Windows: • right mouse click to copy to clipboard. For most operating systems: > bitmap("file.bmp") > plot(x.dat) > dev.off() You can create export graphics to many file formats – bitmap, jpeg, gif, postscript, etc. # < insert code for making plot here Classes Classes
A class describes the way an object in R is stored. describes Strings: “Homo sapiens” Strings: Numeric: 3.141593 Numeric: Boolean: TRUE, FALSE Boolean: We can interrogate an object to find out its class: > a < FALSE FALSE > class(a) class(a [1] "logical" > is.numeric(a) is.numeric(a [1] FALSE Classes also reflect their data structure, eg. matrix, data.frame, function. Classes eg matrix, data.frame function. 9 Working with Strings
While Perl or Python are more competent languages for text parsing, R does have ng, capabilities for manipulating and creating strings. Pasting Strings Together Pasting > paste(c("Cat", "Dog"), sep="") paste(c("Cat [1] "CatDog" Splitting Strings Splitting > strsplit("Seuss", "") strsplit("Seuss [[1]] [1] "S" "e" "u" "s" "s" Searching for Patterns > grep("and", "Brown eggs and ham") grep("and [1] 1 # grep also lets you search with regexp patterns grep also regexp Booleans Algebra
In R, to test for equality use "==" In > 1 == 3 [1] FALSE [1] > 1 ~= 3 ~= [1] TRUE # inequality [1] Another powerful tip: we can test for inclusion in a vector by asking with "%in%" > x < 1:10 ; even.numbers < seq(from=2, to=10, by=2) 1:10 even.numbers >x [1] 1 2 3 4 5 6 7 8 9 10 > even.numbers even.numbers [1] 2 4 6 8 10 > x %in% even.numbers %in% even.numbers [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE [1] + TRUE TRUE We can subset vectors with TRUE/FALSE flags: > x[x %in% even.numbers] [1] 2 4 6 8 10 10 Missing Values
NA is the alliinclusive symbol for a missing value in R. nclusive > mean(c(1, 4, NA)) mean(c(1, [1] NA > mean(c(1, 4, NA), na.rm=T) mean(c(1, na.rm [1] 2.5 We can test whether an object is a missing value. We > NA == NA NA [1] NA # this doesn't work! > is.na(NA) is.na(NA [1] TRUE > na.omit(c(1, 4, NA)) [1] 1 4 Other objects: NaN, Inf. Other NaN For Loops
For loops are very simple in R. For > for( m in 1:3 ){ for( + print(m) print(m } [1] 1 … > for( m in c("actin", "myosin", "gapdh") ){ for( c("actin + print(m) print(m } [1] "actin" [1] … Note: R does not process for loops very quickly, try to avoid them for large data if you can em (eg. Use apply) Use 11 Conditional Statements
We can use conditional statements to automate tasks and functions. s. If..Else Block If..Else If( condition 1 holds ) then do task 1. Else, do task 2. If( > if( x > 0 ){ print("positive") } if( print("positive + else{ print("negative") } else{ print("negative") While Block While While( condition 1 holds) then do task 1. If condition 1 no longer holds, stop. While( > while( x > 0 ){ x < x + rnorm(1) } You can put the break command inside an if( … ) to break out of the conditional loop. You break command to Writing Your Own Functions Writing
Imagine you need to write a simple function that returns both the mean and the standard mean deviation of a vector in a list structure. > mean.and.sd < function(x){ mean.and.sd + res.mean < mean(x) ; res.sd < sd(x) res.mean res.sd + res = list(mean=res.mean, sd=res.sd) res list(mean sd + return(res) return(res +} > mean.and.sd(rpois(10,5)) $mean [1] 4.4 $sd [1] 0.9660918 You can use the args function to find out what arguments a function needs. You args function > args(mean.and.sd) args(mean.and.sd [1] function (x) [1] NULL NULL 12 Inputting Data into R
R has capabilities for reading in data files of many different formats. ormats. For simple ASCII text files we can use the read.table function. read.table function. The arguments specified by read.table are: read.table are: > args(read.table) args(read.table [1] function (file, header = FALSE, sep = "", + quote = "\"'", dec = ".", row.names, col.names, quote "'", dec ".", row.names col.names + as.is = FALSE, na.strings = "NA", colClasses = NA, as.is FALSE, na.strings "NA", colClasses NA, + nrows = 1, skip = 0, check.names = TRUE, nrows 1, check.names TRUE, + fill = !blank.lines.skip, strip.white = FALSE, fill strip.white FALSE, + blank.lines.skip = TRUE, comment.char = "#", blank.lines.skip TRUE, comment.char + allowEscapes = FALSE) allowEscapes Other readin functions: read.csv, scan, readLines read.csv scan, readLines Outputting Data from R
To output data to a simple table text file, we can use write.table. To write.table > args(write.table) args(write.table + function (x, file = "", append = FALSE, quote = TRUE, + sep = " ", eol = "\n", na = "NA", dec = ".", sep eol n", na "NA", dec ".", + row.names = TRUE, col.names = TRUE, row.names TRUE, col.names TRUE, + qmethod = c("escape", "double")) qmethod c("escape", Other write functions: write, cat. Other 13 Porting to Other Languages
A port is a piece of software that provides a means to get one programmiing language to ng is communicate with another. The Omega Project for Statistical Computing The An umbrella project to link different programming languages seamlessly. An Some packages available: RSPython, RSPerl, RMatlab. Some RSPython RSPerl RMatlab (Plus a variety of others). (Plus Example: RSPython RSPython To call Python from R: load RSPython, call py commands using .Python(func, args1, RSPython call py args1, args2, …) To call R from Python: load RS module, RS.call("plot", x, y). RS.call("plot Workspace Management
Where am I? > getwd() # returns the working directory getwd > setwd("C://Jess") # sets the working directory setwd("C > dir() # lists files in working directory > list.files() list.files How can I tell what objects I have? How > ls() ls To remove individual objects use rm(): To rm > rm("name.of.object") rm("name.of.object To save specific objects use save(): > save(x, file="fileName.Rdata") save(x At a later date, you can load this into your workspace: > load("fileName.RData") load("fileName.RData 14 Libraries
Libraries are a collection of R functions that together perform a specialized analysis or task. For example: genetics package. For CRAN Description: CRAN Classes and methods for handling genetic data. Includes classes to represent genotypes and haplotypes at single markers up to multiple markers on multiple chromosomes. haplotypes s. Function include allele frequencies, flagging homo/heterozygotes, flagging carriers of Function flagging certain alleles, estimating and testing for HardyWeinberg disequilibrium, estimating certain Weinberg and testing for linkage disequilibrium, ... Consult CRAN for more: http://cran.us.rproject.org/ http://cran.us.r Helpful Functions
To boot up HTML help files: To > help.start() help.start To pop up a help file on an individual function. > help(function) help(function To seach for help on something around a topic or function: To seach > help.search("plot") help.search("plot To search on a string for something: To > apropos("string") apropos("string 15 More Info & Resources
For R tutorials and simple documents to learn more about R, consult the R ult website for lots of resources www.rproject.org/ www.r (go to Documentation > Other > Contributed Documentation Really Great HTML Tutorial: Kickstarting R by Jim Lemon Really http://cran.rproject.org/doc/contrib/Lemonkickstart/index.html http://cran.r "R for Beginners" by Emmanuel Paradis [short pdf] Paradis [short pdf There are also reference cards that contain the most important R functions (and functions their descriptions) you need to know (like a cheat sheet). "R Reference Card" by Jonathon Baron [1 page list] 16 ...
View
Full Document
 Spring '07
 Ma
 data frame, Bioconductor

Click to edit the document details