R-guide-EP

R-guide-EP - 1 R for beginners Emmanuel Paradis ¢ ¡ ¦ ¤...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 1 R for beginners Emmanuel Paradis ¢ ¡ ¦ ¤ 1 What is R ? 3 £ ¥ § 2 The few things to know before starting 5 ¥  ¥ ¥ ¥  2.1 The operator <2.2 Listing and deleting the objects in memory 2.3 The on-line help 5 5 6 ¨ ©    "   $ &    #        !            %      ' 3 Data with R 8 ¥ " ¥ ! 3.1 The ‘objects’ 3.2 Reading data from files 3.3 Saving data 3.4 Generating data 3.4.1 Regular sequences 3.4.2 Random sequences 3.5 Manipulating objects 3.5.1 Accessing a particular value of an object 3.5.2 Arithmetics and simple functions 3.5.3 Matrix computation 8 8 10 11 11 13 13 13 14 16 ( %   4 ( 0 2 #    2 0 ¨ ) %    3      1   0 7 5 % %     1  6 0 9  8 %     1    1  %  %  #  A @     @  %   #  A @    " %  !   " % ! 2 4 4     1     @ 0    @ 0  6 2  # @ 4       B 1 0 1    0 # % 0   # 1  @  3  1   #    &   # 8 # 0  %    3 ¨ 0   % C   %  0 1  B  # %    @ 3  #  D  ¤ ¦ ¦ ¤ 18 ¥ G F 4.1 Managing graphic windows 4.1.1 Opening several graphic windows 4.1.2 Partitioning a graphic window 4.2 Graphic functions 4.3 Low-level plotting commands 4.4 Graphic parameters 18 18 18 19 20 21 0 ( 0  0 8 C  ( E 4 Graphics with R  H 1 # H      1   1  I  ( H  H #        6    H P  # H          9    # 0 7 ¨ #  @ 4 4  4 % Q   1  3 3  #      6  H 0 ¨ 8  1  9 8 8   ¤    3    ¦  #  R   R ¦ ¦ 5 Statistical analyses with R 23 6 The programming language R 26 & 26 27 W ¥ ¥   ¥ G U  ¨ ¥ S T 6.1 Loops and conditional executions 6.2 Writing you own functions @ &  #     0  #    2  0  0 ¨ V   1  &  # 1 1 @ ¤ H  ¦  @ ¤ $  1  ` 7 How to go farther with R ? 30 8 Index 31 Y ¥ ¥ F ¥ F X 2 The goal of the present document is to give a starting point for people newly interested in R. I tried to simplify as much as I could the explanations to make them understandables by all, while giving useful details, sometimes with tables. Commands, instructions and examples are written in Courier font. ¦ ¦ R R ` ¦ ¦ ¦ ¦ ¤ ` R ¤ d ¥ ¥ S ¥ F § S S G T G F ¥ S G S ¥ ¥ F ¥ ¥ S c b a ¥ S U F G § ¥ ¥ T ¥ ¥ ¥ ¥ G U U G U T § ¥ G ¥ ¥ ¥ ¥ ¥ U U U c ¥ ¥ ¥ e I thank Julien Claude, Christophe Declercq, Friedrich Leisch and Mathieu Ros for their comments and suggestions on an earlier version of this document. I am also grateful to all the members of the R Development Core Team for their considerable efforts in developing R and animating the discussion list ‘r-help’. Thanks also to the R users whose questions or comments helped me to write “R for beginners”. f ¥ ¥ ¥ ¥ ¥ ¥ G U ¥ ¥ U ¥ U U ¥ U ¥ c U § § ¥ G ¥ ¥ ¥ G c ¥ ¥ U U ¥ c ¥ ¥ G ¥ U § ¥ ¥ ¥ G § ¦ R © 2000, Emmanuel Paradis (20 octobre 2000) h g ¥ F F U S b b ¥ 3 % 1 What is R ? i q r § p § R is a statistical analysis system created by Ross Ihaka & Robert Gentleman (1996, J. Comput. Graph. Stat., 5: 299-314). R is both a language and a software; its most remarkable features are: s ¥ ¥ ¥ ¥ ¥ T T § ¥ ¥ T § w x ¥ ¥ ¥ ¥ U v u v u t ¥ U ¦ R ¦ ` ¦ R ¤ ¦ ` ` an effective data handling and storage facility, a suite of operators for calculations on arrays, matrices, and other complex operations, a large, coherent, integrated collection of tools for statistical analysis, numerous graphical facilities which are particularly flexible, and a simple and effective programming language which includes many facilities. ¥ ¥ ¦ R ¤ S ¦ ¥ ¦ ¥ S F G G y b ¥ F T R ¥ F S ¥ b F F T ¥ S R S c ` ` ¦ ¥ S ¥ ¥ S ¥ F ¥ S F U ¥ ¥ F ¥ F ¥ • • • • • G U ¥ T § ¥ ¥ G T U G ¥ T R U U ¥ R ¤ U G U § R ¤ ` R G c ¦ ¦ R ¦ R is a language considered as a dialect of the language S created by the AT&T Bell Laboratories. S is available as the software S-PLUS commercialized by MathSoft (see http://www.splus.mathsoft.com/ for more information). There are importants differences in the conceptions of R and S, but they are not of interest to us here: those who want to know more on this point can read the paper by Gentleman & Ihaka (1996) or the R-FAQ (http://cran.r-project.org/doc/FAQ/R-FAQ.html), a copy of which is alse distributed with the software.  a € a ¥ ¥ ¥ F T ` ¤ § ¦ R ¥ ¥ S U F ¦ ` „ ƒ ‚ ¥ R ¥ F b T ¦ ¤ ` ` ¦ S § R b ¦ ¦ § ¥ ¤ ¦ F c ` ‚ ¥ F ¦ S U ¦ h ¦ ` ` … ¤ F R … … ¤ a ¥ S S ¥ F ¥ S F G b F F ¥ S b F S F b F ¥ ¥ b b G U ¥ G § ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ U ¥ ¥ T ¥ G U § ¥ ¥ ¥ G T ¥ G … ¥ ¥ ¥ G § ¥ ¥ … † … ¥ … T … ¥ ¥ G U G ¥ G ¥ ` ¤ ` ¦ ¦ R § ¤ ` ¤ § ¦ ¦ R ` ¦ R is freely distributed on the terms of the GNU Public Licence of the Free Software Foundation (for more information: http://www.gnu.org/); its development and distribution are carried on by several statisticians known as the R Development Core Team. A key-element in this development is the Comprehensive R Archive Network (CRAN). ˆ ‚ ¥ ƒ ‡ ¥ F ¦ § S ¦ R ¥ S ¦ ¥ F U ¥ b U ¦ ¥ F h ¥ F … … ¥ S S b G ¥ F c S U ¥ ¥ F ¥ S ¤ … ` ¥ F U ¦ F T ¦ ` ¦ ¥ G ˆ ¥ S b F S F b F S S U § ¥ ¥ ¥ G T ¥ ¥ ¥ c c ‡ T ‡ ¥ ¥ c ¥ G c G ¥ c § R is available in several forms: the sources written in C (and some routines in Fortran77) ready to be compiled, essentially for Unix and Linux machines, or some binaries ready for use (after a very easy installation) according to the following table. ¥ ¥ ¥ ¥ ¥ U U c c § § ƒ ¥ U T U R § ¦ R R ` ¤ ¦ ¥ ¥ S ™  — — ˜ – •  ‘ 7 7 7 ¨ f 7 8 © f g ( P f “ C F T F c !  e 1 0 h i ‰  1 0 ‘ 4 V  H ¨ ` ¥ T  ‰ ’ 0  ¨ W T ¦ ¥ S  ‘ P k ) R S ” Windows 95/98/NT4.0/2000 Linux (Debian 2.2, Mandrake 7.1, RedHat 6.x, SuSe 5.3/6.4/7.0) l F  ‘ R ¥ S Operating system(s) — ¦ ¥ Intel d ¥ G T Architecture Q       1  1 j   D 1 @ j m 7 & f W f  % & 5 5  @ j PPC M a c OS LinuxPPC 5.0 Alpha Systems Digital Unix 4.0 Linux (RedHat 6.x) Sparc Linux (RedHat 6.x) 5 I o C #  7  n n o @ 7 5 p     & m     $  h    @ & m h   ¤ ¦ ¤ ` ¦ §  # @ † ¤  5  ¦ ¦ § ¤ R R ¦ R ¦ ` ¤ The files to install these binaries are at http://cran.r-project.org/bin/ (except for the Macintosh version 1) where you can find the installation instructions for each operating system as well. … … … … „ ¥ ¥ ¥ S ¥ F G S y ¥ ¥ F F G F ¥ S F ¥ G T ¥ ¥ ¥ G F ¥ ¥ F ¥ ¥ ¥ ¥ S U ¥ U ¥ G ¥ a ¥ S ¥ T R is a language with many functions for statistical analyses and graphics; the latter are visualized immediately in their own window and can be saved in various formats (for example, jpg, png, bmp, eps, or wmf under Windows, ps, bmp, pictex under Unix). The results from a statistical analysis can be displayed on the screen, some intermediate results (P -values, regression coefficients) can be written in a file or used in subsequent analyses. The R language allows the user, for instance, to program loops of commands to successively analyse ¥ ¥ c ¥ T U T U § ¥ ¥ U ¤ c ¥ c ¦ T ¦ a § ¦ U ` ƒ § † c R £ ¥ S y F R S U G y ¦ G b G S ¦ F ¤ S R b U ¦ F G § ¦ G R b S R ¦ G G G ¦ b ` y R q ¥ ¥ ¥ F U b ¤ ¥ F S b R S § ¥ F S ¦ R ¦ G T ` S ¦ ¦ ¥ ¥ ¥ S T § ¦ b ¦ ` ` F F U ¦ R a ¥ R ¥ S T R S U S U F U ¦ S ` ¥ ¥ S F S R 4 ! 4 T 0 c ! ¥ S U 4  0 k 0 b b G 4 S ¦ ¥ S T S b " F 2 F  0 0 F ¤ R U c R ¥ S ! ` ¥ G F S 2 F ! F "  U S U 2  0  The Macintosh port of R has just been finished by Stefano Iacus <[email protected]>, and should be available soon on CRAN. 5 e    6   @   1  C  t  r   © 1  s 1 # 3     @ #   1   $   1 1    @         1 #   j g B ) o 1  1    R 4 § § several data sets. It is also possible to combine in a single program different statistical functions to perform more complex analyses. The R users may benefit of a large number of routines written for S and available on internet (for example: http://stat.cmu.edu/S/), most of these routines can be used directly with R. ¥ ¥ ¥ ¥ ¥ ¥ G ` § R ` ¦ ` ¥ ¥ G § ¤ R c R ` ¦ ` a ¥ F b S U ` ¥ F … S … … ¥ ¥ b U F ¤ … ¥ ¥ b U b T U S T R ` ¦ ¥ ¥ G G b R b § R F b b F F S ¦ S ` S S S c ¤ ¦ R F ¦ ¥ S ¦ ¥ F ¥ G ¥ F y G y U ¦ ¥ ¥ S F S § ¦ ¤ ¦ R ¦ § ¦ ` ¦ R ¦ ` F U ¦ ¤ ¥ ¥ F T S U S R ¥ F U At first, R could seem too complex for a non-specialist (for instance, a biologist). This may not be true actually. In fact, a prominent feature of R is its flexibility. Whereas a classical software (SAS, SPSS, Statistica, ...) displays (almost) all the results of an analysis, R stores these results in an object, so that an analysis can be done with no result displayed. The user may be surprised by thus, but such a feature is very useful. Indeed, the user can extract only the part of the results which is of interest. For example, if one runs a series of 20 regressions and wants to compare the different regression coefficients, R can display only the estimated coefficients: thus the results will take 20 lines, whereas a classical software could well open 20 results windows. One could cite many other examples illustrating the superiority of a system such as R compared to classical softwares; I hope the reader will be convinced of this after reading this document. R ¦ ` € a ¥ ¥ b T R ¥ S ¦ R ¤ ¦ R ¦ § S ¦ ¥ F R G ` ¦ ¦ S S ` F G y ` ¥ b b ¦ R ¥ F ¦ R § d ¥ T R ¥ F y ` R ¤ R R ¥ ¥ S U R S R b F ¥ G S ¦ ¦ T ¥ U ¥ S T ¥ S ¥ b ¥ ¥ € ¥ ¥ G T S ` ¥ F U F U ¦ h ¥ F ¥ F U ` £ F § u ¥ U ¥ G T ¥ U ¥ ¥ T ¥ ¥ ¥ U U U T c R R R ¦ ¦ ¥ R S G T R ` G ` ` R ` ¦ ¦ ¥ ` ¦ S R F ¤ ¦ ¤ ¦ ¥ F F G ` R ¥ R S § R ¦ R F R ¦ R v R R ¥ G ¦ R ¥ b S ¤ ¤ S T ¤ ` ¥ ` ` S ¦ R ¥ S U ¦ ¦ ¥ U R b R S ¦ ¥ F ¥ F y ¥ G ¥ ¦ ¥ b ¤ T ¥ F U ¤ G U ¦ U ¤ F S ¥ S U ¦ ` ¥ F ¥ F ¤ ¥ T ¦ G T ¥ ¦ S F U U ¥ S ¥ S U U ¦ ¥ T § ¥ U ¥ ¤ ¥ b § ¥ U G U ¦ ¥ S F U R ¤ d ¥ S ¥ c S F F G ¥ ¥ U § ¥ T ¥ ¥ F F G b ¦ ¥ b U ¤ T ¦ ` ¥ S b U ¥ S F F ¥ 5  2 The few things to know before starting q q p r w x w w Once R is installed on your computer, the software is accessed by launching the corresponding executable (RGui.exe ou Rterm.exe under Windows, R under Unix). The prompt ‘>’ indicates that R is waiting for your commands. ¦ ¤ ¦ ¤ R § ¦ ` ¥ S S G F ¦ S S U R ¥ G U b F U S T ƒ ¦ S ¦ a ¦ ¥ F ¦ R ¥ F T ¤ ¥ ¤ ¥ F ¦ S ¦ y £ R y ¥ § y ¥ S G b F G S y F S S U F S b U y ¥ F U y U U y ¥ U ¥ ¥ T Under Windows, some commands related to the system (accessing the on-line help, opening files, ...) can be executed via the pull-down menus, but most of them must heve to be typed on the keyboard. We shall see in a first step three aspects of R: creating and modifying elements, listing and deleting objects in memory, and accessing the on-line help. £ ¥ G ¥ ¤ ¥ G ¥ ƒ ¥ T § S ¥ G ¤ ¥ ¥ T ¦ ` § ¥ b c R ` ¥ ¤ ¦ ¦ § ¥ b ¦ R ¥ b U R U S U b S ` ¤ c ¦ ¦ ` ¥ G U R ` U ¦ R S y R ¤ § v ¤ £ ¥ ¥ S b S b T S ¥ S ¥ F R ¤ G ¦ R ¤ ¥ F ¥ G F S F ¦ ¦ † ¥ G S § ¦ ¥ S S S F T b b ¦ ¦ ¥ S ¥ T R R ¥ S S S { 2.1 The operator <~ } | z § € R is an object-oriented language: the variables, data, matrices, functions, results, etc. are stored in the active memory of the computer in the form of objects which have a name: one has just to type the name of the object to display its content. For example, if an object n has for value 10 : u ¥ ¥ ¥ ¥ U ¥ ¥ U c † † § ¥ ¥ ¥ ¦ R ¦ R ¥ G U `  § ¥ c ¤ U T ¦ † § ¤ ¥ ¥ c ` ¤ † ¤ ˆ ¥ ¥ S G b ¥ F y ¥ S ¥ S ¥ ¥ ¥ G T b ¥ S G ¥ ¥ T U  R U ` F c >n [1] 10 § The digit 1 within brackets indicates that the display starts at the first element of n (see § 3.4.1). The symbol assign is used to give a value to an object. This symbol is written with a bracket (< or >) together with a sign minus so that they make a small arrow which can be directed from left to right, or the reverse: ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ † ¤ ¥ ¤ R v F F b ¥ ¥ ¥ ¤ b c c ¤ ¥ S ¥ ¥ U R ¥ § ¥ ¦ ¥ § T § ¥ G T § U ¦ ¥ ¦  ¤ ƒ ‚ ¦ ¥ ‚ T S U b T v ¤ v ¥ ¥ S ¥ F ¤ ¥ F ¦ ` ¥ F c ¥ F § ¥ F ¤ F R ` ¦ ¥ ¥ F b F F >n >n [1] >5 >n [1] <- 15  15 -> n  5 The value which is so given can be the result of an arithmetic expression: ¦ ¦ ¤ ¥ S F G ¦ ` R ¥ ¤ § ¥ b y F ¦ ¦ ¤ ¦ ¤ R ¤ a ¥ S F U S S c U c > n <- 10+2 >n [1] 12  R ¤ † § R ¦ ¦ ¦ ¤ ¦ ¦ R ¦ ¤ Note that you can simply type an expression without assigning its value to an object, the result is thus displayed on the screen but not stored in memory: ¥ ¥ U ¥ ¥ F S ¥ U ¥ S c S ¥ ¥ S U F G S y G ¥ T G T b S U ¥ ‡ ¥ T § ¥ ¥ ¥ ¥ T ¥ U G T U > (10+2)*5 [1] 60 ‹ Š { ‰ Œ ‡ ‡ Š 2.2 Listing and deleting the objects in memory ~  Ž   … ~ „ ~ ˆ † … ~ ˆ … } † … „ z † § † § The function ls() lists simply the objects in memory: only the names of the objects are displayed. ¥ ¥ ¥ ¥ T ¥ ¥ T T ¥ G U G T > name <- "Laure"; n1 <- 10; n2 <- 100; m <- 0.5 > ls() [1] "m" "n1" "n2" "name"       e     Note the use of the semi-colon ";" to separate distinct commands on the same line. If there are ¥ ¥ ¥ ¥ ¥ ¥ G ¥ ¥ U ¥ ‡ 6 & § † § a lot of objects in memory, it may be useful to list those which contain given character in their name: this can be done with the option pattern (which can be abbreviated with pat) : ¥ ¥ ¥ ¥ ¥ ¥ ¥ c ¤ U ¦ ¦ ¥ § § § ¤ ¦ U ¤ ¤ ¥ F c ’ S ‘ ’ ¦ ¥ S  ¥ T ¦ ¥ ’ ¥ T ¤ § ¦ ¤ ¥ ¥ G S S b S ‘ > ls(pat="m") [1] "m" "name" ’ ‘  †   § If we want to restrict the list of objects whose names strat with this character: ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ > ls(pat="^m") [1] "m" ’  ¦ ` ¤ ¦ † § ¤ R ‘ ¦ ¤ R ¦ To display the details on the objects in memory, we can use the function ls.str(): ¥ ’ ¥ S S ¥ U S U F b T b ¥ ¥ S a ¥ S G T > ls.str() m: num 0.5 n1: num 10 n2: num 100 name: chr "name" “ ” • – ” – – — ” ` ¦ R ` ¤ § § ¦ § • ” R ¦ • ¤ The option pattern can also be used as described above. Another useful option of ls.str() is max.level which specifies the level of details to be displayed for composite objects. By default, ls.str() displays the details of all objects in memory, including the columns of data frames, matrices and lists, which can result in a very long display. One can avoid to display all these details with the option max.level=-1: € ¥ ’ ¥ S G U § ¦ a ¥ F U S F c ’ S U ’ S  † ` R ¦ § R ¦ ` R R ¤ ¦ ` ¦ G ‘ ¤ ¦ ¤ ¦  ¥ ¥ ¥ G T b F ¥ ¥ G T G c ™ ` R ¤ ¦ ¥ R ¦ ¦ † ¥ S R R R b § R R ` S U F T ¦ b R b ¦ ¤ R ¥ ¦ R ¤ ¦ S G T S F T ¤ ¦ F U ` U R ¦ ¥ S c R ¥ ’ G T R ¥ S c ¦ ¥ S ¦ ¥ G T R ¥ S U ¦  ˜ ` ¥ S S ¥ F ¥ b ¥ b F ¥ ¥ G > M <- data.frame(n1,n2,m) > ls.str(pat="M") M: ‘data.frame’: 1 obs. of $ n1: num 10 $ n2: num 100 $ m: num 0.5 > ls.str(pat="M", max.level=-1) M: ‘data.frame’: 1 obs. of ’    š  ’ ’ ‘ 3 variables: › › ’    e e    e ’ ’ ‘ 3 variables: › † › ’ § † § To delete objects in memory, we use the function rm(): rm(x) deletes the object x, rm(x,y) both objects x et y, rm(list=ls()) deletes objects in memory; the same options mentioned for the function ls() can then be used to delete selectively some objects: rm(list=ls(pat="m")). ¥ ¥ ¥ ¥ ¥ ¥ U U ¥ T œ † ¥ ¥ ¥ † ¥ ¥ ¥ § ¥ § ¥ ’ G † § T § R ¦ ¥ T R R ¥ b T § ¥ ¤ ¦ ¥ ` ¥ c ¤ ¥ S U S ` ¥ S S F U ’ ’  ‘  { ‰ { 2.3 The on-line help Š ž  Ÿ | … … z ¤ ¦ ` ¤ ¤ ¦ ` ¦ R ` ¦ ` R ¤ ¦ R ¤ The on-line help of R gives some very useful informations on how to use the functions. The help in html format is called by typing: a ¥ ¥ S S ¥ U a ¥ S U S b F S U U T F b c G ¦ § R R ¦ ` ¥ S G S c S R ¤ ¥ T ¦ R ¤ ¥ b T F b S G > help.start() ’ ’ ‘ § ¦ ` ` ¤ R ¤ R ¤ ¦ ¤ ¤ ¦ R § ¦ ¦ v ¤ ¦ ¤ A search with key-words is possible with this html help. A search of functions can be done with apropos("what") which lists the functions with "what" in their name: € ¥ S S S ¥ S F U G ¥ ¥ ¥ ¥ b ¥ G ¥ ¥ ¥ F F T ¥ ¥ ’ U ‘ ‘ > apropos("anova") [1] "anova" [5] "anova.lm" [9] "print.anova" ‘   ™ ’ ’ ™  ’ e    ™   ™ e ’   ™ ’  ‘  ™   ‘  ™ ’ ‘  ™  § ‘ The help is available in text format for a given function, for instance: "anova.glmlist" "anovalist.lm" "stat.anova" ’ "anova.glm" "anova.glm.null" "anova.lm.null" "anova.mlm" "print.anova.glm" "print.anova.lm" ¥ ¥ ¥ U c ¥ ¥ c G € 7 W > ?lm displays the help file for the function lm(). The function help(lm) or help("lm") has the same effect. This last function must be used to access the help with non-conventional characters: ¥ ¥ ¥ ¥ U ‘ ¥ G U G T ‘ § ¥ ¥ ¥ G c ¥ ¥ U ¥ U ¥ ¥ U ¤ ¥ F F > ?* Error: syntax error > help("*") Arithmetic ’ œ ‘ R Documentation ’ ’ e › ’ ‘ ’ Arithmetic Operators ... package:base ¡ ’ ‘ ’ ’ 8 ( ¢ 3 Data with R p £ ‹ { Œ ž 3.1 The ‘objects’ ~ „ ˆ Ž ˆ z ¦ ¤ § ¦ ¦ ¦ ¦ ¤ R R ¤ ¦ ¤ † § ¤ ¦ v R works with objects which all have two intrinsic attributes: mode and length. The mode is the kind of elements of an object; there are four modes: numeric, character, complex, and logical. There are modes but these do not characterize data, for instance: function, expression, or formula. The length is the total number of elements of the object. The following table summarizes the differents objects manipulated by R. a ¥ ¥ ¥ ¥ ¥ ¥ ¥ ’ b S F U  S F S F c  ` ¤ † ¥ § ` R ¥ ` ¦ v ¤ ¥ ¥ ’ S ‘ ˜ b   ¦ F F F U S S b S  e ` ¦ ¤ ¤ § ¤ a ¥ ’ ¥ S   S ¥ F ¥ F ¥ F ¥ S b U F F e ¤ † § ¤ ` R ` § R ¤ ¦ ¤ a a ¥ ¥ ¥ ¥ ¥ ¥ S b F b ’ S U F  § † ¥  e  § ¥ ¥ ˜ ¥ G U ‘ § ¥ T U ¤ ¨ several modes possible in the same object ? © ¬ — — § ” ± — ° ª ‘ § « • ® ‘ ‘ ­ ¬ — — ª ‘ § « ¦ ¨ © ‘ — — § ¥ ” ©   ‘ § ‘ • — ‘ numeric, or character ‰ numeric, character, complex, or logical No g factor ª ‘ vector ¨ possible modes ¯ object    #       #  j No g #   # # j  @ j    #   #  #  @ j array numeric, character, complex, or logical matrix numeric, character, complex, or logical data.frame numeric, character, complex, or logical Yes ts numeric, character, complex, or logical Yes numeric, character, complex, logical, function, expression, or formula Ye s No g    g     # #       § ¦   # #  @ j  #   # # j  @ j   #       #  j #   # # j  @ j   #       #  j 0 ² #   # # j 0 2 4 0    1        D 1 j ¦ 4 4  ` 0   # 1  @ j ¦ @ j   R # # #    D j  3  #  j  R  j j         j   # list No ¦  #    # # j   3 @ 1 j  @ ¦ R ¤ ¦ R § ¦ ¦ A vector is a variable in the commonly admitted meaning. A factor is a categorical variable. An array is a table with k dimensions, a matrix being a particular case of array with k = 2. Note that the elements of an array or of a matrix are all of the same mode. A data.frame is a table composed with several vectors all of the same length but possibly of different modes. A ts time-series data set and so contains supplementary attributes such as the frequency and the dates. € ¥ F ³ ¥ F c ¤ ¥ F ¦ S ` R S ¦ ¥ b ¦ ¥ ¥ b § b ¦ S ¦ ³ F ¤ ¦ ¥ F U G S b S S R F c § c ¦ ¥ F y € ¥ b ¦ ¥ F S T € ¥ b S œ œ ¥ ¥ ¥ ¥ ¥ ¥ ‡ ¥ ’ œ § § ¥ § ¥ ¥ G T ¥ ¥ ¥ U c ¥ G c § ¥ ¥ T ¥ U U ¥ ¥ ¥ U ¥ G T G ¥ ¥ ¥ ’ U ¥ Among the non-intrinsic attributes of an object, one is to be kept in mind: it is dim which corresponds to the dimensions of a multivariate object. For example, a matrix with 2 lines and 2 columns has for dim the pair of values [2,2], but its length is 4. ¤ ¦ ¤ ¦ ¦ ¦ ¦ ¥ v § ¦ ¥ S b † ¥ S § ` § ¥ G ¦ ¦ ¥ S ¥ S ¦ ¦ ¥ ¤ ¥ F U S € ¥ F S S S S b  † ¥ ¥ § ¥ ¥ ¥ G ¥ c ¥ G U § ¥ ¥ ¥ ’ U ¤ † § ¤ ` U ¤ ` G c ¦ ¦ U ¦ ¤ v R ` ¦ It is useful to know that R discriminates, for the names of the objects, the upper-case characters from the lower-case ones (i.e., it is case-sensitive), so that x and X can be used to name distinct objects (even under Windows): ¥ F G G ¥ ¥ ¥ b U § ¤ ¥ ¦ ¥ S U ´ ¥ ¥ S S ¦ ¦ b ¦ ¥ S ¥ F ¥ ¥ F ¦ R U ¤ ¥ U ` ¤ ¥ S c S d ¥ S ¥ F b F F F ˜ ¦ † § ¦ ¦ £ ¥ S F S S U ¥ ¥ S c b S > x <- 1; X <- 10 > ls() [1] "X" "x" >X [1] 10 >x [1] 1 { ¹ ‡ · Š ‡ ‡ Š Š  3.2 Lire des données à partir d’un fichier µ ~ ¶ ˆ … ¶ ¸ ¶ } | „ ˆ … … Ž „ ¶ z § ¦ ` ¤ R ¦ ` ¦ R can read data stored in text (ASCII) files; three functions can be used: read.table() (which has two variants: read.csv() and read.csv2()), scan() and read.fwf(). For example, if we have a file data.dat, one can just type: d ¥ € d ¥ ¥ ¥ ¥ ¥ › ’ S U S S F U S y F F ¦ S ¤ ¤ ¦ ¤ ˆ ¥ F S º ¥ S  S ™ F c ™ † ¥ G T R ¥ ¥ U S ¦ ` ¤ ` ¦ R ¥ S G c b y > mydata <- read.table("data.dat") › ’ ’ ’ ’ œ ž 9 P § § § § mydata will then be a data.frame, and each variable will be named, by default, V1, V2, ... and could be accessed individually by mydata$V1, mydata$V2, ..., or by mydata["V1"], mydata["V2"], ..., or, still another solution, by mydata[,1], mydata[,2], etc2. There are several options available for the function read.table() which values by default (i.e. those used by R if omitted by the user) and other details are given in the following table: » » ¥ ¥ ’ U T ’ c œ § ¼ § ’ ¼ F T ’ ¼  œ R ¦ ¦ ¦ ’ T  œ R ¤ T U § R S c S U  œ § ¦ R ¤ R R ¦ a ¥ F ¥ ’ ’ F  œ ¥ S T ¥ F U S ’  œ  œ § ¥ ¼ F § ¥ ¥ ¥ ¥ › ’ U T U c U G c § c § ¥ ¥ ¥ ¥ § ¥ c ¥ U ¥ T T U > read.table(file, header=FALSE, sep="", quote="\"’", dec=".", row.names=, col.names=, as.is=FALSE, na.strings="NA", skip=0, check.names=TRUE, strip.white=FALSE) ’   ¿ ¾ › ½ ’ ‘ º e ¿  Á ¾   À ’ ‘ ¿  ¾ ½    ’ ’ ‘ ! ! file the name of the file (within ""), possibly with its path (the symbol \ is not allowed and must be replaced by /, even under Windows) à  h     @  H       $      m h    H  $      H       j 0 m  1 H ! f V    1 1 @   6 4  $ #     !  j 0 4 ! 0  2  0 4 0 2  2 0 0 0 0 4 0 header 4 a logical (FALSE or TRUE) indicating if the file contains the names of the variables on its first line    1         6    3  1    1   1  #  m Å p ) © 5 Å B Q h Ä    1  # 1    #  0  4  0 2   0 4 ! 0 0 2 0 0 2 4 0 2  1 0  4 0  2 sep  the field separator used in the file, for instance sep="\t" if it is a tabulation  1      @     # 1    1      1    @           j quote ! the characters used to cite the variables of mode character    #   #            6 #       @ #   #   dec the character used for the decimal point    !   #        @ #   #   row.names ! a vector with the names of the lines which can be a vector of mode character, or the number (or the name) of a variable of the file (by default: 1, 2, 3, ...)     @    #   #      #   6   # #  H            # H   6 j m % 4 ¨ 2 ! 4 h 0 2  2  j j 4 ! 0 2  m   @   $ h          6  3  1    j ! col.names ! a vector with the names of the variables (by default: V1, V2, V3, ...) m % Æ Æ h Æ    @ j j   $       6        H #   6 j as.is ! controls the conversion of character variables as factors (if FALSE) or keep them as characters (TRUE) m 5 h             #          6 #   #      # 6 m Å p      #  h ) ©     #    # na.strings the value given to missing data (converted as NA) m h g        6 #      !     6  !  @   6 skip ! the number of lines to be skipped before reading the data                         @ check.names if TRUE, checks that the variable names are valid for R strip.white (conditional to sep) if TRUE, scan deletes extra spaces before and after the character variables ! p     6            6  #  # j   2 2 ! 4 2 Å     #    #       1        #     p ) 0 4 m ©    D  0 0     h    1  1  # j 4  § ! 0     6 § Two variants of read.table() are useful because they have different by default options: ¥ ¥ ¥ ¥ ¥ › ’ G U T c T U U U c read.csv(file, header = TRUE, sep = ",", quote="\"", dec=".", ...) read.csv2(file, header = TRUE, sep = ";", quote="\"", dec=",", ...) ’ ¿  Á ‘ e ™ ’ ¿ ¦ ¤ ¦  Á ‘ e ™ ¤ ¤ R § ¦ R ` ¦ ¦ ` ¤ The function scan() is more flexible than read.table() and has more options. The main difference is that it is possible to specify the mode of the variables, for example : a ¥ ¥ a ¥ › S b S G F b ’ S S F y b S S U  R ` R § ¦ ¤ ` ¤ ` ¥ G b F F y ¦ R ¥ ¦ ¦ ¦ ¥ b c § ¥ G T ¤ ¦ ¥ ` ` ¦ ¥ G S F > mydata <- scan("data.dat", what=list("",0,0)) ’ ’ ’ ’ ’ œ ` ¤ ¤ ` ¦ ¦ ` ¤ R § ¦ ¤ R ¦ ` ¤ ¦ reads in the file data.dat three variables, the first is of mode character and the next two are of mode numeric. The options are as follows. ¥ ¥ ¥ F ¥ S y S ¥ F F ¥ b ¥ F F ¥ ¥ ¥ F c S F ¥ G U > scan(file="", what=double(0), nmax=-1, n=-1, sep="", quote=if (sep=="\n") "" else "’\"", dec=".", skip=0, nlines=0, na.strings="NA", flush=FALSE, strip.white=FALSE, quiet=FALSE) › ’ ’ ‘ ‘ e e ¾ ½ ¾ À ’  e    ‘ ’ ’ ’ ‘ e ! ! ! the name of the file (within ""), possibly with its path (the symbol \ is not allowed and must be replaced by /, even under Windows); if file="", the data are input with the keyboard (the entry is terminated with a blank line) à  h     @   H       $       m h   H  $      H       j  ! h k         $  0 0      H  @ 1    2 0 0 m ! f V  4      H 1   1 1 @   6  $ j #    4 k 4 !  0 0   1 1  0     H  1 3      $ 1 specifies the mode(s) of the data m   0  j 0 m h     what        2 2  0  # 0     4  Nevertheless, there is a difference: mydata$V1 and mydata[,1] are vectors whereas mydata["V1"] is a data.frame. g         H     #  6    1   # 1                 6 j 2   3     2 ¿ file ¨ 10 7 ! ! nmax ! the number of data to read, or, if what is a list, the number of lines to read (by default, scan reads the data up to the end of file) h    @   $           @  j    j  j          @ j m     !        @      n ! the number of data to read (by default, no limit) m  h      @  $          @ j sep the field separator used in the file 4 0 2  0 4   0 2    1    @           quote the characters used to cite the variables of mode character  2 4 ! 0  0     #    #   3            6 #        @  #    #   dec the character used for the decimal point 0 4 0  2     1  0   2 3 #  ! 0  k       !  @ 0 4  #  2   #   ! skip  the number of lines to be skipped before reading the data        1                1  0 4   3 2 1   @ ! nlines  the number of lines to read      0   1   0  3 1 0   @ 4 na.string  the value given to missing data (converted as NA) m B h g    !  4 2 !     0   1  6 4 #     1   3  1   6  2   0 @ 4   6 0 flush 4 a logical, if TRUE, scan goes to the next line once the number of columns has been reached (allows the user to add comments in the data file) Å   #    1      1 3  @ #    3 1 @    # 1    1  D 1 p ) ©         j 4 m 0 2          # 1 4   1  3 3  #  j 0    4 h       @   H  strip.white (conditional to sep) if TRUE, scan deletes extra spaces before and after the character   2 2 ! 4 2 Å     #    #       1        #      p ) 0 4 m ©   D  0 0     h    1  1  # j 4 ! 0 variables  !     a logical, if FALSE, scan displays a line showing which fields have been read     6 quiet 5   6   #  H  H      $    j #    j The function read.fwf() can be used to read in a file some data in fixed width format: ` ¤ ¦ ¥ ¦ ` ¦ R ¥ b ¦ ` ¦ § ¥ F ¦ ` ¥ S y b S ¤ a ¥ F S U S S U º > read.fwf(file, widths, sep="\t", as.is=FALSE, skip=0, row.names, col.names) ’ ’ ‘ ‘   The options are the same than for read.table() except widths which specifies the width of the fields. For example, if the file data.txt has the following data: ¥ ¥ ¥ ¥ ¥ ¥ › ’ G ’ G ¥ ¥ G ¥ ¥ ¥ ¥ ¥ G A1.501.2 A1.551.3 B1.601.4 B1.651.5 C1.701.6 C1.751.7 ¤ ¦ ¾ Ç ¤ one can read them with: ¥ ¥ b F S S > mydata <- read.fwf("data.txt", widths=c(1,4,3)) > mydata V1 V2 V3 1 A 1.50 1.2 2 A 1.55 1.3 3 B 1.60 1.4 4 B 1.65 1.5 5 C 1.70 1.6 6 C 1.75 1.7 ’ ’ ’ º ’ ’ ˜  º œ ’ œ ¼ ¼ ¼ ¾ Ç ‡ Š È ž ž 3.3 Saving data ~ } } † … É } z ¦ † § ¦ ¦ ` ¤ The function write(x, file="data.txt") writes an object x (a vector, a matrix, or an array) in the file data.txt. There are two options: nc (or ncol) which defines the number of columns in the file (by default nc=1 if x is of mode character, nc=5 for the other modes), and append (a logical) to add the data without erasing those possibly already present in the file (TRUE), or erasing these (FALSE, the default value). ¥ S F ¥ F y b ¥ F ¥ ’ F ’ ˜ ` § ¤ ¦ ` ¤ ¦ a ¥ S c ’ ’ ˜ ¤ S ˜ ¦ S U º ¤ R ¦ ` ¤ ¦ a ¥ F b U ¥ S S F ¥ S  ¥ G F ¥ F ¥ ¥ S y F T F  § ¥ ¥ ¥ ¥ ¥ U T U § ¥ ¥ G T ¥ ¥ G T ¥ ¥ ¥ U ‘ R R ` ¤ ¤ g ¥ U ¦ c ¤ ‚ € g U ¦ ` ƒ y a ¥ S R ‘ ¦ ˆ ¥ ¦ F F ¦ ¦ ` ¤ a ¥ ¥ a ¥ › F S ’ G S ’ F  ’ S S U º > write.table(x, file, append=FALSE, quote=TRUE, sep=" ", eol="\n", na="NA", dec=".", row.names=TRUE, col.names=TRUE) ¥ The function write.table() writes in a file a data.frame. The options are: ›  ’ ’ ‘ ‘ e ¿  Á ¿    ’ ‘ Á ¾   º À  11 sep the field separator used in the file        @         col.names a logical indicating whether the names of the columns are written in the file         H    @ #             H  #  row.names id. for the names of the lines 0 4  2  2    1    3  1    a logical or a numeric vector; if TRUE, the variables of mode character are quoted with ""; if a numeric vector, its elements gives the indices of the columns to be quoted with "". In both cases, the names of the lines and of the columns are also quoted with "" if they are written. 2 0  0  2 4 ! 0  2 Å      H  A @     p ) 0 0 #    #   3       0 4   6 4 ©   0   quote   #  # 6   3 1  @    #    j  !  0 ! 4  2 0 0  0 4 0 0 e       #   1   H  A @     1 3  @ #      # 1      6   1  3      #  # 6   3 @ 1 j 0  1   2 0  0     H   4  4  2   $ 0 4  2   H  A  @       1 3 0  @ 4 #     0 1     1  2    3  ! 1     the character to be used for the decimal point   1    3 #    0    0     @ 2   #    # !    na  the character to be used for missing data    0 0 0 4    1   3 2        @    #    # !   the character to be used at the end of each line ("\n" is a carriage-return) m h  1     @  ¦   `    #    1 #    ¤ 1     `  ¦  §   @   ¦ †   # §     eol j  dec  #   ` To record a group of objects in a binary form, we can use the function save(x, y, z, file="Mystuff.RData"). To ease the transfert of data between different machines, the option ascii=TRUE can be used. The data (which are now called image) can be loaded later in memory with load("Mystuff.RData"). The function save.image() is a short-cut for save(list=ls(all=TRUE), file=".RData"). ¥ ¥ S œ ˜ S a ¥ U S U b F F T S S G F U F F ™ § ¥ ¥ ¥ ¥ ¥ ¥ ¥ ’ ’ e § œ § ¥ ¥ ¥  G U ` ¤ ¦ ¦ ` ¤ ¤ ¦ a ¥ F ¥ ¥ F U ¥ S  ’ S Ê Á ’ š U F T ™ e ’ Ê b b œ Á ¿  Á ’ ™ ‡ 3.4 Generating data Ì ~ Ë ž ~ } } † … } ¶ … z Ð Ï 3.4.1 Regular sequences ª — ‘ ‰ ¯ ‘ Ò ’ ‘ — ® Ñ • – ’ Í ‘ Î Î § A regular sequence of integers, for example from 1 to 30, can be generated with: ¥ ¥ ¥ ¥ G U U > x <- 1:30 ˜ The resulting vector x has 30 éléments. The operator ‘:’ has priority on the arithmetic operators within an expression: ¥ ¥ ¥ ¥ ¥ ¥ G T ¥ ¥ G c U ¥ ¥ G G > 1:10-1 [1] 0 1 2 3 4 5 6 7 8 9 > 1:(10-1) [1] 1 2 3 4 5 6 7 8 9 R R ` § R ¦ ` ¤ The function seq() can generate sequences oe real numbers as follows: ¥ F b U S F S a ¥ F U S S S S U > seq(1, 5, 0.5) [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 where the first number indicates the start of the sequence, the second one the end, and the third one the increment to be used to generate the sequence. One can use also: ¤ ¤ ¥ ¤ ¥ S ¤ ¥ S S ` ¤ ¥ S S ¥ ¦ ¥ ¥ ¦ § ¦ ¥ ` ¤ ¤ ¥ F U S F b ¥ S U F F § ¥ U ¥ ¥ ¥ U ¥ ¥ U > seq(length=9, from=1, to=5) [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 ’ ’ § It is also possible to type directly the values using the function c() : ¥ ¥ ¥ U U U ¥ c ¥ G T ¥ ¥ G T > c(1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5) [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 which gives exactly the same result, but is obviously longer. We shall see late that the function c() is more useful in other situations. The function rep() creates a vector with elements all identical: ¤ ¤ R R R ¤ R R ¦ § ¦ § R ¤ R ¦ ¤ ¦ ¤ £ ¥ ¥ ¥ ¥ ¥ ¥ F ¥ ¥ S T ¥ c U ¥ c U ¥ U ¥ U ¥ F U b ¥ T y c ¥ ¥ U U U ‘ ¥ ¥ > rep(1, 30) [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ‘ ¥ 12 ¤ § ¦ ¤ ¦ ` ` ¦ ¦ ` ¤ The function sequence() creates a series of sequences of integers each ending by the numbers given as arguments: ¥ ¥ S T S ¥ F S S a ¥ F F U S  S U e ¦ § ¥ S b F U S F b c S U > sequence(4:5) [1] 1 2 3 4 1 2 3 4 5 > sequence(c(10,5)) [1] 1 2 3 4 5 6 e e 8 9 10 1 2 3 4 5 § 7 § The function gl() is very useful because it generates regular series of factor variables. The usage of this fonction is gl(k, n) where k is the number of levels (or classes), and n is the number of replications in each level. Two options may be used: length to specify the number of data produced, and labels to specify the names of the factors. Examples: ¥ ¥ c ¥ ¥ U U U U T c U § ¥ ¥ c § ¤ ` ¥ ¥ U ¦ § ¦ R R ¤ ¦ ¦ ¦ R ` U § a ¥ ¥ ¥ ¥ F b U S ’ G T U b T S G S c S G F F b S U  R ` ¤ ` ¤ ` ¦ ` g ¥ ¥ ¥ ¥ ¥ › G b F y b S G T S F U G > gl(3,5) [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 Levels: 1 2 3 > gl(3,5,30) [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 Levels: 1 2 3 > gl(2,8, label=c("Control","Treat")) [1] Control Control Control Control Control Control Control Control Treat [10] Treat Treat Treat Treat Treat Treat Treat Levels: Control Treat > gl(2, 1, 20) [1] 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 Levels: 1 2 > gl(2, 2, 20) [1] 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 Levels: 1 2 ™ › ’ ’  ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’  ™ § Finally, expand.grid() creates a data.frame with all combinations of vectors or factors given as arguments: ¥ ¥ ¥ ¥ ¥ ’ c T ‘ ¥ U c > expand.grid(h=seq(60,80,10), w=seq(100,300,100), sex=c("Male","Female")) h w sex 1 60 100 Male 2 70 100 Male 3 80 100 Male 4 60 200 Male 5 70 200 Male 6 80 200 Male 7 60 300 Male 8 70 300 Male 9 80 300 Male 10 60 100 Female 11 70 100 Female 12 80 100 Female 13 60 200 Female 14 70 200 Female 15 80 200 Female 16 60 300 Female 17 70 300 Female 18 80 300 Female ½ š  ˜  º ‘ ˜ š š š š š š ½  ½  ½  ½  Í 3.3.2 Random sequences ¬ — ‘ ‰ ‘ Ò ’ ‘ — § § R ` ¦ ¦ ¥ b U S F ¥ F S ` ¦ R R ¥ ` S ¤ b R ¦ § S ¥ F F ¦ ¦ ` ¥ ¦ ¥ R ¦ R Î ¦ ¥ d ¥ S S ¤ ¦ ` ¦ ¦ R ¦ § § a ¥ b F S ¥ S ¥ U F S ¥ S U S ¥ S U T ¥ S T F > rfunc(n, p[1], p[2], ...) F Í • Î ` It is classical in statistics to generate random data, and R can do it for a large number of probability density functions. These functions are built on the following form: ‘ ‘ G 13 % § § § whre func indicates the law of probability, n the number of data to generate and p[1], p[2], ... are the values for the parameters of the law. The following table gives the details for each law, and the possible default values (if none default value is indicated, this means that the parameter must be specified by the user). ¥ ¥ ¥ ¥ ¥ U ‘ ¥ ¥ G T ‘ ¤ ` R ¦ ¤ ¦ R § ¦ R R ` ¤ R ¤ ` ¤ ` R ¤ a ¥ ¥ ¥ F ¤ ¤ ¥ ¥ ¦ ¥ ¤ ¦ ¥ ¥ S ¥ S c ¦ ¥ F ¦ R R ¥ ` ` b ¦ F R S U ¥ F R ¥ b G ` U R § S U S U ¦ ¤ G § ¥ ¥ • « « § ¨ ‰ G U ª § rnorm(n, mean=0, sd=1) Gaussian (normal) m h   rexp(n, rate=1)    4 9  @ exponential 0   1  1   D gamma rgamma(n, shape, scale=1)  Poisson rpois(n, lambda) Weibull rweibull(n, shape, scale=1) Cauchy rcauchy(n, location=0, scale=1)    4   4 !   0 V  @  # $ rbeta(n, shape1, shape2) o  @ ! beta   rt(n, df)  ‘Student’ (t) m h 5   Ó 1 rf(n, df1, df2)  @  0 Fisher (F) m h Ô   Ä  Pearson (χ2) rchisq(n, df) m h n Õ 1 rbinom(n, size, prob) 4     0  0 ! binomial  rgeom(n, prob) 3  1 0 geometric  # rhyper(nn, m, n, k)   3    0 hypergeometric   #   rlogis(n, location=0, scale=1) 3   0    0 logistic  $ 4  # rlnorm(n, meanlog=0, sdlog=1)    lognormal     negative binomial rnbinom(n, size, prob) uniform runif(n, min=0, max=1) !     j ¤ ¦ R ¤ ¦ R   6  rwilcox(nn, m, n), rsignrank(nn, n) ¦    @ Wilcoxon’s statistics R ¥ G T ¬ ¯ S U U R ¥ c § commande ‘ F c ¥ c loi  # § V        § # ¦ ` ¤ R R Note all these functions can be used by replacing the letter r with d, p or q to get, respectively, the probability density (dfunc(x)), the cumulative probability density (pfunc(x)), and the value of quantile (qfunc(p), with 0 < p < 1). ¥ T ¥ G c ¥ ¥ F ¥ F ¥ ¥ ¥ F S G F T S U ¥ S S ‡ ¥ U ‘ ¤ ¦ ¥ ¦ ¥ S R ¦ § § ¦ R ¥ S T ¤ ¥ ¦ F T G ¥ c b U ¥ U R ¦ § § ¤ ¥ S T ‘ ˜ ¦ F ¥ G T ˜ ¤ ¦ R ¦ ` ¥ R ¥ S ‘ U U c ‘ ‹ ‰ Œ Š Š ž 3.4 Manipulating objects Ö ~ ~ „ Ž † … } | ¸ … } z ° 3.4.1 Accessing a particular value of an object ©  ‘ § • § ‘ • ’ • ­ ‰ ’ • ” • – — — ‘ ‰ ‰ Î ¦ ` † Í  ‰ ¦ ` R ¦ ¤ ¤ R Î ` To access, for example, the third value of a vector x, we just type x[3]. If x is a matrix or a data.frame, the value of the ith line and jth column is accessed with x[i,j]. To change all values of the third column, we can type: d ¥ F ¥ F y b G y ¥ ¥ T U ¥ F y c U F c a ¥ G b F y ˜ R R ¤ ¤ ¦ ¦ R ¤ † ¦ R ¤ ¦ ¤ ` R ¤ a ¥ ¥ S S b ¥ S U ¥ ¥ ’ S U c  ˜ ¥ G ¥ T ¥ U U c > x[,3] <- 10.2 ` § ¤ ¦ ¦ ¤ ¦ ¦ R R ¦ ¦ ¦ ¦ ¦ ¤ This indexing system is easily generalised to arrays, with as many indices as the number of dimensions of the array (for example, a three dimensional array: x[i,j,k], x[,,3], ...). It is useful to keep in mind that indexing is made with straight brackets , whereas parentheses are used for the arguments of a function: ¥ F b U ¥ S S T S ¥ b F T a ¥ F F ¥ S b T S T ¥ S y ¥ G T T § ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ G G ¦ ` ` U ¤ ¥ S S U ` ¥ ¥ S U b F U F F U > x(1) Error: couldn’t find function "x" ˜ ’ ’ e R R ¦ R R ¦ R e R § ¦ Indexing can be used to suppress one or several lines or columns. For examples, x[-1,] will suppress the first line, or x[-c(1,15),] will do the same for the 1st and 15th lines. ˆ ¥ G b y F S b F U S F F c S F G G U S U S S y ˜ ¦ R ¤ ¤ S ¥ S ¥ ` ¤ ¥ R R ¦ ¦ F ¥ R ¦ b ¥ F ˜ S ` ¤ F ¥ F G G U d 14 ¤ ¦ R ` R ¤ R § ¦ ¦ ¦ ¦ For vectors, matrices and arrays, it is possible to access the values of an element with a comparaison expression as index: ¥ ¥ ¥ S b S U ¥ ¥ ¥ G c F F T S ¦ b F F c ¦ S ¦ S y ˆ ¥ F F G S y F G b > x <- 1:10 > x[x >= 5] >x [1] 1 2 > x[x == 1] >x [1] 25 2 <- 20 ˜ 3 4 20 20 20 20 20 20 <- 25 ˜ ˜ 3 4 20 20 20 20 20 20 ˜ § The six comparaison operators used by R are: < (lesser than), > (greater than), <= (lesser than or equal to), >= (greater than or equal to), == (equal to), et != (different from). Note that these operators return a variable of mode logical (TRUE or FALSE). ¥ ¥ ¥ ¥ ¥ T G U G ‡ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ U € ‚ ˆ ¥ U R g ¥ g ƒ y ¦ U R ` R § ¦ a ¥ F b F S c F ¥ F U F F G 3.4.2 Arithmetics and simples functions ¬  —  § ‰ — ’ ‘ ” — • — Í  ‰ ‘ Î Î There are numerous functions in R to manipulate data. We have already seen the simplest one, c() which concatenates the objects listed in parentheses. For example: £ ¥ ¥ ¥ G T ¥ ¥ c G U R ¥ U ¤ ¦ ¦ R † § U U ¤ ¤ ¦ ¤ ˆ ¥ G b ¥ F y S F G ¥ ¥ ¥ ¥ S S S > c(1:5, seq(10, 11, 0.2)) [1] 1.0 2.0 3.0 4.0 5.0 10.0 10.2 10.4 10.6 10.8 11.0 ¦ ¦ ¤ ¦ R ¦ R ¤ ¦ R ¦ § Vectors can be manipulated with classical arithmetic expressions: ¥ F G ¥ ¥ b y ¥ G U S b S F œ œ 5.0 R ¦ ¤ ¤ ¦ ¤ ¦ § ¤ R ` ` ¦ ` Vectors of different lengths can be added; in this case, the shortest vector is recycled. Examples: ¥ F T ¥ F ¥ ¥ ¥ ¥ F c S ¥ S » ¥ F x <- c(1,2,3,4) y <- c(1,1,1,1) z <- x + y z [1] 2.0 3.0 4.0 S > > > > » ¥ S S F F R G b g y <- c(1,2,3,4) <- c(1,2) <- x+y ¨ >x >y >z >z [1] œ œ 2446 > x <- c(1,2,3) > y <- c(1,2) > z <- x+y Warning message: longer object length is not a multiple of shorter object length in: x + y >z [1] 2 4 4 ˜ œ œ ˜ › ’ › ’ œ ’ ’  ˜ ’ ’  ‘ ’  e  Note that R has given a warning message and not an error message, thus the operation has been done. If we want to add (or multiply) the same value to all the elements of a vector: ¥ ¥ ¥ G ¥ ¥ ¥ ¥ ¥ U c ¥ G T ¥ ˜ ` ¦ ¥ F G F S § ¥ U ˜ The arithmetic operators are +, -, *, /, and ^ for powers. ¥ ‡ ¥ 10 20 30 40 c ¥ c <- c(1,2,3,4) <- 10 <- a*x ¥ U >x >a >z >z [1] F F ¤ ¥ F G ¦ ¤ ¥ b F a  15 ¦ ¦ ¦ ¦ ` R ` R ` ¤ Two other useful operators are x %% y for “x modulo y”, and x %/% y for integer divisions (returns the integer part of the division of x by y). ¥ S ¥ F c S F S œ T b U F y F ˜ œ § F ` ¦ ¦ ¦ ¤ § S T G U F U ` ¦ ¥ T a ¥ F ˜ y ¥ ¤ ¥ F G c § ¥ F ¥ S S F F U § The functions available in R are too many to be listed here. One can find all basic mathematical functions (log, exp, log10, log2, sin, cos, tan, asin, acos, atan, abs, sqrt, ...), special functions (gamma, digamma, beta, besselI, ...), as well as diverse functions useful in statistics. Some of these functions are detailed in the following table. ¥ ¥ ¥ ¥ T c U ¥ ¥ ¥ › ’ ’ ’ U ‘ R ` ¦ ` ¦ R R ¦ ¥ S U ` R ¦ ¥ › U S › ’ F U S c  R § ¦ R R ` ¤ ¦ ¥ R   ¦ ¦ ¥ ¤ ` ¥ S F S 2 4   1  3           ¥ 3  @    #  @  max(x) maximum of the elements of x min(x) minimum of the elements of x which.max(x) returns the index of the greatest element of x which.min(x) returns the index of the smallest element of x range(x) has the same result than c(min(x),max(x)) length(x) number of elements in x mean(x) mean of the elements of x                                @                   @   @          @      @       !     2   4   @ 2  D    1  3     1   3 median(x) median of the elements of x 2 4  2   D 0 0 2 0 4 4 0   1  3  2   4   1   2 0 variance of the elements of x (calculated on n – 1); if x is a matrix or a data.frame, the variance-covariance matrix is calculated m    h   D  3   1 D     #  @ × #  D 3 var(x) ou cov(x)    1  3      # 1    6 @ j 4 4 0 0 0  0 2 0 0  # @  #    D 0 3 0  0 2 # 1  0    2 6 #  # 1  0   0   6 4 correlation matrix of x if it is a matrix or a data.frame (1 if x is a vector) h    2 0   #   6 2   D 4  2 4     D  3      D   D  3 ! 1    !    # 0 covariance between x and y, or between the columns of x and the columns of y if they are matrices or data.frames      $ 1 3  @ #   1   D  1 3  @ #   1   var(x,y) ou cov(x,y) $   H   1 $  1 D    H  # 1     6 # 2 0  0  2 0 0 0  3      4  !   0  #   3   4 0 linear correlation between x and y, or correlation matrix if they are matrices or data.frames      #    3      $    D 3 1        #   1 $  1 D    @ j  cor(x,y)  4   H 1     cor(x) m ¦ ¥ S 2   product of the elements of x  ¦ ¥ b U prod(x)  ¦ ¥ S sum of the elements of x D G U  ` ¥ S S sum(x)       #    1 j 2    3     These functions return a single value (thus a vector of length one), except range() which returns a vector of length two, and var(), cov() and cor() which may return a matrix. The following functions return more complex results. ¥ ¥ ¥ ¥ G ¤ c ¦ ¤ ¦ ¥ U U c ¥ U ¤ U ¤ R ` a ¥ ¥ F y b S F ¥ F U b T S ¥ ¥ S ™ S R R ¥ F U 4 G y b 0 F b 2 S F 4   3 #  1 S S     D   1  3 2    4  1  @   reverses the elements of x     1  3          6  sorts the elements of x in increasing order; to sort in decreasing order: rev(sort(x))     #        F      #       R      computes the logarithm of x with base base            !        H          @ # pmin(x,y,...) a vector which ith element is the minimum of x[i], y[i], ...   j     @     #  # H   6 j pmax(x,y,...) id. for the maximum 0  2 0  3 2  0 3 @ 4   3 D 0  0     cumsum(x) a vector which ith element is the sum from x[1] to x[i]    3   3  @     1  3    #  H  #   6 cumprod(x) id. for the product cummin(x) id. for the minimum cummax(x) id. for the maximum match(x,y) returns a vector of same length than x with the elements of x which are in y (else NA) which(x==a) returns a vector of the indices of x if the comparison operation is true (TRUE), i.e. the values of i for which x[i]==a (or x!=a; the argument of this function must be a variable of mode logical)   #  @       @   @  h g      $ m  #  H            H         #    6  @ h p       @            #      #      #  6    @ j ! ! h      6     @   # @      @     # H      @ 6 m   ...   which(x!=a) m F U R ranks of the elements of x log(x,base)  ¦ S rank(x)  U sort(x) D ` ¥ F U rev(x)  S c ¦ ¥ rounds the elements of x to n decimals  ¥ F ™ round(x,n) ` & 16 Ø Ø Ø choose(n,k) ! computes the combinations of k events among n repetitions = n!/[(n – k)!k!] m Ù h f   ×        ×        6 ×      #       @ # na.omit(x) ! suppresses the observations with missing data (NA) (suppresses the corresponding line if x is a matrix or a data.frame) h m h g         #           @         H     6            @ m       na.fail(x) returns an error message if x contains NA(s) m h 0 B 2 0 g    1   1  #  D      3      1   1   @  table(x) returns a table with the numbers of the differents values of x (typically for integers or factors) 0 2 4 4 0      2 4 h   2   1    $ #  2 0  2  $   D   @ !     6 1    0       3 1 @ 4 !     H     1   @  m    #  subset(x,...) returns a selection of x with respect to criteria (...) depending on the mode of x (typically comparisons: x$V1 < 10); if x is a data.frame, the option select allows the user to identify variables to be kept (or dropped using a minus sign -) 4 4 0 2 h  0   $ #     3 4   h 0 1   1 1     0   H  1     # 0  2 0 4   #      0   @    4   0 m   $      H 2 0 1   #      1   @  0 m         1      3  # j m ! h !      @    @               6  $ Í 3.4.3 Matrix computation   § ¤ ¦ § ¦ ¦ R ¦ ¦ Í  • ” § ’ ‰ • Î ¤ ¦ ` ¦ ¦ R ¦ Î ` ¤ R has facilities for matric computation and manipulation. A matrix can be created with the function matrix(): € ¥ ¥ ¥ ¥ F S ¥ F y b ¥ S U G S b S ¥ ¥ S G U b ¥ F b F ¦ ` ¥ ’ S S U  ˜ > matrix(data=5, nr=2, nc=2) [,1] [,2] [1,] 5 5 [2,] 5 5 > matrix(1:6, nr=2, nc=3) [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6 ’ ’ ’  R ¤ ¦ R ¤ ¤ ¦ ¦ ¦   ˜ § ¦ ` ¤ The functions rbind() and cbind() bind matrices with respect to the lines or the columns, respectively: ¥ ¥ ¥ ¥ ¥ ¥ b F U S G F a ¥ › S F b › S S S  S U  ¥ T G c > m1 <- matrix(data=1, nr=2, nc=2) > m2 <- matrix(data=2, nr=2, nc=2) > rbind(m1,m2) [,1] [,2] [1,] 1 1 [2,] 1 1 [3,] 2 2 [4,] 2 2 > cbind(m1,m2) [,1] [,2] [,3] [,4] [1,] 1 1 2 2 [2,] 1 1 2 2 ’ ’ ’ ’     ˜ › ›  ¤ ¦ ¦ R ¦ ¦ ` ¤   ` ¤ The operator for the product of two matrices is ‘%*%’. For example, considering the two matrices m1 and m2 above: ˆ ¥ ¥ ¥ S F S G b F y ¥ F ¥ ¥ b F U a ¥ G F F F G § ¦ ¥ b c S b F b > rbind(m1,m2) %*% cbind(m1,m2) [,1] [,2] [,3] [,4] [1,] 2 2 4 4 [2,] 2 2 4 4 [3,] 4 4 8 8 [4,] 4 4 8 8 > cbind(m1,m2) %*% rbind(m1,m2) [,1] [,2] [1,] 10 10 [2,] 10 10 ¤ ¦ R ¦ ` ¦ ¤ ¦ ` ¤ ¤ ¦  › ›  › ›  ¦  ¦ ` ¦   ¦ ¤ The transposition of a matrix is done with the function t(); this function also with a ¥ ¥ S ¥ S ¥ ’ U ¥ S S ¥ ¥ S U ¥ F y b ¥ S G S F data.frame. ’  § § The function diag() can be used to extract or modify the diagonal of a matrix, or to build diagonal matrix. ¥ U ¥ ¥ ¥ T ¥ ¥ ¥ U U ¥ a 17 W > diag(m1) [1] 1 1 > diag(rbind(m1,m2) %*% cbind(m1,m2)) [1] 2 2 8 8  › ›      > v <- c(10,20,30) > diag(v) [,1] [,2] [,3] [1,] 10 0 0 [2,] 0 20 0 [3,] 0 0 30 > diag(2.1, nr=3, nc=5) [,1] [,2] [,3] [,4] [,5] [1,] 2.1 0.0 0.0 0 0 [2,] 0.0 2.1 0.0 0 0 [3,] 0.0 0.0 2.1 0 0  > diag(3) [,1] [,2] [,3] [1,] 1 0 0 [2,] 0 1 0 [3,] 0 0 1  > diag(m1) <- 10 > m1 [,1] [,2] [1,] 10 1 [2,] 1 10   18 ( Ú 4 Graphics with R q Û § R offers a remarkable variety of graphics. To get an idea, one can type demo(graphics). It is not possible to detail here the possibilities of R in terms of graphics, particularly each graphic function has a large number of options making the production of graphics very flexible. I will first give a few details on how to manage graphic windows. ¥ ¥ G ¥ ¥ G T T c ‘ § ¥ G T ¥ G U § ¥ ¥ G ¥ ¥ ¥ G G § § ¥ T ¥ G c ¥ ¥ G U G ¦ ¦ U ¤ U ¤ R ¦ ` ¥ S G F S ¦ ¦ ¥ b F c ‡ ` ¥ S { 4.1 Managing graphic windows Š Š Ë Ö „ … | } ¶ † † … † } … } z ¨ ¨ ¨ “ Ð Ð 4.1.1 Opening several graphic windows ¬ — § Ý ¯ Ü ª ‰ Ý ” • ® – • ® ‘ ‘ ­ — – ¯ ¯ ‘ Î ¥ ¥ ¥ ¥ G U G G G Î G U § § ¥ T £ ¥ T ¥ G Ï ” When a graphic function is typed, a graphic window is open with the graph required. It is possible to open another window by typing: ¥ G T G > x11() ˜ § § § The window so open becomes the active window, and the subsequent graphs will be displayed on it. To know the graphic windows which are currently open: ¥ G T ¥ G U ¥ U ¥ G c R ¤ ¦ ¤ ¦ ¦ ¤ ¤ v ¦ a ¥ S G ¥ S T F F F U S G ¥ F S S > dev.list() windows windows 2 3 ’ ™ § § The figures displayed under windows are the numbers of the windows which can be used to change the active window: ¥ ¥ ¥ U U U G T U ¥ ¥ c > dev.set(2) windows 2 ’ ™ 4.1.2 Partitioning a graphic window ¬  § Ý ‰ Ý ” • – • –  § • Î ¦ ¦ ¦ ¤ ¦ ¤ ¦ ¦ ¦ Î ` ¤ The function split.screen() partitions the active graphic window. For instance, split.screen(c(1,2)) divide the window in two parts which can be selected with screen(1) or screen(2); erase.screen() erases the last drawn graph. ˆ ¥ ¥ S S F S G F ¥ ¥ ¥ S c F ’ G S  ¤ ¦ R ¥ § ¤ ¦ ¤ ¦ ¥ ¥ ¦ F S U ‘ ¤ ¦ ¥ S a ¥ ¦ ¥ G S ’ S c  ¥ ‘ ¥ G The function layout() allows more complex partitions: it partitions the active graphic window in several parts where the graphs will be displayed successively. For example, to divide the window in four equal parts: ¥ G ¥ ¥ ¥ ¥ ¥ ¥ ¥ G c G ’ G U e R R ¦ R ¦ § R R ¦ ¤ ¤ œ ¤ R ¦ ¦ ˆ ¥ ¥ G b F y T c U G T G ¥ F F F R G F ` S c ¦ S ¦ ¤ ¦ ¥ ¦ ¥ F G F U S U S c > layout(matrix(c(1,2,3,4), 2, 2)) ’ ’ e ¤ ¦ ¦ ¦ ` ¤ ¦ § ¤ ` § ¤ ¦ œ ¤ ¤ where the vector gives the numbers of the sub-windows, and the two figures 2 indicates that the window will be divided in two rows and two columns. The command: ¥ ¥ ¥ ¥ S F ¥ ¥ S U S ¥ F U b U ¥ S ¥ F c F c § ¥ ¥ U ¥ c > layout(matrix(c(1,2,3,4,5,6), 3, 2)) ’ ’ e ¤ R ¦ ¦ ¤ ¦ § œ ¦ R R ¦ will create six sub-windows, three in row, and two in column, whereas: ¥ F S b ¥ S U S F S ¥ F S U F y > layout(matrix(c(1,2,3,4,5,6), 2, 3)) ’ ’ e œ will also create six sub-windows, but two in row, and three in column. The sub-windows may be of different sizes: ¦ § ¤ R ¦ ¤ ¦ § ¦ § ¦ R R R ¦ a ¥ T b S U S b U S F ¥ S F S ¥ ¥ U S U F y ¥ > layout(matrix(c(1,2,3,3), 2, 2)) ’ ’ e œ § 19 P § § will open two sub-windows in row in the left half of the window, and a third sub-window in the right half. Finally, to create an inlet in a graphic: ¥ ¥ ¥ ¥ ¥ U G U ¦ ¤ ¦ R ¦ R R ¦ ` R ¤ ¤ ¦ ¤ ˆ ¥ G F ¥ S S ¥ S ¥ F S T ¥ F > layout(matrix(c(1,1,2,1), 2, 2), c(3,1), c(1,3)) ’ ’ e ¦ § ¤ ` ¦ ¦ ¦ R ¤ ¦ œ ¦ ¤ the vectors c(3,1) and c(1,3) giving the relative dimensions of the sub-windows. ¥ S ¥ S U S b ¥ ¥ F c S S F c ¥ c To visualize the partition created by layout() before drawing the graphs, we can use the function layout.show(2), if, for example, two sub-windows have been defined. ¤ ¤ ¤ ¦ ¥ ` § § ¦ ¥ S U G ¦ ¥ F S F F T e § ¤ ¥ ’ F ¦ ¥ R ¦ a ¥ S F G U c œ § ¥ c ¥ ’ G U U e œ ¹ { Š  Ë 4.2 Graphic functions ~ „ … Ž … | ¸ } ¶ z § Here is a brief overview of the graphics functions in R. ¥ ¥ G U c plot of the values of x (on the y-axis) ordered on the x-axis m h             ß      @ Þ   6    plot(x,y)  ! bivariate plot of x (on the x-axis) and y (on the y-axis) h m h        $ Þ             6 ß sunflowerplot(x,y) m id. than plot() but the points with similar coordinates are drawn as flowers which petal number represents the number of points 0  4 2 0 4 0 0  0  #  H    H   1  H      0    1    #   3    1 H 2   ! 1        3 1 @ 1  @ ! 4   0      !   0    1          3 1 @    piechart(x)  c plot(x) circular pie-chart  0 4 0    #     # @  # boxplot(x) “box-and-whiskers” plot stripplot(x) plot the values of x on a line (an alternative to boxplot()for small sample sizes) 4 k 0  !   0 m 4 4 4 0 4      à  3    3      2 0 2  0 0 h     4  1 H 2  4  4   6 1    2  D  1 4   1   1    D 2   @ 2   6  4 0  0 coplot(x~y|z) ! bivariate plot of x and y for each value of z (if z is a factor) m h     #     à  à  @ # 6     1 $     D      6 interaction.plot (f1,f2,x) if f1 and f2 are factors, plots the means of y (on the y-axis) with respect to the values of f1 (on the x-axis) and of f2 (different curves) ; the option fun= allows to choose the summary statistic of y (by default fun=mean) m     @   6  h   #       H      $ Þ            #     j  4 4 0       #    H  1 2 m      2 0 2 h ¨ 2 0 m      6 2  # @ 2 h  1     4 m 2 ! 1   2 h 0   $  # $ 1       ß 0  @  D      $  3 3    @ matplot(x,y) ! bivariate plot of the first column of x vs. the first one of y, the second one of x vs. the second one of y, etc.     â    #      $ á ã        â j   # @ á           6 ã 2  #   $  1  1  #   j ! dotplot(x) if x is a data.frame, plots a Cleveland dot plot (stacked plots line-by-line and column-by-column) h     $ o      #            6       j ! m  @  ! 4 0 0 ! 4 ! 0 4 4 #  $ @ 0 0 # 2 pairs(x) 0 if x is a matrix or a data.frame, draws all possible bivariate plots between the columns of x    1     H           6        H       D 3   D j 2 4  D ! ! "  1 3  @ # plot.ts(x) ! if x is an object of class ts, plot of x with respect to time, x may be multivariate but the series must have the same frequency and dates     @    6  @  $      #        H    j   #  #     j       # $  A @          6  @      ts.plot(x) ! id. but if x is multivariate the series may have different dates and must have the same frequency3          6   @          6  $            6  @ @ ä # $ 1  A @  2  hist(x) histogram of the frequencies of x barplot(x) histogram of the values of x qqnorm(x) quantiles of x with respect to the values expected under a normal law     #  A @             H       @ #       @   6    @    6       #         H   @ A qqplot(x,y) quantiles of y with respect to the quantiles of x   !     A @      #      H  $    creates a contour plot (data are interpolated to draw the curves), x and y must be vectors and z must be a matrix so that dim(z)=c(length(x),length(y)) m h     @  $ A @ contour(x,y,z)   6 # @    H                    @  #      # j !              @   #  6 image(x,y,z) id. but with colours (actual data are plotted) persp(x,y,z) id. but in 3-D (actual data are plotted) m ! h              # @    @  m  ¦ ¤ ` ¦ R ¤ ¦ # H   R  ¤   ¤  @ ! h   %     ¦  #  @ ` @ § ¦ ¤ ¦ ` ¤ For each function, the options may be found with the on-line help in R. Some of these options are identical for several graphic functions; here are the main ones (with their possible default values): ¥ ¥ S ¥ G R b ` R § ¦ S ¦ ¤ ¥ ¤ G U G S ¦ ¥ ¥ ¥ S ¦ S ¤ ¥ U F S S b T ¤ ¦ ` ¥ ¥ S ¦ b F F ˆ ¥ G S ¤ S R ` F U R ¦ ¦ ¥ ¥ S S G U F F F c S F R U  0 0 4 0 2 0    2 0 k  0 0 0 2  The function ts.plot() is in the package ts and not in base as for the other graphic functions listed in the table (see § 5 for details on packages in R). © 3    1     1   # 1 @ #               1   1 1     #     1 0 m  1 k 4 0 2 r  # 1  @  4 h !  1     #   1             % c 20 7 ¦ ¦ ` ¦ ¦ ¤ R ¤ ` ¦ if TRUE superposes the plot on the previous one (if it exists) if FALSE does not draw the axes specifies the type of plot, "p": points, "l": lines, "b": points connected by lines, "o": id. but the lines are over the points, "h": vertical lines , "s": steps, the data are represented by the top of the vertical lines, "S": id. but the data are represented by the bottom of the vertical lines. xlab=, ylab= annotates the axes, must be variables of mode character (either a character variable, or a string within "") main= main title, must be a variable of mode character sub-title (written in a smaller font) sub= g ¥ ¥ ¥ S y U F c ¥ G ƒ y a ¥ S G G F add=FALSE axes=TRUE type="p" ¿ G U ¤ ` g ¥ € ‚ ¦ ˆ ¥ ¾ ¿ F S y  ½ Á ˜ § ¥ ¥ ¥ ¥ ¥ ¥ › G T G G G G T ‘ ‘ ‘ ’ œ § ¥ ¥ ¥ G ¥ ¤ § ¥ ¥ ¦ ¦ R R ¤ ¤ ¥ F R R ¤ ¥ F § ¥ F ¦ ¥ S ` ¥ ¤ ¤ ¥ T ¦ S ¤ § ¥ G c ¦ ¥ U ` ¥ S ¥ c ¦ ¥ U ¥ G c ¥ ¤ ¥ R § F ¥ S F T ¦ ¥ F ¥ b c ` G § § G F F ¤ ¥ ¥ ¥ ¥ ¥ F › ’ F F b F c b U S y › S œ ˜ § ¥ ¥ c § § ¥ ¥ ¥ ’ c ` R U R ¦ ¦ ¥ ¥ R ¦ ¥ § ¥ ¥ › S F b S S F U e ‡ ‰ Š ž 4.3 Low-level plotting commands ~ „ … }   † ~ … | É ˆ z R R R ¤ ¤ ¦ ¦ R ` ` ¤ ¦ ¤ ¦ ` ¦ ¤ ` ¤ R has a set of graphic functions which affect an already existing graph: they are called lowlevel plotting commands. Here are the main ones: ¥ F ¥ G T F ¥ S y F T ¥ S ¥ S ¦ S G U F ¤ ¦ R R R Y ¥ S S ! ¥ b F 0  F S h     @ 1  # 1      0 1 4     0  !  0 0 4 0 0  1 H ! @ 0 adds text given by labels at coordinates (x, y); a typical usage is: plot(x,y,type="n"); text(x,y,names) m h        @ #    $ $  D    1     #  1 $    6  D G c text(x,y,labels,...)  ¥ S   id. but with lines  b 0 adds points (the option type= can be used) lines(x,y) m b points(x,y)   j segments(x0,y0,x1,y1) draws a line from point (x0,y0) to point (x1,y1) arrows(x0,y0,x1,y1, angle=30, code=2) id. with an arrow at point (x0,y0) if code=2, at point (x1,y1) if code=1, or at both points if code=3; angle controls the angle from the shaft of the arrow to the edge of the arrow head m h m  h   Þ     ß Þ ! m  m     Þ    Þ          H   H j    H å j     ß å   H    ß j    h   j    å j h   ß å j          #        H       abline(a,b) draws a line of slope b and intercept a abline(h=y) draws a horizontal line at ordinate y abline(v=x) draws a vertical line at abcissa x abline(lm.obj) draws the regression line given by lm.obj (see § 5) rect(x1,y1,x2,y2) draws a rectangle which left, right, bottom, and top limits are x1, x2, y1, and y2, respectively     #    0  0   4   1     1  ! 0 1 4   à 4  m   H #     H  0       0   0    4     1  #     6   H ! h      $  6          æ  H !   $  $ j j j           j     j  j #  H   #     H j 4 0   $ # 6      polygon(x,y) draws a polygon linking the points with coordinates given x and y legend(x,y,legend) adds the legend at the point (x,y) with symbols given by legend 0 0  0  1 $  1 D   6  0    1   !   1    m  6   0 1 4 4 1   $  1    $      H h   k   H ! $ 0  # H   $            j title() ! adds a title and optionally a sub-title      @   $           axis(side,vect) ! adds an axis at the bottom (side=1), on the left (2), at the top (3), or on the right (4); vect (optional) gives the abcissa (or ordinates) where tick-marks are drawn m  h m          h m                ! h    m  #     j m   j   h   j  H  H h         #  m h      6     rug(x) draws the data x on the x-axis as small vertical lines locator(n, type="n", ...) returns the coordinates (x,y) after the user has clicked n times on the plot with the mouse; also draws symbols (type="p") or lines (type="l") with respect to optional graphic parameters (...); by default nothing is drawn (type="n")      0 4   0     k   H  0  #  4  6  1        # #      2  3           H ß     m 0   h   @     Þ    1    #    1   @ ß  j    0 m 0 h 4 m 4 h ! 4   #       H m 0 h 0  1   4  1  H     1  2 ! m   $    H  0 h   3       @ 4  @   $   3 0  1   3     #      1    § Note the possibility to add mathematical expressions on a plot with text(x, y, where the function expression() transforms its argument in a mathematical equation according to a coding used in the type-setting TeX. For example, text(x, y, expression(Uk[37]==over(1, 1+e^{-epsilon*(T-theta)}))) will display, on the plot, the following equation at point of coordinates (x,y): ¥ ’ ¥ ¥ ’ G ¥ ¥ ¥ G ¥ ‡ ¥ G T œ ¦ ¦ ¥ ` ¦ ¥ S S b ¥ F U b F S ç ¥ R ¦ R R ‘ ¤ S F U  ˜ ¦ ¦ ¦ ¦ ‘ R ¦ ˜ ¤ a F y ¤ ¥ S ¦ ˆ b ¤ ¥ F  R G ` expression(...)), ¥ S ¥ G ¥ ¥ S T ¥ S U S F ¥ S ¥ b U b ¦ ’   ‘ ` ¦ F ¥ S G è ë ó ò ñ ð ï î í ì ê ‘ ¦ ¥ S ’  ™ ¦ ¥ é T ’ 1e G 1 T Uk 37 ƒ ¦ ¥ S R R ˜ ` œ S ˜ ¤ R ¥ U ˜ ¤ ¥ ¥ G S ’ 21 ¤ ¦ ¤ ¦ ` ¤ R § ¦ ¦ ¦ R ¦ To include in an expression a variable we can use the function substitute() together with the function as.expression(); for example to include a value of R2 (previously computed and stored in an object named Rsquared): ¥ ¥ ¥ ¥ a ¥ › ’ ’ F ’ S e R S U S U F S c F G S y S S U e ¦ ` R R ¦ R ` ¦ ` ¤ ô ¥ ¥ G U b T U F c G U c ¥ S G U b F y S  ‘ S ¥ U ˜ † § ¥ ¥ e > text(x, y, as.expression(substitute(R^2==r, list(r=Rsquared)))) › Á ’ Á ’ ’ e ’ ’  e ‘ e ˜ œ ’ ˜ ˜ will display on the plot at the point of coordinates (x,y): ¦ ` ¦ ¥ T ¤ R ¥ S y ¥ F S ¥ ¤ R ¥ ¦ R R ¦ ¥ G G S G T R2 = 0.9856298 õ R R ` ¤ ` ¦ R ¦ ¤ R R ¦ To display only three decimals, we can modify the code as follows: ¥ a ¥ b T S b F S T G T > text(x, y, as.expression(substitute(R^2==r, list(r=round(Rsquared,3))))) › ’ e ’ ’ e ’ ’ ‘ e e ’ œ ¦ R R R ¦ ¤ ¦ ¤ which will result in: ¥ S F U R2 = 0.986 õ ¦ R ¦ ¤ ¤ ¦ R ¦ ¦ ¤ ¦ R R ¦ Finally, to write the R in italics (as are the mathematical conventions): y ¥ ¥ S S ¥ S c ¥ b ¥ b ¥ F ¥ ˆ ¥ S F S T >text(x, y, as.expression(substitute(italic(R)^2==r, list(r=round(Rsquared,3))))) › Á ’ ’ ’ ’ ’ e  e ‘ œ ˜ ’ e e ˜ ˜ ’ R2 = 0.986 ô { 4.4 Graphic parameters ~ „ ˆ  } } | | } z § In addition to low-level plotting commands, the presentation of graphics can be improved with graphic parameters. They can be used either as options of graphic functions (but it does not work for all), oe with the function par() to change permanently the graphic parameters, i.e. the subsequent plots with respect to the parameters specified by the user. For instance, l’instruction suivante: ¥ G c ¥ ¥ G ¥ ¥ ¥ G G § ¥ ¥ c § ¥ ¥ ¥ U G U ¥ U ¥ G ¥ G ¥ G ¥ ¥ ¥ ¥ G T ¥ G G T ¥ ¥ U ‘ ¦ ¤ § ¦ ` ¦ ¤ ¤ ¦ R § ¤ ¦ ˆ ¥ ¥ S S F F ¥ U G ¥ F T b F ¥ ¥ ¥ G G ¥ ¥ F ¥ G S U U ¦ ¦ ¦ ¥ ¥ S c ¥ S U U F S > par(bg="yellow") › ‘ œ ¦ ¤ ¤ v § R R ¤ ¦ R § R R R R ¦ will draw all subsequent plots with a yellow background. There are 68 graphic parameters, some of them have very close functions. The exhaustive list of graphic parameters can be read with ?par; I will limit the following table to the most usual ones. a ¥ ¥ F b F G G F F § F ¦ S ¤ ` ¦ F U R ¥ ¥ G T ¦ ¤ ¤ ¦ S ` U F U R ¤ ¤ ` a ¥ F S ¥ F b F G G ¥ F ¥ c U ¥ S y R ¤ R S § U F T ¦ R R ` c ¤ b c ¦ ¦ R R R b ¦ ¤ ¦ d ¥ S U ¥ ¥ ¥ ¥ b U ¥ ¥ S b ‘ "  " " adj controls text justification (0 left-justified, 0.5 centered, 1 right-justified) m     7   @  7    # h   j    @     #      @   # j bg specifies the colour of the background (e.g.: bg="red", bg="blue", ... the list of the 657 available colours is displayed with colors()) 4 ! 4 0  &  2 0 4      6   k h W  !  2    4  0    j  1  @   #  2 0      @  #    #    j  m 0 4 0 0 4   H 4 4 4 4  !  $    2     @  # 4 bty controls the type of box drawn around the plot, allowed values are: "o", "l", "7", "c", "u" or "]" (the box looks like the corresponding character); if bty="n" the box is not drawn     j j j     @  6  H    !  2 0    1  @  m      1   H   D       $   1  # j 0  H  j  1 1   D 0  k    0 4 k 4 !    #    #  1 1        # h        D  ! cex a value controling the size of texts and symbols with respect to the default; the following parameters have the same control for numbers on the axes, cex.axis, annotations on the axes, cex.lab, the title, cex.title, and the sub-title, cex.sub          H      @   j  D     #        H $ 0        1   1               !   2   1 1 j   j   D   #  4   1     3 1 @   @  6         1  #  3      6 j 4 0 !   4  0    @   1  j   j j ! controls the colour of symbols; as for cex there are: col.axis, col.lab, col.title, col.sub   j j !           $  @   #      ! an integer which controls the style of text (0: normal, 1: italics, 2: bold, 3: bold italics); as for cex there are: font.axis, font.lab, font.title, font.sub m % 7      #  # j    j  j # h     font  col      $         # # H     j j  j j     R 22 las an integer which controls the orientation of annotations on the axes (0: parallel to the axes, 1: horizontal, 2: perpendicular to the axes, 3: vertical) 7     h                                   # #  H    j m %    #   6        # @      j &    j ! lty controls the type of lines, can be an integer (1: solid, 2: dashed, 3: dotted, 4: dotdash, 5: longdash, 6: twodash), or a string of up to eight characters (between "0" and "9") which specifies alternatively the length, in points or pixels, of the drawn elements and the blanks, for example lty="44" will have the same effet than lty=2 %       j  4 0   $    j 4  h   j     0 2 0     j 0  1      #    #      #  1 H  1        $    # j !  h    j m   6       H   0 2   #    0    m  #      @ 1          H j   4 4 0 4 2 k 4 !  4      6  H  3   D    1   2   1  4 0 0   1  3   1  H  0      j   D   4   1   1  j 1  j  2  1 2      3   lwd a numeric which controls the width of lines   !      H    # # # H   @ ! mar a vector of 4 numeric values which control the space between the axes and the border of the figure of the form c(bottom, left, top, right), the default values are c(5.1, 4.1, 4.1, 2.1)     @                  H  #        @        # #  H   @ # 6   @  #   6  6  @      j mfcol a vector of the form c(nr,nc) which partitions the graphic window as a matrix of nr lines and nc columns, the plots are then drawn in columns (cf. § 4.1.2)            H # H           m  #  H     #   6 h  #   @ #  H            @ # j mfrow id. but the plots are drawn in rows (cf. § 4.1.2) pch controls the type of symbol, either an integer between 1 and 25, or any single character within "" m ! h  #     #     $       H  H         @ !  #  H !   H           j  $       $    # j 4 ! 2 0 0 0  4  0  0 ps an integer which controls the size in points of texts and symbols   4  3  $ 1 0      D 0  0   1 4    1   à   2    0 2   1  # # 0   0 H    1 1    pty a character which specifies the type of the plotting region, "s": square, "m": maximal   3  D 3    A @  1 j  0  2 4 4         1            $  #    #  H  #    #  2 j 0 2  k k 0 2  4  0 2 0  0  4 tck a value which specifies the length of tick-marks on the axes as a fraction of the smallest of the width or height of the plot; if tck=1 a grid is drawn     H      3     1   #           D 1     3  #   0  1     0 2 #  0   # 4 H      2    @     H   1      a value which specifies the length of tick-marks on the axes as a fraction of the height of a line of text (by default tcl=-0.5)               #              #        #    #  H 4 m @   6 2 ! h  @   $ if xaxt="n" the x-axis is set but not drawn (useful in conjonction with axis(side=1, ...)) yaxt if yaxt="n" the y-axis is set but not drawn (useful in conjonction with axis(side=2, ...))   xaxt   6 0   tcl " m  ! h    # H   #   @     # @   @      ß !  #   @ h  H   H " m   @ H     @        Þ 23 % 5 Statistical analyses with R ö ÷ q w r p q p r ø § § Even more than for graphics, it is impossible here to go in the details of the possibilities offered by R with respect to statistical analyses. A wide range of functions is available in the base package and in others distributed with base. ¥ ¥ ¥ ¥ ¥ ¥ G ¤ ¦ R § R ¥ G ¦ ¦ ¦ ` ` ¦ G R R ¦ c ¦ ¤ ¦ § ` ` € ¥ ¥ S ¥ S c S S U F ¥ ¥ ¥ ¥ ¥ S T ¤ G ¦ § ¥ ¦ ¦ ¥ F ¤ ¦ ¥ F T v ¥ › F U F S S › G Several contributed packages increase the potentialities of R. They are distributed separately and must be loaded in memory to be used by R. An exhaustive list of the contributed packages, together with their descriptions, is at the following URL: http://cran.rproject.org/src/contrib/PACKAGES.html. Among the most remarkable ones, there are: R § ¦ ¦ ¤ ` ¦ ¦ R ¦ ¤ ¦ v § ¦ R a ¥ ¥ F T ¥ G § ¦ ¥ F F U ¤ ` ¦ ¥ ¥ ¥ S T R ¦ ¤ ¥ G F § S ¥ G § F U ¦ S R F c § € ¥ ¥ ¥ F U ¥ ¥ S … ¥ c U S y T ¥ U F T b b S b U S … ƒ ¥ ¥ ¥ ¥ ¥ ¥ G ¥ ¥ ¥ G § § … ¥ ¥ ¥ … ¥ ¥ generalised estimating equations; multivariate analyses, includes correspondance analysis (by contrats to mva which is distributed with base) ; linear and nonlinear models with mixed-effects; survival analyses; trees and classification; time-series analyses (has more methods than ts which is distributed with base). ¥ ¥ R R ¦ R ¦ ¦ R ¥ S T S S ¤ G F F ¦ ¥ S U § ¦ ¦ ¥ ¦ ¦ ¥ S T ¤ G ¥ U ¦ G † … ¥ gee multiv F c ’ b U ¤ ™ e  § ¥ ¥ ¥ ¥ › F U F ¥ S T  ™ ¥ T c c U e ¥ R ¦ ¦ § ¥ ¦ ¦ ¥ ¦ ¤ ¦ ¤ ¤ F ¥ b ¤ ¥ ¥ ¦ S T ¤ nlme survival5 tree tseries ’ ’ ¤ ¥ ¥ › † ’ F U § S b F b § Jim K. Lindsey distributes on his site (http://alpha.luc.ac.be/~jlindsey/rcode.html) several interesting packages: … … … … ¥ ¥ c T G U ¥ ¥ ¥ f ¥ G U T ¥ ¥ G ¦ R ` ` R R ` ¤ R ¦ R R ` ¦ R ¦ manipulation of molecular sequences (includes the ports of ClustalW and of flip) nonlinear generalized models; probability functions and generalized regressions for stable distributions; models for normal repeated measures; models for non-normal repeated measures; models and procedures for historical processes (branching, Poisson, ...) tools for nonlinear regressions and repeated measures. £ ¥ G ¥ S ¥ F U ¥ G S U S F U b U R ¦ S F S F ¥ ¥  S § ¥ b R S § ¥ S ¦ b § G U R dna gnlm stable growth repeated event rmutil S  § ¥ U U ’ ¥ ’ G U ¥ ’ G U ‘ § ¥ G ’ G U ¦ ¦ R ` R ¥ F b U G F S S F F F S S S  › G T ¥ ’ F  e “An Introduction to R” (pp 51-63) gives an excellent introduction to statistical models with R. Only some points are given here in order that a new user can make his first steps. There are five main statistical functions in the base package: ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ U ¥ G c ¥ ¥ G ¥ ¥ G U ¥ ¥ U G c ¥ ¥ ¥ T ¥ ¥ › G U R ¦ linear models; generalised linear models; analysis of variance comparison of models; log-linear models; nonlinear minimisation of functions. b R ¦ b F R ¦ F c R lm glm aov anova loglin nlm S S  R F c S  T G ¥ ¥ U ¤ ¦ ¦ § ¦ ` ¤ ¦ ¤ ¤ ` ¦ R For example, if we have two vectors x and y each with five observations, and we wish to perform a linear regression of y on x: ¥ S ¥ c F c ¥ T S y F ˆ ¥ c G c b y F T ˜   œ œ ˜ œ e   Coefficients: S Call: lm(formula = y ~ x) ¥ > x <- 1:5 > y <- rnorm(5) > lm(y~x) ’ G 24 x 0.1809 † § ¦ ¦ § ` R (Intercept) 0.2252 ’ ’ ‘ ¤ ¦ ¦ ` ` As for any function in R, the result of lm(y~x) can be copied in an object: ¥ ¥ S S G ¥ S € ¥ F U ˜ S S S U S T F  œ > mymodel <- lm(y~x) ˜  œ   œ § if we type mymodel, the display will be the same than previously. Several functions allow the user to display details relative to a statistical model, among the useful ones summary() displays details on the results of a model fitting procedure (statistical tests, ...), residuals() displays the regression residuals, predict() displays the values predicted by the model, and coef() displays a vector with the parameter estimates. ¥ ¥ ¥ U c T U ¥ ¥ G c ¥ G T G T œ R ` ¤ R R ¦ ¦ ¥ S  œ  U ¦ ¥ S U b ¥ ¥ R ¥ R ¦ R ¥ b ¦ ¥ ¥ F c G T F U e R ¥ ¦ ¦ ¥ ¥ ¦ ¥ ¥ ¦ ¥ F F U G ` R ` R ¥ ¤ R ¥ S ¦ R ¥ b ¦ ¥ F U S G T e § ¥ ¥ ¥ G T U ¥ c ’ G T U G T ‘ ¥ ¥ ¥ ¥ ¥ ¥ G c G T > summary(mymodel)   œ   œ e Call: lm(formula = y ~ x) œ e Residuals: 1 2 3 4 1.0070 -1.0711 -0.2299 -0.3550 Á e 5 0.6490 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.2252 1.0062 0.224 0.837 x 0.1809 0.3034 0.596 0.593 ’  ’ ù ’ e ¿ ’ ’ ’ ¿  ™ ’ ’ ‘ Residual standard error: 0.9594 on 3 degrees of freedom Multiple R-Squared: 0.1059, Adjusted R-squared: -0.1921 p-value: 0.593 F-statistic: 0.3555 on 1 and 3 degrees of freedom, ’ ˜    ’ ‘ e e ’ e ‘  ™ Á e ’ e  e ’ ’ ½  > residuals(mymodel) 1 2 3 4 1.0070047 -1.0710587 -0.2299374 -0.3549681 œ 5 0.6489594 e > predict(mymodel) 1 2 3 4 5 0.4061329 0.5870257 0.7679186 0.9488115 1.1297044 ’ ‘ œ > coef(mymodel) (Intercept) x 0.2252400 0.1808929   œ ’ ’ ‘ § § It may be useful to use these values in subsequent computations, for instance, to compute predicted values with by a model for new data: ¥ ¥ ¥ ¥ ¥ ¥ G U U ¥ G U U U ¥ c ¥ U U U T § ¥ ¥ ¥ T U c > a <- coef(mymodel)[1] > b <- coef(mymodel)[2] > newdata <- c(11, 13, 18) > a + b*newdata [1] 2.215062 2.576847 3.481312   œ › œ ’  º › ’ To list the elements in the results of an analysis, we can use the function names(); in fact, this function may be used with any object in R. ¥ ¥ ¥ U ¥ U T ¥ ¥ ¥ § § ¥ ¥ T U T U  "rank" "df.residual" "model"  e  ’ ’  ¥ U † ¥ "effects" "qr" "terms" ¥ > names(mymodel) [1] "coefficients" "residuals" [5] "fitted.values" "assign" [9] "xlevels" "call"   œ ’  e ’ ’ e ™ ˜  G  25 > names(summary(mymodel)) [1] "call" "terms" [5] "sigma" "df" [9] "fstatistic" "cov.unscaled" œ "residuals" "r.squared" "coefficients" "adj.r.squared" ’  e œ e ’  e  e ’  ¦ R R ` e ’ ’ ™ ¤ ¦ § R ¤ The elements may be extracted in the following way: ¥ ¥ S T ¥ S a ¥ F y b T S b > summary(mymodel)["r.squared"] $r.squared [1] 0.09504547  e  œ   œ e e Formulae are a key-ingrediant in statistical analyses with R: the notation used is actually the same for (almost) all functions. A formula is typically of the form y ~ model where y is the analysed response and model is a set of terms for which some parameters are to be estimated. These terms are separated with arithmetic symbols but they have here a particular meaning. ¥ ¥ T ¥ U ¥ ¥ ¥ ¥ U ¥ ¥ ¥ T ¥ T ¥ ¥ G T œ U ¥ T ¥ U U œ § ¥ ¥ ¥ ¥ ¥ ¥ G ¦ R ¦ G ¤ ¤ ¤ § ¥ S S b F ¥ F U G F c R § ¦ ¤ ¥ ¦ ¥ T b U ¤ T ¦ ¥ ¤ ¥ b ¥ a ¥ F T F G F b F additive effects of a and of b a+b a:b interactive effect between a and b a*b identical to a+b+a:b poly(a,n) polynomials of a up to degree n ^n includes all interactions up to level n, i.e. (a+b+c)^n a+b+c+a:b+a:c+b:c The effetcs of b are nested in a (identical to a+a:b) b%in%a a-b removes the effect of b, for examples: (a+b+c)^n-a:b a+b+c+a:c+b:c, y~x-1 forces the regression through the origin 0+y~x) ¥ ¥ › › c § ¥ ¥ ¥ ¥ › › c ¥ › ¥ › › ` R ¦ R ¥ F G b U S G T  R ¦ ¦ is identical to ¥ ¦ ¦ ¥ R  R ¦ ¦ ¥ ¥ R R R G c  S U F S  › ¥ › ¦ ¦ ¦ ¥ S ¥ ` F ¦ F ˜ œ › R ` ` ` ` › ¥ › G b › F y b c F  ¦ ¦ ¤ ¤ ¤ ¦ ¥ ¤ ` ¥ ¥ › S F F U S F F › F ˜ œ ˜ œ We see that arithmetic operators of R have in a formula a different meaning than the one they have in a classical expression. For example, the formula y~x1+x2 defines the model y = β1x1 + β 2x2 + α, and not (if the operator + would have is usual meaning) y = β(x1 + x2) + α. To include arithmetic operations in a formula, we can use the function I(): the formula y~I(x1+x2) defines the model y = β (x1 + x2) + α. ¥ ¥ ¥ ¥ ¥ T U ¥ ¥ ¥ £ ¥ G c ¥ è › ¤ ¥ › › ¥ › R S U  ¥ ¥ ‘ œ ¦ ¥ › S is identical to (id. for y~x+0, or ¥ G U é G c œ ¥ è è U é U c ¥ ¥ G U è ú R ` ¤ ¦ ¥ b U ` ¤ R ¥ F ` ¦ ¦ ¦ ¥ S S ¤ ¥ U S U b U F S ¥ S F R ¦ R b ¤ ¦ ¥ G ¦ F S U ` ¥ b è è S é ˜ ˜ õ § § The following table lists standard packages which are distributed with base. ¥ ¥ ¥ ¥ ¥ ¥ › G U þ Description ¤ ¤ P a ck a g e ¡ ¦ ¨ § ¥ £ ý ¢ classical tests (Fisher, “Student”, Wilcoxon, Pearson, Bartlett, Kolmogorov-Smirnov, ...) 5 5 V           j  h  #  j   @ j       #    # j ! !   $   $       $     #      ©   #   6  #                        #                     $     6 @ 0 1 0            4 1 1  1 0  0 2 0 ! 0 0 4  0 4 1     1   # 1 @ 1 0    @   4 #   3 0  §   $  1     0      3 § ¥ U ¥ T G U > library(eda)  A package must be loaded in memory to be used:  time-series analyses  empirical distribution functions @ ts  stepfun $ splines  j splines  nonlinear regression  nls  multivariate analyses  mva  modern regression: smoothing and local regression 6 résistant regression and estimation of covariance modreg  û ü lqs 6 ý methods described in “Exploratory data analysis” by Tukey j ü eda m ÿ ctest › œ œ & 26  6 The programming language R w p p x ‡  ‡ 6.1 Loops and conditional executions ~ „ … ~ Ž ˆ  ˆ } … … … } „ | z § An advantage of R compared to softwares with pull-down menus is the possibility to program simply a series of analyses which will be executed successively. Let us consider a few examples to get an idea. ¥ ¥ G ¥ ¥ G T U ¥ ¥ ¥ G U G c § ¥ ¥ U T c U U T G T ¥ ¥ G ¤ ¦ § R ¤ ¤ ¦ ` R ¤ ` ¤ Suppose we have a vector x, and for each element of x with the value b, we want to give the value 0 to another variable y, else 1. We first create a vector y of the same length than x: ¥ ¥ ¥ ¥ S c U ¤ ¤ ¥ ¥ c R ¥ S y ¤ b F ` ¦ S ` F y c R R § G c ¦ G U ¤ R £ ¥ ¥ ¥ S y S ¥ b ¥ F T ¥ ¥ F c F F T ¥ F c S U c > y <- numeric(length(x)) > for (i in 1:length(x)) if (x[i] == b) y[i] <- 0 else y[i] <- 1 ’ e › œ ’ œ œ  ˜  ˜ § § Several instructions can be executed if they are placed within braces: ¥ ¥ G ¥ ¥ T ¥ U U c > for (i in 1:length(x)) >{ > y[i] <- 0 ... >} ’  ˜  œ > if (x[i] == b) >{ > y[i] <- 0 ... >} › œ § Another possible situation is to execute an instruction as long as a condition is true: ¥ ¥ ¥ ¥ U ¥ ¥ U ¥ ¥ ¥ G U U > while (myfun > minimum) >{ ... >}  ¤ ¤ ¦ ` ¦ R ¦ ` ¦  e ¦    e  œ º ¦ R R ¦ Typically, an R program is written in a file saved in ASCII format and named with the extension .R. In the following example, we want to do the same plot for three different species, the data being in three distinct files, the file names and species names are so used as variables. The first command partitions the graphic window in three arranged as rows. d ¥ ¥ € d ¥ b ` ` ¦ S ¥ S ¤ b ` F S R S c ¤ F b ¦ R F R F ` G S ¥ S F ¥ F ¥ F G ¥ G ¦ ` ¤ R ¦ F b S G S b b ¦ ¥ S ¦ ¤ ¦ ¥ S ¥ S y ` ¥ U y ¥ S R ¦ S § S y ¦ ¥ F S ¤ ¥ S T ¦ ¥ b ¦ G T ¤ d ¥ a ¥ S R ¥ S G § ¥ ¥ ¥ ¥ ¥ G G # partition the window layout(matrix(c(1,2,3), 3, 1,)) for(i in 1:3) { if (i==1) { file <- "Swal.dat"; species <- "swallow" } if (i==2) { file <- "Wren.dat"; species <- "wren" } if (i==3) { file <- "Dunn.dat"; species <- "dunnock" } data <- read.table(file) # read the data plot(data$V1, data$V2, type="l") title(species) # adds the title } ’ ’ ’ ’ c ’ ‘ e œ  ’ ‘ ’ ‘ ’ ‘ e e › ’ ’ ’ ’ ’ ’ ‘ ’ ¼ ’ ’ ¼ ’ ’ ‘ œ ’ ’ ‘ The character # is used to add comments in a program, R then goes to the next line. Note that there are no brackets "" around file in the function read.table() since it is a variable of mode character. The command title adds a title on a plot already displayed. A variant of this program is given by: ‡ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ G ` R § ¦ ¦ ¦ ¦ ¦ U ` ¥ ¥ ¤ ¦ v ¥ § ¤ ¥ ¥ › F ` ’ S c ¦ R ¦ R S R S R U S S F U F ¦ S F F ¤ € a ¥ ¥ ¥ ¥ S F c T G T F G S ’ ’ S b ’ b T § S ¦ c ¦ b ¦ F F G b ¤ ¥ 27 W layout(matrix(c(1,2,3), 3, 1,)) # partition the window species <- c("swallow", "wren", "dunnock") file <- c("Swal.dat" , "Wren.dat", "Dunn.dat") for(i in 1:3) { data <- read.table(file[i]) # read the data plot(data$V1, data$V2, type="l") title(species[i]) # add the title } ’ ’ ’ ’ ’ ‘ e  ’   e Ê   º ’ º  ’  e œ ‘ º º › ’ ’ ’ ’ ‘ ’ ’ ’ ’ ‘ œ ’ ’ ’ ’ ’ ‘ These programs will work correctly if the data files *.dat are located in the working directory of R; if they are not, the user must either change the working directory, or specifiy the path in the program (for example: file <- "C:/data/Swal.dat"). If the program is written in the file Mybirds.R, it will be called by typing: ¥ ¥ ¥ ¥ ¥ ¥ ¥ T G T ¥ ¥ ¥ G G T ¥ ¥ ¥ ¥ ¥ ¥ T U ¥ ¥ ¥ U T ¥ ’ G ’ G G ¥ G § § § ¥ ¥ T T T > source("C:/data/Mybirds.R") › Á š ’ œ e or selecting it with the appropriate pull-down menu under Windows. Note you must use the symbol “slash” (/) and not “backslash” (\), even under Windows. ¤ ¦ R ‡ R ¦ ¤ ¤ ¦ ¦ ¦ R £  ¥ ¥ ¥   U U S T F S U S b U S ¥ G U F G F G ¥ ¥ ¥ G S §  F § … £ ¥ U c T ¹ Š Š  6.2 Writing your own functions ~ „ … ~ Ž  … … ¸ ¶  † … ¶ z We have seen that most of the work of R is done with functions with arguments given within parentheses. The user can actually write his/her own functions, and they will have the same properties than any functions in R. Writing your own functions allows you a more efficient, flexible, and rational use of R. Let us come back to the above example of reading data in a file, then plotting them. If we want to do a similar analysis with any species, it may be a good a function to do this job: ¥ ¥ ¥ c ¥ ¥ U ¥ ¥ ¥ £ ¥ U c … ¥ ¥ c ¥ ¥ T ¥ U T ¥ U G U £ ¥ ¥ U ¥ T U U ¥ T § U ¥ G T § ¥ ¥ G ¥ G § ¥ ¥ c ¥ U U § ¥ ¥ G T ¥ T ¥ ¥ ¥ ¥ ¥ G T § † ¦ ¤ ¦ ¥ ` ¥ ¥ S S U myfun <- function(S, F) { data <- read.table(F) plot(data$V1, data$V2) title(S) } ½ ’    e e œ  › ’ ¼ ’ ’ ¼ ’ ’ ‘ ’ ’ Then, we can, with a single command, read the data and plot them, for example myfun("swallow", "Swal.dat"). To do as in the two previous programs, we can type: ¥ ¥ ¥ G ¥ ¥ G ¥ G ¥ G T U ¥ ’ G c e œ layout(matrix(c(1,2,3), 3, 1,)) myfun("swallow", "Swal.dat") myfun("wren", "Wrenn.dat") myfun("dunnock", "Dunn.dat") ’ > > > > ’  ˜ e œ ’ e ’  ’ ¦ R §   º e  œ Ê  ¤ œ   ¦ `   e ¦ ¤   e e œ  R As a second example, here is a function to get a bootstrap sample with a pseudo-random resampling of a variable x. The technique used here is to select randomly an observation with the pseudo-random number generator according to the uniform law ; the operation is repeated as many times as the number of observations. The first step is to extract the sample size of x with the function length and store it in n ; then x is copied in a vector named sample (this operation insures that sample will have the same characteristics (mode, ...) than x). A random number uniformly distributed between 0 and n is drawn and rounded to the next integer value using the function ceiling() which is a variant of round() (see ?round for more details and other variants) : this results in drawing randomly an integer between 1 and n. The corresponding value of x is extracted and stored in sample which is finally returned using the function return(). ¥ F ¤ b S ¦ F ¥ G U ¦ G § b G R ¥ ¥ ¥ S ¦ € ¥ F R S ¤ F U ¦ ¤ G ¤ R b § S y ¦ ` ¦ R a ¥ ¥ S ¥ F c S b T S ¥ ¥ F F U S U F y S c G b § ¥ ¥ ¥ G ¥ G ¥ ¥ U U § ¥ ¥ ¥ ¥ ¥ G ¥ § ¥ ¥ G ¥ c ¥ ¥ U ¥ ¥ T ¥ ¥ ¥ ¥ ’ G c ¥ G U U ‘ ¥ ¥ ¥ ¥ ¥ ¥ ¥ c G U ‘ R ¦ ¤ ¥ U F c ¥ ¦ S ¥ § ¥ S y § ¥ S F U S S F S ¦ ¦ ¥ S R ` ¦ § ¥ F U b T F S F U b S U  R ¦ ` ` ¥ ¦ ¦ ¤ ¦ ¤ ¦ ` ¥ S F b F   e ¤ ¦ ¥ S F ¥ S c S S U U  e § ¥ ¥ ¥ T ¥ ¥ U U ¥ ¥ U ¥ T ¥ ¥ c ¥ U G c ‘ ¦ ¥  ’ e S S U ` 28 ( bootsamp <- function(x) { n <- length(x) sample <- x for (i in 1:n) { u <- ceiling(runif(1, 0, n)) sample[i] <- x[u] } return(sample) } ’ › ’ ‘ e ’  ˜  ‘  ˜    e e ‘ e ’ ‘   e § Thus, one can, with a few, relatively simple lines of code, program a method of bootstrap with R. This function can then be called by a second function to compute the standard-error of an estimated parameter, for instance the mean: ¥ ¥ ¥ ¥ ¥ G G ` G ¤ ¦ T ` ¥ c § U R R § ¤ ¦ ` ¦ ¤ a ¥ S F F F F ¥ ¥ ¥ S G U ¥ b ¥ S S S U ¤ S ¦ ¥ S ¥ S T S U ` ¦ ¥ b S ¥ S S F ¥ F b F ¥ G b meanboot <- function(x, rep=500) { M <- numeric(rep) for (i in 1:rep) M[i] <- mean(bootsamp(x)) print(var(M)) } › ’ ‘ ’  ˜  ‘  e  e › ’ ‘ ‘ š ’  ‘ ™ § § Note the value by default of the argument rep, so that it can be omitted if we are satisfied with 500 réplications. The two following commands will thus have the same effect: ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ U U T U ‡ ¥ c ‘ ¥ ¥ ¥ c ¥ ¥ G U > meanboot(x) > meanboot(x, rep=500) › ’  ˜  › ’ ‘ R ` ¦ ¦ R ` § ¤ ¦ ` If we want to increase the number of replications, we can do, for example: ¥ G b F y S ¥ S G F F b ¥ S U F d ¥ S S > meanboot(x, rep=5000) › ’ ‘ R ` ` ¦ ¦ ` ` ¦ ¦ ¦ ` ¤ ¦ R R ` ` ¤ The use of default values in the definition of a function is, of course, very useful and adds to the flexibility of the system. ¥ ¥ S U U F T F c ¥ S U S ¥ S U S U c U ¤ U ` ¥ ¦ ¥ b § a ¥ S R ¦ § ¦ R ` ¤ ¥ T T ¥ y § The last example of function is not purely statistical, but it illustrates well the flexibility of R. Consider we wish to study the behaviour of a nonlinear model: Ricker’s model defined by: ¥ ¥ ¥ ¥ ¥ T ¥ ¥ U ¥ ¥ ¥ U T ¥ ¥ G U G U § § ¥ T U Nt K ¦ ` ` R R ¥ U  N t exp r 1 ¤ T G " ¦  y ¦ ! 1 ¦ ¥ c Nt ¦  R ¦ R ¦ ¦ R ¦ ¤ This model is widely used in population dynamics, particularly of fish. We want, using a function, simulate this model with respect to the growth rate r and the initial number in the population N0 (the carrying capacity K is often taken equal to 1 and this value will be taken as default); the results will be displayed as a plot of numbers with respect to time. We will add an option to allow the user to display only the numbers in the last time steps (by default all results will be plotted). The function below can do this numerical analysis of Ricker’s model. £ ¥ S ¤ ¥ S U ¦ T § R ¦ ¥ ¦ ¦ F b G b ¤ ¥ S U S G U G ¤ ¥ S S T ¤ ¥ S a ¥ F U ¤ ¥ S F ¥ F ¥ ¦ ¥ F S U b T R ¦ ¤ R ¥ G ¥ F ¦ ¦ ¥ ` ¥ b b U S S U §  ¥ ¥ U ¥ ¥ c ¥ ¥ U ¥ G T ¥ T G U # § § £ ¥ ¥ ¥ ¥ ¥ G § ¥ T ¥ G T ¥ U U § ¥ U ¥ G U ¥ ¥ ¥ ¥ G ¥ U T ¥ G T ¥ G § ¥ T ¥ U § ¥ ¥ U ¥ ¥ G U U ricker <- function(nzero, r, K=1, time=100, from=0, to=time) { N <- numeric(time+1) N[1] <- nzero for (i in 1:time) N[i+1] <- N[i]*exp(r*(1 - N[i]/K)) Time <- 0:time plot(Time, N, type="l", xlim=c(from,to)) } ’ ’ ’ ’ e ’ À $  À À ‘  À e  ’   ˜ ’ ’ ’   ˜ ‘ œ À ’  ‘ G 29 P Try it yourself with: ¥ ¥ U T T layout(matrix(1:3, 3, 1)) ricker(0.1, 1); title("r = 1") ricker(0.1, 2); title("r = 2") ricker(0.1, 3); title("r = 3") ’ ˜ ’ ’ ’ ’ ’ ’ ’  e œ > > > > 30 7 % 7 How to go farther with R ? q  x  £ £ § § The basic reference on R is a collective document by its developers (the “R Development Core Team”): ¥ ¥ G ¥ G c ¥ c ¥ T U c ¦ R R Development Core Team. 2000. An Introduction to R. http://cran.r-project.org/doc/manuals/ R-intro.pdf. y € d ¥ ¥ % S R … y ¥ F U ¦ a ¥ S ` S b F S † … … b G c ¤ … … y ¥ G ¥ F S S U b ¥ F F G F S F ¥ G If you install the last version of R (1.1.1), you will find in the directory RHOME/doc/manual/ (RHOME is the path where R is installed), three files in PDF format, including the “An Introduction to R”, and the reference manual “The R Reference Index” (refman.pdf) detailing all functions of R. … … … ¥ U ¥ ¥ T ¥ U ¥ T ¥ ¥ c ¥ U ¥ ¥ T ¥ G U ¥ ¥ G ¥ ¥ ¥ U U ¥ U ¦ ¦ R ¦ ¤ The R-FAQ gives a very general introduction to R: http://cran.r-project.org/doc/FAQ/R-FAQ.html € ¥ ¥ R ¤ ˆ F U … € S F S F T c y a c † … € y ˆ ¥ S … ¤ … … ˆ ¥ ¥ b ¥ F F G F S F ¥ G For those interested in the history and development of R: Ihaka R. 1998. R: Past and Future History. http://cran.r-project.org/doc/html/interface98-paper/paper.html. ¥ ¥ G c ¥ G G G ¥ U † … … ¥ … … ¥ ¥ G G § ¦ ¥ ¥ U … ¥ ¥ ¥ T … ¥ ¥ T § ¦ ¥ R G ¦ ¦ ¤ ¤ There are three discussion lists on R; to subscribe see: http://cran.r-project.org/doc/html/mail.html ¥ F R ¤ R ¦ ¥ ¤ ¥ … S b F U † … ¥ b ¥ S U R … … ¥ b ¥ F F G F S F F ¤ … F ¥ G Several statisticians have written documents on R, for examples: Altham P.M.E. 1998. Introduction to generalized linear modelling in R. University of Cambridge, Statistical Laboratory. http://www.statslab.cam.ac.uk/~pat. Maindonald J.H. 2000. Data Analysis and Graphics Using R—An Introduction. Statistical Consulting Unit of the Graduate School, Australian National University. http://room.anu.edu.au/~johnm/ ¥ ¥ G ¥ ¥ U ¥ ¥ c c ƒ ¥ ¥ T ¥ ¥ c ¥ U § … … ¥ ¥ G § … ¥ ¥ ¥ ¥ G U § ¥ ¥ ¥ T ƒ ¥ ¥ ¥ ¥ f ¥ ¥ G U ƒ ‡ ¥ ƒ ¥ T T c ¥ ¥ U ¥ ¥ ¥ U U ¤ … † … … ¤ … ¥ b S U U S U b F ¥ G § Finally, if you mention R in a publication, cite original article: Ihaka R. & Gentleman R. 1996. R: a language for data analysis and graphics. Journal of Computational and Graphical Statistics 5: 299-314. ¥ ¥ ¥ ¥ U G U T T & s ¥ G v  T ¥ U w ' € x ‚ ‚ v v u v  v v  v u t a % 8 Index ( w ¦ ¤ ¦ ¦ ¦ ` ¤ ¦ ¦ ¦ ¤ This index is on the functions and operators introduced in this document. ¥ % % & & & ( (  & % ( ( P & ( ( 7   W (  P &  8 P ¨ P P P P &  7  7  P 16 16 15 20 20 8 19 t table tan text title ts ts.plot &  7 7 ( ¨ ¨ P % & % % % P P 7  var 15 which which.max which.min while write write.table 16 15 15 26 10 10 ( & 7   & 7 7 ¨ % %  8 % 18 ( W % % ¨ ¨ &    j   j   j   j &  %   j &  %  & P % % & (  & ¨ 7    ( ( 7 ¨  j  ¨ 7 ¨ ( % ¨ 7 ¨    ¨ &  % ( S 7  W P   W & P  & ) 18 18 20 8,15 25 20 8 23 11 20 15 15 15 8 23 5 6 P  20 P y  ( ( layout layout.show legend length library lines list lm load locator log log10 log2 logical loglin ls ls.str % 15 15 13 16 13 13 13 9 9 10 8 20 11 24 15 13 13 13 13 13 7 & W range rank rbeta rbind rbinom rcauchy rchisq read.cvs read.cvs2 read.fwf read.table rect rep residuals rev rexp rf rgamma rgeom rhyper % ( 19 19 11 11 9 18 20 11 11 15 15 27 18 15 19 16 21 15 24 19 %  qqnorm qqplot save save.image scan screen segments seq sequence sin sort source split.screen sqrt stripplot subset substitute sum summary sunflowerplot %  S U 7 ¨ P   P & ( 24 8 19 19 15 15 15 19 21 19 19 19 19 16 15 20 25 20 24 15  25 26 19  I if image pairs par persp piechart plot plot.ts pmax pmin points poly polygon predict prod % 7 6 19  help help.start hist 16 16 24 23 8 13 13 6 13 13 15 13 13 13 20 13 13 13 % 15 12 23 j ¥ S & gamma gl glm P & 26 27 ( for function na.fail na.omit names nlm numeric rlnorm rlogis rm rnbinom rnorm round rpois rsignrank rt rug runif rweibull rwilcox & S 26 18 15 12 & S else erase.screen exp expand.grid expression & G 8 18 18 17 15 8 19 16 8,16 19 15 15 15 15 8 % ¥ F data.frame dev.list dev.set diag digamma dim dotplot match matrix matplot max mean median min mode % ¥ F 16 S 11 16 8 F c cbind character choose coef complex contour coplot cor cos cov U 19 15 15 19 S barplot beta besselI boxplot ¥ 20 15 23 23 6 8 20 15 15 15 20 abline abs anova aov apropos array arrows asin acos atan axis U 16 16 16 16 b cummax cummin cumprod cumsum S 26 15 16 15 25 15,25 15,25 15,25 15 14 11,25 6 14 14 5 14 14 14 7 13 15,25 26 25 ¥ # %% %*% %/% %in% * + / != : ; < <= <== > >= ? ^ {} ~ x11 31    P P ( 8 ¨ a ...
View Full Document

This note was uploaded on 11/17/2011 for the course STOR 664 taught by Professor Staff during the Fall '11 term at UNC.

Ask a homework question - tutors are online