lect 16.9 - Gene Set Enrichment Analysis R. Gentleman and...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
Gene Set Enrichment Analysis R. Gentleman and Seth Falcon July 31, 2006 1 Introduction We begin by loading up the appropriate libraries. > library("Biobase") > library("annotate") > library("Category") > library("hgu95av2") > library("genefilter") In this case study we will see how to use gene set enrichment analysis (Subramanian et al., 2005; Tian et al., 2005). We will make use of the data from an investigation into acute lymphoblastic leukemia (ALL) reported in Chiaretti et al. (2004) for our examples. We will primarily concentrate on the two sample problem, where the data can be divided into two distinct groups, and we want to understand differences in gene expression between the two groups. For the ALL data we will compare those samples with BCR/ABL to those that have no observed cytogenetic abnormalities (NEG). The basic idea behind gene set enrichment analysis is that we want to use predefined sets of genes, perhaps based on function, in order to better interpret the observed gene expression data. In some ways the ideas here are quite similar to those that the usual Hypergeomtric testing is based on. Preprocessing As for all analyses, we must first load the data and process it. In the code chunk below we load the data and then select the subset we are interested in. Question 1 How many samples are in our subset? How many are BCR/ABL and how many NEG? Next, we remove from our data set those probes that have no mapping to EntrezGene, since we will not be able to find any metadata for these probes. > entrezIds <- mget(geneNames(eset), envir = hgu95av2LOCUSID) > haveEntrezId <- names(entrezIds)[sapply(entrezIds, function(x) !is.na(x))] > numNoEntrezId <- length(geneNames(eset)) - length(haveEntrezId) > eset <- eset[haveEntrezId, ] 1
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Next we do some basic prefiltering. My preference is to filter genes according to their variability across samples. In the code below we compute the IQR (approximately) and then select for our gene set those genes that have an IQR above the median value. > lowQ = rowQ(eset, floor(0.25 * numBN)) > upQ = rowQ(eset, ceiling(0.75 * numBN)) > iqrs = upQ - lowQ > selected <- iqrs > 0.5 > nsFiltered <- eset[selected, ] In the next code chunk, we find all probes that map to a single gene. We want only one probe set to represent each gene (otherwise we have to do a lot of downstream adjustments) and our decision here is to chose the one that shows the most variation, as measured by the IQR, across samples. > nsFilteredIqr <- iqrs[selected] > uniqGenes <- findLargest(geneNames(nsFiltered), nsFilteredIqr, + "hgu95av2") > nsFiltered <- nsFiltered[uniqGenes, ] > numSelected <- length(geneNames(nsFiltered)) > numBcrAbl <- sum(nsFiltered$mol.biol == "BCR/ABL") > numNeg <- sum(nsFiltered$mol.biol == "NEG") Question 2 How many genes have been selected for our analysis? Since we have done this selection without regard to phenotype, or the type of gene sets we
Background image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 07/29/2010 for the course BIOC BIOC2808 taught by Professor Dr.jjwang during the Fall '09 term at HKU.

Page1 / 11

lect 16.9 - Gene Set Enrichment Analysis R. Gentleman and...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online