ubc_2006-199169-1 - C O M P U T A T I O N A L A P P R O A C...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: C O M P U T A T I O N A L A P P R O A C H E S T O T HE S T U D Y O F G ENOMIC ROLES O F R E P E A T E D D N A SEQUENCES by L OUIE N A T H A N V A N DE L A G E M A A T B .A.Sc, The University o f B ritish C olumbia, 1996 A T HESIS S UBMITTED LN P A R T I A L F U L F I L L M E N T O F T HE REQUIREMENTS FOR THE D E G R E E O F D OCTOR OF PHILOSOPHY in T HE F A C U L T Y O F G R A D U A T E S TUDIES (Genetics Graduate Program) T HE UNIVERSITY O F B RITISH C O L U M B I A A pril 2006 © L ouie Nathan van de Lagemaat, 2006 Abstract Repeated sequences make up nearly h alf o f the bulk o f mammalian genomes and vary widely i n structure and function. This thesis describes computational approaches for assessment of interaction o f repeats and their host genomes. F ollowing public release of the human genome sequence, i nitial investigations focused on overall distributions o f retroelements w ith respect to sequence composition and genie position. E xclusion o f various retroelements from regions both w ithin and surrounding protein-coding genes suggested selection against the presence o f these elements. Directional biases of accepted retroelements in these regions further supported this notion. Directional biases are understood to reflect differential mutagenicity by sequences in one direction vs. the other. To examine the relationship between protein-coding genes and mobile elements further, mappings of genomic transposable elements in the human and mouse genomes were examined in relationship to positions of all exons of protein-coding gene m RNAs. I found that approximately one quarter of m RNAs o f protein-coding genes harbor sequence contributed by transposable elements. The fact that transposable element sequence is most often found i n untranslated regions ( UTRs) suggests a highly significant role for these sequences in modulation o f translation efficiency i n addition to roles in transcription. Further investigations used directional biases of retroelements in transcribed regions in humans and mice to show that transposable elements transcribed by R N A polymerase II (pol II) exert varying effects upon insertion, depending on the sequence of the element. F inally, a bioinformatic study done on global insertion patterns o f retroelements since human-chimpanzee divergence revealed that some transposable elements polymorphic for presence or absence in primate genomes actually represented ii deletions rather than de novo insertions. These deletions were flanked by short tracts o f identical sequence, suggesting deletion by recombinational mechanisms. The relative rarity o f these events lends support to the assumed stability o f transposable element insertions while illustrating the recombinational activity o f even low-copy, nonadjacent, short repeated sequences, such as those found flanking transposable element insertions. In summary, these bioinformatic studies lend insight into the b iological roles and genomic effects of mammalian genomic repeats, especially transposable elements. i ii Table of Contents Abstract ii Table o f Contents iv L ist o f Tables vi L ist o f Figures v ii L ist o f Abbreviations v iii Acknowledgements ix Chapter 1: Introduction 1 1.1 Thesis overview 2 1.2 Sequence motion: what, how, why, and where to? 3 1.2.1 Discovery o f repetitive D N A 3 1.2.2 M obile elements and their modes o f transposition 3 1.2.3 Mutagenic roles of transposable elements 12 1.2.4 Relationships between i nitial and long-term distributions of T Es 18 1.3 Neutral or beneficial roles for TEs in the transcriptome 21 1.3.1 Regulatory motifs donated by TEs 21 1.3.2 Protein domains donated by TEs 24 1.4 Stability o f repetitive sequence 25 1.4.1 Assumptions of stability and use o f retrotransposed sequence as population markers 25 1.4.2 A role for D N A double-strand break ( DSB) repair in creating de novo insertions, deletions, and tandem duplications 26 1.5 Repeat finding methods 28 1.6 Thesis objectives and chapter summaries 29 Chapter 2: Retroelement distributions in the human genome 33 2.1 Introduction 34 2.2 Methods 36 2.2.1 Description o f retroelements 36 2.2.2 Data sources 37 2.2.3 Density analysis 37 2.3 Results and Discussion 39 2.3.1 Distributions of retroelements in different G C domains 39 2.3.2 Arrangements of retroelements w ith respect to genes 45 2.3.3 Shifting retroelement distributions w ith age 50 2.3.4 Length differences do not account for the shifting patterns 56 2.3.5 Delay of A l u density changes on the Y chromosome 58 2.3.6 Potential explanations for A l u distribution patterns 59 2.4 C oncluding remarks 62 Chapter 3: A nalysis o f transposable elements in the human and mouse transcriptomes.. 65 3.1 Introduction 66 3.2 Methods 66 3.2.1 Prevalence of T Es i n human and mouse gene transcripts 66 3.2.2 V ariation o f T E prevalence w ith gene class or function 67 3.3 Results and Discussion 68 3.3.1 Prevalence of T Es i n human and mouse gene transcripts 68 3.3.2 TEs serve as alternative promoters of many genes 69 iv 3.3.3 T E prevalence varies with gene class or function 75 3.3.4 TEs are more prevalent in m RNAs o f rapidly evolving and mammalianspecific genes 79 3.4 Conclusions 81 Chapter 4: A nalysis o f genie distributions of endogenous retroviral long terminal repeat families i n humans 82 4.1 Introduction 83 4.2 Methods 84 4.2.1 Directional bias of insertions in transcribed regions in mice 84 4.2.2 Directional bias of retroelements i n the human genome 84 4.3 Results and Discussion 86 4.3.1 Opposite orientation bias of fixed versus mutation-causing retroviral insertions 86 4.3.2 V ariation i n density of genie insertions of different H E R V families 87 4.3.3 Density profiles of E RVs across transcriptional units 90 4.3.4 Distinct pattern o f H E R V 9 elements w ith respect to genes 93 4.3.5 S V A S INE elements display distribution patterns similar to L TRs 94 4.4 Concluding remarks 96 Chapter 5: A nalysis o f repeats and genomic stability 98 5.1 Introduction 99 5.2 Methods 101 5.2.1 Direct assessment o f retroelement deletion rate 101 5.2.2 Detection of deletions internal to A lu elements 102 5.2.3 Assessment of deletion frequency due to illegitimate recombination.... 102 5.2.4 Genomic P CR and sequencing 103 5.3 Results and Discussion 104 5.3.1 Direct assessment o f retroelement deletion frequency 104 5.3.2 A nalysis o f random genomic deletion by illegitimate recombination.... 110 5.3.3 Direct confirmation of A l u element deletions 113 5.4 Concluding remarks 117 Chapter 6: Summary and conclusions 119 6.1 Summary 120 6.2 Initial and long-term genomic localization o f mobile elements are related in complex ways 120 6.3 Some TEs interact strongly with genes, leading to population biases in regions surrounding genes 121 6.4 M any gene U TRs are associated with TE-derived sequence 122 6.5 Short repeated sequences are involved in genomic deletions 123 6.6 Conclusions 125 References 127 Appendix A 140 List of Tables Table 1.1 Human T E types, copy numbers, genomic coverage, and mutation occurrence 4 Table 2.1 Significance (p-values) of retroelement locations with respect to genes 49 Table 2.2 Significance (p-values) of distributional differences between divergence cohorts 55 Table 2.3 Significance (p-values) of distributional difference between A lus on the Y chromosome versus the whole genome 58 Table 3.1 RefSeq transcripts beginning w ithin a previously unrecognized TE 71 Table 3.2 Domains associated with TE enrichment or exclusion in m RNAs 78 Table 4.1 Annotated copy numbers and evolutionary ages o f various E R V familes 88 Table 5.1 A luS indels assayed i n primates by PCR and B L A S T 114 vi List of Figures Figure 1.1 F ull length endogenous retrovirus-like elements 7 Figure 1.2 L I structure 8 Figure 1.3 T ypical A l u element of -300 bp 9 Figure 1.4 Structure of a typical S V A element, found at human chromosome 13ql4.11.10 Figure 1.5 Target-primed reverse transcription ( TPRT) 11 Figure 1.6 Transcription control elements of the M L V L T R 16 Figure 1.7 Orientation of T Es that contain human gene transcriptional start sites 23 Figure 2.1 Density of retroelements in different G C fractions in the human genome, calculated over 20-kb windows across the genome sequence 41 Figure 2.2 Density of retroelements as a function of average G C content of each human chromosome 44 Figure 2.3 Ratios of observed to predicted retroelement densities w ith respect to genes i n the human genome 46 Figure 2.4 Retroelement densities of different divergence classes in various GC fractions o f the human genome 51 Figure 2.5 Length distribution of retroelements w ith respect to surrounding GC content57 Figure 2.6 Density of A l u divergence cohorts in different G C fractions on chromosome Y compared to the whole genome 59 Figure 3.1 TEs in genes by species and orientation 69 Figure 3.2 Examples of genes w ith apparent TE-derived promoters 73 Figure 3.3 Prevalence of T Es i n m RNAs o f various gene classes 76 Figure 4.1 Directional bias of retroelements in mouse transcribed regions 87 Figure 4.2 Orientation bias of various full length E R V sequences i n genes 89 Figure 4.3 Patterns of annotated E R V presence in equal-sized bins across transcriptional units 92 Figure 4.4 Insertion pattern of S V A and A l u Y retroelements across transcriptional units. 95 Figure 5.1 Deletions due to D N A double strand break repair 108 Figure 5.2 Prevalence of direct repeats at deletion boundaries 112 Figure 5.3 P CR and sequence evidence for precise A l u element deletion 114 Figure 5.4 Sequence evidence for precise A l u element deletion 116 v ii List of Abbreviations ALV BLAST BLAT D SB ERV E Tn GO HERV H IV HR I AP indel IPR IR KOG L INE LTR M aLR M ER4 MLT MLV MRN N CBI NHEJ O RE P CR pol II R PA S INE S IV S SA s sDNA TE T PRT T SD U CSC U TR A vian Leukosis V irus Basic L ocal A lignment Search T ool Blast-like A lignment T ool double strand break endogenous retrovirus(like) E arly Transposon Gene Ontology human E R V Human Immunodeficiency V irus homologous recombination intracisternal A particle insertion/deletion InterPro inverted repeat euKaryotic clusters of Orthologous Groups long interspersed nuclear element long terminal repeat M ammalian apparent L T R Retrotransposon MEdium-Reiterated repeat 4 M ammalian L T R Transposon (Moloney) M urine Leukemia V irus M RE11/RAD50/NBS1 (protein complex) National Center for Biotechnology Information nonhomologous end j oining open reading frame polymerase chain reaction R N A polymerase II replication protein A short interspersed nuclear element Simian Immumodeficiency V irus single strand anealing single stranded D N A transposable element target-primed reverse transcription target site duplication University o f California Santa C ruz untranslated region v iii Acknowledgements Reflecting over the work described here, I would like to thank many people for their input and give them credit for the role they have played in my work. First of a ll, many thanks go to my supervisor, Dr. D ixie Mager. Besides being a smart and insightful person, she has been a fun and interactive supervisor, and conversation w ith her has always been profitable. I really appreciated the measure o f freedom she gave me to pursue particular aspects o f the questions at hand, and she was always ready to discuss new data, especially when there were graphs to look at. Furthermore, it's been fun getting to know past and present members of the Mager Lab and the broader B C Cancer Research Centre community, and I thank them all for their friendship and input. Further credit is due to many people who have been i nvolved w ith my projects in various supportive roles. First in this regard have been my thesis committee members, who have contributed in various ways to my success. Drs. A nn Rose and Steven Jones have been a source of good questions about my data, and in more than one instance, have led to positive development of my projects. Dr. Holger Hoos's contributions in the realm o f algorithm development have led to some of the software developed in the course of this work. In addition to my committee, I thank the National Science and Engineering Research C ouncil o f Canada and the Canadian Institutes of Health Research for generous funding during the course of my work. V ery importantly, I feel very grateful to Patrik Medstrand for his involvement all my P hD long. I suppose our relationship goes beyond a supervisor-student relationship into the realm of personal friendship. I have learned a great deal from Patrik, and the exchange visits to his lab in Sweden have been productive, and more than that, lots of ix fun. Thanks also goes to Patrik's wife L illy, and my hosts i n L und, the Bruce family. Last and definitely not least, I thank my parents, W ulf and Henrietta, for all their love and interest in my work. I have always enjoyed the discussions we have had. x Chapter 1: Introduction 1.1 Thesis overview T he g oal o f this thesis project was to use g lobal c omputational genomic analyses o f repeated sequences i n mammalian genomes to understand interactions o f these elements w ith t heir host genome. Repetitive sequences, i ncluding t ransposable elements and t andem and segmental duplications, make up nearly h alf o f mammalian genomes. R epeated sequences influence the host genome in several main ways, f rom the obvious r ole i n genome expansion to generation o f distributed s imilar sequences susceptible to e ctopic r ecombination, to p rovision o f t ranscriptional a nd regulatory signals. A s early as B arbara M c C l i n t o c k ' s e xperiments in maize in the 1950s, there w as some appreciation o f the cytogenetic and concomitant regulatory consequences o f D N A that c ould m ove about i n a g enome. However, o nly m ore recently have molecular studies begun to unravel some o f the mechanisms i nvolved i n a mplification o f repetitive sequence and the m echanisms mediating the regulatory roles o f repetitive elements fixed i n mammalian genomes. These advances in understanding, coupled w ith the a vailability o f the sequenced genomes o f humans and various model organisms, have enabled further g enome-wide studies o f repeated sequences and their role in organismal b iology. T he r esearch reported in this thesis attempted to c larify the dual roles o f transposable elements and other repetitive sequences and their potentials for both benefit and harm. P rimarily, b ioinformatic and mapping techniques were used to elucidate these r oles w ith o ccasional a dditional validation using wet laboratory approaches. These approaches m ake it possible to address questions regarding g lobal g enomic processes and how r epetitive D N A has shaped genome architecture. 2 1.2 Sequence motion: what, how, why, and where to? 1.2.1 Discovery of repetitive DNA M ammalian g enomes may best be described as a patchwork o f many types o f sequences, i ncluding genes, control regions, and, not mutually e xclusively, repeated sequences. The first h ints o f the pervasive presence o f repeated sequence in genomes date b ack to the e arly 1950s, when it was observed that the D N A content o f c ells, w hich w as presumed to c ontain the genes, was poorly correlated w ith the o verall l evel o f c omplexity o f the o rganism (the so-called C-value paradox), reviewed in Gregory (2001). One early e xplanation p roposed that the extra D N A w as 'junk', or non-coding pseudogenes, w hile later theories suggested it to be self-replicating ' selfish D N A ' ( Doolittle and Sapienza 1980; O rgel a nd C rick 1 980). The discovery o f m obile o r transposable elements by B arbara M c C l i n t o c k a h alf c entury ago ( McClintock 1950; M c C l i n t o c k 1956) and subsequent discovery o f retrotransposable elements has better e xplained the nature o f this ' selfishness'. 1.2.2 Mobile elements and their modes of transposition T horough s tudy o f the human genome has shown that r ecognizable repeats m ake up n early h alf o f its bulk (International Human Genome Sequencing Consortium 2001). L esser c overage by repeats i n other mammalian genomes, for example in rodents, has b een attributed to the incompleteness o f the l ibrary o f known rodent repeats as w ell as faster substitution rates i n the rodent lineage, and therefore the estimates o f a pproximately 4 0% repetitiveness o f rodent genomes are considered to be lower bounds ( Mouse G enome Sequencing Consortium 2002; Rat Genome Sequencing Project C onsortium 2 004). 3 Several broad categories of repetitive sequence exist. One of the most basic distinctions one can make is related to copy number. L ow copy number repeats, including tandem and segmental duplications, are generated by recombinational mechanisms and are discussed in Section 1.4. H igh copy number repeats, also known as mobile or transposable elements (TEs), are considered to move about using proteins encoded either by themselves or in trans by another mobile element (Table 1.1). These elements have amplified over the course of evolution and attained copy numbers ranging from several tens to over a m illion i n mammalian genomes. T Es may be characterized on the basis of several aspects, including whether or not the element is autonomous, presence or absence of terminal repeats, and the mechanisms by w hich the elements move. Table 1.1 Human T E types, copy numbers, genomic coverage, and mutation occurrence Genomic Copies per haploid human coverage (%) genome (X1000)" 13.14 1558 S INEs 10.60 1090 Alu 2.54 468 MIR 20.42 LINEs 868 516 16.89 L1 3.22 L2 315 37 0.31 L3 8.29 443 LTR elements 112 2.89 E RV c lass I 0.31 E RV c lass II ( ERV-K) 8 E RV c lass III ( ERV-L) 83 1.44 M aLR 240 3.65 0.15 3 S VA 2.84 294 DNA elements 44.84 Total Taken from International Human Genome Sequencing Consortium otherwise noted From Chen et a l. (2005) From Wang et a l. (2005) Type/family a C a b c Human mutations documented? 22 13 4 39 (2001), unless D N A transposons, no longer mobile i n mammals, are exemplified by the P- 4 elements in Drosophila (Pinsker et a l. 2001). A ctive elements of this type move by a cutand-paste mechanism (Kazazian 2004). Autonomous elements encode their own transposase protein, w hich binds in a sequence-specific manner to the terminal inverted repeats flanking the element and cleaves the D N A , excising the element ( Miskey et al. 2005). Non-coding D N A w ith similar flanking inverted repeats is also susceptible to cutting and pasting by the same proteins. In rice, multiple D N A transposons exist, including autonomous Pong and non-autonomous mPing and Mutator-like elements (Jiang et a l. 2003; Jiang et a l. 2004). Proliferation o f Mutator-like non-autonomous elements harboring coding-competent genie sequence has also been documented (Jiang et a l. 2004). In mammals, these elements are no longer functional. However, their recombinase functions persist in the co-opted R A G genes used in V (D)J recombination (Brandt and R oth 2004; Schatz 2004). Retroelements, in contrast to D N A transposons, move by a copy-and-paste mechanism (Kazazian 2004). R N A intermediates are reverse transcribed into D N A using a reverse transcriptase protein. Similar to D N A transposons, autonomous retroelements code for their own reverse transcriptase, while non-autonomous elements depend on this protein in trans. Several broad classes of retroelements exist, including long interspersed nuclear elements ( LINEs), short interspersed nuclear elements ( SINEs), pseudogenes, and elements w ith long terminal repeats ( LTRs) (Kazazian 2004). Approximately 100 families o f L TR-containing elements have been found in humans, varying in length from several hundred base pairs in size for solitary L TRs to full length elements as long as 10 kb (Jurka 2000; International Human Genome Sequencing Consortium 2001). These include endogenous retroviruses ( ERVs), w hich are presumed to have resulted from 5 germline infections by exogenous viruses, L TR retrotransposons, and repetitive elements with an LTR-like structure for which no corresponding full-length structure has been identified (Mager and Medstrand 2003; Medstrand et al. 2005). Approximately 85% of LTR-retroelement insertions exist in the human genome as solitary L TRs, a result of recombination between the terminal repeats (International Human Genome Sequencing Consortium 2001). E RVs are so named due to their structural similarity to the integrated provirus form of exogenous retroviruses (Figure 1.1). The flanking L TRs contain necessary regulatory motifs for their transcription in three regions, termed U3, R, and U5, described further below. The internal sequence encodes the proteins necessary for their retrotransposition. Both autonomous E RVs and L TR retrotransposons increase their copy number through m RNA intermediates which are reverse transcribed and reinserted into the host genome, and their overall genomic organization includes several features related to their life cycle (Wilkinson et al. 1994). Just 3' of the upstream L TR are a tRNA primer binding site, which primes reverse strand synthesis, and a packaging signal which interacts with the nucleocapsid protein. The internal sequence of these elements includes gag and pol genes, which code for a nucleocapsid protein, a protease, RNAse H, reverse transcriptase, integrase, and other genes. Just 5' of the downstream L TR is a poly-purine tract. Finally, the L TR begins and ends with T G and C A dinucleotides, which are important for insertion. Some ERVs have an additional env gene, which codes for an envelope protein and allows E RVs to form infectious particles that can escape the host cell and reinfect adjacent cells. However, it should be noted that in humans the vast majority of these elements are defective, with mutations in some, usually all, of their internal sequences. No disease-causing E RV insertions have been found in humans, 6 although several l oci are polymorphic for presence or absence of an E R V i n humans (Turner et al. 2001; Bennett et al. 2004; Belshaw et al. 2005). A s w ith a ll retroviruses, insertions o f E RVs are flanked by short tracts o f identical sequence, called a target site duplication ( TSD), usually 4-6 bp in length. Figure 1.1 Full length endogenous retrovirus-like elements. A . Full-length H E R V - H , as described by Jern et al. (2005). H E R V - H has genomic organization and genes related to those of infectious retroviruses, as well as canonical pol-II transcriptional signals. This suggests that it has been an autonomous element that entered the primate gerinline as an infection by an exogenous virus. PBS and P PT are primer binding site and polypurine tract, respectively. B. H UERS-P1 element, first described by Harada et al. (1987). HTJERS-P1 element lacks discernable open reading frames, although having presumptive pol-II transcriptional signals in its L TRs, and therefore is presumed to be non-autonomous. Another class o f L TR-containing elements exists whose internal sequence bears little or no resemblance to that o f exogenous viruses. These elements include the M ammalian apparent L T R retrotransposons ( MaLRs) and so-called medium-reiterated 4 elements ( MER4s) w hich lack internal similarity to retroviral genes (Smit 1993). However, these elements do contain the obligatory L TRs w ith regulatory motifs as w ell as polypurine tracts, w hich leaves open the question o f how these elements have been 7 m obilized. A ctive full-length L INE elements are typified b y the L I family (Figure 1.2). These elements are transcribed from an internal R N A polymerase II (pol II) promoter (Swergold 1990; Tchenio et al. 2000; Y ang et al. 2003; Athanikar et al. 2004) and have two open reading frames ( ORFs), w hich encode a nucleic acid binding protein, a reverse transcriptase, and an endonuclease; these proteins are necessary for their own transposition (Kazazian 2004). These elements terminate w ith a polyadenylation signal and p oly-A tract. The origin o f L INE elements is unknown, however many classes of eukaryotes contain them, including mammals, fish, invertebrates, plants, and fungi (Furano 2000). L i s exhibit a marked cis preference, w hich means that proteins encoded most often act only on the m R N A that encoded them ( Wei et a l. 2001). In addition, however, L i s have also been shown to m obilize other R N A species, for example processed m RNAs and S INEs i n human (Esnault et al. 2000; Dewannieux et a l. 2003). Antisense promoter activity from the 5' U T R has also been reported (Nigumann et al. 2002). Species of m R N A m obilized by L i s usually have a T SD 7-20 bp in length. Promoter / 5' UTR ORF1 ORF2 3 ' U T R 13-21 YY1 83-101 RUNX3 472-477 SOX I 572-577 SOX I Antisense promoter Figure 1.2 LI structure. The overall genomic organization of the currently active L I (Hs) consists of a 5' UTR/promoter, two open reading frames (ORF1 and ORF2), and a short 3' U TR, which terminates in a canonical polyadenylation signal followed immediately by a poly-A tract. The promoter region has recognized binding sites for Y Y1, R UNX3, and SOX-family transcription factors (Swergold 1990; Tchenio et al. 2000; Yang et al. 2003; Athanikar et al. 2004). A n antisense promoter has been described (Nigumann et al. 2002). 8 A ctive S INE elements are typified in humans b y the A lus and S V A elements. A lus are non-protein coding, R N A polymerase I ll-driven, 7 SL R NA-derived dimeric elements ( Ullu and Tschudi 1984) (Figure 1.3). A lus number in excess of one m illion copies in the haploid human genome (International Human Genome Sequencing Consortium 2001) and are s till actively retro transposing in the primate lineage. Their current amplification rate, estimated at one in 200 live births (Deininger and Batzer 1999), is 100-fold lower than at the peak o f their activity (Shen et al. 1991). Elements polymorphic for presence or absence i n the human population, which number approximately 1000 in the average human (Bennett et al. 2004), have also demonstrated usefulness in distinguishing relationships of human populations (Batzer and Deininger 2002). In addition to A lus, primate genomes contain a family o f mammalian-wide f RNA-derived S INEs called M IR, w hich are older than A lus and are not known to be transcribed in humans (Smit and Riggs 1995). Other f RNA-derived S INEs are active in mice (Dewannieux and Heidmann 2005). ~50-bp similar regions Figure 1.3 Typical Alu element of ~300 bp. Similar regions share ~70% identity. In addition to A lu elements, S VAs comprise another relatively numerous superfamily o f S INE elements. Numbering 2762 in the haploid human genome (Ono et al. 1987; Wang et al. 2005), these elements are believed to be confined to the hominoid 9 primates, indicating a relatively recent evolutionary origin ( Kim et al. 1999; Wang et al. 2005). They consist of a hexamer repeat, homologies to inverted A lu elements, a variable number of tandem repeats, and a deleted partial copy of an E R V - K element including the terminal part o f an env gene and a partially deleted L T R (Figure 1.4). S VAs end with a polyadenylation signal and p oly-A tract (Ono et al. 1987; Zhu et al. 1992; Shen et al. 1994; Wang et a l. 2005). The fact that S V A elements contain a major portion o f an E R V - K L T R suggests that these active elements may have L TR-like effects. Furthermore, m R N A evidence exists supporting a role for these elements as alternative promoters, for example the hyaluronoglucosaminidase 1 and the G protein-coupled receptor M R G X 3 genes (University o f California Santa C ruz Genome Browser, http: //genome. uc sc. edu). (CCCTCT) n Alu Alu p o - 2 70) (135 - 260) VNTR H E R V K 1 0 Aenv + L TR A U3/R I 5 00 1000 1500 Figure 1.4 Structure of a typical S VA element, found at human chromosome 13ql4.11. SVAs consist of a tract of C C C T C T hexamer repeats, two regions homologous to Alu elements, a region containing a variable number of tandem repeats ( VNTR), and a partial H ERVK10 env gene and partial L TR, ending with the polyadenylation signal and an immediate poly-A tract. L I-driven retro transposition is targeted to T T / A A A A sites, which are widely dispersed through mammalian genomes (Jurka 1997). The process is known as target primed reverse transcription ( TPRT), reviewed in K azazian (2004), and begins with opposite-strand n icking at the insertion site, uncovering a short p oly-T tract complementary to the m R N A to be reverse-transcribed. The m R N A p oly-A tail anneals 10 to the p oly-T tract, w hich then serves as a primer for reverse transcription (Figure 1.5). After insertion, new elements segregate as Mendelian genes i n the population. The likelihood o f any given insertion reaching fixation i n a population is very small. 5'i Target Site lllllllllllllllllllllllllllllllll T a) First strand cleavage 5' 5-i c) Second strand f cleavage Target Site | iiinmnii-'Hiy^liiiMiiinm 5" d) Integration 1 SI) e) DNA synthesis TSI) New insert flanked by target site duplications Figure 1.5 Target-primed reverse transcription (TPRT). Taken from Ostertag and Kazazian (2001). Due to the copy-and-paste strategy employed by retroelements in their amplification, new retroelement insertions are ideally identical to their ancestor element. Independent mutations occurring i n an individual element either during or after 11 retrotransposition introduce variation w hich is then faithfully inherited by derivative copies of the element. This process results in an increasing genomic population of elements marked by diagnostic mutations. These mutations may then be exploited to infer family relationships of accumulated retroelements (International Human Genome Sequencing Consortium 2001). Furthermore, the l ikely sequence o f the ancestral elements can be reconstructed from sequence alignments of distributed copies of the element. Divergence from this consensus element can then be computed and, assuming a molecular clock, the approximate age of each element may be computed. This type of analysis has been described elsewhere (Smit 1993). The molecular clock is usually calibrated using fossil evidence whose age is calculated from radioactive dating of fossils based on the apparent age of the rocks where the fossils are found (Goodman et a l. 1998). 1.2.3 Mutagenic roles of transposable elements A s early as Barbara M cClintock's experiments with D N A transposons in corn, an appreciation o f the 'gene-controlling' effect of these 'jumping genes' began to take root ( McClintock 1950; M cClintock 1956). In those experiments, it was noted even in physical examination of chromosomes that new insertions could drastically alter local chromosomal structure and these could account for alterations in phenotypic characteristics such as kernel color. Mutagenic roles of T Es may be grouped into several types. M ost basic of a ll, insertional mutagenesis involves disruption of a conserved and functionally important region o f D N A . Perhaps because they are more readily analyzed, most well-studied cases o f insertional mutagenesis involve disruption of a coding exon of a transcribed gene which then introduces a stop codon or frame-shift resulting in a prematurely terminated or non-functional transcript. Examples o f this type of mutagenesis reported in the 12 literature include inactivation of the human cholinesterase gene by an A lu insertion (Muratani et al. 1991) and disruption of the Factor V III gene by an L I element causing hemophilia A ( Kazazian et a l. 1988). More examples are described by Deininger and Batzer (1999) and reviewed by Chen et a l. (2005). As a variation on this theme, i n one report a sequence believed to have been 3'-transduced by an S V A element was found to have disrupted exon 5 of the alpha spectrin gene, resulting in exon skipping (Hassoun et al. 1994; Ostertag et a l. 2003). Several more subtle regulatory roles exist for T Es, usually related to the structure o f the consensus element. For example, although the phenomenon is apparently fairly rare, mutations in intronic antisense A lu 3 ' ends can lead to generation of an efficient splice acceptor site which may result in disease (Sorek et a l. 2002; L ev-Maor et al. 2003; Sorek et al. 2004). L i s , on the other hand, have recently been shown to act fairly universally as transcriptional rheostats, reducing transcription of genes they are found in (Han et al. 2004). This activity is believed to be due to the A -rich consensus element, which is believed to cause the pol-II holoenzyme to fall off its template with high frequency. Reduction i n this A-richness while conserving the protein sequence o f the element resulted in highly efficient transcription (Han and Boeke 2004). Furthermore, presence o f A -rich L I sequence i n introns of genes was correlated with overall reduction in transcription efficiency. On the other hand, L i s integrating into introns in a direction antisense to the gene's direction of transcription have a demonstrated polyadenylation activity (Perepelitsa-Belancio and Deininger 2003; Han et a l. 2004), resulting in reduction of transcription in either orientation by intronic L i s . Intronic L i s have also been linked to disease (Kimberland et a l. 1999). 13 Another type of mutagenesis related to the element's sequence is that posed by E RVs and their L TRs. A s described above, active elements o f this type code for several genes related to their life cycle. In order to produce correct amounts o f each protein, these elements often splice out part o f their coding sequence, requiring the presence o f functional splice donor and acceptor sites i n the full-length element (Rabson and Graves 1997). Mutations due to these splice sites have been demonstrated in mice (Maksakova et a l. 2006). M ore importantly than donation of splicing motifs by full-length E RVs, L TRs harbor transcriptional control elements including fully functional promoters and polyadenylation signals. The general structure o f L TRs has been studied extensively (Temin 1982; Rabson and Graves 1997). A classical L T R consists of three regions, termed U 3, R , and U 5. The U3 region extends from the 5' end of the L T R or proviral copy through the promoter to the start o f transcription (Figure 1.1). Together with the R region, the U3 region does double duty as the 3' untranslated region ( UTR) o f retroviral transcripts. The R region extends from the start o f transcription to the polyadenylation site of v iral transcripts and contains the polyadenylation signal. L astly, the U5 region, w ith the R region, forms the 5' U T R o f the v iral transcript. The most fruitful mammalian examples of L T R mutagenesis come from inbred strains of mice, i n which 10 percent o f characterized new mutations are due to insertional mutagenesis as a result of E R V or L T R insertions (Maksakova et a l. 2006). This is often associated with intronic localization (Baust et al. 2002) in the same transcriptional orientation as the gene, w hich often leads to premature polyadenylation (Maksakova et al. 2006). The mutagenic nature o f L T R polyadenylation motifs has been used in gene 14 trapping experiments, in which constructs with a selectable marker and L T R polyadenylation signal were randomly inserted into mouse embryonic stem cells (Friedrich and Soriano 1991; von Melchner et al. 1992; Boeke and Stoye 1997). A total o f 28 clones with expression of the selectable marker were used to create transgenic lines o f mice and then bred to homozygosity. Eleven o f the 28, or approximately 40% of genie insertions proved to be embryonic lethal, highlighting a high-frequency outcome of mutagenesis by polyadenylation: death o f the involved c ell. Potentially more insidious is the role of E R V insertions as ectopic promoters. L T R promoters consist of a core promoter and regulatory elements composed of arrays of protein binding sites that act as enhancers and repressors to control expression of viral genes (Temin 1982). For example, transcription of the M oloney M urine Leukemia V irus ( MLV) is controlled by its L TR, shown in Figure 1.6. In general, the core M L V promoter consists of a T A T A box and cw-acting C A A T enhancer (bound by C /EBP). The more distal enhancer region contains a tandemly duplicated group of transcription factor binding sites. W hile methylation l ikely silences many T E insertions (Lavie et al. 2005; Meunier et al. 2005), M L V is silenced by repressor binding sites found upstream of its L T R enhancer. A more complete discussion of transcriptional control by the L TRs o f M L V and other exogenous viruses is found in Rabson and Graves (1997). 15 Mo-MLV LTR EIP NF-1 b U3 HLH CBF OEBP U5 SACTQR A v MCHfcF- TATA GR (-) <) 3 '»c( repeals PBS (-) Promoter (+) too bp Enhancer (+> Figure 1.6 Transcription control elements of the MLV LTR. Taken from Rabson and Graves (1997). L ong experience w ith mutational mechanisms in mice, particularly due to spontaneous E R V insertions, has resulted in a large number of these mutations being characterized. The binding sites provided by de novo E R V insertions have frequently been found to cause oncogene activation in mice, functioning either as enhancers or promoters. Relevant mechanisms and frequency have been reviewed by Rosenberg and Jolicoeur (1997). S imilarly for humans, an enhancer effect by exogenous lentiviral L TRs has been implicated i n two cases o f secondary leukemia after gene therapy treatment for X -SCID i n 11 individuals ( Hacein-Bey-Abina et a l. 2003). However, this small number o f trials leaves the overall expected frequency of adverse events in a gene therapy context i n doubt. In another t rial, the hematopoietic systems of 42 immunoablated rhesus monkeys were repopulated w ith cells having therapeutic insertions of M L V and simian immunodeficiency virus ( SIV) vectors ( Kiem et a l. 2004; Dunbar 2005). Stable, polyclonal, v irally marked hematopoiesis was observed, w ith no secondary leukemias over 6 months to 6 years. In addition to their sense-oriented promoter, enhancer, and polyadenylation 16 effects, there is some evidence that L TRs can be damaging in the antisense direction, upstream of genes or w ithin genes. In one documented case, loss of epigenetic silencing o f an antisense intracisternal A particle ( IAP) E R V upstream of the agouti gene i n mice resulted in ectopic expression of the agouti gene from a cryptic E R V promoter (Morgan et al. 1999). W ithin introns, aberrant splicing o f antisense E R V sequences from cryptic splice signals may be a prominent mutagenic mechanism of E RVs, as also highlighted for I APs i n mice in a recent review (Maksakova et a l. 2006). The observation that L TRs are less l ikely to be found in genes i n the antisense orientation than expected by random chance is in agreement w ith this v iew (see Chapter 4). A couple of de novo mutagenic roles have been elucidated even for very old TEs. A s discussed more fully i n terms o f mechanism in section 1.4.3, A lus have been shown to engage i n A lu-Alu recombination (Deininger and Batzer 1999). The most recent literature describes many examples of this type of event, including mediation of M L L CBP gene fusion in leukemia (Zhang et al. 2004), deletion in B R C A 1 and B R C A 2 genes i n cancers (Tournier et a l. 2004; Ward et a l. 2005), and Factor V III deletions in hemophilia (Nakaya et a l. 2004). F inally, for some mutagenic insertions such as those o f the hominid S V A retroelement family, the precise mechanism of mutagenesis has not been determined. To date, four cases o f de novo mutations due to S V A retroelements have been described. Interestingly, all were w ithin the borders of known genes and are in the same transcriptional orientation as the enclosing gene (Chen et a l. 2005). One hint as to a possible mode of mutagenesis comes from the fact that S VAs retain the H ERV-K10 endogenous retroviral-derived hormone response element and the core enhancer sequences and a polyadenylation signal derived from the same L TR, suggesting that 17 S VAs can act in a similar way to L T R elements (Ono et a l. 1987; Wang et a l. 2005). This topic is addressed in Chapter 4. 1.2.4 Relationships between initial and long-term distributions of TEs It has been observed that for some elements, for example A lus, the final genomic distribution does not correspond to that o f newly-inserted active elements of the same type (International Human Genome Sequencing Consortium 2001) (See also Chapter 2). In the case of A lus, their consensus T T A A A A insertion site is expected to be found more often i n regions of h igh A T sequence content and, indeed, recently inserted A lus tend to be located in A T-rich regions, similar to the pattern seen for L I elements (Smit 1999; International Human Genome Sequencing Consortium 2001). However, the vast majority o f A lus are found in G C-rich regions. Several theories have attempted to explain this disparity. Pavlicek et al (2001) noted that, i n spite of using the same target site, the observed long term distributions of A lus and L i s were biased to different G C content isochores. In their data set, the presumed slightly older A luYb8 subfamily was more skewed to higher G C isochores than the slightly younger A luYa5 subfamily. Therefore, they proposed that the GC-richness o f A l u elements makes them more stable in regions where the surrounding G C content is similar to that o f the A lu consensus, and that excision by an unknown mechanism happens more frequently in h igh-AT isochores (Pavlicek et al. 2001). Others have hypothesized that A lu elements are selectively retained in G C-rich regions because of a functional benefit to genes, w hich reside in h igh-GC regions. For example, Britten cites i ndividual A lu elements from earlier literature conferring functional binding sites for various proteins to nearby genes (Britten 1997). Schmid 18 hypothesized that A lus might be involved i n chromatin remodeling and signaling of double-stranded RNA-dependent protein kinase in response to c ell stress, conferring a selective advantage for having A lus i n transcribed regions (Schmid 1998). W hile these studies are tantalizing and suggest selective advantage o f individual A lus, no study since that time has demonstrated an overall selective advantage for accumulation of A lus i n G C-rich regions. Instead, the developmentally critical H oxD gene cluster is almost devoid o f retroelements (International Human Genome Sequencing Consortium 2001), suggesting that some classes of genes may need to exclude such sequences from their environment to ensure proper function or regulation. A more neutralist third hypothesis proposes that A lus are maintained in G C-rich regions because deletions are unlikely to be precise and deletions of these elements i n gene-rich regions would l ikely also involve important adjacent regulatory sequence (Brookfield 2001). W hile more satisfying in that it attempts to explain the A lu distribution without i nvoking a functional role, rates o f deletion in gene-rich vs. genepoor regions in primates have not been conclusively assessed. Furthermore, given that only five percent o f mammalian genomes is estimated to be under purifying selection (Mouse Genome Sequencing Consortium 2002) and therefore vulnerable, it is unlikely that this explanation alone can account for the massive accumulation of A lus i n gene-rich regions. A fourth mechanism theorized to contribute to the relative paucity of A lus i n gene-poor, h igh-AT regions is A lu-Alu recombination. Closely-spaced A l u pairs are found only occasionally in the human genome (Lobachev et a l. 2000; Stenger et al. 2001), possibly because o f clearance of these elements through the mechanism of inverted repeat (IR)-mediated recombination (Leach 1994). Later studies have linked this 19 phenomenon to sister chromatid exchange (Nag et al. 2005). In addition, unequal homologous recombination results in loss of the sequence separating closely spaced directly repeated A lus (Stenger et al. 2001). E ven A lus up to 20% divergent have been found to recombine efficiently (Lobachev et a l. 2000). That this process is ongoing is evidenced i n its observed involvement in human disease, discussed above. This mechanism provides an additional explanation for the enrichment of A lus i n G C rich regions without requiring a functional role. W hile the above theories address likelihood o f fixation o f inserted elements, some interesting recent work motivated by the use of retroviral vectors in gene therapy has focused instead on mechanisms involved at the time of insertion. For example, work using unselected in vitro integrations of H IV and HIV-based vectors consistently demonstrated a propensity to integrate in transcribed regions, w ith higher frequency in highly transcribed genes (Schroder et al. 2002; M itchell et a l. 2004; Barr et a l. 2005). M L V , on the other hand, has a marked preference to integrate into the start sites of more active genes ( Wu et a l. 2003), and avian leukosis virus ( ALV) demonstrated a weaker, though highly significant, preference for transcribed regions ( Mitchell et a l. 2004; Barr et al. 2005). The question arises why the different viruses target active genes, and then with varying locations w ithin genes. Early experiments in Swiss mouse cells found that the majority o f M L V insertion sites were sensitive to a micrococcal nuclease and D NAse I, suggesting that v iral integrations chiefly target accessible D N A , w hich is found in regions of open chromatin (Panet and Cedar 1977). W hile access to targets may w ell partially explain integration targeting, it fails to explain the additional marked preference o f M L V for 5' regions of genes. Instead, a tethering model whereby interactions between 20 the v iral preintegration particle and host factors bound to D N A facilitate targeting of integrations to specific regions w ithin open chromatin has been proposed (Bushman 2003; Bushman et a l. 2005). In such a model, M L V might be tethered by transcription factors, while H IV might interact w ith proteins that bind w ithin transcription units. One candidate tethering factor has been identified for H IV (Ciuffi et a l. 2005). A dditional evidence for this phenomenon comes from the recent discovery of symmetrical base pair profiles around sites of insertion o f H IV-1, A L V , and M L V (Holman and Coffin 2005). It should be noted, in any case, that studies of in vitro insertions represent insertions before they have a chance to be tested during organismal development. Only after escaping purifying selection during development and the lifetime of an organism do germline mutant alleles have a chance to segregate i n the population and attain a small chance of spreading to fixation. 1.3 Neutral or beneficial roles for TEs in the transcriptome 1.3.1 Regulatory motifs donated by TEs In addition to a potent role as mobile genomic mutagens, there has been a growing appreciation for the positive roles TEs can fulfill ( Kidwell and L isch 1997). One early role elucidated for TEs was as the parotid-specific enhancer of the human amylase gene (Ting et al. 1992). In that investigation, a 700 bp fragment approximately 300 bp upstream of the transcription start and derived entirely from a human endogenous retrovirus E ( HERV-E) L T R was sufficient to confer salivary expression on a reporter gene i n transgenic mice. Curiously, an L T R element has also been found to act in the opposite role as a strong upstream repressor of annexin A 5 transcription (Carcedo et al. 21 2001). M any more examples of L TRs acting as alternative promoters of cellular genes exist. A growing body of literature describes genes i n many roles under the control of E R V L TRs (For reviews, see L eib-Mosch et a l. 2005; Medstrand et a l. 2005)(see Chapter 3). A n early investigation identified an E RV9 L T R that acts as a tissue-specific promoter o f the zinc finger gene Z NF80 and drives expression in several hematopoietic c ell lineages (Di Cristofano et a l. 1995). A n E RV9 element in the human globin locus control region has also been shown to participate in expression of downstream genes, l ikely by participation i n chromatin remodeling ( Yu et al. 2005). Several examples of H E R V - E L TRs acting as alternative promoters, including involvement in M I D I , apolipoprotein CI and endothelin B receptor expression, have been identified as w ell (Medstrand et al. 2001; Landry et a l. 2002). More recently, an L T R o f the E R V - L superfamily has been shown to form the dominant promoter of the human betal,3-galactosyltransferase 5 gene i n humans (Dunn et al. 2003). This promoter was found to be more conserved than expected by chance, suggesting purifying selection preserving these retroviral sequences (Dunn et a l. 2005). This example is instructive, as the orthologous mouse gene is also expressed in colon despite the lack of an L T R promoter, leading to the conclusion that the insertion o f this E R V has been co-opted because it provided useful motifs in the approximately correct position rather than because it is essential. This example may also suggest external control by a separate colon-specific enhancer. F inally, a genome-wide bioinformatic survey showed that most T E types have some basal capacity to form 5' transcriptional start sites, presumably by contributing preexisting promoter-related sequences or by evolving them in situ (Dunn et al. 2005, See also Chapter 3). However, L T R elements upstream of genes i n the same transcriptional 22 orientation as the gene form the transcriptional start sites of genes far more often than other T E types, relative to their local genomic density (Figure 1.7). L TRs that are antisense to the gene transcriptional direction form the 5' ends of genes next most often, suggesting a significant, though secondary, role for transcription factor binding sites donated by the L T R as a nearby enhancer or downstream promoter. 0.01400 -. 0.012000.01000c o « El S ense H A ntisense 0.00800 0.00600 0.00400 0.00200 0.00000 - JM S INE Pi:? H L INE L TR D NA m c\ CO Li. <J T E c lass Figure 1.7 Orientation of T Es that contain human gene transcriptional start sites. T Es within the genomic region 5kb upstream and 5kb downstream of human RefSeq gene transcriptional start sites were grouped by class and orientation with respect to the direction of gene transcription. The fractions of sense and antisense T Es that contain transcriptional start sites are depicted for each class by gray and black bars, respectively. Taken from Dunn et al (2005). M ost o f the above-discussed roles for T Es involve L TRs i n control of transcription, but this balance l ikely reflects some bias in the current research interests of the groups involved. T Es have also been shown to perform other neutral or beneficial roles. A s mentioned above, L i s have been shown to have antisense promoter activity. This has been shown to involve several cellular transcripts (Nigumann et a l. 2002). Furthermore, L TRs have been shown to donate polyadenylation signals to cellular genes. In one example, E R V - H L TRs provide polyadenylation signals to the H H L A 2 and 3 genes (Mager et a l. 1999). The absence of these L TRs i n baboon was associated with use 23 o f alternate polyadenylation signals. Another example involves polyadenylation of receptor tyrosine kinase F LT4 and other transcripts by human E R V - K L TRs (Baust et al. 2000). Lastly, an array of several retroelements has been shown to exert a modulatory effect on transcription and translation efficiency of the zinc finger gene Z NF177 (Landry e tal. 2001). The above considerations, taken together, are strong evidence of a neutral or positive role for some T Es, particularly L TRs, and motivated a bioinformatic survey of the contributions of T Es to the transcripts of protein-coding genes, as reported in Chapter 3. 1.3.2 Protein domains donated by TEs In addition to roles in transcription control, TEs have also been demonstrated to have contributed to mammalian proteomes. Perhaps the best known example is that o f the recombination activating ( RAG) genes involved i n V (D)J recombination, w hich are derived from D N A transposons (Reviewed in Brandt and Roth 2004; Schatz 2004). The R A G genes are found i n jawed vertebrates and mediate precise, site-specific combinatorial j oining o f gene segments making up antigen receptors in T and B cells. In addition, E R V sequences have also been found to perform useful functions. In two known and completely unrelated cases, the fusogenic properties of coding-competent retroviral env genes from E R V - W and E R V - F R D have been co-opted as two different syncytin genes, both w ith a role in placenta formation ( Mi et a l. 2000; Renard et al. 2005). These and other potential roles of retroviral-derived proteins in health and disease are discussed by Bannert and K urth (2004). 24 1.4 Stability of repetitive sequence 1.4.1 Assumptions of stability and use of retrotransposed sequence as population markers Insertions of retroelements have undergone intense scrutiny as a destabilizing influence i n mammalian genomes. In particular, insertions of recent families of S INEs and L INEs i n primates have been shown to be associated with deletions and other rearrangements upon insertion (Gilbert et a l. 2002; Symer et a l. 2002; C allinan et a l. 2005). Furthermore, as mentioned above, high copy number genomic elements may cause disease by homologous recombination w ith each other (Deininger and Batzer 1999). N ew insertions of retroelements that are at worst m ildly deleterious segregate as Mendelian alleles in the population and have a small chance of reaching fixation. Once fixed, elements are generally assumed to be stable, except when i nvolved i n infrequent rearrangements. As a result, assessments o f relative activity of retroelements have assumed that presence of an element at a given site is an unambiguous indication o f insertion ( Liu et a l. 2003). A corollary of this assumed unidirectionality o f retroelement insertions with no known mechanism for precise deletion is that genomic retroelement l oci may be assumed identical by descent rather than identical by state. This property makes active families of A lu and L I elements ideal for use in studying relationships between human populations (Perna et a l. 1992; Sheen et a l. 2000; C arroll et al. 2001; R oy-Engel et a l. 2001). For example, 20 to 30 percent of insertion l oci o f some active A l u families are polymorphic for presence or absence of the element between human populations (Carroll et a l. 2001; R oy-Engel et a l. 2001). S imilar analyses have also been used in the resolution of primate phylogeny. F or example, one study used evidence from the A luYe5 family to show 25 strong support for a sister grouping of chimpanzees and humans (Salem et a l. 2003b). 1.4.2 .A role for DNA double-strand break (DSB) repair in creating de novo insertions, deletions, and tandem duplications W hile insertion of pol-II transcribed retroelements has the potential to explain many genomic regulatory changes, many more sequence differences between genomes may be explained by the paradigm of D N A D SB repair. The association of single gene mutations i n this pathway w ith k nown diseases (for review, see Thompson and S child 2002) and the feasibility o f studying D SB repair in c ell culture systems has made it possible to investigate many of the most important proteins and their mechanisms. Double strand breaks in D N A , induced by gamma rays or free radical insult, are repaired by one of two main mechanisms. One is slower and involves similarity between the sequences to be j oined, while the other is faster and homology-independent. The interactions between these pathways have been characterized as competitive in eukaryotic cells (Prudden et a l. 2003), and resolution of a D SB sometimes involves both mechanisms. Before repair begins, however, broken and damaged D N A ends are trimmed by an endonuclease complex (Helleday 2003). The fast, homology-independent mechanism of D SB repair, commonly termed non-homologous end j oining ( NHEJ), initiates with the binding of a K u heterodimer in a ring around each broken D N A end, stabilizing it (Walker et a l. 2001). The D NA-bound K u heterodimers are then bound together and the D N A ends trimmed (Helleday 2003). F inally, ligases rejoin the two ends. The slower, homology-driven mechanism of D SB repair is believed to begin with a 5' to 3' peeling back or resectioning of the D N A by an unknown exonuclease, exposing a 3 ' single-stranded D N A ( ssDNA) end. Frequently, exposure of several hundred base 26 pair repeats such as A lus i n two 3' s sDNA ends i n mammalian cells results in repair by a mechanism known as single-strand annealing ( SSA) (Elliott et al. 2005). S SA is an errorprone mechanism which results in loss of one o f the flanking repeats and the intervening sequence. In the absence o f long similar sequences, shorter sequences are also used, but less frequently, and the precise mechanisms are uncertain (Sankaranarayanan and Wassom 2005). A s an alternative to S SA, homologous recombination ( HR), w hich involves B R C A 1 , B R C A 2 , and R AD51 proteins, may occur. In this case, R AD51/ssDNA filaments invade nearby D N A duplexes and pair with homologous D N A (Helleday 2003). The sister chromatid, available during the S and G2 phases o f the c ell cycle, is the homologous D N A duplex most often used as a template for repair (Johnson and Jasin 2000), rather than the homologous chromosome, use of w hich could result in loss of heterozygosity (Moynahan and Jasin 1997). U pon strand invasion, a H olliday junction is formed and D N A synthesis occurs, often continuing beyond the original site of D N A breakage. In any case, final resolution of the break may occur by N H E J , secondary S SA, or reinvasion of the original duplex by the nascent D N A strand at the homologous location beyond the site of the original break. Resolution by secondary N H E J or S SA has the potential of causing deletions or tandem D N A insertions, while secondary S SA or reinvasion o f the original duplex may result in error-free repair. F inally, aberrant template choice, for example in the case o f D N A breakage at A l u elements or other ubiquitous motifs, may result in ectopic sequence duplication (for review, see Helleday 2003). Chapter 5 discusses observations of the absolute rate w ith which retroelement insertions actually revert or undergo precise deletion, presumably by a D N A DSB 27 homologous repair mechanism. The relative paucity of these events shows that identity by descent is a relatively safe assumption for retroelement l oci compared across species. It also addresses the role of homologous direct repeats i n sequence removal from primate genomes. 1.5 Repeat finding methods W hile the goal of this thesis was to understand the interaction of repetitive elements, it depended heavily on repeat annotation provided by RepeatMasker ( A.F. A . Smit and P. Green, unpublished) as tracks in the U CSC Genome Browser. RepeatMasker makes use o f the Repbase libraries (Jurka 2000) to iteratively mask genomic sequence, excising fully embedded repeats and a llowing more confident identification o f the targeted repeat ( A.F. A . Smit and P. Green, unpublished). M askerAid (Bedell et a l. 2000) is a suite of perl scripts that adapts W U - B L A S T ( W. G ish, unpublished) for use with RepeatMasker, functionally replacing the cross_match aligner. M askerAid increases the speed of RepeatMasker analysis dramatically, by 40 fold at most RepeatMasker sensitivity settings, while identifying repeats w ith nearly identical sensitivity and specificity (Bedell et a l. 2000). This reflects the fact that both methods rely on pairwise sequence alignment of genomic sequence with consensus sequences, t ypically from Repbase Update (Jurka 2000), to identify repetitive regions in genomes. R E C O N , by contrast, uses a tunable, heuristic analysis of results of a s elf-BLAST to perform de novo definition of repeat boundaries and thus repeat families (Bao and Eddy 2002). The algorithm is tunable on at least two levels, that o f the sensitivity of the B L A S T search it is based on and the ' willingness' o f the heuristic analysis to split elements up (Bao and Eddy 2002; Holmes 2002). This and related methods of 28 constructing repeat family consensus elements can make a valuable contribution to analysis o f newly sequenced genomes. RepeatScout represents a further development on the R E C O N concept (Price et al. 2005). RepeatScout uses overrepresented short sequences as seeds for l ocal pairwise alignments, w hich then are reconstructed into longer repeat consensus sequence. In testing, this de novo repeat finding method identified a more complete set of rodent repeat families than had been included in Repbase Update (Jurka 2000). However, repeat identification by masking w ith RepeatScout-generated human libraries found less repeats than masking w ith Repbase libraries. This observation was attributed to the fact that the human genome has been better studied, allowing for many years of curation o f human repeat families. In summary, all these methods are limited i n that they rely on pairwise alignment methods to find repeats. Repeats that undergo many short insertions or deletions, despite being recognizable by eye, are unduly penalized for being broken up. It seems clear that a better model of sequence evolution is required before repeat finding can reach optimal sensitivity. In the interim, repeats found by RepeatMasker and Repbase libraries have provided an opportunity to assess the nature o f the effects o f transposable elements on their host genome. 1.6 Thesis objectives and chapter summaries Repetitive sequence is ubiquitous in mammalian genomes and has an array of consequences for the organism, both positive and negative. In some cases, positive roles for repetitive sequence are related to their mutagenic role. For example, binding sites donated by E R V L TRs can function as transcriptional enhancers, repressors, and polyadenylation signals. W hile these functions can interfere w ith the function of nearby 29 genes, in many cases they can also provide regulatory diversity. While a few cases of neutral or positive roles have been identified for LTRs, there are likely many more. Nonadjacent repeated sequences, on the other hand, can have a different role. Whether in the form of high-copy repeats or short random nonadjacent segments of identity, these sequences can function as substrates for D NA repair. While the potential for deletion of tumor suppressors and other genes is obvious, nonadjacent repeats may also have a positive function in genome size attenuation. The main goal of this thesis work was to use automated sequence analysis and global statistical analyses of mammalian, primarily human, genomic repeat distributions to gain understanding of the roles of repetitive sequence in defining organisms at the genetic and hopefully also phenotypic levels. As the project progressed, more and more information became available that increased our ability to ask fundamental questions about these roles. Time was spent at the outset of the project developing methods and algorithms; however, as more data became available, the data formats and repositories evolved in response. The availability of a draft of the human genome sequence provided the positive stimulus for the development of the various genome browsers as well as long term choices being made on data formats. The information infrastructure and tools made available with the data have considerably aided in this analysis. Subsequent availability of a draft of the mouse genome sequence enabled us to use comparative approaches to assess T E involvement in the human and mouse transcriptomes. At the same time, availability of raw sequencing traces from multiple human sources enabled others to assess the levels of repeat polymorphism present in the human. Using these published data sets, we were able investigate detrimental impacts of retroelement families active in humans and compare them to the projected detrimental impacts of older, inactive 30 families. F inally, availability o f sequence traces from Rhesus monkey allowed us to assess precise deletion of retroelements in the human and chimpanzee lineages. Chapter 2 describes our i nitial analyses of the draft sequence o f the human genome. The analyses performed were essentially collations of repetitive sequence and assessment o f their location w ith respect to genomic features l ike sequence composition and genie positions. This analysis showed that T Es o f different ages localize to different regions of the human genome and that L T R elements demonstrate orientation bias up to 5kb upstream and downstream of transcribed regions. Chapter 3 describes our analysis of T Es i n transcripts. M apping o f human and mouse protein coding transcripts and TEs were compared to find transcripts that contained T E sequence. Databases of gene classification information were then developed and classification of human and mouse genes performed to determine which classes of genes had TEs as part o f their transcripts more often. The same classes of genes i n both human and mouse, such as those w ith more organism-specific functions, tended to be permissive for the presence of T E sequence i n transcripts. H ighly conserved genes and genes w ith important housekeeping or developmental functions were less permissive. Chapter 4 reviews our work on T E directional biases in human and mouse transcribed regions. Different families of L T R elements showed widely varying retention patterns w ithin transcribed regions. S V A retroelements, w hich retain a partial L TR i n their consensus, also demonstrated an L TR-like pattern, suggesting that T Es transcribed by R N A polymerase II (pol II) exert varying effects upon insertion, depending on the sequence o f the element. Chapter 5 is concerned w ith analysis of nonadjacent repeated sequence and its 31 role i n recombination and deletion. Retroelement insertions, normally assumed to be stable w ith no known mechanism for precise deletion, are shown to be deleted precisely i n some cases, most l ikely due to a mechanism i nvolving D N A double strand break repair and the flanking short tracts o f identical sequence. A nalysis o f random deletions showed that a large fraction of these events are also mediated by short nonadjacent tracts o f identical sequence. 32 Chapter 2: Retroelement distributions in the human genome A version of this chapter has been published: Medstrand, P .*, L . N . van de Lagemaat*, and D .L. Mager. 2002. Retroelement distributions i n the human genome: variations associated w ith age and proximity to genes. Genome Res 1 2: 1483-1495. * these authors contributed equally to this work I performed all data analysis and wrote sections of the paper. P . M . and D . L . M . performed postprocessing o f the data in Figure 2.2 and wrote a significant share of the paper. 33 2.1 Introduction Since Barbara M cClintock discovered transposable elements (TEs) in maize ( McClintock 1956), it has become w ell established that such elements are universal. W hile there are examples of both loss and increase of host fitness due to the activity of transposable elements, their population dynamics are far from being understood, and the forces underlying their genomic distributions and maintenance in populations are a matter o f debate (Biemont et a l. 1997; Charlesworth et al. 1997). The prevailing v iew is that T Es are essentially selfish D N A parasites w ith little functional relevance for their hosts (Doolittle and Sapienza 1980; Orgel and C rick 1980; Y oder et a l. 1997). A ccording to this hypothesis, the interaction of T Es w ith the host is primarily neutral or detrimental and their abundance is a direct result of the ability to replicate autonomously. It is generally accepted that selection is the major mechanism controlling the spread and distribution o f T Es i n natural populations of model organisms (Charlesworth and Langley 1991). W hile the exact mechanisms through w hich selection acts are controversial, the processes controlling transposition involve selection against the deleterious effects of T E insertions close to genes (Charlesworth and Charlesworth 1983; K aplan and Brookfield 1983) and selection against rearrangements caused by unequal recombination (ectopic exchange) in meiosis (Langley et a l. 1988). More recently, the ubiquitous nature o f T Es has gained increasing attention and it is now becoming accepted that T Es give rise to selectively advantageous adaptive variability w hich contributes to evolution of their hosts ( McDonald 1995; Brosius 1999). However, the mechanisms responsible for maintenance, dispersion, fixation and genomic clearance of T Es remain largely unknown. W hile most work on TEs has focused on model organisms, sequencing o f the human genome has revealed that nearly h alf o f our D N A is derived from ancient T Es, 34 mainly retroelements (Smit 1999; International Human Genome Sequencing Consortium 2001). The wealth o f human genomic information now allows comprehensive explorations into the evolutionary history and genomic distribution patterns o f transposable elements w ith a view to increasing our understanding of the forces that have shaped our genome and its mobile inhabitants. The retroelements present i n the human genome are divided in two major types, the n on-LTR and L T R retroelements (International Human Genome Sequencing Consortium 2001). The n on-LTR retroelements are represented by the autonomous L I and L2 elements ( LINE repeats) and the non-autonomous A lu and M IR ( SINE) repeats and have been extensively studied (Smit 1999; International Human Genome Sequencing Consortium 2001; Ostertag and Kazazian 2001; Batzer and Deininger 2002) but appreciation o f the heterogeneous collection o f L T R retroelements is more limited. These sequences make up 8% of the human genome (International Human Genome Sequencing Consortium 2001) and include defective endogenous retroviruses ( ERVs) ( Wilkinson et a l. 1994; Sverdlov 2000; Tristem 2000), related solitary L TRs, and sequences w ith L TR-like features for which no homologous proviral structure has been found. Over 200 families of L TR retroelements are defined in Repbase (Jurka 2000) but they can be grouped into six broad superfamilies (see Methods). W hile some of the L T R retroelement families, particularly members of class I and II E RVs, presumably entered the primate germline as infectious retroviruses and then amplified via retrotransposition ( Wilkinson et a l. 1994; Sverdlov 2000; Tristem 2000), other L T R families l ikely represent ancient retrotransposons that amplified at different stages during mammalian evolution (Smit 1993). The vast majority o f human retroelements were actively transposing at various stages prior to and during the radiation of mammals and are now deeply fixed i n the 35 primate lineage. Essentially only the youngest subtypes of A lu (Batzer and Deininger 2002) and L I elements (Ostertag and K azazian 2001) are s till actively retrotransposing in humans. Some E RVs belonging to the Class II H E R V - K family are human-specific (Medstrand and Mager 1998) and a few are polymorphic (Turner et a l. 2001) but no current activity o f human E RVs has been documented. Here we show that genomic densities of human retroelements vary w ith distance from genes and that their distributions w ith respect to surrounding G C content also shift as a function of their age. 2.2 Methods 2.2.1 Description of retroelements Human retroelements are classified into two major classes: n on-LTR and L T R retroelements. The former contain the L INEs, represented by the L I and L2 elements, whereas the A lu and M IR elements belong to S INEs. For this analysis L T R retroelements were divided into the following six groups (Smit 1999; Jurka 2000; International Human Genome Sequencing Consortium 2001; Mager and Medstrand 2 003): class I E RVs, w hich are similar to type C or gamma retroviruses such as murine leukemia virus; class II E RVs, w hich are similar to type B or beta retroviruses like mouse mammary tumor virus; class III E RVs (also called E RV-L), w hich have limited similarity to spuma retroviruses; M E R 4 elements, w hich are non-autonomous class I related E RVs; and M ST (named for a common restriction enzyme site M stll) and M L T (mammalian L T R transposon) elements, w hich are both part o f the large non-autonomous Mammalian apparent L T R Retrotransposon ( MaLR) superfamily. Solitary L TRs outnumber L T R elements w ith internal sequences by approximately 10 fold. 36 2.2.2 Data sources Genomic sequence and annotated gene data for all figures were derived from the August 6, 2001 draft human genome assembly at http://genome.ucsc.edu. Retroelement locations derived from RepeatMasker (http://ftp.genome.washington.edu/RM/RepeatMasker.html), GC-content calculated in non-overlapping windows o f 20-kb, sequence gap data, and known gene data from the Reference Sequence database were all downloaded from this site. After compilation, data points were included i n graphs only i f supported by more than 100 retroelements. Element count was calculated to reflect as nearly as possible the number of individual integrations of the element. That is, nearby repeat segments (within 20 kb of each other) having the same family name and RepeatMasker alignment parameters (alignment score, substitution and gap levels) were combined and treated as a single element. Subfamily assignments and divergence values were taken directly from RepeatMasker output files. Internal sequences of L TR elements were excluded from the analysis. Data was further conditionally discarded in figures where retroelement divergence is used as a measure of age. In some cases where element length was short (below 150 bp), it was noted that RepeatMasker assigned an artificially l ow divergence value due to the alignment method used in finding repeats. T his was a particular problem for the old M IR and L 2 sequences. A n attempt was therefore made to ensure that relative divergence indeed represented age by plotting element length versus assigned divergence values. Since repeats i n general grow shorter as they age (e.g. see Figure 2.5), retroelement divergence cohorts were considered anomalous and discarded i f they did not follow this trend. 2.2.3 Density analysis The retroelement data were compiled by repeat superfamily, divergence from consensus, 37 and surrounding genomic G C content. The density function in Figures 2.1, 2.4 and 2.6 was calculated as follows: the fraction o f the retroelement base pairs in a given G C b in divided by the fraction o f the genome in that G C bin. Thus, it affords a measure of preference of a particular age class for different G C contents. When an age class of an element had a significant presence in only some o f the G C bins, the effective genome size for that age class was calculated from the sizes of only those G C bins. Thus for the Figure 2.6 genomic data, the 'whole genome' is that fraction of the genome w ith G C content less than 46%. In Figure 2.2, the bin considered was an individual chromosome. W ith these considerations in mind, the calculations of density are identical. For Figure 2.2 (retroelement density versus G C content on each chromosome), correlation coefficients (r) and level o f significance (p-values) were calculated for each data set. The graphs of chromosomal retroelement density as a function of gene density are not shown, but are almost identical because of the highly significant correlation between G C content and gene density (International Human Genome Sequencing Consortium 2001). For Figure 2.3, a script divided the chromosomes into eleven segment types or bins: w ithin the transcript start and end positions of known (annotated) genes and 0-5, 510, 10-20, 20-30, and >30 kb upstream and downstream of genes. The majority o f the genome was located either w ithin genes (22% o f the total) or at distances greater than 30 kb from genes (63% of the total). In each segment, the script determined the base pair contribution o f each retroelement type and noted the orientation of the element with respect to the nearest gene. The G C content of each segment was calculated and then the density data from Figure 2.1 was used to predict the base pair contribution by each retroelement type in the segment. Predictions done w ithin genes or at distances >30 kb 38 from genes were compiled from predictions made from 10 kb sub-segments. H alf o f the predicted retroelement base pairs were assumed to be in the sense orientation and half i n antisense. F inally, the observed base pairs in each bin were divided by the cumulative predicted base pairs for each retroelement type. P values shown in Tables 2.1, 2.2, and 2.3 and variability o f the data i n Figures 2.3, 2.4 and 2.6 were calculated as follows. The sequence segments comprising the whole genome were divided up into four 'subgenomes' of equal composition. The retroelement distributions were calculated in each subgenome, and the means and standard deviations of retroelement distributions were calculated. After appropriate normalization, the significance (p-value) o f the difference between different retroelement distributions was tested by the one-tailed unpaired t-test. 2.3 Results and Discussion 2.3.1 Distributions of retroelements in different GC domains To begin our analysis, we measured the density of various retroelements w ith respect to G C content in 20-kb windows across the human genome sequence. A s reported previously (Smit 1999; International Human Genome Sequencing Consortium 2001), LI elements are predominantly found in the A T-rich regions, L2 elements are more uniformly distributed whereas A l u and M IR repeats reside in the higher G C fractions of the genome (Figure 2.1 A ) i n comparison to the entire genome which has an average G C content of 40% (International Human Genome Sequencing Consortium 2001). For the different L T R superfamilies, an uneven distribution in G C occupancy is also observed. The relatively young Class I E RVs and the non-autonomous M E R 4 sequences, which 39 may have been propagated by Class I elements, have very similar broad distributions that peak in regions of medium G C. Class II E RVs, w hich include the youngest known H ERVs (Medstrand and Mager 1998; Turner et a l. 2001), have a distribution more skewed toward higher G C regions (Figure 2. IB). Distributions of the older Class III E RVs and their distantly related M L T and M ST elements are generally biased toward l ow G C regions, except for M L T elements w hich are spread more uniformly (Figure 2.1C). 40 A 2I | O c 0H 1 <34 1 1 1 1 ' 1 1 1 1 1 1 36-38 40-42 44-46 48-50 52-54 >54 c _ O 0I1 <34 1 1 1 1 1 1 1 1 1 1 1 36-38 40-42 44-46 48-50 52-54 >54 g -| , , , , , , , , 1 I 1 1 <34 36-38 40-42 44-46 48-50 52-54 >54 Genomic G C ( ) % A -•-Alu - o-L1 — — MIR B -•-L2 C B I - « - Class IERV - x - Class II ERV -*— Class I ERV I — A — MLT - 9 - MST -*— MER4 Figure 2.1 Density of retroelements in different G C fractions in the human genome, calculated over 20-kb windows across the genome sequence. Panels A to C show the density of various retroelement classes and those represented in each panel are indicated in the box below the graphs. The bins from left to right correspond to an increasing 2 % G C fraction. 41 To determine i f retroelement densities on each chromosome agree w ith overall densities shown in Figure 2.1, we plotted densities against estimated gene (data not shown) or average G C content of each chromosome (Figure 2.2). As expected, the two distribution profiles are almost identical because o f the strong correlation between G C content and gene density (International Human Genome Sequencing Consortium 2001). The density o f A l u elements increases as a strict function of increasing G C content and M IR elements also generally f ollow this trend (Figure 2 .2A, C). In contrast, there is generally a negative or no correlation between the density of L I , L 2 or L T R elements and gene density or G C content (Figure 2.2). The Class II E RVs and the M L T elements show little, i f any, bias for G C-poor chromosomes, while the L I , Class I, III and M ST groups are overrepresented on these chromosomes. Class I-II elements are dramatically over represented on chromosome Y , as noted before (Kjellman et a l. 1995; Smit 1999; International Human Genome Sequencing Consortium 2001), and also somewhat on 19. Abundance o f the youngest E RVs on chromosome Y may be due to recombination isolation and absence of major recent rearrangements on much of this chromosome (Graves 1995; L ahn et a l. 2001), and since chromosome 19 is much more gene-dense than the other chromosomes (International Human Genome Sequencing Consortium 2001), one possible explanation for the over-representation of the same E RVs on this autosome is that these elements had an i nitial integration preference for regions near genes or gene-related features such as C pG islands. We also noted an under representation on Y of the old L 2, MIR and M L T retroelements w hich is consistent with major rearrangements and deletions of Y during mammalian evolution (Lahn et al. 2001). Similar trends are observed for M ER4 distributions and their autonomous class I counterparts (over representation on Y and 19), and for the non-autonomous M aLR 42 ( MLT and M ST) elements and their apparent autonomous class III E RVs (over representation on 21). A l u , L I , MER4, class I and II E R V sequences represent the younger elements w hich have actively amplified during the last 40 M Y R o f primate evolution, whereas other element types were already inactivated for transposition by this time (International Human Genome Sequencing Consortium 2001). A l l younger retroelements except A l u sequences are over represented on Y . E ven though some of the L TR superfamilies show a stronger negative correlation than others, the distribution profiles demonstrate that various retroelement families cluster preferentially in different genomic landscapes and are in agreement w ith the general trends observed in Figure 2.1. 43 3 ,A Alu p<1 e-11 2 -,B L p<1e-11 1 <X > OY 2.5 1.5 21.5 1 1 0.5 0.5 H 1 1 1 1 1 1 L 0 i 1 1 1 1 1 36 38 40 42 44 46 48 36 38 40 42 44 46 48 2 ,C MIR p<1e-5 2 -I D 1.5 190 L2 p=0.46 • 1.5 E H OY i 1H 1 0.5 0 -jF —i 9l 0.5 D) OY 1 1 1 1 1 •B o 36 38 40 42 44 46 48 MLT p=0.012 021 2 36 38 40 42 44 46 48 MST p<1e-8 0 21 g. 2^ I. 1.5 H i1 E 0.5 a O20 • 1.5 1 0.5 V*s* -I ^« . 1 r- OY o C o 36 38 40 42 44 46 48 G MER4 p=0.0015 0 21 2 36 38 40 42 44 46 48 -|H Class II ERV p<1e-9 I • 21 r °- 2 1 1.5 19 O O20 . ... • OY 1 0.5 0 0.5 36 38 40 42 44 46 48 H 0l 4 n 36 38 40 42 44 46 48 5 Class I ERV p<1e-8 ° Y J Class I ERV p=0.26 I ° Y 3• 2 1 0 # ^ 1 1 1 1 1 r 1 43• 2• 10• 19" v «4** ™r 1 1 r- 36 38 40 42 44 46 48 36 38 40 42 44 46 48 G C (%) Figure 2.2 Density of retroelements as a function of average G C content of each human chromosome. The line connecting solid diamonds indicates the general correlation trend between retroelement and G C content of individual chromosomes. The level of significance (p-values) of the correlation for each data set is indicated. Open diamonds were excluded from the correlation analysis and indicate over or under representation of retroelement density on a particular chromosome. Chromosomes 20, 21 and 22 were excluded from the Class II graph (panel J) due to having less than 100 supporting elements. 44 2.3.2 Arrangements of retroelements with respect to genes G iven the results in Figures 2.1 and 2.2, we looked in more detail at the distribution of retroelements by locating all elements in the human genome relative to annotated genes. W hile it is reasonable to assume that locations w ith respect to genes affect retroelement dispersal and fixation patterns, the aim of this analysis was to obtain a measure of this effect. Our strategy was to determine how closely retroelement densities w ith respect to genes could be predicted based on the surrounding G C content. D N A regions located upstream of each gene's transcriptional start site and downstream of the polyadenylation site were divided into segments of various size fractions (see Methods) and the density of each retroelement class in either transcriptional orientation w ith respect to the gene was determined. Regions w ithin the boundaries of a gene, including the introns, were assigned a single segment. The local G C content of each segment was also calculated and used to determine an expected retroelement density based on the whole genome distributions indicated in Figure 2.1 (see Methods) and the results shown in Figure 2.3. To obtain estimates of the variation associated w ith this type of analysis, we divided the genome into four 'subgenomes' as detailed in Methods and performed the analysis independently for each. The points in the graphs represent the mean and standard deviation derived from values obtained for each subgenome. 45 ^ gene <5 10-20 >30 *30 M LT 10-20 <5 gene gene <5 10-20 >30 >30 MST 10-20 <5gene 5 E a 1 .5i F 1 1.5] 1 2 1 CL T 0.5 - T > 8o 1 5 2 a 10-20 >30 >30 MER4 10-20 • a <5 gene 1.5 gene <5 10-20 >30 >30 10-20 C lass lit E RV <S gene gene <S 1.5 1 0.5 0.5 gene <5 10-20 >30 >30 10-20 <5 gene gene <5 J 10-20 >30 >30 C lass II ERV 10-20 <5gene C lass I E RV gene<5 10-20 >30 >30 10-20 <5 gene gene <5 10-20 >30 >30 10-20 <5gene J|Downstream (kb) Upstream -1 V ^Downstream (kb) Upstream s ense —•—Antisense Figure 2.3 Ratios of observed to predicted retroelement densities with respect to genes in the human genome. The points above 'gene' and '<5' of each graph indicate the density in gene regions, and in the first 5 kb either 5' or 3' of genes. The other bins are 5-10,10-20, 20-30, and >30 kb either upstream or downstream of genes. Open symbols and dashed lines indicate elements in the same or sense orientation with respect to the nearest gene and solid symbols and lines indicate elements in the reverse direction. Standard deviation error bars, which are too small to see in some cases, were determined as described in Methods. Solid boxes below the graphs represent gene regions and the lines indicate the distance bins of the intergenic regions. It should be noted that the vast majority of retroelements within genes are located in introns. 46 D ividing the genome based on proximity to genes revealed several intriguing patterns. First, densities o f the relatively old M IR and L2 elements in intergenic regions generally conform to that predicted from the G C content of each region. That is, the ratio o f observed to expected density is close to one (Figure 2.3C, D ). Second, for the S INE ( Alu and M IR) elements, densities w ithin genes are close to that predicted or are overrepresented based on average G C content of gene regions (Figure 2 .3A, C ). In contrast, L I elements and all six L T R classes, particularly those in the same transcriptional direction, are underrepresented within genes (Figure 2 .3B, E -J). L I sequences and the older M L T , M S T and Class III elements are also underrepresented in the 0-5 kb regions both upstream and downstream of genes, while the younger class I and M ER4 elements are underrepresented in the downstream region only. The higher tendency for L T R elements and L i s w ithin genes to be oriented in the antisense direction has been noted previously (Smit 1999) and l ikely reflects lower fixation rates resulting from interference by retroelement regulatory motifs, such as polyadenylation signals, when genes and elements are located in the same transcriptional direction. However, this is the first study to demonstrate lower densities of L T R and L I elements w ithin genes relative to that predicted based on the surrounding G C content. In addition, the fact that an orientation bias for some elements extends to significant distances away from genes has not been reported previously. Moreover, our analysis indicates that the densities of most L T R elements and L i s are highest in regions furthest from genes. These patterns suggest that L I and L T R elements are excluded from genes and nearby regions by selection. Interestingly, the density distribution of A lu elements w ith respect to genes is opposite to that observed for L I and most L T R elements in that the density is lowest in 47 regions most distant from genes and they are overrepresented (as predicted by G C content) in regions w ithin and near genes. It is also noteworthy that densities o f the relatively young L T R class II elements peak in the region 5-20 kb 5' or 3' of genes and, indeed, are overrepresented in these areas compared to the expected densities based on regional G C content (Figure 2.3J). Such a pattern may reflect a preference for this class o f elements to integrate near genes. The statistical significance o f these results is shown in Table 2.1 w hich lists the resulting p-values for three sets o f comparisons. The top part of the table compares the sense versus antisense distributions and confirms the significance of the orientation biases discussed above. M IR elements are the only group to show no significant orientation bias. In contrast, an orientation bias extends up to 20 kb 5' of genes for M L T and M ST elements. The bottom two panels in Table 2.1 compare densities of retroelements in each orientation at each intergenic location to the densities of retroelements in regions most distant (>30 kb) from genes. These latter comparisons illustrate that the retroelement density differences plotted relative to gene location are highly significant. F or example, the densities of A lu sequences at a ll locations are highly significantly different from their density in regions >30 kb from genes. 48 Table 2.1 Significance (p-values) of retroelement locations with respect to genes Sense vs. antisense in Class I Class I I 0-5 d s" nt 5-10dnst 10-20 dnst 20-30 dnst >30 dnst > 0 upst 3 20-30 upst 0.25 0.16 0.42 0.053 10-20 upst 01 7 5-10 upst "~ 0.043! 6.0E-05 ^ T O] OO I 0-5 upst 00 1 4.9E-05 .0 in c , 0 0 63 p 00 ] 9fp7]3 2.9E-05 SJE-OS^^E-Oe 3.6E-04 1.6E04 r Antisense vs. >30 kb from genes in 0-5 dnst 5-10 dnst 10-20 dnst 20-30 dnst 20-30 upst 10-20 upst 5-10 upst 0-5 upst in Alu L1 MR I L2 MLT MST MER4 Class II Class I Class I I I 8.9E-06 " .0.015 0.13 , . 0.007 ' 0.003 1.2E-04 00 1 6.9E-05 4.3E-05 .0 0.02 9:9E-07 2 3E-06" 0 0 6 "T0 01 1.3E-04'/ 1.0E-04 0.005 . 0.012 0.015 0.27 7.1E-07 0.45 0.009 0.29' ' 0.006 0.018 0.08 0.12 0.42 ~^8E~5"~ 4.9E-07, 0.058 0.003 0.49 0.026 0.097 0.07 "07602~ 0 0 5 6.006 70 1.1E-06 0.002 0.074 0.5 0 011 0 03' 0.18 04 • 0.005 .1 4.2E-07 0.38 0.2 0.27 0.06 0.007 0.033 ' o :or 0 0 3 71 0.17 1.5E-07 0.38 70.01~r~T0.0T2""""7~5 0.064 00 1 ""004 .6 0.054 0.46 ' 7 1 004 * list*2.1E-07, "0.00$ 0.083 0.001 0.38 0 004 0.022 > 0.029' " 0 0 8 ' ^0.022 " "72 Air •* -t <rf 3.0E-07 .3.0E-06 0.34 7.7E-05~ 2 4 " 5 •4.8E-04 0.033 1.2E-06I "7E 0 0.06 . 0.016 .0 8.9E-06 /" 0.015 0.13 0.007 0.003' 1.2E-04 00 1 6.9E-05 473E-05 ; 0.02 t 1^ 11 g * Sense vs. >30 kb from genes in 0-5 dnst 5-10 dnst 10-20 dnst 20-30 dnst 20-30 upst 10-20 upst 5-10 upst 0-5 upst in MST MER4 Class I Class I Class I MR I L 2 MLT Alu L1 1. . - 4 8.6E-07 0 069 0.14j, 4:4E-07,, 2.2E-06 '1.3E-07 1.6E-07, 1.5E-07' 1.3E-05 -7E0 002 0.3 1.0E-06 6.9E-06 9.3E-05 " 7 0 2 5E-06 2.3E-05 1.0E-05 l!9E-04 2.0E-06 *> fftp , A , ,,u,. 03 1 0.003 .9 0.003 0.12^ 0.003 ' 00 1 .0 0.48 0.055 9.0E-07J 0.45 004 0.02 0.003 0.39 00' 0.18 ~ 7 05 0.15 0.019! " 7 3 2.9E06 " 1 7 0 .0.011 0.24 ~>07 0.03 0.068 0.078 0.002 02 .2 0.28 .3.9E-06 0.096 04 . 07002| 0.073 0.36 0.24 ~~ 00 2 7 " 7 0 ~"7Y " 0 0 2 0.44 1.8E-07J 0.4 0704™ 00 1 .4; .0I 0.002 0.23 ./1.4E-07 0.053 0.066 0.24 0.015 * 00 3 .0 00 I" 0.045, 0 41f ~7"2 ' 0.026 4.6E-04 0.14 0.33 00 2 '4.7E-07~"""T07q2' 0.13 0.012 2.1E-05 1 3E-06 ~ 0 0 1 > : E0 l ~ 7 " 7 83 - 6 0.48 0.043 6.8E-07 6.8E-07 Ell lift $ ft t V 0.069 0.14 4.4E-07 2.2E-06 '1.3E-07 1.6E-07~1.5E-07 1.3E-05 1.7E-04 8.6E-07 7 1 f a b c Shaded regions are significant (p < 0.05) within a gene dnst: kb downstream o f the nearest gene upst: kb upstream o f the nearest gene 49 2.3.3 Shifting retroelement distributions with age It is apparent that the retroelement distributions in genes and intergenic regions (Figure 2.3) do not fully conform to the genome-wide distribution patterns o f elements observed i n Figures 2.1 and 2.2. Furthermore, for A lu repeats, it has been reported previously that young elements (< 1 M yr) have a preference for A T-rich regions whereas older A lus show an increasing density in G C-rich D N A (Smit 1999; International Human Genome Sequencing Consortium 2001) (see Figure 2.4A) and hypotheses to explain this phenomenon have been proposed (Schmid 1998; Brookfield 2001; International Human Genome Sequencing Consortium 2001; Pavlicek et al. 2001)(see Section 1.2.4). Transposition into A T-rich regions might be expected to lead to accumulation of T Es i n this gene poor part of the genome (e.g. the heterochromatin) where recombination is strongly reduced and element interference w ith genes is less pronounced. However, the observed density differences o f the youngest A lu elements (present in A T r ich regions) as opposed to older elements (in G C r ich regions) do not f ollow this expectation. A possible explanation for the age-related A lu density differences is that these retroelements are removed preferentially from their i nitial integration sites in the A T r ich regions of the genome prior to fixation. However, because there is a gradual density increase of A lu elements by age in the G C r ich fraction, it is possible that already fixed elements are gradually lost from the A T r ich region while they are maintained in G C r ich regions. 50 • <34 36-38 40-42 44-46 48-50 52-54>54 <4 3 36-38 40-42 44-46 48-50 52-54>54 o 4 —i—i— i —i—i—i—i—i— i —i—i—i <34 36-38 40-42 44-46 48-50 52-54>54 21 I Class I ERV o i —i—i—i—i—i—i—i—•— — — <34 36-38 40-42 44-46 48-50 52-54 >54 1 1 1 1 _ 2I J Class II ERV 0 -I—•—i—i—i—i—i—i—i—i—i—i—i 0 •—i—•—i—•—i—i—•—•—•—i—'—> <34 36-38 40-42 44-46 48-50 52-54>54 <34 36-38 40-42 44-46 48-50 52-54 >54 Genomic O C ( ) % — _ - 10-15 - K - 15-20 - B-5-10 -•-0-5 1-5 - B-0-1 —K 20-25 3— 25-30 —1—30-35 — « - 35-40 —40+ Figure 2.4 Retroelement densities of different divergence classes in various G C fractions of the human genome. The density distribution of each retroelement divergence cohort was plotted in G C bins as indicated in the legend to Figure 2.1. The divergence classes are indicated in % divergence from the consensus sequence below the graphs. Data points missing in traces are due to G C bins containing less than 100 elements. Standard deviations were calculated (see Methods) but are not shown in the interest of clarity. 51 To investigate i f other retroelements also change their genomic distribution with age, we determined the distribution patterns o f L T R elements, S INEs and L INEs o f different ages as a function of G C content (Figure 2.4). As discussed above, it is apparent that the youngest A l u elements (0-1% divergent), many of w hich are polymorphic insertions ( Carroll et a l. 2001; Batzer and Deininger 2002), are distributed differently than the next youngest (fixed) A lus o f the 1-5% divergence group and that the densities of the next two A l u age cohorts (5-15% divergent) are skewed even further to G C-rich regions (Figure 2 .4A). Notably, this figure also reveals that the oldest A lu repeats are less prevalent in G C-rich domains and, indeed, have a density distribution closer to that o f the youngest age class. This density pattern of the oldest A l u elements was not evident in a similar analysis reported previously (International Human Genome Sequencing Consortium 2001). In that study, A lu elements were divided by subfamily instead of divergence and the density of the oldest subfamily, A luJ, was s till highly skewed to G C r ich regions. However, the A luJ subfamily was considered as a single large cohort, the members of w hich have divergences ranging from less than 10% to greater than 25%>. W hen the more divergent A luJ members of 15-20%) and 20-25%) divergence are separated into their own groups, their densities are essentially identical to the patterns presented in Figure 2.4A (data not shown). Thus, the different methods for separating A l u elements accounts for the differences between our analysis and that i n the genome consortium study. Results o f similar analyses conducted for the other retroelements reveal some provocative trends. As noted before (Smit 1999) and as shown in Figure 2 .4B, young L I elements are preferentially found in the A T-rich fraction in the genome and older 52 elements tend to be found in the most AT-dense part o f the genome. A nalysis o f the ancient L 2 and M IR repeats was hampered by the short average length of most elements which prevented an accurate determination of their divergence from a consensus sequence (age) (see Methods for details). However, for the two divergence classes that could be reliably determined, the oldest L 2 and M IR sequences also show an increased density in the less G C r ich sections of the genome compared to their younger counterparts (Figure 2 .4C, D ). For most of the L T R elements, we observe a trend similar to that seen for the L2 and M IR sequences. For elements belonging to the M L T , MST, MER4, Class I and III E R V groups, densities of the youngest members of these superfamilies peak in regions of higher G C compared to their older relatives (Figure 2.4E-I). That is, the highest concentrations o f these elements appear to gradually shift to regions of lower G C with increasing age. This tendency is not evident for the Class II E RVs (Figure 2.4J). Potential explanations for this trend w ill be discussed below. To determine i f the shifting patterns observed in Figure 4 are statistically significant, we again divided the genome into four subgenomes and repeated the analysis for each of these. Each point in the graphs could then be assigned a mean and standard deviation based on values obtained for each subgenome. The t-test was used to determine i f the density distribution of a particular age cohort was significantly different when compared to the next oldest cohort. Table 2.2 lists the p values resulting from this analysis. F or a ll retroelements except the Class II E RVs, the majority o f the density points are significantly different (p < 0.05) for at least one comparison between adjoining age cohorts. Indeed, for the most numerous elements, A l u and L I , almost all comparisons are statistically significant. If the youngest and oldest age cohorts of each 53 superfamily are compared, all except the Class II E RVs are highly significant (data not shown). 54 Table 2.2 Significance (p-values) of distributional differences between divergence cohorts Alu 5-10: 10-15 L1 15-20: 20-25 B 0-1: 1-5 1-5: 5-10 10-15: 15-20 15-20: 20-25: 20-25 25-30 0-5: 5-10 5-10: 10-15 10-15: 15-20 20-25: 25-30 25-30: 30-35:1 30-35 35-40 <34 34-36 36-38 38-40 40-42 42-44 44-46 46-48 48-50 50-52 52-54 >54 a MIR 30-35: 35-40 <34 34-36 36-38 38-40 40-42 42-44 44-46 46-48 48-50 50-52 52-54 >54 0.23 L2 30-35: 35-40: 35-40 40+ 0.24 0.05 0.20 0.48 0.07 MLT 15-20: 20-25: 20-25 25-30 25-30: 30-35 0.22 MST 10-15: 15-20: 15-20 20-25 MER4 10-15: 15-20: 20-25: 15-20 20-25 25-30 1.0*00,1 . • • • -l 0:036 15-20: 20-25 <34 34-36 36-38 38-40 40-42 42-44 44-46 46-48 48-50 50-52 52-54 >54 a Class I 20-25: 25-30 25-30: 5-10: 30-35 10-15 Class I 10-15: 15-20: 20-25: 15-20 20-25 25-30 Class II 5-10: 10- 15: 10-15 15-20 0.32 0.20 0.21 0.07 0.46 0.08 loiaMS o.i8 0 18[@t0l00'1i G C content (%); Divergence cohorts compared 55 One qualification regarding this data concerns the method used to identify retroelements of different ages. Elements were classified as belonging to divergence cohorts based on percent substitution from their consensus sequence (Jurka 2000). The consensus sequence corresponds to the approximate sequence at the time of integration in the genome, where retroelements in higher divergence cohorts indicate an older time of integration relative to the retroelements of lower divergence values ( Li and Graur 1991; Shen et al. 1991; Smit et a l. 1995; International Human Genome Sequencing Consortium 2001). Therefore, the v alidity o f this method is highly dependent on having accurate consensus sequences for a ll subfamilies. It is quite possible and even l ikely that some elements have been assigned an incorrect age due to extreme heterogeneity of some of the retroelement classes, particularly among the L T R groups. However, i f this was a major problem, one w ould not expect to observe a consistent shift in density in one direction - namely toward lower G C regions w ith increasing divergence. 2.3.4 Length differences do not account for the shifting patterns To investigate potential mechanisms that may underlie the age related distribution differences, we used two different methods to try to determine i f differential rates o f retroelement deletions in different genomic G C regions account for the shifting patterns observed in Figure 2.4. First, we examined the relative length of elements in different G C fractions. The results o f this analysis indicated that retroelements gradually become shorter as they age, presumably due to small deletions or loss of recognition o f diverged segments by RepeatMasker, but the shortening is largely independent of the surrounding G C content (data not shown). The two exceptions to this general observation are represented by L I elements and older A lu sequences (Figure 2.5). The average length of 56 younger L I elements (<10% divergence) peaks in the 38-42% G C fractions w hich might explain the abundance of L I base pairs in this region (Figure 2 .4B). In the case of A lu elements in the 20-30%) divergence cohorts, there is a slight decrease in apparent length w ith increasing G C content (Figure 2.5B) but this is not enough to account for the density pattern of this age group (Figure 2 .4A). In addition, the small degree of shortening as measured here does not explain the rapid enrichment of younger A l u elements in higher G C fractions. Figure 2.5 Length distribution of retroelements with respect to surrounding G C content. Retroelements of each group were classified as belonging to divergence cohorts as described in the text. The average length in base pairs (bp) of each retroelement divergence cohort contained within each G C bin (see legend to Figure 2.1) is shown for LI and Alu elements (panels A and B). G C bins containing less than 100 elements were excluded from the graphs. 57 2.3.5 Delay of Alu density changes on the Y chromosome A s another way of investigating the change in distribution of younger A lus toward G Crich regions, we analyzed A l u density patterns on the Y chromosome, much of which does not recombine (Graves 1995), and detected a major difference on this chromosome compared to the whole genome (Figure 2.6). A lu elements on chromosome Y less than 5% divergent are not numerous enough to include in this analysis. However, the density pattern of A lus i n the 5-10% divergence class is strikingly opposite to that observed in the whole genome in that they are much more prevalent in A T-rich regions compared to G C-rich regions (Figure 2.6C). The distributions of older A l u elements (>10%> divergent from the consensus) w ith respect to G C content are consistent w ith the patterns seen in the entire genome (Figure 2 .6D-F). Table 2.3 shows the p values resulting from this analysis. This finding suggests that the density shift of A lus from A T-rich to G C-rich regions during evolution was significantly delayed on the Y chromosome and, therefore, that the ability to recombine w ith a homologous chromosome greatly facilitated this shift. Table 2.3 Significance (p-values) of distributional difference between Alus on the Y chromosome versus the whole genome "5-10 10-15 15-20 20-25 34-36" |v> 0'P012^« 0.0231 0.13! :0.022 40-42 42-44 44-46 ^bi039 0.10 0.11 0.41 0.24 0.08 0.27 0.34 58 1.5 0-1% diverged 2 1.5 B 1-5% diverged O 0 0.5H i T I 0.5 36-38 40-42 44-46 2.5 2 1.51 1 0.5 "r"" " 1 c o <34 2 C <34 D 36-38 40-42 44-46 2 U c 5-10% diverged 10-15% diverged B) 1 £ 0.5 o o c "T"" ""i" — t I I I <34 2 E 36-38 40-42 44-46 2] <34 F 36-38 40-42 44-46 S o 15-20% diverged 20-25% diverged re 1.5 1 0.5 •i f<34 36-38 40-42 0.51 — 1 1 1 1 44-45 <34 Genomic G C (%) 36-38 40-42 44-46 Figure 2.6 Density of A lu divergence cohorts in different G C fractions on chromosome Y compared to the whole genome. Solid lines indicate Alu elements on chromosome Y whereas dashed lines represent the Alu density in the whole genome. Parts A to F indicate the density of a specific divergence class which is indicated on the top of each panel. There were insufficient numbers of Alu elements on the Y chromosome in the first two divergence cohorts to be plotted in panels A and B. The density distribution of each A lu divergence class is plotted against the local 20-kb genome G C content. Standard deviations were calculated as described in Methods. 2.3.6 Potential explanations for Alu distribution patterns T he d ensity patterns o f A l u e lements do not conform to trends observed for other r etroelements. These elements integrate into the A T - r i c h part but accumulate in G C - r i c h D N A ( International Human Genome Sequencing Consortium 2001) (Figure 2.4A) and at least three h ypotheses have been proposed to account for this phenomenon. One p roposed explanation is that the G C - r i c h A l u elements are more stable in regions where the surrounding G C content is s imilar ( Pavlicek et al. 2001). However, we have observed that p artial deletions or apparent shortening of various A l u age groups are u niformly 59 distributed irrelevant of G C occupancy (Figure 2.5B). This finding does not seem to support such a hypothesis although it is possible that the tendency of retroelements to remain i n regions of matching G C content does play some role. A second hypothesis proposes that A lu elements are selectively retained in G C-rich regions because having these elements close to genes is of functional benefit (Britten 1997; K idwell and L isch 1997; Schmid 1998). Figure 2.3 A shows that the A lu density near genes is higher than predicted based on G C content. That is, the tendency of A l u elements to be located near genes is not fully explained by the general GC-richness associated w ith coding regions and such a pattern may therefore reflect a functional role for these elements. However, other observations appear discordant w ith this view. For example, it is known that the developmentally critical H oxD gene cluster is almost devoid of retroelements (International Human Genome Sequencing Consortium 2001). A recent study has also found that S INEs ( Alu and M IR elements) are less frequently associated w ith imprinted than non-imprinted genomic regions (Greally 2002). Certain classes of genes may therefore need to exclude such sequences from their environment to ensure proper function or regulation. A third hypothesis proposes that the maintenance of A lus i n G C rich regions may be due to the adverse effects that deletions and unequal recombinations could have in gene-rich regions (Brookfield 2001). Indeed, due to the vast numbers of A lu elements in the genome, it is l ikely that specific recombinational mechanisms have been a major force i n shaping the distribution of A lus i n the genome. It has recently been demonstrated that the efficiency o f A lu-Alu recombination in yeast increases as a pair of elements are placed closer together (Lobachev et a l. 2000). Such closely spaced A lu pairs are found only occasionally i n the human genome (Lobachev et a l. 2000; Stenger et al. 60 2001) , possibly because o f clearance of these elements through the mechanism of inverted repeat (IR)-mediated recombination (Leach 1994). A l u elements seem quite promiscuous for recombination because two elements up to 20% divergent are s till able to recombine efficiently (Lobachev et a l. 2000). Furthermore, there are many examples of Alu-mediated recombination resulting in mutations in humans (Batzer and Deininger 2002) . These findings suggest a possible explanation for the changing A l u distribution profiles shown in Figure 2.4A and their enrichment near genes. Considering the high number of genomic A l u elements and the fact that they preferentially target A T-rich regions, these domains must have suffered a massive b uild up of A l u integrations. Such accumulation l ikely resulted in increased recombination as the occurrence of closely spaced, highly related A lus increased which could have led to loss of both newly integrated and fixed A l u elements in the A T rich fraction of the genome. In regions close to genes, it is possible that A lu-Alu recombination events are less l ikely to be allowed or become fixed because o f an increased chance of simultaneously removing gene regulatory domains (Brookfield 2001). This could help explain the over-representation of A lu elements near genes without invoking a functional role. The fact that we observe no increased density in G C- or gene-rich regions for the oldest A lus could be explained by the fact that A lus i n these age cohorts are much less numerous and therefore would have been less subject to loss via recombination in A T-rich regions. A l u elements of 20-30% divergence are present i n only - 25,000 copies whereas younger A lus i n the 5-10, 10-15 and 15-20% divergence classes are present i n -300,000, 480,000 and 210,000 copies respectively. Furthermore, due to their higher divergence values, the oldest A lus would also have been less able to recombine w ith their younger, more numerous relatives when the latter populated the genome. 61 Differences i n recombination are l ikely also responsible for the fact that A lu elements are not over represented on chromosome Y as are other younger retroelements such as Class I and II E RVs (International Human Genome Sequencing Consortium 2001) (Figure 2.2). This finding suggests that A lus are lost more readily than the L TR elements. However, loss of A lu elements on the Y appears delayed compared to on the autosomes (Figure 2.6), l ikely because only intrachromosomal/IR recombination can operate on most of the Y . I R recombination seems to work more efficiently when two elements are closely located (Lobachev et al. 2000) and it is l ikely that this is true also for intrachromosomal recombination in general. Thus we postulate that L T R elements are removed less efficiently than A lu elements due to their much lower copy number and, therefore, larger average inter-element distance. 2.4 C o n c l u d i n g remarks One view of transposable elements considers them to be selfish D N A o f no use to the host (Doolittle and Sapienza 1980; Orgel and C rick 1980; Yoder et a l. 1997), while others hypothesize that their fixation reflects functional interactions with the host ( McDonald 1995; Brosius 1999). Our data support the idea that retroelements have a general negative impact on the host because o f a gradual accumulation of most retroelement superfamilies in the A T rich fraction and on the Y chromosome (which is predicted to occur according to the selfish D N A hypothesis) (Charlesworth et a l. 1997). However, these findings also support a concept in which retroelements gradually are cleared (or maintained) from the host genome, a relationship that seems dependent on the age of their association. (Di Franco et a l. 1997; Junakovic et al. 1998; Torti et al. 2000; K idwell and L isch 2001). The fact that densities of o ld M IR and L2 retroelements near genes are close to that predicted by average G C content suggests a relatively benign 62 relationship between these retroelements and genes. In contrast, retroviral elements may have interfered more often w ith gene function due to i nitial integration site preference into gene r ich regions. The density pattern o f the relatively young class II E RVs (Figure 2.3 J) supports this suggestion. O f those L TR elements w hich have been fixed i n the population (i.e. almost all those i n humans), our analyses have revealed that the highest densities o f the older elements gradually shift w ith age to A T-rich or gene-poor D NA. Furthermore, we have shown that a ll types of L T R retroelements are significantly underrepresented w ithin genes. Since L TRs carry transcriptional regulatory signals very similar to those i n cellular genes (Majors 1990), it seems reasonable that insertion of an L T R close to or w ithin a gene would frequently be disadvantageous unless it is efficiently silenced by methylation or other mechanisms (Yoder et a l. 1997; Whitelaw and M artin 2001). Such insertions w ith a marked negative impact w ill be selected against with no chance to spread to fixation. However, it is known that a mutation w ith a selective disadvantage can s till be fixed through genetic drift, especially i f the effective population size is small ( Li and Graur 1991). It is possible that some L T R elements, despite being fixed i n the species, had a slight negative impact and were gradually eliminated with time. Alternatively, mechanisms unrelated to selection, such as differential rates o f recombination i n different G C domains, may also explain the shifting density patterns o f L T R retroelements. The fact that the youngest Class II E RVs do not show the same density pattern shifts as seen for most of the L T R superfamilies could be because there has not been sufficient evolutionary time for their distribution to be shaped by selective forces and/or recombination. Once fixed i n the population, it is not possible for an insertion to be eliminated unless insert-free alleles are re-created. W hile unequal crossing-over between 63 h omologous chromosomes may be the main mechanism responsible for elimination of retroelements in G C r ich r egions, w hich h ave higher rates o f recombination (Fullerton et a l. 2 001), intrachromosomal deletions and IR-mediated recombination might enhance this effect, especially i n regions o f h igh r etroelement density. Such processes c ould regenerate insert-free alleles and again provide an opportunity for the o riginal i nsertion to be lost f rom the population through natural selection or drift. W hile these studies have attempted to address some o f the potential mechanisms o r forces that h ave shaped the genomic distributions o f human retroelements, further studies are warranted to elucidate the complex evolutionary and functional relationships b etween these sequences and their host genome. 64 Chapter 3: Analysis of transposable elements in the human and mouse transcriptomes A version of this chapter has been published: van de Lagemaat, L . N . , J.R. Landry, D .L. Mager, and P. Medstrand. 2003. Transposable elements in mammals promote regulatory variation and diversification o f genes with specialized functions. Trends Genet 19: 530-536. I performed all bioinformatic analyses, except that i n Table 3.2, and wrote sections o f the paper. J . R . L . created Table 3.1 and Figure 3.2. D . L . M . and P. M . are senior authors on the paper. 65 3.1 Introduction T Es (primarily retroelements) comprise at least 45% of the human genome and 40% of the mouse genome and ancient elements w hich have diverged beyond recognition have also undoubtedly contributed to the composition of mammalian chromosomes (Deininger and Batzer 2002; Mouse Genome Sequencing Consortium 2002). W hile the negative effects of T Es i n causing mutations in individuals are w ell recognized (Ostertag and Kazazian 2001; Deininger and Batzer 2002; Mouse Genome Sequencing Consortium 2002), their major impact may be their ability to induce changes i n gene regulation (Murnane and Morales 1995; Brosius 1999; Hamdi et a l. 2000; Medstrand et al. 2001; Nigumann et al. 2002; Jordan et a l. 2003; Kashkush et a l. 2003) or coding potential (Murnane and Morales 1995; Nekrutenko and L i 2001) without destroying existing gene functions. The primary goal of this study was to test the hypothesis that T Es foster variation i n some gene classes while being excluded from others. 3.2 Methods 3.2.1 Prevalence of TEs in human and mouse gene transcripts Genomic coordinates of T Es and m RNAs contained in the RefSeq database were downloaded from the human June 2002 and mouse February 2003 Genome Browser at the University o f C alifornia Santa C ruz (http://genome.ucsc.edu). Genes were defined by their m RNAs, w hich were required to have non-zero-length 5' and 3' U TRs as mapped to the sequence assemblies. We excluded 1998 human and 2557 mouse m RNAs contained i n RefSeq from the analysis because they lacked either a 5' or a 3' U T R or both U TRs. 66 For each genome, we constructed m ySQL databases containing the mapping data and Perl scripts that conducted automated queries to determine genomic overlaps between T Es and U T R exons. Overlaps of greater than one bp were allowed but only 2% of detected T Es had an annotated overlap of <5 bp. To eliminate false-positive TEs (which can occur in regions where the local G C content differs from the surroundings), genomic sequences of a ll putative repeats i n U TRs were remasked using RepeatMasker (http://repeatmasker.genome.washington.edu) w ith -s (sensitive) and -gccalc (expected G C content equal to that o f the repeat itself) settings. F inally, a ll 490 human A lu repeat elements, identified i n the sense orientation at the 3' end of transcripts, were moved to the 3' ' internal' U T R category because many appear to represent oligo-dT mispriming on the A -rich A l u terminus during c D N A synthesis. 3.2.2 Variation of TE prevalence with gene class or function The Gene Ontology ( GO) analyses of RefSeq transcripts was carried out as follows: the RefSeq database was downloaded from the online N C B I repository and each record was parsed to obtain the accession numbers of the nucleotide source records of each RefSeq transcript, and these l inks were recorded in a database table. Further, the SwissProt database was downloaded and parsed to obtain the links between SwissProt identifiers and the accession numbers in the nucleotide database upon w hich each SwissProt record was based. We then also constructed a table of links between G O terms and SwissProt identifiers parsed from a table downloaded from the GO online repository (http://www.geneontology.org). W ith these tables in a large m ySQL database, a threeway table j oin allowed us to make a classification o f most RefSeq transcripts. We then also downloaded the database o f G O terms and links between them and used a m ySQL database system to translate the assigned G O terms to more general ones w ithin the 67 molecular function and b iological process classification trees. The conservation-based analyses of K a/Ks were done using basic assumptions. Alignments o f a ll mouse and all human RefSeq transcripts were constructed by finding the best human-mouse hit using ungapped B L A S T n (Altschul et a l. 1990). The aligned results were parsed and analyzed in three reading frames. The optimal reading frame was chosen by m inimization o f the sum of stop codons and non-synonymous substitutions. The K a/Ks ratio was calculated in this reading frame. 3.3 R esults a n d D iscussion 3.3.1 Prevalence of TEs in human and mouse gene transcripts We first determined the overall prevalence of T Es i n the U TRs o f human and mouse genes i n the RefSeq database (http://www.ncbi.nlm.nih.gov/RefSeq/). T his analysis revealed that 27.4% of 12179 human RefSeq l oci with annotated U TRs (referred to from now on as human genes) have at least one m RNA w ith TE-derived sequence within the 5' or 3' U TR. The percentages of genes w ith T Es i n different U T R locations are shown in Figure. 3.1a. These data are in general agreement w ith a recent survey of the Mammalian Gene C ollection (http://mgc.nci.nih.gov/) w hich showed that close to 20% of human genes i n that dataset contain T E sequences in a 5' or 3' U T R (Jordan et a l. 2003). A nalysis o f the mouse Refseq database i n the same way revealed that 18.4% of 10064 mouse RefSeq l oci (or mouse genes) contain at least one T E w ithin their U TRs. The lower T E coverage in mouse genes is l ikely to be more apparent than real due to an incomplete rodent repeat database and to the higher nucleotide substitution rate i n the mouse lineage resulting in fewer detectable ancient T Es i n mouse compared to human 68 (Mouse Genome Sequencing Consortium 2002). (a) 0.25, 0 .20 F raction o f g enes with T E s in e ach l ocation Human Q Mouse — 1 CO fj 0.15-1 0.10-1 0 .05 0 5 ' T erminus 5' U T R i nternal 3' U T R i nternal 3 ' T erminus (b) O rientation o f L TRs in g enes S e n s e |~] A n t i s e n s e CO c o UJ c c () / Human i ntrons Human t ranscript t ermini Mouse i ntrons Mouse t ranscript t ermini Figure 3.1 TEs in genes by species and orientation, (a) Fraction of human and mouse genes with TEs in UTRs. The fraction of genes with one or more RefSeq mRNA having at least one T E extending across the 5 ' or 3 ' end of the transcript ('terminus'), or totally within ('internal')the 5 ' or 3 ' UTR is shown, (b) Transcriptional orientation of LTR elements in introns compared to those spanning mRNA termini. The left scale is the absolute numbers of LTR elements within introns of genes and the right hand scale shows numbers overlapping mRNA termini. Note that over 9 9% of TEs within genes are in introns. The orientation bias of all four categories shown is significant (p<0.01). 3.3.2 TEs serve as alternative promoters of many genes A search o f the Human Promoter Database (http://zlab.bu.edu/~mfrith/HPD.html) has previously shown that close to 25% of analyzed promoter regions contain some T E- 69 derived sequence (Jordan et a l. 2003) and several individual cases showing a role for TEs i n human gene transcription have been reported (Murnane and Morales 1995; Brosius 1999; H amdi et a l. 2000; Medstrand et a l. 2001; Nekrutenko and L i 2001; Nigumann et al. 2002) M any o f these cases were detected by our method and we also found numerous new examples of apparent usage o f a TE-derived promoter where T E involvement has not been reported previously (see Table 3.1 for a partial list). Genes are candidates for having a TE-derived promoter i f the 5' end of their 5' U T R resides in a T E sequence and those i n Table 3.1 were examined in more detail. This analysis illustrated several ways in w hich TE-derived promoters might contribute to gene expression, including examples of a) different expression patterns i n human and mouse orthologs that correlate w ith the TE insertion ( CYP19, TMPRSS3, H Y A L 4 , E NTPD1, CASPR4, M K K S ) ; b) the same T E insertion correlating w ith a tissue-specific promoter in both species ( CA1, SPAM1, K LK11); c) presence of the T E in both species but apparent usage as a promoter in human only ( MSLN) d) presence of the T E only in human but similar overall expression patterns as in the mouse ( B A A T , SIAT1, CLDN14, M A D 1 L I ) and e) a human multigene family where the member w ith a T E has a different expression pattern compared to other family members ( FUT5, I LT2). 70 Table 3.1 RefSeq transcripts beginning within a previously unrecognized T E G ene F ull name F unctional role ( Disease) P robable T E i nvolvement in h uman T E in m ouse" H uman e xpression M ouse e xpression" b CYP19 Aromatase Transmembr TMPRSS3 ne a HYAL-4 ENTPD1 /CD39 protease, serine 3 Hyaluronidas e4 Ectonucleosi de triphosphate diphosphohy drolase 1 Contactin associated protein-like 4 McKusickKaufman syndrome gene Carbonic anhydrase 1 Sperm adhesion molecule 1 Kallikrein 11 Mesothelin Estrogen synth (repro abnormal) Serine protease (deafness) Hyaluronan catabolism Lymphoid cell activation antigen Brain cell adhesion Chaperonin (MckusickKaufman) Carbon metabolism Sperm-egg adhesion Serine protease Megakaryo cyte potentiating factor Bile metab. (hyperchola nemia) C ell cycle regulation (cancer) Tight junct component (deafness) Humoral immunity C ell adhesion Immune inhibitory receptor receptor LTR one of at least 6 promoters L TR/Alu as alternate promoter Antisense L1/Alu as only known promoter LTR as 1 of 2 promoters & results in HERV-derived Nterminus LTR is one of 3 promoters & donates protein N-terminus L TR/L2 as alternate promoter L TR one of 2 major promoters Antisense E RV as only known promoter MIR one of 3 promoters & leads to alt. N-terminus LTR as 1 of 2 promoters & part of 5' UTR of other transcript form; alt. promoter is an MIR LTR only known promoter LTR as 1 of 2 promoters & part of 5' UTR of other form LTR as 1 of 2 promoters E RV one of at least 3 promoters Antisense Alu/L1 as only known promoter E RV as alternate promoter No No No No LTR drives very high placental expression. T E form exp. primarily in P BLs. Other forms widespread Primarily placenta LTR drives exp. in placenta & melanoma. Overall expression widespread L TR form exp. in brain, testis, tumors. High exp. in brain & sp. cord T E form in testis and fetal tissues. Overall expr. widespread L TR drives erythroid exp. Primarily testis Widespread Widespread No placental exp. Inner ear, kidney, stomach, testis Primarily skin Widespread CASPR4 MKKS CA1 SPAM1 /PH20 KLK11 MSLN /MPF BAAT MAD1L1 CLDN14 SIAT1 FUT5 No Brain No Widespread Y es Y es Y es Y es for both LTR drives erythroid exp. Primarily testis Brain & prostate Widespread but not from LTR or MIR Liver Widespread Bile acid C oA Mitotic arrestdeficient 1 like 1 Claudin-14 No No Liver LTR form in tumors. Other form widespread LTR form: melanoma /skin and kidney, other form in liver E RV form in mature B cells. Other forms in various tissues C olon, liver. Much lower exp. compared to related F UTs Only ILT known to be expressed in natural killer cells No Widespread Sialyltrasfera se 1 Fucosyltransf erase-5 No N/A B- cell & liver exp. from multipromoters No mouse ortholog Ig-like ILT2/LIR1/ranscript -2 LILRB1 t ILTs expanded after humanmouse split "Presense of T E in mouse was determined by Genome Browser annotation, B LAST and dotplot alignments, "information from literature, where available, or expression databases. Expression pattern of the TE-initiated form is given if known. information from literature or databases with attention to patterns that differ from human. N/A 71 One of the most striking examples of a TE insertion involved i n new tissuespecific expression is the C YP19 gene (Table 3.1 and Figure 3.2a). C YP19 encodes aromatase P450, the key enzyme in estrogen biosynthesis and is expressed only in the gonads and brain of most mammals but the primate gene is also expressed at high levels i n the syncytiotrophoblast layer of the placenta (Kamat et al. 2002). Placental-specific transcription of C YP19 is driven by a well-characterized alternative promoter located - 100 kb upstream of the coding region (Kamat et al. 2002) and our analysis has revealed that this promoter is actually an endogenous long terminal repeat ( LTR), a fact that has escaped previous notice (Figure 3.2a). U sing genomic P CR, we found this L T R present i n O ld W orld monkeys and in one New W orld monkey (marmoset) (data not shown). Therefore, this insertion early during primate evolution appears to have provided a placental-specific promoter that assumed an important role in transcription of C YP19 and, consequently, in controlling estrogen levels during pregnancy. 72 (a) CYP19 P lacenta Gonads 2a Ovary 1.1 O thers O thers 2 I* ^rt MER21A (ERVI) irHF 2 ...10 i I I +^ 2...10 Erythroid 1a Colon 1 Erythroid Colon 1a 1 II MER74C (ERVIII) 1— 7 II — III > MER74C (ERVIII) 1 '- 5 (c) MSLN W idespread 1a 1b Widespread 1 | | , j »-4 MIR M ER54B ( SINE) (ERVIII) III 1 16 MIR MER54B " ( SINE) (ERVIII) (d) BAAT Liver 1 Liver 1 •4 M ER11A (ERVII) U III 2 ...4 —[HHfr 2...4 Human l i m ,„ Mouse Figure 3.2 Examples of genes with apparent TE-dcrivcd promoters. Exon sequences derived from T Es are depicted by boxes shaded in a similar color as the element they are derived from. SINE elements are shown in green and retrovirus-like ( ERV) sequences are in blue, where the thick arrows represent L TRs. The specific type of element is indicated below. Protein coding exons are represented by black boxes and non TE-derived U TRs are indicated by white boxes. Splicing of the alternative first exons is represented by broken lines. For the M SLN gene (part c), transcripts with the la exon either splice or read-through to exon lb. Alternative transcription start sites are illustrated by black vertical arrows and tissue-specific expression patterns are indicated above each promoter. Gene expression patterns were deduced from the literature and/or database sources. See Supplementary Table of van de Lagemaat et al. (2003). The figure is not drawn to scale. 73 M any other instances of putative TE-derived promoters or polyadenylation signals are worthy of mention. The high levels of carbonic anhydrase in human and mouse red blood cells (Brady et a l. 1989) appear to be due to an L TR-derived promoter of the C A1 gene (Figure 3.2b). S PAM1 and H Y A L 4 , closely linked members of the same hyaluronidase gene family (Csoka et al. 2001), have different putative T E promoters giving rise to different expression patterns. For S PAM1, the E R V element is mostly deleted in mouse compared to human but the promoter region is retained, suggesting functional conservation of this segment. The human M S L N (or M PF) gene ( Urwin and Lake 2000) appears to have two promoters, both of w hich are TE-derived. B oth T E insertions are present in the mouse and rat genomes but we found no transcripts in the databases initiating from either T E i n rodents (Figure 3.2c). The only apparent promoter o f the liver-specific B A A T gene, w hich has recently been implicated i n familial hypercholanemia (Carlton et a l. 2003), is an ancient L TR i n human but not in mouse (Figure 3.2d). F UT5 and I LT2 belong to gene families that amplified after the mousehuman divergence (Cameron et a l. 1995; M artin et a l. 2002). As shown in Table 3.1, these genes have putative TE-derived promoters and have acquired an expression pattern distinct from other family members, although it is not known i f the T E is the cause of the differential expression. We also found many examples of T Es serving as polyadenylation sites. Disease-associated genes w ith primary transcripts terminating in a T E include the F8 (factor 8) gene, w hich is polyadenylated in an L TR and the LNG1 (or p 33INGl) tumor suppressor gene, w hich ends i n a D N A TE. We (Chapter 2, Medstrand et a l. 2002) and others (Smit 1999) have observed that some classes of T Es found w ithin introns of genes are more l ikely to be oriented in the 74 antisense transcriptional direction. This is particularly true for L T R elements and L I sequences and is thought to reflect the fact that regulatory motifs such as polyadenylation signals w ithin these elements are more l ikely to be detrimental by, for example, leading to truncated proteins (and thus less l ikely to be fixed) i f oriented in the same direction as the gene. T his study revealed that, i n contrast to their intronic antisense orientation bias, L T R elements located at the 5' and 3' termini of both human and mouse U TRs are significantly more l ikely to be oriented in the same transcriptional direction as the gene transcript (Figure 3.1b). This observation supports the concept that L T R elements at transcript termini are not merely tolerated by the gene but actually participate in transcript formation by providing promoters and polyadenylation signals w hich function only i n the sense direction. 3.3.3 TE prevalence varies with gene class or function To ascertain i f certain types of genes were more or less l ikely to have TE-containing mature transcripts, we used several methods to classify genes. First we used the Gene Ontology database (http://www.geneontology.org/) to classify genes according to their biological process or molecular function and determined the fraction of human and mouse genes containing TEs in transcripts. A remarkably similar pattern in the two species emerged from this analysis. For several classes of genes, the fraction of T Econtaining transcripts was significantly less than the overall average (Figure 3.3a, b). Members o f these categories are involved i n basic housekeeping functions and many are evolutionarily conserved w ith few identified paralogs (International Human Genome Sequencing Consortium 2001; Venter et a l. 2001). In contrast, genes involved i n functions such as defense, stress response and response to external stimuli were more l ikely to have TEs than most of the other gene classes (Figure 3.3a, b). 75 Biological process x (b) x x x x \ x v M olecular f unction 0.5 .5 o 0.4 - H uman - M ouse 0.3 • i 2 0.3 I« (c) § 0.2 I V \ V <2* T 0.2 0.1 0 \ ^» 3 ^ % Protein c o n s e r v a t i o n (d) Species conservation 0.3 0.3 I % c o i 7 >s 'S 7 Figure 3.3 Prevalence of T Es in mRNAs of various gene classes. Symbols show the observed fraction and vertical bars represent the expected fraction of genes containing a T E sequence in the U TR. This expected fraction was determined by assuming a random distribution of T Es in U TRs of all genes and by considering the number of genes belonging to each class. The expected range or length of the bar is shown for p<0.01. (a) Gene classification by 'biological process' using the Gene Ontology (GO) database (http://www.geneontology.org). (b) Gene classification by G O 'molecular function'. For parts a and b, the scale for human genes (blue symbols) is indicated on the left and the mouse scale (orange symbols) is shown on the right of each panel. The 'other' category includes genes of unknown function as well as those of other functional groups, (c) Human and mouse genes separated into those having a K a/Ks ratio less than or greater than 0.115 - the median ratio reported previously for all known human-mouse orthologues (Mouse Genome Sequencing Consortium 2002). K a/Ks values were calculated for 7296 mouse-human gene pairs in our RefSeq dataset. (d) Genes grouped using the K O G database (euKaryotic Clusters of Orthologous Groups; http://www.ncbi.nlm.nih.gov/COG/new/shokog.cgi). Human: genes found only in human (as the sole mammalian representative in the K OG database) plus gene groups conserved in other eukaryotes but expanded to 10 or more members in human. Animal: genes conserved in C. elegans, D. melanogaster, and H. sapiens. Eukaryote: genes conserved in animal, Arabidopsis thaliana (mustard weed), Saccharomyces cerevisiae (budding yeast), Schizosaccharomyces pom be ('fission yeasty, and Encephalitozoon cuniculi (Microsporidia). 76 We next classified genes using the InterPro (IPR) database o f functional protein domains (http://www.ebi.ac.uk/interpro/) and again found good agreement between human and mouse. TEs were either significantly enriched or reduced in genes containing 33 of the 80 most abundant IPR domains. Table 3.2 shows a list o f those I PR domains w ith at least 20 genes i n the human RefSeq database and where the observed fraction of genes w ith T Es in U TRs differs from the expected fraction, based on a random distribution o f T Es ( pO.Ol). W e found that transcripts of genes encoding I g/MHC, C type lectin or some cytokine domains were significantly enriched for TEs as were genes with K R A B / Zn-finger transcription factor domains In contrast, genes important in development, transcription and replication and those w ith some enzymatic domains were much less l ikely to include TEs in their m RNAs (Table 3.2b). 77 Table 3.2 Domains associated with T E enrichment or exclusion in mRNAs (a) D omains a ssociated with T E e nrichment InterPro ID Name Human TE-free genes" a Mouse TEfree genes 15 93 92 75 6 33 6 12 T E-genes Percentage of total g enes d TE-genes° Hum an 1.7 5.0 3.2 4.6 0.5 0.5 0.5 0.2 Mou se 0.7 3.0 3.2 Fly IPR001909 IPR007087 IPR003006 IPR000276 IPR000315 IPR001304 IPR003877 IPR002996 KRAB box 36 202 157 80 26 34 32 10 79 (29)" 149 (88)** 102 (65) ** 60 (35) ** 23(12) ** 27(15) ** 24 (14) ** 12(6) ** 13(4) ** 27 (18) ** 30 (18) ** 21 (14)** 7 (2) ** ! 0 3.6 1.7 1.1 0.1 0.5 0.1 0 Zn-finger, C 2H2 type Immunoglob./ major histocompatibility complex Rhodopsin-like G P C R superfamily Zn-finger, B-box 3.9 0.3 0.8 C-type lectin SPIa/RYanodine receptor S PRY Cytokinereceptor, common beta/gamma chain 15(7) ** 6 (2) * *! 5 (3) ! 0.3 0.3 (b) D omains a ssociated with T E e xclusion InterPro ID Name Human TE-free genes 143 93 22 23 53 T E-genes TEfree genes 55 84 12 9 28 Mouse T E-genes Comparative domair coverage 1 Hum an 1.6 1.4 0.2 0.2 0.6 Mou se 1.6 1.8 0.3 0.2 0.7 Fly IPR000504 IPR001356 IPR004046 IPR000629 IPR000387 RNA-binding region RNP-1 (RNA recognition motif) Homeobox Glutathione S-transferase, C terminal A TP-dependent helicase, D EAD-box Tyrosine specific protein phosphatase and dual specificity protein phosphatase Armadillo repeat 23 (42) ** 13(27) ** 0(6) ** 0(6) ** 6(15) ** 10 (10) 6(13) ** 1 (2) 0 (1) 5(5) ! ! 1.7 1.2 0.4 0.3 0.4 IPR000225 28 1(7) ** 7 0 (1) ! 0.2 0.3 0.2 'Domains in bold are under reduced purifying selection or increased diversifying selection in mammals (Mouse G enome Sequencing Consortium 2002). "Observed number of genes without a T E in the U TR. "Observed number of genes with a T E in the U TR and, in parenthesis, expected number of genes with a T E assuming a random distribution of T Es in U TRs of all genes. "Indicates c ases where the observed number differs from the expected with p < 0.01 (chi-squared). ! Indicates less than 20 genes in R efSeq. "Percentage of total genes with the domain in the genomes of the species listed using the Ensembl gene classification (http://www.ensembl.org/). Significant domain expansions (p<0.01 for chi-squared considering the number of genes in each species) for human vs. fly and mouse vs. fly are indicated in bold. 78 3.3.4 TEs are more prevalent in mRNAs of rapidly evolving and genes mammalian-specific T E prevalence in U TRs was next determined in genes separated according to their sequence conservation, measured by their K a/Ks value, w hich is the ratio of the rate o f nonsynonymous to synonymous change in coding sequences (Hurst 2002). A median K a/Ks value of 0.115 for mouse-human orthologous gene pairs was recently determined (Mouse Genome Sequencing Consortium 2002) and, relative to this median, we found that genes w ith l ow K a/Ks values are significantly less l ikely to have TEs in U TRs compared to those w ith values above the median (Figure 3.3c). These results indicate that genes w ith rapidly-evolving coding sequences are, in general, more l ikely to have T Es i n their U TRs. Earlier analysis showed that at least eight functional domains are under increased positive diversifying selection or reduced purifying selection based on their high K a/Ks ratio of >0.15 (Mouse Genome Sequencing Consortium 2002). Three of these domains are also significantly associated w ith genes w ith T E overrepresentation in their U TRs and these are bolded in Table 3.2a. Furthermore, it is noteworthy that 6 out o f 8 domains associated w ith enrichment of T Es i n genes (Table 3.2a) are represented at significantly higher numbers in mammalian genomes compared to the fruit fly (Drosophila melanogaster), whereas four of the six domains associated w ith a reduced TE-content i n U TRs are equally represented in human and mouse vs. fly. Taken together, these data suggest that T Es are preferentially found in m RNAs containing rapidly diversifying domains, many of w hich have expanded during vertebrate evolution. F inally, we examined T E prevalence in genes divided into three categories (eukaryotic, animal and human) based on presence of orthologous proteins in different 79 species using the euKaryotic Clusters of Orthologous Groups ( KOG) database (http://vvww.ncbi.nlm.nih.gov/COG/new/shokog.cgi). This analysis showed that mammalian (human)-specific gene m RNAs are significantly enriched in TEs compared to transcripts of genes w ith orthologs in other animals and/or all eukaryotes. This enrichment was most apparent when we expanded our definition of mammalian-specific genes to include all genes w ith an ancient origin but which have expanded in humans (and l ikely other mammals) (Figure 3.3d). Transcripts of ' old' gene classes that are not expanded in mammals have a low prevalence of T Es, while genes specific to mammals or those associated with mammalian expansions are significantly more l ikely to harbor TE sequences i n their m RNAs. We considered two simple reasons for the above patterns. First, we addressed the possibility that T E prevalence in m RNAs is fully or partially dependent on genomic features rather than on gene function. For example, TEs are rarely found in U TRs o f genes w ith homeobox domains (Table 3.2b), an expected observation given that the genomic regions encompassing the human and mouse homeobox gene clusters are nearly devoid o f T Es (International Human Genome Sequencing Consortium 2001; Mouse Genome Sequencing Consortium 2002). We grouped genes by functional class and found that genomic parameters, such as number of T Es available, T E density, gene size, number of exons, length of introns, and local G C content, were insignificantly different between the functional groups. Correlation analysis confirmed this observation, demonstrating that a gene's functional category and its genomic surroundings independently influence the number of T Es w ithin U TRs (data not shown). Second, we considered the possibility that transcripts enriched in TEs are more l ikely to be derived from non-functional but expressed pseudogenes. To address this possibility, we 80 separated RefSeq l oci into two categories, 'reviewed' and ' provisional', w hich represent relatively w ell documented genes, and 'predicted', for which the support is less strong (http://www.ncbi.nlm.nih.gov/RefSeq/). We obtained similar T E frequencies with these two gene sets. Thus, neither genomic features nor the presence o f pseudogenes fully account for the observed patterns shown in Figure 3.3 and Table 3.2. Rather, the data suggest that gene function and conservation act as independent variables in determining T E prevalence in m RNAs. 3.4 Conclusions A growing appreciation for the role of T Es i n genome evolution and gene regulation is evident from a number of recent studies (Murnane and Morales 1995; Sverdlov 1998; Brosius 1999; Hamdi et a l. 2000; M akalowski 2000; K idwell and L isch 2001; Medstrand et a l. 2001; Nekrutenko and L i 2001; Ostertag and K azazian 2001; Deininger and Batzer 2002; Mouse Genome Sequencing Consortium 2002; Nigumann et al. 2002; Jordan et al. 2003; Kashkush et al. 2003). Here we have shown that highly conserved genes, such as those w ith essential functions in metabolism, development or c ell structure, have a low prevalence of T Es i n their m RNAs. This finding suggests, as might be predicted, that changes i n expression of fundamental genes due to T E insertions cannot be allowed by the host and are strongly selected against. In contrast, younger or mammalian-specific genes, such as those i nvolved in immunity and those that have expanded during mammalian evolution, are enriched for TEs in their m RNAs. It is possible that due to functional redundancy, such genes are i nitially more tolerant of T E insertions, some of which may then evolve a role in gene expression. These results suggest the TEs have had a major impact on the rapid evolution and functional diversification of gene families in humans and other mammals. 81 Chapter 4: Analysis of genie distributions of endogenous retroviral long terminal repeat families in humans A version o f this chapter has been submitted for publication: van de Lagemaat, L . N . , P . Medstrand, and D .L. Mager. 2006. Insertion patterns o f endogenous retroviruses and S V A elements: Insights into their i nitial effects on genes. I planned and performed all analyses in this paper and wrote the paper P . M . was involved i n planning this research and writing the paper D . L . M . helped plan research and is the senior author 82 4.1 Introduction Transposable elements ( TEs), including endogenous retroviruses ( ERVs), have profoundly affected eukaryotic genomes ( Kidwell and L isch 1997; Deininger and Batzer 2002; K azazian 2004). S imilar to exogenous retroviruses, E R V insertions can disrupt gene expression by causing aberrant splicing, premature polyadenylation, and oncogene activation resulting in pathogenesis (Boeke and Stoye 1997; Rosenberg and Jolicoeur 1997; Maksakova et al. 2006). W hile E R V activity in modern humans has apparently ceased, about 10% of characterized mouse mutations are due to E R V insertions (Maksakova et a l. 2006). In rare cases, elements that become fixed i n a population can provide enhancers (Ting et a l. 1992), repressors (Carcedo et al. 2001), alternative promoters (Di Cristofano et a l. 1995; Medstrand et al. 2001; Dunn et a l. 2003; Jordan et al. 2003; van de Lagemaat et al. 2003, Chapter 3; Bannert and K urth 2004; L eib-Mosch et a l. 2005) and polyadenylation signals (Mager et al. 1999; Baust et a l. 2000) to cellular genes due to transcriptional signals in their long terminal repeats ( LTRs). It has been previously shown that L TRs/ERVs fixed i n gene introns are preferentially oriented antisense to the gene's transcriptional direction (Smit 1999; Medstrand et al. 2002; Cutter et a l. 2005). In contrast, studies on initial insertion patterns o f exogenous retroviruses or retroviral vectors in vitro have not found any such bias for these unselected insertions (Schroder et al. 2002; Barr et al. 2005). Therefore, the antisense bias exhibited by fixed E RVs/LTRs i n genes strongly suggests that retroviral elements found in the same transcriptional orientation w ithin a gene are much more l ikely to have a negative effect and be eliminated from the population by selection. In this study, we closely examined genie distribution patterns o f individual E R V families in the human genome and find 83 J significant differences. These differences provide clues to the original activity profile of each element type, helping to explain nascence of biases in patterns o f insertion. 4.2 Methods 4.2.1 Directional bias of insertions in transcribed regions in mice We assessed the transcriptional orientation of Early Transposon (ETn) L T R retroelements i n mouse transcribed regions. Retroelement and gene annotation from the A pril 2004 U CSC Mouse Genome Browser ( Karolchik et a l. 2003) was used to assess insertion frequency and orientation of insertions within the longest RefSeq transcribed regions of mouse genes. E Tn L T R elements were represented by the R L T R E T N family o f E Tn/MusD L TRs, and pairs of elements w ithin 10 kb of each other and in the same orientation were assumed to belong to the same original insertion. The antisense bias observed in the genie E Tn L T R population was then compared to genie orientation bias i n a data set of documented mutagenic E Tn/MusD L T R insertions coming from earlier studies (Baust et a l. 2002; Mouse Genome Sequencing Consortium 2002; Maksakova et al. 2006). 4.2.2 Directional bias of retroelements in the human genome RepeatMasker annotations of solitary L TRs and L TRs plus internal sequences of endogenous retroviral elements from the July 2003 U CSC Human Genome Browser were compiled and compared to annotated transcribed region start and end points of the perchromosome longest transcript of each RefSeq gene, defined by its H U G O gene name. Annotations for internal elements were matched w ith their respective L TRs as follows: H E R V E or Harlequin internal sequences were matched w ith L TR2, LTR2B, or L TR2C; H E R V K (HML-2) w ith L TR5, LTR5_Hs, LTR5A, or L TR5B; HERV17 with 84 L T R 17 (where H ERV17 represents H ERV-W); H E R V H w ith L TR7, LTR7A, or L TR7B; H ERV9 w ith L TR12, LTR12B, LTR12C, LTR12D, LTR12E, or L TR12_; MLT2x with E R V L - x (where x is a unique identifier); M ST* w ith M ST*-int (where * represents a wildcard); T HE1* w ith T HEl*-int; and M L T 1 * w ith M LTl*-int. Groups of L TR element segments o f the same type, w ith internal sequence a ll in the same orientation, and occurring w ithin 10 kb were deemed part o f the same composite element. Manual checks confirmed the v alidity o f this criterion. Names of consensus elements occurring i n each composite element were recorded, as w ell as names o f the E R V type. Composite elements without internal sequence were deemed L TR-only, and elements w ith contributions from at least two consensus elements were deemed to contain L T R and internal sequence and therefore were considered full-length. A gain, manual checking confirmed the v alidity o f this criterion. The assigned genomic position of each L T R or full-length element was computed as the average of the beginning and end coordinates of each composite element and compared against the positions of longest transcribed regions of each gene. Each transcribed region was divided into ten equal bins and the T E location w ithin a gene was specified by w hich o f these bins it fell into. In addition to the ten intragenic bins, two bins upstream and two bins downstream o f the gene, of the same size as the intragenic bins, were also considered. Counts of elements for each orientation were computed for each bin. A similar approach was used for computing S V A and A l u distributions across genes as for L T R elements. 85 4.3 Results and Discussion 4.3.1 Opposite orientation bias of fixed versus mutation-causing retroviral insertions A s mentioned above, in vitro studies of de novo retroviral insertions w ithin gene introns have not detected any bias in proviral orientation w ith respect to the transcribed direction o f the gene (Schroder et a l. 2002; Barr et a l. 2005). The fact that integrations that have not yet been tested for deleterious effect during organismal development show no directional bias indicates that the retroviral integration machinery itself does not distinguish between D N A strands in transcribed regions. Presumably, then, any orientation biases observed for endogenous retroviral elements must reflect the forces of selection. In support of this premise is a recent study by Bushman's group that was the first to directly compare genomic insertion patterns o f exogenous avian leukosis virus ( ALV) after infection in vitro w ith patterns o f fixed endogenous elements of the same family (Barr et a l. 2005). Endogenous elements in transcriptional units were four times more l ikely to be found antisense to the transcriptional direction, suggesting strong selection against A L V i n the sense direction. We reasoned that, i f the marked orientation biases of L TRs/ERVs were a reflection of detrimental impact by sense-oriented insertions, then we w ould also expect a dominant sense orientation among insertions with known detrimental effects. W hile no mutagenic or disease-causing E R V insertions are known i n humans, significant numbers have been studied in the mouse. We analyzed element orientation of fixed, non-pathogenic insertions of the s till active E Tn/MusD family o f E RVs i n the mouse (Figure 4.1), and found similar degrees o f antisense bias as seen for human and chicken E RVs (data not shown). However, as expected, 15/18 new mouse germ-line E Tn/MusD insertions in transcribed regions that are associated with mutations are in the sense orientation (Maksakova et al. 2006) (Figure 4.1). Moreover, i n 86 most of these cases, the predominant effect of the E R V was to cause premature polyadenylation o f the gene through use of L T R p olyA signals, accompanied by aberrant splicing (Maksakova et a l. 2006). • S ense OAnti M ouse-all M utagenic Figure 4.1 Directional bias of retroelements in mouse transcribed regions. E Tn elements were those annotated as R L T R E T N in the U CSC May 2004 mouse genome repeat annotation. The mutagenic population of E Tn elements was reported in earlier reviews (Baust et al. 2002; Mouse Genome Sequencing Consortium 2002; Maksakova et al. 2006). Expected variability in the data was calculated from Poisson statistics, which describe randomized gene resampling. 4.3.2 Variation in density of genie insertions of different HER Vfamilies E RVs/LTR elements in the human genome actually comprise hundreds of distinct families o f different ages and structures, many of w hich remain poorly characterized (Gifford and Tristem 2003; Mager and Medstrand 2003). Thus, grouping such heterogeneous sequences together, as has been done for previous studies on orientation bias (Smit 1999; Medstrand et a l. 2002), may w ell mask variable genomic effects of distinct families. T o investigate genie insertion patterns o f different human E R V families, we chose nine w ell studied Repbase-annotated (Jurka 2000) families or groups o f related families to analyze in more detail. These families, their copy numbers and their approximate evolutionary ages are listed in Table 4.1. 87 T able 4.1 A nnotated copy n umbers and evolutionary ages of v arious E R V familes Name copy number 3 full length" evolutionary age(Myr) Reference (Smit 1993) M LT1 160,000 36,000 >100 75 (Smit 1993) M ST 34,000 5175 (Smit 1993) T HE1 37,000 9019 55 (Cordonnier et a l. 1995) >80 HERV-L 25,000 4777 ( Blond e tal. 1999) 242 40-55 HERV-W 675 (Taruscio et al. 2002) 294 HERV-E 1138 25 (Jern et a l. 2004) 1284 >40 HERV-H 2508 (Costas and Naveira 2000) 4837 697 15 H ERV9 30 (Bannert and K urth 2004; HERV-K 1206 178 Belshaw et a l. 2005) ( HML2) "Including L TRs w ith no internal sequence and L TRs w ith associated internal sequence "Elements including both L T R and internal sequence A s a first step, we plotted the fraction of total elements i n either orientation found w ithin RefSeq (http:// www.ncbi.nlm.nih.gov/RefSeq/) transcriptional units (see Methods) and the results are shown in Figure 4.2. To put our results in context, we considered a model of random i nitial integration throughout the genome. Since 34% of the sequenced genome falls w ithin our analyzed set of RefSeq transcriptional units, we would expect 34% of E R V insertions, 17% in either direction, to be found in these regions. This is a conservative model since initial integration patterns o f most exogenous retroviruses are biased toward genie regions (Panet and Cedar 1977; Schroder et a l. 2002; M itchell et a l. 2004; Barr et al. 2005) (see also below). A n earlier study (Medstrand et al. 2002) noted that, for large superfamilies of L TR elements, elements i n either orientation are less prevalent in genie regions than expected by chance. In the present study, we were particularly interested in the fact that, for many L T R types, there are fewer antisense L TRs than expected by chance. This may be a reflection of enhancer effects by these 88 elements, an effect often seen in mice and reviewed elsewhere (Rosenberg and Jolicoeur 1997). Closer analysis of our chosen individual families revealed significant variation i n the magnitude of this effect (Figure 4.2). For example, antisense L TRs o f the M LT1 and H E R V - K ( HML-2) families are relatively more prevalent in genes, suggesting that the presence of these sequences in the antisense direction is less l ikely to negatively affect the enclosing gene. A n alternative explanation is that the i nitial integration preference of these families was biased more heavily to transcriptional units, compared to most other families. The density patterns o f most of the other families are qualitatively similar to each other, w ith a moderate, though significant, under-representation of antisense elements and a further 2 to 3 fold reduction in sense elements. S imilarly, a recent study o f a chimpanzee-specific family o f gammaretroviruses by Y ohn et al (2005) showed that all 13 of these E RVs found w ithin transcribed regions were oriented antisense to the direction o f gene transcription. The exception to this pattern is H E R V 9 ( ERV9), which w ill be discussed further below. fj 0.15 •| c o 0.2 + 0.1 + o § 0.05 + ^ & A- 0 Element type Figure 4.2 Orientation bias of various full length E RV sequences in genes. E RV families are as annotated by RepeatMasker in the human genome and are listed in Table 4.1. Fraction of all genomic elements actually found in genes in the sense and antisense orientations is presented, with neutral prediction (dotted line) based on fraction of total genomic elements expected in sense and antisense directions in genes under assumption of uniform random insertion. 89 4.3.3 Density profiles of ER Vs across transcriptional units A t least three factors could account for the antisense bias exhibited by most E R V families. First, the sense-oriented polyadenylation signal in the L T R could cause premature termination of transcripts and therefore be subject to negative selection. Gene transcript termination w ithin L TRs commonly occurs in E RV-induced mouse mutations (Maksakova et a l. 2006) and this effect has been proposed as the most l ikely explanation for the orientation bias (Smit 1999). Second, splice signals w ithin the interior o f proviruses could induce aberrant R N A processing, a phenomenon also frequently observed in mouse mutations (Maksakova et a l. 2006). To test this second possibility, we plotted graphs similar to Figure 4.2 separately for solitary L TRs, w hich comprise the majority of retroviral elements in the genome (International Human Genome Sequencing Consortium 2001; Mager and Medstrand 2003), and for composite elements containing L T R and internal sequence (data not shown). W hile the numbers of the latter are much lower than for solitary L TRs for most families, we detected no significant differences in the density patterns, suggesting that signals w ithin the L TRs, and not interior splice signals, are the primary determinant of the orientation bias. A third factor that could result in orientation bias is the presence of the L T R transcriptional promoter w ith a potential to cause ectopic expression of the gene, resulting i n detrimental consequences, as occurs in cases o f oncogene activation by retroviruses (Rosenberg and Jolicoeur 1997). If introduction o f an L T R promoter was the primary target o f negative selection, one would predict that sense-oriented L TRs located just 5' or 3' to a gene's native promoter would be equally damaging and therefore subject to similar degrees o f selection. 90 To look for evidence of these effects, and to more closely examine the distributions of E RVs within genes, we measured the absolute numbers of ERVs/LTRs of different families in bins across the length of RefSeq transcriptional units and in equalsized bins upstream and downstream (see Methods). The results of this analysis (Figure 4 . 3 ) revealed density profiles that shift dramatically at gene borders. Specifically, for most E RV families, we found that the prevalence of sense-oriented elements drops markedly inside the 5 ' terminus of a gene, remains relatively low across the gene and then jumps just as markedly 3 ' of the gene. This type of pattern does not indicate a strong detrimental effect of L TR sense-oriented promoter motifs. Rather, these profiles suggest that presence of the L TR polyadenylation signal, which would generally only affect a gene i f located within its transcriptional borders, is the regulatory signal primarily responsible for the resulting lack of sense-oriented L TR elements within introns. 91 M LT1 3500 3 000 1 • 2500 ' 2 00015001000 0 —i—i—i—i—i— 1 MST 700 600 500 400 300 *• • • • . I 2 H ***c i—i—i—i—i—i—i—i—i—i—i—i—i -j *** • 500-I •* • -i—i—i—i—i 200 1 00-| 0 •2 0 2 8 +1 500 400 4 6 8 +1 0 700 600 500 400 300 200 100 0 2 4 T HE1 HERVL H 300 200 100 -i—i—i—i—i—i—i—i—i—i—i—i—i * ••••••• 0 -2 70 -. 60 :•• -T 1 111 111 11 11 0 oo 40 -I 35 + 30 25 2 4 6 8 +1 0 2 4 ( 8 +1 H ERVW H ERVE II E 0) O X! 204 15 + 10 • 5H * Ii i —i 0,1 1 50 40 30 20 10 » 1 - E 3 i 1 i 1 0• -2,-1 1 20100^ 80 0 2,-1 1 40 -i * T i 1 f 1 1 0,1 1 1 2,3 4,5 HERV9 6,7 8,9 +1 +2 2,3 4,5 6,7 8,9 +1 +2 HERVH 120 100 80 60 40 i 4 I 60 • 40 • 20 • 0• -2,-1 60 50 i 40 30 20 0,1 2,3 4,5 TI 1 i 11 111111 11 11 i 6,7 5 8,9 +1.+2 20 0 H T 2 0 2 4 6 8 +1 HERVK t *jI * i 1 i -j 10 I 0• -2,-1 hi i 1 i 1 IAnti • Sense i 1 i 1 —i 0,1 2,3 4,5 6,7 8,9 +1 +2 T ranscription unit b ins Figure 4.3 Patterns of annotated E RV presence in equal-sized bins across transcriptional units. Ten bins, numbered 0-9, were considered within transcribed regions. Four bins, two in either direction outside gene borders and equal in length to intragenic bins, were considered, and are shown as bins 2 and -1 upstream and +1 and +2 downstream. 92 4.3.4 Distinct pattern of HER V9 elements with respect to genes B y separating E RVs into different families, we uncovered a unique genie distribution pattern of H E R V 9 elements. As shown in Figure 4.2, H E R V 9 has a more significant deficit o f antisense elements and little orientation bias compared to other E R V families. The distinct H E R V 9 density profile across gene regions illustrates the same point in greater detail (Figure 4.3H). These data suggest that H E R V 9 elements w ithin introns are nearly equally l ikely to adversely affect the gene, regardless of orientation. This is the only H E R V family we have analyzed that displays this type of distribution pattern and is l ikely the result of the complex structure of H ERV9 L TRs. These L TRs are 0.7-1.5 kb long and extraordinarily r ich i n the C pG dinucleotide - examination of the consensus elements from RepBase (Jurka 2000) reveals that a ll H E R V 9 L T R ( LTR 12, i n RepBase nomenclature) subfamily members are C pG r ich, w ith approximately 90 C pGs spread over the consensus of the most-abundant L TR12C L TR. A relatively C pG-poor tract in the L TR's U 3 region contains 5-17 repeats o f a sequence r ich i n transcription factor binding sites, w hich have recently been shown to bind a transcription factor complex involved i n the regulation of the beta globin locus ( Yu et a l. 2005). H ERV9s are under-represented in both orientations in all bins w ithin transcribed regions compared to the nearest regions upstream and downstream of genes. These results suggest exclusion o f these E RVs from genie regions in both orientations, l ikely due to a transcription defect similar to simple polyadenylation. However, analysis for polyadenylation signals using D NAFSMiner ( Liu et a l. 2005) reveals the presence of polyadenylation signals in a ll L TR12 family consensus sequences and the absence of a polyadenylation signal on the opposite strand, suggesting that polyadenylation alone cannot account for this distribution pattern. One possible explanation comes from recent 93 work by L orincz et al (2004), w hich showed that methylated intragenic CpG dinucleotides were associated w ith transcriptional elongation defects, l ikely due to induction o f a closed chromatin formation. This observation, coupled w ith the fact that H ERV9 L TRs are C pG r ich, has the potential to explain strong selection against intragenic insertions of H E R V 9 L TRs i n either orientation. 4.3.5 SVA SINE elements display distribution patterns similar to LTRs S V A elements are composite S INE sequences composed of a tandem hexamer repeat, a partial A lu element, a variable number o f tandem repeats, a partial E R V - K L TR, and a p oly-A tail (Ono et a l. 1987; Zhu et al. 1992; Shen et al. 1994; Wang et a l. 2005). These elements contain an internal promoter and the L TR-derived p oly-A signal. L ike A lu elements (Dewannieux et a l. 2003), S V A elements are thought to utilize LI-encoded machinery to retrotranspose (Wang et al. 2005). S V A elements are a relatively young and actively transposing family i n humans and have caused several mutations (Ostertag et al. 2003; Chen et a l. 2005). These elements have a stronger antisense bias in genie regions, compared to A l u Y elements (Figure 4 .4A, C ), suggesting interference w ith pol II transcriptional machinery. Similar to in vitro insertions of exogenous viruses (discussed above), antisense S VAs are found in transcribed regions more frequently than expected by random chance, suggesting that genie regions are favored targets for S V A insertions. Analysis o f the profiles of sense and antisense S V A elements across transcription units revealed that, as w ith L T R elements, there was a drastic change in antisense bias at the boundaries of transcribed regions, mostly due to a sudden drop in density of senseoriented insertions. This low density of sense-oriented insertions persisted across transcriptional units (Figure 4 .4B), again suggesting that polyadenylation plays a significant role in deleterious consequences of germline S V A insertions. 94 A SVA B T ranscription unit bins - - - E xpected o Sense | "Anti Figure 4.4 Insertion pattern of S VA and A luY retroelements across transcriptional units. A , C . Fraction of annotated genomic S VA and A luY elements found in the sense and antisense directions in transcribed regions. Dotted line shows expected fraction (17%) assuming initially uniform random insertion. B , D. Cumulative insertions of S VA and A luY elements in ten bins, numbered 0-9, across transcriptional units. Similar to Figure 4.3, four extra bins were considered, two in either direction upstream and downstream of genes, and denoted -2, -I, +1 and +2. In contrast to S VAs, the A luYs, w hich comprise the youngest superfamily of A lu elements (International Human Genome Sequencing Consortium 2001), have a significantly smaller antisense bias, both overall, and across transcriptional units (Figure 4 .4C, D ). G iven the similar mechanism b y w hich S VAs and A lus have l ikely inserted, these results suggest that intronic insertions o f A lus in both orientations have been significantly less l ikely to be selected against than sense-oriented insertions of S VAs. A n intriguing feature o f both the S V A and A lu distribution profiles across transcriptional 9 5 units is the slightly higher prevalence of sense-oriented elements in the 5' regions of genes (bins 0-3) compared to more 3' regions. This pattern could reflect original insertion site preferences favoring the 5' parts o f genes. The biochemical mechanisms are unclear but could be analyzed using in vitro retrotransposition assays. It is interesting to note that S VAs, w hich also have a large numbers of C pG dinucleotides, do not show a similar pattern to H E R V 9 LTRs. CpG dinucleotides in genomic S V A elements do appear to be methylated, given their accelerated mutation rate relative to n on-CpG sites (Wang et a l. 2005). W hy these elements f ail to cause a potential elongation defect is unclear and requires further analysis, perhaps using experimental approaches. 4.4 Concluding remarks The preferential antisense orientation of L TRs/ERVs fixed i n gene introns has been shown before (Smit 1999; Medstrand et al. 2002; Cutter et a l. 2005). However, patterns o f in vitro insertions by exogenous retroviruses have not demonstrated this bias (Schroder et a l. 2002; Barr et a l. 2005). U sing patterns o f insertions across transcribed regions for individual families, we have shown that individual E RVs have had differing impacts upon original insertion. A feature that most share, however, is deleterious impact by the polyadenylation signal, evidenced by the sharp increase in antisense bias immediately downstream of the start o f transcription, corresponding to sharp decrease in the density of sense-oriented insertions. The anomalous insertion pattern of H E R V 9 i n transcribed regions suggests an adverse impact on genes regardless of orientation, perhaps due to induction o f closed chromatin as a result of methylation o f its many C pG dinucleotides. However S V A elements, w hich are similarly r ich i n the C pG dinucleotide, show a robust antisense bias. In conclusion, although some functions of L TRs, primarily their 96 promoters, may have been inactivated or repressed, modern patterns o f insertions can be used to deduce original selective forces acting on these elements. Furthermore, L T R sequence, presumably m obilized as part of the s till active S V A S INE, continues to have an L TR-like impact on the human genome. 97 Chapter 5: Analysis of repeats and genomic stability A version of this chapter has been published: van de Lagemaat, L . N . , L . Gagnier, P. Medstrand, and D .L. Mager. 2005. Genomic deletions and precise removal o f transposable elements mediated by short identical D N A segments in primates. Genome Res 15: 1243-1249. I performed all bibinformatic analysis and wrote sections of the paper. L . G . performed P C R assays. P . M . and D . L . M . discussed research and wrote sections of the paper. 98 5.1 Introduction Current genome size in mammals and other eukaryotes has been greatly affected by massive amplifications of transposable elements (TEs) or retroelements throughout evolution (Brosius 1999; K idwell 2002; L iu et a l. 2003). In mammals, close to 50% of the genome is recognizably TE-derived (International Human Genome Sequencing Consortium 2001; Mouse Genome Sequencing Consortium 2002; Rat Genome Sequencing Project Consortium 2004) and in some plant species, the figure is nearly 80% (SanMiguel et al. 1998; L i et a l. 2004). The various classes of T Es and their distributions in genomes have been widely studied in many species (Adams et a l. 2000; International Human Genome Sequencing Consortium 2001; A paricio et a l. 2002; K idwell 2002; Mouse Genome Sequencing Consortium 2002; Y u et a l. 2002; Kirkness et al. 2003; B aillie et a l. 2004; M a and Bennetzen 2004; Rat Genome Sequencing Project Consortium 2004). In contrast, much less is known about mechanisms that attenuate genome size. Studies in plants have shown that retroelement-driven genome expansion is counteracted by deletions w ithin retroelements, l ikely mediated by illegitimate recombination between short flanking segments o f identity (Devos et a l. 2002). Comparison o f related rice genomes has also revealed that illegitimate recombination has deleted both retroelement-derived sequences as w ell as unique nuclear D N A ( Ma and Bennetzen 2004). A number of studies have documented the prevalence of small deletions and insertions (indels) in primate genomes (Britten et a l. 2003; L iu et al. 2003; Watanabe et al. 2004) but there has been no genome-wide analysis to determine the molecular mechanisms w hich generate these events. Recent availability o f the chimpanzee draft 99 sequence has afforded the opportunity to analyze the spectrum of genomic deletions that have occurred in the last 5-6 m illion years o f primate evolution. Moreover, a large-scale comparison of the human and chimpanzee genomes allows examination of the genomic stability o f retroelement insertions, which are generally considered to be irreversible with no known mechanism for precise excision from the genome (Hamdi et al. 1999; RoyEngel et al. 2001; Batzer and Deininger 2002; Salem et a l. 2003a; Salem et a l. 2003b). Due to this 'unidirectional' property, retroelements, particularly A l u elements, are widely viewed as ideal markers for human population genetic studies ( Carroll et a l. 2001; RoyEngel et al. 2001; Batzer and Deininger 2002; Salem et a l. 2003a) and elucidation of primate phylogenetic relationships (Hamdi et al. 1999; Salem et a l. 2003b; Gibbons et al. 2004). In primates, A lu sequences are the most abundant family o f retroelements, comprising over 10% o f the human genome (International Human Genome Sequencing Consortium 2001; Batzer and Deininger 2002). W hile most of the one m illion A lu elements retrotransposed over 40 m illion years ago, several thousand have integrated into the human genome since divergence from the great apes and close to a thousand of the youngest A lus are polymorphic (Carroll et al. 2001; R oy-Engel et al. 2001; Batzer and Deininger 2002; Salem et a l. 2003a; Bennett et a l. 2004). Most are associated with flanking direct repeats or target site duplications (TSDs) of 10-20 bp (Jurka 1997). In this study, we have obtained evidence that A lu elements can be precisely deleted from the genome via recombination between these flanking repeats. S imilarly, a significant fraction o f 200-500 bp deletions of non-repetitive sequence have l ikely taken place due to recombination between short regions of identical sequence flanking the deleted fragment. We demonstrate that this fraction is much greater than expected i f blunt-end j oining were responsible for generating all these deletions. Our results are in agreement w ith a model 100 o f genomic deletion occurring both by non-homologous and error-prone homologydriven mechanisms of D N A double strand break repair (Helleday 2003). 5.2 Methods 5.2.1 Direct assessment of retroelement deletion rate Putative retroelement insertions were obtained from the chimpanzee scaffold alignments to the U C S C July 2003 human genome (Kent et a l. 2002) using RepeatMasker ( A.F.A. Smit & P . Green, unpublished data), M askerAid (Bedell et a l. 2000), and libraries from the RepBase Update (Jurka 2000). Pseudogenes were detected using B L A T (Kent 2002) and the human RefSeq m R N A records. Insertions were defined as having a single retroelement (including pseudogenes) filling a ll but up to 90 bp of the indel and not extending beyond the indel by more than 10 bp on either side. Search queries were then constructed of the 50-bp sequences upstream and downstream of each putative retroelement insertion location. Scripted discontiguous megablast searches of the relevant N C B I trace archive were then carried out using perl scripts and the Q B L A S T application programming interface (http://www.ncbi.nlm.nih.gov/Traces/trace.cgi; http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html) ( Altschul et a l. 1990; M cGinnis and Madden 2004). Our B L A S T queries used a non-coding template of size 21 and required only one seed hit per high-scoring segment pair. To m inimize false positives and ensure non-redundant hits, we required that 75% of the query be free of known human repeats. Further, the accepted hits were required to match the query at least 30 bp on either side o f the putative breakpoint. A l l traces not f ulfilling these requirements were ignored. Deletions in human relative to chimpanzee or v ice versa were diagnosed by the presence in Rhesus of an insertion at least 80% of the size expected and no traces with 101 less than this amount of sequence. A site was considered an insertion i f one or more Rhesus traces matched the empty site and no traces had extra sequence. Putative deletions in human or chimpanzee were further individually aligned w ith their Rhesus counterpart using C lustalW version 1.82 (Higgins et a l. 1996) and the alignments were edited using Jalview (Clamp et a l. 2004) to check for the presence of the same element in the expected position i n the Rhesus trace. The alignments are provided in Appendix A . 5.2.2 Detection of deletions internal to Alu elements A l l indel l oci i n the chimpanzee scaffold alignments to the U C S C July 2003 human genome, masked as described above, were reanalyzed. Deletions occurring entirely within A lu elements were analyzed for involvement of the approximately 80 bp and 50 bp internal homologies. Putative deletions occurring between the 80 bp similar regions were detected by having one deletion endpoint occurring w ithin positions 1-84 of the consensus and the other endpoint w ithin positions 136 to 219. S imilarly deletions between positions 85-135 and 219 to the end of the consensus were considered as occurring between the 50-bp similar regions. 5.2.3 Assessment of deletion frequency due to illegitimate recombination Human chromosomal sequence files w ith human repeats pre-masked to lower case by RepeatMasker were used. These files were further masked using Tandem Repeats Finder 3.21 (Benson 1999). A l l repetitive sequence was excised, including human repeats and tandem repeats. W e then constructed a C++ program that used an alignment method to find all nonredundant nonadjacent identical segments up to 20 bp long between 200 and 500 bp apart i n the nonrepetitive genome. These were tallied by length, giving the expected distribution o f potential sites for illegitimate recombination in this distance 102 range. We then analyzed all fully-sequenced insertions and deletions (indels) 200-500 bp long present i n the alignments of the U CSC July 2003 human sequence to the chimpanzee scaffolds (Kent et a l. 2002) for the presence of flanking identical segments beginning at the deletion breakpoints. Specifically, indels w ith 50 bp flanking sequence on each were analyzed, and those containing putative new retroelement insertions, tandem duplications, or flanking homologous retroelements corresponding to the indel breakpoints were removed from consideration, as were all indels occurring inside transposable elements. The remaining indels were classified as deletions. Retroelement insertions were detected as described above. Other insertions were diagnosed i f the indel internal sequence was found by B L A T elsewhere in the human genome. Tandem duplications were diagnosed by running Tandem Repeats Finder (Benson 1999) on a sequence including the indel extra sequence and equivalent lengths o f sequence flanking the indel upstream and downstream. A n indel was considered to be a case o f tandem duplication i f a tandem duplication was found covering one of the breakpoints and extending to w ithin one bp o f the other. M anual checks confirmed the validity o f this criterion. After disqualifying putative insertions, tandem duplications, and indels w ith flanking homologous repeats, the remaining 1927 indels were termed random deletions and were analyzed for the distribution of flanking repeat sizes. Flanking repeats were considered to begin at the breakpoint positions and consist of a tract of identical sequence. No mismatches in the flanking repeat or offsets of the identical segment from the indel breakpoints were allowed. 5.2.4 Genomic PCR and sequencing Primate genomic D N A was isolated from various c ell lines as described 103 previously (Goodchild et a l. 1993). A dditional chimpanzee D N A samples were kindly provided by D r. Peter Parham (Stanford University). 150 ng of human or primate genomic D N A was amplified i n a 50ul reaction w ith 2 00uM each d NTP, 2 00nM each primer (see Appendix A ), 1.5mM MgC12, and 1 unit of Platinum Taq (Invitrogen) in I X P CR buffer (Invitrogen). The conditions for the P CR were 9 4°C for 1 m in followed by 30 cycles of the amplification step ( 94°C for 30 s, 4 8-60°C for 30 s, and 7 2°C for 30s-1 min). The annealing temperature and extension time varied for different primer combinations. Sequencing was performed directly on P CR products using the B igDye Terminator v3.1 C ycle Sequencing K it (ABI) i n an A BI P RISM® 3 730XL D N A Analyzer system at the M c G i l l University sequencing facility. 5.3 Results and Discussion 5.3.1 Direct assessment of retroelement deletion frequency During an analysis to identify transposable element (TE) insertions that occurred after divergence of human and chimpanzee, we detected some apparent insertional differences i nvolving A lu elements of older subfamilies. The A l u Y subfamily is the only family known to have been active in the last few m illion years of human evolution (Batzer and Deininger 2002). However, we identified 187 A lu elements from older families such as A luS and A luJ (98 in human and 89 in chimpanzee) w hich appeared to be insertional differences. This finding raised the possibility that at least some of these cases represent deletions in one species rather than new insertions in the other. To explore this possibility, scripted blast searches o f the Rhesus macaque whole-genome shotgun trace archive were used to assess the ancestral state o f apparent retroelement insertional 104 differences i n humans and chimpanzees (see Methods). It should be noted that our requirement that 75% of the totally 100 bp flanking sequence be free of known repeats resulted i n only 8389 of 14765 retroelement l oci being tested, and therefore we expect that our findings represent an underestimate of the overall level o f precise deletion of retroelements. O f 7120 human-chimp indel sites w ith accepted Rhesus trace matches, 7010 were identified as insertions by our criteria (see Methods). That is, the retroelement was absent in Rhesus. The other 110 sites were examined more closely. Fifty-two o f these cases appeared to be rearrangements or multi-copy regions in the Rhesus genome due to the existence of multiple Rhesus traces covering the region, some w ith and some without the retroelement. Three further cases w ith partial poor trace alignments were l ikely genomic rearrangements. The remaining 55 cases were subjected to more detailed analysis to confirm that the indel was a case of deletion i n human or chimpanzee and not an insertion or other rearrangement. M ultiple sequence alignments o f the human, chimpanzee, and Rhesus sequences were done in each of the 55 cases (reproduced in Appendix A ). O nly one (# 23) resulted from poor sequence quality i n the chimpanzee assembly. Another (#51) was a tandemlyduplicated L 2 element. Four other cases (# 5, 13, 18, and 31) showed evidence of independent insertions in the same site or in sites only several base pairs apart. Independent insertions at the same site have been reported before (Conley et a l. 2005). The remaining 49 cases appeared to be retroelement deletions. Twelve cases, six i n humans and six in chimpanzees, were imprecise deletions, removing sequence from older retroelements such as L 2 and M IR. A similar case of imprecise A lu deletion has been previously reported (Edwards and Gibbs 1992). In each case, our 12 imprecise 105 deletions had little or no similarity at the deletion breakpoints, suggesting a nonhomologous deletion mechanism as an explanation for these events. Thirty-seven cases represented apparent precise deletion of previously retrotransposed sequence and all cases but one were A lu elements. The one anomaly (case #6) was a polyadenylated sequence flanked by apparent target site duplications. This is a fragment of a -340 bp sequence w ith - 20 copies mutually -6-10% divergent in the human genome, suggesting possible earlier mobilization as a retrotransposable element. We found 36 cases o f apparent precise deletions of A l u elements. The loss of the A lu was also associated w ith loss of one copy o f the T SD, leaving behind the original, pre-integration site only. This observation raised the possibility that these deletions were mediated by recombination between the flanking identical regions. A possible example o f precise A l u deletion on human chromosome 21 has been reported recently by Hedges et al (Hedges et a l. 2004) but the authors considered A lu excision to be a remote possibility, instead favoring other explanations. Unfortunately, there is no coverage of this region in the chimpanzee scaffolds. Furthermore, recent P CR analysis of humanchimpanzee indels on chimpanzee chromosome 22 revealed two precisely deleted A lu elements; however, sequences and positions o f these events were not given. These deletions resulted in loss of the A lu and deletion of one of the T SD copies, leading the authors to speculate that a homology-dependent recombination mechanism might be responsible for these deletions (Watanabe et al. 2004). We reasoned that under a n ull hypothesis of deletions mediated by nonhomologous mechanisms, very few should be flanked by short identical segments. Instead, the majority of the 49 deletions (37 w ith flanking identical segments and 12 106 without) had identical regions of 10 bp or more. Compared to the n ull hypothesis, this association between deletion and flanking identical D N A was highly significant (p < le100; C hi squared test). The skeptical reader could argue that we were only looking at deletions w ith breakpoints near retroelements, and therefore we w ould be more l ikely to find breakpoints located w ithin T SDs, even w ith a non-homologous deletion mechanism. However, the l ikelihood o f locating the breakpoints precisely at the same location within the T SD i n the vast majority of the cases by random chance alone remains extremely small. Our findings strongly suggest that short, nonadjacent identical segments recombine, l ikely during double-strand break repair, to mediate deletion of these sequences. Consistent w ith this notion is the fact that at least 20-fold more deletions that involve A lus are actually internal to A l u elements and have occurred between the roughly 80-bp and 50-bp homologous regions internal to intact A lu elements (Figure 5.1 A ; See Methods). These findings suggest that double strand D N A breaks internal to A lus are repaired using the internal A l u homologies, obviating use of the flanking T SDs as repair templates and thus retaining remnants of the A lu element. The proposed mechanism of double-strand break repair is illustrated in Figure 5.IB, w hich shows a specific n on-Alu small deletion in chimpanzee. 107 A 7 40 X 2 42 X B 1. ACCGGCTGCTGTGGGGCA 133333Y09Y3YD33331 ACCGGCTGCTGTGGGGCA I.33333Y09Y3YD3D3 GATTTGCTTTCGGG 3 iYYYD9YW3033 TTTGCTTTCGGG 31YYY03YW33D3 o o 3. E -i O TTCGGG ACCGGC^^T o ACCGGCTGCTTTCGGG I33333VD9YYY3333 4. Figure 5.1 Deletions due to D NA double strand break repair. A) Whole and partial Alu element deletions. A full-length A lu is shown in the middle and black arrows represent target site duplications. Shaded and white internal regions represent internal -70% identical regions. Deletions involving the 84 bp internal Alu homologies (shaded regions) were found 740 times in the human-chimpanzee alignments (top left). Alu internal deletions occurring between the other homologies (white regions) were found 242 times (top right). Precise deletion of entire A lu elements, likely involving the target site duplication (black arrows) was found in 36 cases (lower) in relatively repeat-free regions since human-chimpanzee divergence. B) A non-Alu deletion in chimpanzee at human c hrl: 1448280-1448311. Precise deletions of Alu elements, internal deletions within Alus, and other deletions are explained by an error-prone homology-dependent repair mechanism, involving 1. a double-strand D NA break, 2. resection of D NA and exposure of 3' tails, 3. homology search, and 4. ligation. In this case, a 4-bp homology mediated a 16-bp deletion. 108 Several o f the apparent deletions from chimpanzee corresponded in human to human-specific A l u families, such as A luYa5 and A luYb8. However, in each case the corresponding element in Rhesus monkey shared identical T SDs and was also an A luY. Two explanations can account for this observation: multiple independent insertions at the identical site, or recent gene conversion in human w hich converted an existing older A l u Y insertion into an apparent human-specific family. A lthough we cannot rule out independent insertions as an explanation in these cases, we believe gene conversion, reported previously to occur between A lu elements (Salem et a l. 2003a), is more l ikely. It should be noted that both deletion in the chimpanzee lineage and gene conversion in the human lineage, rather than controverting one another, are dual lines of evidence suggesting elevated recombinational or double-strand D N A break repair activity in these l oci i n recent evolutionary time. We further noticed a relative paucity o f precise deletions in human vs. chimpanzee (only 9/37 occurred in the human lineage). Without further study, it is unclear what this might mean. However, further B L A T alignments confirmed that, with the exception of two events (case #25, deleted in human, and case #43, deleted in chimpanzee), these events have all occurred in single-copy regions of the human and chimpanzee genomes. Furthermore, we used discontiguous megablast against the chimpanzee sequence trace database at N C B I to check for the possibility that some o f the putative deletions in chimpanzee were a result of anomalous assembly, in w hich an A lucontaining trace at a locus was over-ruled by traces not containing the A lu. N o such cases were found. By comparison, the numbers of random deletions between 200 and 500 bp long, discussed below, were more similar between human and chimpanzee (1011 and 916, respectively). 109 5.3.2 Analysis of random genomic deletion by illegitimate recombination To further investigate the genomic prevalence of deletions that might be mediated by short repeats during the last few m illion years o f primate evolution, we examined all length differences of 200-500 bp (thus approximating the 300-bp size of A lu elements) between human and chimpanzee and looked for flanking repeats at the breakpoints. After eliminating cases o f tandem duplications, insertions (including sequence having additional copies elsewhere in the human genome), indels w ithin transposable elements, and deletions between homologous transposable elements (see Methods), 1927 indels remained, and we termed these random deletions. It should be noted that our method did not exclude genomic deletions having one or both breakpoints w ithin repetitive sequence, as long as the repetitive sequence at the endpoints did not belong to homologous repeats. We found that the endpoints of 367, or 19.0% of 200-500 bp random deletions in the human and chimpanzee lineages, are associated w ith flanking identical repeats o f at least 10 bp. To put this observation in the context of non-random sequence composition in primate genomes, we attempted to measure the background density of nonadjacent homologies 200-500 bp apart occurring in nonrepetitive human genome sequence. Therefore, repetitive sequence recognized by RepeatMasker ( A.F.A. Smit & P . Green, unpublished data; http://www.repeatmasker.org) and tandem repeats found by Tandem Repeats Finder 3.21 (Benson 1999) were excised from the genome. This left 1.58 Gbp, or 55.6%o o f the human genome. A C ++ program was constructed that computed alignments between all genomic positions 200 to 500 bp apart. F rom the banded alignments, the program directly calculated the length distribution of randomly-occurring identical segments flanking sequence tracts 200-500 bp long. W e then extrapolated the 110 observed homology counts to compute the expected random homology occurrence in a complete genome. This method projected that 1.62 m illion random homologies of 10+ bp would exist 200-500 bp apart i n the full size 2.84 Gbp human genome. The 376 random deletions that we observe w ith 10+ bp flanking repeats therefore account for 0.0226% of a ll such homologies available in the genome. This observation again fits w ell w ithin the paradigm of deletion-prone homology-driven D N A double strand break repair, known as single-strand annealing (Karran 2000; Helleday 2003). In that model, D N A breakage results in binding of complexes that initiate peeling back of D N A , followed by a stochastic homology search in regions adjacent to the broken ends. In this type of D N A repair, many l ocal homologies may be bypassed before fortuitous matching occurs. Exonucleases break down loose D N A ends, followed by ligation of the broken ends (Figure 5.IB). This mechanism accounts for deletion sizes over several orders of magnitude (data not shown), and for varying flanking repeat sizes (Figure 5.2). Ill 1 600 CD O 2 00 o c z os o 1 400 1 4 00 CD i_ • I_ •Observed • E xpected 0 0 2 III I • • • • • • • • • • • 4 6 8 10 12 14 16 18 Flanking repeat size (bp) Figure 5.2 Prevalence of direct repeats at deletion boundaries. 1927 random deletions 200-500 bp in length were observed in the U CSC chimpanzee scaffold alignments to the July 2003 human genome. Observed flanking repeat occurrence (black bars) and expected occurrence if these deletions occurred by nonhomologous end joining alone (grey bars) are displayed. Flanking repeats 7 bp in size and above are expected to occur in <1/1927 cases. A s observed w ith A l u deletions, the observed association of random deletions with 10+ bp flanking repeats appeared much greater than would occur i f homology played no role. Indeed, the suggestion that nonadjacent homologies play a role in genomic deletions has also been made based on studies in plants, although no statistical analysis has been done (Devos et a l. 2002; M a and Bennetzen 2004). To statistically confirm a strong association between flanking repeats and deletion, our results were compared to what w ould be expected in a process o f purely random breakage followed by blunt-end rejoining (Figure 5.2). We reasoned that, under the hypothesis of no association between homology at breakpoints and deletion occurrence, homology occurrence at breakpoints of 200-500 bp deletions should mirror that observed 200-500 bp apart i n the nonrepetitive genome. U sing the data described above without extrapolation, 0.903 m illion randomly-occurring homologies occur in the nonrepetitive 112 genome, wherein there exist 300 times as many, or 0.474 t rillion position combinations 200-500 bp apart. Thus 10+ bp homologies occur randomly at a frequency of 1.9 x 10" o f any two positions 200-500 bp apart. Therefore, i f homology plays no role in these deletions, we w ould expect much less than one occurrence of 10+ bp homology in our set o f 1927 deletions (1927 * 19 x 10" = 0.0036 occurrences, precisely), compared to the . 6 6 observed 367 occurrences (P « 1 x 1 0" ; Chi-squared test). Furthermore, by plotting 100 the observed number of deletions associated w ith different lengths of flanking identity, we found that flanking repeats as short as two base pairs were overrepresented in the data set (Figure 5.2). This strong association of short flanking identities w ith deletion further confirms that illegitimate recombination between such short sequences has played a highly significant role in sequence deletion during primate evolution. 5.3.3 Direct confirmation ofAlu element deletions F inally, to confirm our findings, we chose 9 cases o f A luS elements present i n human but absent i n the draft chimpanzee sequence to examine in more detail. These l oci were chosen w ithin and at varying distances from genes. To avoid regions of poor or anomalous alignments, we only investigated cases where the percentage identity between human and chimpanzee sequence surrounding the A lu is very high (>98%) and the A lu is a complete element w ith recognizable target site duplications ( TSDs). F ive o f the cases (#14, 33, 42, 43, and 52; see Appendix A ) were predicted to be deletions in chimpanzee, and as a control we selected four cases expected to be insertions in human (# C 1-C4). The presence or absence o f each of these A lus i n a range o f primate species was then determined using genomic P CR and the results summarized in Table 5.1. 113 Table 5.1 AluS indels assayed in primates by P CR and B LAST # 1 F am. Position 2 TSD H C 3 4 4 G O G i B R Location/nearest genes 4 4 4 4 5 C1 Sx C2 S g C3 Sg C4 S g 14 S g 15) 33 Sx 42 S q 43 Sx protein) 52 S q 20:11512274 15:83819720 7:104197804 20:18254452 3:127318836 13 16 19 15 17 YN 6 6 NNN NNN NNI 7 YN YN YN YN NNN YY? N N - 354 kb 5' of B TBD3 (BTB/POZ domain containing-3) N N In intron of A KAP13 (A-kinase anchor protein) N N - 17 kb 5' of MLL5 (Myeloid/lymphoid leukemia 5) N N - 9.7 kb 5' of Z NF133 (Kruppel Zn-finger protein) ? Y - 63 kb 3' of KLF15 (Kruppel-like factor 12:48585272 16:69279114 16:74232245 16 16 17 YN YN Y Y Y Y YY YY Y Y 1.3 kb 5' of FAIM2 (Fas apoptotic inhibitory molecule 2) 5.2 kb 3' of C YB5-M (cytochrome b5) Y Y/N° Y Y Y ? Y In intron of LOC348174 (secretory 22:45658137 15 YN Y ?Y ? Y In intron of C22orf4 (putative G TPase activator) C ase number. C ases C n are controls, and others refer to case number in Supplementary information. Chromosome and position in July 2003 Human Genome Browser (http://genome.ucsc.edu) Size of Target Site Duplication (bp) H- human; C-chimpanzee; G - gorilla; O- Orangutan; Gi- Gibbon; B- Baboon; assayed by P CR R- Discontiguous MegaBLAST results from Rhesus monkey trace archive. Y= Alu is present; N= Alu is absent (as determined by P CR or Discontiguous M egaBLAST); ?= primers did not amplify or product of unexpected size I = Independent Alu insertion in same region in Gibbon Alu #43 is 'polymorphic' in all chimpanzees tested. Region is triplicated in human with all 3 having the Alu in human and one region lacking the Alu in chimpanzee. 1 2 3 4 5 6 8 A HCGOGiB- B HCGOGiB- C HCGOGiB- D HCGOGiB- E HCGOGiB- F HCGOGiB- G 12 34 56 Figure 5.3 P CR and sequence evidence for precise Alu element deletion. A -F) cases C4,14,33,42, 52, 43 from Table 5.1; lanes are human (H), chimpanzee (C), gorilla (G), orangutan (O), gibbon (Gi), baboon (Ba), and no-template control (-) G) case 43, genomic P CR in 6 additional chimpanzees, labeled 1-6. 114 A s expected, our four controls demonstrate A luS presence only in human and no other primate, consistent w ith insertion in the human lineage after divergence from chimpanzee (Table 5.1, Figure 5 .3A). In accord w ith this finding is a study suggesting that some A luSx elements may s till be active (Johanning et a l. 2003). Therefore, some of the n on-AluY differences between human and chimpanzee may reflect recent low levels of retrotranspositional activity of A luS elements. A n alternative explanation is that young A l u Y elements inserted in these locations, followed by gene conversion templated by older A luS elements. We therefore more carefully examined these A lu sequences to look for nucleotide positions diagnostic of young A l u Y subfamilies (Batzer and Deininger 2002). A lthough we found no convincing evidence for partial gene conversion, this mechanism cannot be ruled out. Interestingly, in control #3, gibbon has an independent A l u Y insertion at this locus, offset by 4 bp ( NCBI A ccession no. A Y953324). Independent parallel retroelement insertions at or near the same genomic site have been previously noted (Salem et al. 2003a; Conley et a l. 2005). In the remaining five cases, P CR evidence confirms deletion in chimpanzee rather than lineage-specific insertion in human (Table 5.1). In four cases (#14, 33, 42, and 52), the A lu element was found to be uniformly present i n 10 of 10 humans and absent i n 10 o f 10 chimpanzee D N A samples (data not shown). These four regions are apparently unique in the human genome w ith no evidence of segmental duplication. Insertion of these A lu elements could be verified by P CR i n orangutan, w hich diverged from the higher apes 12-15 mya (Glazko and N ei 2003), or in even more distantly related primates (Figure 5 .3B-E). (For case #14 in gibbon, the P CR product was of unexpected size, Figure 5.3E, suggesting rearrangement or other insertions in the region.) G iven these long periods of time, it is u nlikely that these l oci reflect lineage sorting of ancestral 115 polymorphisms, proposed previously to explain unexpected A l u presence/absence relationships in the great apes (Salem et a l. 2003b; Hedges et a l. 2004). Rather, these results suggest that pre-existing fixed A l u elements have been deleted in the chimpanzee lineage. T o verify that the l oci i n other primates contain the same A l u insertion, we sequenced the region in g orilla for cases #33 and 52 ( NCBI A ccession nos. A Y953323 and A Y953322) and compared to the human, chimpanzee, and Rhesus macaque genomic sequences from the databases (Figure 5 .4A,B). In both cases, the g orilla and Rhesus l oci are occupied by the same ancestral A l u as in human w ith the same target site duplication ( TSD). Moreover, the sequence in chimpanzee has the expected structure of the preintegration locus w ith only one copy of the T SD generated upon A l u insertion. A human chimp gorilla rhesus B human chimp gorilla rhesus TSD TSD TGTCTGCCT TGTCTGCCT TGTCTGCCT TGTCTGCCC G GGTGGGAT A A A G A C T T T G A T A A T T a g g c c - A L U - a a a a A A A G A C T T T G A T A A T T G GGTGGGAT A A A G A C T T T G A T A A T T G GGTGGGAT A A A G A C T T T G A T A A T T a g g c c - A L U - a a a a A A A G A C T T T G A T A A T T GGATGGGAT A A A G A C T T T G A T A A T T a g g c c - A L U - a a a a A A A G G C T T — A T A A T T GACGGTAAA GAAATGCCCCCTCTC GAGGGTAAA GAAATGTCCCCTCTC GAGGGTAAA GAAATGCCCCCTCTC GAGGGTAAA GAAATGCCCCCTCTC ggcc -ALU- aaaa GAAATGCCCCCTCTC ACAAAACTG ACAAAATTG g g c c -ALU- a a a g GAAATGCCCCCTCTC ACAAAACTG g g c c -ALU- a a a a GAAATGTCCCCTCTC ACAAAATTG c human1 h uman2 human 3 chimp1 C C C T T G T T T AAGAAGAGGGAGGG g g c t - A L U - a a a a AAGAAGAGGGAGGG C A C T T G T T T AAGAAGAGGGAGGG g g c t - A L U - a a a a AAGAAGAGGGAGGG C A C T T G T T T AAGAAGAGGGAGGG g g c t - A L U - a a a a C C C T T G T T T AAGAAGAGGGAGGG C C C T G G T T T A A G A A G A G G G A G G G g g c t - A L U - a a a a A GGAAGAGGGAGGG C C C T T G T T T AAGAAGAGGGAGGG g g c t - A L U - a a a g A AGAAGAGGGAGGG AAGAAGAGGGAGGG GGCGGGGTCAGCT GGCGGGGTCAGCT GGCGGGGTCAGCT GGCGGGGTCAGCT GGCGGGGTCAGCT , GGCGGGGTTAGCT chimp2 rhesus Figure 5.4 Sequence evidence for precise Alu element deletion. A,B) cases 33 & 52, sequenced in gorilla and compared to the database sequences of human, chimpanzee, and Rhesus macaque C) case 43, showing available human, chimpanzee, and Rhesus loci. Target site duplications are boxed. The final case (#43) is more complex in that chimpanzee appears to have both occupied and unoccupied alleles or l oci (Figure 5.3F). This pattern was seen in D N A from 6 o f 6 additional chimpanzees tested (Figure 5.3G), suggesting that it does not reflect allelic polymorphism. Indeed, database analysis revealed that this locus is part of 116 complex segmental duplications that resulted in three copies in the human genome, all of which have the A l u insertion. The draft chimpanzee sequence has two copies, one of w hich lacks the A l u insertion. We cannot determine i f a third copy exists in chimpanzee because o f gaps and poor sequence coverage in these regions. A n alignment of the three human and two chimpanzee sequences, as w ell as one Rhesus sequence is depicted in Figure 5.4C and shows that the chimpanzee locus without the A l u has the expected structure of a pre-integration allele. We confirmed the database entries by sequencing the two l oci i n chimpanzee ( NCBI A ccession nos. A Y953325 and A Y953326). The most probable explanation for this finding is that the A l u integrated prior to duplication o f the region followed by loss of the A lu i n one chimpanzee copy. 5.4 Concluding remarks In summary, our analysis strongly suggests an important role for short nonadjacent segments o f D N A identity in genomic deletions. In rare cases, even retroelement insertions deeply fixed i n the primate lineage can apparently be precisely excised from the genome in a manner i nvolving the flanking T SDs, leaving behind no footprint of their insertion. W e believe that illegitimate recombination between short identical stretches o f D N A , l ikely i nvolving a D N A double-strand break repair mechanism, is the most l ikely and simplest molecular mechanism to explain the findings reported here. This conclusion is supported by the fact that a large fraction of n on-TE associated deletions distinguishing human and chimpanzee have short repeats at the breakpoints. Furthermore, this study provides new insights into genomic attenuation and contradicts a r igid v iew that a ll insertions of retroelements represent unidirectional events. On the other hand, this study demonstrates that, for A l u elements in particular, homoplasy freedom is a mostly v alid assumption and implicates internal homologous regions as preventing wholesale deletion 117 o f A lus. F inally, an aspect o f A l u biology that has provoked interest is the slight preferential localization o f younger elements i n A T-rich regions but higher density of older elements i n more G C-rich D N A (International Human Genome Sequencing Consortium 2001). Several theories have been proposed to explain the differences in A l u distributions with element age (Schmid 1998; B rookfield 2001; Pavlicek et a l. 2001; Medstrand et a l. 2002; Jurka 2004). W hile our findings indicate that precise deletion of A l u elements makes reversal of retroelement insertions possible, the phenomenon is nevertheless quite rare (-0.5% of length polymorphisms) and is l ikely insufficient to explain the shifts in A lu distribution. However, ectopic illegitimate recombination not i nvolving T SDs may help to explain overall A l u sequence loss and distribution patterns. 118 Chapter 6: Summary and conclusions 119 6.1 Summary The purpose of this thesis work was to use global analyses of populations of repeated sequences of various kinds in mammalian genomes to understand the interactions of these sequences and their host genomes. The methods developed in this pursuit were almost exclusively computational and involved analysis of sequenced human, chimpanzee, and mouse genomes and related sequence data. Trends identified in our analyses have provided insight into the global effects of transposable elements and their impact on the host. B elow I briefly discuss several considerations arising out of this work. 6.2 Initial and long-term genomic localization of mobile elements are related in complex ways Appropriate normalization o f the currently observed distributions of repetitive elements allows comparison of the distributions of elements w ith widely varying total population sizes. W hile most element types seem simply to be lost from h igh-GC regions over time, the A lu distribution is particularly intriguing, i n that very young and very old A lus seem to have less of a h igh-GC sequence composition preference, while A lus o f intermediate ages seem to be strongly clustered in regions of h igh-GC content, w hich, i n many cases, coincide w ith gene-rich regions. Several explanations have been advanced that may account for these observations, each perhaps explaining some part of the nascence o f the observed mobile element distributions. Our own work pointed to a role for recombination in shaping the distributions o f A lu elements (Chapter 2). However, a complete understanding o f the localization o f different retroelements in regions of varying sequence composition requires knowledge of a ll processes taking place at and after the time of insertion. The 120 recent discoveries that transcripts, particularly those of exogenous viruses such as H IV, can be tethered in genie regions resulting in an altered propensity to insert there (Ciuffi et al. 2005; L ewinski and Bushman 2005) suggests that a similar mechanism may have accounted for the A lu accumulation in G C-rich regions. This theory is particularly pleasing i n that it could explain the differential GC-content localization o f the different A lu families based on their consensus sequence alone, without invoking any functional role or any more significant interactions with the genome. Furthermore, the recent discovery o f base pair preferences in the v icinity o f exogenous retroviral insertion sites (Holman and Coffin 2005) further suggests binding by tethering factors. Not least, differences in insertion patterns relative to genes for L I , S V A , and A l u retroelements, all o f w hich are presumed to insert using LI-encoded machinery, strongly suggest the presence of auxiliary factors influencing the localization o f these elements. 6.3 Some T E s interact strongly with genes, leading to population biases in regions s urrounding genes Although the localization o f insertions of mobile elements is apparently determined, at least in part, by the sequence composition of the region, a separate, gene-specific interaction may also be measured. It must be noted that, given the strong association between h igh-GC content and high gene density, some of the apparent genomic sequence composition effect on T E localization may be due to the presence of genes. In any case, mapping o f T Es w ith respect to genes and comparison o f this mapping w ith that expected by consideration of sequence composition alone allowed a conservative assessment o f the effect of genes on T E localization (Chapter 2). In short, TEs whose life cycle involves transcription by the pol-II machinery, w hich therefore contain internal active pol-II transcriptional signals in their sequence, seem to be at least partially disallowed from 121 entry into pol-II transcribed regions, l ikely due to pathogenicity upon insertion there. This has been suggested by others (Smit 1999). However, disallowance of T Es i n the same transcriptional direction as genes extends upstream and downstream of genes as w ell, especially for L TR-containing elements (Chapter 2), suggesting that the transcriptional signals provided by these elements are detrimental at some distance from genes. The fact that pol-II signal-containing TEs are not totally excluded from transcribed regions raises several interesting questions. If pol-II driven TEs are so pathogenic, why have they been allowed to remain at some level i n transcribed regions? A re they disabled in some way? To what extent, i f any, does methylation silencing o f T E pol-II promoters affect the permissiveness of genie regions for insertion of these elements? Do they have weaker promoters or polyadenylation signals? Are there adjacent cellular sequences that override these dangerous motifs in some way? Are there perhaps binding sites for tethering proteins that help keep the pol-II holoenzyme on track i n spite of these polyadenylation signals? These and other questions suggest a fruitful avenue for further research. For example, it might be interesting to search for association between multi-species conserved sequences, w hich have recently been shown to occur at higher density w ithin long introns (Sironi et a l. 2005a; Sironi et a l. 2005b), and senseoriented pol-II driven TEs found in genie regions. 6.4 Many gene U TRs are associated with T E-derived sequence A nalysis o f transcripts revealed that 27.4% of human genes have permitted inclusion o f T E sequence i n the U TRs o f one or more o f their transcripts (Chapter 3). The corresponding figure for mouse is 18.4%. This lower figure may reflect the higher mutation rate i n the mouse lineage, by w hich repetitive sequence becomes undetectable 122 by alignment methods. Indeed, mutually aligning genomic regions for w hich an older repeat is found in humans often have no annotated repeat i n mouse, reflecting an inability o f current alignment methods to find these repeats i n mouse. In addition, high-copy repetitive sequence has, in general, been less w ell studied in the mouse lineage and the lower detected genomic coverage by repeats i n rodent genomes is a reflection o f this fact (Mouse Genome Sequencing Consortium 2002; B aillie et a l. 2004). A synthesis of the mapping data we have in both humans and mouse results in a perhaps-unsurprisingly consistent picture of both systems. In both human and mouse, T Es appear to affect expression of many genes through donation of transcriptional regulatory signals. Furthermore, recently expanded gene classes, such as those involved i n immunity or response to external stimuli, have transcripts enriched in T Es, whereas T Es are excluded from m RNAs o f highly conserved genes w ith basic functions in development or metabolism. These results could support one of two views. On one hand, one might argue that T Es have played a significant role in the diversification and evolution o f mammalian genes. O n the other hand, a more neutralist conclusion might be that permissive genes are so because their product is less dosage-critical, and therefore interference w ith expression due to T E donation of signals or sequence is less l ikely to ill-affect such genes. In this regard, it would be interesting to study inclusion o f T E sequence i n transcripts in the context of variations in expression l evel. One might expect that genes whose expression level is critical and conserved across species would most strongly exclude T Es. A first attempt to study this might entail compilation o f a list o f haploinsufficient genes followed by assessment o f T E content in their transcripts. 6.5 Short repeated sequences are involved in genomic deletions Insertion of transposable elements is a major cause o f genomic expansion in eukaryotes. 123 Less is understood, however, about mechanisms underlying contraction of genomes. A combination o f global bioinformatic analyses and PCR-based approaches showed that retroelements can, in rare cases, be precisely deleted from primate genomes, most l ikely v ia recombination between 10-20 bp T SDs flanking the retroelement (Chapter 5). The deleted l oci are indistinguishable from pre-integration sites, effectively reversing the insertion. It is estimated that 0.5 to 1% of apparent retroelement insertions distinguishing humans and chimpanzees actually represent deletions. Furthermore, 19% of genomic deletions of 200-500 bp that have occurred since the human-chimpanzee divergence are associated w ith flanking identical repeats o f at least 10 bp. A large number of deletions internal to A l u elements are also flanked by similar sequence. These results suggest that illegitimate recombination between short direct repeats has played, and l ikely continues to play, a significant role in human genomic deletion processes and is l ikely implicated in D N A deletion syndromes such as cancer. In short, while this study lends support to the view that insertions of retroelements are mostly irreversible, it is the first to conclusively demonstrate precise reversion o f these events and estimate the rate o f precise deletion of these elements. The data presented also suggested that the same mechanism is responsible for a large number of random genomic deletions. In addition to the insights it provided, our study of short direct repeats and their role i n deletion processes suggests further questions. Perhaps we can learn more about the frequency of error-prone modes of operation of ideally error-free D N A DSB repair pathways. To answer this question, one might perform further study of putative deletions o f a ll sizes, using other primate genomes to determine the ancestral state o f indels. This approach would help to elucidate in further detail the contributions of the various D N A D SB repair mechanisms to genomic sequence change. Such insights can help us 124 understand the etiologies of cancer and similar diseases caused by D N A breakage and repair. 6.6 Conclusions Repeated sequences, primarily T Es, make up a large fraction of mammalian genomes. Bioinformatic analysis of genomic data is a uniquely powerful technique that has allowed us to conduct global genomic analyses of repeats and address their involvement in genomic-scale phenomena. The theme that has emerged repeatedly is the familiar one where the survival o f the organism is the ultimate determinant of whether sequence change is accepted or not. W ithin this paradigm, apparently selfish sequence, such as that o f apparently v iral o rigin, may be co-opted by a genome to perform a useful function. However, it remains subservient to the needs o f the organism, only rarely persisting when its overall impact is negative, and then only i f the negative impact is slight. In rare cases, positive effect has been argued w ith convincing evidence, such as in the case of salivary expression of a digestive enzyme under the control of an enhancer of viral origin ( Ting et a l. 1992). As more detailed genomic analyses are done of insertions and deletions and genes affected by such events, it may be expected that more o f these events w ill come to light. Especially interesting in this regard is the recent availability o f the chimpanzee genome and its alignment w ith the human genome. A vailability o f more sequenced genomes promises to shed additional light on the role of repetitive D N A o f all kinds i n making the broad array of organisms what they are. Although their global nature makes bioinformatic analyses powerful, it is also l imiting. Genome-wide analyses, though they survey all genes, cannot address every factor governing each i ndividual case. This is largely because many influences on gene expression, for example chromatin remodeling and R N A interference to name just two, 125 t hough characterized on some l evel for i ndividual genes, are at present far f rom b eing w ell-enough u nderstood in t erms o f their regulation and their regulatory effect on other genes to apply that k nowledge on a genomic scale. It is exciting to think that, as more data on the various phenomena become available, we may gain sufficient understanding to map such information to genomes and conduct g lobal a nalyses o f them. T his and an i ncreasing t rove o f sequencing and phenotypic data i n the public databases p romise to p rovide g rist for bioinformatic analyses addressing more and more sophisticated b iological q uestions as time progresses. A t no time, however, must the researcher lose sight o f the c ritical i mportance o f complementation o f b ioinformatic studies by wet l aboratory approaches. Rather, bioinformatics and the wet laboratory are envisioned as partners i n an iterative process, in w hich the wet lab provides fundamental u nderstandings w hich are then used in g lobal b ioinformatic analyses. The goal o f those a nalyses is to synthesize diverse data into a more systems-level picture o f b iology, o ffering a w hole new round o f hypotheses amenable to testing in wet-lab environments. 126 References Adams, M .D., S .E. Celniker, R A . H olt, C A . Evans, J.D. Gocayne, P .G. Amanatides, S .E. Scherer, P .W. L i , R .A. Hoskins, R .F. G alle et a l. 2000. The genome sequence o f Drosophila melanogaster. Science 287: 2185-2195. A ltschul, S.F., W. G ish, W . M iller, E .W. Myers, and D .J. L ipman. 1990. Basic local alignment search tool. J Mol Biol 215: 403-410. A paricio, S., J. Chapman, E. Stupka, N . Putnam, J .M. C hia, P . Dehal, A . Christoffels, S. Rash, S. Hoon, A . Smit et al. 2002. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297: 1301-1310. Athanikar, J .N., R . M . Badge, and J .V. M oran. 2004. A Y Y1-binding site is required for accurate human L INE-1 transcription initiation. Nucleic Acids Res 32: 3846-3855. B aillie, G .J., L . N . van de Lagemaat, C. Baust, and D .L. Mager. 2004. M ultiple groups of endogenous betaretroviruses in mice, rats, and other mammals. J Virol 78: 57845798. Bannert, N . and R. K urth. 2004. Retroelements and the human genome: new perspectives on an old relation. Proc Natl Acad Sci USA 101 S uppl 2: 14572-14579. Bao, Z . and S.R. Eddy. 2002. Automated de novo identification of repeat sequence families i n sequenced genomes. Genome Res 12: 1269-1276. Barr, S.D., J . L eipzig, P . Shinn, J.R. Ecker, and F .D. Bushman. 2005. Integration targeting by avian sarcoma-leukosis virus and human immunodeficiency virus in the chicken genome. J Virol 79: 12035-12044. Batzer, M A . and P .L. Deininger. 2002. A lu repeats and human genomic diversity. Nat Rev Genet 3: 370-379. Baust, C , G .J. B aillie, and D .L. Mager. 2002. Insertional polymorphisms of E Tn retrotransposons include a disruption o f the wiz gene i n C 57BL/6 mice. Mamm Genome 13: 423-428. Baust, C , W. Seifarth, H . Germaier, R. Hehlmann, and C. L eib-Mosch. 2000. H E R V - K T47D-Related long terminal repeats mediate polyadenylation of cellular transcripts. Genomics 66: 98-103. B edell, J .A., I. K orf, and W. G ish. 2000. M askerAid: a performance enhancement to RepeatMasker. Bioinformatics 16: 1040-1041. Belshaw, R ., A . L . Dawson, J. W oolven-Allen, J . Redding, A . Burt, and M . Tristem. 2005. Genomewide screening reveals high levels of insertional polymorphism in the human endogenous retrovirus family H E R V - K ( H M L 2 ) : implications for present-day activity. J Virol 79: 12507-12514. Bennett, E .A., L .E. Coleman, C. T sui, W .S. Pittard, and S.E. Devine. 2004. Natural genetic variation caused by transposable elements i n humans. Genetics 168: 933951. Benson, G . 1999. Tandem repeats finder: a program to analyze D N A sequences. Nucleic Acids Res 27: 573-580. Biemont, C , A . Tsitrone, C. V ieira, and C. Hoogland. 1997. Transposable element distribution i n Drosophila. Genetics 147: 1997-1999. B lond, J .L., F . Beseme, L . Duret, O. Bouton, F. B edin, H . Perron, B . Mandrand, and F. M allet. 1999. Molecular characterization and placental expression of H E R V - W , a 127 new human endogenous retrovirus family. J Virol 73: 1175-1185. Boeke, J .D. and J.P. Stoye. 1997. Retrotransposons, Endogenous Retroviruses, and the E volution o f Retroelements. In Retroviruses (eds. J .M. C offin S .H. Hughes, and H .E. Varmus), pp. 343-436. C old Spring Harbor Laboratory Press, Plainview, N ew Y ork, U S A . Brady, H .J., J .C. Sowden, M . Edwards, N . L owe, and P .H. Butterworth. 1989. M ultiple G F-1 binding sites flank the erythroid specific transcription unit of the human carbonic anhydrase I gene. FEBS Lett 257: 451-456. Brandt, V . L . and D .B. R oth. 2004. V (D)J recombination: how to tame a transposase. Immunol Rev 200: 249-260. Britten, R .J. 1997. M obile elements inserted in the distant past have taken on important functions. Gene 205: 177-182. Britten, R .J., L . R owen, J. W illiams, and R .A. Cameron. 2003. Majority o f divergence between closely related D N A samples is due to indels. Proc Natl Acad Sci USA 100: 4661-4665. Brookfield, J.F. 2001. Selection on A l u sequences? Curr Biol 11: R900-901. Brosius, J . 1999. Genomes were forged by massive bombardments with retroelements and retrosequences. Genetica 107: 209-238. Bushman, F ., M . L ewinski, A . C iuffi, S. Barr, J. L eipzig, S. Hannenhalli, and C. Hoffmann. 2005. Genome-wide analysis of retroviral D N A integration. Nat Rev Microbiol 3: 848-858. Bushman, F .D. 2003. Targeting survival: integration site selection by retroviruses and LTR-retrotransposons. Cell 115: 135-138. C allinan, P .A., J . Wang, S.W. Herke, R .K. Garber, P. L iang, and M . A . Batzer. 2005. A lu retrotransposition-mediated deletion. J Mol Biol 348: 791-800. Cameron, H .S., D . Szczepaniak, and B .W. Weston. 1995. Expression of human chromosome 19p alpha(l,3)-fucosyltransferase genes i n normal tissues. Alternative s plicing, polyadenylation, and isoforms. J Biol Chem 270: 2011220122. Carcedo, M .T., J .M. Iglesias, P. Bances, R .O. Morgan, and M .P. Fernandez. 2001. Functional analysis of the human annexin A 5 gene promoter: a downstream D N A element and an upstream long terminal repeat regulate transcription. Biochem J 356: 571-579. Carlton, V .E., B .Z. Harris, E .G. Puffenberger, A . K . Batta, A .S. K nisely, D .L. Robinson, K . A . Strauss, B .L. Shneider, W .A. L im, G . Salen et a l. 2003. Complex inheritance o f familial hypercholanemia with associated mutations in TJP2 and B A A T . Nat Genet 34: 91-96. C arroll, M . L . , A . M . R oy-Engel, S .V. Nguyen, A . H . Salem, E. V ogel, B . Vincent, J. M yers, Z . Ahmad, L . Nguyen, M . Sammarco et al. 2001. Large-scale analysis of the A lu Y a5 and Yb8 subfamilies and their contribution to human genomic diversity. J Mol Biol 311: 17-40. Charlesworth, B . and D. Charlesworth. 1983. The population dynamics of transposable elements. Genet Res 42: 1-27. Charlesworth, B . and C H . Langley. 1991. Population genetics o f transposable elements i n Drosophila. In Evolution at the molecular level (eds. R .K. Selander A . G . Clark, and T .S. Whittam), pp. 150-176. Sinauer Associates, Sunderland, M A , U SA. Charlesworth, B ., C H . Langley, and P .D. Sniegowski. 1997. Transposable element distributions in Drosophila. Genetics 147: 1993-1995. 128 Chen, J .M., P .D. Stenson, D .N. Cooper, and C. Ferec. 2005. A systematic analysis of L INE-1 endonuclease-dependent retrotranspositional events causing human genetic disease. Hum Genet 117: 411-427. C iuffi, A ., M . L lano, E . Poeschla, C. Hoffmann, J. L eipzig, P . Shinn, J.R. Ecker, and F. Bushman. 2005. A role for L EDGF/p75 i n targeting H IV D N A integration. Nat Med 11: 1287-1289. Clamp, M . , J . Cuff, S .M. Searle, and G .J. Barton. 2004. The Jalview Java alignment editor. Bioinformatics 20: 426-427. Conley, M .E., J .D. Partain, S .M. N orland, S .A. Shurtleff, and H .H. K azazian, Jr. 2005. T wo independent retrotransposon insertions at the same site w ithin the coding region o f B T K . Hum Mutat 25: 324-325. Cordonnier, A ., J .F. Casella, and T. Heidmann. 1995. Isolation of novel human endogenous retrovirus-like elements with foamy virus-related pol sequence. J Virol 69: 5890-5897. Costas, J. and H . N aveira. 2000. Evolutionary history o f the human endogenous retrovirus family E RV9. Mol Biol Evol 17: 320-330. Csoka, A . B . , G .I. Frost, and R. Stern. 2001. The six hyaluronidase-like genes i n the human and mouse genomes. Matrix Biol 20: 499-508. Cutter, A .D., J .M. G ood, C T . Pappas, M . A . Saunders, D . M . Starrett, and T .J. Wheeler. 2005. Transposable element orientation bias in the Drosophila melanogaster genome. J Mol Evol 61: 733-741. Deininger, P .L. and M . A . Batzer. 1999. A lu repeats and human disease. Mol Genet Metab 67: 183-193. Deininger, P .L. and M . A . Batzer. 2002. M ammalian retroelements. Genome Res 12: 1455-1465. Devos, K . M . , J .K. B rown, and J .L. Bennetzen. 2002. Genome size reduction through illegitimate recombination counteracts genome expansion in Arabidopsis. Genome Res 12: 1075-1079. Dewannieux, M . , C . Esnault, and T. Heidmann. 2003. LINE-mediated retrotransposition o f marked A l u sequences. Nat Genet 35: 41-48. Dewannieux, M . and T. Heidmann. 2005. LI-mediated retrotransposition of murine B l and B2 S INEs recapitulated in cultured cells. J Mol Biol 349: 241-247. D i Cristofano, A ., M . Strazullo, L . L ongo, and G . L a M antia. 1995. Characterization and genomic mapping of the Z NF80 locus: expression of this zinc-finger gene is driven by a solitary L T R o f E RV9 endogenous retroviral family. Nucleic Acids Res 23: 2823-2830. D i Franco, C , A . T errinoni, P. D imitri, a ndN. Junakovic. 1997. Intragenomic distribution and stability of transposable elements i n euchromatin and heterochromatin of Drosophila melanogaster: elements w ith inverted repeats B ari 1, hobo, and pogo. J Mol Evol 45: 247-252. Doolittle, W .F. and C. Sapienza. 1980. Selfish genes, the phenotype paradigm and genome evolution. Nature 284: 601-603. Dunbar, C E . 2005. Stem c ell gene transfer: insights into integration and hematopoiesis from primate genetic marking studies. Ann N Y Acad Sci 1044: 178-182. Dunn, C .A., P . Medstrand, and D .L. Mager. 2003. A n endogenous retroviral long terminal repeat is the dominant promoter for human betal,3-galactosyltransferase 5 i n the colon. Proc Natl Acad Sci USA 100: 12841-12846. Dunn, C .A., L . N . van de Lagemaat, G .J. B aillie, and D .L. Mager. 2005. Endogenous 129 retrovirus long terminal repeats as ready-to-use mobile promoters: The case o f primate b eta3GAL-T5. Gene 364: 2-12. Edwards, M . C . and R .A. G ibbs. 1992. A human dimorphism resulting from loss of an A lu. Genomics 14: 590-597. E lliott, B ., C . Richardson, and M . Jasin. 2005. Chromosomal translocation mechanisms at intronic alu elements i n mammalian cells. Mol Cell 17: 885-894. Esnault, C , J. Maestre, and T. Heidmann. 2000. Human L INE retrotransposons generate processed pseudogenes. Nat Genet 24: 363-367. Friedrich, G . and P. Soriano. 1991. Promoter traps i n embryonic stem cells: a genetic screen to identify and mutate developmental genes i n mice. Genes Dev 5: 15131523. Fullerton, S .M., A . Bernardo Carvalho, and A . G . Clark. 2001. L ocal rates o f recombination are positively correlated with G C content in the human genome. Mol Biol Evol 18: 1139-1142. Furano, A . V . 2000. The b iological properties and evolutionary dynamics of mammalian L INE-1 retrotransposons. Prog Nucleic Acid Res Mol Biol 64: 255-294. Gibbons, R ., L .J. Dugaiczyk, T. G irke, B . Duistermars, R. Z ielinski, and A . Dugaiczyk. 2004. Distinguishing humans from great apes w ith A luYb8 repeats. J Mol Biol 339: 721-729. Gifford, R . and M . Tristem. 2003. The evolution, distribution and diversity of endogenous retroviruses. Virus Genes 26: 291-315. Gilbert, N . , S. Lutz-Prigge, and J .V. M oran. 2002. Genomic deletions created upon L INE-1 retrotransposition. Cell 110: 315-325. G lazko, G .V. and M . N ei. 2003. Estimation of divergence times for major lineages of primate species. Mol Biol Evol 20: 424-434. G oodchild, N .L., D A . W ilkinson, and D .L. Mager. 1993. Recent evolutionary expansion o f a subfamily of R T V L - H human endogenous retrovirus-like elements. Virology 196: 778-788. Goodman, M . , C A . Porter, J. Czelusniak, S .L. Page, H . Schneider, J. Shoshani, G. G unnell, and C P . Groves. 1998. Toward a phylogenetic classification of Primates based on D N A evidence complemented by fossil evidence. Mol Phylogenet Evol 9: 585-598. Graves, J .A. 1995. The o rigin and function of the mammalian Y chromosome and Y borne genes—an evolving understanding. Bioessays 17: 311-320. Greally, J .M. 2002. Short interspersed transposable elements ( SINEs) are excluded from imprinted regions in the human genome. Proc Natl Acad Sci USA 99: 327-332. Gregory, T .R. 2001. Coincidence, coevolution, or causation? D N A content, cell size, and the C-value enigma. Biol Rev Camb Philos Soc 76: 65-101. H acein-Bey-Abina, S., C. Von K alle, M . Schmidt, M .P. M cCormack, N . Wulffraat, P. L eboulch, A . L im, C .S. Osborne, R. Pawliuk, E. M orillon et a l. 2003. L M 0 2 associated clonal T c ell proliferation in two patients after gene therapy for S CIDX I . Science 302: 415-419. H amdi, H ., H . N ishio, R . Z ielinski, and A . Dugaiczyk. 1999. O rigin and phylogenetic distribution o f A l u D N A repeats: irreversible events i n the evolution of primates. J Mol Biol 289: 861-871. Hamdi, H .K., H . N ishio, J . Tavis, R. Z ielinski, and A . Dugaiczyk. 2000. Alu-mediated phylogenetic novelties in gene regulation and development. J Mol Biol 299: 931939. 130 Han, J.S. and J .D. Boeke. 2004. A highly active synthetic mammalian retrotransposon. Nature 429: 314-318. Han, J.S., S.T. Szak, and J .D. Boeke. 2004. Transcriptional disruption by the L I retrotransposon and implications for mammalian transcriptomes. Nature 429: 268-274. Harada, F., N . Tsukada, and N . Kato. 1987. Isolation of three kinds of human endogenous retrovirus-like sequences using t RNA(Pro) as a probe. Nucleic Acids Res 15: 9153-9162. Hassoun, H ., T .L. Coetzer, J .N. Vassiliadis, K . E . Sahr, G .J. Maalouf, S.T. Saad, L . Catanzariti, and J. Palek. 1994. A novel mobile element inserted in the alpha spectrin gene: spectrin dayton. A truncated alpha spectrin associated with hereditary elliptocytosis. J Clin Invest 94: 643-648. Hedges, D .J., P A . C allinan, R . Cordaux, J. X ing, E . Barnes, and M A . Batzer. 2004. Differential alu mobilization and polymorphism among the human and chimpanzee lineages. Genome Res 14: 1068-1075. Helleday, T. 2003. Pathways for mitotic homologous recombination in mammalian cells. Mutat Res 532: 103-115. Higgins, D .G., J .D. Thompson, and T .J. Gibson. 1996. U sing C L U S T A L for multiple sequence alignments. Methods Enzymol 266: 383-402. Holman, A . G . and J .M. C offin. 2005. Symmetrical base preferences surrounding H IV-1, avian sarcoma/leukosis virus, and murine leukemia virus integration sites. Proc Natl Acad Sci USA 102: 6103-6107. Holmes, I. 2002. Transcendent elements: whole-genome transposon screens and open evolutionary questions. Genome Res 12: 1152-1155. Hurst, L .D. 2002. The K a/Ks ratio: diagnosing the form of sequence evolution. Trends Genet 18: 486. International Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860-921. Jern, P., G .O. Sperber, G. A hlsen, and J. Blomberg. 2005. Sequence variability, gene structure, and expression of full-length human endogenous retrovirus H . J Virol 79: 6325-6337. Jern, P., G .O. Sperber, and J. Blomberg. 2004. Definition and variation of human endogenous retrovirus H . Virology 327: 93-110. Jiang, N . , Z . Bao, X . Zhang, S.R. Eddy, and S.R. Wessler. 2004. P ack-MULE transposable elements mediate gene evolution in plants. Nature 431: 569-573. Jiang, N . , Z . Bao, X . Zhang, H . H irochika, S.R. Eddy, S.R. M cCouch, and S.R. Wessler. 2003. A n active D N A transposon family i n rice. Nature 421: 163-167. Johanning, K ., C A . Stevenson, O.O. Oyeniran, Y . M . G ozal, A . M . R oy-Engel, J . Jurka, and P .L. Deininger. 2003. Potential for retroposition by old A l u subfamilies. J Mol Evol 56: 658-664. Johnson, R .D. and M . Jasin. 2000. Sister chromatid gene conversion is a prominent double-strand break repair pathway in mammalian cells. Embo J19: 3398-3407. Jordan, I .K., L B . R ogozin, G .V. G lazko, and E .V. K oonin. 2003. O rigin o f a substantial fraction o f human regulatory sequences from transposable elements. Trends Genet 19: 68-72. Junakovic, N . , A . Terrinoni, C. D i Franco, C. V ieira, and C. Loevenbruck. 1998. A ccumulation o f transposable elements i n the heterochromatin and on the Y chromosome of Drosophila simulans and Drosophila melanogaster. J Mol Evol 131 4 6: 6 61-668. J urka, J . 1997. Sequence patterns i ndicate an enzymatic involvement in integration o f m ammalian r etroposons. Proc Natl Acad Sci U S A 9 4: 1872-1877. J urka, J . 2000. Repbase update: a database a nd an electronic journal o f repetitive e lements. Trends Genet 1 6: 418-420. J urka, J . 2004. Evolutionary impact o f human A l u r epetitive elements. Curr Opin Genet Dev 1 4: 603-608. K amat, A . , M . M . H inshelwood, B . A . M urry, and C .R. M endelson. 2002. Mechanisms in t issue-specific r egulation o f estrogen biosynthesis in humans. Trends Endocrinol Metab 1 3: 122-128. K aplan, N . L . a nd J.F. B rookfield. 1 983. The effect o f homozygosity o f selective d ifferences between sites o f transposable elements. Theor Popul Biol 2 3: 273280. K arolchik, D . , R. Baertsch, M . Diekhans, T.S. Furey, A . H inrichs, Y . T . L u , K . M . R oskin, M . S chwartz, C W . Sugnet, D . J . T homas et al. 2003. The U C S C G enome Browser D atabase. Nucleic Acids Res 3 1: 51-54. K arran, P . 2000. D N A d ouble strand break repair in mammalian c ells. Curr Opin Genet £ > e v l 0 : 1 44-150. K ashkush, K . , M . F eldman, and A . A . L evy. 2 003. Transcriptional activation o f r etrotransposons alters the expression o f adjacent genes i n wheat. Nat Genet 3 3: 1 02-106. K azazian, H . H . , J r. 2004. M o b i l e e lements: drivers o f genome evolution. Science 3 03: 1 626-1632. K azazian, H . H . , J r., C . W ong, H . Youssoufian, A . F . S cott, D . G . P hillips, a nd S.E. A ntonarakis. 1 988. Haemophilia A resulting from de novo insertion o f L I sequences represents a n ovel mechanism for mutation in man. Nature 3 32: 164166. K ent, W . J . 2002. B LAT—the B L A S T - l i k e a lignment t ool. Genome Res 1 2: 656-664. K ent, W .J., C W . Sugnet, T.S. Furey, K . M . R oskin, T . H . P ringle, A . M . Z ahler, and D . H aussler. 2 002. The human genome browser at U C S C . Genome Res 1 2: 9961006. K i d w e l l , M . G . 2 002. Transposable elements and the evolution o f genome size in e ukaryotes. Genetica 1 15: 49-63. K i d w e l l , M . G . a nd D . L isch. 1 997. Transposable elements as sources o f variation in a nimals and plants. Proc Natl Acad Sci USA 94: 7 704-7711. K i d w e l l , M . G . a nd D . R . L isch. 2 001. Perspective: transposable elements, parasitic D N A , a nd genome evolution. Evolution IntJ Org Evolution 5 5: 1-24. K i e m , H . P . , S. Sellers, B . Thomasson, J . C M orris, J .F. Tisdale, P . A . H orn, P . Hematti, R . A dler, K . Kuramoto, B . Calmels et al. 2004. Long-term c linical a nd molecular f ollow-up o f large animals receiving retrovirally transduced stem and progenitor c ells: n o progression to c lonal h ematopoiesis or leukemia. Mol Ther 9 : 389-395. K i m , H .S., O . Takenaka, and T .J. C row. 1999. C loning a nd nucleotide sequence o f r etroposons specific to hominoid primates derived from an endogenous retrovirus ( H E R V - K ) . AIDS Res Hum Retroviruses 1 5: 595-601. K imberland, M . L . , V . D i v o k y , J . P rchal, U . Schwann, W . Berger, and H . H . K azazian, J r. 1999. F ull-length human L I insertions retain the capacity for high frequency r etrotransposition i n cultured c ells. Hum Mol Genet 8: 1557-1560. K irkness, E .F., V . Bafna, A . L . H alpern, S. L evy, K . Remington, D . B . R usch, A . L . 132 Delcher, M . Pop, W. Wang, C M . Fraser et al. 2003. The dog genome: survey sequencing and comparative analysis. Science 301: 1898-1903. Kjellman, C , H .O. Sjogren, and B . Widegren. 1995. The Y chromosome: a graveyard for endogenous retroviruses. Gene 161: 163-170. Lahn, B .T., N . M . Pearson, and K . Jegalian. 2001. The human Y chromosome, in the light o f evolution. Nat Rev Genet 2: 207-216. Landry, J.R., P. Medstrand, and D .L. Mager. 2001. Repetitive elements i n the 5' untranslated region of a human zinc-finger gene modulate transcription and translation efficiency. Genomics 76: 110-116. Landry, J .R., A . R ouhi, P . Medstrand, and D .L. Mager. 2002. The Opitz syndrome gene M i d i is transcribed from a human endogenous retroviral promoter. Mol Biol Evol 19: 1934-1942. Langley, C .H., E . Montgomery, R. Hudson, N . K aplan, and B . Charlesworth. 1988. On the role of unequal exchange in the containment of transposable element copy number. Genet Res 52: 223-235. L avie, L ., M . K itova, E . Maldener, E. Meese, and J. Mayer. 2005. C pG methylation directly regulates transcriptional activity of the human endogenous retrovirus family H E R V - K ( H M L - 2 ) . J Virol 79: 876-883. Leach, D .R. 1994. L ong D N A palindromes, cruciform structures, genetic instability and secondary structure repair. Bioessays 16: 893-900. L eib-Mosch, C , W. Seifarth, and U . Schon. 2005. Influence of Human Endogenous Retroviruses on Cellular Gene Expression. In Retroviruses and Primate Genome Evolution (ed. E .D. Sverdlov), pp. 123-143. Landes Bioscience, Georgetown, Texas, U SA. L ev-Maor, G ., R . Sorek, N . Shomron, and G . A st. 2003. The birth of an alternatively spliced exon: 3' splice-site selection in A lu exons. Science 300: 1288-1291. L ewinski, M . K . and F .D. Bushman. 2005. Retroviral D N A integration—mechanism and consequences. Adv Genet 55: 147-181. L i , W ., P. Zhang, J.P. Fellers, B . Friebe, and B .S. G ill. 2004. Sequence composition, organization, and evolution of the core Triticeae genome. Plant J40: 500-511. L i , W .H. and D. Graur. 1991. Fundamentals of Molecular Evolution. Sinauer Associates, Sunderland, M A , USA. L iu, G ., S. Zhao, J .A. B ailey, S . C Sahinalp, C. A lkan, E . Tuzun, E .D. Green, and E .E. Eichler. 2003. A nalysis o f primate genomic variation reveals a repeat-driven expansion of the human genome. Genome Res 13: 358-368. L iu, H ., H . H an, J. L i , and L . W ong. 2005. D NAFSMiner: a web-based software toolbox to recognize two types o f functional sites i n D N A sequences. Bioinformatics 21: 671-673. Lobachev, K .S., J.E. Stenger, O .G. K ozyreva, J . Jurka, D .A. Gordenin, and M . A . Resnick. 2000. Inverted A lu repeats unstable i n yeast are excluded from the human genome. EmboJ 19: 3822-3830. L orincz, M . C , D.R. D ickerson, M . Schmitt, and M . Groudine. 2004. Intragenic D N A methylation alters chromatin structure and elongation efficiency in mammalian cells. Nat Struct Mol Biol 11: 1068-1075. M a, J. and J .L. Bennetzen. 2004. R apid recent growth and divergence of rice nuclear genomes. Proc Natl Acad Sci USA 101: 12404-12410. Mager, D .L., D.G. Hunter, M . Schertzer, and J .D. Freeman. 1999. Endogenous retroviruses provide the primary polyadenylation signal for two new human genes 133 ( HHLA2 and H HLA3). Genomics 59: 255-263. Mager, D .L. and P. Medstrand. 2003. Retroviral repeat sequences. In Nature Encyclopedia of the Human Genome, Volume 5, pp. 57-63. M acmillan Publishers L td., L ondon, U . K . Majors, J . 1990. The structure and function of retroviral long terminal repeats. Curr Top Microbiol Immunol 157: 49-92. M akalowski, W . 2000. Genomic scrap yard: how genomes utilize all that junk. Gene 259: 61-67. Maksakova, I .A., M .T. Romanish, L . Gagnier, C A . D unn, L . N . van de Lagemaat, and D .L. Mager. 2006. Retroviral elements and their hosts: insertional mutagenesis in the mouse germ line. PLoS Genetics 2: e2. M artin, A . M . , J .K. K ulski, C . Witt, P . Pontarotti, and F .T. Christiansen. 2002. Leukocyte Ig-like receptor complex ( LRC) i n mice and men. Trends Immunol 23: 81-88. M cClintock, B . 1950. The origin and behavior of mutable l oci i n maize. Proc Natl Acad Sci USA 36: 344-355. M cClintock, B . 1956. C ontrolling elements and the gene. Cold Spring Harb Symp Quant Biol 21: 197-216. M cDonald, J.F. 1995. Transposable elements: possible catalysts of organismic evolution. Trends Ecol Evol 10: 123-126. M cGinnis, S. and T .L. Madden. 2004. B L A S T : at the core of a powerful and diverse set o f sequence analysis tools. Nucleic Acids Res 32: W 20-25. Medstrand, P., J.R. Landry, and D .L. Mager. 2001. L ong terminal repeats are used as alternative promoters for the endothelin B receptor and apolipoprotein C-I genes i n humans. J Biol Chem 276: 1896-1903. Medstrand, P. and D .L. Mager. 1998. Human-specific integrations of the H E R V - K endogenous retrovirus family. J Virol 72: 9782-9787. Medstrand, P., L . N . van de Lagemaat, C A . Dunn, J.R. Landry, D. Svenback, and D .L. Mager. 2005. Impact of transposable elements on the evolution of mammalian gene regulation. Cytogenet Genome Res 110: 342-352. Medstrand, P., L . N . van de Lagemaat, and D .L. Mager. 2002. Retroelement distributions i n the human genome: variations associated with age and proximity to genes. Genome Res 12: 1483-1495. Meunier, J ., A . K helifi, V . N avratil, and L . Duret. 2005. Homology-dependent methylation in primate repetitive D N A . Proc Natl Acad Sci USA 102: 54715476. M i , S., X . Lee, X . L i , G . M . V eldman, H . Finnerty, L. Racie, E. L aVallie, X . Y . Tang, P. Edouard, S. Howes et a l. 2000. Syncytin is a captive retroviral envelope protein involved i n human placental morphogenesis. Nature 403: 785-789. M iskey, C , Z. Izsvak, K . K awakami, and Z . Ivies. 2005. D N A transposons i n vertebrate functional genomics. Cell Mol Life Sci 62: 629-641. M itchell, R .S., B .F. B eitzel, A .R. Schroder, P. Shinn, H . Chen, C .C. Berry, J.R. Ecker, and F .D. Bushman. 2004. Retroviral D N A integration: A S L V , H IV, and M L V show distinct target site preferences. PLoS Biol 2: E 234. Morgan, H .D., H .G. Sutherland, D .I. M artin, and E. Whitelaw. 1999. Epigenetic inheritance at the agouti locus in the mouse. Nat Genet 23: 314-318. Mouse Genome Sequencing Consortium. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520-562. Moynahan, M . E . and M . Jasin. 1997. Loss of heterozygosity induced by a chromosomal 134 double-strand break. Proc Natl Acad Sci USA9A: 8988-8993. Muratani, K ., T. Hada, Y . Yamamoto, T. Kaneko, Y . Shigeto, T. Ohue, J. Furuyama, and K . Higashino. 1991. Inactivation o f the cholinesterase gene by A lu insertion: possible mechanism for human gene transposition. Proc Natl Acad Sci USA 88: 11315-11319. Murnane, J.P. and J.F. Morales. 1995. Use of a mammalian interspersed repetitive ( MIR) element in the coding and processing sequences o f mammalian genes. Nucleic Acids Res 23: 2837-2839. N ag, D .K., M . Fasullo, Z. Dong, and A . Tronnes. 2005. Inverted repeat-stimulated sisterchromatid exchange events are R A D 1-independent but reduced in a msh2 mutant. Nucleic Acids Res 33: 5243-5249. Nakaya, S .M., T .C. H su, S.J. Geraghty, M .J. Manco-Johnson, and A .R. Thompson. 2004. Severe hemophilia A due to a 1.3 kb factor V III gene deletion including exon 24: homologous recombination between 41 bp w ithin an A l u repeat sequence i n introns 23 and 24. J Thromb Haemost 2: 1941-1945. Nekrutenko, A . and W .H. L i . 2001. Transposable elements are found in a large number of human protein-coding genes. Trends Genet 17: 619-621. Nigumann, P ., K . Redik, K . M atlik, and M . Speek. 2002. M any human genes are transcribed from the antisense promoter of L I retrotransposon. Genomics 79: 628-634. Ono, M . , M . K awakami, and T. Takezawa. 1987. A novel human nonviral retroposon derived from an endogenous retrovirus. Nucleic Acids Res 15: 8725-8737. Orgel, L .E. and F .H. C rick. 1980. Selfish D N A : the ultimate parasite. Nature 284: 604607. Ostertag, E . M . , J .L. Goodier, Y . Zhang, and H .H. K azazian, Jr. 2003. S V A elements are nonautonomous retrotransposons that cause disease i n humans. Am J Hum Genet 73: 1444-1451. Ostertag, E . M . and H .H. K azazian, Jr. 2001. B iology o f mammalian L I retrotransposons. Annu Rev Genet 35: 501-538. Panet, A . and H . Cedar. 1977. Selective degradation of integrated murine leukemia proviral D N A by deoxyribonucleases. Cell 11: 933-940. Pavlicek, A ., K . Jabbari, J. Paces, V . Paces, J .V. Hejnar, and G . Bernardi. 2001. Similar integration but different stability of A lus and L INEs i n the human genome. Gene 276: 39-45. Perepelitsa-Belancio, V . and P. Deininger. 2003. R N A truncation by premature polyadenylation attenuates human mobile element activity. Nat Genet 35: 363366. Perna, N .T., M . A . Batzer, P .L. Deininger, and M . Stoneking. 1992. A l u insertion polymorphism: a new type of marker for human population studies. Hum Biol 64: 641-648. Pinsker, W ., E. Haring, S. Hagemann, and W .J. M iller. 2001. The evolutionary life history of P transposons: from horizontal invaders to domesticated neogenes. Chromosoma 110: 148-158. Price, A . L . , N . C . Jones, and P .A. Pevzner. 2005. De novo identification of repeat families i n large genomes. Bioinformatics 21 Suppl 1: i351-i358. Prudden, J., J.S. Evans, S.P. Hussey, B. Deans, P. O 'Neill, J . Thacker, and T. Humphrey. 2003. Pathway utilization i n response to a site-specific D N A double-strand break i n fission yeast. EmboJll: 1419-1430. 135 Rabson, A . B . and B .J. Graves. 1997. Synthesis and Processing of V iral R N A . In Retroviruses (eds. J .M. Coffin S .H. Hughes, and H .E. Varmus), pp. 205-261. C old Spring Harbor Laboratory Press, Plainview, N ew Y ork, U SA. Rat Genome Sequencing Project Consortium. 2004. Genome sequence o f the B rown N orway rat yields insights into mammalian evolution. Nature 428: 493-521. Renard, M . , P .F. V arela, C . Letzelter, S. Duquerroy, F A . R ey, and T. Heidmann. 2005. Crystal structure o f a pivotal domain of human syncytin-2, a 40 m illion years old endogenous retrovirus fusogenic envelope gene captured by primates. J Mol Biol 352: 1029-1034. Rosenberg, N . and P. Jolicoeur. 1997. Retroviral Pathogenesis. In Retroviruses (eds. J .M. Coffin S .H. Hughes, and H .E. Varmus), pp. 475-586. C old Spring Harbor Laboratory Press, P lainview, N ew Y ork, U SA. R oy-Engel, A . M . , M . L . C arroll, E . V ogel, R .K. Garber, S .V. Nguyen, A . H . Salem, M A . Batzer, and P .L. Deininger. 2001. A lu insertion polymorphisms for the study of human genomic diversity. Genetics 159: 279-290. Salem, A . H . , G .E. K ilroy, W .S. Watkins, L .B. Jorde, and M A . Batzer. 2003a. Recently integrated A lu elements and human genomic diversity. Mol Biol Evol 20: 13491361. Salem, A . H . , D A . R ay, J. X ing, P A . C allinan, J.S. M yers, D .J. Hedges, R .K. Garber, D .J. Witherspoon, L . B . Jorde, and M A . Batzer. 2003b. A l u elements and hominid phylogenetics. Proc Natl Acad Sci USA 100: 12787-12791. Sankaranarayanan, K . and J.S. Wassom. 2005. Ionizing radiation and genetic risks X IV. Potential research directions in the post-genome era based on knowledge of repair o f radiation-induced D N A double-strand breaks in mammalian somatic cells and the o rigin o f deletions associated with human genomic disorders. Mutat Res 578: 333-370. S anMiguel, P ., B .S. Gaut, A . Tikhonov, Y . Nakajima, and J .L. Bennetzen. 1998. The paleontology of intergene retrotransposons o f maize. Nat Genet 20: 43-45. Schatz, D .G. 2004. Antigen receptor genes and the evolution of a recombinase. Semin Immunol 16: 245-256. Schmid, C W . 1998. Does S INE evolution preclude A lu function? Nucleic Acids Res 26: 4541-4550. Schroder, A .R., P . Shinn, H . Chen, C. Berry, J.R. Ecker, and F. Bushman. 2002. H IV-1 integration in the human genome favors active genes and local hotspots. Cell 110: 521-529. Sheen, F . M . , S.T. Sherry, G . M . R isch, M . Robichaux, I. Nasidze, M . Stoneking, M . A . Batzer, and G .D. Swergold. 2000. Reading between the L INEs: human genomic variation induced by L INE-1 retrotransposition. Genome Res 10: 1496-1508. Shen, L ., L . C W u, S. Sanlioglu, R . Chen, A .R. Mendoza, A . W . Dangel, M . C C arroll, W .B. Z ipf, and C .Y. Y u. 1994. Structure and genetics of the partially duplicated gene R P located immediately upstream of the complement C 4A and the C4B genes i n the H L A class III region. Molecular cloning, exon-intron structure, composite retroposon, and breakpoint of gene duplication. J Biol Chem 269: 8466-8476. Shen, M .R., M . A . Batzer, and P .L. Deininger. 1991. E volution o f the master A l u gene(s). J Mol Evol 33: 311-320. Sironi, M . , G . M enozzi, G .P. C omi, N . Bresolin, R . C agliani, and U . P ozzoli. 2005a. F ixation o f conserved sequences shapes human intron size and influences 136 transposon-insertion dynamics. Trends Genet 21: 484-488. Sironi, M . , G . M enozzi, G .P. C omi, R . C agliani, N . B resolin, and U . P ozzoli. 2005b. A nalysis o f intronic conserved elements indicates that functional complexity might represent a major source of negative selection on non-coding sequences. Hum Mol Genet 14: 2533-2546. Smit, A .F. 1993. Identification of a new, abundant superfamily of mammalian L TRtransposons. Nucleic Acids Res 21: 1863-1872. Smit, A .F. 1999. Interspersed repeats and other mementos o f transposable elements i n mammalian genomes. Curr Opin Genet Dev 9: 657-663. Smit, A .F. and A . D . R iggs. 1995. M IRs are classic, t RNA-derived S INEs that amplified before the mammalian radiation. Nucleic Acids Res 23: 98-102. Smit, A .F., G . Toth, A . D . Riggs, and J. Jurka. 1995. Ancestral, mammalian-wide subfamilies of L INE-1 repetitive sequences. J Mol Biol 246: 401-417. Sorek, R., G. Ast, and D. Graur. 2002. Alu-containing exons are alternatively spliced. Genome Res 12: 1060-1067. Sorek, R., G. L ev-Maor, M . R eznik, T. Dagan, F. B elinky, D . Graur, and G. Ast. 2004. M inimal conditions for exonization of intronic sequences: 5' splice site formation i n alu exons. Mol Cell 14: 221-231. Stenger, J .E., K .S. Lobachev, D. Gordenin, T A . Darden, J. Jurka, and M . A . Resnick. 2001. Biased distribution of inverted and direct A lus i n the human genome: implications for insertion, exclusion, and genome stability. Genome Res 11: 1227. Sverdlov, E .D. 1998. Perpetually mobile footprints of ancient infections in human genome. FEBS Lett 428: 1-6. Sverdlov, E .D. 2000. Retroviruses and primate evolution. Bioessays 22: 161-171. Swergold, G .D. 1990. Identification, characterization, and c ell specificity of a human L INE-1 promoter. Mol Cell Biol 10: 6718-6729. Symer, D .E., C . Connelly, S.T. Szak, E . M . Caputo, G .J. Cost, G. Parmigiani, and J.D. Boeke. 2002. Human 11 retrotransposition is associated with genetic instability in v ivo. Cell 110: 327-338. Taruscio, D ., G . F loridia, G .K. Zoraqi, A . Mantovani, and V . F albo. 2002. Organization and integration sites i n the human genome o f endogenous retroviral sequences belonging to H E R V - E family. Mamm Genome 13: 216-222. Tchenio, T ., J.F. Casella, and T. Heidmann. 2000. Members o f the S R Y family regulate the human L INE retrotransposons. Nucleic Acids Res 28: 411-415. Temin, H . M . 1982. Function of the retrovirus long terminal repeat. Cell 28: 3-5. Thompson, L . H . and D. S child. 2002. Recombinational D N A repair and human disease. MutatRes 509: 49-78. T ing, C .N., M .P. Rosenberg, C M . Snow, L . C Samuelson, and M . H . M eisler. 1992. Endogenous retroviral sequences are required for tissue-specific expression of a human salivary amylase gene. Genes Dev 6: 1457-1465. Torti, C , L . M . G omulski, D . M oralli, E . Raimondi, H . M . Robertson, P. Capy, G. Gasperi, and A .R. M alacrida. 2000. E volution o f different subfamilies of mariner elements w ithin the medfly genome inferred from abundance and chromosomal distribution. Chromosoma 108: 523-532. Tournier, I., B .B. Paillerets, H . Sobol, D. Stoppa-Lyonnet, R. Lidereau, M . Barrois, S. Mazoyer, F . Coulet, A . Hardouin, A . Chompret et al. 2004. Significant contribution of germline B R C A 2 rearrangements i n male breast cancer families. 137 Cancer Res 64: 8143-8147. Tristem, M . 2000. Identification and characterization of novel human endogenous retrovirus families by phylogenetic screening of the human genome mapping project database. J Virol 74: 3715-3730. Turner, G ., M . Barbulescu, M . Su, M .I. Jensen-Seaman, K . K . K idd, and J. L enz. 2001. Insertional polymorphisms of full-length endogenous retroviruses in humans. Curr Biol 11: 1531-1535. U llu, E . and C. Tschudi. 1984. A l u sequences are processed 7SL R N A genes. Nature 312: 171-172. U rwin, D . and R .A. L ake. 2000. Structure o f the M esothelin/MPF gene and characterization of its promoter. Mol Cell Biol Res Commun 3: 26-32. van de Lagemaat, L . N . , J .R. Landry, D .L. Mager, and P. Medstrand. 2003. Transposable elements i n mammals promote regulatory variation and diversification of genes w ith specialized functions. Trends Genet 19: 530-536. Venter, J .C., M . D . Adams, E .W. M yers, P .W. L i , R .J. M ural, G .G. Sutton, H .O. Smith, M . Y andell, C .A. Evans, R .A. H olt et a l. 2001. The sequence o f the human genome. Science 291: 1304-1351. von Melchner, H ., J .V. D eGregori, H . Rayburn, S. Reddy, C. Friedel, and H .E. Ruley. 1992. Selective disruption of genes expressed in totipotent embryonal stem cells. Genes Dev 6: 919-927. Walker, J.R., R .A. C orpina, and J. Goldberg. 2001. Structure o f the K u heterodimer bound to D N A and its implications for double-strand break repair. Nature 412: 607-614. Wang, H ., J . X ing, D . Grover, D .J. Hedges, K . H an, J .A. Walker, and M . A . Batzer. 2005. S V A Elements: A Hominid-specific Retroposon F amily. J Mol Biol 354: 9941007. Ward, B .D., B .C. Hendrickson, T. Judkins, A . M . Deffenbaugh, B . L eclair, B .E. Ward, and T. S choll. 2005. A multi-exonic B RCA1 deletion identified in multiple families through single nucleotide polymorphism haplotype pair analysis and gene amplification with widely dispersed primer sets. J Mol Diagn 7: 139-142. Watanabe, H ., A . Fujiyama, M . Hattori, T .D. Taylor, A . Toyoda, Y . K uroki, H . N oguchi, A . B enKahla, H . Lehrach, R. Sudbrak et a l. 2004. D N A sequence and comparative analysis of chimpanzee chromosome 22. Nature 429: 382-388. W ei, W ., N . Gilbert, S .L. O oi, J .F. Lawler, E . M . Ostertag, H .H. K azazian, J .D. Boeke, and J .V. M oran. 2001. Human L I retrotransposition: cis preference versus trans complementation. Mol Cell Biol 21: 1429-1439. Whitelaw, E . and D .I. M artin. 2001. Retrotransposons as epigenetic mediators of phenotypic variation in mammals. Nat Genet 27: 361-365. W ilkinson, D .A., D .L. Mager, and J .C. Leong. 1994. Endogenous Human Retroviruses. In The Retroviridae (ed. J .A. L evy), pp. 465-535. Plenum Press, New Y ork, N Y . W u, X . , Y . L i , B . C rise, and S .M. Burgess. 2003. Transcription start regions in the human genome are favored targets for M L V integration. Science 300: 1749-1751. Y ang, N . , L . Zhang, Y . Zhang, and H .H. K azazian, Jr. 2003. A n important role for R U N X 3 i n human L I transcription and retrotransposition. Nucleic Acids Res 31: 4929-4940. Yoder, J .A., C P . Walsh, and T .H. Bestor. 1997. Cytosine methylation and the ecology of intragenomic parasites. Trends Genet 13: 335-340. Y ohn, C .T., Z . Jiang, S.D. M cGrath, K . E . Hayden, P. K haitovich, M . E . Johnson, M . Y . 138 Eichler, J.D. McPherson, S. Zhao, S. Paabo et al. 2005. Lineage-specific expansions of retroviral insertions within the genomes of African great apes but not humans and orangutans. PLoS Biol 3: e l 10. Y u, J., S. Hu, J. Wang, G .K. Wong, S. L i, B . Liu, Y. Deng, L . Dai, Y. Zhou, X . Zhang et al. 2002. A draft sequence of the rice genome (Oryza sativa L . ssp. indica). Science 2 96: 79-92. Y u, X ., X. Zhu, W. Pi, J. Ling, L . Ko, Y. Takeda, and D. Tuan. 2005. The long terminal repeat (LTR) of E RV-9 human endogenous retrovirus binds to N F-Y in the assembly of an active L TR enhancer complex N F-Y/MZF1/GATA-2. J Biol Chem 2 80: 35184-35194. Zhang, Y ., N . Zeleznik-Le, N . Emmanuel, N . Jayathilaka, J. Chen, P. Strissel, R. Strick, L . L i, M.B. Neilly, T. Taki et al. 2004. Characterization of genomic breakpoints in M L L and C BP in leukemia patients with t (l 1;16). Genes Chromosomes Cancer 41: 257-265. Zhu, Z.B., S.L. Hsieh, D.R. Bentley, R.D. Campbell, and J.E. Volanakis. 1992. A variable number of tandem repeats locus within the human complement C2 gene is associated with a retroposon derived from a human endogenous retrovirus. J Exp Med 175: 1783-1787. 139 Appendix A 140 H uman-chimpanzee i n d e l l o c i a s s a y e d i n R hesus monkey t r a c e archive. R ecord s t r u c t u r e : 1: H eader, s howing human c hromosome: chromosome s t a r t p o s i t i o n i n J u l y 2003 0CSC genome b rowser: c hromosome e nd p o s i t i o n ( i f n e c e s s a r y ) : c himpanzee s c a f f o l d name : s c a f f o l d s t a r t p o s i t i o n : s c a f f o l d e nd p o s i t i o n ( i f n e c e s s a r y ) : o r i e n t a t i o n o f s caffold/chromosome alignment. 2: s h o r t d e s c r i p t i o n o f c ase 3: M u l t i p l e s equence a l i g n m e n t o f human, c himpanzee, a nd R hesus s equences. Underlining h i g h l i g h t s i d e n t i c a l s egments f l a n k i n g t h e i n d e l s c hrl:144298 621:14 4 298777:scaffold_37562:1315854 : + . . . i m p r e c i s e d e l e t i o n o f A luSx fragment i n c himpanzee C LUSTAL casein/1-256 g n l | t i I 5 18618788/314-567 caselc/1-101 caselh/1-256 g nl|ti|518618788/314-567 caselc/1-101 c asein/1-256 g nlIti|518618788/314-567 caselc/1-101 caselh/1-256 g n l | t i I 5 18618788/314-567 caselc/1-101 caselh/1-256 g nl|ti|518618788/314-567 caselc/1-101 GAAGAAATTTGGAAGAATTGCCACATGTGGAGCTATCTCTATATATAACATATACATATC GAAGAAATTTGGAAGAATTGCCATATATAGAGCTATCTCTGCATATAACATATACATATC GAAGAAATATGGAAGGATTGCCATATGTGGAGCCATCTCTACTTATAACAGA TACATATAACACCTGTAATCCCAGTACTTTGGGAGAATGAGGCAGGTGGATCACCTGAGG TACATATAATGCCTGTAATCCCAACACTTTGGGAGGCTGAGGCAGGTGGATCACCTGAGG TCAGGAGTTTGAGACCAGCCTGGCCAACATGGTGAAACCCCCGTCTCTACCAAAAATACA TCAGGAGTTTGAGACCGGCCTGGCCAAGATGGTGAAACCCCCTTCTGTACTAAAAATACA AAAATTAGCCGGGTGTGGTGGTACCAGCCCACTTCCCCCAGGCCCAATCCCAGAGATTGT AAAGTGAGCCAGGCATGGTGGCACCAGCCCACTTCCCCCAGGCCCAACCCCAGAGATTGT ACTGGCCCACCTCCCCCAGGCCCACCCCCAGATATTAT TAGATGTATCAGGAGC T A—TCTATCAGGAGC CTATAAGGAGC c hr2:50683419:50683739:scaffold_37688:19585334:...AluY CLUSTAL c ase2h/l-482 g n l | t i I 5 08398655/154-589 case2c/l-163 TGTAAGTTTCAGAATCATACTTTAAAAAA ATCTTTTTGGGGGCAGTTTTGTTTC TGTAAGCTGCAGAATCATACTTTAAAAAAAAAAAAAAATTTTGGGGGGCAGTTTTGTTTC TGTAAGTTTCAGAATCATACTTTTAAAAAA ATCTTTTTTGGGGCAGTTTTGTTTC i n R hesus, g ene c o n v e r s i o n t o A luYa5 i n human, p r e c i s e deletion i n c himpanzee c ase2h/l-482 AAAATTTAAATAAGAAACAAAAGTTCGGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAG g n l | t i I 5 08398655/154-589 AAAATTTAAATAAGAAACAAAAGTTC ATCACGCCTGTAATCCCAG case2c/l-163 AAAATTTAAATAAGAAACAAAAGTTC c a s e 2 h / l - 4 82 g nl|ti|508398655/154-589 case2c/l-163 c ase2h/l-482 g nl|ti|508398655/154-589 case2c/l-163 c ase2h/l-482 g n l | t i | 5 08398655/154-589 case2c/l-163 c ase2h/l-482 g n l | t i I 5 08398655/154-589 case2c/l-163 CACTTTGGGAGGCCGAGGCGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCCGGCTA CACTTTGGGAGGCCGAGCCGGGCGGATCATGAGGTCAAGAAATCAAGACCATCCTGGCTA AAACGGTGAAA.CCCCGTCTCTACTAAAAATACAAAAAAAAATTAGCCGGGCGTAGTGGCG ACATGGTGAAACCCTGTCTCTACTGAAAACACAAAAAA TTAGCTAGGCGTGGTGGCG GGCGCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATAGCGTGAACCCGGGAG GGCGCCTGTA CTCGGGAGACTGAGGCAGGAGAATGGCGTGAACCCGAGAG GCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGGGCGACAGAGCGAG GCGGAGCTTGCAGTGAGCAGTGATTGTGCCACTGCACTCCATCCTGGGTGACAGTGCAAG 141 c ase2h/1-4 8 2 ACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAGAAACAAAAGTTCAGAATATTGAAAGA g n l | t i | 5 08398655/154-589 ACTCTGTCTTAAAAAAAAAAAA TTCAAAATATCGAAAGA case2c/l-163 AGAATATTGAAAGA c ase2h/l-482 g nl|ti|508398655/154-589 case2c/l-163 c ase2h/l-482 g nl|ti|508398655/154-589 case2c/l-163 ATTATATGTTCTGTTGATATGGATAATAACAATATTAACTCCCCAAAACTCACTACAGGA ATTATCTGTTCTGTTCATATGGATAA CAATATTAACTCCCCAACACTCACTACAGGA ATTATATGTTCTGTTGATATGGATAATAACAATATTAACTCCCCAAAACTCACTACAGGA AAACGTTA AAACTTTA AAACGTTA c h r 2 : 6 9 5 6 6 7 4 5 : s c a f f o l d _ 3 7 6 88:564230:564543:. . . p r e c i s e d e l e t i o n o f A luY i n human CLUSTAL c a s e 3 c / l - 4 64 AAAAAAAAAAAAAAAAAGAAGTAGCTGATTCTTA TTTTTCATATAAGCTATCTTT g n l | t i I 5 07941191/90-555 AAAACAAACAAACAAAAAAAGTAACTGGTTCTTAATTTATTTTTCATATAAGCTATCTTT c ase3h/l-150 AAAAAAAAAAAAAAAAAGAAGTAGCTGATTCTTATTTTTCATATAAGCTATCTTT case3c/l-464 g nl|ti|507941191/90-555 c ase3h/l-150 case3c/l-464 g n l | t i | 5 07941191/90-555 c ase3h/l-150 case3c/l-464 g n l | t i I 5 07941191/90-555 c ase3h/l-150 case3c/l-464 g nl|ti|507941191/90-555 c ase3h/l-150 c a s e 3 c / l - 4 64 g nl|ti|507941191/90-555 c ase3h/l-150 TCTTGGGGCTTCTTAAAAAAAATGGACAGCTCCAGGCCGGGCGCAGTGGCTCACGCCTGT TCTTGTGGCTTCTTAAAAAAAAAAAAGAAC GGCCGGGCACGGTGGCTCAAGCCTGT TCTTGGGGTTTCTTAAAAAAA-TGGACAGCTCCA AATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCACGAGGTCAGGAGATCGAGACCAT AATCCCAGCACTTTGGGAGGCCGAGACAGGCGGATCACGAGGTCAGGAGATCGAGACCAT CCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAA-TACAAAA-AATTAGCCGGGCGT CCTGGCTAACACGGTGAAACCCCGTTTTTATTAAAAAATACAAAACAACTAGCCGGGGGA GGTAGCGGGCGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAAC GGTGGGGGGTGCCTGTAGTCCCAGCTACTCGGGAGGGTGAGGCAGGAGAATGGGGTAAAC CCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGCGACA CCGGGAGGGGGAGCTTGCAGTGAGCTGAGATCCGGCCACTGCACTCCAGCCTGGGCAACA c a s e 3 c / l - 4 64 GAGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAATGGACAGCTCCAGTTATTAACTA g n l | t i | 5 07941191/90-555 GAGCAAGACTCCGTTTCAAAAAAAAAAAAAAAAAAAA-GGACAGCTCCAGTTATTAACTA c ase3h/l-150 GTTATTAACTA case3c/l-464 g nl|ti|507941191/90-555 c ase3h/1-150 GTAATGCTCTTAATTTCCTAAATATAAAATTAATTTGGCTAAGAACCCAGA GTAATGCTCTTAATTTCCTAAATATAAAATTCATTTAGCTAAGAACCCAGA GTAATGCTCTTAATTTCCTAAATATAAAATTAATTTGGCTAAGAACCCAGA c hr2:84195010:scaffold_36190:2002525:2002866:... p r e c i s e d e l e t i o n o f A luY i n human CLUSTAL case4c/l-491 g nl|ti|332419397/459-927 c ase4h/l-150 GTGTACACATATGAATCTCAAAGCTGACATCTTTGTAACTAACATCTTAAAAAGCCTGAA AGGTACACATATGAATCTCAAAGCTGACATCTTTGCAACTAAAATCTTAAAAAGCCTGAA GTGTACACATATGAATCTCAAAGCTGACATCTTTGTAACTAACATCTTAAAAAGCCTGAA c a s e 4 c / l - 4 91 ATCTTAAAAATCAACATCTTGGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTT g n l I t i I 3 32419397/459-927 ATCTTAACAATCAGCATGTT GGTGGCTCACGCCTGTAATCCTAGCACTTT c ase4h/l-150 ATCTTAAAAATCAACATCTT case4c/l-491 g nl|ti|332419397/459-927 c ase4h/l-150 c a s e 4 c / l - 4 91 GGGAGGCCGAGGCGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCTGGCTAACACGG GGGAGGCCGGGACTGGTGGATCACGAGGTCAAGAGATGCAGACCATCGTGGCTAACATGG TGAAACCCCGTCTCTACT AAAAATACAAAAAATTAGCCGGGCGTGGTA 142 g nl I ti|332419397/459-927 c ase4h/l-150 case4c/l-491 g nl|ti|332419397/459-927 c ase4h/l-150 case4c/l-491 g nlIti|332419397/459-927 c ase4h/l-150 case4c/l-491 g nl|ti|332419397/459-927 c ase4h/l-150 TGAAAACCCGTCTCTTCTTAAAAAAAAAAAAAAAAAAAAAAAAAATAGCCAGGGGAGGTG GCGGGCGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGG GTGGGCGCCTGTAGTCCCAGCTACTCGGGAGGCCGAGGCAGGAGAATGGTGTGAACCCGG GAGGCGGAGCTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGCGACAGAGC GAGGCGGAGCTTGCAGTGAGCCGAGATCGCACCACTGCACTCCAGCCTGGGCGACAGAGC GAGACTCCGTCT C AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA CAGACTCCATCTCAAACAAACAAACAAATGAACAAAAAA case4c/l-491 AATTCAACTTCTTAGT CAT TTAAAAAT CTGATATCTCACCTTCAT G AAAC AAAAAAGAGT g n l | t i | 3 32419397/459-927 TCAACATCTTAATCATTTAAAAATCTGATATCTCAGCTTCATGAAACAGAAAAGAGT c ase4h/1-150 AGTCATTTAAAAATCTAATATCTCAACTTCATGAAACAGAAAAGAGT case4c/l-491 g n l | t i I 3 32419397/459-927 c ase4h/l-150 AGGATCAGTGCAGTAAAAAGAAG AGGATCAGTGCAGTAGAAAGAAG AGGATCAGTGCAGTAGAAAGAAG c hr2=157 6 69833:scaffold_37694:13104195=13104 426:...rhesus s equence f i l l i n g t h i s i n d e l i s f ound m u l t i p l e t imes i n t h e human genome. Chimp s equence i s an L1PA2 w i t h o u t a d i s c e r n a b l e TSD. L i k e l y m u l t i p l e i ndependent i n s e r t i o n s . CLUSTAL c ase5h/1-100 AGATTTAAAATCATTTTGATGATTTCTCTAAGATTATGTTGAGTTTCT g nl|ti|540008912/212-794 A GATTTTAAATCATTTTGATGATTTCTCTAAAATTATGTTGGGTTCCTTTTTTTTTTTTT c ase5c/1-331 AGATTTTAAATCATTTTGATGATTTCTCTAAGATTATGTTGAGTTTTT c ase5h/l-100 g n l I t i I 5 40008912/212-794 case5c/l-331 c ase5h/l-100 g n l | t i | 5 40008912/212-794 case5c/l-331 c ase5h/l-100 g nl|ti|540008912/212-794 case5c/l-331 c ase5h/l-100 g nl|ti|540008912/212-794 case5c/l-331 c ase5h/l-100 g nl|ti|540008912/212-794 case5c/l-331 TTTTGAATAGCAAATAGTTTATTGGTAAGTACATGGTTTCAACAAGAGTAATAAATTCAC ATGAAAAGGAGACAATAATCAAGTCAAAAGAATAAATGCTTACTAATCATCAGAAAATCT GTGGCCATTAGGGCTGGCACGTAAAAATCCAAAATCACTCAAAGGCCAAATCTTAAAGAA GATTCGTCCTCTTATTAGTCCATATGGAATAGGTCCATAGTACACAGAATCTGTAGAATT AA CTGTAGATTATCACCTTCTAACCAAACATGACCCATTGGCACCATAGGTCTGTTCTACAA AATTGAACAATGAGATCACATGGACACATGAAGGGGAATATCACACTCTGGGGACTGTGG c ase5h/l-100 g n l | t i | 5 40008912/212-794 TTCATTCAATTTTTACGGAAGTCACAGGCCTTCCAGAAAAAAAAAAAAATTAAAGCATGT case5c/l-331 TGGGGTGGGGGGAGGGGGGAGGGATAGCATTGGGAGATATACCTAAGGCTAGATGACGAG c ase5h/l-100 g nl|ti|540008912/212-794 case5c/l-331 c ase5h/l-100 g nl|ti|540008912/212-794 case5c/l-331 c ase5h/1-100 g nl|ti|540008912/212-794 c ase5c/1-331 TTCAAAGCATAGGTGGATGATCCATGATTTCAGGAATCCTCGGGCTCCAAAGAACCCTGA TTAGTGGGTGCAGCGCACCAGCATGGCACATGTATACATATGTAACTAACCTGCACAATG AAAACTTGG AGACCCTCAACCAGGACACAGGTGGGCCTTTCTCACCTATGTTGGATTTTTAAAAGTTGG TGCACATGTACCCTAAAACTTAAAGTATAAAAACAAAAACAAAAAAA- - AAATAACTTGG TTTGTGTTAAATCTGTTCATTGATTTGGAGACTGACAGATATA TTTGCATTAAATCTGTTCATTGATTTGGAGACTGACAGATATA TTTGTGTTAAATCTGTTCATTGATTTGGAGACTGACAGATATA 143 c hr2:18382 9 2 3 8 : s c a f f o l d _ 3 7 6 34:13637014:13637174:+ . . . n o n - t r a n s p o s a b l e e lement sequence, l i k e l y p r e v i o u s l y i n s e r t e d w i t h TSD, a l s o f ound on human chromosome 5, p r e c i s e d e l e t i o n i n human CLUSTAL case6c/l-260 g nl|ti|513283760/530-796 c ase6h/l-100 c ase 6 c/1-2 60 g n l | t i I 5 13283760/530-796 c ase6h/l-100 case6c/l-260 g n l | t i I 5 13283760/530-796 c ase6h/l-100 GTCAGAGGGGTAAGAAAGCCAAGAGAGAGTAGTGATAAATGCT AAAAAAGAAAT AAGAAA GTCAGAGGGGTAAGAAAGCCAAGAGAGAGTAGTGATAAATGCTAAAAAAGAAATAAGAAA GTCAGAGGGGTAAGAAAGCCAAGAGAGAGTAGTGATAAATGCTAAAAAAGA AAGAAA GGAGTGGTCAGCCTCACACAGTCCTTCTGAGATTGTTAAGCAGATTACTTCCACCAGTAT GGAGTGGTCAGCCTCGCACAGTCCTTCTGAGATTGTTAAGCAGATTACTTCCATCAGTAT GGAGTGGTCAGC TGAGCCAGGAGTTGAGGTGGAAGTCACCATTGCAGATGCTTAAGTCAACTATTTTAATAA TGAGCCAGGAGTTGAGGTGGAAGTCACCATTGTAGATGCTTAAGTCAACTATTTTAATAA c ase 6 c/1-2 60 ATTGATTACCAATTGTTTTAAAAAAAAAAAA AAGAAAGGAGTGGTCA g n l | t i | 5 13283760/530-796 ATTGATTACCAGTTGTTTTAAAAAAAAAAAAAAAAAAAAAAAAAA GAGTGGTCA c ase6h/l-100 case6c/l-260 g nl|ti|513283760/530-796 c ase6h/l-100 GCAGTAAACATCAGTTGCGCAATAAGAAGACCA GCAGTAAACATCAGTTGCCCAATAAGATCAAAA —AGTAAACATCAGTTGCGCAATAAGAAGACCA c hr2:187720209:187720500:scaffold_37634=175 62752:+ . ..AluY i n R hesus, p a r t i a l d e l e t i o n a nd gene c o n v e r s i o n t o A luYb8 i n human, p r e c i s e d e l e t i o n i n c himpanzee CLUSTAL c ase7h/l-438 g n l I t i l 5 07795985/79-552 case7c/l-146 c ase7h/l-438 g n l | t i I 5 07795985/79-552 case7c/l-146 c ase7h/l-438 g n l | t i l 5 07795985/7 9-552 case7c/l-146 c ase7h/l-438 g nl|ti|507795985/79-552 case7c/l-146 c ase7h/l-438 g n l | t i I 5 07795985/79-552 case7c/l-146 c ase7h/l-438 g n l | t i I 5 07795985/79-552 case7c/l-146 c ase7h/l-438 g n l | t i | 5 0 7 7 9 5 9 8 5 / 7 9-552 case7c/l-146 c ase7h/l-438 g n l | t i l 5077 95985/7 9-552 case7c/l-146 c ase7h/l-438 TGATATGTATGAGAAAGATACTGGTTTCTACATTTTGCTTTTAAATACTGTGAAGTAAAG TGATATGTATGAGAAAGATACTCGTTTCTACATTTTGCTTTTAAATACTATGAGGTAAAG TGATATGTATGAGAAAGATACTGGTTTCTACATTTTGCTTTTAAATACTGTGAAGTAAAG CACGAGACAACTTAAAAAAATATCTATAATG . GATGAGACAACTTAAAAAA-TATCTATAATGGGCCGGGCGCGGTGGCTCAAGCCTGTAAT CACGAGACAACTTAAAAAA-TATCTATAGTG AGCACTTTGGGAGGCCGAGGCGGGTGGATCATGAGGTCAGGAGATCGAGACCATCCT CCCAGCACTTTGGGAGGCCAAGACGGGCAGATCATGAGGTCAGGAGATCGAGACCATCAT GGCTAACAAGGTGAAACCCCGTCTCTACTAAAA—ATACAAAAAATTAGCCGGGCGTGTT GGCTAACACGGTGAAACCCCGTCTATACTAAAAAAATACAAAAAACTAGCCAGGCGAGGT GGTGGGCGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCG GGTGGGCGCCTGTAGCCCCAGCT TGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCG GGAAGCGGAGCTTGCAGTGAGCCGAGATTGCGCCACTGCAGTCTGCAGTCCGGCCTGGGC GGAGGCGGAGCTTGCAGTGAGCTGAGATCCGGCCACTGCACTC CAGCCTGGGT GACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAA-AAAAAAAAT GACAGAGCGAGACTCCATCTCAAAAAAAAAAAAAAAAAAAATATATATATATATATATAT CTATAATGAAAATAGAGCAAAAATTACTGAAGTAGCATACAAATGACAAAGATGAGAGAA ATAAAATGAAAATAGAGCAAAAATTACTGAAGTACCATACAAATGACAAAGATGAGAGAA AAAATAGAGCAAAAATTACTGAAGTAGCATAAAA-TGACAAAGATGAGAGAA AGAAT 144 g n l I t i l 5 07795985/79-552 case7c/l-146 AGAAT AGAAT c hr2:192411818:192412134:scaffold_37634:22298194:+ . ..AluY i n R hesus, C LUSTAL c ase8h/l-476 AAAGAACTTTGCCTGTGTTGACTTTGTAAATGTTGTCTTAACCAAGCTTGGGCAACTATC g n l | t i I 4 98009529/103-591 AAAGAACTTTGCCTGTCTTGACTTTGTAAATGTTGTCTTAACCGAGCTTGAGCAACTATT case8c/l-160 AAAGAACTTTGCCTGTGTTGACTTTGTAAATGTTGTCTTAACCAAGCTTGAGCAACTATT case8h/l-47 6 g nl I ti|498009529/103-591 case8c/l-160 c ase8h/l-47 6 g n l | t i | 4 98009529/103-591 case8c/l-160 case8h/l-4 7 6 g nl|ti|498009529/103-591 case8c/l-160 case8h/l-476 g n l | ti|498009529/103-591 case8c/l-160 case8h/l-47 6 g nl|ti|498009529/103-591 case8c/l-160 TTTTTAAGAATCAAAAACACA—GCCGGGCGTGGTGGCTCACGCCTGTAATCCCAGCACT TTTTTAAGAATCAAAAACACAAAGCCGGGCGCGGTGGCTCAAGCCTGTAATCCCAGCACT TTTTTAAGAATCAAAAACACAA TTGGGAGGCCGAGGGGGGTGGATCACGAGGTCAGGAGATCGAGACCATCCCGGCTAAAAC TTGGGAGGCCGAGGCGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCTGGCTAACAT g ene c o n v e r s i o n t o A luYa5 i n human, p r e c i s e d e l e t i o n i n c himpanzee GGTGAAACCCCGTCTCTACTAAAAATACAAAAAA TTAGCCGGGCGTAGTGGCGGG GGTGAAACCCCGTCTCTACTAAAAATACAAAAAAAAAAACTAGCCGGGCGTGGTGGCGGG CGCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGC CACCTGTAGTCCCAGCTACTCGG-AGACTGAGGCAGGAGAATGGCGTGAACCTGGGAGGC GGAGCTTGCAGTGAGCCGAGATCCCGCCGCTGCACTCCAGCCTGGGCGACAGAGCG—AG GGAGCTTGCAGTGAGCCGAGATCGCACCACTGCACTCCAGCCTGGGTGACACAGCGCGAG case8h/l-47 6 ACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAGAATC AAAAACACGAAAAAGCCA g n l | t i | 4 98009529/103-591 ACTCCGTCTCAAAAAAAAAAAAACAAAAACAAAAACAAAACAAAAAACACAAAAAAGCCA case8c/l-160 AAAAGCCA c ase8h/l-476 g nl|ti|498009529/103-591 case8c/l-160 c ase8h/l-476 ' g nl|ti|498009529/103-591 case8c/l-160 CTCCACACAAAATATAATGCATCAAGTGTGGCTGCTGAATTACCGGAGTTAACATAGGTA CCCCATATAAAATACAGTGCATCAGGTGTGGCTGCTAAATTACCAGAGTTAACATATGTA CTCCACACAAAATATAATGCATCAAGTGTGGCTGCTGAATTACTGGAGTTAACATAGGTA AGCACACACC AACATACACC AGCACACACC c hr2:210695342:210695583:scaffold_36996:2672593:+ . . . p r e c i s e d e l e t i o n o f A l u S g i n c himpanzee CLUSTAL c ase9h/1-363 gnl|ti|562549776/185-546 case9c/l-122 c ase 9h/1-3 63 gnl|ti|562549776/185-546 case9c/l-122 c a s e 9 h / l - 3 63 gnl|ti|562549776/185-546 case9c/l-122 case9h/l-363 gnl|ti|562549776/185-546 case9c/l-122 c a s e 9 h / l - 3 63 TAACACATGTACTTCTAAATGTGTTTGAAGTTTGAGTCTTACTAACTTATGTAGAATTCA TAATACATGCAGTTCTAAATGTGTTTGAAGTTTGAGTCTTACTGACTTATATAGAATTCA TAACACATGTACTTCTAAATGTGTTTGAAGTTTGAGTCTTACTAACTTATGTAGAATTCA CATATTTTTCGGCCGGGCACGGTGACTCACGCC-TGTAATCCCAG—CACTTTGGGAGGC CATATTTTTCGGCCAAACATGGTGAC—ACACC-TGTAATCACAG—AACTTTGGGAGGC CATATTTTTC CGAGGCAGACAGATCACAAGGTCAGGAGTTTGAGACCAGCCTGACCAACATGGTGAAACC CGAGGCGGGCAGATCACAAGGTCAGGAGTTTGAGACCAGTCTGACCGACATGGTGAAACC CCCGTCTCTACTAAAAATA-AAAAATTAGCCGGGCATGGTGGCACGTGCCTGTAATCCCA TCCATCTCTACTAAAAATACAAAAATTAGCCAGGCTTGGTGGCAGGCATCTGTAATCCCA GCTACTCAGGAGGCTGAAGAAGGAGAATCGCTTGAACCCGGGAGACGGAGGTTGCAGTGA 145 g nl|ti|562549776/185-546 case9c/l-122 case9h/l-363 g nl|ti|562549776/185-546 case9c/l-122 case9h/l-363 g n l | t i I 5 62549776/185-546 case9c/l-122 GCTACTCAGGAGGCTGAGGCAGGAGAATCGCTGGAACCCAGGAGGCAGAGGTTGCAGTGA GCCGAGATATTTTTCACACAAAAACCTTAAATATTACAGTGCAGTGGTTAGCAACTCTGA GCCAAGATATTTTTCACACGAAAACCTTAAATATTACAGTGCAGTGGTTAGCAACTCTGA ACACAAAAACCTTAAATATTACAGTGCAGTGATTAGCAACTCTGA AAGTTTG AAGTTTG AAGTTTG c hr2:231119183=231119489:scaffold_34265=2315868:+ . ..AluY i n R hesus, g ene c o n v e r s i o n t o A luYg6 i n human, p r e c i s e d e l e t i o n i n c himpanzee C LUSTAL c aselOh/1-461 TCAGTGCA-TTTATTTTATATATTATTTCTATAATGTGCATATTTAATAACAAATAACAT g nl|ti|540018070/562-1046 TCAATGCA-TTTATTTTATATATTATTTCTATAATGTGCATATTTAATAACAAATAACAT c aselOc/1-156 TCAGTGTAGTTTATTTTATATATTATTTCTATAATGTGCATATTTAATAACAAATAACAT c aselOh/1-461 g nl|ti|540018070/562-104 6 casel0c/l-156 c aselOh/1-461 g n l | t i I 5 40018070/562-104 6 casel0c/l-156 c ase10h/1-461 g n l | t i I 5 40018070/562-104 6 casel0c/l-156 c aselOh/1-461 g nl|ti|540018070/562-104 6 casel0c/l-156 c ase 10h/1-4 61 g n l | t i I 5 40018070/562-1046 casel0c/l-156 TTATATAAAGTATGTTTA GGCCGGGCGCGGTGGCTCACGCCTGTAA TTATATAAAGTATGTTTAATATATTAAGTCTAGGCCGGGCGCGGTGGCTCACGCCTGTAA TTATATAAAGTATGTTTAATATATTAAGTCTA TCCCAGCACTTTGGGAGGCCGAGACGGGCGGATCACGAGGTCAGGAGATCGAGACCATCC TCCCAGCACTTTGGAAGGCCGAGACAGGCAGATCATGAGGTCAGGAGATCGAGATCATCC TGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAA TTAGCCAGGCA TGGCTAACACGGTGAAACCCCCTCTCTACTAAAAATACAAAAAAAAAAATTAGCCGGGCG TGGTGGCGTGCGCCTGTAGTCCCAGCTACACGGGAGGCTGAGGCAGGAGAATGGCGCGAA TGGTGGTGGGCGCCTGTAGTCCCAGCTACACGGGAGGCTGAGGCAGGAGAATGGCATAAA CCCGGGAGGCGGAGCTTGCAGTGAGTCGAGATCGCGCCACTGCACTCCAGCCTGGGCAAC CCCGGGGAGCGGAGCTTGCAGTGAGCTGAGATCGCGCCACTGCACTCCAGCCTGGGCAAC c aselOh/1-461 AGAGCTAAACTCCGTCTCAAAAAAAAAAAAAA AAAGTATGTTTAATATATTAAGTC g n l | t i I 5 40018070/562-1046 AGAGCGAGACTCCGTTTCAAAAAAAAAAAAAAAAATATATATATATATATATATAAAGTC casel0c/l-156 c aselOh/1-461 g nl|ti|540018070/562-1046 c aselOc/1-156 c aselOh/1-461 g nl|ti|540018070/562-1046 c aselOc/1-156 TATTCTAAGGAGTTAATTTACTGAATTATTTCCTGTAATGCAACAGAGTTAATTAAATTA TATTCTAAGAAGTTGCTTTACTGAATTATTTCCTGTAATGCAGCAGAGTTAATTATATTA —TTCTAAGGAGTTAATTTACTGAATTATTTCCTGTAATGCAACAGAGTTAATTAAATTA ATAAGT ATAAGT ATAAAT C hr2=234434078=234434368:scaffold_37338:1917127:...AluY i n R hesus, g ene c o n v e r s i o n t o A luYb8 i n human, p r e c i s e d e l e t i o n i n c himpanzee CLUSTAL casellh/1-437 g nl|ti|538232282/352-812 casellc/1-147 casellh/1-437 g nl|ti|538232282/352-812 casellc/1-147 casellh/1-437 GAAACTAGTTATTGTCAGAAGAGACCATTGGGATGAAAAGTGTGTATCTTCGAAACCACT GAAACTAGTTATTGTCAGAAGAGACCATTGGGATGAAAAGTGTGTATCTTCAAAACCACT GAAACTAGTTATTGTCAGAAGAGACCATTGGGATGAAAAGTGTGTATCTTCAAAACCACT AAAGAAAAACTC ACACTTTGGGAGGCC AAAGAAAAACTCGGCCGGGCACGGTGGCTCAAGCCTGTAATCCCAGCACTTTGGGAGGCC AAAGAAAAACTC GAGGCGGGTGGATCATGAGGTCAGGAGATCAAGACCATCCTGGCTAACAAGGTGAAACCC 146 g nl|ti|538232282/352-812 casellc/1-147 GAGATGGGTGGATCACGAAGTCAGGAGATCGAGACCATCCTGGCTAACAGGGTGAAACCC casellh/1-437 TGTCTCTACTAAAAA-TACAAA AAATTAGCCGGGCGCGGTGGCGGGCGCCTGTAGT g n l | t i | 5 38232282/352-812 CGTCTCTACTAAAAAATACAAAAAAAAAATTAGCCGGGCGTGGTGGCGGGTGCCTGTAGT casellc/1-147 casellh/1-437 g nl|ti|538232282/352-812 casellc/1-147 c a s e l l h / 1 - 4 37 g n l | t i I 5 38232282/352-812 casellc/1-147 casellh/1-437 g n l | t i I 5 38232282/352-812 casellc/1-147 CCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAAGCGAAGCTTGCA CCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCCTGAACCCGGGAGGCAGAGCTTGCA GTGAGCCGAGATTGCACCACTGCAGTCCGCAGTCCGGCCTGGGCGACAGAGCGAGACTCC GTGACCTGAGATCCGGCCACTGCACTCCA GCCTGGGTGACAGAGCAAGACTCC GTCTCAAAAAAAAAAAAAAGAAAAAAAGAAAGAAAAACTCAGTCTGTACCTTCTATAATA GTCTCAAAAAAAAAAAAAAAAAAA GAAAAACTCAGTCTGTACCTTCTATAATA AGTCTGTACCTTCTATAATA casellh/1-437 ATAATTAGTAATGCCAGTAACAGCAGGTAACATTTATTGAATGTATACAGTGTGT g nl|ti|538232282/352-812 ATAATTAGTAATGCCAGTAACAGAAAGTAATATTTATTGAATGTGTACAATGTGT casellc/1-147 ATAATTAGTAATGCCAGTAACAGCAGGTAACATTTATTGAATGTATACAGTGTGT c hr3=3802176:scaffold_32934:3840390:3840497:+ . . . i m p r e c i s e d e l e t i o n o f L2 f ragment i n human, no s i m i l a r i t y CLUSTAL casel2c/l-207 gnl|ti|583293545/557-763 casel2h/l-101 casel2c/l-207 gnl|ti|583293545/557-763 casel2h/l-101 TAATTCCTCATGTCCCCACCTTAGCCCCCATATGATCCTCTCAGGCCTTGCTAAAGCTCA TCATCCCTCATGTCCCCACCTCTGCCCCCACATGATCCTCTCAGGCCTTGCTAAAGCTGA TAATTCCTCATGTCCCCACCTTAGCCCCCATATGATCCTCTCAGGCCTTGTT GAGCCAAAAAACAGTTCTTAGAACACAGTAGAAACTCAACAAATATTTGCTGAATTAGTA GAGCCAAAAAACAGTTCTTAGAACACAGTAGAAACTCAACAAATATTTGCTGAATTAGTG at breakpoint c ase12c/1-207 TCCATGTTCGTCTAGTCTCCCAGTTCATAGGCCATCATGTTTTACATGTAGCTGATGTTT g n l | t i I 5 83293545/557-763 TCCATGTTTGTCTAGTCTCCCAGTTCATAGGCCATCATGTTTTACAAGTAGCTGATGTTT casel2h/l-101 GTTTTACATGTAGCTGATGTTT casel2c/l-207 g nl|ti|583293545/557-763 casel2h/l-101 GTCGATGTCTACTGAGTCAAATATAGA GTCGATGTCTACTGAGTCAAATATAGA GTCGATGTCTACTGAGTCAAATATAGA c hr3=34069680:scaffold_37683=15182979=15183196:...independent i n s e r t i o n s o f an A luY i n R hesus a nd L1PA2 i n c himpanzee i n t h e same s i t e C LUSTAL c asel3c/285-601 g nl|ti|583406125/478-897 casel3h/l-99 c asel3c/285-601 g n l I t i I 5 83406125/478-897 casel3h/l-99 c asel3c/285-601 g nl|ti|583406125/478-897 casel3h/l-99 e a s e l 3c/2 85-601 g nl|ti|583406125/478-897 casel3h/l-99 c asel3c/285-601 g nl|ti|583406125/478-897 AAGCAGATGGGGGTCAGGAAGCAAGAAGAGGTAGGTTAGAAAGCCTTTACGGAAGGGGAA AAG-GGATGGGGGTCAGGAAGCAAGAAGAGGTAGATTAGAAAGGCTTTACCAGCCGGGCG AAG-AGATGGGGGTCAGGAAGCAAGAAGAGGTAGATTAGAAAGGCTTTAC TA CCATACTCTGGGGACTGTGGTGGG--GAGGGGGGAGG CGGTGGCTCACGCCTGTAATCCCAGAACTTTGGCAGGCCGAGGCGGGCGGATCACGAGGT — GGGGAGGGATAGCAT- -TGGGAGATATACCTAA TGCTAGA CAGGAGATCGAGACCATCCTGGCTAACACAGTGAAACCCCGTCTCTACTAAAAATACAAA TGACGAGT—TAGTGGGTGCAGCGCACCAGCATGGC AAGAAAAAAATTAGCCGGGCATGGTGGCGAGCGCCTGTAGTCGCAGC-TACTCGGGAGGC ACATGTATACATATGTAACTAACCTG CACAATGTGCACA TGAGGCAGGAGAATGGCGTGAACCTGGGAGGCGGAGCTTGCAGTGAGCTGGATCGCGCCA 147 casel3h/l-99 c asel3c/285-601 g nl|ti|583406125/478-897 casel3h/l-99 c asel3c/285-601 g nl|ti|583406125/478-897 e a s e l 3 h/l-99 c asel3c/285-601 g nl|ti|583406125/478-897 casel3h/l-99 -TGTACCCTAA AACTTAAAGTATAATAATA AAAAAAAAAAAAAAAAAAGAA CTGCACTCCAGCCTGGGTGAAAGAGCGAGACTCCTTCTCAAAAGAAAGAAAAAAAAAGAA AGCCTTTACCAAAAAAAA CATGGTGTTTGAACTGATACTTAAACTGGTTCTTAAGCT AGGCTTTACCAAAAAAAAAACCACAGTGTTTGAACTGATACTTAAGCTGGTTCTTAAGCT CAAAAAAAA CATGGTGTTTGAACTGATACTTAAACTGGTTCTTAAGCT GG GG GG c hr3:127318836=127319143:scaffold_36943=880680:. . . p r e c i s e d e l e t i o n o f A l u S g i n c himpanzee CLUSTAL c a s e l 4 h / l - 4 62 g n l | t i I 5 01884074/122-632 casel4c/l-155 c a s e l 4 h / l - 4 62 g nl|ti|501884074/122-632 c ase 1 4c/1-155 c ase14h/1-462 g nl|ti|501884074/122-632 casel4c/l-155 CGGTGAGTATTCACTTAGCACCCAGGATCTT TAAACCAACCAGTGCAT CGGTGAGTATTCACTTAGCATCCGGGATCTTCACACTCTGCCCTAAACCAACCAGTGCAT CGGTGAGTATTCACTTAGCACCTAGGATCTT TAAACCAACCAGTGCAT TCCCTTCTGAAGAGCCTTCAACTTCACTTTAAAAAATCCTGTGATGGGGCGGGCGTGGTG TCACTTCTGAGGATCCTTCAACTTCACTTAAAAAAATCCTGTGACGGGGCGGGCATGGTG TCCCTTCTGAAGAGCCTTCAACTTCACTTTAAAAAATCCTGTGATGG GCTCATGCCTGCAATCCCAGCACTTTGGGAGGTCAAGGTGGGCAGATCACGAGGTCAGGA GCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCAAGGCGGGCGGATCACCAGGTCAGGA casel4h/l-462 GTTCGACACCAGCCTTACCAACATGGTGAAACCCTGTCTCTACTAAAAATACAAAAATTA g n l | t i | 5 01884074/122-632 GCTCAACACCAGCCTTACCAACATGGTGAAACCCTGTCTTTACTAAAAATACAAAAATTA casel4c/l-155 c ase14h/1-462 g nl|ti|501884074/122-632 casel4c/l-155 c ase14h/1-462 g nl|ti|501884074/122-632 casel4c/l-155 GCCGGGTGTGGTGACACGCGCCTGTAATCCCAGCTACTCAGGAGGCTGAGGCAGGAGAAT GCCAGGTGTGGTGGAGTGCGCCTGTAATCCCAGCTACTCAGGAGGCTGAGGCAGGAGAAT CACTTGAATCCGGGAGGTGGAGGTTGCAGTGAGCCGAAATCATGCCACTGCACTCCAGCC CACTTCAACCCGGGAGGTGGAGGTTGCAGTGAGCCGCGATCATGCCACTGCACTCCAGCC casel4h/l-462 TGGGCAACAGAACGAGACTTTGTCTAAAAAAAAAAAAA g n l | t i | 5 01884074/122-632 TGGGCGACAGAGTGAGATTCTGTCAAAAAAAAAAAAAAAAAAAAAAAAACACACACACAC casel4c/l-155 casel4h/l-462 g nl|ti|501884074/122-632 e a s e l 4 c/1-155 casel4h/l-462 g nl|ti|501884074/122-632 casel4c/l-155 AAAAATCCTGTGATGGCAAAGACGTTCT-CAGGCTAAATTCAACTC ACAACAAAACAAAATAAAATGCTGTGATGGCAAAGACGTTTTTCAGGCTAAATTCAACTC CAAAGACGTTCT-CAGGCTAAATTCAACTC ATGTATTTTTTACATACATAATATTTGAAGG ATGTATTTTTTACATAGATAATATTTGAAGG ATGTATTTTTTACATACATAATATTTGAAGG C hr4=62743877.-62744205:scaffold_37623=10184651:+ . ..AluY i n R hesus, p a r t i a l TSD d e l e t i o n a nd gene c o n v e r s i o n t o A luYb8 i n human, p r e c i s e d e l e t i o n i n c himpanzee CLUSTAL casel5h/l-494 g nl|ti|523759330/256-733 case15c/l-166 casel5h/l-494 g nl|ti|523759330/256-733 AGTGACAAAGTGCAGGTCTACAGAACAAAGGGCACAAATAAAACTAGGCAATTTTGTGTG ACTGACAAAGTGCAGGTCTACAGAAAAAAAGGCACAAATAAAACTAGGCAATTTTGTGTG AGTGACAAAGTGCAGGTCTACAGAACAAAGGGCACAAATAAAACTAGGCAATTTTGTGTG TTGACTATAAAAGAGCATTTTG GGCCGGGCGCGGTG TTGACTATAAAAGAGCATTTTGATGTTGTTGAAAATATTTTACACAGGCCGGGCGCGGTG 148 casel5c/l-166 case15h/1-494 g n l | t i I 5 23759330/256-733 casel5c/l-166 TTGACTATAAAATAGCATTTTGATGTCATTGAAAATATTTTACACAGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGTGGATCATGAGGTCAGGA GCTCAAGCCTGTAATCCCAGCACTTTGGGAGGCCAAGACGGGCGGATCACGAGGTCAGGA c ase 15h/1-4 94 GATCGAGACCATCCTGACTAACAAGGTGAAACCCCGTCTCTACTAAAAA—TACAAAAAA g n l | t i I 5 23759330/256-733 GATCGAGACCATCCTGGCTAACACAGTGAAACCCCGTCTCTACTAAAAAAATACAAAAAA casel5c/l-166 casel5h/l-494 g n l | t i I 5 23759330/256-733 casel5c/l-166 casel5h/l-494 g nl|ti|523759330/256-733 casel5c/l-166 TTAGCCGGGCGCGGTGGTGGGCGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAG CTAGCCGGGCGAGGTGGCGGGCGCCTGTAGTCCCAGCTACTCAGGAGGCTGAGGCAAGAG AATGGCGTGAACCCGGGAAGCGGAGCTTGCAGTGAGCCGAGATTGCGCCACTGCAGTCCG AATGGCGTAAATCCGGGAGGCGGAGCTTGCAGTGAGCCGACATCCGGCCACTGCACTCCA e a s e l 5 h/l-4 94 CAGTCCGGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAA g n l | t i | 5 23759330/256-733 GCCTGGGCAACAGAGCGAGACTCCGCCTCAAAAAAAAAAAAA casel5c/l-166 casel5h/l-494 AAAAAAGAGCATTTTGATGTCGTTGAAAATATTTTACACAAATGAAAAAGTGGTGATGGT g n l | t i I 5 23759330/256-733 GAAAATATTTTACACAAATGAAAAAGTGGTGATGGT casel5c/l-166 AATGAAAAAGTGGTGATGGT casel5h/l-494 g nl|ti|523759330/256-733 e asel5c/1-166 TATCACGTGAGGGGGTTCTGTGGTCTCCCCTTCTCTTAGT TATCATGTGAAGGGGTTCTGTGGTCTCCCTTTCTCTTAGT TATCACATGAGGGGGTTCTGTGGTCTCCCCTTCTCTTAGT c hr4:110847016:110847328:scaffold_37491:1921071:+ . . . p r e c i s e d e l e t i o n o f A luY i n c himpanzee CLUSTAL e asel6h/1-4 7 0 TCATTTGCTTTATTTTAGAAAAGCCTATTAGTACATTATAATTCAGTACTAGAACTTCAA g nl|ti|540783035/118-571 TCATTTGCTTTATTTTAGAAAAGCCTATTAGTACATTATAATTCAGTACTAGAACTTCAA casel6c/l-158 TCATTTGCTTTATTTTAGAAAAGCTTATTAGTACATTATAATTCAGTACTAGAACTTCAA e a s e l 6 h/l-470 g nl|ti|540783035/118-571 casel6c/l-158 c ase16h/1-470 g nl|ti|540783035/118-571 casel6c/l-158 e asel6h/1-4 7 0 g nl|ti|540783035/118-571 casel6c/l-158 casel6h/l-470 g n l | t i I 5 40783035/118-571 casel6c/l-158 TTATTTTATTAAAAAGTCCACTCCAG-CCGGGCGCGGTGGCTCACGCCTGTAATCCCAGC TTATTTTATTAAAAAGTCCACTCCGGGCTGGGCATGGTGGCTCACGCCTGTAATCCCAGC TTATTTTATTAAAAAGTCCACTCCA ACTTTGGGAGGCCGAGGCGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCTGGCTAA ACTTTGGGAGGCTAAGGCAGGCAGATCATGAGGTCAGGAGATCGAGACCATCCTGGCAAA CACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCGTGGTAGCGGGCG CACGGTGAAACCCCGTCTCTACTAAATATACAAAAAATTAGCTGGGCAAGGTGGCGGGCG CCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGG CCTGTAGTCCCAGCTTCTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGG casel6h/l-470 AGCTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTC g n l | t i | 5 40783035/118-571 AGCTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGGGACAGAGCGAGACTC casel6c/l-158 e a s e l 6 h/l-470 CGTCTCAAAAAAAAAAAAAAAAAAAAAAGTCCGCTCCAAGTATGTATGTTTTGAGTGTTA g n l | t i | 5 40783035/118-571 CGTCTCAAAAA GCCCACTCCAAGTATGTATGTTTTGAGTGTTG casel6c/l-158 AGTATGTATGTTTTGAGTGTTA easel6h/l-470 g nl|ti|540783035/118-571 casel6c/l-158 CATTAGTTGTTAAAGTTGGTTGCACTTTTGGCTAGTGTTTAAAAGGTGTCA CATTAGTTGTTAACTTAGGTTGCACTTCTGGCTAGTGTTTAAAAGGTGTCA CATTAGTTGTTAAAGTTGGTTGCACTTTTGGCTAGTGTTTAAAAGGTATCA 149 c hr4:113908780:113908932:scaffold_37491:5060691: + . ...AluY i n R hesus, p a r t i a l d e l e t i o n / g e n e c o n v e r s i o n t o A luYb9 i n human, p r e c i s e d e l e t i o n i n c himpanzee CLUSTAL casel7h/l-252 TATGAAATTTGTGTTGTGTCTTCAGGTGATTTAAAAAAATATATGACAT casel7c/l-100 TATGAAATTTGTGTTGTGTCTTC AGGTGATTTAAAAAAATATATGACAT g n l | t i | 5 41340560/355-730 TATGAAATTTGTGTTGTGTCTTCAGG GGCGCGGTGGC casel7h/l-252 casel7c/l-100 g n l | t i I 5 41340560/355-730 casel7h/l-252 casel7c/l-100 g nl|ti|541340560/355-730 casel7h/l-252 casel7c/l-100 g n l I t i I 5 41340560/355-730 casel7h/l-252 casel7c/l-100 g nl|ti|541340560/355-730 casel7h/l-252 casel7c/l-100 g nl|ti|541340560/355-730 TCAAGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCACAAGGTCAGGAGA TCGAGACCACAGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCGCGGTG CGG ACGGGCGCCTGTAGTCCCAGCGACTCAGGAGGCTGAGGCAGGAGAATGGCGGGAACCCGG GAAGCGGAGCTTGCAGTGAGCCGAGATTGCGCCACTGCAGTCCGCAGTCCAGCCTGGGCG ' • GAGGCGGAGCTTGCAGTGAGCCGAGATCGCGCCACTGCACTCC AGCCTGTGCA ACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATATAT ACAGCGTGAGACTCCGTCTCACACGNCCNNNNNNNACATNAAAAAAA casel7h/l-252 ATATATATATATATATATATATATGACATCTATCCTGTCAAGTTGATGTTAATTTGG-AT casel7c/l-100 CTATCCTGTCAAGTTGATGTTAATTTGG-AT g n l | t i | 5 41340560/355-730 TGACATCTATCCCGTTAAGTTGATGTTGATTTTGCAT casel7h/l-252 casel7c/l-100 g nl|ti|541340560/355-730 AGAATTGGC-TTTATAGTTGTA AGAATTGGC-TTTATAGTTGTA GGAATTGCCCTTTCCATTTGTA c hr4:127295460:127295786:scaffold_34705:5896785:... independent i n s e r t i o n s o f L 1PA2 i n human, and A luY i n R hesus. due t o m i c r o d e l e t i o n h ere? CLUSTAL c asel8h/209-634 g n l | t i | 5 5 8 3 5 2 0 0 8 / 2 93-7 68 casel8c/l-114 c asel8h/209-634 g nl|ti|558352008/293-7 68 casel8c/l-114 c asel8h/209-634 g n l | t i | 5 5 8 3 5 2 0 0 8 / 2 93-7 68 casel8c/l-114 c asel8h/209-634 g nl|ti|558352008/293-768 casel8c/l-114 c asel8h/209-634 g nl|ti|558352008/293-7 68 casel8c/l-114 c asel8h/209-634 g nl|ti|558352008/293-7 68 casel8c/l-114 c asel8h/209-634 Imprecise localization ATAGTAATCTAAATTGCTCCAGACAAATATTTTCTGTTTTAAAAATAGTA ATAGTCGTCTAAACTGATCTCGACAAAGATTTTCTGTTTTAAAAATAGTATTAACTACGT ATAGTAATCTAAATTGATCCAGACAAATATTTTCTGTTTTAAAAATAGTATTAACTACAT AGGACATGGATGAAATTGGAA TATAAGAAAATTTAGCAGGGTGCAGTCGTTCGTGCCTGTAATCCCGCACTTTGGGAGGGC TATTAGAAAATTT ATCATCATTCTCAGTAAACTATCGCAAGAACAAAAAACCAAACACCGCATATTCTCACTC GAGGCGGGTGGATCACGAAGTCAGGAGATCAAGACCACCCTGGCTAACACAGTGAAACCC ATAGGTGGGAATTGAACAATGAGATCACATGGACACAGGAAGGGGAATATCACACTCTGG CATCTTTCCTAAAAATACAAAAAAATTAACCAGGCATTGTGGCGGGCGCCTGTAGTTCCA GGACTGTGGTGGGGTGGGGGGAGGGGGGAGGGATAGCATTGGGAGATATACCTAATGCTA GCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCTGGGAGGCAGAGCTTGCGGTGA GATGACGAGTTAGTGGGTGCAGCGCA-CCAGCATGGCACATGTATACATATGTAACTAAC GCCAAGATCACACCACTGCACTCCAGCCTGGGAGGGAAAGGTGTTTTTGGGAGACAGAGC CTGCACAATGTGCACATGTACCCTAAAACTTAAAGTATAATAAAAAAATAAATAAATAAA 150 g n l | t i | 5 58352008/293-768 casel8c/l-114 c asel8h/209-634 g nl|ti|558352008/293-768 casel8c/l-114 AAAACTCCGTATCAAAAAAAAAAAAAAAAAAAACCACACAAAAAACTCCGTCTCAAAAAA TAAAAAAGAAAATTTAAATAGGTACAATGACCATTTATAGAAACATAATTCACTAG AAAAAAAAAAAAATTACATAGGTATAATGACCATTTATAGAAAAATAATTCACTGG AAATAGGTACAATGACCATTTATAGAAACATAATTCACTAG C hr4:180115809:scaffold_31924:122 6 7927:12268281:+ . . . i m p r e c i s e d e l e t i o n o f A luY i n human, but i n v o l v i n g TSD and f o r t u i t o u s u pstream m atch CLUSTAL casel9c/l-504 g n l | t i I 4 58957840/166-649 casel9h/l-150 casel9c/l-504 g nl|ti|458957840/166-649 casel9h/l-150 casel9c/l-504 g n l | t i I 4 58957840/166-649 casel9h/l-150 casel9c/l-504 g nl|ti|458957840/166-649 casel9h/l-150 casel9c/l-504 g n l | t i I 4 58957840/166-649 casel9h/l-150 casel9c/l-504 g nl|ti|458957840/166-649 casel9h/l-150 casel9c/l-504 g nl|ti|458957840/166-649 casel9h/l-150 casel9c/l-504 g nl|ti|458957840/166-649 casel9h/l-150 easel9c/l-504 g nl|ti|458957840/166-649 c ase19h/l-150 TGATCCTGGGCAGCAACATAAGAAGGGTTGGGTCATGGGGCCAGGATGCTTTTGTAATGG TGATCTTGGGCGGCAACATGAGAAGGGTTGGGTCATGGGGCCAGGATGCTTTTCTAAAGG TGATCCTGGGCAGCAACATAAGAAGGGTTGGGTCGTGGGGCCAGGATGCTTTTGTAATGG AGTCTTCTCCAAAAGAAGCATGAGCCTGAGGAAAAGAAAAGCCCCTTTTAGGCCAGGCAC GGTCTTCTCCAAAAGAAGCATGAGCCTGAGGAAAAGAAAAGTCCCCTCTAGGCCAGGCGC AGTCTTCTCCAAAAG TGTGGCTCATGCCTTTAATCCCAGCACTTTGGGAAACCCAGGTGGGCGGATCACCTGAGG GGTGGATCATGCCTTTAATCCCAGCACTTTGAGAAACCCAGGTGGGCAGATCACCTCAGG TCAGGAGTTCGAGACAAAACTGGCCAACGTGGCAAAACCTCATCTCTACTAAAAATACAA TCAGGAGTTCCAGACCAAACTGGCCAACATGGCAAAACCTCATCTTTACTAAAAATACAA AAATTAAGCGGGCATGGTGGCTTATGCTGGTAAACCCAGCTACTCGGGAGGCTGATGCAT AAATTAACCGGGCATGATGGCTCATGTCGGTAAACGCAGCTACTCGGGAGGCTGACGCAT GAGAATCGCGTGAACCAGGGTGGCAGAGGCTGCAGTGAGCCGAGAACATGCCACAGCACT GAGAATCGCTTGTACCAGGGTGGCGGAGTTTGCAGTGAGCCGTGAACACGCCACTGCACT CCAGGCTGGGCCACAGAGTGAGACTCCATCTGAAAAAAAAAAAAAAAGAAAAAGAAAAAG CCAGCCTGGGCGACAGAGTGAGACTCCGTCTGAAAAAAAAAAAA AAAAAAAAGCCCCCTCTAGCATAAACTTATTTCAGAAATAACACTAGTGAACAAGTAACC AAAAGCCCCCTCTAGCGTAAACTTATTTCAGAAATAACACTAGCGAACAAGTAACC CCCCCTCTAGCATAAACTTATTTCAGAAATAACACTAGTGAACAAGTAACC CTTTCCAAATTATTTTGAATCTAA CTTTCCACATTATTTTGAATCTAA CTTTCCAAATTATTTTGAATCTAA c hr5 = 2 9115431:scaffold_3707 8 :3005315:3005552:... i m p r e c i s e . d e l e t i o n o f HAL1 f ragment i n human, no f l a n k i n g CLUSTAL c ase20c/1-337 g nl|ti|509178888/301-662 c ase2Oh/1-100 c ase20c/1-337 g n l | t i I 5 09178888/301-662 c ase20h/l-100 c ase20c/1-337 gnl|ti|509178888/301-662 c ase20h/l-100 AATACTGAAGAGACTGTCAGTGAGGCCCCTCCTTAATGCACCATTCCTATCAACTTTTGC AATACTGAAGAGACTATCAGTGAGGCCTCTCCTTTATGCACCATTCCTGTCAACATTCGT AATACTGAAGAGACTGTCAGTGAGGCCCCTCCTTAATGCACCATTCCTAT TCTTTGGAAACAATAACAGAAAACAGAAAGTTAATGAAAATATTGACTCCATGATATAAA TCTTTGGAAACAATAATGGAAAACAGAAAGTTAATGAAAATATTGACTCTATGATATAAA GCAGGATACTATAAAAAGGGATATTTAAAGAAAAAA-TAGAAATTAAAAACATGATTGTGCAGGATACTATAAAAAGGGATATTTAGAGAAAAAAATGGAAATTAAAAACATGATAGTT identity c ase20c/l-337 GAT AT ATATATATAT AT AT AT AT AT AT TTGAATGTTT g n l | t i | 5 09178888/301-662 TTATAAATATATATATATATAAATATATATATATATAAATATATATAATTTTGAATGTTT 151 c ase20h/l-100 case20c/l-337 g nl|ti|509178888/301-662 c ase20h/l-100 case20c/l-337 g nl|ti|509178888/301-662 c ase2 Oh/1-100 case20c/l-337 g nl]ti|509178888/301-662 c ase20h/l-100 GT GT GT GGAAGACAAAGTTGTAGACATAACTTAGAGAGTCAGAGACGTGGAGAATAGAAGACAAAG GGAAGACAAAGTTGTGGACATAACTTAGAGAGTGAGAGACATGGAGAACAGAAGACAAAG GGAGGAGCAACTGAACAGATTAAGTATAAACATATACATTGTTTCCAAAAAGAAAACAGT GGAGGACCAACTGACCACATTAAGTATAAAGATATAGATTGCTTCCAAAAAGAAGACAGT GAACAGATTACGCATAAACCTATACATTGTTTCCAAAAAGAAAACAGT c hr5:169226842:scaffold_37615:12572677:12572918:+ . . . i m p r e c i s e d e l e t i o n o f a MIRb e lement, 3 bp i d e n t i t y blunt-end d e l e t i o n C LUSTAL case21c/l-341 g n l | t i I 5 72959929/194-536 c ase21h/l-100 case21c/l-341 g n l | t i | 5 7 2 9 5 9 9 2 9/194-536 c ase21h/l-100 case21c/l-341 g n l | t i | 5 7 2 9 5 9 9 2 9/194-536 c ase21h/l-100 case21c/l-341 g nl|ti|572959929/194-536 case21h/l-100 case21c/l-341 g nl|ti|572959929/194-536 c ase21h/l-100 case21c/l-341 g n l | t i | 5 7 2 95992 9/194-536 c ase21h/l-100 f l a n k i n g the d e l e t i o n , probably TTACGTACCCTTTGATGAAGTATAAGCAAAAAGTTTATATTTGGACAAATTAAATTCTGG TTACGTACCCTTTAATGAAGTATAAGCAAAAAGTTTAGATTTGGACAAATTAAATTCTGG TTACGTACCCTTTGATGAAGTATAAGCAAAAAGTTTATATTTGGACAAAT CCCTGCCACTAACTTGCTCTGTAGCCTGAGTTTACTTATTTCAACTCACATAAGCCTCAA CCCTGCCACTAACTTGCTCTGTAGCCTGAGTTTACTTATTTCAACTCATATAAGCCTCAG TTTGCTCATCAGTAACATGGAGATGATAACACCTTACTCAAAGAATTGTGGTAGAAATAA TTTGCTCAACAGTAACATGGAGATGATAACAGCTTACTCAAAGAATTGTGGTAGAAATAA ACTGACTTCTGAATATAAAGTGCTTAGCACAGAGTTGGGCTTATAGCA TTCATTAA ACTGACTTCTGAATATAAAGGGCTTAGGACAGAGTTGGGCTTATAGCAAGCATTCATTAA CATGAATAACATTATGTCAGTATTTTTAAAACAAGTACCCACCATGAATTAGAATATAAA CATGAATAACATTATGTCAATATTTTTAAAACAAGTATCCACCATGAATTATAATATAAA ATAAA GTATGCCTGAATAATTAAGATGAAACATAAGACTGAATTCAAATA GTATGCATGAATAATTAAGATAAAACATAAGACTGAATTCAAATA GTATGCATGAATAATTAAGATGAAACATAAGACTGAATTCAAATA c hr5:17 6660475:17 6 660732:scaffold_34495:399572 :. . . p r e c i s e d e l e t i o n o f A l u J o i n c himpanzee C LUSTAL c ase22h/l-387 g n l | t i I 5 37031087/370-730 c ase22c/1-130 c ase22h/l-387 g nl|ti|537031087/370-730 case22c/l-130 c ase22h/l-387 g nl|ti|537031087/370-730 case22c/l-130 c ase22h/1-387 g n l | t i | 5 37031087/370-730 case22c/l-130 c ase22h/l-387 CCTGATTTCTTCACTGTTTACATGCTGTAACATCTACACATCATGCTAAGAAAAAAAAAA CCTGAGTTCTTCACTGTTTACATGCTGTAACATCTACATATCATGCTTAAAAAAAAAAAA CCTGATTTCTTCACTGTTTACATGCTGTAACATCTACACATCATGCTAAGAAAAAAAAAA AAAAAAAGGAGAGAGAGAGCGAGAGAAC A CTTTCC AGCCTGGGC AAC AT AGTAAGACCCC GAGAGA—GAGAGAGAGAGTGAGAGAACACTTTCCAGCCTGGGCAACATAGTTAAGACCC AAAAA CTTTCTCAACAAAAAATAAAAAAA-ATTGCCCACGCATGGTGGCAAGTGGCTGCAGTCCC CCATCTCAATAAAAAATAAAAAAATATTACCCATGCATGGTGGCAAGTGGCTGTAGTCCC AGCTACTTGGGAAGCTGAGTTGGGAGGATTGCTTGAGCCCAGGAGCTCAAGACTACAATG "AGCTACTTGGGAAGCTGA TTGCTTGAGCCCAGGAGCTCAAGACTACAATG AGCTATGATCACGCCACTGTACTCAAACCTGGACAACAAGACCTCATCTCTTATTAAAAA 12 5 g n l | t i I 5 37031087/370-730 case22c/l-130 c ase22h/l-387 g nl|ti|537031087/370-730 case22c/l-130 c ase22h/l-387 g nl|ti|537031087/370-730 c ase22c/1-130 AGCTACGATCACGCCACTGCACTCCAACCTGGACAA GACCTCATCTCTTATTTAAAA AAAAAAAAAAAAAAAAAAAAAAAGAACACTTTACTTTAGCGCAGCTTTATGAAGTGCTTT AC AAAAAAAAAAGAACACTTTACTT-AGCACAGCTTTATGAAGTGCTTT GAACACTTTACTTTAGCACAGCTTTATGAAGTGCTTT ACCAGCTGGTCTCAGCTGACTATTCCTA ACCAGCTGGTCTCAGCTGACTATCCCTA ACCAGCTGGTCTCAGCTGACTAGTCCTA c hr6:12 9040862:scaffold_37501:5172375:5172 650:...bad match.... chimpanzee a nd r hesus l o c i do n o t m atch > case23c TAAGAAATACATCGAGAATATAAAAAGATAATTTAAACATCCTGAAAATA TTGGGAGGCCGAGACGGGCGGATCACGAGGTCAGGAGATCGAGACCATCT TGGCTAACACGGTGAAACCCCGTTTCTACTAAAAATACAAAAAATTAGCC GGGCGTG.TTGGCGGGCGCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCA GGAGAATGGCATGAACCTGGGAGGCGGAGCTTGCAGTGAGCCGAGATCGC GCCACTGCACTCCAACCTGGGAGACACAGCGAGACTCCGTCTCAAAAAAA AAAAAAAAAAAACATCCTGAAAATAAGAAATAATGCCATTAAAAGAAAAT TCAATACATATGATTAATATAAAAA > case23h TAAGAAATACATCGAGAATATAAAAAGATAATTTAAACATCCTGAAAATA AGAAATAATGCCATTAAAATAAAATTCAATACATATGATTAATATAAAAA > g n l | t i I 5 40771525 NNNAAAACCTTTATCNNNNNGTGGGCCCGCGTTGTCATCAGATGGGGGGG CTCATAACACTCGTTCCAGCTACCCAGGTGTTTAGGTCGGAGAATCACCT AAGCCCAGGTGGTCATGGCTGCACCGAGTCACGATCGCATCACCGCATTC GAAGCCTGGGTGACAGAGTGAGACCCCGTCTGGGGGTAAAAAAAAAAAAA AGTAAGTAAGTAAAGTAAAGTAGTGAAGTAAAGAAAAAAAATAGAGAATC TTAAGAAATCCATCGAGAATATAGAAAGATCGTTTTAAAATAAAAACAAT GCCATTAAGGGAGGAGACCACCCCTCATATCATCTTATGCCCAAGTTCTG CCTCCAAAGAATGAAGAAGTAAAAACTAAAAGGCAGAAATGAAGTCTACA GGCAGACAGCCCGGCGCTGCACCCTGGGCCTGGTAGTTAAAGATCTATCT AATCGGTTCTGTTATCTGTAGATTACAGACATTGTATAGAAATGCACTGT GAAAATTCCTATCTTGTTTTGTTCCAATTACCGGTGCATGCAGCCCCCAG TCACGTACCCCCTCCTTGCTCAATCAATCATGACCCTCTCACATGCACCC CCTCAGAGTTGGGAGCCCTTAAAAGGGACAGGAATTGCTCACTCAGGGGG CTGGGCTCTTGAGACAGGAGTCTTGTGGACACCCCCAGCCGAATAAACCC CTTCCTTCTTTAACTTGGTGTCTGAGGGGTTTTGTCTGCAGCTCATCATG CTATACCATTAAAATAATATTCAATACATATGATTAATATAAAAATGGTA ACAACTACAGAATGACTAGGGAGATGACATAAGGATGGAACTCATCTTGC ATACAGTTCAGAGAAGCTAGAATAAACATAGAAATGTAATTGAGAATATA GAAGGTAAGGTTGAAAGCTAAAACTCTTTCTTNATGNNGAAGACTGGTAG AGATAGAATATTTGAGCGATACTGCTTTGAGATGCTCCAGAATTTTCATG GGTTCCCATCTGTGNTTTCCCTTATGCTAGNCACTGCGTTTCAAAATTAA ATGGAAATTCCNGAATAAAATTCNTTTTAATTTGTTAGGAATACTGCTCC AAACAGGGTTCCCTTATTCTGTTCCCACCAAAAAGGGTTCAAAATCTGTT GTAAAAAAAAGGTTTCAATATCTGTTTCATAAAAAGGTGGGGGCTGTTTT TTATTTAATCACAATCGGAAGTAAAAACAAGTTTCCTGACTTCACAATAA AAAGAAGCTCGCCCCTTTTCTTTTTTTTTTTTTTTACCCCCACGCCCAAA AATTTATGGTTTGTTATTATTAAATTNNNNNNNNNNNNNNNNNNNNTTTT ATTTTTTTTTAACCACAAAAAAAAAACAACCTTTTTTTTTTTTTTTNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCCNN NNNT c hr7:5779663:scaffold_37557:3099237:3099555:+ . . . p r e c i s e d e l e t i o n o f A l u S g i n human CLUSTAL c ase24c/l-468 g nl|ti|567316288/263-732 TGAACTGCAACTGTGATACTTATTAGCATGTCACCTGGTGTTTTGTTTTCATTTCACTTT TGCACTGCAATAGTGATACTTATCAGCACGTCACCTGGGGTTTTGTTTTCATTTCACTTT 153 c ase24h/l-150 c a s e 2 4 c / l - 4 68 g nl|ti|567316288/263-732 c ase24h/l-150 c a s e 2 4 c / l - 4 68 g nl|ti|567316288/263-732 c ase24h/l-150 TGAACTGCAATAGTGATCCTTATCAGCATGTCACCTGGTGTTTTGTTTTCATTTCACTTT AAGAATGTGCCATGTTGGCCGGGTGTGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGA AAGAATGTGCCATGTTGGCCGGGCGTGGTGGCTCATGCCTGTAATCCCAGCACTTTGGGA AAGAATGTGCCATGT GGCCGAGGCGGGAGGATCACAAGGTCAGGAGTTTGAAACCAGCCTGACCAACATAGTGAA GGCCGAGGCGGGCGGATAACAAGGTCACAAGTTCAAAACCTGCCTGACCAACATGGTGAA c ase24c/l-468 ACTCCATCTCTACTAAAAATACAAAAAAATAAAAAATTAGCCAGACATGGTGGCGCATGC g nl|ti|567316288/263-732 ACACCATCTCTACTAAATATACAAA AAAAGTCAGCCAGGCGTGGTGGCGCATGC c ase24h/l-150 c a s e 2 4 c / l - 4 68 g nl|ti|567316288/263-732 c ase24h/l-150 c ase24c/l-468 g n l | t i I 5 67316288/263-732 c ase24h/l-150 CTGTAATCCCAGCTACTCTGGAGGCTGAGGCAGGAGAATCGCTTGAACCTGGGAGGCAGA CTGTAATCGCAGCTACTGGGGAGGCTGAGGCAGGAGAATCGCTTGAACCTGGGAGATGGA GGTTGCAGTGAGCCGAGATCGTGCCATTGCACTCCAGCCTGGGCAATGAAAGTGAAACTC GGTTGCAGTGAGCCGAGATCGTGCCATTGCACTCCAGCCTGGGCAATAAGAGTGAAACTC c a s e 2 4 c / l - 4 68 CATCTCAAAAAAAAAAAG AAGAATGTGCCGTGTGATCGTATTTCTAATCCCT g n l I t i I 5 67316288/263-732 CATCTCAAAAAAAAAAAAAAAAAAAAAAGAATTTGCCATGTGATCGTATTTCTAATCCTT c ase2 4h/1-150 GATCGTATTTCTAATCCCT c ase24c/l-468 TTCACTCTGGAATCCTGCTCTTACCATATTAATGTTGATTAGCATCTCAGGTTTCA g n l | t i I 5 67316288/263-732 TTCACTCTGGAATCCTGTCCTCACCATATTAATGTTGAATAGCATCTCAGGTTTCA c ase24h/1-150 TTCACTCTGGAATCCTGCCCTTACCATATTAATGTTGAATAGCATCTCAGGTTTCA c hr7:24969020:scaffold_36484:641689:641996:+ . . . p r e c i s e d e l e t i o n o f A l u S c i n human: 2 o f t hese s i t e s i n human and c himpanzee genomes b oth f i l l e d w i t h A l u S c i n c himpanzee, one empty a nd one f i l l e d s i t e i n human CLUSTAL case25c/l-461 g n l | t i I 5 41662238/247-704 c ase25h/l-150 case25c/l-461 g n l | t i I 5 41662238/247-704 c ase25h/l-150 case25c/l-461 g nl|ti|541662238/247-704 c ase25h/l-150 case25c/l-461 g nl|ti|541662238/247-704 c ase25h/l-150 c a s e 2 5 c / l - 4 61 g nl|ti|541662238/247-704 c ase25h/l-150 case25c/l-461 g nl|ti|541662238/247-704 c ase25h/l-150 AGGCCCGTTCTGTGTCCGGAGGGGCTGTGATCCTATCAGGACAGGAATCCAGCTCGGAGC AGGCCCGTTCTGTGTCCGGAGGGGCTGTGATCTTATCAGGACAGGAATGCAGCTCGGATC TGGCCTGTTCTGTGTCCAGAGGGACTGTGA TCAGGACAGGAACCCAGCTCGGAGC TCCTATTA-AAGATGACTGTTGGCCGGGTGCAGTGGCTCCCGCCTGTAATCCCAGCACTT TCCTGTGGTAAGATGACTGTTGGCCGG-TGCGGTGGCTCCCGCCTGTAATCCCAGCACTT TCCTGTGATAAGATGACTGTT CAGGAGGTCGAAGCGGGCAGATCATGAGGTCAAGAGATCGAGACCGTCATGGCCAACATC CAGGAGGCCGATGAGGGCAGATCACAAGGTCAAGAGATTGAGACCATCATGGCCAACATG GTGAAACCCCGTCTCTACTAAAAATACAAAAATTAGCTGGGTGTGGTGGCAGGCACCTGT GTGAAACCCCGTCTCCACTGAAAATAGAAAAATTAGCTGGATGTGGTGGCAGGCGCCTGT AGTCACAGCTATTTGGGAGGCTGAGGCAGGAGAATCGCTTGAACCCAGGAGGTGGAGGTT AGTCCCAGAAACTTGGGAGGCCAAGGCAGGAGAAACGCTTAAACCTGGGAGGTGGAGGTT GCAGTGAGCCAAGATCGCGCCACTGCACTCCAGCCTGGTGACAGAACGAGACTACGTCTA GCAGTGAGCCAAGATTGCGCCACTGTACTCCAGCCTGGTGACAGAATGAGACTCCATCTC : c a s e 2 5 c / l - 4 61 AAAAAAAAAAAAAAAAAAAAAGACTGTAGATACTAATATAAACCCCACCTTCCCCAAATC g n l | t i I 5 41662238/247-704 AAAAAAAAAAAAAAGAT GACTGTTGATATTAATATAAACCCCACCTTCCCAAAATT c ase25h/l-150 GATATTAATATAAACCCCACCTTCCAAAAACC c a s e 2 5 c / l - 4 61 TGTTTATCCTGATCTTAAGATACGGC-ACATGTTAATGAGTTG g nl|ti|541662238/247-704 TATTTATCCTGATCTTAAGATATGGCTACGTGTTAAAGAGTGG c ase25h/l-150 TATTTATCCTAATCCTAAGATAAGAC-ACATTTTAATGAGTTT 154 c hr7:128710174:scaffold_37671:707170:707496 :. . . p r e c i s e d e l e t i o n o f A l u S x i n human CLUSTAL c ase2 6 c / l - 4 7 6 g nl|ti|513419099/206-664 c ase2 6h/1-150 case26c/l-476 g nl|ti|513419099/206-664 c ase26h/l-150 case26c/l-476 g nl I ti|513419099/206-664 c ase26h/l-150 c ase2 6 c / l - 4 7 6 g nl|ti|513419099/206-664 c ase26h/l-150 case26c/l-476 g nl|ti|513419099/206-664 c ase26h/l-150 c ase2 6 c / l - 4 7 6 g nl|ti|513419099/206-664 c ase26h/l-150 TTGCTGTGAAGCTATGGGGGCTTTTTATGAGGAAGGCTCTATGCAAGGGTGAGGATGGAA TTGCTGTGAAACTATGGGGGCTTTTTGTGAGGAAGGCTCTATGCAAGGGAGAGGATGGAT TTGCTGTGAAGCTATGGGGGCTTTTTATGAGGAAGGCTCTATGCAAGGGTGAGGATGGAT ATGGGGTTTGATGGTTAGAAAGTGGGAGAGAAGGCCAGGTGCAGTGGCTCACACCTGTAT ACGGGACTTGATGGTTAGAAAGTGGGAGAGAAGTCCAGGTGCAGTGGCTCACACCTGTAA ATGGGGCTTGATGGTTAGAAAGTGGGAGAGAA TCCCAGCACTTTGGGAGGCCGAAGTGGGTGGATCACCTGAAGTCAGGAGTTCAAGACCAG TCCCAGCACTTTGGGAGGCCAAAATGGGTGGATCACCTGAGGTCAAGAGTTCGAGACCAG CCTGGCCAACATGGTGAAACCCTGTCTCTACTAAAAATAAAAAAATTAGCCGGGCATGGC CCTGGCCAATGTGGTAAAACCCCGTCTCTCCTAAAAATAAAAAAATTAGCTGGGCATGGT GGCGTACACCTGTAATCCCAGCTACTCGGAAGGCCGAGGC AGAAGAATTGCTTGTAC GGCAGGCGCCTGTAGTCCCAGCTACTCGGGAGGCCAAGGCCAAGGGAGAATGGATTGAAC CTGGGAGGTAGATGTTGCAGTGAGCCAAGATTGCACCATTGCACTCCAGCCTGGGTGACA CTGGGAGCTTGAGCTTGCAGTGAGCCAAGATCACGCCACTGTACTCCAGCCTGGGTGACA c ase2 6 c / l - 4 7 6 GGGGGAGACTCCGTCTCAAAAAAAAAAAAAAAAAGAAAAAGAAAAAGAAAGTGGGAGAGA g n l | t i | 5 13419099/206-664 GAGTGAGACT--ATCTCAAAAAAAAAAAAAAA GTGGGAGAGA c ase26h/l-150 c ase2 6 c / l - 4 7 6 g nl|ti|513419099/206-664 c ase2 6h/l-150 AGGGAACAGTGACTGCAGGAATAGAAAAAATGTCCACAGAGGAGGAATGAGGACATGGC AGGGAACGGTGACTACAGGAATAGAAAAAGTGTCCACTGAGGAGGAAGGAGGACTTGGC -GGGAACGGTGACTGCAGGAATAGAAAAAATGTCCACAGAGGAGGAAGGAGGACATGGC c hr7:132523884:132524216:scaffold_32923:964569:...AluY i n R hesus, g ene c o n v e r s i o n t o A luYb9 i n human, p r e c i s e d e l e t i o n i n c himpanzee CLUSTAL c ase27h/l-500 TTACTACAGAAATTATGGGAAATGCTTACTTTTTAAAAGGAGAAAGGGAAAGAATCTCAT g n l I t i I 4 96205884/177-647 TTACTGCAGAAATTGTGGGAAATGCTTACTTTTTAAAAGGAGAAAGGGAAAGAATCTCAT c ase27c/l-168 TTACTACAGAATTTGTGGGAAATGCTTACTTTTTAAAAGGAGAAAGGGAAAGAATCTCAT c ase27h/l-500 GTTGAAAATTGTCATTAGGTAAAGTAATAGAATTAATGGAGCAGGCCGGGCGCGGTGGCT g n l | t i | 4 96205884/177-647 GTTGAAAATTATC GCGGTGGCT c ase27c/l-168 GTTGAAAATTGTCATTAGGTAAAGTAATAGAATTAATGGAGCA c ase27h/l-500 g nl|ti|496205884/177-647 c ase27c/l-168 CACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGTGGATCATGAGGTCAGGAGAT CAAGCCTGTAATCCCAGCACTTTGGGAGGCCGAGATGGGCGGATCACGAGCTCAGGAGAT c ase27h/l-500 CGAGACCATCCTGGCTAACAAGGTGAAACCCCGTCTCTACTAA-AAATACAAAAAATTAG g n l I t i I 4 96205884/177-647 CGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTTTATTAAGAAATACAAAAAAT-AG c ase27c/l-168 c ase27h/l-500 g nl|ti|496205884/177-647 c ase27c/l-168 case27h/l-500 g n l | t i I 4 96205884/177-647 c ase27c/l-168 CCGGGCGCGGTGGCGGGCGCCTGTAGTCCCAGCTACTGGGGAGGCTGAGGCAGGAGAATG CCGGGCGAGGTGGCAG-CGCCTGTAGTCCCAGCTACTCGGGAGACTGAGGCCGGAGAATG • GCGTTGAACCCGGGAAGCGGAGCTTGCAGTGAGCCGAGATTGCGCCACTGCAGTCCGCAG GCAT-GAACCCGGGAGGCGGAGCTTGCAGTGAGCTGAGATCCGGCCACTGCACTCC 155 c ase27h/l-500 g n l | t i I 4 96205884/177-647 c ase27c/l-168 c ase27h/l-500 g nl|ti|496205884/177-647 c ase27c/l-168 c ase27h/l-500 g nl|ti|496205884/177-647 c ase27c/l-168 TCCAGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAA— -—AGCCTGGGCTACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAGAAAAT AATAGAATTAATGGAGCAATCCAAAATGAGCTAAATAAGTTTCT TATCACTAGGTAAAGTAATAGAATTAACGGAGCAATTCAAAATGAGCTAAATAAGTTTCT ATCCAAAATGAGCTAAATAAGTTTCT TTGTGAGAATTCTTCTAGAGATTAATTCTAGAATTCTTC TTGTGAGAATTCTTTTAGAGATTAATTCTAGAATTCTTC TTGTGAGAATTCTTCTAGAGATTAATTCCAGAATTCTTC c h r 9 : 3 2 2 4 9 9 9 2 : s c a f f o l d _ 3 7 4 1 9 : 6 540548:6540685:. . . i m p r e c i s e d e l e t i o n o f a MIRb e lement i n human, no f l a n k i n g CLUSTAL c ase28c/l-237 g n l | t i I 5 11659877/91-328 c ase28h/l-100 identity CTCCTTAAAGGAAAGTGAATGTGAACTGAAGTCTAGACCAACACC-TTATGCACAGTCTA CTCCTTAAAAGAAAGGAAATGTGAACTGAAGTTGAGAACGACACCCTTATGCACAGTATA CTCCTTAAAGGAAAGTGAATGTGAACTGAAGTCTAGACCAACACCTT c ase28c/l-237 AGAGGAAAGAATATGAGACTGAAGCTATTGAGGACTGGGTTTAAGTCACTATCCTATTAT g n l | t i | 5 11659877/91-328 AGAGGAAAGAATATGAGACTGAAGCTGTTGAGGACTGGATTTAAGTCACTATCCCATTAT c ase28h/l-100 c ase28c/l-237 g nl I ti|511659877/91-328 c ase28h/l-100 c ase28c/l-237 g nl|ti|511659877/91-328 c ase28h/1-100 TTACTATCTTTGCAATCTAGGATAAAGTACTTAACTCCCCTGAACCTTGGATCCTGAAAC TTACTATCTTTGCAATCTAGGACAAAGTACTTAACTCCCCTGAACCTTGGATTCTGAAAC CACACATGGGAAGACATTGTTACCTTTGTAGCACAGGATTGATGTGTTAACAAATGGG TATACATGGGAAGACACTGTTACCTTTGTAGCACAGGATTGAGGTGTTAACAAATGGG ATGGGAAGACATTGTTACCTTTGTAGCACAGGATTGATGTGTTAACAAATGGG c hr9:677288 6 1:67729179:scaffold_34695:1245496:+ . . . i n v e r s i o n o f A luY i n R hesus, gene c o n v e r s i o n t o A luYb8 i n human, p r e c i s e d e l e t i o n i n c himpanzee CLUSTAL c ase29h/l-479 g n l I t i I 5 56286157/11-486 case29c/l-161 c ase29h/l-47 9 g n l | t i I 5 56286157/11-486 case29c/l-161 c ase2 9 h/l-47 9 gnl|ti|556286157/ll-486 case29c/l-161 c ase2 9 h/l-47 9 g nl|ti|556286157/11-486 case29c/l-161 c ase2 9 h/l-479 g n l | t i I 5 56286157/11-486 case29c/l-161 c ase29h/l-47 9 g nl|ti|556286157/11-486 case29c/l-161 CTTTCTACC-CACTTAACAGACTTGTCTTCTCTTCCCCATGGGAATAAAATTTTAGGCAG GATGGTAAGTAGTTCATC—ACTTGTCTTCTCTTCACCATTGGAATAAAATTTTAGGCAG CTTTCTACC-CACTTAACAGACTTGTCTTCTCTTCCCCATGGGAATAAAATTTTAGGCAG GTCTCTTAGATCTCCCGGTTA GGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGC GTCTCTCAGATCTCCTGGTTATTXXGGCTGGGCGTGGTGGCTCATGCCTGTAATCCCAGC GTCTCTTAGATCTCCTGGTTATT ACCTTGGGAGGCCGAGGCGGGTGGATCATGAGGTCAGGAGATCGAGACCATCCTGGCTAA CCTTTGGGAGGCCAAGGGGGGCGGATCACAAGGTCAGGAGATGGAGAGCATCCTGGCTAA CACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCGCGGTGGCGGGCG CACAGTGAAACCCTATCTGTACTAAAAATACAAAAAATTAGCTGGGCATGGTGGTGGGCG CCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAAGCGG CCTGTAGTCCCAGCTACTGGGGAGGCTGAGGCAGGAGAATGGCATGAACCCAGGAGGCAG AGCTTGCAGTGAGCCAAGACAGCGCCACTGCAGTCCGCAGTCCGGCCTGGGCGACAGAGC AGCTTGCAGTGAACCGAGATCACGCCACTATAC CACTCCAGCCTCGGCGAAACAGC c ase29h/l-479 GAGACTCCGTCTCAAAAAAAAAAAAAAAAAGAT—CTCCCGGTTATTCTTAAGTGGTGAA g n l | t i | 5 56286157/11-486 AAGACTCGGTCTCAAAATAATAATAATAATGATXXCT GGTTAGTCTTAAGTGGTGAA case29c/l-161 CTTAAGTGATGAA 156 c ase29h/l-479 g nl I ti|556286157/11-486 case29c/l-161 c ase29h/l-479 g nl|ti|556286157/11-486 case29c/l-161 AAGTTTGACTACTAGTCACATAATACATCCAATGTGATAATAAATAACTCAAATTCTCAT GAGTTTGACTACTAGTCACATAATACATCCAGTATAATAATAAA-AACTCAAATTCTCAT AAGTTTGACTACTAGTCACATAATACATCCAATGTGATAATAAATAACTCAAATTCTCAT CTAATA CTAATA CTAATA c hr9:84428070:84 4 28205:scaffold_36211:2258065:. . . p r e c i s e d e l e t i o n o f an A luSg/x fragment i n c himpanzee CLUSTAL A l u b oundary shown b y | ' 1 c ase30h/l-306 g nl|ti|542095096/367-694 case30c/l-173 c ase30h/l-306 g nl|ti|542095096/367-694 case30c/l-173 AGTAAGACTCTGTCTCAAAAAAAAAAAA— TTAAATCCTTAGAAAGA AGTAGGATTCTGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAATTTAAATCCTTAGAAAGA AGTGAGACTCTGTCTCAAAAAAAAAAAAA ATTAAATCCTTAGAAAGA TGTATCCTCTGCTGCTGCTCTAGAACTCTAGGT|GGAGGCAAAGGTTGCAGCCAGCGGAGA TGTATCCTCTGCTGCTACTCTAGAACTCTAGGT|GGAGGCGGAGGTTGCAGTGAGTGGAGA TGTATCCTCTGCTGCTGCTCTAGAACTCTAGAT|GGAGGCAGA c ase3Oh/1-306 TCATGCCACTGCACTCCAGCCTGGGCAACAGAGTGGCACTCCATCTCAAAAAAAAAAAAA g n l | t i | 5 42095096/367-694 TCATGCCACTGCCCTCCAGCCTGGGCAACAGAGTGACACTCCATCTCAAAAAAAAAAAAA case30c/l-173 c ase30h/l-306 g nl|ti|542095096/367-694 case30c/l-173 c ase30h/l-306 g nl|ti|542095096/367-694 case30c/l-173 c ase30h/l-306 g n l | t i I 5 42095096/367-694 case30c/l-173 AAATCCTTAAAAGTATGTATCCTCTGCTGCTACTCTAGAACTCTAGGTGGAGG TGTATTCAAATCCTTAAAAAGATGTGTCCTCTACTGCTACTCTAGAACTCTAGGTGGAGG CAGATTTCTAAACCAGTTCTTTTGCACAGGAATTACTGGTTTGGGTGGGGCGTGGTGACG CAGATTTCTAAACCAGTTCTTTTGTACAGGAATTACTGGTTTGGGTGGGGCGTAGTGACG TTTCTAAACCAGTTCTTTTGCACAGGAATTACTGGTTTGGGTGGGGCGTGGTGACG CTGCTGTCTCAGAATTAATGTAATCATG CTGCTGTCTCAGAATTAATGTAATCATG CTGCTGTCTCAGAATTAATGTAATCATG c hrlO:35597212:35597542:scaffold_37564:16236237:+ . ..independent i n s e r t i o n s o f A luYa5 i n human and L 1PA5 i n R hesus CLUSTAL c ase31h/l-430 case31c/l-100 g n l | t i I 4 70892232/187-735 c ase31h/l-430 case31c/l-100 g n l | t i I 4 70892232/187-735 c ase31h/l-430 c ase31c/l-100 g n l | t i I 4 70892232/187-735 c ase31h/l-430 c ase31c/l-100 g n l | t i I 4 70892232/187-735 c ase31h/l-430 case31c/l-100 g n l | t i I 4 70892232/187-735 c ase31h/l-430 case31c/l-100 AAACCTGCAATATGCTAACCAAACCACTTTTAATTAAAAGGAGAAAAAAGGCCGGGAGCG AAACCTGCAATATGCTAACCAAACCACTTTTAATTAAAAGAAGAAAAAA AAACCTGCAATATGCTAACAAAACCACTTTTAATTAAAAGAA GTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCACGAGGTCA TATTGCGGCACTATTCACAATAGCAAAGACTTGGAACCAACCCAAATGTCCA AGAGATCGAGACCATCCC-GGCTAAAACGGTGAAACCCCGTCTCTACTAAAAATACAAAA TCAATGATAGACTGGATTAAGAAAATGTGGCACATATACACCATGGAGTACTGTGCAGCC AAATTAGCCGGGCGTAGTGGCGGGCGCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAG ATAGAAAAGGATGAGTTCATGTCCTTTGTAGGGACATGGATGAAGCTGGAAACCATCATT GAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACT CTGAGCAAACTATCACAAGGACAGAAAACCAAACACTGCATGTTCTCACTCATAGGTGTG CCAGCCTGGGCGACAGAGCGAGACTCCGT-CTCAAAAAAAAATAAAAAATAAAAAAAAAA 1 57 g n l | t i l 4708 9 2232/187-735 c ase31h/l-430 case31c/l-100 g n l | t i I 4 70892232/187-735 c ase31h/l-430 case31c/l-100 g nl|ti|470892232/187-735 c ase31h/l-430 case31c/l-100 g n l | t i I 4 70892232/187-735 c ase31h/l-430 case31c/l-100 g nl|ti|470892232/187-735 AATCGAACAATGAGAACACTTGGACACAGGAAGGGGAACATCACACACCAGGGCATGTCG AAAAAAT GGGTGGGGGGGATAGCATTAGGAGATATACCTAATGTAAATGACGAGTTAATGGGTGCAG CACACTAATATGGCACATGTATACATATGTAACAAACCTGCAGGTTGTGCACATGTACCC AAAAGGAGAAAAAAGCCTTTTCAAGATCTT GCCTTTTCAAGAACTT TAGAACTTAAAGTATAATAATTAAAAAAAAAGAAGAAGAAAAAAGCCTTTTCAAGAATTT ACAACGGCTCTTATTAGATTATAAATTGTAACCTC ACAATGGCTCTTATTAGATTATAAATTGTAACCTC ACAATGGCTCTTATTAGATTATAAAGTGTAACTTC c h r l l : 8 2826588:scaffold_37428:2205199 = 2 205501:...no TSD d i s c e r n a b l e , i m p r e c i s e d e l e t i o n o f L1PA13 e lement i n human, 2 bp f l a n k i n g i d e n t i t y , probably blunt-end d e l e t i o n C LUSTAL case32c/l-402 g n l | t i I 5 29982443/82-484 c ase32h/l-100 case32c/l-402 g nl|ti|529982443/82-484 c ase32h/l-100 case32c/l-402 g nl|ti|529982443/82-484 c ase32h/l-100 case32c/l-402 g nl|ti|529982443/82-484 c ase32h/l-100 case32c/l-402 g n l | t i I 5 29982443/82-484 c ase32h/l-100 case32c/l-402 g nl|ti|529982443/82-484 c ase32h/l-100 case32c/l-402 g nl|ti|529982443/82-484 c ase32h/l-100 TGCAGCAGAATTTATGAGGTAGTGATTATTATGTTAAGTGCAAGAACT-GAAGTGGGAGC TGCAGAAAAAGGTATGAGGTAGTGATTATTATGTTAAGTGCAAGAATTAGAAGTGAGAGT TGCAGCAGAATTTATGAGGTAGTGATTATTATGTTAAGTGCAAGAACT-GAAG TAAATGATGAGAACACCTGGACACATAGAGGTGAACAACACACACTGGGGCCTATTGGAG TAAATGATGAGAACACATGGACACATAGAGGGGAGCAACACACACTGGGGCCTGTTGGAG : GGTAAAGAGTAGGAGGAGGGAAAGGATCAGGAAAAATAGCTAATGGGTACTAGGCTTAAC GGTAAAGAGTAGGAGGAGAGAGAGGATCAGGAAAAATAGCTAATGGGTTCTAGGCTTAAC ACCTGAATGACAAAATAATGTGTACAGCAAACCCCCATGA ATTTACCTATCTAAC ACCTGGGTGACAAAACAATGTGTACAGCAAAACCCCGTGACACAAATTTACCTATATAAA AAACCTGCACACGTACCCCTGGAACTTGAAAATTAAACTTTAAAAACCTGAAAATACAGA AAACTTGCACATGTACCCCTGGAATTTACAAGTTAAATTTTTAAAAACTGAGA GA ATGTCAAAAAA-TGATTTTCTATCTTTGTAAGTTTGGGATAATGGAAATCCACAGAGACA ATGTCAAGAAAGTGATGTTCTATCTTTGTAAGTTTGGGATGATGGAAATCCACAGAGATA GAATAAATAAGGTTTCAAGATCCCCTACAATCCTTATAGAGAAGCTGAG GAATAAATAAGGTTTCAAGATCCC-TACAATCCTTATAGAGAAGCTGAG -AATAAATAAGGTTTCAAGATCCCCTACAATCCTTATAGAGAAGCTGAG c hrl2:48585272:48585595:scaffold_37077:4241894:. . . p r e c i s e d e l e t i o n o f A l u S x i n c himpanzee CLUSTAL c ase33h/l-48 6 g n l | t i I 5 40393548/121-601 case33c/l-163 c ase33h/l-486 g nl|ti|540393548/121-601 case33c/l-163 c ase33h/l-4 86 TTAAGTGTGTTTGTGCAAAATTGGTGGTTATGGAGTGGGGGACAAAGGGCAAAAAGGGGT TTAAGTGTGTTTGTACAAATTTGGTGGTTATGGAGTGGGGGACAAAGGGCAAAAAAGGAT TTAAGTGTGCTTGTGCAAAATTGGTGGTTATGGAGTGGGGGACAAAGGGCAAAAAGGGGT GGGAT AAAGACTTTGATAATT AGGCC AGGCACGGTGGCTCACACCTGTAATCCCAGCACT GGGATAAAGACTTTGATAATTAGGCCAGGCGTAGTAGTTCATGTCTGTAATCCCAGCCCT GGGATAAAGACTTTGATAATTT TTGGGAGGCTGAGGAGGGTGGATCACTTGAGGTCAGGAGTTGGAGACCAGCCTGGCCAA- 158 g n l I t i | 5 4 0 3 9 3 5 4 8 /121-601 case33c/l-163 c ase33h/l-4 8 6 g nl|ti|540393548/121-601 case33c/l-163 c ase33h/l-486 g nl|ti|540393548/121-601 case33c/l-163 c ase33h/l-486 g n l | t i I 5 40393548/121-601 case33c/l-163 TTGGGAGGCTGAGGTGGGCGGATCACTTGAGGTCAGGAGTTGGAGACCAGCCTGGTCCAG CATGGTGAAACCCTGTCTCTACTAAAAATACAAAA-TTAGCCGGGCGTGGT TGCATTTCACATGGTAAAACCCTGTCTCTACTAAAAATACAAAAATTAGCCGGGCGTGGT GGTGCGCGCCTGTAATCCCAGCTACTCAGGAGGCTGAGGCAGGAGAATCACTTGAACCCG GGTGCATGCCTGTAATCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATCACTTGAACCCA GAAGACAGAGGTTGCAGTGAGCCAAGATTGTGCCACTGCACTCCAGCCTGAGCAACAGAA GAAGACAGAGGTGGTAGTGAGCCGAGATTGTGCCGCTGCACTCCAGCCTGAGCAACAGAG c ase33h/l-4 8 6 CAAGATTTTCTCTGTAAAAAAAAAAAAAAAAAAAAAAAAAAAGACTTTGATAATTTGTCT g n l | t i I 5 40393548/121-601 CAAGACTCTGTCTC AAAAAAAAAAAAAAAGGCTT—ATAATTTGTCT case33c/l-163 GTCT c ase33h/l-48 6 g n l | t i I 5 40393548/121-601 case33c/l-163 c ase33h/l-486 g nl|ti|540393548/121-601 case33c/l-163 GCCTACCTGGGCCCTGCTTTCTCAGGGACTTTAGGTGGGTCCCTGGGCAGTGAGAGGGAA GCCCACC-GGGCCCTGCTTTCTCAAAGTCTTTAGGTGGGTCCCTGGGCAGCGAGAGGGAA GCCTACCTGGGCCCTGCTTTCTCAGGGACTTTAGGTGGGTCCCTGGGCAGTGAGAGGGAA GCAGAAGGTAAGTGGCC GCAGAAGGTAAGTGGCC GCAGAAGGTAAGTGGCC c hr12:5530564 8 :55305956:scaffold_36205:1170504 : + . . . p r e c i s e d e l e t i o n o f A luY i n c himpanzee CLUSTAL c ase34h/l-4 64 g nl|ti|460701445/103-576 case34c/l-156 c ase34h/l-4 64 g nl|ti|460701445/103-576 case34c/l-156 c ase34h/l-4 64 g nl|ti|460701445/103-576 case34c/l-156 c ase34h/l-464 g nl|ti|460701445/103-576 case34c/l-156 c ase34h/l-464 g nl|ti|460701445/103-576 case34c/l-156 c ase34h/l-464 g n l | t i I 4 60701445/103-576 case34c/l-156 AGGTTAAGGACCCCATTCATGAGGCAGGTTACAGAGTCCAACCTCAAAAGACTAAAAGTA AGGTTAAGGACCCCATTCGCGAGGCAGGTTACAGAGTCCAACCTCAAAAGACTAAAAGTA AGGTTAAGGACCCCATTCATGAGGCAGGTTACAGAGTCCAACCTCAAAAGACTAAAAGTA GGAAGACCATCTATCTTTTAAAAACTTCT TGGCTCACGCCTGTAATC GGAAGATGACCTATCTCTTAAAAACTTCTAGGCCAGGCGCGGTGGCTCAAGCCTGTAATC GGAAGACCATCTATCTCTTAAAAACTTCT CCAGCACTTTGGGAGGCCGAGACGGGCGGATCATGAGGTCAGGAGATCGAGACCATCCTG CCAGCACTTTGGGAGGCCGAGACGGGCGGATCACGAGGTCAGGAGATCAAGACCATCCTG GCTAACACGGTGAAACCCCGTCTCTACTAAAAA—TACAAAAAT-TAGCCGGGCATGGTG GCTAACCCGGTGAAACCCCATCTCTACTAAAAAAATACAAAAATCTAGCCGGGCGAGACG GCGCGCGCCTGTAGTCCCAGCTACACGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGG GCGGGCTCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGG GAGGCGGAGCTTGCAGTGAGTCGAGATCGCGCCACTGCACTCCAGCCTGGGCGACACAGC GAGGCAGAGCTTGCAGTGAGCTGAGATCCGGCCACTGCACTCCAGCCTGGGCGACAGAGC c ase34h/l-4 64 GAAACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAACAACAACAAAAAACTTCTTCCTTTC g n l | t i | 4 60701445/103-576 CAGACTCCGTCTCAGGAAAAAAAAAAAAAAAAAAAAAA AACACTTCTTCCTTTG case34c/l-156 TCCTTTC c ase34h/l-4 64 g nl|ti|460701445/103-576 case34c/l-156 TTTGGTAAATTTTTGGCAAGATTACAGTATTTTAGTCTGCAAAATGCCCCCATAATAACT TTTGGTAAATTTTTGGCAAGATTACAGTATTTTAGTCTGCAAAATGCCCCCATAATAATT TTTGGTAAATTTTTGGCAAGATTACAGTATTTTAGTCTGCAAAATGCCCCCATAATAACT c hr12=122703868:122704186:scaffold_37680:2506194:. . . p r e c i s e d e l e t i o n o f A luY i n c himpanzee 159 CLUSTAL c ase35h/l-479 CAACCTCTGGTCTAGACCAACCAATCCAATCATTACCTATCCCACTGGCTATAAACTATC g nl|ti|526305977/153-631 CAACCTCTGGTCTAGACCAACCAATCCAATCATTACCTATCCCACTGGCTATAAATTATC case35c/l-161 CAACCTCTGGTCTAGACCAACCAATCCAATCATTACCTATCCCACTGGCTATAAACTATC c ase35h/l-47 9 g nl|ti|526305977/153-631 case35c/l-161 c ase35h/l-479 g nl|ti|526305977/153-631 case35c/l-161 c ase35h/l-47 9 g nl|ti|526305977/153-631 case35c/l-161 c ase35h/l-47 9 g nl|ti|526305977/153-631 case35c/l-161 c ase35h/l-479 g n l | t i I 5 26305977/153-631 case35c/l-161 c ase35h/l-479 g n l | t i I 5 26305977/153-631 case35c/l-161 c ase35h/l-479 g nl|ti|526305977/153-631 case35c/l-161 c ase35h/l-479 g nl|ti|526305977/153-631 case35c/l-161 TGAAATTAAATAAGTGTGTCGG-TGTG TCGGTGGCTCACGCCTGTAATCCCA TGAAATCAAATAAGTGTGTTTGCCATGGGCCGGGCACGGTGGCTCAAGCCTATAATCCCA TGAAATTAAATAAGTGTGTCTGCCATG GCACTTTGGGAGGCTGAGGCGGGCAGATCACGAGCTCAAGAGATCGAGACCATCCTGGCT GCACTTTGGGAGGCCGAGATGGGCAGATCACGAGGTCAGGAGATCGAGACTATCCTGGCT AACACGGTGAAACCCCGCCTCTACTAAAAA-TACAAAAAATTAGCCGGTCGTGGTGGCGG AACACGGTGAAACCCCGTCTCTACTAAAAAATACAAAAAACTAGCCGGGCGAGCTGGC-- GCGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCATGAACCTGGGAGG CCATAGTCCCAGCTACGCGGGAGGCTGAGGCAGGAGAATGGCGTAAACCCGGGAGG CGGAGCTTGCAGTGAGCCGAGATTGCGCCACTGCACTCCAGCCTGGGTGACAGAGCGAGA TGGAGCTTGCAGTGAGCTGAGATCCAGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGA CTCCGTCTCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTGTGTCTGCCATGTTTTC CTCCGTCTC AAAAAAAAAAAAAAAAAAAAAAAAAAAAGTGTGTCTGCCATGTTTTC T TTTC ACTTATTTTGTGCTCAGGGACATGGGCAGGTCAAGGACAAAGTCATCGCCTTCAACTACA ACTTAATTTGTGCTCAGGGACATAGGCAGGTCAAGGACAAGGTCATCACCTTCGACTACA ACTTAATTTGTGCTCAGGGACATGGGCAGGTCAAGGACAAGGTCATCGCCTTCGACTACA AACATCCCT AACATCCCT AACATCCCT c hrl4:76323357:76323468:scaffold_37670:5922279:. . . i m p r e c i s e d e l e t i o n o f L2 ( element i n s i d e a d e l e t i o n ) i n c himpanzee, no f l a n k i n g identity CLUSTAL c ase36h/6-211 g nl|ti|538237539/116-325 c ase36c/6-100 c ase3 6h/6-211 g nl|ti|538237539/116-325 c ase36c/6-100 c ase3 6h/6-211 g nl|ti|538237539/116-325 c ase36c/6-100 c ase36h/6-211 g n l | t i I 5 38237539/116-325 c ase36c/6-100 CCTGCCAGCATGGGAGAATGGCACTCCAAGGGGCACAGCCAGGGCGCTCCAGGGGGCAGG CCTGCCAGCATGGGAGAACGGCACTCCAAGGGGCAGAGCCAGGGCACTCCAGGGGGCAGG CCTGCCAGCATGGGAGAATGGCACTCCAAGGGGCATAGCCAGGGC AACCCCAGGGGCTGGTGTAGTACCTGGCACAGAGGAGGTGCTCAATAAATGCACATGAGT GACCCCAGGGGCTGGTACAGTACCTGGCACAGAGGAGGTGCTCAGTAAATGCACATGAGT AAATAAGAGGAGGATGGGGAGAAGTTGGATCAGTGCTGGAAAGAAG CAAACAATAT AAACAAGAGGAGGATGGGGAGAAGTTGGATCAGCGCTGGAAAGAAGGCAGCAAAGAATAT TGGAAAGAAG CAAACAATAT GAGAGCCTGGAGGCATCTCTCCCCTGCGGG GAGAGCCTGGAGGCATCTCTCCCCTGTGGG GAGAGCCTGGAGGCATCTCTCCCCTGCGGG c hrl4:81720447:81720771:scaffold_37670:482346:...AluY w i t h p a r t i a l d e l e t i o n i n R hesus, g ene c o n v e r s i o n d e l e t i o n i n c himpanzee CLUSTAL t o A luYb8 i n human, p r e c i s e 160 c ase37h/l-488 CAGTGTATTTGAAAGAAAAAAATAAGTTTCCAAATTTTTCACAGCTTATCTTCTTTATGG g nl|ti|520934604/378-835 CAGTATATTTGAAAGAAAAAA-TAAGTTTCCAAATTTTTCCCATCTTATCTTCTTTACAG c ase37c/l-164 CAGTGTATTTGAAAGAAAAAAATAAGTTTCCAAATTTTTCCCAGCTTATCTTCTTTATGG c ase37h/l-488 CTCTTTAAAAGGTCACTTATGGGCCGGGCGCGATGGCTCACGCCTGTAATCCCAGCACTT g n l | t i | 5 20934604/378-835 CCCTTTAAAAGCTCACTTATG T c ase37c/l-164 CTCTTTAAAAGGTCACTTATG c ase37h/l-488 g n l | t i | 5 20934604/378-835 c ase37c/l-164 c ase37h/l-488 g n l I t i I 5 20934604/378-835 c ase37c/l-164 c ase37h/l-488 g nl|ti|520934604/378-835 c ase37c/l-164 TGGGAGGCCGAGGCGGGTGGATCATGAGGTCAGGAGATCGAGACCATCCTGGCTAACAAG TGGGAGGCCGAGATGGGCAGATCACGAGGTCAGGAGATCGAGAGCATCCTGGCTAACACG GTGAAACCCCGTCTCTACTAAAAATACAAAAAAAATTAGCCGGGCGCGGTGGCGGGCGCC GTGAAACCCCATCTCTACTAAAAA-ATACAAAAAACTAGCCAGGCGAGGTGGTGGGCGCC —' TGTAGTCCCAGCTACTCGGGAGGCTGAGGCGGGAGAATGGCGTGAACCCGGGAAGCGGAG CGCAGTCCCAGCTACTCTGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCAGAG c ase37h/l-488 CTTGCAGTGAGCCGAGATTGCGCCACTGCAGTCCGCAGTCCCGCCTGGGCGACAGAGCAA g n l | t i | 5 20934604/378-835 GTTGCAGTGAGTTGAGATTCGACCACTGCACTCCA GCCTGGGAGACAGAGCGA c ase37c/l-164 c ase37h/l-488 GACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAA AGGTCACTGA g nl|ti|520934604/378-835 GACTCCGTCTCATAAAAAAAAAAAAAAAAAAAAGAAAAAAAAGAAAAAAAAGCTCACTTA c ase37c/l-164 c ase37h/l-488 g n l | t i I 5 20934604/378-835 c ase37c/l-164 c ase37h/l-488 g n l | t i I 5 20934604/378-835 c ase37c/l-164 TGAAAAAGTTTGAATTTCTAGAGCAGACACAGTTCATAGTCACTGAAGTTGTGACCTCTC TGAAAAACTTTGAATTTGTAGAGCAGGCACAGTTCATAGTCACTGAAGTTGTTGCCTCTC —AAAAAGTCTGAATTTCTAGAGCAGACACAGTTCATAGTCACTGAAGTTGTGACCTCTC TTGACAGAGAAGAAAGAGAGTAATA TTGAGAGAGAAGAAACAGAGTAATA TTGAGAGAGAAGAAAGAGAGTAATA c hr15:37146781:37147081:scaffold_37412:10461989:. . . p a r t i a l l y d e l e t e d A l u J o e x i s t i n g p r i o r t o h uman-rhesus s p l i t , no t s d ; i m p r e c i s e d e l e t i o n i n c himpanzee, no TSDs o r f l a n k i n g i d e n t i t i e s CLUSTAL c ase38h/l-400 g nl|ti|523788521/312-713 case38c/l-131 c ase38h/l-400 g nl|ti|523788521/312-713 case38c/l-131 c ase38h/l-400 g nl|ti|523788521/312-713 case38c/l-131 c ase38h/l-400 g nl|ti|523788521/312-713 case38c/l-131 c ase38h/l-400 g nl|ti|523788521/312-713 case38c/l-131 c ase38h/l-400 g nl|ti|523788521/312-713 case38c/l-131 c ase38h/l-400 TTAGCAATCATTTTGGAGGGAGTGTGCTAGACATTAAAAAAAA-TTATTAACACACATGG TTGGCAATCATTTTAGAGGGAGGGTGCTAGACATTAAAAAAAAATTATTAACACACATGG TTAGCAATCATTTTGGAGGGAGTGTGCTAGACATTAAAAAAAA-TTATTAA TTCTGGCCAGGCATGGTGGTTCATGCATATAATCC-AGCACTTTGGGAGGCCAAGGTGGA TTCTGGCCAGGCATGGTGGTTCATGCATATAATCCCAGCACTTTGGGAGGCCAAGGTGGA AAGATCCCTTGAGTCTCAGAATTTGAGACCAGCCTTGGCAACATAGTGAGGCCCCCATCT AAGATCCCTTGAGTCTCAGAATTTGGGACCAGCCTTGGCAACATAGTGAGACCACCATCT CTACAGAAAATAAAAAAAAATTAGCTGGGCATGATGACACACACCTGTAGTCCCAGTTAC CTACAGAAAATAAAAAAA--TTAGCTGGGCGTGATGCTACACACCTGTAGTCCCAGTTAC TTGGGAAGCTGAGGTGGGAAGGACTGCTTGAGCACAGGAGTTTGAGGCTGCAGTGAGCCA TCAGGAAGCTAAGGTGG-AAGGACTGCTTAAGCACAGGAGTTTGAGGCTGCAGTGAGCCG CGATTGCACTACTGCACTTCAGCCTGGGCAACAGAGTGAGACCTTGTCTTAAAAGTAAAT TGACTGCACTACTGCACTTCAGCCTGCGCAACAGAGTGAGACCTCGTCTTAAAAATAAAT ATAAGTTAAATTATTAAATTATTAAATTATTAAATAAAT AA GTAAAGACACGTGGTCCTTCAAAGAGAGAGGTATAAACAA 161 g nl|ti|523788521/312-713 case38c/l-131 AAATAAAGTAAAGACACATGGTCCTTCAAAGAGAGGTG--TAAACAA AA GTAAAGACACGTGGTCCTTCAAAGAGAGAGGTATAAACAA c hr15:50365066:scaffold_37399:651713:652024:. . . p r e c i s e d e l e t i o n o f A l u S q i n human CLUSTAL case39c/l-457 g nl|ti|459269000/431-893 c ase39h/1-150 case39c/l-457 g nl|ti|459269000/431-893 c ase39h/l-150 case39c/l-457 g nl|ti|459269000/431-893 c ase39h/l-150 case39c/l-457 g nl|ti|459269000/431-893 c ase39h/l-150 c ase3 9 c/l-457 g nl|ti|459269000/431-893 c ase39h/l-150 case39c/l-457 g nl|ti|459269000/431-893 c ase39h/l-150 TAACACTTAACACTACTCTGAATTCATGAAAGACCAAAGGTAGCTAATTAATATACAATT TAACACTTAACACTACTGTGAATTCACAAAAGACCAAAGGTAGCTAATGAATATACAATT TAACACTTAACACTACTCTGAATTCATGAAAGACCAAAGGTAGCTAATTAATATACAATT CCTGAAAATAAAAATTATTCAATCTCATCAAAAGTCAAAGAAGGCCAGGCGCAGTGGCTC CCTGAAAATAAAAATTATTCCGTCTCTTCAAAAGTCAAAGAAGGCCAGGCGCAGTGGCTC CCTGAAAATAAAAAAAA AAAAAAAAAAGTCAAAGAA ATGCCTGTAATCCCAGCACTTTGGGAGGGCAAGGCAAGTGGGTCACCTGAGGTCAGGAGT CTGCCTGTAATCCCAGCACTTTGGGAGACCCAGGCCAGTGGATCACCTGAGGTCAGGAGT TCGAGAGCAGCCTAGCCAACATCGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAG TCGAGACGAGCCTAGCCAACATGGCGAAACCCTGTCTCTACTAAAAATACAAAAAATTAA CCAAGTGTGGTGGCAGACACCTGTAATCCCAGCTACTCAGGAGGTTGAGGCAGGAGAATT CCAAGTGTGGTGGCAGGCGCCTGTAATCCCAGCTACTCAGGAGACTGAGGCAGGAGAATT GCTTGAACCCAGGAGGCAGAGGTTGCAGTGAGCTGAGATTGTGCCATTGCACTCCAGCCT GCTTGAACCCAGGAGGTGGAGGTTGCAGTGAGCCGAGATTGCACCACGGCATGCCAGCCT case39c/l-457 AGGCAACAAGAGCAAAACTCGGTCCAA AAAAAAAGTCAAAGAAATGCAAATTAA g n l | t i | 4 59269000/431-893 GGGCAACAAGAGCAAAACTCCGTCCAAGAAAAAAAAAAAAGTCAAAGAAATGCAAATTAA c ase39h/l-150 ATGCAAATTAA case39c/l-457 g nl|ti|459269000/431-893 c ase39h/l-150 ATCAACAGCAAAGTGCCACTTTTGGTCTATTAACTGAGCTAAT ATCAACAACAAAGTTCCACTTTTGGTCTATTAACTGAGCTAAA ATCAACAGCGAAGTGCCACTTTTGGTCTATTAACTGAGCTAAT c hrl6:48429275:scaffold_32947:3554158:3554416:+ . . . p r e c i s e d e l e t i o n o f A l u Y i n human CLUSTAL c ase40c/16-44 9 AAAATATATATATATATATATGTTTATATAGAATATCTTCCATGCCACAGCAATTCCATC g nl|ti|536342151/145-590 TATATATATATATATATATATATATATAAAAAATATCTTCCATACCACAGCAATTCCATC c ase40h/16-150 T ATATATATATT-ATATATATATATA TTCCATC c ase40c/16-449 CAATCACCTTTCTCAAACATGAAGGGGGGTGGAACACGAGGT-CAGGAGATCAAGATCAT g n l I t i | 5 36342151/145-590 CAATCCCCTTTCTCAAACATGAAGGGGGGCAGATCACCAGGTTCAGGAGATCAAGACCAT c ase40h/i6-150 CAATCCCCTTTCTCAAACATGAAGGGG c ase40c/16-449 g nl|ti|536342151/145-590 c ase40h/16-150 c ase40c/16-4 4 9 g nl|ti|536342151/145-590 c ase40h/16-150 c ase40c/16-44 9 g nl|ti|536342151/145-590 c ase40h/16-150 CCTGGCTAACACGGTGAAACCCCATCTTTACTAAAAATACAAAAACAAAATTAGCCGGGC CCTGGCTAACATGGTGAAACCCTGTCTCTACTAAAAAGACAAAAACAAAATTAGCCGGGT GTGGTGGCAGGCGCCTATAGTCCCAGCTACCAGGGAGGCTGAGG-CAGGAGAATGGCGTG GTGGTGGCAGGTGCTTGTAGTCCCAGCTACTCAGGAGGCTGAGG-CAGGAGAATGGTGTG AACCCAAGAGGCGGAGCTTGCAGTGAGCCGAGATCGCACCAGTGCACTCCAGCCTAGGTG AACCCAGGAGACGGAGCTTGCAGTGAGCAGAGATCGCGCCACTGCACTCCAGCCTACGTG 162 c ase40c/16-449 ACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAAA AACCATGAAGGGGCTG g n l | t i | 5 36342151/145-590 ACAGAGCGAGACTCCATCTCAAAAACAAAAACAAAAACAAAACAAAACATGAAGGGGCTA c ase40h/16-150 CTG c ase40c/16-44 9 g nl|ti|536342151/145-590 c ase40h/16-150 c ase40c/16-449 g nl|ti|536342151/145-590 c ase40h/16-150 GGCTTC-TCTCGGCATGGTAGCCAGGTTCCAAGTAAGAAAGTAAGACTATTTCACCAGCA GGCTTCCTCTCAGCATGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GGTTTCCTCTTGGCATGGTAGCCAGGTTCCAAGTAA GACTATTTCACCAGCA AGTGTTCTTGCTGGAAGTAGAAGGAAG NNNNNNNNNNNNNNNNNNNNNNNNNNN AGTGTTCC ATGAAAAAGGAAG c hr16:6657598 4:6657 6 310:scaffold_37667:3291777:...AluY i n R hesus, g ene c o n v e r s i o n t o A luYb8 i n human, p r e c i s e d e l e t i o n i n c himpanzee CLUSTAL c ase41h/l-4 91 g nl|ti|536041529/83-557 c ase 4 1c/1-165 c ase41h/l-4 91 g nl|ti|536041529/83-557 case41c/l-165 c ase41h/l-491 g n l | t i I 5 36041529/83-557 case41c/l-165 c ase41h/l-4 91 g nl|ti|536041529/83-557 case41c/l-165 c ase 4 l h/1-4 91 g nl|ti|536041529/83-557 case41c/l-165 c ase41h/l-491 g n l | t i I 5 36041529/83-557 case41c/l-165 c ase41h/l-491 g nl|ti|536041529/83-557 case41c/l-165 CTCTTCACTTCCATTCTATTCATTAACTCCTTTTGTTCCACCTTGAACTATGCTCATTTT CTCTTTACTCCCATTCTATTCATTAACTCCTTTTGTTCCACCTTAAACTATGCTCGTTTC CTCTTCACTTCCATTCTATTCATTAACTCCTTTTGTTTCACCTTGAACTATGCTCATTTT TCCTCATCTTAAAAAAATACCC GGCCGGGCGCGGTGGCTCACGCCTGTAATCCCA TCCTCATCTCAAAAAAATACCCAATAGAGCCGGGCGCAGTGGCTCACGCCTGTAATCCCA TCCTCATCTTAAAAAAATACCCAATAG GCACTTTGGGAGGCCGAGGCGGGTGGATCATGAGGTCAGGAGATCGAGACCATCCTGGCT GCACTTTGGGAGGCCAAGGCGGGCGGATCACAAGGTCAGGAGATCGAGACCAC AACAAGGTGAAACCCCGTCTCTACTAAAAA-TACAAAAAATTAGCCGGGCGCGGTGGCGG GGTGAAACCCCGTCTCTACTAAAAAATACAAAAAATTAGCCGGGTGCTGTAGCGG GCGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAAG GCGCCTGTAGTCCCAG GAGGCTGATGTTTGAGAATGGCGTGAACCTGGGAGG CGGAGCTTGCAGTGAGCCGAGATTGCGCCACTGCAGTCCGCAGTCCGGCCTGGGCGACAG CGGAGCTTGCAGTGAGCCAAGATCGCGCCACTGCACTCCA GCCTGGGGGACAG . AGCGAGACTCCGTC T C AAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAATACCCA AGCGAGACTCCGTCTCAACAACAACAACAACAACAAAACAAAACAAAACAAAAACACCCA c ase41h/l-491 ATAGTTTGTTTGATTTGCTATCCTTTCAATTATTCATCCACTCCTCCCAGCAGATGCACA g n l | t i | 5 36041529/83-557 ATAGTTTGTTTGATTTGCTATCCTTTCAAGTATTCATTCACTCCTCCCAACAGATGCAAG case41c/l-165 TTTGTTTGATTTGCTATCCTTTCAATTATTCATCCACTCCTCCCAGCAGATGCACA c ase41h/l-4 91 g nl|ti|536041529/83-557 case41c/l-165 TATACCATTAGAAAGTAAATGT TATCCCATTAGATAGTAAATGT TATACCATTAGAAAGTAAATGT c hrl6:69279114:69279432:scaffold_37667:485635:... p r e c i s e d e l e t i o n o f A l u S q i n c himpanzee CLUSTAL c ase42h/l-47 9 g nl|ti|536066149/241-714 case42c/l-161 c ase42h/l-479 g nl|ti|536066149/241-714 TTCTGAAACCCACGTCTCTTGACAACTATGGTCTCTGCAACTTATCTGACCTTAAAACAC TTCTGAAACCCACATATCTTGACAACTATGGTCTCTGCAACTTATCTGACCGTAAAACAC TTCTGAAACCCACGTCTCTTGACAACTATGGTCTCTGCAACTTATCTGACCTTAAAACAC TTGCCTGGGTAATGTCCTTATAAGAGTTCTTCCTTTCTGGCCGGGCGCGGTGGCTTACAC TTGCCTGGGTAATGTCCTTGTAAGAGTTCTTCCTTTCTGGCCGGATGCGGTGGCTCACGC 163 case42c/l-161 c ase42h/l-479 g n l | t i I 5 36066149/241-714 case42c/l-161 c ase42h/l-47 9 . g n l | t i I 5 36066149/241-714 case42c/l-161 c ase42h/l-479 g n l I t i I 5 36066149/241-714 case42c/l-161 c ase42h/l-479 g n l | t i I 5 36066149/241-714 case42c/l-161 c ase42h/l-47 9 g nl|ti|536066149/241-714 case42c/l-161 c ase42h/l-479 g n l | t i I 5 36066149/241-714 case42c/l-161 c ase42h/l-479 g n l | t i I 5 36066149/241-714 case42c/l-161 TTGCCTGGGTAATGTCCTTATAAGAGTTCTTCCTTTC CTGGAATCCCAGCACTTTGGGAGGCCGAGGTGGGTGGATCACTTGAGGTCAGGAGTTT-G CTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGTGGATCACTTGAGGTCAGGAGTTTTG ATACCAGCCTGGCCAACGTGGTGAAACCCTGCCTCTACTAAAAATACAAAAATTAGCTGG AGACCAGCCTGGCCAACATGGTGAAACCCTGTTTCTATTAAAAATACAAAAGTTATCTGG ACCTGGTAGTGCATGCCTGTAATCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATCTCTT ACGTGGTAGTGCATGCCTGTAATCCCAGCTACTCGGGAGGCTGAGGCAGGAGACTCTCTT GAACCTGGGAGGTGGAGGTTGCAGTGAGCTGAGATCGCACCATTGCATTCCGGCCTGGGG GAACCTGGGAGATGGGGATTGCAGTGAACTGAGATCGTGCCACTGCACTCCAGCCTGGGG GACAAGAGTGAAACTCCATCTCAAAAAAAAAAAAAAAAAAAAGAGTTCTTCCTTTCCAAG GACAAGAGTGAAACTCCATCTCAGGGAAAAAA AAAAGTTCTTCCTTTCCAAG CAAG TATACCTGAGTTCCCATAAGACAGAAAGTCATTTTGTTG-CTGTTAATTTTTTGAG-TAA TATACCTAAGTTCCCATAAGACAGAAAGTCACTTTTTTGGTTGTTAATTTTTTGAGATAA TATACCTGAGTTCCCATAAGACAGAAAGTCATTTTGTTG-TTGTTAATTTTTTGAG-TAA GG GG GG c hrl6:74232245 = 7 4232552:scaffold_37614:14060425:. . . p r e c i s e d e l e t i o n o f A luSx i n c himpanzee, 3 c o p i e s o f t h i s r e g i o n i n human genome, a l l w i t h t he A l u , c himpanzee h as one w i t h t he A l u , one w i t h o u t CLUSTAL c ase43h/l-4 62 g nl|ti|503026621/313-769 c ase4 3 c/1-155 c ase43h/l-4 62 g nl|ti|503026621/313-769 c ase43c/l-155 c ase43h/l-4 62 g nl|ti|503026621/313-769 c ase43c/l-155 c ase43h/l-4 62 g n l | t i I 5 03026621/313-769 case43c/l-155 c ase43h/l-462 g nl|ti|503026621/313-769 case43c/l-155 c ase43h/l-462 g n l | t i I 5 03026621/313-769 c ase43c/l-155 c ase43h/l-4 62 g nl|ti|503026621/313-769 case43c/l-155 c ase43h/l-4 62 g n l | t i I 5 03026621/313-769 case43c/l-155 CAGCCACACCTCTCTGCCCTAGTCTCCTGCCCCCAGGAGCCTGGCCTCATATGCTCCCCA CAGCCACGCCTCTCCGCCCTGGTCTCCTGTCCCCGTA CCCGCCTCATCGGCTCCCTA CAGCCACGCCTCTCTGCCCTAGTCTCCTGCCCCCAGGAGCCTGGCCTCATATGCTCCCCA CCACGCACAGCTGACCCCGCCCCCTCCCTCTTCTTTTTTTTTTTTGAGACAGAGTCTCAC CCACGCACAGCTAACCCCGCCCCCTCCCTCTTCTTCTTTTTTTT-AAGACAGAGTCTCAC CCACGCACAGCTGACCCCGCCCCCTCCCTCTTCTT CCTGTCGCCCAGGCTGGAGTGCAGTGGAGAAATCCC—GGCTTACTGCAACCTCCGCCTC CCTGTTGCCCAGTCTGGAGTGCAGTGGAGCAATCTC—GGCTTACTGCAAACTCCGCCTC CCAGGTTCAAGCAATTCTCCTGCCTCAGCCTCCCAAGCAGCTGGGATTACAGCCATGTGA CCAGGTTCAAGCGATTCTCCTGCCTCAGCCTCCCGAGTAGCTGGGATTAGAGCCATGTGA CACCACACCTGGCTAATTTTTGTATTTTTAGTAGAGACGGGGTTTCACCATGTTGGCCAT CACCACATCTGGCTAATTTTTGTATTTTTAGTAGAGACAGGGTTTCACCACGTTGGCCAT TGCTGGTGTCAAACTGCTGACCTTAGGTGATCTGCTTGTGTCAGCCTCCCAGAGTGCTGG -GCTGGTCTCAAACTGCTGACCTTAGGTGATCTCCTTGCGTCAGCCTCCCAAAGTGCTGG GATTACAGGTGTGAGCCACCGTGCCCAGCCCCCTCCCTCTTCTTAAACAAGGGGCCTGGC GATTACAGGTGTCAGCCACCGCGCCCAGCCCCCTCCCTCTTCTTAAACAAGGGGCCTGGC AAACAAGGGGCCTGGC AATCACCACCCCTGGGTGACTTGGTGCAGTCCCCTGATCTCCCG AATCGCCACCCCTGGGTGACCTGCTGCAGACCCCTGATCTCCCG AATCACCACCCCTGGGTGACTTGGTGCAGTCCCCTGATCTCCCG 164 c hr17:30401123:30401435:scaffold_37172:1193530:+ ...imprecise d e l e t i o n o f A l u S c i n c himpanzee (no TSD c opy r e t a i n e d ) CLDSTAL c ase44h/l-470 g n l | t i I 5 13289005/92-572 case44c/l-161 c ase44h/l-470 g n l | t i I 5 13289005/92-572 case44c/l-161 c ase44h/l-470 g n l | t i I 5 13289005/92-572 case44c/l-161 c ase44h/l-470 g n l | t i I 5 13289005/92-572 case44c/l-161 c ase44h/l-470 g n l I t i I 5 13289005/92-572 case44c/l-161 c ase44h/l-470 g nl|ti|513289005/92-572 case44c/l-161 ACCCATTAGAAGATAACACCATTTGCCTTTTATTTTTGGATAATTCAGGAATAAAAAATG ACCCATTAGAAGATAACAGCACTTGTCTTTTGTTTTTGGATAATTCAGGAATAAAAAATG TCCCCCTATAAGATAACACCATTTGCCTTTTATTTTTGGATCATTCAGGAATAAAAA-TG GATCTCAAGCTTTATAAAACTTACAATTCTAGGCTGGGCGCTGTGGCTCACACCTGTAAT GAACCCAAGTTTTATAAAACTTACAATTCTAGGCTGGGCGCCATGGCACACGCATGTAAT GATCTCAAGCTTTATAAAACCC CCCAGCACTTTGGGAGGCCAAGGCCGGCGGATCACACGGTCAGGAGGTCAAGACCATCCT CTCAGCACTCTGGGAGGCCAACGCAGGTGGATCACACAGTCAGGAGGTCAAGACCATCCT GGCCAACATGGTGAAACCCTGTCTCTACTAAAAATACAAAAATTAGCTGGGCGTGGTGGT GGCCAACATGGTAAAACCCTGTCTCTACTAAAAATACAAAAATTAGCTGGGCGTGGTGGT GCGAGCCTGTAATCCCAGCTACTCGAGAGGCTGAGACAGGAGAATTGCTTGAACCCAGGA GCGAGCCTGTAATCCCAGCTACTCCGGAGGCTGAGACAGGAGAATTGCTGGAACCCGGGA GGCAGAGATTGCAGTGAGCCGAGATTGTGC-CACTGCACTTCAGCCTGGCAACAGAGTGA GGCAGAGATTACAGTGAGCTAAGATTGTGT-CGCTGCACTCCAGCCTGGCAACCCAGCAA c ase44h/l-470 AACTCCGTCTCAAAAA AAAAAAATTATAATTCTATAGAAAAATAACATTTGTAT g n l | t i | 5 13289005/92-572 AACTCCATCTCAAAAAAAAAGAAAAAAAATTATAATTCTATAGAAAAAGAACATTGGTAT case44c/l-161 CATAGAAAAATAACATTTGTAT c ase44h/l-470 AAATTTAAC-TTTGGTGTA AAAAAGTGAATTTAACTTTGGTATTGCACACTGGTA g n l | t i | 5 13289005/92-572 AAATTTAAC-TTTAGTGTAGAAGAAAAAAGTGAATTTAACTTTGGTATTGCACACTGGAA case44c/l-161 AAATTTAACCTTTGGTGTA AAAAAGTGAATTTAACTTTGGTATTGCACACTGGTA c ase44h/l-470 g nl|ti|513289005/92-572 case44c/l-161 ATT AGT ATT c hrl7:37731003:scaffold_37479:3983548:3983669:...imprecise d e l e t i o n internal t o MIRb e lement i n human, no f l a n k i n g identity CLUSTAL c ase45c/1-221 g n l | t i I 5 29975753/582-804 c ase45h/l-100 case45c/l-221 g nl|ti|529975753/582-804 c ase45h/l-100 case45c/l-221 g nl|ti|529975753/582-804 c ase45h/l-100 case45c/l-221 g nl|ti|529975753/582-804 c ase45h/l-100 CAGAATCGATCACTAAAAGATGTTAGTGTTTTTACGCCACTGCGGGTCTTTAATTTCTTG CAGAATAGATCACTAAAAGATGTTAGTGTTTTTACGCTGCTGCGGGTCTTTAATTTCTTG CAGAATCGATCACTAAAAGATGTTAGTGTTTTTACGCCACTGCGGGTCTT GTGCCTCAATTTCCTCCTCTGTAAAGTGGACCTAATCCCAATATTTCTGTCATCAGTTGT GTGCCTCAATTTCCTCCTCTGTAAAGTGGACCTAATCCCAATGTTTCTATCATCAGTTGT GGAAATTACGTGAGGTAACGTTTGCAATTAGCAAAGGAAGGCATTAAGACAGCAAGCTCT GGAAATTACGTGAGGTAACATTTGCAATTAGCAAAGGAAGGCATTAAGACAGCAAGCTCT GCAAGCTCT TGGAAGGCGGGAACTAGTCTC—AGTCTCATTTGGCTCACAAC TGGAGGGCGAGAACTAGTCTCGTAGTCTCCTTTGGCTCACAGC TGGAGGGCGGGAACTAGTCTC—AGTCTCATTTGGCTCACAAC chrl7:57592534:57592845:scaffold 37 659:174722 63:+ 165 . AluY i n R hesus, gene c o n v e r s i o n t o A luYb8 i n human, p r e c i s e d e l e t i o n i n c himpanzee CLUSTAL c ase46h/l-468 g n l I t i I 5 41136674/627-1103 c ase46c/l-157 c ase4 6 h/l-4 68 g n l | t i I 5 41136674/627-1103 c ase46c/l-157 c ase46h/l-468 g n l | t i I 5 41136674/627-1103 c ase46c/l-157 c ase4 6 h/l-4 68 g nl|ti|541136674/627-1103 c ase46c/l-157 c ase4 6 h/l-4 68 g nl|ti|541136674/627-1103 c ase46c/l-157 case46h/1-468 g n l | t i I 5 41136674/627-1103 c ase46c/l-157 c ase46h/l-468 g n l | t i I 5 41136674/627-1103 c ase46c/l-157 c ase4 6 h/l-4 68 g n l | t i I 5 41136674/627-1103 c ase46c/l-157 c ase46h/l-468 g nl|ti|541136674/627-1103 c ase46c/l-157 AGAAAGAATATAGAGCTTAGGTTGGAGTTGAAATGGTGAGGACTAGTATTTAAGAAATCT AGAAAGAATATAGGGCTTAGGTTGGAGTTGAAATGGTGAGGACTAGTATTTA-GAAGTCT AGAAAGAATATAGAGCTTAGGTTGGAGTTGAAATGGTGAGGACTAGTATTTAAGAAATCT TTAGTTATCCCAGCATATTAAGAATATGCCA GGCCGGGCGCGGTGGCTCACGCCT TTAGTTATCCAAGCATATTAAGAGTATGCCAGTGTTGGCCAGGCGCAGTGGCTCACGCCT TTAGTTATCCAAGCATATTAAGAATATGCCAGTGTT GTAATCCCAGCACTTTGGGAGGCCGAGGCGGGTGGATCATGAGGTCAGGAGATCAAGACC GTAATCCCAGCACTTTGGGAGGCCAAGGCAGGCGGATCATGAGGTCAGGAGATCGAGACC 1 ATCCTGGCTAACAAGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCGC ATCCTGGCTAACACAGTGAAACCCCGTCTTCACCAAAAATACAAAAAGTTCTCCGGGCGT GGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAAC GGTGGCGGGTGCCTGTAGTCCTAGCTACTCCGGAGGCTGAGGCAGGAGAACGGCGTGAGC CCGGGAAGCGGAGCTTGCAGTGAGCCGAGATTGCGCCACTGCAGTCCGCAGTCCGACCTG CTGGGAGGCGGAGCTTGCAGTGAGCCGAGATCACACCACTGTACTCCA-GCCTG GGCGACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAA GAATATGCCAG GGCAACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAGAAGAATATGCCAG TGTTCATACTGAGGAGAACATGGGTAAAACAGAAACAGTAGAAAGCTAACTTTTAATTAC TGTTCACACTGAGGAGAACATGGGTAAAACAGAAACAGTAGAAAGCTAACTTTTAATTAC CATACTGAGGAGAACATGGGTAAAACAGAAACAGTAGAAAGCTAACTTTTAATTAC TCTAA TCTAG TCTAG C hrl7:79427256:79427568:scaffold_34699:232 697:+ ... A luY i n R hesus h as p olymorphisms a t h ead a nd t a i l , human, p r e c i s e d e l e t i o n i n c himpanzee CLUSTAL c ase47h/l-412 g nl|ti|503733953/300-735 case47c/l-101 gene c o n v e r s i o n t o A luYg6 i n GAACGTCTTCCCATGTCATTAAACACAACAAAATAAGGTTAGGATAGATTAA-GATTGAA GAACATCTTCCTATGTCATTAAACACAACAAAATAAGGTTAGGATGGATTAAAGATTGAA GAACGTCTTCCCATGTCATTAAACACAACAAAATAAGGTTAGGATAGATTAAAGATTGAA c ase 4 7 h/1-412 CGTTTA GGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACT g n l | t i | 5 03733953/300-735 CATTTAAAATAAACCAT-AAGGGGCCAGGCGCGGTGGCTCACGCCTGTAATCACAGCACT case47c/l-101 CTTTTAAAACAAACCGTTAAG c ase47h/l-412 g n l | t i I 5 03733953/300-735 case47c/l-101 . c ase47h/l-412 g nl|ti|503733953/300-735 case47c/l-101 c ase47h/l-412 g n l I t i I 5 03733953/300-735 case47c/l-101 TTGGGAGGCCGAGACGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCTGGCTAACAC TTGGGAGGCCGAGGCGGGCGGATCACGAGGTCAGGAAATCAAGACCATCCTGGCTAACAC GGTGAAACCCCGTCTCTACTAAAAATACAAAAA-TTAGCCGGGCATGGTGGCGTGCGCCT GGTGAAACCCCTTCTCTACTAAAAATACAAAAAATTAGCTGGGCGTGGTGGCGGGCACCT GTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGC GTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGGGTGAACCCGGGAAGAGGAGC c ase47h/l-412 TTGCAGTGAGTCGAGATCGCGCCACTGCACTCCAGCCTGGGCGACAGAGCGAAACTCCGT g n l | t i | 5 03733953/300-735 TTGCAGTGAGCTGAGATCCGGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCAT case47c/l-101 166 c ase47h/l-412 g nl|ti|503733953/300-735 case47c/l-101 c ase47h/l-412 g nl|ti|503733953/300-735 case47c/l-101 CTCAAAAAAAAAAAAAAAAA AAAGATTGAACGTTTA CTCAAAAATAATAATAATAATAATAATAATAATATAATAAAATA AAACAAACCGTAAGACACCATGAAAGACCTGGTG -AATAAACCATAAGACACCATGAAAGACCTGGTG ACACCATGAAAGACCTGGTG c hr19:8420410:8420717:scaffold_37480:196649:...AluY i n r h e s u s , g ene c o n v e r s i o n t o A luYa5 i n human, p r e c i s e C LUSTAL c ase4 8 h/l-4 62 g nl|ti|496120749/118-591 case48c/l-155 c a s e 4 8 h / l - 4 62 g nl|ti|496120749/118-591 case48c/l-155 c a s e 4 8 h / l - 4 62 g nl|ti|496120749/118-591 case48c/l-155 c ase48h/l-4 62 g nl|ti|496120749/118-591 case48c/l-155 c a s e 4 8 h / l - 4 62 g n l | t i I 4 96120749/118-591 case48c/l-155 c ase48h/l-462 g nl|ti|496120749/118-591 case48c/l-155 GTCACCATGTCGTCACTAGGCCTCGGTCCTATGGAGGCTACTACCTACACGCTTACAGCT GTCACCCTGTTGTCACTAGGCCTCGGTCCTATGGAGGTTACCACCTACACGCTAACAGCT GTCACCATGTCGTCACTAGGCCTCGGTCCTATGGAGGCTACTACCTACACGCTTACAGCT TTAAAAGAGACTCTTAGGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGA TTAAAAGAGACTCTT-GGCCGGGCACGGTGGCTCAAGCCTGTAATCTCAGCACTTTGGGA TTAAAAGAGACTCTTA GGCCGAGGCGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCCGGCTAAAACGGTGAA GGCCGAAACGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCTGGCTGACACGGTGAA deletion i n c himpanzee ACCCCGTCTCTACT AAAAATACAAAAAATTAGCCGGGCGTAGTGGCGGGCGCCTGT ACCCCGTCTCTACTTAAAAAAAATACAAAAAACTAGCCGGGCGAGGTGGCAGGCGCCTGT AGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTT AGTCCCAGCTACTCGGGAGGCTGAAGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTT GCAGCGAGCCGAGATCCCGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGTCT GCAGTGAGCTGAGATCCGGCCACTGCACTCCAGCCTGGGCAGCAGGGCGAGACTCCGTCT c ase48h/l-462 C AAAAAAAAAAAA AAAAGAGACTCTTAGCACACAGT GAGGC AATGATGT AT g n l | t i I 4 96120749/118-591 CAAAAAAAAAAAAAAAAAAAAAGAGACTCTTGGCACACAGTAACTGAGGCAATGATGTAT case48c/l-155 GC ACAC AGT GAGGC AATGATGTAT c ase48h/1-4 62 g nl|ti|496120749/118-591 case48c/l-155 TTTTCTACCAATAATTTTATTGAGAAGGGTAGAGAGGGGGACTGATTCACTCCTA TTTTCTACCAATAATTTTATCGAGAAGGGTAGAGAGGCGGACTGATTTACTCCTA TTTTCTACCAATAATTTTATTGAGAAGGGTAGAGAGGGGGGCTGATTCACTCCTA > chr19:41499856:41500219:scaffold_37497:2669080:....rhesus h as p a r t i a l t andem d u p l i c a t i o n o f A l u S q ; c himp 3 d e l e t i o n s , AluSq C LUSTAL including orig c a s e 4 9 h / l - 4 63 A ACATTTTTGGTAATTATTTT C TATATATTTTAGATATTTCATTAAACTAATAC g nl|ti|563501071/127-787 A ACAGTTTTGGTAATTATTTTTACTTTCTATATATTTTAGAGTTTCCATTAAACTAATAT case49c/l-115 A ACATTTTTGGTGATTATTTT CTGTGTATTTTAGGATTTCCATTAAACTA c ase4 9 h/l-4 63 g n l I t i I 5 63501071/127-787 case49c/l-115 c ase49h/l-463 g nl|ti|563501071/127-787 case49c/l-115 AGAATTTCATATTCAGGGCCAGGCATAGTGGCTCATGCCTGTAATCCCAGCACTTTGAGA AGAATTTCATATTCAGGGCCTGGTGCAGAGGCTCATACCTGTAATCC-AGCACTTTGGGA T ATTC GTCCAAGGCGGGCGCATCACCTGAGGTCAGGGGTTCGAGACCATCCTGGCCAACAAGGGA AGCCGAGATGGGCGGATCACCTGAGGTCAAGGGTTCGAGACCAGCCTGGCCAACATGGCC 167 c ase4 9h/l-4 63 g nl|ti1563501071/127-787 case49c/l-115 c a s e 4 9 h / l - 4 63 g n l | t i I 5 63501071/127-787 case49c/l-115 c ase49h/l-463 g nl|ti|563501071/127-787 case49c/l-115 c ase49h/l-463 g n l | t i I 5 63501071/127-787 case49c/l-115 c ase49h/l-463 g nl|ti|563501071/127-787 case49c/l-115 c ase4 9h/l-4 63 g n l I t i I 5 63501071/127-787 case49c/l-115 c ase49h/l-4 63 g n l | t i I 5 63501071/127-787 case49c/l-115 c ase49h/l-463 g nl|ti1563501071/127-787 case49c/l-115 c ase49h/l-463 g nl|ti|563501071/127-787 case49c/l-115 AAACCCCATCTCTATTAAAAATACAAAATTAGCCGGGTGTGGTGTTGCACGCCTGTAATC ACACCCCGTCTCTACTAAAAATTCAAAATTAGTCGGGTGTGGTGGTGCATGCCTGTAATC CCAGCTACTTGGAAGGCTGAGGCAGGAGAATC CCAGCTACTTGGTATGCTGAGGCAGGAGAATCCATTGAAACTGCGAGGCGGAGTTTGCAG TGAGCCAACATCACGCCATTGTACTCTAGCCTGGGCAACAGTAGTGAAACTCCAGCTCAA GCATG AAAAAAAAAAAAAAAAAAAAAAAAAATACAAAACTTAGCTGGGCATGTTGGCGGGTGCTG ACCCCGGGGGCAGAGA ATAATCCCAGTCACTCGGTAGGCTGAGGCAGGAGAATTGCTTCAACCCAGGGAGCAGAGG TTGCAGTGAGCTGAGATCTTGCCACTTCATTCCAGCCTGGGCCACAGAGCAAGACTCCTT TTGCAATGAGCCAAGATCTCATGACTTCGCTCCAGCCTGGGGCACAGGGCAAAACTCCTT CTCAAAAAAAAAAAAAAAAAAAAA-AAATTCATATTCGCTCATATCAAAAATGAAAATTT CTCAAAAAAAAAAAGAAAAAAATTCAAATTCATATTCACTCATTTCAAAAATGAAAATTT T CATTTC A TTTTTGCAAATTTCTA AGTGATAGAATTATTTTAATGTAGGAAAGG-TTCATCAA ATTTTTGCAAATTTCTATTTGAGTGATAGAATTATTTTAATTTAGGAATGGCTTCATAAA AAATTTCTATCTGAGTGATAGAATTATTTTAATGTAGGAAAGG-TTCATCAA AA AA AA C hrl9:46430068=4 64303 9 7 : s c a f f o l d _ 3 7 5 4 3 : 1 4 7 6014:. . . p r e c i s e d e l e t i o n o f A l u S q i n c himpanzee C LUSTAL c ase50h/l-4 95 g n l | t i I 5 02904367/30-527 case50c/l-166 c ase50h/l-4 95 g nl|ti|502904367/30-527 case50c/l-166 c ase50h/l-4 95 g nl|ti|502904367/30-527 case50c/l-166 c ase50h/l-495 g nl|ti|502904367/30-527 case50c/l-166 c ase50h/l-495 g nl|ti|502904367/30-527 case50c/l-166 c a s e 5 0 h / l - 4 95 g n l | t i I 5 02904367/30-527 case50c/l-166 c a s e 5 0 h / l - 4 95 g n l | t i | 5 02904367/30-527 case50c/l-166 AGAGTAGGGAATATTCGCTAGAA GGATATATTACAACCCAGATGAGCTAGACCCAGC ACAGTAGGGAATATTTGCTAGAATGAGGATATATTACAACCCAGATGAGCTAGACCCAGA AGAGTAGGGAATATTTGCTAGAA GGATATATTACAACCCAGATGAGCTAGACCCAGC CTCTGCCCTCAAGTTGCTCCTAGAATAAGAAAACCAAAACCAGGCCAGGTGTGGTGGCTT CTCTGCCCTCAAGTTCCTCCTAGAGTAAGAAAACTAAAACCAGGCCAGCTGTGGTGGCTT CTCTGCCCTCAAGTTGCTCCTAGAATAAGAAAACCAAAACCA ACACCTGTAACCCCAGCACTTTGGGAGGCCAAGGCTGGTGGATCACCTGAGGTCAGGAGT ACACCTATAACCCCAGCACTTTGGGAGGCCACGGCGGGTGGATCACCTGAGGTCAGGAGT TCGAGACCAGCCTGGCTAACATGGTGAAACCCCATTTCTACTAAAAATACAAAAAATTAG TCGAGACCAGCCTGGCTAACATGGTGAAACCCCATTTCTACTAAAAATACAAAAAATTAG CCGGGTGTGGTGGCACACACCTGTAATCCCAGCTACTCAGGAGGCTGAGGCAGGAGAATC CCAGGTGTGGTGGCACACACCTGTAATCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATC CCTTGAACCCTGGAGGCAGAGGTTGCAGTGAGCCGAGATCGTGCCATTGCACTCCAGCTT ACTTGAACCTGGGAGGCAGAGGTTGCAGTGAGCCAAGATTGTGCCATTGCACTCCAGCTT GGGTAACACGAGCGAACTTCCGTCTCAAGAAAAAAAAAAAAAGAAAGAAAGAAAGAAGAA GGGTAACAAGAGCGAACGTCCGTCTCAAAAAAAAAAAAAAAAAAAAGACAGAAAGAAGAA ; 168 c ase50h/l-495 AACCAAAACCAAAACAAAACTCACAGATCTGTAAATATAAGGCCTAATTCTGGTCTGAAG g n l | t i I 5 02904367/30-527 AACCAAAACCAAAACCAAATTCACAGATCTGTAAATATAAGGCTGAATTCTGGTCTGAAG case50c/l-166 AAACAAAACTCACAGATCTGTAAATATAAGGCCTAATTCTGGTCTGAAG c ase50h/l-495 g n l I t i I 5 02904367/30-527 case50c/l-166 TTCTCTGAGATTAGAAAG TTCTCTGAGATTAGAATG TTCTCTGAGATTAGAAAG c hr20:41311768:41311810:scaffold_37443:5951986:...L2 e lement t andemly d u p l i c a t e d r e g i o n i n human, c himp, a nd r hesus CLUSTAL case51c/l-100 g n l | t i I 5 45384313/363-522 c ase51h/l-142 case51c/l-100 g n l | t i I 5 45384313/363-522 c ase51h/l-142 c ase51c/1-100 g nl|ti|545384313/363-522 c ase51h/1-142 GCCTTTTCAGGGGACTTGACCTCAGCTGGAGACTTTGCTTCTTCCTT TCCTTCTCTGGGGACTTGGCCTTCTCGGGGGACTTTGCTTCCTCCTTCGTTGGGGACTTG GCCTTTTCAGGGGACTTGACCTCAGCTGGAGACTTTGCTTCTTCCTT CACTGAGGACTTG GCCTTTTCAGGGGACTTGGCCTCAGCCGGTGACTTTGCTTCTTCCTTCACTGGGGACTTG CACTGGGGACTTGGCCTCAGCTGGTGACTTTGCTTCTTCCTTCACTGGGGACTTG GCCTTCTCTGGAGACTTGGCCTCAGCTGATGATTTTGCCT GCCTTCTCTGGAGACTTGGCCTCAGCTGGTGACTTTGCCT GCCTTCTCTGGAGACTTGGCCTCAGCTGGTGATTTTGCCT c hr22:45658137:45658441:scaffold_37534:1549045:. . . p r e c i s e d e l e t i o n o f A l u S q i n c himpanzee CLUSTAL c ase52h/l-458 g n l | t i I 5 55960713/344-767 case52c/l-154. c ase52h/l-458 g n l | t i I 5 55960713/344-767 c ase52c/l-154 c ase52h/l-458 g nl|ti|555960713/344-767 c ase52c/l-154 c ase52h/l-458 g n l | t i I 5 55960713/344-767 c ase52c/l-154 c ase52h/l-458 g n l | t i I 5 55960713/344-767 c ase52c/l-154 c ase52h/l-458 g nl|ti|555960713/344-767 c ase52c/l-154 c ase52h/l-458 g nl|ti|555960713/344-767 c ase52c/l-154 c ase52h/l-458 g nl|ti|555960713/344-767 c ase52c/l-154 CTGCTTAACCAGATGAGGAAGAACGAGGTTAATGAAAATGCCCAGTGATGGTGACGGTAA CTGCTTAACCAAATGAGGGAGAACAAGG AAATGCCCAGTGATTGTGAGGGTAA CTGCTTAACCAGATGAGGAAGAACGAGGTTAATGAAAATGCCCAGTGATGGTGAGGGTAA AGAAATGCCCCCTCTCGGCCAGGCGCGGTGGCTCATGTCTGTAATCCCAGCACCCTGGGG AGAAATGCCCCCTCTCGGCCGGGCACGGTGGCTCACACCTGTAATCCCAGCACTTTGGGA AGAAATGTCCCCTCTC GGCCGAGGCGGGCGGATCACTTGAGGTCAGGAGTTTGAGACCAGCCTGGCCAACAGGGTG GGCCGAGGCAGGCGGATAACCTGAGGTCAGGAGTTCGAGACCAGCCTGGCCAACATGGTG AAACCCCGTCTCTACTAAAAAATACAAAAATTAGCCAGGCGTGGTGGCAGGCGCCTTAAT AAACCCTGTCTCTACTAAAAAATAGAAAAATTAGCTGGGCGTGGTGGCAGGAGCTTTAAT CCTAGCTACTTGGGAGGCAGAGGCAGGAGAATCGTTTGAACCCAGGAGGCAGAGGTTGCA CCCAGCTACTTGGGAGGC GGAGGCAGAGGTTGCA GTGGGCTGAGATCGAGCCACTGCACTCAAGCCTGGGGGACAAGGGCGAGACTTCTCTGAA GTGAGGCAAGATCGAGCCATTGCACTCAAGCCTGGGGGACAAGGGTGAGACTTCTGTCAA AAAAGGAAATGCCCCCTCTCACAAAACTGCTGGCTGCAGGGCAAACCAACTCAGTGGGCC AAAAG-AAATGTCCCCTCTCACAAAATTGCTGGCTGCCCGGCAAACCAACTCAGTGGGCC ACAAAATTGCTGGCTGCAGGGCAAACCAACTCAGTGGGCC CCAGGGTCACTTGGCTGTGGCCACCAAGTTCCCCAAAC CCAGGGTCACTTGGCCGTGTGCACCAAGTTCCACAAAC CCAGGGTCACTTGGCTGTGGCCACCAAGTTCCTCAAAC c hr22:46857451:4 6 857769:scaffold_37534:338287:- 169 . ..AluY p a r t i a l l y d e l e t e d a nd r e v e r s e d i n R hesus ( s l i g h t l y c o m p l i c a t e d g ene c o n v e r s i o n t o A luYa5 i n human, p r e c i s e d e l e t i o n i n c himpanzee CLUSTAL c ase53h/l-418 g nl|ti|556293551/420-813 case53c/l-100 e xample o f N HEJ), GGCACTGGACCAAGCCTTCCTGCTGGGCAGAGATGGGACTGGCTTTTCATAAGATTGCGC GGCAGTGGACCAAGCCTTCCTGCCGGGCAGAGACGGGACTGGC GGCACTGGACCAAGCCTTCCTGCTGGGCAGAGACGGGACTGGCTTTTCATAAGATTGAGC c ase53h/l-418 CTTGGGCCGGGCACGGTGGCTCACTCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGG g n l | t i I 5 56293551/420-813 XXATGAAAAGCCAGXXTCCCAGCACTTTGGGAGGCCGAGACGGG case53c/l-100 CTT c ase53h/l-418 g nl|ti|556293551/420-813 case53c/l-100 c ase53h/l-418 g n l | t i I 5 56293551/420-813 case53c/l-100 c ase53h/l-418 g n l | t i I 5 56293551/420-813 case53c/l-100 c ase53h/l-418 g n l | t i I 5 56293551/420-813 case53c/l-100 c ase53h/l-418 g nl|ti|556293551/420-813 case53c/l-100 c ase53h/l-418 g nl|ti|556293551/420-813 case53c/l-100 CGGATCACGAGGTCAGGAGATCGAGACCATCCCGGCTATAACGGTGAATCCCCGTCTCTA CGGATCACGAGGTCAGGAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTA CTA-AAAATACAAAAAA-TTAGCCGGGCGTAGTGGCGGGCGCCTGTAGTCCCAGCTACTT CTACAAAATACAAAAAAACTAGCCGGGCGAGGTGGCGGGCACCTGTAGTCCCAGCTACTC GGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGA GGGAGGCTGAGGCAGGAGAATGGCGTGAACCTGGGAGGCGGAGCTTGCAGTGAGCTGAGA TCCCGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAA TCCGGCCACTGTACTCCAGCCTGGGCCACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAA ACAAAAAAAA AAGATTGCGCCTTTCAGCAACATCAACTCTTCCAGGAAATGTG AAAAAAAAAAGAAAAXXAAGATTGTGCCTTTCAGCAACATCGAGTCTTCCAGGAAATGTG TCAGCAACATCAACTCTTCCAGGAAATGTG C TTATAT CTTACAT C TTATAT c hrX:5202006:52022 9 2 : s c a f f o l d _ 2 5 5 8 2 : 6 9 9 6 : ... p r e c i s e d e l e t i o n o f A l u S p i n c himpanzee CLUSTAL c ase54h/l-431 g nl|ti|485777938/470-934 case54c/l-140 c ase54h/l-431 g nl|ti|485777938/470-934 case54c/l-140 c ase54h/l-431 g nl|ti1485777938/470-934 case54c/l-140 c ase54h/l-431 g nl|ti1485777938/470-934 case54c/l-140 TGGAAGCCTGACCAGAAAATATCATGGCATCAGTTCAACGCATCCCAAAAACTCACTTAA TGGCAGCCTGACCAGAAAATATGATGGCATCAGTTCAACGCATCCCAAAAACTCACTTAA TGGTAATCTGATCAGAAGATATCA CAGTACAACGCATCCCAAAAACTCACTTAA AAGCCAAAGGCAGGCCGGGCGTGGTGGCTCACGCCTATAATCCCAGCACTTTGGGAGGCC AAGCCAAAGGCAGGCCGGGAATGGTGGCTCACGCCTATAATCCCAGCACTCTGGGAGGCC AAGCTAAAGGCA GAGGCAGGTGGATCACCTGAGGTCGGGAGTTCAAGACCAGCCTGACCAACATGGAGAAAT GAGGCAGGTGGATCACCTGAGGTCGGGAGTTCAAGACCAGCCTGACCAACATGGTGAAAC CCCATCTCTACTAAAAATACAAAATTAGCCAGGTGTGGTGGCACATGCCTGTAATCCCAG CCCATCTTTACTAAAATTACAAAATTAGCTGGGTGTGGGGGCACATGCCTGTAATCCCAG c ase54h/l-431 CTACTCGGGAGG CTGAGGCAGGAGAATGGCTTGAACCTGGGAGGGGGAGGCTGCAGT g n l | t i | 4 85777938/470-934 CTACTCGGGAGGAGGCTGAGGCAGGAGAATGGCTTGAACCTGGGAGGCAGAGGCTACGGG case54c/l-140 c ase54h/l-431 g nl|ti|485777938/470-934 case54c/l-140 c ase54h/l-431 g n l | t i | 4 85777938/470-934 case54c/l-140 GAGCGAAACTC CATC AAAAAAAAAAA AAA GGGCCAAGATCGCGCCATTGCACTCCAGACTGGGCAACAAGAGGGAAACTCCGTTTCAAA AAAAAGGAAAGAAAAAAAAAAGCCAAAGACAAACAAATCATCTGACAGCTGCAAAGAAAA AAAAAAAAAAAAAAAAAAAATACCAAAGGCAAACAAATCATCTGACATCTGCAAAGAAAA AAC AAATC AGCTGACGTCTGC AAAT AAAC 170 c ase54h/l-431 g nl|ti|485777938/470-934 case54c/l-140 GTGCAAGTCCCTATGTTTTGTTTTGTTTTTCATTCTATTTCCAGA GTGCAAGTCCCTATGTTTTGTTTTGTTTTTCATTCTATTTCCAGA ATGCAAGTCTCTATGTTTTGTCTTGGTTTTCACCCTATCTCCAGA c hrX:86865679:86865830:scaffold_37382:835478:+ . . . i m p r e c i s e d e l e t i o n o f low c o m p l e x i t y r e g i o n i n c himpanzee, n o f l a n k i n g CLUSTAL c ase55h/l-251 g nl I ti1495823394/144-417 case55c/l-100 c ase55h/l-251 g n l | t i I 4 95823394/144-417 case55c/l-100 c ase55h/1-251 g nl|ti|495823394/144-417 c ase55c/l-100 c ase55h/l-251 g nl|ti|495823394/144-417 case55c/l-100 c ase55h/l-251 g nl|ti|495823394/144-417 c ase55c/1-100 P rimers a nd S equences > caseCl_3 GTACAGTT GAGGC AT T GCT AC > caseCl_5 TCAGTCTCCAGGGAAGCAATG > caseC2_3 AGGCAATAAAAGAGGCCGGCT > caseC2_5 CAGAGCTCTTTCCTTCCACTC > caseC3_3 TGGGTTATAGGCTTACAGATG > caseC3_5 GAGATAGGCCAAGAACTATAG > caseC4_3 AGAGTACCACCAAGGTATTAG > caseC4_5 GAACTGATGTCTGCAACTTTG > casel4_3 CATACACATATAAGACCCTTC > casel4_5 GTCTCAGTGATAACTTGATGA > case33_3 TTGTAGGGTTGAGAGAGCCTC . > case33_5 TGGCCACTTACCTTCTGCTTC > case42_3 CTTTCTGTCTTATGGGAACTC > case42_5 GAACATCTCTATTCACCTTCG > case43_3 TTAGTGCAGGATGAAGTTGGC > case43_5 TTCTCCCATCTGGTCATGTGA > case52_3 CAGAAAGACACCATGGGTGAA > case52_5 GCCTGTGGATAGATCATAGTC identity GTAATAGA ATAGGAAAAGTTTATTTCTTATTCTTAAAGATGAATCATTTAGAA GTAATATACAGTGTAATAGGAAAAGTTTATTTCTTACTCTTAAAGATGAATCATTTGGAA GTAATAGA ATAGGAAAAGTTTATTTCTTATTCTTACAGATGAATCATTTA CAAAAATTTTTGCTTTTTCTTTTAGAATATATATATGTGTGTATATATATGT CAA—ATTTTTGCTTTTTCTTTTAGAATATATATGTGTGTGTATATATATGTGTGTGTAA ATATATGTGTGTGTATATATATGTATATATATGTGTGTGTATATATATAT ATGTGTGTATATGTGTATGTGTGTGTATATATGTGTATGTATATATGTGTGTGTGTATGT GTGTATATATATGTACTGAAGCATATTCTCAAAATGTGCAAAGAGGCTGCAGTAATATTA GTGTATATATAAGCACTAAAGCATATTCTCAAAATGTGCAAAGAGGCTGCAGTAATATTA CTGCAGTAATATTA TAGATAATTAAAATGAGTCAAACTCTGATTTTGAGG TGGATAAGTAAAATGAGTCAAACTCTGATTTTGAGG TTGATAATTAAAATGAGTCAAACTCTGATTTTGAGG 171 ...
View Full Document

This note was uploaded on 04/06/2010 for the course COMPUTER S COMP5647 taught by Professor Dr.ping during the Spring '10 term at York University.

Ask a homework question - tutors are online