p325-kim - n-Gram/2L: A Space and Time Efcient Two-Level...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon
n-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure Min-Soo Kim, Kyu-Young Whang, Jae-Gil Lee, Min-Jae Lee Department of Computer Science and Advanced Information Technology Research Center (AITrc) Korea Advanced Institute of Science and Technology(KAIST), Daejeon, Korea { mskim, kywhang, jglee, mjlee } @mozart.kaist.ac.kr Abstract The n-gram inverted index has two major advan- tages: language-neutral and error-tolerant. Due to these advantages, it has been widely used in in- formation retrieval or in similar sequence match- ing for DNA and protein databases. Nevertheless, the n-gram inverted index also has drawbacks: the size tends to be very large, and the performance of queries tends to be bad. In this paper, we pro- pose the two-level n-gram inverted index (simply, the n-gram/2L index ) that significantly reduces the size and improves the query performance while preserving the advantages of the n-gram inverted index. The proposed index eliminates the redun- dancy of the position information that exists in the n-gram inverted index. The proposed index is con- structed in two steps: 1) extracting subsequences of length m from documents and 2) extracting n-grams from those subsequences. We formally prove that this two-step construction is identical to the relational normalization process that removes the redundancy caused by a non-trivial multival- ued dependency. The n-gram/2L index has ex- cellent properties: 1) it significantly reduces the size and improves the performance compared with the n-gram inverted index with these improve- ments becoming more marked as the database size gets larger; 2) the query processing time increases only very slightly as the query length gets longer. Experimental results using databases of 1 GBytes show that the size of the n-gram/2L index is re- duced by up to 1.9 2.7 times and, at the same time, the query performance is improved by up to 13.1 times compared with those of the n-gram in- verted index. Permission to copy without fee all or part of this material is granted pro- vided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005 1 Introduction Text searching is regarded as an operation of fundamen- tal importance and is widely used in many areas such as information retrieval[18] and similar sequence matching for DNA and protein databases [7]. DNA and protein se- quences can be regarded as long texts over specific alpha- bets (e.g. { A,C,G,T } in DNA) [5]. A number of index structures have been proposed for efficient text searching, and the inverted index is the most actively used one [18].
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 2
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 03/01/2010 for the course ICT ... taught by Professor ... during the Three '10 term at University of Sydney.

Page1 / 12

p325-kim - n-Gram/2L: A Space and Time Efcient Two-Level...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online