# comparison search engines.pdf - A Comparison of Open Source...

• Essay
• 46

Course Hero uses AI to attempt to automatically extract content from documents to surface to you and others so you can study better, e.g., in search results, to enrich docs, and more. This preview shows page 1 - 7 out of 46 pages.

A Comparison of Open Source Search EnginesChristian Middleton, Ricardo Baeza-Yates
2
Contents1Introduction52Background72.1Document Collection. . . . . . . . . . . . . . . . . . . . . . .82.1.1Web Crawling. . . . . . . . . . . . . . . . . . . . . . .92.1.2TREC. . . . . . . . . . . . . . . . . . . . . . . . . . .92.2Indexing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .102.3Searching and Ranking. . . . . . . . . . . . . . . . . . . . . .122.4Retrieval Evaluation. . . . . . . . . . . . . . . . . . . . . . .133Search Engines173.1Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .183.2Description. . . . . . . . . . . . . . . . . . . . . . . . . . . .193.3Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . .214Methodology254.1Document collections. . . . . . . . . . . . . . . . . . . . . . .264.2Performance Comparison Tests. . . . . . . . . . . . . . . . .264.3Setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .275Tests295.1Indexing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .295.1.1Indexing Test over TREC-4 collection. . . . . . . . .293
4CONTENTS5.1.2Indexing WT10g subcollections. . . . . . . . . . . . .325.2Searching. . . . . . . . . . . . . . . . . . . . . . . . . . . . .335.2.1Searching Tests over TREC-4 collection. . . . . . . .355.2.2Precision and Recall Comparison. . . . . . . . . . . .385.3Global Evaluation. . . . . . . . . . . . . . . . . . . . . . . .396Conclusions41
Chapter 1IntroductionAs the amount of information available on the websites increases, it becomesnecessary to give the user the possibility to perform searches over this infor-mation. When deciding to install a search engine in a website, there existsthe possibility to use a commercial search engine or an open source one.For most of the websites, using a commercial search engine is not a feasiblealternative because of the fees that are required and because they focus onlarge scale sites. On the other hand, open source search engines may give thesame functionalities (some are capable of managing large amount of data)as a commercial one, with the benefits of the open source philosophy: nocost, software maintained actively, possibility to customize the code in orderto satisfy personal needs, etc.Nowadays, there are many open source alternatives that can be used, andeach of them have different characteristics that must be taken into consider-ation in order to determine which one to install in the website. These searchengines can be classified according to the programming language in whichit is implemented, how it stores the index (inverted file, database, other filestructure), its searching capabilities (boolean operators, fuzzy search, use ofstemming, etc), way of ranking, type of files capable of indexing (HTML,PDF, plain text, etc), possibility of on-line indexing and/or making incre-5
6CHAPTER 1.INTRODUCTIONmental indexes.Other important factors to consider are the last date ofupdate of the software, the current version and the activity of the project.These factors are important since a search engine that has not been updatedrecently, may present problems at the moment of customizing it to the ne-cessities of the current website. These characteristics are useful to make abroad classification of the search engines and be capable of narrowing theavailable spectrum of alternatives.Afterward, it is important to considerthe performance of these search engines with different loads of data andalso analyze how it degrades when the amount of information increases. Inthis stage, it is possible to analyze the indexing time versus the amount ofdata, as well as the amount of resources used during the indexing, and alsoanalyze the performance during the retrieval stage.The present work is the first study, to the best of our knowledge, to

Course Hero member to access this document

Course Hero member to access this document

End of preview. Want to read all 46 pages?

Course Hero member to access this document

Term
Spring
Professor
Chenqi
Tags
search engine