WebPageClassification - 12 Web Page Classification:...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 12 Web Page Classification: Features and Algorithms XIAOGUANG QI and BRIAN D. DAVISON Lehigh University Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process. As we review work in Web page classification, we note the importance of these Web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages. Categories and Subject Descriptors: I.5.2 [ Pattern Recognition ]: Design Methodology Classifier design and evaluation ; I.5.4 [ Pattern Recognition ]: Applications Text processing ; I.2.6 [ Artificial Intelligence ]: Learning; H.2.8 [ Database Management ]: Database Applications Data Mining ; H.3.3 [ Information Storage and Retrieval ]: Information Search and Retrieval General Terms: Algorithms, Performance, Design Additional Key Words and Phrases: Categorization, Web mining ACM Reference Format: Qi, X. and Davison, B. D. 2009. Web page classification: Features and algorithms. ACM Comput. Surv. 41, 2, Article 12 (February 2009), 31 pages DOI = 10.1145/1459352.1459357 http://doi.acm.org/10.1145/ 1459352.1459357 1. INTRODUCTION Classification plays a vital role in many information management and retrieval tasks. On the Web, classification of page content is essential to focused crawling, to the as- sisted development of web directories, to topic-specific Web link analysis, to contextual advertising, and to analysis of the topical structure of the Web. Web page classification can also help improve the quality of Web search. In this survey we examine the space of Web classification approaches to find new areas for research, as well as to collect the latest practices to inform future classifier implementations. Surveys in Web page classification typically lack a detailed discus- sion of the utilization of Web-specific features. In this survey, we carefully review the This material is based upon work supported by the National Science Foundation under Grant No. IIS- 0328825. Authors address: Department of Computer Science & Engineering, Lehigh University, Bethlehem, PA 18015; email: { xiq204,davison } @cse.lehigh.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted....
View Full Document

Page1 / 31

WebPageClassification - 12 Web Page Classification:...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online