Unformatted text preview: ons. 2 A Taxonomy of Web Mining
In this section we present a taxonomy of Web mining, i.e. Web content mining and Web usage mining.
We also describe and categorize some of the recent
work and the related tools or techniques in each area.
This taxonomy is depicted in Figure 1. 2.1 Web Content Mining
The lack of structure that permeates the information sources on the World Wide Web makes automated discovery of Web-based information di cult.
Traditional search engines such as Lycos, Alta Vista,
WebCrawler, ALIWEB 29], MetaCrawler, and others
provide some comfort to users, but do not generally
provide structural information nor categorize, lter,
or interpret documents. A recent study provides a
comprehensive and statistically thorough comparative
evaluation of the most popular search engines 32].
In recent years these factors have prompted researchers to develop more intelligent tools for information retrieval, such as intelligent Web agents, and
to extend data mining techniques to provide a higher
level of organization for semi-structured data available
on the Web. We summarize some of these e orts below. 2.1.1 Agent-Based Approach. Generally,
agent-based Web mining systems can be placed into
the following three categories:
Intelligent Search Agents: Several intelligent
Web agents have been developed that search for relevant information using domain characteristics and
user pro les to organize and interpret the discovered information. Agents such as Harvest 6], FAQFinder 19], Information Manifold 27], OCCAM 30],
and ParaSite 51] rely either on pre-speci ed domain
information about particular types of documents, or
on hard coded models of the information sources to retrieve and interpret documents. Agents such as ShopBot 14] and ILA (Internet Learning Agent) 42] interact with and learn the structure of unfamiliar information sources. ShopBot retrieves product information from a variety of vendor sites using only general
information about the product domain. ILA learns
models of various information sources and translates
these into its own concept hierarchy.
Information Filtering/Categorization: A
number of Web agents use various information retrieval techniques 17] and characteristics of open hypertext Web documents to automatically retrieve, lter, and categorize them 5, 9, 34, 55, 53]. HyPursuit 53] uses semantic information embedded in link
structures and document content to create cluster hierarchies of hypertext documents, and structure an
information space. BO (Bookmark Organizer) 34]
combines hierarchical clustering techniques and user
interaction to organize a collection of Web documents
based on conceptual information.
Personalized Web Agents: This category of Web agents learn user preferences and discover Web
information sources based on these preferences, and
those of other individuals with similar interests (using collaborative ltering). A few recent examples of
such agents include the WebWatcher 3], PAINT 39],
Syskill & Webert 41], GroupLens 47], Fire y 49],
View Full Document
- Spring '14
- Data Mining, Web page, World Wide Web, web usage mining