Web Mining Information and Pattern Discovery on the World Wide Web

E transactions integration of various data sources

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: s (i.e. transactions), integration of various data sources such as user registration information, and specializing generic data mining algorithms to take advantage of the speci c nature of access log data. 3.1 Preprocessing Tasks The rst preprocessing task is data cleaning. Techniques to clean a server log to eliminate irrelevant items are of importance for any type of Web log analysis, not just data mining. The discovered associations or reported statistics are only useful if the data represented in the server log gives an accurate picture of the user accesses of the Web site. Elimination of irrelevant items can be reasonably accomplished by checking the su x of the URL name. For instance, all log entries with lename su xes such as, gif, jpeg, GIF, JPEG, jpg, JPG, and map can be removed. A related but much harder problem is determining if there are important accesses that are not recorded in the access log. Mechanisms such as local caches and proxy servers can severely distort the overall picture of user traversals through a Web site. Current methods to try to overcome this problem include the use of cookies, cache busting, and explicit user registration. As detailed in 44], none of these methods are without serious drawbacks. Cookies can be deleted by the user, cache busting defeats the speed advantage that caching was created to provide and can be disabled, and user registration is voluntary and users often provide false information. Methods for dealing with the caching problem include using site topology or referrer logs, along with temporal information to infer missing references. Another problem associated with proxy servers is that of user identi cation. Use of a machine name to uniquely identify users can result in several users being erroneously grouped together as one user. An algorithm presented in 43] checks to see if each incoming request is reachable from the pages already visited. If a page is requested that is not directly linked to the previous pages, multiple users are assumed to exist on the same machine. In 12], user session lengths determined automatically based on navigation patterns are used to identify users. Other heuristics involve using a combination of IP address, machine name, browser agent, and temporal information to identify users 44]. The second major preprocessing task is transaction identi cation. Before any mining is done on Web usage data, sequences of page references must be grouped into logical units representing Web transactions or user sessions. A user session is all of the page references made by a user during a single visit to a site. Identifying user sessions is similar to the problem of identifying individual users, as discussed above. A transaction di ers from a user session in that the size of a transaction can range from a single page reference to all of the page references in a user session, depending on the criteria used to identify transactions. Unlike traditional domains for data mining, such as point of sale databases, there is no convenient method of clustering page references into transactions smaller than an entire user sessi...
View Full Document

This document was uploaded on 02/15/2014.

Ask a homework question - tutors are online