This preview shows page 1. Sign up to view the full content.
Unformatted text preview: s (i.e. transactions), integration of various data sources such as user registration
information, and specializing generic data mining algorithms to take advantage of the speci c nature of
access log data. 3.1 Preprocessing Tasks
The rst preprocessing task is data cleaning. Techniques to clean a server log to eliminate irrelevant
items are of importance for any type of Web log analysis, not just data mining. The discovered associations
or reported statistics are only useful if the data represented in the server log gives an accurate picture of the
user accesses of the Web site. Elimination of irrelevant
items can be reasonably accomplished by checking the su x of the URL name. For instance, all log entries
with lename su xes such as, gif, jpeg, GIF, JPEG,
jpg, JPG, and map can be removed.
A related but much harder problem is determining
if there are important accesses that are not recorded
in the access log. Mechanisms such as local caches and
proxy servers can severely distort the overall picture
of user traversals through a Web site. Current methods to try to overcome this problem include the use of
cookies, cache busting, and explicit user registration.
As detailed in 44], none of these methods are without serious drawbacks. Cookies can be deleted by the
user, cache busting defeats the speed advantage that
caching was created to provide and can be disabled,
and user registration is voluntary and users often provide false information. Methods for dealing with the
caching problem include using site topology or referrer
logs, along with temporal information to infer missing
Another problem associated with proxy servers is
that of user identi cation. Use of a machine name
to uniquely identify users can result in several users
being erroneously grouped together as one user. An
algorithm presented in 43] checks to see if each incoming request is reachable from the pages already visited.
If a page is requested that is not directly linked to the
previous pages, multiple users are assumed to exist on
the same machine. In 12], user session lengths determined automatically based on navigation patterns are
used to identify users. Other heuristics involve using
a combination of IP address, machine name, browser
agent, and temporal information to identify users 44].
The second major preprocessing task is transaction identi cation. Before any mining is done on
Web usage data, sequences of page references must
be grouped into logical units representing Web transactions or user sessions. A user session is all of the
page references made by a user during a single visit
to a site. Identifying user sessions is similar to the
problem of identifying individual users, as discussed
above. A transaction di ers from a user session in
that the size of a transaction can range from a single page reference to all of the page references in a
user session, depending on the criteria used to identify transactions. Unlike traditional domains for data
mining, such as point of sale databases, there is no
convenient method of clustering page references into
transactions smaller than an entire user sessi...
View Full Document
This document was uploaded on 02/15/2014.
- Spring '14