TellingApplesFromOranges

TellingApplesFromOranges - Enhanced Web Page Classification...

Info iconThis preview shows pages 1–15. Sign up to view the full content.

View Full Document Right Arrow Icon
Enhanced Web Page Classification Xiaoguang Qi
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background Utilizing features of neighbors Using fielded features
Background image of page 2
Problem definition Classification A set of labeled data is used to train a classifier which can be applied to label future examples. Web page classification The process of assigning a web page to one or more predefined category labels. aka. web page categorization
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Why important? Web page classification is essential to Improving quality of retrieval Maintaining web directories Helping question answering systems Many more …
Background image of page 4
Why important? (Cont.) Tradition text classification approaches don’t perform well on web pages An experiment on dmoz ODP dataset 12 topical categories 228,000 training documents, 12,000 for testing Na ï ve Bayes: 35% error rate Support vector machine: 27% error rate
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Text is not enough
Background image of page 6
Text is not enough (Cont.) Textual features could be missing, misleading, or unrecognizable. Web page temporarily unavailable, robot exclusion, picture, flash, frame, etc There could be too much irrelevant text. Advertisement, navigational panel, spam follow-ups, etc … Solution? Besides text, web pages have other features. Make use of them!
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
On-page features
Background image of page 8
On-page features (Cont.)
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Using on-page features HTML tags structural info embedded in HTML documents Golub and Ardo [2005] Assign significance indicators for different HTML tags Title, headings, metadata, and main text Kwon and Lee [2000, 2003] Divide all the HTML tags into three groups Assign each group an arbitrary weight
Background image of page 10
Using on-page features (Cont.) Summarize then classify Classify web pages based on their summarization [Shen et al., 2004] . URL A web page can be reasonably classified just based on its URL ! [Kan, 2004; Kan and Thi, 2005]
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Using on-page features (Cont.) Visual analysis Most approaches focus on text, ignoring visual info. Sometimes, it might be more expensive than using text. Analyze a web page’s visual layout , represent the recognized parts in an adjacency graph, apply heuristic rules on the graph [Kovacevic et al., 2004] .
Background image of page 12
? Sibling Target page Parent Spouse Child
Background image of page 13

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Using features of neighbors Directly incorporate text from parent or child into the target page It does more harm than good [Chakrabarti et al., 1998; Ghani et al., 2001; Yang et al., 2002]. Parent and child pages == useless?
Background image of page 14
Image of page 15
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 08/06/2008 for the course CSE 450 taught by Professor Davison during the Spring '08 term at Lehigh University .

Page1 / 49

TellingApplesFromOranges - Enhanced Web Page Classification...

This preview shows document pages 1 - 15. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online