s10-link-analysis

s10-link-analysis - IR for Web Pages Important points on...

Info iconThis preview shows pages 1–9. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: IR for Web Pages Important points on Trust vs. Relevance Relevance vs. Trustworthiness User will know whether something is relevant when shown Woody Allen I finally had an orgasm and my doctor says it is the wrong kind Wont know whether it is trustworthy/popular etc Relevance can be learned from user models Trust cant be learned from the userbut the quality of the data. Relevance has the notion of marginal relevance No notion of Marginal trustworthiness. Pagerank is best seen as a trust measure. Search Engine A search engine is essentially a text retrieval system for web pages plus a Web interface. So whats new??? Some Characteristics of the Web Web pages are very voluminous and diversified widely distributed on many servers. extremely dynamic/volatile. Web pages have more structure (extensively tagged). are extensively linked. may often have other associated metadata Web search is Noisy (pages with high similarity to query may still differ in relevance) Uncurated; Adversarial! A page can advertise itself falsely just so it will be retrieved Web users are ordinary folks (dolts?) without special training they tend to submit short queries. There is a very large user community. U s e t h e l i n k s a n d t a g s a n d M e t a- d a t a ! U s e t h e s o c i a l s t r u c t u r e o f t h e w e b N e e d t o c r a w l a n d m a i n t a i n i n d e x E a s i l y i m p r e s s e d Short queries? Okay--except when the student is desparately trying to use the web to cheat on his/her homework Use of Tag Information (1) Web pages are mostly HTML documents (for now). HTML tags allow the author of a web page to Control the display of page contents on the Web. Express their emphases on different parts of the page. HTML tags provide additional information about the contents of a web page. Can we make use of the tag information to improve the effectiveness of a search engine? Use of Tag Information (2) Two main ideas of using tags: Associate different importance to term occurrences in different tags. Title > header 1 > header 2 > body > footnote > invisible Use anchor text to index referenced documents. ( What should be its importance? ) . . . . . . worst teacher I ever had . . . . . . Your page Page 2: Raos page Document is indexed not just with its contents ; But with the contents of others descriptions of it Google Bombs: The other side of Anchor Text You can tar someones page just by linking to them with some damning anchor text If the anchor text is unique enough, then even a few pages linking with that keyword will make sure the page comes up high E.g. link your SOs page with my cuddlybubbly woogums Shmoopie unfortunately is already taken by Seinfeld For more common-place keywords (such as unelectable or my sweet heart) you need a lot more...
View Full Document

This note was uploaded on 03/11/2012 for the course CSE 494 taught by Professor Rao during the Spring '08 term at ASU.

Page1 / 153

s10-link-analysis - IR for Web Pages Important points on...

This preview shows document pages 1 - 9. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online