{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

lec1a - What is this course about CS6913 Web Search Engines...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
CS6913 Web Search Engines Torsten Suel CSE Department NYU Poly [email protected] CS6913, Spring 2010 Lecture 1, 1/21/2010 What is this course about ? learn how search engines work (google, yahoo, baidu, …) learn about other types of web search tools and applications (pricewatch, citeseer, google news, …) learn about Information Retrieval (IR) learn about data compression learn about web search performance challenges learn how to build such tools ! - basic information retrieval techniques - what software tools to use - system architectures and performance - how to work with TBs of data Not covered in this course: web site design and HTML Java, web scripting how to use search engines image and multimedia search peer-to-peer search technologies advanced IR: categorization, clustering, ... natural language processing (NLP) Today and next class: (1/21 & 1/28/2010) I - Introduction: - the web - overview of search tools - how the web works II - Basic Techniques - basic search engine architecture - crawling basics: following links, robot exclusion, .. - storage - indexing - querying and term-based ranking - link-based ranking III - Introduction to Information Retrieval IV - Introduction to Python 1 – Introduction and Motivation: What is the Web? text … lot’s of text … billion of pages of text pages containing (fairly unstructured) text images, audio, etc. embedded in (hanging off) pages structure defined using HTML (+flash etc) (Hypertext Markup Language) hyperlinks between pages! hundreds of billions of pages trillions of hyperlinks a giant graph! What is the web? (another view)
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
How the web is organized: link structure hyperlinks are very useful hyperlink structure is also often meaningful many links to a page: page is important or liked by many? Google Pagerank exploits this old idea: citation analysis, social networks understanding link structure helps us understand the web but: manipulation through link farms … just as in real life pages reside in servers sites often contain related pages site/host structure local versus global links application: connectivity sonar How the web is organized: site structure Web Server (Host) Web Server (Host) Web Server (Host) www.poly.edu www.cnn.com www.irs.gov hundreds of billions of pages billions of hyperlinks plus images, movies, .. , database content just browsing does not work anymore How do we find pages on the web? we need specialized tools for finding pages and information search engines and related tools 2 - Overview of web search tools and issues Major search engines (google, yahoo, bing, baidu, ask, yandex, …) Web directories (yahoo, open directory project (dmoz)) Specialized search engines (news, citeseer, kosmix, findlaw, blogs Local search engines (for one site or domain) (or one area) Meta search engines (dogpile, mamma, search.com, clusty) Personal search assistants (alexa, google/msft/yahoo toolbar Comparison shopping (shopping.com, pricegrabber, nextag)
Background image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}