CS6913
Web Search Engines
Torsten Suel
CSE Department
NYU Poly
[email protected]
CS6913, Spring 2010
Lecture 1, 1/21/2010
What is this course about ?
•
learn how search engines work
(google, yahoo, baidu, …)
•
learn about other types of web search tools
and applications
(pricewatch, citeseer, google news, …)
•
learn about Information Retrieval (IR)
•
learn about data compression
•
learn about web search performance challenges
•
learn how to build such tools !
- basic information retrieval techniques
- what software tools to use
- system architectures and performance
- how to work with TBs of data
Not covered in this course:
•
web site design and HTML
•
Java, web scripting
•
how to
use
search engines
•
image and multimedia search
•
peer-to-peer search technologies
•
advanced IR:
categorization, clustering, ...
•
natural language processing (NLP)
Today and next class:
(1/21 & 1/28/2010)
•
I - Introduction:
- the web
- overview of search tools
- how the web works
•
II - Basic Techniques
- basic search engine architecture
- crawling basics: following links, robot exclusion, ..
- storage
- indexing
- querying and term-based ranking
- link-based ranking
•
III - Introduction to Information Retrieval
•
IV - Introduction to Python
1 – Introduction and Motivation:
What is the Web?
text …
lot’s of text …
billion of
pages of
text
•
pages containing (fairly unstructured) text
•
images, audio, etc. embedded in (hanging off) pages
•
structure defined using HTML (+flash etc)
(Hypertext Markup Language)
•
hyperlinks between pages!
•
hundreds of billions of pages
•
trillions of hyperlinks
a giant graph!
What is the web?
(another view)
This
preview
has intentionally blurred sections.
Sign up to view the full version.
How the web is organized: link structure
•
hyperlinks are very useful
•
hyperlink structure is also often
meaningful
•
many links to a page: page is important or liked by many?
•
Google Pagerank exploits this
•
old idea: citation analysis, social networks
•
understanding link structure helps us understand the web
•
but: manipulation through link farms
•
… just as in real life
•
pages reside in servers
•
sites often contain related pages
•
site/host structure
•
local versus global links
•
application: connectivity sonar
How the web is organized: site structure
Web Server
(Host)
Web Server
(Host)
Web Server
(Host)
www.poly.edu
www.cnn.com
www.irs.gov
•
hundreds of billions of pages
•
billions of hyperlinks
•
plus images, movies, .. , database content
•
just browsing does not work anymore
How do we find pages on the web?
we need specialized tools for finding
pages and information
search engines and
related tools
2 - Overview of web search tools and issues
•
Major search engines
(google, yahoo, bing, baidu, ask, yandex, …)
•
Web directories
(yahoo, open directory project (dmoz))
•
Specialized search engines
(news, citeseer, kosmix, findlaw, blogs
•
Local search engines
(for one site or domain) (or one area)
•
Meta search engines
(dogpile, mamma, search.com, clusty)
•
Personal search assistants
(alexa, google/msft/yahoo toolbar
•
Comparison shopping
(shopping.com, pricegrabber, nextag)
•

This is the end of the preview.
Sign up
to
access the rest of the document.
- Spring '10
- TorsenSuel
- Search Engines, World Wide Web, Web server, Pages, Web crawler
-
Click to edit the document details