{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

1 - CSci 6907 Data Management and Exploration on the Web...

Info iconThis preview shows pages 1–16. Sign up to view the full content.

View Full Document Right Arrow Icon
CSci 6907: Data Management and Exploration on the Web Nan Zhang
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Course Information Meeting time: Mondays 06:10-08:40PM Meeting location: Philips Hall, Room 108 Office Hours: Mondays 12:00-2:00pm Office: Academic Center 715 Phone: (202) 994-5919 Email: [email protected] Course Website o http://www.seas.gwu.edu/~nzhang10/6907/ 2
Background image of page 2
Course Website 3
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Course Schedule 4
Background image of page 4
Grading Homework o 5% * 5 = 25% Presentation o 25% Project o 50% 5
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Surface Web Crawling and Information Retrieval 6 from slides for Bing Liu, Web Data Mining, Springer, 2007.
Background image of page 6
The Deep Web Deep Web vs Surface Web o Dynamic contents, unlinked pages, private web, contextual web, etc o Estimated size: 91,850 vs 167 tera bytes [1] , hundreds or thousands of times larger than the surface web [2] 7 [1] SIMS, UC Berkeley, How much information? 2003 [2] Bright Planet, Deep Web FAQs, 2010, http://www.brightplanet.com/the-deep-web/
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Hidden Web Repositories 8 Web User Hidden Repository Owner
Background image of page 8
Deep Web Repository: Example I Enterprise Search Engine’s Corpus 9 Unstructured data Keyword search Top-k Asthma
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Exploration: Example I 10 Metasearch engine Discovers deep web repositories of a given topic Integrate query answers from multiple repositories For result re-organization, evaluate the quality of each repository through analytics e.g., how large is the repository? e.g., average length of documents of a given topic Disease info Treatment info
Background image of page 10
Example II Yahoo! Auto, other online e-commerce websites 11 Structured data Form-like search Top-1500
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Exploration: Example II 12 Third-party services for an individual repository Find fake products Price distribution Construction of a universal mobile interface Third-party services for multiple repositories Repository comparison Consumer behavior analysis Main Tasks Resource discovery Data integration Single-/Cross- site analytics
Background image of page 12
Example III 13 Semi-structured data Graph browsing Local view Picture from Jay Goldman, Facebook Cookbook, O’Reiley Media, 2008.
Background image of page 13

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Exploration: Example III 14 For commercial advertisers: Market penetration of a social network “buzz words” tracking For private detectors: Find pages related to an individual For individual page owners: Understand the (relative) popularity of ones own page Understand how new posts affect the popularity Understand how to promote the page Main Tasks: resource discovery and data integration less of a challenge, analytics on very large amounts of data becomes the main challenge.
Background image of page 14
Summary of Main Tasks/Obstacles Find where the data are o Resource discovery: find URLs of deep web repositories o Required by: Metasearch engine, shopping website comparison, consumer behavior modeling, etc. Understand the web interface o Required by almost all applications. Explore the underlying data o crawling, sampling, and analytics o Required by: Metasearch engine, keep it real fake, price prediction, universal mobile interface, shopping website comparison, consumer behavior modeling, market penetration analysis, social page evaluation and optimization, etc.
Background image of page 15

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 16
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}