cis6930fa11_WebTables

cis6930fa11_WebTables - WebTables Exploring the Power of...

Info iconThis preview shows pages 1–11. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: WebTables: Exploring the Power of Tables on the Web Michael J. Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, Yang Zhang Presented by: Ganesh Viswanathan September 29 th , 2011 CIS 6930 – Data Science: Large-scale Advanced Data Analysis University of Florida, Gainesville Outline • Introduction to the work • Motivation • WebTables • Relation Recovery – Relational Filtering – Metadata recovery • Attribute Correlation Statistics Database ( ACSDb ) • Relation Search • Ranking • Indexing • ACDBb Applications • Schema auto-complete • Attribute synonym-finding • Join graph traversal • Experimental Results • Discussion Introduction • Age of big data: – Availability of massive amounts of data is driving many technical advances on the web and off – The web is traditionally modeled as a corpus of unstructured data – But web documents contain large amounts of relational data • Questions – What are effective techniques for searching for structured data at search-engine scales? – What additional power can be derived by analyzing such a huge corpus? Motivation • Large demand for queries for results contained in this corpus • Around 30 million queries from Google’s 1-day log • Relational ranking is difficult since data is not purely unstructured, rather a mixture of “structure” and “content” • Lacks incoming hyperlink anchor text used in traditional IR • Ranking methods like PageRank do not give optimum results. Example: Relational Data on the Web WebTables • The goals are to gather is corpus of high quality relational data on the web and make it better searchable. • Describe a ranking method for Tables on the web based on their relational data combining traditional index and query-independent corpus-wide coherency score. • Define an attribute correlation statistics database (ACSDb) containing statistics about corpus schemas. • Using these statistics to create novel tools for database designers addressed later on. WebTables • Using Google’s English language web indexes, 14.1 billion raw HTML pages extracted • Tables used for page layout and non-relational purposes filtered out • Resulted in 154M distinct relational databases, i.e., around 1.1% of all raw HTML tables. Relational Search Interface Applications • We can leverage the ACSDb to offer solutions to following tasks: – Schema auto-complete tool to help database designers choose a schema – Attribute synonym finding tool that computes pairs of schema attributes used synonymously – Join graph traversal using common attributes and clustering. The Deep Web • Deep web refers to tables behind HTML forms • Can be detected by detecting if the URL is parameterized • Some deep web data is web crawlable • Vast majority still requires sophisticated systems that can input parameters in a semantically meaningful manner • Corpus contains around 40% from deep web sources while 60% from non-deep-web sources. The WebTables System...
View Full Document

This note was uploaded on 11/09/2011 for the course CIS 6930 taught by Professor Staff during the Fall '08 term at University of Florida.

Page1 / 55

cis6930fa11_WebTables - WebTables Exploring the Power of...

This preview shows document pages 1 - 11. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online