# Register now to access 7 million high quality study materials (What's Course Hero?) Course Hero is the premier provider of high quality online educational resources. With millions of study documents, online tutors, digital flashcards and free courseware, Course Hero is helping students learn more efficiently and effectively. Whether you're interested in exploring new subjects or mastering key topics for your next exam, Course Hero has the tools you need to achieve your goals.

17 Pages

### cs345-streams3-2

Course: CS 345, Fall 2001
School: Stanford
Rating:

Word Count: 707

#### Document Preview

More Still Stream-Mining Frequent Itemsets Elephants and Troops Exponentially Decaying Windows 1 Counting Items Problem: given a stream, which items appear more than s times in the window? Possible solution: think of the stream of baskets as one binary stream per item. 1 = item present; 0 = not present. Use DGIM to estimate counts of 1s for all items. 2 Extensions In principle, you could count frequent pairs or...

Register Now

#### Unformatted Document Excerpt

Coursehero >> California >> Stanford >> CS 345

Course Hero has millions of student submitted documents similar to the one
below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.

Course Hero has millions of student submitted documents similar to the one below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.
Find millions of documents on Course Hero - Study Guides, Lecture Notes, Reference Materials, Practice Exams and more. Course Hero has millions of course specific materials providing students with the best way to expand their education.

Below is a small sample set of documents:

Stanford - CS - 345
Still More StreamMiningFrequent Itemsets Elephants and Troops Exponentially Decaying Windows1Counting Itemsx Problem: given a stream, which items appear more than s times in the window? x Possible solution: think of the stream of baskets as one binary
Stanford - CS - 345
Stream ClusteringExtension of DGIM to More Complex Problems1Clustering a StreamAssume points enter in a stream. Maintain a sliding window of points. Queries ask for clusters of points within some suffix of the window. Important issue: where are the cl
Stanford - CS - 345
Stream ClusteringExtension of DGIM to More Complex Problems1Clustering a Streamx Assume points enter in a stream. x Maintain a sliding window of points. x Queries ask for clusters of points within some suffix of the window. x Important issue: where ar
Stanford - CS - 345
CS345 Data MiningIntroductions What Is It? Cultures of Data Mining1Course Staffx Instructors: Anand Rajaraman Jeff Ullman Robbie Yanx TA:2Requirementsx Homework (Gradiance and other) 20% x Project 40% x Final Exam 40% Gradiance class code BB8F69
Stanford - CS - 345
CS345 - Data MiningIntroductions What Is It? Cultures of Data Mining1Course StaffInstructors:Anand Rajaraman Jeff UllmanTA:Jeff Klingner2RequirementsHomework (Gradiance and other) 20%Gradiance class code DD984360Project 40% Final Exam 40%3Pr
Stanford - CS - 345
CS345 Data MiningIntroductions What Is It? Cultures of Data Mining1Course Staffx Instructors: Anand Rajaraman Jeff Ullman Jeff Klingnerx TA:2Requirementsx Homework (Gradiance and other) 20% x Project 40% x Final Exam 40% Gradiance class code DD9
Stanford - CS - 345
CS345 - Data MiningCourse Introduction Varieties of Data Mining Bonferroni's Principle1Course StaffInstructors:Anand Rajaraman Jeff UllmanTA:Babak Pahlavan2RequirementsHomework (Gradiance and other) 20%Gradiance class code B0E9AA66 Note URL for
Stanford - CS - 345
CS345A: Data Mining on the WebCourse Introduction Issues in Data Mining Bonferroni's Principle1Course Staffx Instructors: Anand Rajaraman Jeff Ullman Babak Pahlavanx TA:2Requirementsx Homework (Gradiance and other) 20% Gradiance class code B0E9A
Stanford - CS - 345
CS345A: Data Mining on the WebCourse Introduction Issues in Data Mining Bonferroni's Principle1Course Staffx Instructors: Anand Rajaraman Jeff Ullmanx Reach us as cs345awin0809staff @ lists.stanford.edu. x More info on www.stanford.edu/class/cs345a.
Stanford - CS - 345
Generalizing MapReduceThe Computational Model MapReduceLike Algorithms Computing Joins1Overviewx There is a new computing environment available: x Mapreduce allows us to exploit this environment easily. x But not everything is mapreduce. x What else c
Stanford - CS - 345
CS 345A Data MiningMapReduceSingle-node architectureCPU Machine Learning, Statistics Memory &quot;Classical&quot; Data Mining DiskCommodity ClustersWeb data sets can be very largeTens to hundreds of terabytesCannot mine on a single server (why?) Standard arc
Stanford - CS - 345
CS 345A Data MiningMapReduce Singlenode architectureCPU Machine Learning, Statistics Memory &quot;Classical&quot; Data Mining DiskCommodity ClustersWeb data sets can be very large Cannot mine on a single server (why?) Standard architecture emerging: Te
Stanford - CS - 345
CS 345A Data MiningMapReduceSingle-node architectureCPU Machine Learning, Statistics Memory Classical Data Mining DiskCommodity ClustersWeb data sets can be very largeTens to hundreds of terabytesCannot mine on a single server (why?) Standard archi
Stanford - CS - 345
CS 345A Data MiningMapReduce Singlenode architectureCPU Machine Learning, Statistics Memory &quot;Classical&quot; Data Mining DiskCommodity ClustersWeb data sets can be very large Cannot mine on a single server (why?) Standard architecture emerging: Te
Stanford - CS - 345
Near-Neighbor SearchApplications Matrix Formulation Minhashing1Example Application: Face RecognitionWe have a database of (say) 1 million face images. We want to find the most similar images in the database. Represent faces by (relatively) invariant v
Stanford - CS - 345
NearNeighbor SearchApplications Matrix Formulation Minhashing1Example Application: Face Recognitionx We have a database of (say) 1 million face images. x We want to find the most similar images in the database. x Represent faces by (relatively) invari
Stanford - CS - 345
Near-Neighbor SearchApplications Matrix Formulation Minhashing1Example Problem - Face RecognitionWe have a database of (say) 1 million face images. We are given a new image and want to find the most similar images in the database. Represent faces by (
Stanford - CS - 345
NearNeighbor SearchApplications Matrix Formulation Minhashing1Example Problem Face Recognitionx We have a database of (say) 1 million face images. x We are given a new image and want to find the most similar images in the database. x Represent faces b
Stanford - CS - 345
What is Database Theory?A collection of studies, often connected to the relational model of data. Restricted forms of logic, between SQL and full rst-order. Dependency theory: generalizing functional dependencies. Conjunctive queries CQ's: useful, decida
Stanford - CS - 345
CS345 Data MiningLink Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. UllmanLink Analysis Algorithms Page Rank Hubs and Authorities TopicSpecific Page Rank Spam Detection Algorithms Other interesting topics we won't cover Detecting dup
Stanford - CS - 345
Link Analysis AlgorithmsCS345 Data MiningLink Analysis Algorithms Page RankPage Rank Hubs and Authorities Topic-Specific Page Rank Spam Detection Algorithms Other interesting topics we wont coverDetecting duplicates and mirrors Mining for communities
Stanford - CS - 345
CS345 Data MiningLink Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. UllmanLink Analysis Algorithms Page Rank Hubs and Authorities TopicSpecific Page Rank Spam Detection Algorithms Other interesting topics we won't cover Detecting dup
Stanford - CS - 345
CS345 Data MiningLink Analysis Algorithms Page RankAnand Rajaraman, Jeffrey D. UllmanLink Analysis AlgorithmsPage Rank Hubs and Authorities Topic-Specific Page Rank Spam Detection Algorithms Other interesting topics we wont coverDetecting duplicates
Stanford - CS - 345
CS345 Data MiningLink Analysis Algorithms Page RankAnand Rajaraman, Jeffrey D. UllmanLink Analysis AlgorithmsPage Rank Hubs and Authorities Topic-Specific Page Rank Spam Detection Algorithms Other interesting topics we won't coverDetecting duplicates
Stanford - CS - 345
CS345 Data MiningLink Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. UllmanLink Analysis Algorithms Page Rank Hubs and Authorities TopicSpecific Page Rank Spam Detection Algorithms Other interesting topics we won't cover Detecting dup
Stanford - CS - 345
TopicsCS345 Data MiningLink Analysis 2 Page Rank VariantsThis lectureMany-walkers model Tricks for speeding convergence Topic-Specific Page RankAnand Rajaraman, Jeffrey D. UllmanRandom walk interpretationAt time 0, pick a page on the web uniformly
Stanford - CS - 345
CS345 Data MiningLink Analysis 2 Page Rank Variants Anand Rajaraman, Jeffrey D. UllmanTopicsThis lecture Manywalkers model Tricks for speeding convergence TopicSpecific Page RankRandom walk interpretation At time 0, pick a page on the web unif
Stanford - CS - 345
CS345 Data MiningRecommendation SystemsAnand Rajaraman, Jeffrey D. UllmanRecommendationsSearchRecommendationsItemsProducts, web sites, blogs, news items, The Long TailSource: Chris Anderson (2004)From scarcity to abundanceShelf space is a scarc
Stanford - CS - 345
CS345 Data MiningRecommendation Systems Netflix Challenge Course Projects Anand Rajaraman, Jeffrey D. UllmanRecommendations SearchRecommendationsItemsProducts, web sites, blogs, news items, .From scarcity to abundanceShelf space is a scarce com
Stanford - CS - 345
CS345 Data MiningRecommendation Systems Anand Rajaraman, Jeffrey D. UllmanRecommendations SearchRecommendationsItemsProducts, web sites, blogs, news items, .The Long TailSource: Chris Anderson (2004)From scarcity to abundanceShelf space is a
Stanford - CS - 345
CS345 Data MiningRecommendation Systems Netflix Challenge Anand Rajaraman, Jeffrey D. UllmanRecommendations SearchRecommendationsItemsProducts, web sites, blogs, news items, .From scarcity to abundanceShelf space is a scarce commodity for tradi
Stanford - CS - 345
RecommendationsCS345 Data MiningRecommendation Systems Netflix Challenge Course ProjectsSearch RecommendationsItemsProducts, web sites, blogs, news items, Anand Rajaraman, Jeffrey D. UllmanFrom scarcity to abundanceShelf space is a scarce commodit
Stanford - CS - 345
CS345 Data MiningMining the Web for Structured DataOur view of the web so farWeb pages as atomic units Great for some applicationse.g., Conventional web searchBut not always the right modelGoing beyond web pagesQuestion answeringWhat is the height
Stanford - CS - 345
CS345 Data MiningMining the Web for Structured Data Our view of the web so far. Web pages as atomic units Great for some applicationsBut not always the right modele.g., Conventional web searchGoing beyond web pagesQuestion answering Relation
Stanford - CS - 345
CS345 Data MiningMining the Web for Structured DataOur view of the web so farWeb pages as atomic units Great for some applicationse.g., Conventional web searchBut not always the right modelGoing beyond web pagesQuestion answeringWhat is the height
Stanford - CS - 345
CS345 DataMiningMiningtheWebforStructured DataOurviewofthewebsofar Webpagesasatomicunits GreatforsomeapplicationsButnotalwaystherightmodele.g.,ConventionalwebsearchGoingbeyondwebpagesQuestionanswering RelationExtractionWhatistheheightofMtEveres
Stanford - CS - 345
CS345 Data MiningMining the Web for Structured Data Our view of the web so far. Web pages as atomic units Great for some applicationsBut not always the right modele.g., Conventional web searchGoing beyond web pagesQuestion answering Relation
Stanford - CS - 345
Stanford - CS - 345
Stanford - CS - 345
Finding Similar SetsApplications Shingling Minhashing LocalitySensitive Hashing1Goalsx Many Webmining problems can be expressed as finding &quot;similar&quot; sets:1. Pages with similar words, e.g., for classification by topic. 2. NetFlix users with similar ta
Stanford - CS - 345
Applications of LSHEntity Resolution Fingerprints Similar News Articles1DesiderataWhatever form we use for LSH, we want :1. The time spent performing the LSH should be linear in the number of objects. 2. The number of candidate pairs should be propor
Stanford - CS - 345
Applications of LSHEntity Resolution Fingerprints Similar News Articles1Desideratax Whatever form we use for LSH, we want :x Bucketizing guarantees (1).1. The time spent performing the LSH should be linear in the number of objects. 2. The number of
Stanford - CS - 345
Theory of LSHDistance Measures LS Families of Hash Functions SCurves1Distance Measuresx Generalized LSH is based on some kind of &quot;distance&quot; between points.x Two major classes of distance measure:1. Euclidean 2. NonEuclideanSimilar points are &quot;close
Stanford - CS - 345
Methods for High Degrees of SimilarityIndex-Based Methods Exploiting Prefixes and Suffixes Exploiting Length1OverviewLSH-based methods are excellent for similarity thresholds that are not too high.Possibly up to 80% or 90%.But for similarities above
Stanford - CS - 345
Methods for High Degrees of SimilarityIndexBased Methods Exploiting Prefixes and Suffixes Exploiting Length1Overviewx LSHbased methods are excellent for similarity thresholds that are not too high. x But for similarities above that, there are other me
Stanford - CS - 345
Compact SkeletonsAssume tuples components are scattered over website We have a tagger that can tag all tuple components on website Assume no noise for nowCS345Reconstruct relation Compact SkeletonsCompact SkeletonsWelcome to Big Corp!Relation Skele
Stanford - CS - 345
CS345Compact SkeletonsCompact Skeletons Assumetuples components are scattered over website We have a tagger that can tag all tuple components on website Assume no noise for now ReconstructrelationCompact SkeletonsRelation Skeleton Data Graph Webs
Stanford - CS - 345
CS345Compact SkeletonsCompact Skeletons Assumetuples components are scattered over website We have a tagger that can tag all tuple components on website Assume no noise for now ReconstructrelationCompact SkeletonsRelation Skeleton Data Graph Webs
Stanford - CS - 345
CS345 Data MiningWeb Spam Detection Economic considerationsSearch has become the default gateway to the web Very high premium to appear on the first page of search results e.g., ecommerce sites advertisingdriven sitesWhat is web spam?Spamming =
Stanford - CS - 345
CS345 Data MiningLink Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam DetectionAnand Rajaraman, Jeffrey D. UllmanTopic-Specific Page RankInstead of generic popularity, can we measure popularity within a topic?E.g., computer science, he
Stanford - CS - 345
CS345 Data MiningLink Analysis 2: TopicSpecific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. UllmanTopicSpecific Page RankInstead of generic popularity, can we measure popularity within a topic?Bias the random walkE.g.
Stanford - CS - 345
CS345 Data MiningLink Analysis 2: TopicSpecific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. UllmanSome problems with page rankMeasures generic popularity of a page Uses a single measure of importanceBiased against to
Stanford - CS - 345
Problem formulation (1998)CS345 Data MiningLink Analysis 3: Hubs and Authorities Spam DetectionSuppose we are given a collection of documents on some broad topice.g., stanford, evolution, iraq perhaps obtained through a text searchCan we organize the
Stanford - CS - 345
CS345 Data MiningLink Analysis 3: Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. UllmanProblem formulation (1998)Suppose we are given a collection of documents on some broad topic Can we organize these documents in some manner?
Stanford - CS - 345
CS345 Data MiningVirtual DatabasesExampleFind marketing manager openings in Internet companies so that my commute is shorter than 10 miles.Structured queries e.g., in SQLVirtual RelationsWebApplicationsComparison shoppingshopping.com, fatlens, mo
Stanford - CS - 345
CS345DataMiningVirtualDatabasesExampleFindmarketingmanageropeningsinInternetcompaniessothatmycommuteisshorterthan10miles.Structuredqueriese.g.,inSQLVirtualRelationsWebApplicationsComparisonshoppingJobsearchindeed.com,simplyhired,Classifieds
Stanford - CS - 345
CS345DataMiningVirtualDatabasesExampleFindmarketingmanageropeningsinInternetcompaniessothatmycommuteisshorterthan10miles.Structuredqueriese.g.,inSQLVirtualRelationsWebApplicationsComparisonshoppingJobsearchindeed.com,simplyhired,Classifieds
Stanford - CS - 345
CS 345A Data Mining Lecture 1What is Web Mining?Introduction to Web MiningDiscovering useful information from the World-Wide Web and its usage patternsWeb Mining v. Data MiningStructure (or lack of it)Textual information and linkage structureWeb Mi
Stanford - CS - 345