LarsBackstrom_GeoOnWeb

LarsBackstrom_GeoOnWeb - Geography on the web Lars...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Geography on the web Lars Backstrom Facebook Introduc9on •  Informa9on is becoming increasingly geographic as it becomes easier to geotag all forms of data, and many devices have embedded GPS. •  What sorts of ques9ons can we answer with this geographic data? –  Query logs –  Friendship •  Data is noisy. Is there enough signal? How can we extract it. •  Simple methods aren’t quite good enough, we need a model of the data. Outline •  Query Logs –  –  –  –  –  Probabilis9c, genera9ve model of queries Results and evalua9on Adding temporal informa9on to the model Modeling more complex geographic query paNerns Extrac9ng the most dis9nc9ve queries from a loca9on •  Facebook GEO ­Data –  Understanding data •  Where do people live? •  How far away do their friends live? •  How are the two related? •  Can we predict loca9ons from friends’ loca9ons? •  Using the network to improve beyond predic9on based on located friends’ loca9ons. –  Loca9on Algorithm •  Conclusions Outline •  Query Logs –  –  –  –  –  Probabilis9c, genera9ve model of queries Results and evalua9on Adding temporal informa9on to the model Modeling more complex geographic query paNerns Extrac9ng the most dis9nc9ve queries from a loca9on •  Facebook GEO ­Data –  Understanding data •  Where do people live? •  How far away do their friends live? •  How are the two related? •  Can we predict loca9ons from friends’ loca9ons? •  Using the network to improve beyond predic9on based on located friends’ loca9ons. –  Loca9on Algorithm •  Conclusions Query Logs •  Many topics have geographic focus –  Sports, airlines, u9lity companies, aNrac9ons •  Our goal is to iden9fy and characterize these topics –  Find the center of geographic focus for a topic –  Determine if a topic is 9ghtly concentrated or spread diffusely geographically •  Use Yahoo! query logs to do this –  Geoloca9on of queries based on IP address Red Sox Bell South Comcast.com Grand Canyon Na9onal Park Probabilis9c Model •  Consider some query term t •  For each loca9on x, a query coming from x has probability px of containing t •  Our basic model focuses on term with a center “hot ­spot” cell z. •  We pick a simple family of func9ons: –  Probability highest at z –  px is a decreasing func9on of ||x ­z|| –  e.g. ‘red sox’ –  A query coming from x at a distance d from the term’s center has probability px = C d ­α –  Ranges from non ­local (α = 0) to extremely local (large α) Algorithm •  Maximum likelihood approach allows us to evaluate a choice of center, C and α •  Algorithm finds parameters which maximize likelihood –  For a given center, likelihood is unimodal and search algorithms find op9mal C and α –  Consider all centers on a course mesh, op9mize C and α for each center –  Find best center, consider finer mesh α = 1.257 α = 0.931 α = 0.690 Comcast.com α = 0.24 More Results (newspapers) Newspaper The Wall Street Journal USA Today The New York Times New York Post The Daily News Washington Post … Chicago Sun Times The Boston Globe The Arizona Republic Dallas Morning News Houston Chronicle Star Tribune (Minneapolis) 1.165482 1.171179 1.284957 1.286526 1.289576 1.337356 α 0.11327 0.263173 0.304889 0.459145 0.601810 0.719161 •  Term centers land correctly •  Small α indicates na9onwide appeal •  Large α indicates local paper More Results School Harvard Caltech Columbia MIT Princeton Yale Cornell Stanford U. Penn Duke U. Chicago α 0.386832 0.423631 0.441880 0.457628 0.497590 0.514267 0.558996 0.627069 0.729556 0.741114 1.097012 City New York Chicago Phoenix Dallas Houston Los Angeles San Antonio Philadelphia Detroit San Jose α 0.396527 0.528589 0.551841 0.588299 0.608562 0.615746 0.763223 0.783850 0.786158 0.850962 Evalua9on •  Consider terms with natural ‘correct’ centers •  We compare with three other ways to find center –  Center of gravity –  Median –  Most likely grid cell –  Baseball teams –  Large US Ci9es •  Compute baseline rate for all queries •  Compute likelihood of observa9ons at each 0.1x0.1 grid cell •  Pick cell with lowest likelihood of being from baseline model Baseball Teams and Ci9es •  Our algorithm outperforms mean and median •  Simpler likelihood method does beNer on baseball teams –  Our model must fit all na9onwide data –  Makes it less exact for short distances Temporal Extension •  We observe that the locality of some queries changes over 9me –  Query centers may move –  Query dispersion may change (usually becoming less local) •  We examine a sequence of 24 hour 9me slices, offset at one hour from each other –  24 hours gives us enough data –  Mi9gates diurnal varia9on, as each slice contains all 24 hours Hurricane Dean •  Biggest hurricane of 2007 •  Computed op9mal parameters for each 9me slice •  Added smoothing term –  Cost of moving from A to B in consecu9ve 9me slices γ|A ­B|2 •  Center tracks hurricane, alpha decreases as storm hits na9onwide news Mul9ple Centers • Not all queries fit the one ­ center model –  ‘Washington’ may mean the city of the state –  ‘Cardinals’ might mean the football team, the baseball team, or the bird –  Airlines have mul9ple hubs •  We extend our algorithm to locate mul9ple centers, each with its own C and α –  Loca9ons use the highest probability from any center –  To op9mize: •  Start with K random centers, op9mize with 1 ­center algorithm •  Assign each point to the center giving it highest probability •  Re ­op9mize each center for only the points assigned to it United Airlines Spheres of influence Spheres of Influence •  Each baseball team assigned a color •  A team with N queries in a cell gets NC votes for its color •  Map generated be taking weighted average of colors Dis9nc9ve Queries •  For each term and loca9on –  Find baseline rate p of term over en9re map –  Loca9on has t total queries, s of them with term –  Probability given baseline rate is: ps(1 ­p)t ­s •  For each loca9on, we find the highest devia9on from the baseline rate, as measured by the baseline probability Outline •  Query Logs –  –  –  –  –  Probabilis9c, genera9ve model of queries Results and evalua9on Adding temporal informa9on to the model Modeling more complex geographic query paNerns Extrac9ng the most dis9nc9ve queries from a loca9on •  Where do people live? •  How far away do their friends live? •  How are the two related? •  Can we predict loca9ons from friends’ loca9ons? •  Using the network to improve beyond predic9on based on located friends’ loca9ons. •  Facebook GEO ­Data –  Understanding data –  Loca9on Algorithm •  Conclusions Geo ­proper9es of Friendship •  Using Facebook data, we explore the geographic proper9es of social networks –  How well do link distance distribu9ons match exis9ng theories? –  How does it depend on where people live (urban vs. rural)? •  Expanding beyond user ­provided addresses –  Can we use what we’ve learned about geography to predict unknown loca9on from sparse geo ­data –  How can network data improve loca9on data Related Work  ­ Rou9ng •  Long history of link between friendship and geography –  Milgram experiment (1967) showed short paths between people –  WaNs and Strogatz (1998) showed theore9cal models of how this might work –  Kleinberg (1999) showed intui9ve geographic proper9es could lead to small world –  Liben ­Nowell et al. (2005) showed some of these proper9es on LiveJournal network •  Larce network with random links distributed by distance US FB Popula9on •  We have two ways to determine loca9on –  From IP Address –  From user provided address •  Lots of inaccurate, bad data •  Also lots of inaccurate, bad data •  But, more precise when correct •  Using user provided data, we can map FB usage FB Density –  3 Million US users provided this data –  More interes9ng when compared to US Popula9on •  Surprisingly midwest is highest FB Density, normalized by Poplua9on User Data •  User provided addresses mapped to lat,long using TIGER/Line data from US Census bureau –  Only addresses that can be mapped to real street addresses used Popula9on Density •  Consider each 0.01x0.01 degree (~0.36 sq. miles) region in the US –  How many FB users are there in each region? –  Distribu9on shows two dis9nct behaviors on log ­ log plot •  Rural Region, slow falloff •  Urban Region, fast falloff •  Transi9on at 560K2’ / FB user or about 5,6002’ / person Popula9on Density •  We divide the US into three regions –  Low, medium and high density – 1/3 of users in each •  For each person, we count people in annulus of width 0.1 mile and radius r –  High density regions have more people nearby –  Curves converge at 50 miles –  High density curve goes down as we transi9on urban to rural 0.1 Where our friends live •  Mostly nearby, but a significant number far away too –  20% within 2 miles –  50% within 12 miles –  20% over 100 miles •  Popula9on density plays a role 20% 50% Low Density Medium Density High Density 3.5 2.1 1.6 14.8 12.7 11.9 80% 83.6 97.1 137.1 Probability of Friendship •  What is the probability that you know someone x miles away? –  Count the number you know x miles away –  Count the total number of people x miles away –  Divide and aggregate over all users •  Tail of curve is roughly x ­1 •  Breaking down by density –  Higher density means lower probabili9es at lower distances •  Larger denominator –  At ~100 miles all curves converge –  Beyond 100 miles, urban users have higher probability •  More cosmopolitan? Using Friends to Geolocate •  For some people (about 3%) we have precise user ­provided loca9ons •  How well can we infer the loca9ons of the rest based on friendship? –  We observed the probability of friendship as a func9on of distance (about x ­1) Loca9on Inference Problem •  Problem Statement: –  Input: user u, with friends {v1,v2,…,vk} where vi is located at loca9on pi –  Output: an es9mate for the loca9on of u •  Friendship probabili9es suggest genera9ve model –  Assume that users u and v are friends with probability f(distance(u,v)) –  f(.) learned from observa9ons of users with known geography – about x ­1 •  Pick loca9on to maximize likelihood Pinpoin9ng Loca9on •  Given f(.) can easily compute likelihood of loca9on L –  Πi f(distance(L,pi)) *  Π~i 1-f(distance(L,pi)) •  First term pulls towards friends •  Second term pulls towards sparsely populated regions –  {i} is the set of friends, {~i} is the set of non-friends •  Naively this is expensive –  Many choices for L –  {~i} is huge Friends Likelihood Surface Pinpoin9ng Loca9on •  Simple op9miza9ons make this prac9cal –  Product over all non ­friends can be pre ­computed one 9me –  Op9ma is (almost) always colocated with a friend –  For all L, precompute Πusers 1-f(distance(L,pi)) –  For each friend, compute likelihood at friend’s location –  Pick maximum •  Algorithm Likelihood Surface Predic9on Performance •  Performance evaluated at x miles –  What frac9on of users are placed within x miles of their true loca9on –  True loca9on comes from user ­provided addresses •  Performance of IP derived loca9ons –  57% within 25 miles –  76% within 100 miles –  Performance on new users/recently updated addresses not much beNer •  Measure for people with new addresses in last 90 days •  Suggests errors come from IP ­Loca9on, not people moving Predic9on Performance •  A simple network ­based baseline comes from placing a person at the loca9on of a random friend –  Does beNer than IP at short distances (IP resolu9on is rarely beNer than city ­level) –  Much worse overall •  Our algorithm is beNer at shorter distances –  61% (vs. 57%) at 25 mi. •  It is worse at greater distances –  71% (vs 76%) at 100 mi. Predic9on Performance •  We predict more accurately for people with more friends –  Median error with 5 located friends is 16 miles –  With 16 friends, it is 6.8 miles •  Error on users with 16 or more located friends is beNer at all distances –  69% (vs. 57%) at 25 miles –  78% (vs 76%) at 100 miles Hybrid Algorithm •  Results suggest hybrid algorithm –  Less than 5 friends with user ­provided loca9on, use IP ­Loca9on –  Otherwise use friend ­based loca9on –  Strict improvement over IP on this metric Using the Network Further •  Use predic9on for u to improve predic9on for u’s friends –  Ideally, maximize global likelihood –  Repeat our algorithm mul9ple 9mes •  Hard problem, can’t be easily solved •  Run once to get first approxima9ons •  Subsequently, op9mize based on known and approximated loca9ons Tes9ng Itera9ve Method •  To test, remove loca9ons from ½ of users •  Predict loca9ons for those users based on other ½ •  Performance curve shows improvement –  Second pass significantly beNer –  Third pass shows liNle improvement Facebook Checkins •  Most popular places to checkin –  Stadium –  Airport –  Fisherman’s wharf Outline •  Query Logs –  –  –  –  –  Probabilis9c, genera9ve model of queries Results and evalua9on Adding temporal informa9on to the model Modeling more complex geographic query paNerns Extrac9ng the most dis9nc9ve queries from a loca9on •  Where do people live? •  How far away do their friends live? •  How are the two related? •  Can we predict loca9ons from friends’ loca9ons? •  Using the network to improve beyond predic9on based on located friends’ loca9ons. •  Facebook GEO ­Data –  Understanding data –  Loca9on Algorithm •  Conclusions Conclusions •  Large ­scale study of social geography confirms some observa9ons made one small scales –  Most friends live within about 10 miles –  People in urban areas have more long ­range 9es •  Also, lends further credence to geographic explana9ons of social rou9ng –  Kleinberg model links to O(x) nodes at distance x with prob. 1/x2 each •  Cumula9ve distribu9on of friends is C*log(x) C*log(x) Conclusions •  Using known loca9ons from just a few people, we can bootstrap approxima9ons for the rest –  Allows us to serve beNer local content to users •  Future work –  Use edge crea9on 9mes (more weight on new edges) –  Explore different meanings of distance •  Is the social distance of NY ­LA more than NY ­Rapid City, SD? Ques9ons ...
View Full Document

This note was uploaded on 01/11/2011 for the course CS 224 at Stanford.

Ask a homework question - tutors are online