Unformatted text preview: Geography on the web Lars Backstrom Facebook Introduc9on • Informa9on is becoming increasingly geographic as it becomes easier to geotag all forms of data, and many devices have embedded GPS. • What sorts of ques9ons can we answer with this geographic data? – Query logs – Friendship • Data is noisy. Is there enough signal? How can we extract it. • Simple methods aren’t quite good enough, we need a model of the data. Outline • Query Logs – – – – – Probabilis9c, genera9ve model of queries Results and evalua9on Adding temporal informa9on to the model Modeling more complex geographic query paNerns Extrac9ng the most dis9nc9ve queries from a loca9on • Facebook GEO
Data – Understanding data • Where do people live? • How far away do their friends live? • How are the two related? • Can we predict loca9ons from friends’ loca9ons? • Using the network to improve beyond predic9on based on located friends’ loca9ons. – Loca9on Algorithm • Conclusions Outline • Query Logs – – – – – Probabilis9c, genera9ve model of queries Results and evalua9on Adding temporal informa9on to the model Modeling more complex geographic query paNerns Extrac9ng the most dis9nc9ve queries from a loca9on • Facebook GEO
Data – Understanding data • Where do people live? • How far away do their friends live? • How are the two related? • Can we predict loca9ons from friends’ loca9ons? • Using the network to improve beyond predic9on based on located friends’ loca9ons. – Loca9on Algorithm • Conclusions Query Logs • Many topics have geographic focus – Sports, airlines, u9lity companies, aNrac9ons • Our goal is to iden9fy and characterize these topics – Find the center of geographic focus for a topic – Determine if a topic is 9ghtly concentrated or spread diﬀusely geographically • Use Yahoo! query logs to do this – Geoloca9on of queries based on IP address Red Sox Bell South Comcast.com Grand Canyon Na9onal Park Probabilis9c Model • Consider some query term t • For each loca9on x, a query coming from x has probability px of containing t • Our basic model focuses on term with a center “hot
spot” cell z. • We pick a simple family of func9ons: – Probability highest at z – px is a decreasing func9on of x
z – e.g. ‘red sox’ – A query coming from x at a distance d from the term’s center has probability px = C d
α – Ranges from non
local (α = 0) to extremely local (large α) Algorithm • Maximum likelihood approach allows us to evaluate a choice of center, C and α • Algorithm ﬁnds parameters which maximize likelihood – For a given center, likelihood is unimodal and search algorithms ﬁnd op9mal C and α – Consider all centers on a course mesh, op9mize C and α for each center – Find best center, consider ﬁner mesh α = 1.257 α = 0.931 α = 0.690 Comcast.com α = 0.24 More Results (newspapers) Newspaper The Wall Street Journal USA Today The New York Times New York Post The Daily News Washington Post … Chicago Sun Times The Boston Globe The Arizona Republic Dallas Morning News Houston Chronicle Star Tribune (Minneapolis) 1.165482 1.171179 1.284957 1.286526 1.289576 1.337356 α 0.11327 0.263173 0.304889 0.459145 0.601810 0.719161 • Term centers land correctly • Small α indicates na9onwide appeal • Large α indicates local paper More Results School Harvard Caltech Columbia MIT Princeton Yale Cornell Stanford U. Penn Duke U. Chicago α 0.386832 0.423631 0.441880 0.457628 0.497590 0.514267 0.558996 0.627069 0.729556 0.741114 1.097012 City New York Chicago Phoenix Dallas Houston Los Angeles San Antonio Philadelphia Detroit San Jose α 0.396527 0.528589 0.551841 0.588299 0.608562 0.615746 0.763223 0.783850 0.786158 0.850962 Evalua9on • Consider terms with natural ‘correct’ centers • We compare with three other ways to ﬁnd center – Center of gravity – Median – Most likely grid cell – Baseball teams – Large US Ci9es • Compute baseline rate for all queries • Compute likelihood of observa9ons at each 0.1x0.1 grid cell • Pick cell with lowest likelihood of being from baseline model Baseball Teams and Ci9es • Our algorithm outperforms mean and median • Simpler likelihood method does beNer on baseball teams – Our model must ﬁt all na9onwide data – Makes it less exact for short distances Temporal Extension • We observe that the locality of some queries changes over 9me – Query centers may move – Query dispersion may change (usually becoming less local) • We examine a sequence of 24 hour 9me slices, oﬀset at one hour from each other – 24 hours gives us enough data – Mi9gates diurnal varia9on, as each slice contains all 24 hours Hurricane Dean • Biggest hurricane of 2007 • Computed op9mal parameters for each 9me slice • Added smoothing term – Cost of moving from A to B in consecu9ve 9me slices γA
B2 • Center tracks hurricane, alpha decreases as storm hits na9onwide news Mul9ple Centers • Not all queries ﬁt the one
center model – ‘Washington’ may mean the city of the state – ‘Cardinals’ might mean the football team, the baseball team, or the bird – Airlines have mul9ple hubs • We extend our algorithm to locate mul9ple centers, each with its own C and α – Loca9ons use the highest probability from any center – To op9mize: • Start with K random centers, op9mize with 1
center algorithm • Assign each point to the center giving it highest probability • Re
op9mize each center for only the points assigned to it United Airlines Spheres of inﬂuence Spheres of Inﬂuence • Each baseball team assigned a color • A team with N queries in a cell gets NC votes for its color • Map generated be taking weighted average of colors Dis9nc9ve Queries • For each term and loca9on – Find baseline rate p of term over en9re map – Loca9on has t total queries, s of them with term – Probability given baseline rate is: ps(1
p)t
s • For each loca9on, we ﬁnd the highest devia9on from the baseline rate, as measured by the baseline probability Outline • Query Logs – – – – – Probabilis9c, genera9ve model of queries Results and evalua9on Adding temporal informa9on to the model Modeling more complex geographic query paNerns Extrac9ng the most dis9nc9ve queries from a loca9on • Where do people live? • How far away do their friends live? • How are the two related? • Can we predict loca9ons from friends’ loca9ons? • Using the network to improve beyond predic9on based on located friends’ loca9ons. • Facebook GEO
Data – Understanding data – Loca9on Algorithm • Conclusions Geo
proper9es of Friendship • Using Facebook data, we explore the geographic proper9es of social networks – How well do link distance distribu9ons match exis9ng theories? – How does it depend on where people live (urban vs. rural)? • Expanding beyond user
provided addresses – Can we use what we’ve learned about geography to predict unknown loca9on from sparse geo
data – How can network data improve loca9on data Related Work
Rou9ng • Long history of link between friendship and geography – Milgram experiment (1967) showed short paths between people – WaNs and Strogatz (1998) showed theore9cal models of how this might work – Kleinberg (1999) showed intui9ve geographic proper9es could lead to small world – Liben
Nowell et al. (2005) showed some of these proper9es on LiveJournal network • Larce network with random links distributed by distance US FB Popula9on • We have two ways to determine loca9on – From IP Address – From user provided address • Lots of inaccurate, bad data • Also lots of inaccurate, bad data • But, more precise when correct • Using user provided data, we can map FB usage FB Density – 3 Million US users provided this data – More interes9ng when compared to US Popula9on • Surprisingly midwest is highest FB Density, normalized by Poplua9on User Data • User provided addresses mapped to lat,long using TIGER/Line data from US Census bureau – Only addresses that can be mapped to real street addresses used Popula9on Density • Consider each 0.01x0.01 degree (~0.36 sq. miles) region in the US – How many FB users are there in each region? – Distribu9on shows two dis9nct behaviors on log
log plot • Rural Region, slow falloﬀ • Urban Region, fast falloﬀ • Transi9on at 560K2’ / FB user or about 5,6002’ / person Popula9on Density • We divide the US into three regions – Low, medium and high density – 1/3 of users in each • For each person, we count people in annulus of width 0.1 mile and radius r – High density regions have more people nearby – Curves converge at 50 miles – High density curve goes down as we transi9on urban to rural 0.1 Where our friends live • Mostly nearby, but a signiﬁcant number far away too – 20% within 2 miles – 50% within 12 miles – 20% over 100 miles • Popula9on density plays a role 20% 50% Low Density Medium Density High Density 3.5 2.1 1.6 14.8 12.7 11.9 80% 83.6 97.1 137.1 Probability of Friendship • What is the probability that you know someone x miles away? – Count the number you know x miles away – Count the total number of people x miles away – Divide and aggregate over all users • Tail of curve is roughly x
1 • Breaking down by density – Higher density means lower probabili9es at lower distances • Larger denominator – At ~100 miles all curves converge – Beyond 100 miles, urban users have higher probability • More cosmopolitan? Using Friends to Geolocate • For some people (about 3%) we have precise user
provided loca9ons • How well can we infer the loca9ons of the rest based on friendship? – We observed the probability of friendship as a func9on of distance (about x
1) Loca9on Inference Problem • Problem Statement: – Input: user u, with friends {v1,v2,…,vk} where vi is located at loca9on pi – Output: an es9mate for the loca9on of u • Friendship probabili9es suggest genera9ve model – Assume that users u and v are friends with probability f(distance(u,v)) – f(.) learned from observa9ons of users with known geography – about x
1 • Pick loca9on to maximize likelihood Pinpoin9ng Loca9on • Given f(.) can easily compute likelihood of loca9on L – Πi f(distance(L,pi)) * Π~i 1f(distance(L,pi)) • First term pulls towards friends • Second term pulls towards sparsely populated regions – {i} is the set of friends, {~i} is the set of nonfriends • Naively this is expensive – Many choices for L – {~i} is huge Friends Likelihood Surface Pinpoin9ng Loca9on • Simple op9miza9ons make this prac9cal – Product over all non
friends can be pre
computed one 9me – Op9ma is (almost) always colocated with a friend – For all L, precompute Πusers 1f(distance(L,pi)) – For each friend, compute likelihood at friend’s location – Pick maximum • Algorithm Likelihood Surface Predic9on Performance • Performance evaluated at x miles – What frac9on of users are placed within x miles of their true loca9on – True loca9on comes from user
provided addresses • Performance of IP derived loca9ons – 57% within 25 miles – 76% within 100 miles – Performance on new users/recently updated addresses not much beNer • Measure for people with new addresses in last 90 days • Suggests errors come from IP
Loca9on, not people moving Predic9on Performance • A simple network
based baseline comes from placing a person at the loca9on of a random friend – Does beNer than IP at short distances (IP resolu9on is rarely beNer than city
level) – Much worse overall • Our algorithm is beNer at shorter distances – 61% (vs. 57%) at 25 mi. • It is worse at greater distances – 71% (vs 76%) at 100 mi. Predic9on Performance • We predict more accurately for people with more friends – Median error with 5 located friends is 16 miles – With 16 friends, it is 6.8 miles • Error on users with 16 or more located friends is beNer at all distances – 69% (vs. 57%) at 25 miles – 78% (vs 76%) at 100 miles Hybrid Algorithm • Results suggest hybrid algorithm – Less than 5 friends with user
provided loca9on, use IP
Loca9on – Otherwise use friend
based loca9on – Strict improvement over IP on this metric Using the Network Further • Use predic9on for u to improve predic9on for u’s friends – Ideally, maximize global likelihood – Repeat our algorithm mul9ple 9mes • Hard problem, can’t be easily solved • Run once to get ﬁrst approxima9ons • Subsequently, op9mize based on known and approximated loca9ons Tes9ng Itera9ve Method • To test, remove loca9ons from ½ of users • Predict loca9ons for those users based on other ½ • Performance curve shows improvement – Second pass signiﬁcantly beNer – Third pass shows liNle improvement Facebook Checkins • Most popular places to checkin – Stadium – Airport – Fisherman’s wharf Outline • Query Logs – – – – – Probabilis9c, genera9ve model of queries Results and evalua9on Adding temporal informa9on to the model Modeling more complex geographic query paNerns Extrac9ng the most dis9nc9ve queries from a loca9on • Where do people live? • How far away do their friends live? • How are the two related? • Can we predict loca9ons from friends’ loca9ons? • Using the network to improve beyond predic9on based on located friends’ loca9ons. • Facebook GEO
Data – Understanding data – Loca9on Algorithm • Conclusions Conclusions • Large
scale study of social geography conﬁrms some observa9ons made one small scales – Most friends live within about 10 miles – People in urban areas have more long
range 9es • Also, lends further credence to geographic explana9ons of social rou9ng – Kleinberg model links to O(x) nodes at distance x with prob. 1/x2 each • Cumula9ve distribu9on of friends is C*log(x) C*log(x) Conclusions • Using known loca9ons from just a few people, we can bootstrap approxima9ons for the rest – Allows us to serve beNer local content to users • Future work – Use edge crea9on 9mes (more weight on new edges) – Explore diﬀerent meanings of distance • Is the social distance of NY
LA more than NY
Rapid City, SD? Ques9ons ...
View
Full Document
 '09
 Sort, the00

Click to edit the document details