5 - DATA EXPLORATION AND PRIVACY PRESERVATION OVER HIDDEN...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: DATA EXPLORATION AND PRIVACY PRESERVATION OVER HIDDEN WEB DATABASES Nan Zhang, The George Washington University 1 *Collaborative work with Xin Jin of George Washington University, Arjun Dasgupta, Bradley Jewell, Anirban Maiti, and Dr. Gautam Das of University of Texas at Arlington, and Dr. Surajit Chaudhuri of Microsoft Research. OUTLINE ¢༊  Introduction ¢༊  Unbiased Aggregate Estimation ¢༊  Protecting Sensitive Aggregates ¢༊  Conclusion 2 THE DEEP WEB ¢༊  Deep —༉  —༉  Web vs Surface Web Dynamic contents, unlinked pages, private web, contextual web, etc Estimated size [1]: 91,850 vs 167 tera bytes [1] SIMS, UC Berkeley, How much informa<on? 2003 hAp://www.sims.berkeley.edu/research/projects/how- much- info- 2003/ 3 HIDDEN DATABASES: USED CAR INVENTORY ¢༊  Form-like ¢༊  Return interface top-k tuples 4 SEARCH QUERIES VS AGGREGATE QUERIES ¢༊  Search Queries —༉  SELECT * FROM D WHERE ac1 = vc1 &···& acu = vcu ¢༊  —༉  —༉  ¢༊  e.g., List 2006 Ford F-150 with 4WD and 5.4L engine in Cargiant’s inventory Answered by hidden database with top-k restriction overflowing (> k), valid (1..k), and Underflowing (0 tuple) queries Aggregate Queries —༉  SELECT AGGR(*) FROM D WHERE ac1 = vc1 &···& acu = vcu, ¢༊  —༉  e.g., How many vehicles in Cargiant’s inventory have MPG > 30? Cannot be answered through the public web interface Search query Hidden database Aggregate query Web interface 5 OUTLINE ¢༊  Introduction ¢༊  Unbiased Aggregate Estimation ¢༊  Protecting Sensitive Aggregates ¢༊  Conclusion 6 PROBLEM DEFINITION Accurately estimate size and other aggregates (with or without selection conditions) from a hidden database through its publicly available front end in an efficient manner 7 PERFORMANCE AND ACCURACY MEASURES ¢༊  Performance —༉  Queries executed on the web interface incurs cost IP based query limits ¢༊  pay-per-query model ¢༊  —༉  Reduce number of queries executed ¢༊  Accuracy ~ —༉  Estimation error (for an estimator θ and aggregate θ ) ~ ¢༊  Bias E[θ ] − θ ~ E[(θ − E (θ ))2 ] Reduce bias and variance ¢༊  —༉  ~ Variance € 8 RUNNING EXAMPLE D A1 A2 A3 A4 t1 0 0 0 0 t2 0 0 0 1 t3 0 1 0 0 t4 0 1 0 1 t5 1 1 0 0 t6 1 1 1 1 ¢༊  Hidden database with Top-1 constraint ¢༊  Query can be of the form —༉  —༉  —༉  A1=0, A2=1 A1=0, A2=1, A3=0 etc. ¢༊  Dom(A1) =2 ¢༊  |Dom(.)| = 24 9 BASELINE 1: BRUTE-FORCE-SAMPLER A1 A2 A3 A4 0 1 0 0 hit 0000 0001 0010 0011 0100 0101 0110 0111 t4 t3 t2 t1 1000 1001 1010 1011 1100 1101 1110 1111 t6 t5 Technique ¢༊  ¢༊  ¢༊  Specify randomly generated query on all attributes Result: either hit or miss —༉  Hit: Estimation is |Dom(.)| —༉  Miss: Estimation is 0 Unbiased! 10 BASELINE 1: BRUTE-FORCE-SAMPLER A1 A2 A3 A4 1 1 1 0 miss 0000 0001 0010 0011 0100 0101 0110 0111 t4 t3 t2 t1 1000 1001 1010 1011 1100 1101 1110 1111 t6 t5 Technique ¢༊  ¢༊  ¢༊  Specify randomly generated query on all attributes Result: either hit or miss —༉  Hit: Estimation is |Dom(.)| —༉  Miss: Estimation is 0 Unbiased! 11 BASELINE 1: BRUTE-FORCE-SAMPLER A1 A2 A3 A4 1 1 1 1 hit 0000 0001 0010 0011 0100 0101 0110 0111 t4 t3 t2 t1 1000 1001 1010 1011 1100 1101 1110 1111 t6 t5 Technique ¢༊  ¢༊  ¢༊  Specify randomly generated query on all attributes Result: either hit or miss —༉  Hit: Estimation is |Dom(.)| —༉  Miss: Estimation is 0 Unbiased! 12 BASELINE 1: BRUTE-FORCE-SAMPLER A1 A2 A3 A4 0 1 1 0 miss 0000 0001 0010 0011 0100 0101 0110 0111 t4 t3 t2 t1 1000 1001 1010 1011 1100 1101 1110 1111 t6 t5 Technique ¢༊  ¢༊  ¢༊  Specify randomly generated query on all attributes Result: either hit or miss —༉  Hit: Estimation is |Dom(.)| —༉  Miss: Estimation is 0 Unbiased! DRAWBACK INEFFICIENT |Dom(.)| >>> database size 13 BASELINE 2: HIDDEN-DB-SAMPLER ¢༊  Use previous techniques to generate near uniform random sample from hidden database ¢༊  Capture/Recapture to estimate database size —༉  Back-end Hidden Sample C1sd DB ¢༊  Problems —༉  sampling Sample C1 Sample C2 Estimation ~ | C1 | × | C 2 | m= | C1 C 2 | Bias Unknown bias in samples ¢༊  Bias is also introduced by estimation ¢༊  —༉  Inefficient ¢༊  at least an expected number of m1/2 queries required! 14 SIMPLE-UNBIASED-ESTIMATOR Query tree D A1 A2 A3 t1 0 0 0 0 0 0 1 t3 0 1 0 0 t4 0 1 0 1 t5 1 1 0 0 t6 1 1 1 1 1 0 t2 0 A4 0 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 15 SIMPLE-UNBIASED-ESTIMATOR q:(A1=0) 1/2 q:(A1=0 & A2=0) 1/2 1/2 1/2 p(q)=1/16 | q| = 1 : Overflow : Valid : Underflow Basic Ideas ü༏  Use random drill down (similar to idea from Hidden-DB-Sampler) ü༏  Continue drill down till valid or underflow is reached ü༏  Size estimation as | q | (Horvitz-Thompson Estimator) p(q) ü༏  Unbiasedness of estimator € ⎡ྎ | q | ⎤ྏ |q | E ⎢ྎ = ∑ p(q). =m ⎥ྏ p(q) ⎣ྏ p(q) ⎦ྏ q ∈Ω TV 16 SIMPLE-UNBIASED-ESTIMATOR 1/2 1/2 p(q)=1/4 |q|=0 : Overflow : Valid : Underflow Basic Ideas ü༏  Use random drill down (similar to idea from Hidden-DB-Sampler) ü༏  Continue drill down till valid or underflow is reached ü༏  Size estimation as | q | (Horvitz-Thompson Estimator) p(q) ü༏  Unbiasedness of estimator € ⎡ྎ | q | ⎤ྏ |q | E ⎢ྎ = ∑ p(q). =m ⎥ྏ p(q) ⎣ྏ p(q) ⎦ྏ q ∈Ω TV 17 SIMPLE-UNBIASED-ESTIMATOR 1/4 1/8 1/8 1/16 1/8 1/16 : Overflow : Valid : Underflow Drawback ü༏  High variance due to underflowing nodes Variance Calculation s2 = 1/16(16-6)2 + 1/8(8-6)2 + + 1/16(16-6)2 + 1/8(8-6)2 + 1/4(0-6)2 + 1/8(0-6)2 =27 18 TECHNIQUES FOR VARIANCE REDUCTION ¢༊  Backtracking ¢༊  Weight Adjustment ¢༊  Divide and conquer 19 BACKTRACKING FOR VARIANCE REDUCTION 1/2 1 1/2 Basic idea ü༏  Reduce variance by backtracking from underflow ü༏ Always reach a valid tuple at the end ü༏ Probability calculations changes to reflect this ü༏ UNBIASED 20 BACKTRACKING FOR VARIANCE REDUCTION ü༏ Variance computed using same number of queries (13) ü༏ Comparison shows improvement 1/2 1/2 1/2 1/2 1/4 1 1/2 1/4 1/2 1/8 1/8 Estimator with BACKTRACKING Variance Calculation (w BACKTRACKING) s2 = 1/8(8-6)2 + 1/4(4-6)2 + 1/8(8-6)2 + 1/4(4-6)2 = 3 1/4 1/8 1/2 1/16 1/8 1/8 1/16 SIMPLE-UNBIASED-ESTIMATOR With early termination Variance Calculation (SIMPLE-UNBIASEDESTIMATOR) s2 = 1/16(16-6)2 + 1/8(8-6)2 + 1/4(0-6)2 + 1/8(0-6)2 + 1/16(16-6)2 + 1/8(8-6)2 =27 21 REDUCING VARIANCE BY WEIGHT ADJUSTMENT root root p(s1) > p(s2) p(s1) = p(s2) Subtree s1 Subtree s2 Subtree s1 Subtree s2 p(s) = ½ Boolean data p(s) = 1/|Dom(.)| Categorical Data ¢༊  Basic Idea —༉  Follow branch with probability proportional to the number of valid nodes under it ¢༊  Weight of branch not available! Weight of branch can be approximated from historic queries ¢༊  22 REDUCING VARIANCE BY DIVIDE AND CONQUER ¢༊  The Deep dense nodes deep dense nodes can increase the estimation variance ¢༊  Impact on estimation variance is high •  Recursive partitioning is applied by dividing the query tree into mutually exclusive subtrees •  Improves chances of exploring deep subtrees that would have small chance of getting picked otherwise 23 ALGORITHM HD-UNBIASED-(COUNT/AGG) ¢༊  Perform —༉  —༉  —༉  r random drill down on query tree with Backtracking Weight Adjustment Divide and Conquer ¢༊  Extends to Aggregate queries like SUM ¢༊  Estimations for SUM queries —༉  ⎡ྎ | q | ⎤ྏ ⎡ྎ SUM (q) ⎤ྏ Replace ⎢ྎ ⎥ྏ by ⎢ྎ ⎥ྏ p(q) ⎦ྏ p(q) ⎦ྏ ⎣ྏ ⎣ྏ € € 24 EXPERIMENTAL RESULTS MSE V/S QUERY COST Existing technique Impact of Variance Reduction Dataset: Boolean IID and Mixed, with m=200,000, n = 40 and k=100 C&R: Capture and recapture with HIDDEN-DB-SAMPLER BOOL: Count(*) estimator without variance reduction HD: HD-UNBIASED-COUNT 25 EXPERIMENTAL RESULTS PERFORMANCE AND ESTIMATION ACCURACY OF HDUNBIASED-COUNT/SUM 26 EXPERIMENTAL RESULTS Real World Categorical Data simulation (OFFLINE) with 38 attributes and ~200,000 tuples Real World Yahoo Autos (ONLINE) Web Data 27 OUTLINE ¢༊  Introduction ¢༊  Unbiased Aggregate Estimation ¢༊  Protecting Sensitive Aggregates ¢༊  Conclusion 28 SAMPLING TECHNIQUES FOR DISCOVERING AGGREGATES Search query Hidden database Web interface Sampling techniques for aggregate query 29 PRIVACY CONCERNS OVER AGGREGATE INFORMATION ¢༊  Used —༉  —༉  car dealership (commercial competition) How much percentage of vehicles in Cargiant’s inventory have MPG > 30? Reason: If a competitor knows such aggregate information, it take advantage of the low inventory by a multitude of tactics (e.g., stock those vehicles, make adjustments to price). ¢༊  Flight —༉  —༉  occupancy (homeland security) Which flights, on what dates, are likely to be relatively empty? Reason: In 9/11 and Russian aircraft bombing of 2004, terrorists’ tactics are believed to be to hijack relatively empty flights because there would be less resistance from occupants. ¢༊  Health IT (government policies) 30 PROTECTING SENSITIVE AGGREGATES OVER HIDDEN DATABASES ¢༊  A novel problem of protecting sensitive aggregates over hidden web databases —༉  —༉  reveal individual tuples truthfully and efficiently but hide aggregated views of the data ¢༊  Comparison —༉  Most existing work focuses on protecting individual tuples while disclosing aggregates ¢༊  —༉  with existing work Inverse to our problem There exists work that recognizes the possible sensitivity of aggregate information [ABE+99, VEB+04, GV06] ¢༊  However, the objective is not to reveal individual tuples truthfully while protecting the sensitive aggregates 31 PROBLEM STATEMENT ¢༊  The —༉  —༉  objective of aggregate suppression is to make it very difficult for any adversary to obtain an accurate estimate of aggregates via the search interface, and minimize the reduction of service quality for search queries issued by normal users. ¢༊  Normal —༉  —༉  users may include human end-users third-party web mashup applications 32 OUTLINE OF TECHNICAL RESULTS ¢༊  COUNTER-SAMPLER —༉  Make it extremely difficult for an adversary to retrieve uniform random samples from the hidden database. ¢༊  —༉  Ensure minimum disruption to end users and web mashup applications ¢༊  Two —༉  main components Neighbor Insertion ¢༊  —༉  Effective againt both existing attacks: HIDDEN-DB-SAMPLER [DDM07] and HYBRID-SAMPLER [DZD09]. Make it difficult for an adversary to retrieve one sample. High-level Packing ¢༊  Make it difficult for an adversary to retrieve subsequent samples (i.e., prevent an adversary from reusing previously issued queries to improve sampling efficiency). 33 INAPPLCABILITY OF NAÏVE STRATEGIES (I) ¢༊  Audit —༉  query history for each user, IP address, etc Problem: distributed attacks 34 INAPPLCABILITY OF NAÏVE STRATEGIES (II) ¢༊  CAPTCHA challenge before submitting or answering a query (used by seatcounter.com) —༉  Problem: significant inconvenience to end users, completely disable third-party applications through the public interface 35 INAPPLCABILITY OF NAÏVE STRATEGIES (III) ¢༊  use machine-incomprehensible response (e.g., image instead of text for attribute values —༉  Problem: overhead on the server side, disable third-party applications through the public interface 36 OUR STRATEGY: INSERTION OF DUMMY TUPLES WITH CAPTCHA FLAGS Make Model Year Color Ford Fusion 2006 Red Honda Accord 2007 Blue Toyota Prius 2008 Green ¢༊  Insert dummy tuples, and associate each tuple with a CAPTCHA flag indicating whether it is real or dummy 37 JUSTIFICATION: WHY DUMMY TUPLES? ¢༊  Requirement —༉  —༉  —༉  of truthfully revealing individual tuples No change of existing tuple values No removal of existing tuples Only choice left: insert dummy tuples ¢༊  Key observation: All existing sampling attacks retrieve samples from valid queries only. —༉  —༉  Overflowing queries are usually overlooked because their results (top-k tuples) are preferentially selected by a ranking function, and hence cannot be assumed to be random. How about inserting dummy tuples to overflow valid queries? 38 JUSTIFICATION: WHY CAPTCHA FLAGS? ¢༊  CAPTCHA flags exploit the different requirements of a search user, a mashup application, and an adversary ¢༊  a bona fide search user issues a small number of search queries - manual intepretation of CAPTCHA flags is tolerable ¢༊  a web mashup application can directly push the flag to its end users - thereby maintaining its usability ¢༊  However, an adversary requires a relatively large number of queries for aggregate estimation – CAPTCHA flags become a major deficiency for the adversary. 39 TECHNICAL PROBLEM DEFINITION ¢༊  Problem definition: Given the attacking-cost limit umax and a set of sensitive aggregate queries, the objective of dummy insertion is to achieve (ε, δ, p)-guarantee for each aggregate query while minimizing the number of inserted dummy tuples. —༉  —༉  Attacking-cost limit umax: the maximum number of search queries that an adversary can issue. Why minimizing the number of inserted dummy tuples? ¢༊  —༉  Dummy tuples naturally lead to degradation of service quality (ε, δ, p)-guarantee: partial disclosure measure ¢༊  same spirit as the privacy-game notion [KMN05] 40 AN EXAMPLE A1 A2 A3 t1 0 0 1 t2 0 1 0 t3 0 1 1 000 001 010 011 100 101 110 111 t1 t2 t3 41 HIDDEN-DB-SAMPLER A1 0 1 A2 0 1 0 1 A3 0 000 1 0 001 t1 010 t2 1 0 011 t3 100 1 0 101 110 1 111 42 HIDDEN-DB-SAMPLER A1 1/2 (A1 = 0)? Overflow 1/2 1/2 1/2 A2 (A1=0 & A2=1) Overflow 1/2 A3 (A1=0 & A2=1 & A3=0)? 1/2 Valid 000 001 010 011 100 101 110 111 43 HIDDEN-DB-SAMPLER A1 1/2 Overflow 1/2 A2 (A1 = 0)? (A1=0 & A2=0) 1/2 1/2 Valid 000 001 A3 010 011 100 101 110 111 44 HIDDEN-DB-SAMPLER A1 1/2 1/2 (A1 = 1)? underflow A2 A3 000 001 010 011 100 101 110 111 45 NEIGHBOR INSERTION No Dummies With Dummies Valid Overflow 46 HIGH-LEVEL PACKING No Dummies With Dummies Underflow [Restart] Overflow [Get Sample] 47 A GENERIC MODEL FOR SINGLE SAMPLE TUPLE ATTACK ¢༊  ¢༊  Single sample tuple attack: retrieve one uniform random sample tuple A generic single sample tuple attack —༉  Finding a valid query from the set of all possible queries —༉  Attackers rely on INFERENCE! ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? 48 Q1: SELECT * FROM D WHERE a1 = 0 AND a2 = 0 AND a3 = 0 over a Boolean database KEY OBSERVATIONS FOR SINGLE SAMPLE TUPLE ATTACK ¢༊  Any smart sampler should issue queries that maximize the shrinkage of search space for valid queries. —༉  Valid queries as well as long overflowing queries contribute the most to shrinking the space. ¢༊  Forunately, long queries (both overflowing and valid) are difficult to find even before dummy insertion: —༉  For a Boolean database, within the (2c ⋅ nCc) c-predicate queries, the total number of valid and overflowing queries is at most m⋅ nCc; thus the probability of choosing one is no more than m/2c, which is extremely small when c is large. ¢༊  Thus, short valid queries becomes the main threat for defense. 49 NEIGHBOR INSERTION ¢༊  b-Neighbor Insertion: add dummy tuples such that all valid queries with fewer than b predicates will overflow —༉  Precisely what we discussed earlier in the case study ¢༊  Approach: insert dummy tuples into the “neighboring zone” of real tuples (i.e., sharing the same values on a large number of attributes). 50 REVISIT OF SHRINKING EFFECT FOR MULTI SAMPLE TUPLE ATTACK ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? Key Observations: - Shrinkage by underflow is permanent - Shrinkage by overflow is mostly temporary - Thus, short underflowing queries become a very dangerous threat ? 51 HIGH-LEVEL PACKING ¢༊  d-Level Packing: add dummy tuples such that all underflowing queries with fewer than d predicates will overflow —༉  Precisely what we discussed earlier in the case study ¢༊  Approach: “pack” short underflowing queries with dummy tuples 52 COUNTER-SAMPLER ¢༊  Usually, d < b —༉  ¢༊  Valid queries are more dangerous than underflows! Step 1. d-level packing —༉  Pack until valid – will be overflowed by Step 2 ¢༊  Step 2. b-neighbor insertion ¢༊  Time Complexity: —༉  O(nCd–1·max(2d, m) + nCb–1·m) 53 PRIVACY GUARANTEE ¢༊  For a Boolean hidden database with m tuples, when all samplers have an attacking-cost limit umax, for any COUNT query with answer in [x, y], the hidden database owner achieves (ε, δ, 50%)-privacy guarantee if COUNTER-SAMPLER has been executed with parameters b and d which satisfy (a) d ≥ log 2 m + 1 and 3 d −1 (d + 1) ≥ u max (b) b ≥ d + (3ε u (32 min(x (m − x ), y (m − y ))(erf (δ )) )) where erf–1(⋅) is the inverse error function. 2 −1 2 max € € € 54 EXPERIMENTAL SETUP ¢༊  Datasets —༉  —༉  —༉  Boolean 0.3: 100,000 tuples and 30 attributes. Each attribute has probability of p = 0.3 to be 1. Boolean Mixed: 30 independent attributes, 5 have probability of p = 0.5 to be 1, 10 have p = 0.3, the other 10 have p = 0.1. Categorical Census: 1990 US Census Adult data published on the UCI Data Mining archive [HB99]. Highest domain size: 92 categories, lowest: Boolean. ¢༊  Attacking —༉  —༉  Techniques HIDDEN-DB-SAMPLER [DDM07] HYBRID-SAMPLER [DZD09] 55 HIDDEN-DB-SAMPLER Boolean 0.3 and mixed datasets 22.6% and 24.7% dummy tuples 56 Census Categorical Dataset 24.7% dummy tuples HYBRID-SAMPLER 57 OUTLINE ¢༊  Introduction ¢༊  Unbiased Aggregate Estimation ¢༊  Protecting Sensitive Aggregates ¢༊  Conclusion 58 AGGREGATE ESTIMATION ¢༊  We introduced UNBIASED techniques for estimating SIZE and AGGREGATES from Hidden Databases ¢༊  Techniques to lower the VARIANCE of such UNBIASED Estimators were put in place ¢༊  HD-UNBIASED-COUNT/AGG for unbiased estimation of SIZE and AGGREGATES with LOW VARIANCE! 59 AGGREGATE PRESERVATION ¢༊  A Novel —༉  —༉  —༉  Problem reveal individual tuples truthfully and efficiently but hide aggregated views of the data An urgent challenge for hidden database owners ¢༊  COUNTER-SAMPLER —༉  —༉  —༉  The insertion of dummy tuples with CAPTCHA flags Minimum disruption to end users and third-party web mashup applications Provides privacy guarantee against sampling attacks. Increase by up to an order of magnitude the number of queries required by existing sampling techniques. 60 A BROADER PICTURE ¢༊  Solution —༉  —༉  —༉  space for privacy-preserving strategies Back-end hidden database: this paper (dummy tuple insertion) Query processing module Future work Front-end interface 61 THANK YOU Acknowledgement 62 Partially supported by NSF grants 0852673, 0852674, 0845644 and 0915834 and a GWU Research Enhancement Fund. ...
View Full Document

This note was uploaded on 02/08/2012 for the course CSCI 6907 taught by Professor Zhang during the Spring '11 term at GWU.

Ask a homework question - tutors are online