Unformatted text preview: DATA EXPLORATION AND PRIVACY
PRESERVATION OVER HIDDEN WEB
DATABASES
Nan Zhang, The George Washington University
1 *Collaborative work with Xin Jin of George Washington University,
Arjun Dasgupta, Bradley Jewell, Anirban Maiti, and Dr. Gautam Das
of University of Texas at Arlington, and Dr. Surajit Chaudhuri of
Microsoft Research. OUTLINE
¢༊ Introduction
¢༊ Unbiased Aggregate Estimation
¢༊ Protecting Sensitive Aggregates
¢༊ Conclusion 2 THE DEEP WEB
¢༊ Deep
༉
༉ Web vs Surface Web Dynamic contents, unlinked pages, private web, contextual web, etc
Estimated size [1]: 91,850 vs 167 tera bytes [1] SIMS, UC Berkeley, How much informa<on? 2003 hAp://www.sims.berkeley.edu/research/projects/how much info 2003/ 3 HIDDEN DATABASES: USED CAR INVENTORY
¢༊ Formlike ¢༊ Return interface topk tuples 4 SEARCH QUERIES VS AGGREGATE QUERIES
¢༊ Search Queries
༉ SELECT * FROM D WHERE ac1 = vc1 &···& acu = vcu
¢༊ ༉
༉
¢༊ e.g., List 2006 Ford F150 with 4WD and 5.4L engine in Cargiant’s inventory Answered by hidden database with topk restriction
overflowing (> k), valid (1..k), and Underflowing (0 tuple) queries Aggregate Queries
༉ SELECT AGGR(*) FROM D WHERE ac1 = vc1 &···& acu = vcu,
¢༊ ༉ e.g., How many vehicles in Cargiant’s inventory have MPG > 30? Cannot be answered through the public web interface
Search query Hidden database
Aggregate query
Web interface 5 OUTLINE
¢༊ Introduction
¢༊ Unbiased Aggregate Estimation
¢༊ Protecting Sensitive Aggregates
¢༊ Conclusion 6 PROBLEM DEFINITION
Accurately estimate size and other aggregates (with or
without selection conditions) from a hidden database
through its publicly available front end in an efficient
manner 7 PERFORMANCE AND ACCURACY MEASURES
¢༊ Performance
༉ Queries executed on the web interface incurs cost
IP based query limits
¢༊ payperquery model
¢༊ ༉ Reduce number of queries executed ¢༊ Accuracy
~ ༉ Estimation error (for an estimator θ and aggregate θ )
~
¢༊ Bias
E[θ ] − θ
~ E[(θ − E (θ ))2 ]
Reduce bias and variance
¢༊ ༉ ~ Variance € 8 RUNNING EXAMPLE
D A1 A2 A3 A4 t1 0 0 0 0 t2 0 0 0 1 t3 0 1 0 0 t4 0 1 0 1 t5 1 1 0 0 t6 1 1 1 1 ¢༊ Hidden database with
Top1 constraint
¢༊ Query can be of the form
༉
༉
༉ A1=0, A2=1
A1=0, A2=1, A3=0
etc. ¢༊ Dom(A1) =2
¢༊ Dom(.) = 24
9 BASELINE 1: BRUTEFORCESAMPLER
A1 A2 A3 A4 0 1 0 0
hit 0000 0001 0010 0011 0100 0101 0110 0111
t4
t3
t2
t1 1000 1001 1010 1011 1100 1101 1110 1111
t6
t5 Technique
¢༊ ¢༊ ¢༊ Specify randomly generated query on all
attributes
Result: either hit or miss
༉ Hit: Estimation is Dom(.)
༉ Miss: Estimation is 0
Unbiased! 10 BASELINE 1: BRUTEFORCESAMPLER
A1 A2 A3 A4 1 1 1 0
miss 0000 0001 0010 0011 0100 0101 0110 0111
t4
t3
t2
t1 1000 1001 1010 1011 1100 1101 1110 1111
t6
t5 Technique
¢༊ ¢༊ ¢༊ Specify randomly generated query on all
attributes
Result: either hit or miss
༉ Hit: Estimation is Dom(.)
༉ Miss: Estimation is 0
Unbiased! 11 BASELINE 1: BRUTEFORCESAMPLER
A1 A2 A3 A4 1 1 1 1
hit 0000 0001 0010 0011 0100 0101 0110 0111
t4
t3
t2
t1 1000 1001 1010 1011 1100 1101 1110 1111
t6
t5 Technique
¢༊ ¢༊ ¢༊ Specify randomly generated query on all
attributes
Result: either hit or miss
༉ Hit: Estimation is Dom(.)
༉ Miss: Estimation is 0
Unbiased! 12 BASELINE 1: BRUTEFORCESAMPLER
A1 A2 A3 A4 0 1 1 0
miss 0000 0001 0010 0011 0100 0101 0110 0111
t4
t3
t2
t1 1000 1001 1010 1011 1100 1101 1110 1111
t6
t5 Technique
¢༊ ¢༊ ¢༊ Specify randomly generated query on all
attributes
Result: either hit or miss
༉ Hit: Estimation is Dom(.)
༉ Miss: Estimation is 0
Unbiased! DRAWBACK
INEFFICIENT
Dom(.) >>> database size
13 BASELINE 2: HIDDENDBSAMPLER
¢༊ Use previous techniques to generate near uniform
random sample from hidden database
¢༊ Capture/Recapture to estimate database size
༉ Backend Hidden
Sample C1sd
DB ¢༊ Problems ༉ sampling Sample
C1
Sample
C2 Estimation
~  C1  ×  C 2 
m=
 C1 C 2  Bias
Unknown bias in samples
¢༊ Bias is also introduced by estimation
¢༊ ༉ Inefficient
¢༊ at least an expected number of m1/2 queries required! 14 SIMPLEUNBIASEDESTIMATOR
Query tree
D A1 A2 A3 t1 0 0 0 0 0 0 1 t3 0 1 0 0 t4 0 1 0 1 t5 1 1 0 0 t6 1 1 1 1 1 0 t2 0 A4 0 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 15 SIMPLEUNBIASEDESTIMATOR
q:(A1=0) 1/2 q:(A1=0 & A2=0)
1/2
1/2
1/2
p(q)=1/16
 q = 1 : Overflow : Valid : Underflow Basic Ideas
ü༏ Use random drill down (similar to idea from HiddenDBSampler)
ü༏ Continue drill down till valid or underflow is reached
ü༏ Size estimation as  q  (HorvitzThompson Estimator)
p(q) ü༏ Unbiasedness of estimator
€ ⎡ྎ  q  ⎤ྏ
q 
E ⎢ྎ
= ∑ p(q).
=m
⎥ྏ
p(q)
⎣ྏ p(q) ⎦ྏ q ∈Ω TV 16 SIMPLEUNBIASEDESTIMATOR
1/2
1/2
p(q)=1/4
q=0 : Overflow : Valid : Underflow Basic Ideas
ü༏ Use random drill down (similar to idea from HiddenDBSampler)
ü༏ Continue drill down till valid or underflow is reached
ü༏ Size estimation as  q  (HorvitzThompson Estimator)
p(q) ü༏ Unbiasedness of estimator
€ ⎡ྎ  q  ⎤ྏ
q 
E ⎢ྎ
= ∑ p(q).
=m
⎥ྏ
p(q)
⎣ྏ p(q) ⎦ྏ q ∈Ω TV 17 SIMPLEUNBIASEDESTIMATOR 1/4
1/8 1/8
1/16 1/8 1/16 : Overflow : Valid : Underflow Drawback
ü༏ High variance due to underflowing nodes Variance Calculation
s2 = 1/16(166)2 + 1/8(86)2 + + 1/16(166)2 + 1/8(86)2 + 1/4(06)2 + 1/8(06)2 =27 18 TECHNIQUES FOR VARIANCE REDUCTION
¢༊ Backtracking ¢༊ Weight Adjustment ¢༊ Divide and conquer 19 BACKTRACKING FOR VARIANCE REDUCTION
1/2 1
1/2 Basic idea
ü༏ Reduce variance by backtracking from underflow
ü༏ Always reach a valid tuple at the end
ü༏ Probability calculations changes to reflect this
ü༏ UNBIASED
20 BACKTRACKING FOR VARIANCE REDUCTION
ü༏ Variance computed using same
number of queries (13)
ü༏ Comparison shows improvement 1/2 1/2 1/2 1/2 1/4
1 1/2 1/4 1/2 1/8 1/8
Estimator with BACKTRACKING Variance Calculation (w BACKTRACKING)
s2 = 1/8(86)2 + 1/4(46)2 + 1/8(86)2 + 1/4(46)2 = 3 1/4 1/8
1/2 1/16 1/8
1/8 1/16 SIMPLEUNBIASEDESTIMATOR
With early termination
Variance Calculation (SIMPLEUNBIASEDESTIMATOR)
s2 = 1/16(166)2 + 1/8(86)2 + 1/4(06)2 + 1/8(06)2 +
1/16(166)2 + 1/8(86)2 =27
21 REDUCING VARIANCE BY WEIGHT
ADJUSTMENT
root root p(s1) > p(s2) p(s1) = p(s2) Subtree s1 Subtree s2 Subtree s1 Subtree s2 p(s) = ½ Boolean data
p(s) = 1/Dom(.) Categorical Data
¢༊ Basic Idea
༉ Follow branch with probability proportional to the
number of valid nodes under it ¢༊ Weight of branch not available!
Weight of branch can be approximated from historic queries ¢༊ 22 REDUCING VARIANCE BY DIVIDE AND
CONQUER
¢༊ The Deep dense nodes deep dense nodes can
increase the estimation
variance
¢༊ Impact on estimation
variance is high • Recursive partitioning is applied by dividing the query tree
into mutually exclusive subtrees
• Improves chances of exploring deep subtrees that would have
small chance of getting picked otherwise
23 ALGORITHM HDUNBIASED(COUNT/AGG)
¢༊ Perform
༉
༉
༉ r random drill down on query tree with Backtracking
Weight Adjustment
Divide and Conquer ¢༊ Extends to Aggregate queries like SUM
¢༊ Estimations for SUM queries
༉ ⎡ྎ  q  ⎤ྏ
⎡ྎ SUM (q) ⎤ྏ
Replace ⎢ྎ
⎥ྏ by ⎢ྎ
⎥ྏ
p(q) ⎦ྏ
p(q) ⎦ྏ
⎣ྏ
⎣ྏ € € 24 EXPERIMENTAL RESULTS
MSE V/S QUERY COST
Existing technique Impact of Variance
Reduction Dataset: Boolean IID and Mixed, with m=200,000, n = 40 and k=100
C&R: Capture and recapture with HIDDENDBSAMPLER
BOOL: Count(*) estimator without variance reduction
HD: HDUNBIASEDCOUNT 25 EXPERIMENTAL RESULTS
PERFORMANCE AND ESTIMATION ACCURACY OF HDUNBIASEDCOUNT/SUM 26 EXPERIMENTAL RESULTS
Real World Categorical Data simulation (OFFLINE) with 38
attributes and ~200,000 tuples Real World Yahoo Autos (ONLINE) Web Data 27 OUTLINE
¢༊ Introduction
¢༊ Unbiased Aggregate Estimation
¢༊ Protecting Sensitive Aggregates
¢༊ Conclusion 28 SAMPLING TECHNIQUES FOR DISCOVERING
AGGREGATES Search query Hidden database
Web interface Sampling techniques for
aggregate query 29 PRIVACY CONCERNS OVER AGGREGATE
INFORMATION
¢༊ Used
༉
༉ car dealership (commercial competition) How much percentage of vehicles in Cargiant’s inventory have
MPG > 30?
Reason: If a competitor knows such aggregate information, it
take advantage of the low inventory by a multitude of tactics
(e.g., stock those vehicles, make adjustments to price). ¢༊ Flight
༉
༉ occupancy (homeland security) Which flights, on what dates, are likely to be relatively empty?
Reason: In 9/11 and Russian aircraft bombing of 2004,
terrorists’ tactics are believed to be to hijack relatively empty
flights because there would be less resistance from occupants. ¢༊ Health IT (government policies) 30 PROTECTING SENSITIVE AGGREGATES OVER
HIDDEN DATABASES
¢༊ A novel problem of protecting sensitive aggregates over
hidden web databases
༉
༉ reveal individual tuples truthfully and efficiently
but hide aggregated views of the data ¢༊ Comparison
༉ Most existing work focuses on protecting individual tuples
while disclosing aggregates
¢༊ ༉ with existing work Inverse to our problem There exists work that recognizes the possible sensitivity of
aggregate information [ABE+99, VEB+04, GV06]
¢༊ However, the objective is not to reveal individual tuples truthfully
while protecting the sensitive aggregates 31 PROBLEM STATEMENT
¢༊ The
༉
༉ objective of aggregate suppression is to make it very difficult for any adversary to obtain an accurate
estimate of aggregates via the search interface, and
minimize the reduction of service quality for search queries
issued by normal users. ¢༊ Normal
༉
༉ users may include human endusers
thirdparty web mashup applications 32 OUTLINE OF TECHNICAL RESULTS
¢༊ COUNTERSAMPLER
༉ Make it extremely difficult for an adversary to retrieve
uniform random samples from the hidden database.
¢༊ ༉ Ensure minimum disruption to end users and web mashup
applications ¢༊ Two
༉ main components Neighbor Insertion
¢༊ ༉ Effective againt both existing attacks: HIDDENDBSAMPLER
[DDM07] and HYBRIDSAMPLER [DZD09]. Make it difficult for an adversary to retrieve one sample. Highlevel Packing
¢༊ Make it difficult for an adversary to retrieve subsequent samples (i.e.,
prevent an adversary from reusing previously issued queries to
improve sampling efficiency). 33 INAPPLCABILITY OF NAÏVE STRATEGIES (I)
¢༊ Audit
༉ query history for each user, IP address, etc Problem: distributed attacks 34 INAPPLCABILITY OF NAÏVE STRATEGIES (II)
¢༊ CAPTCHA challenge before submitting or answering a
query (used by seatcounter.com)
༉ Problem: significant inconvenience to end users, completely
disable thirdparty applications through the public interface 35 INAPPLCABILITY OF NAÏVE STRATEGIES (III)
¢༊ use machineincomprehensible response (e.g., image
instead of text for attribute values
༉ Problem: overhead on the server side, disable thirdparty
applications through the public interface 36 OUR STRATEGY: INSERTION OF DUMMY
TUPLES WITH CAPTCHA FLAGS
Make Model Year Color Ford Fusion 2006 Red Honda Accord 2007 Blue Toyota Prius 2008 Green ¢༊ Insert dummy tuples, and associate each tuple with a
CAPTCHA flag indicating whether it is real or dummy 37 JUSTIFICATION: WHY DUMMY TUPLES?
¢༊ Requirement
༉
༉
༉ of truthfully revealing individual tuples No change of existing tuple values
No removal of existing tuples
Only choice left: insert dummy tuples ¢༊ Key observation: All existing sampling attacks retrieve
samples from valid queries only.
༉ ༉ Overflowing queries are usually overlooked because their
results (topk tuples) are preferentially selected by a ranking
function, and hence cannot be assumed to be random.
How about inserting dummy tuples to overflow valid queries?
38 JUSTIFICATION: WHY CAPTCHA FLAGS?
¢༊ CAPTCHA flags exploit the different requirements of a
search user, a mashup application, and an adversary
¢༊ a bona fide search user issues a small number of search
queries  manual intepretation of CAPTCHA flags is tolerable
¢༊ a web mashup application can directly push the flag to its end
users  thereby maintaining its usability
¢༊ However, an adversary requires a relatively large number
of queries for aggregate estimation – CAPTCHA flags
become a major deficiency for the adversary. 39 TECHNICAL PROBLEM DEFINITION
¢༊ Problem definition: Given the attackingcost limit umax
and a set of sensitive aggregate queries, the objective of
dummy insertion is to achieve (ε, δ, p)guarantee for
each aggregate query while minimizing the number of
inserted dummy tuples.
༉
༉ Attackingcost limit umax: the maximum number of search
queries that an adversary can issue.
Why minimizing the number of inserted dummy tuples?
¢༊ ༉ Dummy tuples naturally lead to degradation of service quality (ε, δ, p)guarantee: partial disclosure measure
¢༊ same spirit as the privacygame notion [KMN05]
40 AN EXAMPLE A1 A2 A3 t1 0 0 1 t2 0 1 0 t3 0 1 1 000 001 010 011 100 101 110 111
t1 t2 t3 41 HIDDENDBSAMPLER
A1
0 1
A2 0 1 0 1
A3 0 000 1 0 001
t1 010
t2 1 0 011
t3 100 1 0 101 110 1 111
42 HIDDENDBSAMPLER
A1
1/2
(A1 = 0)? Overflow
1/2 1/2 1/2 A2 (A1=0 & A2=1) Overflow
1/2 A3
(A1=0 & A2=1 & A3=0)?
1/2 Valid
000 001 010 011 100 101 110 111
43 HIDDENDBSAMPLER
A1
1/2
Overflow 1/2
A2 (A1 = 0)? (A1=0 & A2=0)
1/2 1/2
Valid 000 001 A3 010 011 100 101 110 111
44 HIDDENDBSAMPLER
A1
1/2 1/2
(A1 = 1)?
underflow A2 A3 000 001 010 011 100 101 110 111
45 NEIGHBOR INSERTION
No Dummies With Dummies Valid Overflow
46 HIGHLEVEL PACKING
No Dummies With Dummies Underflow [Restart] Overflow
[Get
Sample] 47 A GENERIC MODEL FOR SINGLE SAMPLE
TUPLE ATTACK
¢༊
¢༊ Single sample tuple attack: retrieve one uniform random sample tuple
A generic single sample tuple attack
༉ Finding a valid query from the set of all possible queries
༉ Attackers rely on INFERENCE!
? ? ? ? ?
? ?
? ? ?
?? ?
? ? ? ? ? ?
? ?
? ? ?
?? ?
? ? ? 48 Q1: SELECT * FROM D WHERE a1 = 0 AND
a2 = 0 AND a3 = 0 over a Boolean database KEY OBSERVATIONS FOR SINGLE SAMPLE
TUPLE ATTACK
¢༊ Any smart sampler should issue queries that maximize
the shrinkage of search space for valid queries.
༉ Valid queries as well as long overflowing queries contribute
the most to shrinking the space. ¢༊ Forunately, long queries (both overflowing and valid) are
difficult to find even before dummy insertion:
༉ For a Boolean database, within the (2c ⋅ nCc) cpredicate
queries, the total number of valid and overflowing queries is
at most m⋅ nCc; thus the probability of choosing one is no
more than m/2c, which is extremely small when c is large. ¢༊ Thus, short valid queries becomes the main threat for
defense. 49 NEIGHBOR INSERTION
¢༊ bNeighbor Insertion: add dummy tuples such that all
valid queries with fewer than b predicates will overflow
༉ Precisely what we discussed earlier in the case study ¢༊ Approach: insert dummy tuples into the “neighboring
zone” of real tuples (i.e., sharing the same values on a
large number of attributes). 50 REVISIT OF SHRINKING EFFECT FOR MULTI
SAMPLE TUPLE ATTACK
? ? ? ? ?
? ?
? ? ?
?? ?
? ? ? ? ? ?
? ?
? ? ?
?? ?
? ? Key Observations:
 Shrinkage by underflow is permanent
 Shrinkage by overflow is mostly temporary
 Thus, short underflowing queries become a very dangerous threat ? 51 HIGHLEVEL PACKING
¢༊ dLevel Packing: add dummy tuples such that all
underflowing queries with fewer than d predicates will
overflow
༉ Precisely what we discussed earlier in the case study ¢༊ Approach: “pack” short underflowing queries with
dummy tuples 52 COUNTERSAMPLER
¢༊ Usually, d < b
༉ ¢༊ Valid queries are more
dangerous than underflows! Step 1. dlevel packing
༉ Pack until valid – will be
overflowed by Step 2 ¢༊ Step 2. bneighbor insertion ¢༊ Time Complexity:
༉ O(nCd–1·max(2d, m) + nCb–1·m) 53 PRIVACY GUARANTEE
¢༊ For a Boolean hidden database with m tuples, when all
samplers have an attackingcost limit umax, for any
COUNT query with answer in [x, y], the hidden database
owner achieves (ε, δ, 50%)privacy guarantee if
COUNTERSAMPLER has been executed with
parameters b and d which satisfy
(a) d ≥ log 2 m + 1 and 3 d −1 (d + 1) ≥ u max
(b) b ≥ d + (3ε u (32 min(x (m − x ), y (m − y ))(erf (δ )) ))
where erf–1(⋅) is the inverse error function.
2 −1 2 max € € €
54 EXPERIMENTAL SETUP
¢༊ Datasets
༉
༉
༉ Boolean 0.3: 100,000 tuples and 30 attributes. Each attribute
has probability of p = 0.3 to be 1.
Boolean Mixed: 30 independent attributes, 5 have probability
of p = 0.5 to be 1, 10 have p = 0.3, the other 10 have p = 0.1.
Categorical Census: 1990 US Census Adult data published
on the UCI Data Mining archive [HB99]. Highest domain
size: 92 categories, lowest: Boolean. ¢༊ Attacking
༉
༉ Techniques HIDDENDBSAMPLER [DDM07]
HYBRIDSAMPLER [DZD09]
55 HIDDENDBSAMPLER Boolean 0.3 and mixed datasets
22.6% and 24.7% dummy tuples 56
Census Categorical Dataset
24.7% dummy tuples HYBRIDSAMPLER 57 OUTLINE
¢༊ Introduction
¢༊ Unbiased Aggregate Estimation
¢༊ Protecting Sensitive Aggregates
¢༊ Conclusion 58 AGGREGATE ESTIMATION
¢༊ We introduced UNBIASED techniques for estimating
SIZE and AGGREGATES from Hidden Databases
¢༊ Techniques to lower the VARIANCE of such
UNBIASED Estimators were put in place
¢༊ HDUNBIASEDCOUNT/AGG for unbiased estimation
of SIZE and AGGREGATES with LOW VARIANCE! 59 AGGREGATE PRESERVATION
¢༊ A Novel
༉
༉
༉ Problem reveal individual tuples truthfully and efficiently
but hide aggregated views of the data
An urgent challenge for hidden database owners ¢༊ COUNTERSAMPLER
༉
༉
༉ The insertion of dummy tuples with CAPTCHA flags
Minimum disruption to end users and thirdparty web mashup
applications
Provides privacy guarantee against sampling attacks. Increase
by up to an order of magnitude the number of queries
required by existing sampling techniques.
60 A BROADER PICTURE ¢༊ Solution
༉
༉
༉ space for privacypreserving strategies Backend hidden database: this paper (dummy tuple insertion)
Query processing module
Future work
Frontend interface
61 THANK YOU Acknowledgement
62 Partially supported by NSF grants 0852673,
0852674, 0845644 and 0915834 and a GWU
Research Enhancement Fund. ...
View
Full Document
 Spring '11
 ZHANG
 Variance, Databases, World Wide Web, Search Queries, Underflow, dummy tuples

Click to edit the document details