ccs03-0 - Privacy Cognizant Information Privacy Systems...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Privacy Cognizant Information Privacy Systems Rakesh Agrawal IBM Almaden Research Center Jt. work with Srikant, Kiernan, Xu & Evfimievski Thesis Thesis ƒ There is increasing need to build information systems that ƒ protect the privacy and ownership of information ƒ do not impede the flow of information ƒ Cross-fertilization of ideas from the security and database research communities can lead to the development of innovative solutions. Outline Outline Motivation Motivation Privacy Preserving Data Mining Privacy Privacy Aware Data Management Privacy Information Sharing Across Private Databases Information Conclusions Conclusions Drivers Drivers Policies and Legislations Policies – U.S. and international regulations – Legal proceedings against businesses Consumer Concerns Consumer – Consumer privacy apprehensions continue to plague the Web … these fears will hold back roughly $15 billion in eCommerce revenue.” Forrester Research, 2001 – Most consumers are “privacy pragmatists.” Westin Surveys Moral Imperative Moral – The right to privacy: the most cherished of human freedom -- Warren & Brandeis, 1890 Outline Outline Motivation Motivation Privacy Preserving Data Mining Privacy Privacy Aware Data Management Privacy Information Sharing Across Private Databases Information Conclusions Conclusions Data Mining and Privacy Data The primary task in data mining: The – development of models about aggregated data. Can we develop accurate models, while Can protecting the privacy of individual records? Setting Setting Application scenario: A central server interested in Application building a data mining model using data obtained from a large number of clients, while preserving their privacy – Web-commerce, e.g. recommendation service Desiderata: Desiderata: – Must not slow-down the speed of client interaction – Must scale to very large number of clients During the application phase During – Ship model to the clients – Use oblivious computations Alice 35 95,000 J.S. Bach painting nasa 35 35 95,000 95,000 J.S. Bach J.S. Bach painting painting nasa nasa 45 45 60,000 60,000 B. Spears B. Spears baseball baseball cnn cnn World Today Recommendation Service Bob 45 60,000 B. Spears baseball cnn Chris 42 85,000 B. Marley, camping, microsoft 42 42 85,000 85,000 B. Marley B. Marley camping camping microsoft microsoft Alice 35 95,000 J.S. Bach painting nasa 35 35 95,000 95,000 J.S. Bach J.S. Bach painting painting nasa nasa 45 45 60,000 60,000 B. Spears B. Spears baseball baseball cnn cnn World Today Recommendation Service Bob 45 60,000 B. Spears baseball cnn Mining Algorithm 42 42 85,000 85,000 B. Marley B. Marley camping camping microsoft microsoft Chris 42 85,000 B. Marley, camping, microsoft Data Mining Model New Order: Alice 35 95,000 J.S. Bach painting nasa 35 becomes 50 (35+15) 50 50 65,000 65,000 Metallica Metallica painting painting nasa nasa Randomization to Randomization Protect Privacy Recommendation Service Bob 45 60,000 B. Spears baseball cnn 38 38 90,000 90,000 B. Spears B. Spears soccer soccer fox fox Chris 42 85,000 B. Marley, camping, microsoft 32 32 55,000 55,000 B. Marley B. Marley camping camping linuxware linuxware Randomization techniques differ for numeric and categorical data Each attribute randomized independently Per-record randomization without considering other records Randomization parameters common across users New Order: Alice 35 95,000 J.S. Bach painting nasa 50 50 65,000 65,000 Metallica Metallica painting painting nasa nasa 38 38 90,000 90,000 B. Spears B. Spears soccer soccer fox fox Randomization to Randomization Protect Privacy Recommendation Service Bob 45 60,000 B. Spears baseball cnn Chris 42 85,000 B. Marley, camping, microsoft 32 32 55,000 55,000 B. Marley B. Marley camping camping linuxware linuxware True values Never Leave the User! Alice 35 95,000 J.S. Bach painting nasa 50 50 65,000 65,000 Metallica Metallica painting painting nasa nasa 38 38 90,000 90,000 B. Spears B. Spears soccer soccer fox fox New Order: Randomization Protects Privacy Recommendation Service Recovery Mining Algorithm Bob 45 60,000 B. Spears baseball cnn Chris 42 85,000 B. Marley, camping, microsoft 32 32 55,000 55,000 B. Marley B. Marley camping camping linuxware linuxware Data Mining Model Recovery of distributions, not individual records Reconstruction Problem Reconstruction (Numeric Data) Original values x1, x2, ..., xn Original – from probability distribution X (unknown) To hide these values, we use y1, y2, ..., yn To – from probability distribution Y Given Given – x1+y1, x2+y2, ..., xn+yn – the probability distribution of Y Estimate the probability distribution of X. Estimate Reconstruction Algorithm Reconstruction fX0 := Uniform distribution j := 0 := repeat repeat 1 n fY (( xi + yi ) − a ) f Xj ( a ) fXj+1(a) := ∞ j n ∑ i =1 j := j+1 until (stopping criterion met) until ∫ −∞ fY (( xi + yi ) − a ) f X ( a ) Bayes’ Rule (R. Agrawal & R. Srikant, SIGMOD 2000) (R. Converges to maximum likelihood estimate. Converges – D. Agrawal & C.C. Aggarwal, PODS 2001. Works Well Works 1200 1000 Number of People 800 600 400 200 0 20 Age 60 Original Randomized Reconstructed Decision Tree Example Decision Age 23 17 43 68 32 20 Salary 50K 30K 40K 50K 70K 20K Repeat Visitor? Repeat Repeat Repeat Single Single Repeat Age < 25 Yes Repeat No Salary < 50K Yes Repeat No Single Algorithms Algorithms Global Global – Reconstruct for each attribute once at the beginning By Class By – For each attribute, first split by class, then reconstruct separately for each class. Local Local – Reconstruct at each node See SIGMOD 2000 paper for details. Experimental Methodology Experimental Compare accuracy against Compare – Original: unperturbed data without randomization. – Randomized: perturbed data but without making any corrections for randomization. Test data not randomized. Test Synthetic benchmark from [AGI+92]. Synthetic Training set of 100,000 records, split equally Training between the two classes. Decision Tree Experiments Decision 100% Randomization Level 100 90 Accuracy Original 80 Randomized 70 60 50 Fn 1 Fn 2 Fn 3 Fn 4 Fn 5 Reconstructed Accuracy vs. Randomization Accuracy Fn 3 100 90 Accuracy 80 70 60 50 40 10 20 40 60 80 100 150 200 Original Randomized Reconstructed Randomization Level More on Randomization More Privacy-Preserving Association Rule Mining Over Privacy Categorical Data – Rizvi & Haritsa [VLDB 02] – Evfimievski, Srikant, Agrawal, & Gehrke [KDD-02] Privacy Breach Control: Probabilistic limits on what one can infer with access to the randomized data as well as mining results – Evfimievski, Srikant, Agrawal, & Gehrke [KDD-02] – Evfimievski, Gehrke & Srikant [PODS-03] Related Work: Related Private Distributed ID3 How to build a decision-tree classifier on the union of two How private databases (Lindell & Pinkas [Crypto 2000]) Basic Idea: Basic Find attribute with highest information gain privately Find Independently split on this attribute and recurse Independently Selecting the Split Attribute Selecting Given v1 known to DB1 and v2 known to DB2, compute (v1 + v2) Given log (v1 + v2) and output random shares of the answer Given random shares, use Yao's protocol [FOCS 84] to compute [FOCS Given information gain. Trade-off Trade + Accuracy – Performance & scaling Related Work: Purdue Toolkit Related Partitioned databases (horizontally + vertically) Partitioned Secure Building Blocks Secure Algorithms (using building blocks): Algorithms – Association rules – EM Clustering C. Clifton et al. Tools for Privacy Preserving Data C. Mining. SIGKDD Explorations 2003. Related Work: Related Statistical Databases Provide statistical information without compromising Provide sensitive information about individuals (AW89, Sho82) Techniques Techniques – Query Restriction – Data Perturbation Negative Results: cannot give high quality statistics Negative and simultaneously prevent partial disclosure of individual information [AW89] Summary Summary Promising technical direction & results Promising Much more needs to be done, e.g. Much – Trade off between the amount of privacy breach and performance – Examination of other approaches (e.g. randomization based on swapping) Outline Outline Motivation Motivation Privacy Preserving Data Mining Privacy Privacy Aware Data Management Privacy Information Sharing Across Private Databases Information Conclusions Conclusions Hippocratic Databases Hippocratic Hippocratic Oath, 8 (circa 400 BC) Hippocratic – What I may see or hear in the course of treatment … I will keep to myself. What if the database systems were to embrace the What Hippocratic Oath? Architecture derived from privacy legislations. Architecture – US (FIPA, 1974), Europe (OECD , 1980), Canada (1995), Australia (2000), Japan (2003) Agrawal, Kiernan, Srikant & Xu: VLDB 2002.. Architectural Principles Architectural Purpose Specification Purpose Associate with data the purposes for collection Limited Retention Limited Do not retain data beyond necessary Consent Consent Obtain donor’s consent on the purposes Accuracy Accuracy Keep data accurate and up-todate Limited Collection Limited Collect minimum necessary data Safety Safety Protect against theft and other misappropriations Limited Use Limited Run only queries that are consistent with the purposes Openness Openness Allow donor access to data about the donor Limited Disclosure Limited Do not release data without donor’s consent Compliance Compliance Verifiable compliance with the above principles Architecture: Policy Architecture: Privacy Policy Converts privacy policy into privacy metadata tables. For each purpose & piece of information (attribute): • External recipients • Retention period • Authorized users Different designs possible. Privacy Metadata Creator Limited Disclosure Limited Retention Privacy Metadata Store Privacy Policies Table Privacy Purpose purchase purchase register register recommend ations Table customer customer customer customer order Attribute name email name email book Externalrecipients {delivery, credit-card} empty empty empty empty Authorizedusers {shipping, charge} {shipping} {registration} {registration} {mining} Retention 1 month 1 month 3 years 3 years 10 years Architecture: Data Collection Architecture: Data Collection Privacy Constraint Validator Privacy policy compatible with user’s privacy preference? Audit trail for compliance. Consent Audit Info Compliance Privacy Metadata Audit Trail Store Architecture: Data Collection Architecture: Data Collection Privacy Constraint Validator Data Accuracy Analyzer Audit Info Data cleansing, e.g., errors in address. Accuracy Associate set of Purpose purposes with Specification each record. Privacy Metadata Audit Trail Store Record Access Control Architecture: Queries Architecture: Queries Safety 2. Query tagged “telemarketing” cannot see credit card info. 3. Telemarketing query only sees records that include “telemarketing” in set of purposes. Attribute Access Control Safety Limited Use 1. Telemarketing cannot issue query tagged “charge”. Privacy Metadata Store Record Access Control Architecture: Queries Architecture: Queries Safety Telemarketing query that asks for all phone numbers. • Compliance • Training data for query intrusion detector Attribute Access Control Query Intrusion Detector Audit Info Compliance Privacy Metadata Audit Trail Store Record Access Control Architecture: Other Architecture: Other Limited Collection Analyze queries to identify unnecessary collection, retention & authorizations. Delete items in accordance with privacy policy. Additional security for sensitive data. Data Collection Analyzer Data Retention Manager Limited Retention Safety Privacy Metadata Store Encryption Support Architecture Architecture Privacy Policy Data Collection Privacy Constraint Validator Data Accuracy Analyzer Audit Info Queries Attribute Access Control Query Intrusion Detector Audit Info Other Data Collection Analyzer Data Retention Manager Privacy Metadata Creator Privacy Metadata Audit Trail Store Record Access Control Encryption Support Related Work: Related Statistical & Secure Databases Statistical Databases Statistical – Provide statistical information (sum, count, etc.) without compromising sensitive information about individuals, [AW89] Multilevel Secure Databases Multilevel – Multilevel relations, e.g., records tagged “secret”, “confidential”, or “unclassified”, e.g. [JS91] Need to protect privacy in transactional databases that Need support daily operations. – Cannot restrict queries to statistical queries. – Cannot tag all the records “top secret”. Some Interesting Problems Some Privacy enforcement requires cell-level decisions (which may Privacy be different for different queries) – How to minimize the cost of privacy checking? Encryption to avoid data theft Encryption – How to index encrypted data for range queries? Intrusive queries from authorized users Intrusive – Query intrusion detection? Identifying unnecessary data collection Identifying – Assets info needed only if salary is below a threshold – Queries only ask “Salary > threshold” for rent application Forgetting data after the purpose is fulfilled Forgetting – Databases designed not to lose data – Interaction with compliance Solutions must scale to database-size problems! Outline Outline Motivation Motivation Privacy Preserving Data Mining Privacy Privacy Aware Data Management Privacy Information Sharing Across Private Databases Information Conclusions Conclusions Today’s Information Sharing Systems Mediator Q R Q R Centralized Federated Assumption: Information in each database can be Assumption: freely shared. Minimal Necessary Information Minimal Sharing Compute queries across databases so that no more Compute information than necessary is revealed (without using a trusted third party). Need is driven by several trends: Need – End-to-end integration of information systems across companies. – Simultaneously compete and cooperate. – Security: need-to-know information sharing Agrawal, Evfimievski & Srikant: SIGMOD 2003. Agrawal Selective Document Sharing Selective R is shopping for is technology. S has intellectual has property it may want to license. First find the specific First technologies where there is a match, and then reveal further information about those. R Shopping List S Technology List Example 2: Govt. agencies sharing information on a need-to-know basis. Medical Research Medical Validate hypothesis Validate between adverse reaction to a drug and a specific DNA sequence. Researchers should not Researchers learn anything beyond 4 counts: Sequence Present Sequence Absent ? ? DNA Sequences Mayo Clinic Drug Reactions Adverse Reaction No Adv. Reaction ? ? Minimal Necessary Sharing R a u v x R S R must not know that S has b & y S must not know that R has a & x R u v S S b u v y Count (R S) R & S do not learn anything except that the result is 2. Problem Statement: Problem Minimal Sharing Given: Given: – Two parties (honest-but-curious): R (receiver) and S (sender) – Query Q spanning the tables R and S – Additional (pre-specified) categories of information I Compute the answer to Q and return it to R without revealing Compute any additional information to either party, except for the information contained in I – For intersection, intersection size & equijoin, I = { |R| , |S| } – For equijoin size, I also includes the distribution of duplicates & some subset of information in R S A Possible Approach Possible Secure Multi-Party Computation Secure – Given two parties with inputs x and y, compute f(x,y) such that the parties learn only f(x,y) and nothing else. – Can be solved by building a combinatorial circuit, and simulating that circuit [Yao86]. Prohibitive cost for database-size problems. Prohibitive – Intersection of two relations of a million records each would require 144 days Intersection Protocol: Intuition Intersection Want to encrypt the value in R and S and compare Want the encrypted values. However, want an encryption function such that it However, can only be jointly computed by R and S, not separately. Commutative Encryption Commutative Commutative encryption F is a computable function f : Key F X Dom F -> Dom F, satisfying: – For all e, e’ Key F, fe o fe’ = fe’ o fe Key (The result of encryption with two different keys is the same, irrespective of the order of encryption) Each fe is a bijection. (Two different values will have different encrypted values) The distribution of <x, fe(x), y, fe(y)> is indistinguishable from the distribution of <x, fe(x), y, z>; x, y, z r Dom F and e r Key F. (Given a value x and its encryption fe(x), for a new value y, we cannot distinguish between fe(y) and a random value z. Thus we cannot encrypt y nor decrypt fe(y).) – – Example Commutative Example Encryption fe(x) = xe mod p where – p: safe prime number, i.e., both p and q=(p-1)/2 are primes – encryption key e 1, 2, …, q-1 1, – Dom F: all quadratic residues modulo p Commutativity: powers commute Commutativity (xd mod p)e mod p = xde mod p = (xe mod p)d mod p Indistinguishability follows from Decisional DiffieIndistinguishability Hellman Hypothesis (DDH) Intersection Protocol Intersection Secret key R R r S S fs(S ) s We apply fs on h(S), where h is a hash function, not directly on S. Shorthand for { fs(x) | x S } Intersection Protocol R R fs(S ) fr(fs(S )) fs(fr(S )) Commutative property r S S fs(S) s Intersection Protocol R fs(fr(S )) R fr(R ) <y, fs(y)> for y fr(R) <x, fs(fr(x))> for x R r S S fr(R ) s <y, fs(y)> for y fr(R) Since R knows <x, y=fr(x)> Intersection Size Protocol Intersection R R fr(R ) fs(S ) fr(fs(S )) fr(fs(R)) R cannot map z fr(fs(R)) back to x R. r S S fs(S ) fr(R ) s fs(fr(R )) Not <y, fs(y)> for y fr(R) Equi Join and Join Size Equi See Sigmod03 paper See Also gives the cost analysis of protocols Also Related Work Related [NP99]: Protocols for list intersection problem [NP99]: – Oblivious evaluation of n polynomials of degree n each. – Oblivious evaluation of n2 polynomials. [HFH99]: find people with common preferences, [HFH99]: without revealing the preferences. – Intersection protocols are similar to ours, but do not provide proofs of security. Challenges Challenges Models of minimal disclosure and corresponding Models protocols for – other database operations – combination of operations Faster protocols Faster Tradeoff between efficiency and Tradeoff – the additional information disclosed – approximation Closing Thoughts Closing Solutions to complex problems such as privacy Solutions require a mix of legislations, societal norms, market forces & technology By advancing technology, we can change the mix By and improve the overall quality of the solution Gold mine of challenging research problems Gold (besides being useful)! References References http://www.almaden.ibm.com/software/quest/ M. Bawa, R. Bayardo, R. Agrawal. Privacy-preserving indexing of Documents on the M. Network. 29th Int'l Conf. on Very Large Databases (VLDB), Berlin, Sept. 2003. Network R. Agrawal, A. Evfimievski, R. Srikant. Information Sharing Across Private Databases. R. ACM Int’l Conf. On Management of Data (SIGMOD), San Diego, California, June 2003. A. Evfimievski, J. Gehrke, R. Srikant. Liming Privacy Breaches in Privacy Preserving A. Data Mining. PODS, San Diego, California, June 2003. Data R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. An Xpath Based Preference Language for R. P3P. 12th Int'l World Wide Web Conf. (WWW), Budapest, Hungary, May 2003. P3P R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Implementing P3P Using Database R. Technology. 19th Int'l Conf.on Data Engineering(ICDE), Bangalore, India, March 2003. Technology R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Server Centric P3P. W3C Workshop on the R. Future of P3P, Dulles, Virginia, Nov. 2002. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Hippocratic Databases. 28th Int'l Conf. on Very R. Large Databases (VLDB), Hong Kong, August 2002. R. Agrawal, J. Kiernan. Watermarking Relational Databases. 28th Int'l Conf. on Very R. Large Databases (VLDB), Hong Kong, August 2002. Expanded version in VLDB Journal 2003. A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Mining Association Rules Over Privacy A. Preserving Data. 8th Int'l Conf. on Knowledge Discovery in Databases and Data Mining Preserving (KDD), Edmonton, Canada, July 2002. R. Agrawal, R. Srikant. Privacy Preserving Data Mining. ACM Int’l Conf. On R. Management of Data (SIGMOD), Dallas, Texas, May 2000. ...
View Full Document

This note was uploaded on 11/03/2010 for the course ITEC ITEC313 taught by Professor Nazifedimililer during the Spring '10 term at Eastern Mediterranean University.

Ask a homework question - tutors are online