lecture22

lecture22 - Data Mining CS57300 Purdue University December...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Data Mining CS57300 Purdue University December 2, 2010 Anomaly detection (cont) (source: Introduction to Data Mining by Tan, Steinbach and Kumar) Unsupervised (point) anomaly detection • General method !"#$%&'()*+*,+-#"(.,/*$*0( • Build a profile of “normal” behavior based on patterns or summary statistics for the overall population ! ?)%)1$@"'()AB C D/*@4"$"A13E*@)"3E"(-)"F%310$@G"+)-$H*31 • Use deviations from13E*@)",$%"+)"A$to1%B"31"B/00$1J"B($(*B(*,B"E31"(-)"3H)1$@@"A3A/@$(*3% “normal” (() detect anomalies !I • Types of methods C KB)"(-)"F%310$@G"A13E*@)"(3"4)(),("$%30$@*)B ! L%30$@*)B"$1)"3+B)1H$(*3%B"M-3B)",-$1$,()1*B(*,B 4*EE)1"B*7%*E*,$%(@J"E130"(-)"%310$@"A13E*@) • Visual and statistical-based #JA)B"3E"$%30$@J"4)(),(*3%" B,-)0)B • Distance-based ! C ?1$A-*,$@"N"'($(*B(*,$@O+$B)4 • Model-based C 5*B($%,)O+$B)4 C 634)@O+$B)4 !"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""> Mixture model approach • Assume the dataset D is a mixture of two distributions • M: majority distribution • A: anomalous distribution • General method: • Initially, assume that all points belong to M • Let Li(D) be the likelihood at iteration i • For each point xi in M, move it to A • Compute the difference in likelihood, ∆=Li(D)-Li’(D) • If ∆>c then xi is moved permanently to A Statistical-based (cont) • Mixture distribution, D=(1-λ)M + λ A • M is a probability distribution estimated from data • A is initially assumed to be a uniform distribution • λ is the expected fraction of outliers • Likelihood at iteration i: N Li (D) = j =1 P (xj ) = (1 − λ)|Mi | PMi (xj ) xj ∈Mi PAi (xj ) λ | Ai | xj ∈Ai Limitations of statistical approaches • Most tests are for univariate data • In many cases, the form of the data distribution may not be known • If misspecified, outliers may look likely (e.g., mean is an outlier) • For high-dimensional data, it may be difficult to estimate likelihood of points in the true distribution Distance-based approaches • Three major types of methods • Nearest-neighbor • Density-based • Clustering approach Nearest-neighbor • Compute distance between every pair of points • How to define outliers? • Points for which there are fewer than p neighboring points within distance d • Top p points whose distance to kth nearest neighbor is greatest • Top p points whose average distance to the k nearest neighbors is greatest Example Kumar Introduction to Data Mining © Tan,Steinbach, Kumar 1/17/2006 Introduction to Data Mining 5 1/17/200 Example Kumar Introduction to Data Mining © Tan,Steinbach, Kumar 1/17/2006 Introduction to Data Mining 7 1/17/20 High dimensions • In high-dimensional space, data is sparse and notion of proximity becomes meaningless • Every point is almost equally good outlier from the perspective of proximity-based definitions • Lower-dimensional projections can be used for outlier detection • A point is an outlier if in some lower dimensional projection, it is present in a local region of abnormally low density Low dimensional projection • Divide each attribute in intervals ϕ based on frequency (each interval contains f=1/ϕ records) • Consider a k-dimensional cube created by picking grid ranges from k different dimensions • If attributes are independent, we expect a region to contain a fraction fk of the points • If there are N points, we can measure the sparsity of a cube D as: S (D) = n(D) − N · f k N · f k · (1 − f k ) • Negative sparsity indicates cube contains smaller number of points than expected Example • N=100 !"#$%&' ! ?@:==&"! @"A&"B"@":9A"@"=C<&"?"" B< @"8 • ϕ=5 • f=1/5 • N x f2=4 !"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":> (')*+,-./#*'0123452#%%67# Density-based ! !"#$%&'()*%+$,-&"./0$1" D31")$,-"E3*%(&"8,30E/()"""(-)"4)%F*(G"3B" 9:;9<==8""""""""""""" :> I30E/()"H3,$H"3/(H*)1"B$,(31"JKLDM"3B"$" $N)1$7)"3B"(-)"1$(*3F"3B"(-)"4)%F*(G"3B"F 4)%F*(G"3B"*(F"%)$1)F("%)*7-+31F L/(H*)1F"$1)"E3*%(F"O*(-"H$17)F("KLD"N$H/ 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" ! • For each point, compute the density of its local neighborhood ! distance(x, y ) −1 y ∈N (x,k) (')*+,-./#*'0123452#%%67#89 density (x, k) = |N (x, k)| !"#$%& "+$#,+ 4%/3&# 9+$%#) D31")$,-"E3*%(&",30E/()"(-)"4)%F*(G"3B"*(F"H3,$H"%)*7-+31-334 Local outlier factor (LOF):! I30E/()"H3,$H"3/(H*)1"B$,(31"JKLDM"3B"$"F$0EH)"! $F"(-)" $N)1$7)"3B"(-)"1$(*3F"3B"(-)"4)%F*(G"3B"! F$0EH)"! $%4"(-)" " ! " 4)%F*(G"3B"*(F"%)$1)F("%)*7-+31F ! L/(H*)1F" ratio of the KLD" $,-&"./0 • For a instance i the LOF is the $1)"E3*%(F"O*(-"H$17)F(""#$%&'()*%+N$H/)$1" ! 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" ! • " # density of instance i and average density of its nearest neighbors • Outliers are points with largest LOF value " !" " !"#$%&'()*%+$,-&"./0$1" !"#$%&#''#())*+(,%-#). /0# "+$#,+"0/1&*&1#(0#+2$3/&*-# 4%/3&#567#())*+(,%#8/"1# 9+$%#): ("1#).#(0#+2$3/&*0 !# 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":; Clustering-based !"#$%&'()*+,-$&. • Cluster the data into groups of different density ! ?$@*,"*4)$A B cluster as candidate • Choose points in smallCD/@()1"(-)"4$($"*%(3" 713/E@"3F"4*FF)1)%("4)%@*(G outliers B C-33@)"E3*%(@"*%"@0$DD" ,D/@()1"$@",$%4*4$()" • Compute the distance 3/(D*)1@ candidate between points and non-candidate0E/()"(-)"4*@($%,)" clusters B C3 +)(H))%",$%4*4$()"E3*%(@" • If candidate points $%4"%3%I,$%4*4$()" are far from all other ,D/@()1@J" are outliers non-candidate points, they ! 2F",$%4*4$()"E3*%(@"$1)"F$1" F130"$DD"3(-)1"%3%I,$%4*4$()" E3*%(@&"(-)G"$1)"3/(D*)1@ !"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":> Example © Tan,Steinbach, Kumar Introduction to Data Mining 1/17/2006 10 Example © Tan,Steinbach, Kumar Introduction to Data Mining 1/17/2006 11 Spectral methods • Analysis based on eigendecomposition of data • Key Idea • Find combination of attributes that capture bulk of variability • Reduced set of attributes can explain normal data well, but not necessarily the anomalies • Example: Principal Component Analysis • Top few principal components capture variability in normal data • Smallest principal component should have constant values • Outliers have variability in the smallest component Source: Lazarevic et al, ECML/PKDD’08 Tutorial Anomalies in time series data • Time series is a sequence of data points, measured typically at successive times, spaced at (often uniform) time intervals • Anomalies in time series data are data points that significantly deviate from the normal pattern of the data sequence Source: Sutton, CS294 UCBerkeley Examples of time series data Network traffic data Telephone usage data Source: Sutton, CS294 UCBerkeley Example: Network traffic (Lakhina et. al ’04) Goal: Find source-destination pairs with high traffic (e.g., by rate, volume) Backbone network Y= …. 100 30 42 212 1729 13 …. Source: Sutton, CS294 UCBerkeley Example: Network traffic Data matrix Y= Perform PCA on matrix Y …. 100 30 42 212 1729 13 …. Low-dimensional data Yv= …. ytTv1 ytTv2 …. Eigenvectors v1 v2 … Source: Sutton, CS294 UCBerkeley Example: Network traffic • Projections onto principal components: Yvi ui = Yvi • Look for first projection that contains a 3σ from mean to identify beginning of “anomalous” subspace Example: Network traffic Abilene backbone network traffic volume over 41 links collected over 4 weeks Perform PCA on 41-dim data Select top 5 components to form “normal” subspace P anomalies Source: Sutton, CS294 UCBerkeley Example: Network traffic Abilene backbone network traffic volume over 41 links collected over 4 weeks Project to residual subspace ˜ y = (I − PPT )y threshold Source: Sutton, CS294 UCBerkeley Example: Telephone traffic (Scott ‘03) • Problem: Detecting if the phone usage of an account is abnormal or not • Data collection: phone call records and summaries of an account’s previous history (features: call duration, regions of the world called, calls to “hot” Potentially fradulent numbers, etc) Account B Time (days) Time (days) Fraud score Account A activities Source: Sutton, CS294 UCBerkeley Burst modeling using modulated Poisson processes (Scott ‘03) Poisson process N0 binary Markov chain Poisson process N1 Model as a nonstationary discrete time hidden Markov model (hidden state=intruder or customer traffic) Source: Sutton, CS294 UCBerkeley Detection results Uncontaminated account probability of a criminal presence probability of each phone call being intruder traffic Contaminated account Evaluation • Objective measures • Measure ability to identify known anomalous instances • Measure ability to identify injected anomalous instances (e.g., random cases or profiled cases) • Use internal measure to evaluate (e.g., density of clusters) • Subjective measures • Adhoc inspection of identified anomalies • Critical issue: • Anomalies are by definition rare instances Evaluation criteria • False alarm rate (type I error) • Misdetection rate (type II error) • Neyman-Pearson criterion • Minimize misdetection rate while false alarm rate is bounded • Bayesian criterion • Minimize a weighted sum of false alarm and misdetection rate Announcements • Homework 5: Due Monday Dec 6th by 4pm • Next class: • Data mining systems • Top ten data mining mistakes • Myths and pitfalls ...
View Full Document

Ask a homework question - tutors are online