Unformatted text preview: Data Mining
CS57300
Purdue University
December 2, 2010 Anomaly detection (cont)
(source: Introduction to Data Mining by Tan, Steinbach and Kumar) Unsupervised (point) anomaly detection
• General method !"#$%&'()*+*,+#"(.,/*$*0( • Build a proﬁle of “normal” behavior based on patterns or summary
statistics for the overall population
! ?)%)[email protected]"'()AB C D/*@4"$"A13E*@)"3E"()"F%[email protected]"+)$H*31 • Use deviations from13E*@)",$%"+)"A$to1%B"31"B/00$1J"B($(*B(*,B"E31"()"3H)[email protected]@"A3A/@$(*3%
“normal” (() detect anomalies
!I
• Types of methods C KB)"()"F%[email protected]"A13E*@)"(3"4)(),("$%[email protected]*)B
! L%[email protected]*)B"$1)"3+B)1H$(*3%B"M3B)",$1$,()1*B(*,B
4*EE)1"B*7%*E*,$%(@J"E130"()"%[email protected]"A13E*@) • Visual and statisticalbased
#JA)B"3E"$%[email protected]"4)(),(*3%"
B,)0)B
• Distancebased
! C ?1$A*,[email protected]"N"'($(*B(*,[email protected]+$B)4 • Modelbased C 5*B($%,)O+$B)4
C 634)@O+$B)4 !"#$%&'()*%+$,&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""> Mixture model approach
• Assume the dataset D is a mixture of two distributions
• M: majority distribution
• A: anomalous distribution
• General method:
• Initially, assume that all points belong to M
• Let Li(D) be the likelihood at iteration i
• For each point xi in M, move it to A
• Compute the difference in likelihood, ∆=Li(D)Li’(D)
• If ∆>c then xi is moved permanently to A Statisticalbased (cont)
• Mixture distribution, D=(1λ)M + λ A
• M is a probability distribution estimated from data
• A is initially assumed to be a uniform distribution
• λ is the expected fraction of outliers
• Likelihood at iteration i:
N Li (D) =
j =1 P (xj ) = (1 − λ)Mi  PMi (xj )
xj ∈Mi PAi (xj ) λ  Ai 
xj ∈Ai Limitations of statistical approaches
• Most tests are for univariate data
• In many cases, the form of the data
distribution may not be known
• If misspeciﬁed, outliers may
look likely (e.g., mean is an
outlier)
• For highdimensional data, it may
be difﬁcult to estimate likelihood of
points in the true distribution Distancebased approaches
• Three major types of methods
• Nearestneighbor
• Densitybased
• Clustering approach Nearestneighbor
• Compute distance between every pair of points
• How to deﬁne outliers?
• Points for which there are fewer than p neighboring points within distance d
• Top p points whose distance to kth nearest neighbor is greatest
• Top p points whose average distance to the k nearest neighbors is greatest Example Kumar Introduction to Data Mining © Tan,Steinbach, Kumar 1/17/2006 Introduction to Data Mining
5 1/17/200 Example Kumar Introduction to Data Mining © Tan,Steinbach, Kumar 1/17/2006 Introduction to Data Mining
7 1/17/20 High dimensions
• In highdimensional space, data is sparse and notion of proximity becomes
meaningless
• Every point is almost equally good outlier from the perspective of
proximitybased deﬁnitions
• Lowerdimensional projections can be used for outlier detection
• A point is an outlier if in some lower dimensional projection, it is present in
a local region of abnormally low density Low dimensional projection
• Divide each attribute in intervals ϕ based on frequency (each interval contains
f=1/ϕ records)
• Consider a kdimensional cube created by picking grid ranges from k different
dimensions
• If attributes are independent, we expect a region to contain a fraction fk of
the points
• If there are N points, we can measure the sparsity of a cube D as: S (D) = n(D) − N · f k N · f k · (1 − f k ) • Negative sparsity indicates cube contains smaller number of points than
expected Example
• N=100 !"#$%&'
! [email protected]:==&"! @"A&"B"@":9A"@"=C<&"?"" B< @"8 • ϕ=5
• f=1/5
• N x f2=4 !"#$%&'()*%+$,&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":> (')*+,./#*'0123452#%%67# Densitybased ! !"#$%&'()*%+$,&"./0$1" D31")$,"E3*%(&"8,30E/()"""()"4)%F*(G"3B"
9:;9<==8""""""""""""" :>
I30E/()"H3,$H"3/(H*)1"B$,(31"JKLDM"3B"$"
$N)1$7)"3B"()"1$(*3F"3B"()"4)%F*(G"3B"F
4)%F*(G"3B"*(F"%)$1)F("%)*7+31F
L/(H*)1F"$1)"E3*%(F"O*("H$17)F("KLD"N$H/ 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" ! • For each point, compute the density of its local
neighborhood
!
distance(x, y ) −1
y ∈N (x,k)
(')*+,./#*'0123452#%%67#89
density (x, k) =
N (x, k) !"#$%&
"+$#,+
4%/3&#
9+$%#) D31")$,"E3*%(&",30E/()"()"4)%F*(G"3B"*(F"H3,$H"%)*7+31334
Local outlier factor (LOF):! I30E/()"H3,$H"3/(H*)1"B$,(31"JKLDM"3B"$"F$0EH)"! $F"()"
$N)1$7)"3B"()"1$(*3F"3B"()"4)%F*(G"3B"!
F$0EH)"! $%4"()"
"
!
"
4)%F*(G"3B"*(F"%)$1)F("%)*7+31F
! L/(H*)1F" ratio of the
KLD" $,&"./0
• For a instance i the LOF is the $1)"E3*%(F"O*("H$17)F(""#$%&'()*%+N$H/)$1"
!
2%(134/,(*3%"(3"5$($"6*%*%7""""""""
! • " # density of instance i and average density
of its nearest neighbors • Outliers are points with largest LOF value
" !"
" !"#$%&'()*%+$,&"./0$1" !"#$%&#''#())*+(,%#). /0#
"+$#,+"0/1&*&1#(0#+2$3/&*#
4%/3ȷ#())*+(,%#8/"1#
9+$%#): ("1#).#(0#+2$3/&*0 !# 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":; Clusteringbased
!"#$%&'()*+,$&.
• Cluster the data into groups of different
density
! [email protected]*,"*4)$A B cluster as candidate
• Choose points in smallCD/@()1"()"4$($"*%(3"
713/[email protected]"3F"4*FF)1)%("4)%@*(G
outliers
B [email protected])"E3*%(@"*%"@0$DD"
,D/@()1"[email protected]",$%4*4$()"
• Compute the distance 3/(D*)[email protected] candidate
between
points and noncandidate0E/()"()"4*@($%,)"
clusters
B C3
+)(H))%",$%4*4$()"E3*%(@"
• If candidate points $%4"%3%I,$%4*4$()"
are far from all other
,D/@()[email protected]" are outliers
noncandidate points, they
! 2F",$%4*4$()"E3*%(@"$1)"F$1" F130"$DD"3()1"%3%I,$%4*4$()"
E3*%(@&"()G"$1)"3/(D*)[email protected]
!"#$%&'()*%+$,&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":> Example © Tan,Steinbach, Kumar Introduction to Data Mining 1/17/2006 10 Example © Tan,Steinbach, Kumar Introduction to Data Mining 1/17/2006 11 Spectral methods
• Analysis based on eigendecomposition of data
• Key Idea
• Find combination of attributes that capture bulk of variability
• Reduced set of attributes can explain normal data well, but not necessarily
the anomalies
• Example: Principal Component Analysis
• Top few principal components capture variability in normal data
• Smallest principal component should have constant values
• Outliers have variability in the smallest component
Source: Lazarevic et al, ECML/PKDD’08 Tutorial Anomalies in time series data
• Time series is a sequence of data points, measured typically at successive
times, spaced at (often uniform) time intervals • Anomalies in time series data are data points that significantly deviate from
the normal pattern of the data sequence Source: Sutton, CS294 UCBerkeley Examples of time series data Network trafﬁc data Telephone usage data
Source: Sutton, CS294 UCBerkeley Example: Network trafﬁc (Lakhina et. al ’04)
Goal: Find sourcedestination
pairs with high trafﬁc (e.g., by
rate, volume) Backbone network Y= ….
100 30 42 212 1729 13 …. Source: Sutton, CS294 UCBerkeley Example: Network trafﬁc
Data matrix Y= Perform PCA on matrix Y ….
100 30 42 212 1729 13 …. Lowdimensional data Yv= ….
ytTv1 ytTv2 …. Eigenvectors v1 v2 …
Source: Sutton, CS294 UCBerkeley Example: Network trafﬁc
• Projections onto principal components: Yvi
ui =
Yvi • Look for ﬁrst projection that contains a 3σ from mean to identify beginning
of “anomalous” subspace Example: Network trafﬁc
Abilene backbone network
trafﬁc volume over 41 links
collected over 4 weeks Perform PCA on 41dim data
Select top 5 components to
form “normal” subspace P anomalies Source: Sutton, CS294 UCBerkeley Example: Network trafﬁc Abilene backbone network
trafﬁc volume over 41 links
collected over 4 weeks Project to residual subspace ˜
y = (I − PPT )y threshold Source: Sutton, CS294 UCBerkeley Example: Telephone trafﬁc (Scott ‘03)
• Problem: Detecting if the phone usage of an account is abnormal or not
• Data collection: phone call records and summaries of an account’s previous
history (features: call duration, regions of the world called, calls to “hot” Potentially
fradulent
numbers, etc) Account B Time (days) Time (days) Fraud score Account A activities Source: Sutton, CS294 UCBerkeley Burst modeling using modulated
Poisson processes (Scott ‘03) Poisson
process N0
binary
Markov
chain Poisson
process N1 Model as a nonstationary discrete time hidden Markov model
(hidden state=intruder or customer trafﬁc)
Source: Sutton, CS294 UCBerkeley Detection results
Uncontaminated account probability of a
criminal presence probability of each
phone call being
intruder trafﬁc Contaminated account Evaluation
• Objective measures
• Measure ability to identify known anomalous instances
• Measure ability to identify injected anomalous instances (e.g., random
cases or proﬁled cases)
• Use internal measure to evaluate (e.g., density of clusters)
• Subjective measures
• Adhoc inspection of identiﬁed anomalies
• Critical issue:
• Anomalies are by deﬁnition rare instances Evaluation criteria
• False alarm rate (type I error)
• Misdetection rate (type II error)
• NeymanPearson criterion
• Minimize misdetection rate while false alarm rate is bounded • Bayesian criterion
• Minimize a weighted sum of false alarm and misdetection rate Announcements
• Homework 5: Due Monday Dec 6th by 4pm
• Next class:
• Data mining systems
• Top ten data mining mistakes
• Myths and pitfalls ...
View
Full Document
 Fall '08
 Staff
 Normal Distribution, Data Mining, Mixture model

Click to edit the document details