Example Dissimilarity between Asymmetric Binary Variables Gender is a symmetric

# Example dissimilarity between asymmetric binary

• 58

This preview shows page 51 - 58 out of 58 pages.

Example: Dissimilarity between Asymmetric Binary Variables Gender is a symmetric attribute (not counted in) The remaining attributes are asymmetric binary Let the values Y and P be 1, and the value N be 0 Distance: 51 Jack Mary Jim Jim Mary Jack

Subscribe to view the full document.

Data Mining Exploratory Data Analysis Proximity Measure for Categorical Attributes 52 Categorical data, also called nominal attributes Example: Color (red, yellow, blue, green), profession, etc. Method 1 : Simple matching m : # of matches, p : total # of variables Method 2 : Use a large number of binary attributes Creating a new binary attribute for each of the M nominal states
Data Mining Exploratory Data Analysis Ordinal Variables 53 An ordinal variable can be discrete or continuous Order is important, e.g., rank (e.g., freshman, sophomore, junior, senior) Can be treated like interval-scaled Replace an ordinal variable value by its rank: Map the range of each variable onto [0, 1] by replacing i -th object in the f -th variable by Example: freshman: 0; sophomore: 1/3; junior: 2/3; senior 1 Then distance: d(freshman, senior) = 1, d(junior, senior) = 1/3 Compute the dissimilarity using methods for interval-scaled variables 1 1 if if f r z M {1,..., } if f r M

Subscribe to view the full document.

Data Mining Exploratory Data Analysis Attributes of Mixed Type 54 A dataset may contain all attribute types Nominal, symmetric binary, asymmetric binary, numeric, and ordinal One may use a weighted formula to combine their effects: If f is numeric: Use the normalized distance If f is binary or nominal: d ij (f) = 0 if x if = x jf ; or d ij (f) = 1 otherwise If f is ordinal Compute ranks (where ) Treat as interval-scaled
Data Mining Exploratory Data Analysis Cosine Similarity of Two Vectors 55 A document can be represented by a bag of terms or a long vector, with each attribute recording the frequency of a particular term (such as word, keyword, or phrase) in the document Other vector objects: Gene features in micro-arrays Applications: Information retrieval, biologic taxonomy, gene feature mapping, etc. Cosine measure: If and are two vectors (e.g., term-frequency vectors), then where indicates vector dot product, : the length of vector

Subscribe to view the full document.

Data Mining Exploratory Data Analysis Example: Calculating Cosine Similarity 56 Calculating Cosine Similarity: where indicates vector dot product, : the length of vector d Ex: Find the similarity between documents 1 and 2. First, calculate vector dot product Then, calculate || d 1 || and || d 2 || Calculate cosine similarity: 1 3 3 0 0 2 2 0 0 0 0 2 2 0 0 0 0 6.48 || || 5 0 0 1 5 d     2 3 2 2 0 0 1 1 1 1 || | 0 0 1 1 0 0 1 1 4.12 | 3 0 0 d      
Data Mining Exploratory Data Analysis Summary Basic data descriptions (e.g., measures of central tendency and measures of dispersion) and graphic statistical displays (e.g., quantile plots, histograms, and scatter plots) provide valuable insight into the

Subscribe to view the full document.

• Winter '18
• nour

### What students are saying

• As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

Kiran Temple University Fox School of Business ‘17, Course Hero Intern

• I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

Dana University of Pennsylvania ‘17, Course Hero Intern

• The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

Jill Tulane University ‘16, Course Hero Intern