This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Problem 1. A database of 10 million credit card transactions had 1% fraud cases and the remaining 99% of the transactions were legitimate. A data miner studying fraud detection is using a random sample of 20,000 transactions. For this purpose, would a simple random sample or a stratified random sample be best? Explain why and discuss how the sample you think is best would be chosen. Problem 2. Data Preparation and Exploration The Excel file Baseball.xls contains data on baseball salaries based on performance. The Text file Baseball.txt describes the raw data. Use the SAS EM/Insight software as the exploratory DM platform. (a) List each variable together with its model role, measurement scale and type. Scan the data for missing values. Are there any? (b) Plot a histogram of player salaries. Does the salary distribution appear to be skewed? Discuss your answer. Repeat the same exercise for the variable RBI. Plot four histograms for player statistics of your choice in one plot and discuss your observations.for player statistics of your choice in one plot and discuss your observations....
View Full Document
This note was uploaded on 02/06/2011 for the course ORIE 474 taught by Professor Apanasovich during the Spring '07 term at Cornell University (Engineering School).
- Spring '07