ISDS 2001 CH4

ISDS 2001 CH4 - CHAPTER 4 — DATA TEXT AND WEB MINING I...

Info iconThis preview shows pages 1–10. Sign up to view the full content.

View Full Document Right Arrow Icon
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 2
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 4
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 6
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 8
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 10
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CHAPTER 4 — DATA, TEXT, AND WEB MINING I. OPENING VIGNETTE: HIGHMARK, INC. EMPLOYS DATA MINING TO MANAGE INSURANCE COSTS A. Problem B. Solution C. Results D. What we can learn from this vignette II. DATA MINING CONCEPTS AND APPLICATIONS A. Introduction 1. Tom Davenport (author of Competing on Analytics) argued that the latest strategic weapon for companies in ANALYTICAL DECISION MAKING. Companies like Amazon.com, Capital One, Marriott, Oakland A’s have used analytics to gain that competitive edge by understanding their customers. Because of the improvement in technology and 'decreased cost, data base sizes have grown exponentially and the tools are available to analyze these data. ' (W Data mining is used to describe knowledge discovery B. Definitions, Characteristics, and Benefits . in databases. \ \ techniques to extract and identify useful information and subsequent knowledge from large data bases. Data mining is also referred to as knowledge extraction, data archeology, data exploration, data dredging, and information harvesting. 1. 2. Data mining uses statistical, mathematical, and other 3. C. How Data Mining Works 1- Dimming.findspattemsanridstinssIhcssnattetas listings of mathematical rules. Those rules 9:111 then be used for predictimgpflof association in_ an attempt to aid in decision making. 2. Data mining algorithms fall into FOUR broad categories: a. Classification b. Clustering c. Association d. Sequence Discovery a. regression analysis A Other data mining tools include: a l w 8 “2 b. time series / c. visualization The term data mining is relatively new, but has historical roots in traditional statistical analyses from the 19805. i .5} b l. ,l l C i k’\ K ‘— -/ 2. Factors behind the sudden popularity in Data Mining reduction in cost of data storage and processing and increased hardware capacity have provided the ability to collect and accumulate data with increased database capacities and the availability of analysis tools, many companies recognized that they have untapped data and the tools to analyze. consolidation in a data warehouse, data both at the customer level and from various sources, gives the ability to analyze from a more complete view 3. Applications of Data Mining (also on page 141). Data mining is used to: identify successful therapies for illnesses and to discover new drugs reduce fraudulent behavior (insurance claims and credit card usage) identify customer buying patterns reclaim profitable customers aid in market-basket analysis better target customers/clients D. Classification Analysis (page 144): J7 CL 1. iJ'C'ltrssiircai'irm procedures are the most cgwnlmmonxof all data mining approaches. " » ""“t 2. Classification involves identifying ,patterns‘of __dgta as / a..__.._-,__,v._. I a. b. c. (a... . 12¢19¥lgiagto mean sategoryi Credit Approval 7 good or bad credit risk Store Location — good, moderate, bad Target Marketing — likely customer or no hope (you receive a credit card application in the mail because the credit card company has targeted you as likely to accept their application for a credit card) Fraud Detection — yes/no Telecommunications — likely or not likely to turn to a another phone company Route or segmentation decisions — prioritize crashes as high, moderate, or no severity Any ADS examples covered in previous chapters 3. Basic idea: a. ngpge “ragga; of interest. For each observati0n in a data set you have values on an outcome/class variable (Y) of interest that represents various 5 groups (Y=0, 1, 2, 3, .. .), and predictors (X1, X2, l ..., X, for p-predictors). b. Usinggtltaldataidevelopla model/mathematical equationfor predicting the Binaries (Y) ’b‘as‘ed prn the predictors. (The best model is selected that results in the highest predictive accuracy.) c. Use that model derived in_p~arflt»§_to‘p£§d‘i_gt outcomes EYEYFSFB ations for which you treatises” their outcomes (Ys). In others words, if you have an observation where the outcome of interest (Y) is unknown, you can predict the Y value based upon the known predictors. Example: Suppose I obtain ALL data fi'om past [SDS 2000 students (assuming students randomly assign themselves to the class) (Y) Grade: A, B, C, D, F (coded as l,2,3,4,5) (x1) High School GPA (X1) Current Overall College GPA (X3) ACT score (X4) it of Hours Completed before taking ISDS 2000 (X5) # of Hours Worked per Week I would build a model on that data. Then when a student registers for [SDS 2000 and asks me, on the first day of class, what grade do you think I will get in this class? I can use their values for X; through X5 to predict into which group that strident belongs (A, B, C, D, or F) Data Mining Project Processes (how to conduct a data mining analysis?) “a series. 9.1? stars I??? heir to standardizsan , Iii particulandat'a miner—practitioners have proposed several different approaches for managing and standardizing this process. These are: Some proposed models include: 1. CRISP-BM — Cross-Industry Standard Process for Data Mining. The steps are: lit-mm umamardn x - ' if) a. L- Business Understanding antd'Data Understanding: There must be much discussion to understand the business environment and what business questions must be addressed in order to remain competitive. This goes hand-in-hand with determining what variables must be measured in order to quantify the process. Based upon the characteristics of the data, various mathematical techniques are used to develop models for classification. These techniques fall into the following categories (page 145): a. 'tb. 3"“? Q9 :33; " Data preparation — collect data, enter information . l ,5, ,. Liar-.1 h- m... «x d ' J- ‘D‘ecision Trees —'0utcorne (Y) is categorical 3333551535?— .'cstfeg9r.i¢h1 or. numeric Statistical analyses such as: (1) Linear Dissrinamalysis (LDA): tr: fiutsome .(Y) is satsssrical and predictors (Karlaessmuicsl each having normal swisstipsaasssqualstresses- (Sometimes referred—teas a 0-1 Linear Regression) . llb‘fi- lu "ii-F" [Willi-("Wtw’} if '7 "TI! ' (2) nogisaeaegréésion-Anayes-(Loon Outcomer(¥)=is=categorical"and:predic o'rs},l .(Xs) are-categorical or- numeric $1,“ ‘ (Same-data‘conditions-as Decisioanree'fso ' ' youcan choose to-use either-Decision Tree or Logistic Regression) Neural Networks Bayesian Classifiers Genetic Algorithms Rough Set Approach Example: If you want to predict whether or not alcohol is involved in a crash, managers/policy makers should communicate with emergency personnel in order to recognize what factors seem " to be associated with alcohol involvement (ex: suppose after communicating, it seems that young adult males driving sports car crash into a tree at night on the weekend are associated with alcohol involvement ...then you should collect data on age, gender, vehicle type, number of vehicles involved in crash, day of week, and time of day). . .during that point, their needs to be discussion of what data is available as well. into a data format to make available for use, edit, and save. Example: A crash is reported in paper report form, then entered into a database, data from different public safety agencies are standardized, edited. (A manual is provided that educates public safety agencies on how data must be recorded in the report) (Note: Steps (1 and b can take as much as 60 percent of the time allotted for the data mining process) 9 LE") . c. a«rl‘Modelmg — based upon the types of variables and the purpose of the analyses, a particular data mining procedure (classification, clustering, = , ' , aSsociation, prediction) is selected for detecting (if?! i." l 3’ patterns and relationships. That procedure is employed in order to find the best mathematical explanation of the patterns that exist. (Note: Many times, when modeling, the researcher finds that data must be edited or additional variables should be included. This requires revisiting steps a and b. d. 'l ' jhlvaluate the model to determine its effectiveness. In classification, for example, you want to make sure that your model (set of predictors) is the best model for being able to predict group membership. You further want to make sure your model (based upon your sample) is a “fairly good” representation of the relationships that exist in the population. .‘ -. \ e. 1 1- Deploy the model. Once you determine the best model for describing the business process, you must use that model for making business ($5? decisions. Example: Suppose you are a Macy‘s marketing analyst targeting customers that have a Macy’s credit card. Yon use data mining to come up with a model to help you determine to which customers you will mail a Macy’s catalog. Your [I IV. Regression Analysis Example to Illustrate the Process A. Example: A sample of 34 chain stores are randomly selected for a test-market study of OmniPower Bars. Suppose you want to build a regression equation with the goal to predict Number of Bars Sold (Y). 1. Obtain your data Promotion Price (X1) 59 59 2. Select the appropriate model based upon criteria: The question is: Does a model with X1, X2, or both X] and X2 perform best when predicting Y? Could it be that none of the predictors are good? It? [2 model indicates that those card-holding customers most likely to go to your store to make a purchase are Females, ages 25-35, with a college degree, own a home, have visited a Macy‘s store in the last 3 months, and have a Macy’s credit card balance between $300 and $600. You would get your marketing department to send catalogs to those customers that fit the profile (and would continue to do that on a continual basis, say monthly basis). Keep in mind — you don’t want to spend money on the catalog and postage for those customers you don’t have a real chance at getting into your store. - . f ' . - 2. DMAIC ~ stands for D‘efgtg, M_e_asy_r_e, yflm, Mass, and cgmrnatsufiuzéais smegma mam—£13m mindfig'fifdcesses. This process is ordinarily utilized in manufacturing, service delivery, management, and other business activities that rely on eliminating defects, waste, quality control problems. ® FIGURE 43 '9" . , ,4 1 . .il‘fl: . . . . ,_ _. :=, i4 3- SEMMA - stands.th Swirls, Explore. Media. Model, and gasess and’WafdeVélfip‘éd byth'é'BAs Tn—s’fifiih' " ‘ ' I “W Question: Is X1 a good predictor of Y? Or: Is there a relationship between X] and Y? Regression Srarislics —l R 5 acre 0.540 any R a; 0.525 SIT] E110! 364.946 Observations 34 _ Caz clean 0' Error: P—mt‘ne 7512.343 Simple Linear Regression V=b0+b1X1 sum 1 5000 i 4000 I Em ' I z a 2000 . was - o 3 o inn—«Ar ——.——._.——.————--.——~l---l ami. so so m so so 100 up rm [xx] Model is appopriate because points are linear and have equal spread around the regression line, so you can test to see ile is a good predictor of Y. Hg: Bl = 0 (X1 is not a good predictor on) H]: [5. qé 0 (X1 is a good predictor on) At or = 0.05 level of significance, p < u, so reject H0 and conclude there is sufficient evidence that X] is a ‘good’ predictor of Y. Goodness of Fit = Adjusted R2 = 0.526 (scale ranging from 0 — l) A Y = 7512.348 7 56.714X1 b0: 7512.348 means if X1=0 then the predicted number of bars sold is approximately 7512. b1=—56..714 for every increase in price (by 1 penny), the number of bars sold will decrease by approximately 56. 1'3 Question: Is X2 a good predictor of Y? _ _Or: Is therea relationshipbetween X2 and Y? Regression Statistics i12qu I 0.286 ' 0.164 1077.572 34 Co efficients Sid Errors r Slur Intercept 1496.016 453.979 3.09] Promotion . 7 4.123 1.152 SHE. Simple Linear Regression Y=IaD+b2X2 I mmuenllz} Model is appopriate because points are linear and have equal spread around the regression line, so you can test to see if X2 is a good predictor of Y. Ho: [32 = 0 (X2 is not a good predictor of Y) H: [32 ¢ 0 (X2 is a good predictor of Y) At ct = 0.05 level of significance, p < a, so reject H0 and conclude there is sufficient evidence that X2 is a ‘ good1 predictor of Y. Goodness of Fit = Adjusted R2 = 0.264 (scale ranging from 0 — 1) A Y = 1496.016 + 4.128X2 bo= 1496.016 means if X2=0 then the predicted number of bars sold is approximately 1496. b1: 4.128 means for every increase in amount spent on promotions (by $1), the number of bars sold will increase by approx 4. [.5 3. Evaluation: 4. DEPLOYMENT: Use the mathematical model for prediction: Suppose you are opening up a new store location. You want to predicted the Sales (number of bars sold) if you‘ve set your price at 79 cents and allot $400 for promotional expenditures? 1. Y = 58375208 — 53.2173xl + 3.6131X2 = 58375208 . 53.217309) + 3.6131(400) = 3078.57 OmniPower Bars per month CHAPTER 4 — DATA, TEXT, AND WEB MINING Notes to the end: VI. Clustering — page 151 — Cluster Analysis for Data Mining 19‘“ (,1! (is-;absetvstisnsprfiQWS.summers,Studs s ..C-)_111§.0 " "' natur groupingsarmamentssuchthatth similar characteristics but the groupsjh different.\(Note: fill—ismalys " _ _ m. at- e o‘up's are-unknownand' grits-twigs L known: when {gr-arias A. Introduction: Clusteijilg’partitions a collection of things H Question: Is the predictor set (X1 and X2) a good predictor of Y? Or: Is there a relationship between (X1 and X2) and Y? Re msian Sraiistics 34 Coqflinienls Srd' Error Intercept 5837.521 628.150 i’rice (X1) 53.217 6.852 i’romnlinn (X2) Observances At 0'. = 0.05 level of significance, p < a, for both X1 and X2, so there is sufficient evidence that BOTH X1 and X2 together are ‘good’ predictors of Y. Goodness of Fit = Adjusted R2 = 0.742 (scale ranging from 0 — l) A Y = 58375208 — 53.2173X; + 3.6131X2 So which model do you select? ANS: The model with the largest adjusted R2 and make sure all predictors are significant. Conclusion: The best mathematical model that describes the number of Omni bars sold is one that uses price and amount spent on promotions. (Patterns: As price increases, the numbers sold decrease, as amount spent on promotions increases, the numbers sold increase.) 16 (Example: Students may remember taking a career inventory survey and based on your response to many questions, you were told — or put into a cluster indicating - what occupation would suit you best) Other examples in book: 1. Harry Potter - the Sorting Hat determines to which House (Gryffindor, Hufflepufi‘, Ravenclaw and Slytherin) to assign first-year students at the Hogwarts School. http:l/sorting—hat.corrI/sorthatg.htm 2. Seating guests at a wedding or sociai events B. The goal of clustering is to create groups so that the members of the group have maximum similarity (a lot in common) and the members across group have minimal _simi1arity(differ). ,1 01} flu,j”"‘_l‘ i. Y (3). ‘ . . geodetimutisatiansimmigrating. emanation e an . \ analysis that aids in dividing customers into groups based ‘ I! upon data descriptions (variables) so that you can target .r‘ mose groups with different advertising campaigns. lesi.,. , , . i) (Sci, 110;: '. no. it” ,2: '; iii-L Winners" '. ' “y critter-Puma;ail-1' onto Suppose you are targeting your most loyal customers. You could design a market segmentation questionnaire if ‘for an airline asking for demographic information such l'? behavior items such as frequency of flying, how purchased tickets, who traveled with, cities flown to, where sat, airlines flown, money spent on airline (Your data may indicate, for example, that your most “frequent fiyers” are males who fly alone in your first class section during the week, obtain tickets online, utilize rental cars, and stay at the Marriott Hotel. You may also find that your big money makers are “families with teenage children” traveling in your coach section during holidays to winter vacation destinations, and stay at resort areas or “honey— mooners“ traveling to bi g—ticket destinations from Saturday to Saturday with no children.) 2.1; “Gender, age, income, housing type, and education ‘ level are common demographic variables for , x clustering. For example, some brands are targeted only to women, others only to men. Music downloads tend to be targeted to the young, while hearing aids are targeted to the elderly. Education levels ofien define market segments. For instance, private elementary schools 1'9 Examples: 13%;) 3-,” ; f 1. Popular example in academics: A super market discovers through market basket analysis that customers who bought diapers ofien bought beer. They then placed diapers close to beer coolers and their sales increased dramatically; the explanation being that fathers who are sent out to buy diapers often buy a beer as well, as a reward. I 2. People who buy cold medicine frequently will also buy tissue. A lot of customers will go to the storejust for milk, so milk is placed in the back of the store. ‘-; 3. During Thanksgiving, you will see Wairnart displays with Jiffy cornbread mix, canned sweet potatoes, ff brown sugar, pecans, pie shells, flour. (Use of cross promotional programs) 4. Cross selling on the web: Amazoncom's use of "customers who bought book A also bought book B." VIII. Text Mining, page 159 'i. A. Introduction: Text mining is the application of data mining --_w.._. “m. . .. to text filesto r‘fhidden” content? B-‘Wtrlisstisns 7! 1. An insurance company can group documents by i common themes. For example, they may find that - r X3} 18 might define their target market as highly educated households containing women of childbearing age. ' x} 3. Marriott International utilized data mining analyses their set of customers and as a result created different hotels experiences: Marriott Suites...Pennanent vacationers Fairfield inn...Economy Lodging Residence Inn...Extended Stay Courtyard By Marriott...Business Travellers 9.01:7? VII. Association t r . W. Tali I. , I A. Introduction: Analyses aimed at associations that establish relationships "3mm" re s‘fvariablesor COLUMNS) within a g'ven record: ' riflwm' 1573' . .. 1 . r. . The goalqis‘i reate groups of VARIABLES that are gala?" . . .. _ C. Business Application: 1 \ _, a _ :1 [ {nu-1‘ “in. lr_.i.-'.=>_i :' _.-', r! ’:_'j :_ “ / if}! _Mark_e_t Analysis in the retail business refers to ‘ H ' research that provides the retailer with information to help understand the purchase behavior of a buyer. This information enables the retailer to understand the buyer's needs and modify the store's layout accordingly (i.e.: product placement) customers who have similar complaints eventually cancel their policies. Similarly, analysis of warranty claims, help desk calls, etc. to identify the most common problems and relevant responses. 2. A company may mine data and discovers that any customer that buys “tennis shoes” also buys “socks.” i t 3. Email i X i r r a. Text mining can be applied to messages or e- mails to route them to the most appropriate party 2 to process that message E b. Filters examine words in the subject line. A 1 good subject line encourages people to open an email, but care must be given to prevent filtering. According to SiteSell.com, maker of the free email marketing tool, mom, words such as , “free” are obvious triggers, but using the word “ with other trigger words such as “trial”, “quote”, “sample”, etc. can put your spam score through the roof. C. Text mining is different that web search engines. Search engines use known relationships to find documents; whereas text mining aims to discover new patterns. D. How to Mine Text 4*“term extraction” is the most basic formjoftexvt“ :Echage 151 mmmm h“ -msm - rt raw 'i_i‘l|‘:i“ ‘ 1. :‘Ht maps informationfigmnnstruc into a structured format. 3; a" «In. _,H-=..r.._.e...____.h , a 4. ‘> 21 22 Mi; ‘5'; ‘ , ..- 1.1%“ . “ it in -‘ ' ‘ -. 2. The Simplest data structure is the feature vggtor which. I stemming algorithms). Eliminate plurals and various If: f r“ conjugations. (so the terms phoned, phones, phoning would all be mapped to the word phone) is a weighted list of wggdswwyhggmig mimosaw-a A- 3. The most important words in the text are listed, along _ _ 7 with their relative importance. It 4. As a result, a document is reduced to a list of terms 3. Consider synonyms and phrases (ex: snidth and pupil and weights. The entire semantics of the document _=' should be grouped as the same term). Also be able to may not exist, but the key concepts are identified. :3 distinguish the same word having different meanings (ex: Microsoft Windows is different fl‘om the windows E. Example: Fireman’s Fund Insurance Company that you find on a house). (Ellingsworth and Sullivan, 2003) 4. Calculate the weights of the remaining terms bsed 1. Consider this statement obtained by Fireman’s Fund 1-: upon the frequency with which the word appears. Insurance Company: ‘ Two common weighting measures are: . , 1, i p, ,, . "the claimant is anxioustu Sadie. mentioned his ottomsyis writing-la negotiate. Also Mliing to work ‘ mum-mw—"M- m-“W*'=’"'““"“-'" ‘ 7! l :1; t ‘5. : it‘s 2. “4-” 'I :t‘ min us an toss sdjustmsnt expanses (ME) one caisuisitng ectuoicssh value. Unusually tamer , ‘ In ‘ ' d l Cl ' vidsd l d 1‘ b ' . 7 caiiiiiiié'ii‘s‘i’ “mm "m -. a The term frequency (rffactor) measures the . , ' number df'iiifi‘es d‘fiio'fd" 'appe‘ “are in a document. 2. Becomes this. . . ....,, . b. The inverse document frequency (idf factor) nTe‘EEFes‘tfiéEttmté‘i-‘Brtifiiésihs"were appears 3 in all documents in a set. i c. Note: a large if factor increases the weight, while gimme“ l 3;; ' a large idf factor decreases the weight because the “"""‘°'""“" M, 7 terms that occur frequently in all documents tats...“ W would be common words to the industry and not F. To create a Feature Vector (there is software to do this for be considered important. Us): if - 5. Example: Suppose after eliminating commonly used ' 1. Eliminate commonly used words (the, and, other, ...). ' words, you have 28 words total left over with 20 being These are called stop-words. unique, where 14 words appear only once (all weighted 0.0357 = 1/28) and 6 “important” words appear more than once. Word Frequency if Factor - 2. Replace words with their stems or roots (using 23 24 0.07:4 = 2128 2 l b. how many people have on their owri website structure 2 \ hyperlinks of other websites ten“ 2 . c. how a particular site is organized text 2 0-0714 \ d. tracking of visitors to a website, each search 2 N. l :3: fining 0'1429 : 4’28 on a search engine, each click on a link, a TotaJMfi 0'0714 transaction on an e-commerce site 4 l 0.50 = 14/23 . Note: the “important” words comprise 50% of the total occurrence. Clearly, this document is about “text mining” and involves “text” and “data” with ifxtiil- _ _ , _ _ . “Structures, and “weight.” B. Web Mining; the discovery and analysts of information slimmest and about the web, and usually obtained through web-based tools data a. 3. Understanding/analysis of this data helps in adding value to the visitor/customer of our website. G. Text Mining Systems must include: (Note: the term web mining was first used by Etzioni in 1996 and is now the topic of many conferences, journal articles, and books.) data files that can then be mined. There _ T d WFb 3. data mining software that can provide results of fizzy-2.. clustering, classification, etc. - Web mining IX. Web Mining, page 163 l l. A system for handling various file formats (plain text, doc files, PDF, etc.) 2. components that process the documents and create A. Introduction: l. The Web is the'biggest dataitext repository and is 1_ l. Web‘lContent Mining — extracts and uses the content 2. Examples of information found on the Web: H... . .. " ‘ -‘ ' ' " ' ' v found within the web pages. ' - \ a. whose home page is link to which other pages \. 2. deterrfiine'theliey "oncepts. Web Structure Mining— extracting useful information from the analysis of links found in web documents. a. d. Authoritypages 7 thosethatare linkedbymany ., x. e. "5'" ‘ ‘1 a‘ 25 26 \\ * . )2. :t‘? “2‘ Web crawlers automatic ly w M...“ w. content 0 he understand how U.S. extremist groups are connected. read” thypughthe ik “text mining, that is generated through web page visits, traffic, transactions, etc. 3. Web Usage Mining; extracts and uses information This information can be used to enhance search engine capability. Data are collected on servers. For example, server logs contain a user’s history of web activity. Turetken and Sharda (2004) perform an interact I _; a. search using the Google search engine, use " software that “reads” the top 100 documents (creates a feature vector for each), clusters similar documents and presents in a snapshot graphical format. H's: -r 5Y1: 2'. ‘ ' E‘s ~.‘ '1 The’dat'a,‘ knownwas'friic'kstream data, provides a trail of user’s activities andsliost the'user’s browsing patterns: which sites are visited, which pages are accessed, how much time is spent on each site, etc. The popularity of a document can be determined by analyzing the links going to a document. (This is used in the page-rank algorithm used by Google'search engines) .. . _. _ ,__ 7 _ _ c. Web usage mining is the process of finding out 7 easementsiaaiang for on the internet. Examples: (1) Analysis may show that 60 percent of ' visitors who search for “hotels in Maui” had ; also searched for “airfares to Maui.” (Note: L' this could be used to decide ‘where‘ to place " online advertisements. Hubs — pages that point tommany authoriv: ‘ w e s in the 3 hubs. 7 If you knew that 70 percent of software downloads from your site occur between 7 and 11pm (EST), you could plan for better See apfiiéifidfidfl to see how both web content " mining and web structure mining help to better 27 _ fl technical support and bandwidth during that time. Figure 4.6 i A registered user who visits Amazoncom is greeted by name (cookies 7 text file written by a web site on the visitor’s computer). They also present the user with a choice of products based upon previous purchases and use a recommender system (association analyses) to recommend products based upon similar users. Sample Test Questions: A national restaurant chain is interested in determining a mathematical model for predicting their weeklysales based upon advertising expenditures. They take a random sample of store locations and collect data on weekly sales (Y) and amount spent on Radio Advertising (X1) and Newspaper Advertising (X2). They then conduct a data mining project aimed at determining the best set of predictors. Use the output below to answer questions 1 — 3: 1. Which of the following set of predictors is best for predicting weekly sales? (usea '=. 0.01) . ’f . ’pr-vm is <94 Hun K, M a. Radio Advertising 61 96mg“ CHOW b. Newspaper Advertlsmg ‘- c. Bother Radio and Newspaper Advertising “£911.01 06 513 n i 1?: (‘cm (LE , d. None of the predictors are effective . . m P-value _ lntercet 156.430 126.758 lnterce I t 699.957 n .5 U'l g i a G .. 782-171 182-553 0-215 I 14-7537 2-597 "-0173 = H: QW- '1 2. ' Suppose both Radio and Newspaper expenditures are significant predictors of montlily ‘ . sales, what is the predicted weekly sales tto the nearest cent) if the advertising ; w4€0+ [amoeby W ,Pmdicw department spends $100 per week on radlo advertlsmg and $100 per week on newspaper +62% m CDWQCJ +0 advertising? (I “R. . - (£0 @Wafi‘éplocts . a. 2258.77 c. 2023.92 ma "3‘ ' n 0 :4 6 ‘ -—'-"‘ b. 1916.16 d. 1896.83 NWFJ’PCLP“ “EL-#- @1546 3. Afier testing the model for accuracy, you use the estimated regression equation to predict weekly sales. This part of the data mining process is known as a. Business Understanding b. Modeling c. Evaluation d. Deployment The data and results of a linear discriminant analysis for admissions (1=YES, O=NO) into an MBA program using GMAT, GPA, and QGMAT as predictors is presented below. Use this information to answer questions 4 e 8: m m 2-75 mm EE-EI mm” MERE” mm mam WEE-3m WEI-I 1?? a. .. Last @ [L f 11 an ?fdcfict41/4ctnw_ 4 u l: u it h fl 0792983599 0939726436 0003247893 0998719668 095598659 0002203769 0984875666 4. Suppose someone applying for the MBA program has GMAT=540, GPA=3.00, and QGMAT=60, what is the value of LCFl? hum! _ a. 176.27 ( c: 159.93 Clmigm b. 162.42 d. 145.85 :flwcfiw’é.‘ Suppose someone applying for the MBA program has GMAT=540, GPA:3.00, and QGMAT=60, what is the value of LCFO? Lara: a. 159.93 c. 176.27 (414,943+(ao' 012®+(8-6cl.w73+ (MO ~%) b." 162.42 d. 145.85 7 ,- Q 124.514 + '94:; 1‘ meow r 05511.0 6. Suppose for a student applying for the MBA program, you determine that LCF1=210.38 and LCFO=195.71. Which of the following statements is true? a. You would admit the student because LCFl > LCFO. b. You would deny the student because LCFl > LCFO. c. You would admit the student because LCFl < LCFO. d. You would deny the student because LCFI < LCFO. 7‘ . aafw‘ mam 914198 . v CP’J“ 4t’3/6 Cw Bimini ’95 7. Which of the following statements is correct for the output above? 7 _ . . I a. The proportion of correct classifications is 0.875. / m @J QWOUQ raw W‘in K - b. The number of students admitted and predicted to be admitted is 4. c. The number of students not admitted but predicted to be admitted is 1. d. All of the above statements are correct e. None of the above statement is correct 8. Which of the following statements is correct? The predicted class for observation 2 is incorrect. The predicted class for observation 1 is correct. The predicted class for observation 3 is correct. All of the above statements are correct None of the above statement is correct EDP-99‘?” The data below represent the admission decisions made for incoming freshman by the Office of Admissions at LSU. Using that data, answer questions 9 and 10. _'_ ‘lndiutesvis 9. Which of the following statements best represents the admission decision rule? a. If HSGPAS 2.9 orACTS 21 then ADMIT=NO b. If HSGPA S 2.6 or ACT E 20 then ADMIT=NO c. If HSGPA 2 2.9 or ACT 2 20 then ADMIT=YES d. If HSGPA 2 3.0 or ACT 2 21 then ADMIT=YES e. None of the above 10. Suppose a student who has a High School GPA of 3.1 and ACT score of 23 is applying for admissions to LSU. Based upon the decision rule, which of the following decisions should be made? a. Admit the student b. Do not admit the student ...
View Full Document

This note was uploaded on 12/14/2010 for the course ISDS 2001 taught by Professor Herbert during the Fall '08 term at LSU.

Page1 / 10

ISDS 2001 CH4 - CHAPTER 4 — DATA TEXT AND WEB MINING I...

This preview shows document pages 1 - 10. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online