1 What is Machine Learning? Many different forms of “Machine Learning” We focus on the problem of prediction Want to make a prediction based on observations Vector X of m observed variables: <X 1 , X 2 , …, X m > o X 1 , X 2 , …, X m are called “input features/variables” o Also called “independent variables,” but this can be misleading! X 1 , X 2 , …, X m need not be (and usually are not) independent Based on observed X , want to predict unseen variable Y o Y called “output feature/variable” (or the “dependent variable”) Seek to “learn” a function g ( X ) to predict Y: o When Y is discrete, prediction of Y is called “classification” o When Y is continuous, prediction of Y is called “regression” ) ( ˆ X g Y A (Very Short) List of Applications Machine learning widely used in many contexts Stock price prediction o Using economic indicators, predict if stock with go up/down Computational biology and medical diagnosis o Predicting gene expression based on DNA o Determine likelihood for cancer using clinical/demographic data Predict people likely to purchase product or click on ad o “Based on past purchases, you might want to buy…” Credit card fraud and telephone fraud detection o Based on past purchases/phone calls is a new one fraudulent? Saves companies billions(!) of dollars annually Spam E-mail detection (gmail, hotmail, many others) What is Bayes Doing in My Mail Server? This is spam: Who was crazy enough to think of that? Let’s get Bayesian on your spam: Content analysis details: (49.5 hits, 7.0 required) 0.9 RCVD_IN_PBL RBL: Received via a relay in Spamhaus PBL [ listed in] 1.5 URIBL_WS_SURBL Contains an URL listed in the WS SURBL blocklist [URIs:] 5.0 URIBL_JP_SURBL Contains an URL listed in the JP SURBL blocklist [URIs:] 5.0 URIBL_OB_SURBL Contains an URL listed in the OB SURBL blocklist [URIs:] 5.0 URIBL_SC_SURBL Contains an URL listed in the SC SURBL blocklist [URIs:] 2.0 URIBL_BLACK Contains an URL listed in the URIBL blacklist [URIs:] 8.0 BAYES_99 BODY: Bayesian spam probability is 99 to 100% [score: 1.0000] Spam, Spam… Go Away! The constant battle with spam Source: “And machine -learning algorithms developed to merge and rank large sets of Google search results allow us to combine hundreds of factors to classify spam.”
