DSA_Book.pdf - SANJIV RANJAN DAS D ATA S C I E N C E THEORIES MODELS ALGORITHMS AND A N A LY T I C S S R DAS Copyright \u00a9 2013 2014 2016 Sanjiv Ranjan

DSA_Book.pdf - SANJIV RANJAN DAS D ATA S C I E N C E...

This preview shows page 1 out of 462 pages.

You've reached the end of your free preview.

Want to read all 462 pages?

Unformatted text preview: SANJIV RANJAN DAS D ATA S C I E N C E : THEORIES, MODELS, ALGORITHMS, AND A N A LY T I C S S. R. DAS Copyright © 2013, 2014, 2016 Sanjiv Ranjan Das published by s. r. das ∼ sanjivdas/ Licensed under the Apache License, Version 2.0 (the “License”); you may not use this book except in compliance with the License. You may obtain a copy of the License at . Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “as is” basis, without warranties or conditions of any kind, either express or implied. See the License for the specific language governing permissions and limitations under the License. This printing, July 2016 T H E F U T U R E I S A L R E A D Y H E R E ; I T ’ S J U S T N O T V E R Y E V E N LY D I S T R I B U T E D . – WILLIAM GIBSON T H E P U B L I C I S M O R E FA M I L I A R W I T H B A D D E S I G N T H A N G O O D D E S I G N . I T I S , I N E F F E C T, C O N D I T I O N E D T O P R E F E R B A D D E S I G N , B E C A U S E T H AT I S W H AT I T L I V E S W I T H . T H E N E W B E C O M E S T H R E AT E N I N G , T H E O L D R E A S S U R I N G . – PA U L R A N D I T S E E M S T H AT P E R F E C T I O N I S AT TA I N E D N O T W H E N T H E R E I S N O T H I N G L E F T T O A D D, B U T W H E N T H E R E I S N OT H I N G M O R E TO R E M OV E . – A N T O I N E D E S A I N T- E X U P É R Y . . . I N G O D W E T R U S T, A L L O T H E R S B R I N G D ATA . – W I L L I A M E DWA R D S D E M I N G Acknowledgements: I am extremely grateful to the following friends, students, and readers (mutually non-exclusive) who offered me feedback on these chapters. I am most grateful to John Heineke for his constant feedback and continuous encouragement. All the following students made helpful suggestions on the manuscript: Sumit Agarwal, Kevin Aguilar, Sankalp Bansal, Sivan Bershan, Ali Burney, Monalisa Chati, JianWei Cheng, Chris Gadek, Karl Hennig, Pochang Hsu, Justin Ishikawa, Ravi Jagannathan, Alice Yehjin Jun, Seoyoung Kim, Ram Kumar, Federico Morales, Antonio Piccolboni, Shaharyar Shaikh, Jean-Marc Soumet, Rakesh Sountharajan, Greg Tseng, Dan Wong, Jeffrey Woo. Contents 1 The Art of Data Science 25 1.1 Volume, Velocity, Variety 27 1.2 Machine Learning 29 1.3 Supervised and Unsupervised Learning 1.4 Predictions and Forecasts 30 1.5 Innovation and Experimentation 1.6 The Dark Side 1.6.1 Big Errors 1.6.2 Privacy 30 31 31 31 32 1.7 Theories, Models, Intuition, Causality, Prediction, Correlation 2 The Very Beginning: Got Math? 41 2.1 Exponentials, Logarithms, and Compounding 2.2 Normal Distribution 43 2.3 Poisson Distribution 43 2.4 Moments of a continuous random variable 2.5 Combining random variables 2.6 Vector Algebra 45 2.7 Statistical Regression 2.8 Diversification 2.9 Matrix Calculus 2.10 Matrix Equations 48 49 50 52 45 41 44 37 6 3 Open Source: Modeling in R 3.1 System Commands 3.2 Loading Data 3.3 Matrices 55 55 56 58 3.4 Descriptive Statistics 59 3.5 Higher-Order Moments 61 3.6 Quick Introduction to Brownian Motions with R 3.7 Estimation using maximum-likelihood 3.8 GARCH/ARCH Models 66 3.10 Portfolio Computations in R 71 3.11 Finding the Optimal Portfolio 3.13 Regression 72 75 77 3.14 Heteroskedasticity 81 3.15 Auto-regressive models 83 3.16 Vector Auto-Regression 86 3.17 Logit 3.18 Probit 90 94 3.19 Solving Non-Linear Equations 95 3.20 Web-Enabling R Functions 4 62 64 3.9 Introduction to Monte Carlo 3.12 Root Solving 61 97 MoRe: Data Handling and Other Useful Things 4.1 Data Extraction of stocks using quantmod 4.2 Using the merge function 103 109 4.3 Using the apply class of functions 114 4.4 Getting interest rate data from FRED 4.5 Cross-Sectional Data (an example) 4.6 Handling dates with lubridate 4.7 Using the data.table package 114 117 121 124 4.8 Another data set: Bay Area Bike Share data 4.9 Using the plyr package family 130 128 103 7 5 Being Mean with Variance: Markowitz Optimization 5.1 Quadratic (Markowitz) Problem 5.1.1 Solution in R 135 137 5.2 Solving the problem with the quadprog package 5.3 Tracing out the Efficient Frontier 5.5 Combinations 141 142 5.6 Zero Covariance Portfolio 143 5.7 Portfolio Problems with Riskless Assets 5.8 Risk Budgeting 143 145 Learning from Experience: Bayes Theorem 6.1 Introduction 149 149 6.2 Bayes and Joint Probability Distributions 6.3 Correlated default (conditional default) 6.4 Continuous and More Formal Exposition 6.5 Bayes Nets 151 152 153 156 6.6 Bayes Rule in Marketing 6.7 Other Applications 7 138 140 5.4 Covariances of frontier portfolios: r p , rq 6 135 159 162 6.7.1 Bayes Models in Credit Rating Transitions 6.7.2 Accounting Fraud 6.7.3 Bayes was a Reverend after all... 162 162 162 More than Words: Extracting Information from News 7.1 Prologue 165 165 7.2 Framework 167 7.3 Algorithms 169 7.3.1 Crawlers and Scrapers 7.3.2 Text Pre-processing 7.3.3 The tm package 7.3.4 Term Frequency - Inverse Document Frequency (TF-IDF) 7.3.5 Wordclouds 7.3.6 Regular Expressions 169 172 175 180 181 178 8 7.4 Extracting Data from Web Sources using APIs 7.4.1 Using Twitter 7.4.2 Using Facebook 7.4.3 Text processing, plain and simple 7.4.4 A Multipurpose Function to Extract Text 184 187 7.5 Text Classification 193 7.5.1 Bayes Classifier 193 7.5.2 Support Vector Machines 7.5.3 Word Count Classifiers 7.5.4 Vector Distance Classifier 7.5.5 Discriminant-Based Classifier 7.5.6 Adjective-Adverb Classifier 7.5.7 Scoring Optimism and Pessimism 7.5.8 Voting among Classifiers 7.5.9 Ambiguity Filters 7.6 Metrics 190 191 198 200 201 202 204 205 206 206 207 7.6.1 Confusion Matrix 7.6.2 Precision and Recall 7.6.3 Accuracy 7.6.4 False Positives 7.6.5 Sentiment Error 7.6.6 Disagreement 7.6.7 Correlations 7.6.8 Aggregation Performance 7.6.9 Phase-Lag Metrics 7.6.10 Economic Significance 7.7 Grading Text 207 208 209 209 210 210 210 211 213 215 215 7.8 Text Summarization 7.9 Discussion 184 216 219 7.10 Appendix: Sample text from Bloomberg for summarization 221 9 8 Virulent Products: The Bass Model 8.1 Introduction 227 8.2 Historical Examples 8.3 The Basic Idea 227 228 8.4 Solving the Model 8.4.1 234 8.7 Sales Peak 8.8 Notes 231 233 8.6 Calibration 9 229 Symbolic math in R 8.5 Software 227 236 238 Extracting Dimensions: Discriminant and Factor Analysis 9.1 Overview 241 9.2 Discriminant Analysis 241 9.2.1 Notation and assumptions 9.2.2 Discriminant Function 9.2.3 How good is the discriminant function? 9.2.4 Caveats 9.2.5 Implementation using R 9.2.6 Confusion Matrix 9.2.7 Multiple groups 242 242 243 244 9.3 Eigen Systems 244 248 249 250 9.4 Factor Analysis 252 9.4.1 Notation 252 9.4.2 The Idea 253 9.4.3 Principal Components Analysis (PCA) 9.4.4 Application to Treasury Yield Curves 9.4.5 Application: Risk Parity and Risk Disparity 9.4.6 Difference between PCA and FA 9.4.7 Factor Rotation 9.4.8 Using the factor analysis function 253 257 260 260 261 260 241 10 10 Bidding it Up: Auctions 10.1 Theory 265 265 10.1.1 Overview 10.1.2 Auction types 10.1.3 Value Determination 10.1.4 Bidder Types 10.1.5 Benchmark Model (BM) 265 266 266 267 10.2 Auction Math 268 10.2.1 Optimization by bidders 10.2.2 Example 269 270 10.3 Treasury Auctions 10.3.1 267 272 DPA or UPA? 272 10.4 Mechanism Design 274 10.4.1 Collusion 10.4.2 Clicks (Advertising Auctions) 10.4.3 Next Price Auctions 10.4.4 Laddered Auction 275 276 278 279 11 Truncate and Estimate: Limited Dependent Variables 11.1 Introduction 11.2 Logit 11.3 Probit 283 284 287 11.4 Analysis 288 11.4.1 Slopes 288 11.4.2 Maximum-Likelihood Estimation (MLE) 11.5 Multinomial Logit 292 293 11.6 Truncated Variables 297 11.6.1 Endogeneity 11.6.2 Example: Women in the Labor Market 301 11.6.3 Endogeity – Some Theory to Wrap Up 303 299 283 11 12 Riding the Wave: Fourier Analysis 12.1 Introduction 305 12.2 Fourier Series 305 12.2.1 Basic stuff 12.2.2 The unit circle 12.2.3 Angular velocity 12.2.4 Fourier series 12.2.5 Radians 12.2.6 Solving for the coefficients 305 305 306 307 307 12.3 Complex Algebra 308 309 12.3.1 From Trig to Complex 12.3.2 Getting rid of a0 12.3.3 Collapsing and Simplifying 12.4 Fourier Transform 12.4.1 305 310 311 311 312 Empirical Example 314 12.5 Application to Binomial Option Pricing 12.6 Application to probability functions 12.6.1 Characteristic functions 12.6.2 Finance application 12.6.3 Solving for the characteristic function 12.6.4 Computing the moments 12.6.5 Probability density function 315 316 316 316 318 318 13 Making Connections: Network Theory 13.1 Overview 321 13.2 Graph Theory 322 13.3 Features of Graphs 13.4 Searching Graphs 323 325 13.4.1 Depth First Search 325 13.4.2 Breadth-first-search 329 317 321 12 13.5 Strongly Connected Components 331 13.6 Dijkstra’s Shortest Path Algorithm 13.6.1 Plotting the network 337 13.7 Degree Distribution 13.8 Diameter 340 13.9 Fragility 341 13.10Centrality 333 338 341 13.11Communities 346 13.11.1 Modularity 348 13.12Word of Mouth 354 13.13Network Models of Systemic Risk 355 13.13.1 Systemic Score, Fragility, Centrality, Diameter 13.13.2 Risk Decomposition 359 13.13.3 Normalized Risk Score 13.13.4 Risk Increments 360 361 13.13.5 Criticality 362 13.13.6 Cross Risk 364 13.13.7 Risk Scaling 355 365 13.13.8 Too Big To Fail? 367 13.13.9 Application of the model to the banking network in India 13.14Map of Science 371 14 Statistical Brains: Neural Networks 14.1 Overview 377 14.2 Nonlinear Regression 14.3 Perceptrons 378 379 14.4 Squashing Functions 381 14.5 How does the NN work? 381 14.5.1 Logit/Probit Model 14.5.2 Connection to hyperplanes 382 14.6 Feedback/Backpropagation 14.6.1 382 382 Extension to many perceptrons 384 377 369 13 14.7 Research Applications 384 14.7.1 Discovering Black-Scholes 14.7.2 Forecasting 384 384 14.8 Package neuralnet in R 14.9 Package nnet in R 384 390 15 Zero or One: Optimal Digital Portfolios 15.1 Modeling Digital Portfolios 15.2 Implementation in R 394 398 15.2.1 Basic recursion 15.2.2 Combining conditional distributions 398 15.3 Stochastic Dominance (SD) 15.4 Portfolio Characteristics 401 404 407 15.4.1 How many assets? 15.4.2 The impact of correlation 15.4.3 Uneven bets? 15.4.4 Mixing safe and risky assets 15.5 Conclusions 393 407 409 410 411 412 16 Against the Odds: Mathematics of Gambling 16.1 Introduction 415 16.1.1 Odds 415 16.1.2 Edge 415 16.1.3 Bookmakers 416 16.2 Kelly Criterion 416 16.2.1 Example 16.2.2 Deriving the Kelly Criterion 16.3 Entropy 415 416 418 421 16.3.1 Linking the Kelly Criterion to Entropy 16.3.2 Linking the Kelly criterion to portfolio optimization 16.3.3 Implementing day trading 422 421 422 14 16.4 Casino Games 423 17 In the Same Boat: Cluster Analysis and Prediction Trees 17.1 Introduction 427 17.2 Clustering using k-means 427 17.2.1 Example: Randomly generated data in kmeans 17.2.2 Example: Clustering of VC financing rounds 17.2.3 NCAA teams 434 17.3 Hierarchical Clustering 17.4 Prediction Trees 436 436 17.4.1 Classification Trees 440 17.4.2 The C4.5 Classifier 442 17.5 Regression Trees 17.5.1 445 Example: Califonia Home Data 18 Bibliography 451 447 430 432 427 List of Figures 1.1 The Four Vs of Big Data. 27 1.2 Google Flu Trends. The figure shows the high correlation between flu incidence and searches about “flu” on Google. The orange line is actual US flu activity, and the blue line is the Google Flu Trends estimate. 28 1.3 Profiling can convert mass media into personal media. 33 1.4 If it’s free, you may be the product. 34 1.5 Extracting consumer’s surplus through profiling. 36 3.1 3.2 3.3 3.4 3.5 Single stock path plot simulated from a Brownian motion. 4.1 4.2 4.3 4.4 Plots of the six stock series extracted from the web. Multiple stock path plot simulated from a Brownian motion. 67 68 Systematic risk as the number of stocks in the portfolio increases. HTML code for the Rcgi application. R code for the Rcgi application. 73 98 101 105 Plots of the correlation matrix of six stock series extracted from the web. Regression of stock average returns against systematic risk (β). 108 109 Google Finance: the AAPL web page showing the URL which is needed to download the page. 113 4.5 Failed bank totals by year. 122 4.6 Rape totals by year. 126 4.7 Rape totals by county. 129 5.1 The Efficient Frontier 141 6.1 Bayes net showing the pathways of economic distress. There are three channels: a is the inducement of industry distress from economy distress; b is the inducement of firm distress directly from economy distress; c is the inducement of firm distress directly from industry distress. 157 6.2 Article from the Scientific American on Bayes’ Theorem. 163 16 7.1 The data and algorithms pyramids. Depicts the inverse relationship between data volume and algorithmic complexity. 169 7.2 Quantity of hourly postings on message boards after selected news releases. Source: Das, Martinez-Jerez and Tufano (2005). 171 7.3 Subjective evaluation of content of post-news release postings on message boards. The content is divided into opinions, facts, and questions. Source: Das, Martinez-Jerez and Tufano (2005). 172 7.4 Frequency of posting by message board participants. 173 7.5 Frequency of posting by day of week by message board participants. 173 7.6 Frequency of posting by segment of day by message board participants. We show the average number of messages per day in the top panel and the average number of characters per message in the bottom panel. 174 7.7 Example of application of word cloud to the bio data extracted from the web and stored in a Corpus. 181 7.8 Plot of stock series (upper graph) versus sentiment series (lower graph). The correlation between the series is high. The plot is based on messages from Yahoo! Finance and is for a single twenty-four hour period. 211 7.9 Phase-lag analysis. The left-side shows the eight canonical graph patterns that are derived from arrangements of the start, end, high, and low points of a time series. The rightside shows the leads and lags of patterns of the stock series versus the sentiment series. A positive value means that the stock series leads the sentiment series. 214 Actual versus Bass model predictions for VCRs. 228 Actual versus Bass model predictions for answering machines. 229 Example of the adoption rate: m = 100, 000, p = 0.01 and q = 0.2. 231 Example of the adoption rate: m = 100, 000, p = 0.01 and q = 0.2. 232 Computing the Bass model integral using WolframAlpha. 234 Bass model forecast of Apple Inc’s quarterly sales. The current sales are also overlaid in the plot. 237 8.7 Empirical adoption rates and parameters from the Bass paper. 237 8.8 Increase in peak time with q ↑ 240 8.1 8.2 8.3 8.4 8.5 8.6 10.1 Probability density function for the Beta (a = 2, b = 4) distribution. 10.2 Revenue in the DPA and UPA auctions. 273 10.3 Treasury auction markups. 274 10.4 Bid-Ask Spread in the Auction. 275 271 13.1 Comparison of random and scale-free graphs. From Barabasi, Albert-Laszlo., and Eric Bonabeau (2003). “Scale-Free Networks,” Scientific American May, 50–59. 323 17 13.2 Microsoft academic search tool for co-authorship networks. See: . The top chart shows co-authors, the middle one shows citations, and the last one shows my Erdos number, i.e., the number of hops needed to be connected to Paul Erdos via my co-authors. My Erdos number is 3. Interestingly, I am a Finance academic, but my shortest path to Erdos is through Computer Science co-authors, another field in which I dabble. 326 13.3 Depth-first-search. 327 13.4 Depth-first search on a simple graph generated from a paired node list. 329 13.5 Breadth-first-search. 330 13.6 Strongly connected components. The upper graph shows the original network and the lower one shows the compressed network comprising only the SCCs. The algorithm to determine SCCs relies on two DFSs. Can you see a further SCC in the second graph? There should not be one. 332 13.7 Finding connected components on a graph. 334 13.8 Dijkstra’s algorithm. 335 13.9 Network for computation of shortest path algorithm 336 13.10Plot using the Fruchterman-Rheingold and Circle layouts 338 13.11Plot of the Erdos-Renyi random graph 339 13.12Plot of the degree distribution of the Erdos-Renyi random graph 340 13.13Interbank lending networks by year. The top panel shows 2005, and the bottom panel is for the years 2006-2009. 345 13.14Community versus centrality 354 13.15Banking network adjacency matrix and plot 357 13.16Centrality for the 15 banks. 359 13.17Risk Decompositions for the 15 banks. 361 13.18Risk Increments for the 15 banks. 363 13.19Criticality for the 15 banks. 364 13.20Spillover effects. 366 13.21How risk increases with connectivity of the network. 368 13.22How risk increases with connectivity of the network. 370 13.23Screens for selecting the relevant set of Indian FIs to construct the banking network. 13.24Screens for the Indian FIs banking network. The upper plot shows the entire network. The lower plot shows the network when we mouse over the bank in the middle of the plot. Red lines show that the bank is impacted by the other banks, and blue lines depict that the bank impacts the others, in a Granger causal manner. 373 13.25Screens for systemic risk metrics of the Indian FIs banking network. The top plot shows the current risk metrics, and the bottom plot shows the history from 2008. 374 13.26The Map of Science. 375 372 18 14.1 A feed-forward multilayer neural network. 380 14.2 The neural net for the infert data set with two perceptrons in a single hidden layer. 387 15.1 Plot of the final outcome distribution for a digital portfolio with five assets of outcomes {5, 8, 4, 2, 1} all of equal probability. 400 15.2 Plot of the final outcome distribution for a digital portfolio with five assets of outcomes {5, 8, 4, 2, 1} with unconditional probability of success of {0.1, 0.2, 0.1, 0.05, 0.15}, respecitvely. 403 15.3 Plot of the difference in distribution for a digital portfolio with five assets when ρ = 0.75 minus that when ρ = 0.25. We use outcomes {5, 8, 4, 2, 1} with unconditional prob- ability of success of {0.1, 0.2, 0.1, 0.05, 0.15}, respecitvely. 405 15.4 Distribution functions for returns from Bernoulli investments as the number of investments (n) increases. Using the recursion technique we computed the probability distribution of the portfolio payoff for four values of n = {25, 50, 75, 100}. The distribution function is plotted in the left panel. There are 4 plots, one for each n, and if we look at the bottom left of the plot, the leftmost line is for n = 100. The next line to the right Ru is for n = 75, and so on. The right panel plots the value of 0 [ G100 ( x ) − G25 ( x )] dx for all u ∈ (0, 1), and confirms that it is always negative. The correlation parameter is ρ = 0.25. 408 15.5 Distribution functions for returns from Bernoulli investments as the correlation parameter (ρ2 ) increases. Using the recursion technique we computed the probability distribution of the portfolio payoff for four values of ρ = {0.09, 0.25, 0.49, 0.81} shown by the black, red, green and blue lines respectively. The distribution function is plotted in Ru the left panel. The right panel plots the value of 0 [ Gρ=0.09 ( x ) − Gρ=0.81 ( x )] dx for all u ∈ (0, 1), and confirms that it is always negative. 410 16.1 Bankroll evolution under the Kelly rule. The top plot follows the Kelly criterion, but the other two deviate from it, by overbetting or underbetting the fraction given by Kelly. The variables are: odds are 4 to 1, implying a house probability of p = 0.2, own probability of winning is p∗ = 0.25. 419 16.2 See . The House Edge for various games. The edge is the same as − f in our notation. The standard deviation is that of the bankroll of $1 for one bet. 424 17.1 VC Style Clusters. 430 17.2 Two cluster example. 432 17.3 Five cluster example. 433 17.4 NCAA cluster example. 435 17.5 NCAA data, hierarchical cluster example. 437 19 17.6 NCAA data, hierarchical cluster example with clusters on the top two principal components. 438 17.7 Classification tree for the kyphosis data set. 443 17.8 Prediction tree for cars mileage. 448 17.9 California home pri...
View Full Document

  • Winter '19

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture