#### You've reached the end of your free preview.

Want to read all 462 pages?

**Unformatted text preview: **SANJIV RANJAN DAS D ATA S C I E N C E :
THEORIES,
MODELS,
ALGORITHMS, AND
A N A LY T I C S S. R. DAS Copyright © 2013, 2014, 2016 Sanjiv Ranjan Das
published by s. r. das
∼ sanjivdas/
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this book except in compliance
with the License. You may obtain a copy of the License at . Unless
required by applicable law or agreed to in writing, software distributed under the License is distributed on an “as
is” basis, without warranties or conditions of any kind, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
This printing, July 2016 T H E F U T U R E I S A L R E A D Y H E R E ; I T ’ S J U S T N O T V E R Y E V E N LY D I S T R I B U T E D .
– WILLIAM GIBSON T H E P U B L I C I S M O R E FA M I L I A R W I T H B A D D E S I G N T H A N G O O D D E S I G N . I T I S , I N
E F F E C T, C O N D I T I O N E D T O P R E F E R B A D D E S I G N , B E C A U S E T H AT I S W H AT I T L I V E S
W I T H . T H E N E W B E C O M E S T H R E AT E N I N G , T H E O L D R E A S S U R I N G .
– PA U L R A N D I T S E E M S T H AT P E R F E C T I O N I S AT TA I N E D N O T W H E N T H E R E I S N O T H I N G L E F T T O
A D D, B U T W H E N T H E R E I S N OT H I N G M O R E TO R E M OV E .
– A N T O I N E D E S A I N T- E X U P É R Y . . . I N G O D W E T R U S T, A L L O T H E R S B R I N G D ATA .
– W I L L I A M E DWA R D S D E M I N G Acknowledgements: I am extremely grateful to the following friends, students, and readers (mutually non-exclusive) who offered me feedback
on these chapters. I am most grateful to John Heineke for his constant
feedback and continuous encouragement. All the following students
made helpful suggestions on the manuscript: Sumit Agarwal, Kevin
Aguilar, Sankalp Bansal, Sivan Bershan, Ali Burney, Monalisa Chati, JianWei Cheng, Chris Gadek, Karl Hennig, Pochang Hsu, Justin Ishikawa,
Ravi Jagannathan, Alice Yehjin Jun, Seoyoung Kim, Ram Kumar, Federico Morales, Antonio Piccolboni, Shaharyar Shaikh, Jean-Marc Soumet,
Rakesh Sountharajan, Greg Tseng, Dan Wong, Jeffrey Woo. Contents 1 The Art of Data Science 25 1.1 Volume, Velocity, Variety 27 1.2 Machine Learning 29 1.3 Supervised and Unsupervised Learning
1.4 Predictions and Forecasts 30 1.5 Innovation and Experimentation
1.6 The Dark Side
1.6.1 Big Errors 1.6.2 Privacy 30 31 31
31
32 1.7 Theories, Models, Intuition, Causality, Prediction, Correlation 2 The Very Beginning: Got Math? 41 2.1 Exponentials, Logarithms, and Compounding
2.2 Normal Distribution 43 2.3 Poisson Distribution 43 2.4 Moments of a continuous random variable
2.5 Combining random variables
2.6 Vector Algebra 45 2.7 Statistical Regression
2.8 Diversification
2.9 Matrix Calculus
2.10 Matrix Equations 48 49
50
52 45 41 44 37 6 3 Open Source: Modeling in R
3.1 System Commands
3.2 Loading Data
3.3 Matrices 55 55 56
58 3.4 Descriptive Statistics 59 3.5 Higher-Order Moments 61 3.6 Quick Introduction to Brownian Motions with R
3.7 Estimation using maximum-likelihood
3.8 GARCH/ARCH Models 66 3.10 Portfolio Computations in R 71 3.11 Finding the Optimal Portfolio
3.13 Regression 72 75
77 3.14 Heteroskedasticity 81 3.15 Auto-regressive models 83 3.16 Vector Auto-Regression 86 3.17 Logit
3.18 Probit 90
94 3.19 Solving Non-Linear Equations 95 3.20 Web-Enabling R Functions 4 62 64 3.9 Introduction to Monte Carlo 3.12 Root Solving 61 97 MoRe: Data Handling and Other Useful Things
4.1 Data Extraction of stocks using quantmod
4.2 Using the merge function 103 109 4.3 Using the apply class of functions 114 4.4 Getting interest rate data from FRED
4.5 Cross-Sectional Data (an example)
4.6 Handling dates with lubridate
4.7 Using the data.table package 114
117 121
124 4.8 Another data set: Bay Area Bike Share data
4.9 Using the plyr package family 130 128 103 7 5 Being Mean with Variance: Markowitz Optimization
5.1 Quadratic (Markowitz) Problem
5.1.1 Solution in R 135 137 5.2 Solving the problem with the quadprog package
5.3 Tracing out the Efficient Frontier
5.5 Combinations 141 142 5.6 Zero Covariance Portfolio 143 5.7 Portfolio Problems with Riskless Assets
5.8 Risk Budgeting 143 145 Learning from Experience: Bayes Theorem
6.1 Introduction 149 149 6.2 Bayes and Joint Probability Distributions
6.3 Correlated default (conditional default)
6.4 Continuous and More Formal Exposition
6.5 Bayes Nets 151
152
153 156 6.6 Bayes Rule in Marketing
6.7 Other Applications 7 138 140 5.4 Covariances of frontier portfolios: r p , rq 6 135 159 162 6.7.1 Bayes Models in Credit Rating Transitions 6.7.2 Accounting Fraud 6.7.3 Bayes was a Reverend after all... 162 162
162 More than Words: Extracting Information from News
7.1 Prologue 165 165 7.2 Framework 167 7.3 Algorithms 169 7.3.1 Crawlers and Scrapers 7.3.2 Text Pre-processing 7.3.3 The tm package 7.3.4 Term Frequency - Inverse Document Frequency (TF-IDF) 7.3.5 Wordclouds 7.3.6 Regular Expressions 169
172 175 180
181 178 8 7.4 Extracting Data from Web Sources using APIs
7.4.1 Using Twitter 7.4.2 Using Facebook 7.4.3 Text processing, plain and simple 7.4.4 A Multipurpose Function to Extract Text 184
187 7.5 Text Classification 193 7.5.1 Bayes Classifier 193 7.5.2 Support Vector Machines 7.5.3 Word Count Classifiers 7.5.4 Vector Distance Classifier 7.5.5 Discriminant-Based Classifier 7.5.6 Adjective-Adverb Classifier 7.5.7 Scoring Optimism and Pessimism 7.5.8 Voting among Classifiers 7.5.9 Ambiguity Filters 7.6 Metrics 190
191 198
200
201
202
204
205 206 206 207 7.6.1 Confusion Matrix 7.6.2 Precision and Recall 7.6.3 Accuracy 7.6.4 False Positives 7.6.5 Sentiment Error 7.6.6 Disagreement 7.6.7 Correlations 7.6.8 Aggregation Performance 7.6.9 Phase-Lag Metrics 7.6.10 Economic Significance 7.7 Grading Text 207
208 209
209
210
210
210
211 213
215 215 7.8 Text Summarization
7.9 Discussion 184 216 219 7.10 Appendix: Sample text from Bloomberg for summarization 221 9 8 Virulent Products: The Bass Model
8.1 Introduction 227 8.2 Historical Examples
8.3 The Basic Idea 227 228 8.4 Solving the Model
8.4.1 234 8.7 Sales Peak
8.8 Notes 231 233 8.6 Calibration 9 229 Symbolic math in R 8.5 Software 227 236
238 Extracting Dimensions: Discriminant and Factor Analysis
9.1 Overview 241 9.2 Discriminant Analysis 241 9.2.1 Notation and assumptions 9.2.2 Discriminant Function 9.2.3 How good is the discriminant function? 9.2.4 Caveats 9.2.5 Implementation using R 9.2.6 Confusion Matrix 9.2.7 Multiple groups 242
242
243 244 9.3 Eigen Systems 244 248
249 250 9.4 Factor Analysis 252 9.4.1 Notation 252 9.4.2 The Idea 253 9.4.3 Principal Components Analysis (PCA) 9.4.4 Application to Treasury Yield Curves 9.4.5 Application: Risk Parity and Risk Disparity 9.4.6 Difference between PCA and FA 9.4.7 Factor Rotation 9.4.8 Using the factor analysis function 253
257 260 260
261 260 241 10 10 Bidding it Up: Auctions
10.1 Theory 265 265 10.1.1 Overview 10.1.2 Auction types 10.1.3 Value Determination 10.1.4 Bidder Types 10.1.5 Benchmark Model (BM) 265
266
266 267 10.2 Auction Math 268 10.2.1 Optimization by bidders 10.2.2 Example 269 270 10.3 Treasury Auctions
10.3.1 267 272 DPA or UPA? 272 10.4 Mechanism Design 274 10.4.1 Collusion 10.4.2 Clicks (Advertising Auctions) 10.4.3 Next Price Auctions 10.4.4 Laddered Auction 275
276 278
279 11 Truncate and Estimate: Limited Dependent Variables
11.1 Introduction
11.2 Logit
11.3 Probit 283 284
287 11.4 Analysis 288 11.4.1 Slopes 288 11.4.2 Maximum-Likelihood Estimation (MLE) 11.5 Multinomial Logit 292 293 11.6 Truncated Variables 297 11.6.1 Endogeneity 11.6.2 Example: Women in the Labor Market 301 11.6.3 Endogeity – Some Theory to Wrap Up 303 299 283 11 12 Riding the Wave: Fourier Analysis
12.1 Introduction 305 12.2 Fourier Series 305 12.2.1 Basic stuff 12.2.2 The unit circle 12.2.3 Angular velocity 12.2.4 Fourier series 12.2.5 Radians 12.2.6 Solving for the coefficients 305
305
306
307 307 12.3 Complex Algebra 308 309 12.3.1 From Trig to Complex 12.3.2 Getting rid of a0 12.3.3 Collapsing and Simplifying 12.4 Fourier Transform
12.4.1 305 310 311
311 312 Empirical Example 314 12.5 Application to Binomial Option Pricing
12.6 Application to probability functions
12.6.1 Characteristic functions 12.6.2 Finance application 12.6.3 Solving for the characteristic function 12.6.4 Computing the moments 12.6.5 Probability density function 315
316 316
316 318
318 13 Making Connections: Network Theory
13.1 Overview 321 13.2 Graph Theory 322 13.3 Features of Graphs
13.4 Searching Graphs 323
325 13.4.1 Depth First Search 325 13.4.2 Breadth-first-search 329 317 321 12 13.5 Strongly Connected Components 331 13.6 Dijkstra’s Shortest Path Algorithm
13.6.1 Plotting the network 337 13.7 Degree Distribution
13.8 Diameter 340 13.9 Fragility 341 13.10Centrality 333 338 341 13.11Communities 346 13.11.1 Modularity 348 13.12Word of Mouth 354 13.13Network Models of Systemic Risk 355 13.13.1 Systemic Score, Fragility, Centrality, Diameter
13.13.2 Risk Decomposition 359 13.13.3 Normalized Risk Score
13.13.4 Risk Increments 360 361 13.13.5 Criticality 362 13.13.6 Cross Risk 364 13.13.7 Risk Scaling 355 365 13.13.8 Too Big To Fail? 367 13.13.9 Application of the model to the banking network in India 13.14Map of Science 371 14 Statistical Brains: Neural Networks
14.1 Overview 377 14.2 Nonlinear Regression
14.3 Perceptrons 378 379 14.4 Squashing Functions 381 14.5 How does the NN work? 381 14.5.1 Logit/Probit Model 14.5.2 Connection to hyperplanes 382 14.6 Feedback/Backpropagation
14.6.1 382 382 Extension to many perceptrons 384 377 369 13 14.7 Research Applications 384 14.7.1 Discovering Black-Scholes 14.7.2 Forecasting 384 384 14.8 Package neuralnet in R
14.9 Package nnet in R 384 390 15 Zero or One: Optimal Digital Portfolios
15.1 Modeling Digital Portfolios
15.2 Implementation in R 394 398 15.2.1 Basic recursion 15.2.2 Combining conditional distributions 398 15.3 Stochastic Dominance (SD)
15.4 Portfolio Characteristics 401 404
407 15.4.1 How many assets? 15.4.2 The impact of correlation 15.4.3 Uneven bets? 15.4.4 Mixing safe and risky assets 15.5 Conclusions 393 407
409 410
411 412 16 Against the Odds: Mathematics of Gambling
16.1 Introduction 415 16.1.1 Odds 415 16.1.2 Edge 415 16.1.3 Bookmakers 416 16.2 Kelly Criterion 416 16.2.1 Example 16.2.2 Deriving the Kelly Criterion 16.3 Entropy 415 416
418 421 16.3.1 Linking the Kelly Criterion to Entropy 16.3.2 Linking the Kelly criterion to portfolio optimization 16.3.3 Implementing day trading 422 421
422 14 16.4 Casino Games 423 17 In the Same Boat: Cluster Analysis and Prediction Trees
17.1 Introduction 427 17.2 Clustering using k-means 427 17.2.1 Example: Randomly generated data in kmeans 17.2.2 Example: Clustering of VC financing rounds 17.2.3 NCAA teams 434 17.3 Hierarchical Clustering
17.4 Prediction Trees 436 436 17.4.1 Classification Trees 440 17.4.2 The C4.5 Classifier 442 17.5 Regression Trees
17.5.1 445 Example: Califonia Home Data 18 Bibliography 451 447 430
432 427 List of Figures 1.1 The Four Vs of Big Data.
27
1.2 Google Flu Trends. The figure shows the high correlation between flu incidence and searches
about “flu” on Google. The orange line is actual US flu activity, and the blue line is the
Google Flu Trends estimate. 28 1.3 Profiling can convert mass media into personal media.
33
1.4 If it’s free, you may be the product.
34
1.5 Extracting consumer’s surplus through profiling.
36
3.1
3.2
3.3
3.4
3.5 Single stock path plot simulated from a Brownian motion. 4.1
4.2
4.3
4.4 Plots of the six stock series extracted from the web. Multiple stock path plot simulated from a Brownian motion. 67
68 Systematic risk as the number of stocks in the portfolio increases.
HTML code for the Rcgi application.
R code for the Rcgi application. 73 98
101
105 Plots of the correlation matrix of six stock series extracted from the web.
Regression of stock average returns against systematic risk (β). 108 109 Google Finance: the AAPL web page showing the URL which is needed to download
the page. 113 4.5 Failed bank totals by year.
122
4.6 Rape totals by year.
126
4.7 Rape totals by county.
129
5.1 The Efficient Frontier 141 6.1 Bayes net showing the pathways of economic distress. There are three channels: a is the
inducement of industry distress from economy distress; b is the inducement of firm distress directly from economy distress; c is the inducement of firm distress directly from
industry distress. 157
6.2 Article from the Scientific American on Bayes’ Theorem. 163 16 7.1 The data and algorithms pyramids. Depicts the inverse relationship between data volume and algorithmic complexity.
169
7.2 Quantity of hourly postings on message boards after selected news releases. Source:
Das, Martinez-Jerez and Tufano (2005).
171
7.3 Subjective evaluation of content of post-news release postings on message boards. The
content is divided into opinions, facts, and questions. Source: Das, Martinez-Jerez and
Tufano (2005). 172 7.4 Frequency of posting by message board participants.
173
7.5 Frequency of posting by day of week by message board participants.
173
7.6 Frequency of posting by segment of day by message board participants. We show the
average number of messages per day in the top panel and the average number of characters per message in the bottom panel. 174 7.7 Example of application of word cloud to the bio data extracted from the web and stored
in a Corpus.
181
7.8 Plot of stock series (upper graph) versus sentiment series (lower graph). The correlation between the series is high. The plot is based on messages from Yahoo! Finance and
is for a single twenty-four hour period. 211 7.9 Phase-lag analysis. The left-side shows the eight canonical graph patterns that are derived from arrangements of the start, end, high, and low points of a time series. The rightside shows the leads and lags of patterns of the stock series versus the sentiment series.
A positive value means that the stock series leads the sentiment series. 214 Actual versus Bass model predictions for VCRs.
228
Actual versus Bass model predictions for answering machines.
229
Example of the adoption rate: m = 100, 000, p = 0.01 and q = 0.2.
231
Example of the adoption rate: m = 100, 000, p = 0.01 and q = 0.2.
232
Computing the Bass model integral using WolframAlpha.
234
Bass model forecast of Apple Inc’s quarterly sales. The current sales are
also overlaid in the plot.
237
8.7 Empirical adoption rates and parameters from the Bass paper.
237
8.8 Increase in peak time with q ↑
240 8.1
8.2
8.3
8.4
8.5
8.6 10.1 Probability density function for the Beta (a = 2, b = 4) distribution.
10.2 Revenue in the DPA and UPA auctions.
273
10.3 Treasury auction markups.
274
10.4 Bid-Ask Spread in the Auction.
275 271 13.1 Comparison of random and scale-free graphs. From Barabasi, Albert-Laszlo., and Eric
Bonabeau (2003). “Scale-Free Networks,” Scientific American May, 50–59.
323 17 13.2 Microsoft academic search tool for co-authorship networks. See: .
The top chart shows co-authors, the middle one shows citations, and the last one shows
my Erdos number, i.e., the number of hops needed to be connected to Paul Erdos via
my co-authors. My Erdos number is 3. Interestingly, I am a Finance academic, but my
shortest path to Erdos is through Computer Science co-authors, another field in which
I dabble. 326 13.3 Depth-first-search.
327
13.4 Depth-first search on a simple graph generated from a paired node list.
329
13.5 Breadth-first-search.
330
13.6 Strongly connected components. The upper graph shows the original network and the
lower one shows the compressed network comprising only the SCCs. The algorithm to
determine SCCs relies on two DFSs. Can you see a further SCC in the second graph?
There should not be one. 332 13.7 Finding connected components on a graph.
334
13.8 Dijkstra’s algorithm.
335
13.9 Network for computation of shortest path algorithm
336
13.10Plot using the Fruchterman-Rheingold and Circle layouts
338
13.11Plot of the Erdos-Renyi random graph
339
13.12Plot of the degree distribution of the Erdos-Renyi random graph
340
13.13Interbank lending networks by year. The top panel shows 2005, and the bottom panel
is for the years 2006-2009.
345
13.14Community versus centrality
354
13.15Banking network adjacency matrix and plot
357
13.16Centrality for the 15 banks.
359
13.17Risk Decompositions for the 15 banks.
361
13.18Risk Increments for the 15 banks.
363
13.19Criticality for the 15 banks.
364
13.20Spillover effects.
366
13.21How risk increases with connectivity of the network.
368
13.22How risk increases with connectivity of the network.
370
13.23Screens for selecting the relevant set of Indian FIs to construct the banking network.
13.24Screens for the Indian FIs banking network. The upper plot shows the entire network.
The lower plot shows the network when we mouse over the bank in the middle of the
plot. Red lines show that the bank is impacted by the other banks, and blue lines depict that the bank impacts the others, in a Granger causal manner. 373 13.25Screens for systemic risk metrics of the Indian FIs banking network. The top plot shows
the current risk metrics, and the bottom plot shows the history from 2008.
374
13.26The Map of Science.
375 372 18 14.1 A feed-forward multilayer neural network.
380
14.2 The neural net for the infert data set with two perceptrons in a single hidden layer. 387 15.1 Plot of the final outcome distribution for a digital portfolio with five assets of outcomes
{5, 8, 4, 2, 1} all of equal probability.
400
15.2 Plot of the final outcome distribution for a digital portfolio with five assets of outcomes
{5, 8, 4, 2, 1} with unconditional probability of success of {0.1, 0.2, 0.1, 0.05, 0.15}, respecitvely. 403
15.3 Plot of the difference in distribution for a digital portfolio with five assets when ρ =
0.75 minus that when ρ = 0.25. We use outcomes {5, 8, 4, 2, 1} with unconditional prob- ability of success of {0.1, 0.2, 0.1, 0.05, 0.15}, respecitvely. 405 15.4 Distribution functions for returns from Bernoulli investments as the number of investments (n) increases. Using the recursion technique we computed the probability distribution of the portfolio payoff for four values of n = {25, 50, 75, 100}. The distribution function is plotted in the left panel. There are 4 plots, one for each n, and if we look
at the bottom left of the plot, the leftmost line is for n = 100. The next line to the right
Ru
is for n = 75, and so on. The right panel plots the value of 0 [ G100 ( x ) − G25 ( x )] dx
for all u ∈ (0, 1), and confirms that it is always negative. The correlation parameter
is ρ = 0.25. 408 15.5 Distribution functions for returns from Bernoulli investments as the correlation parameter (ρ2 ) increases. Using the recursion technique we computed the probability distribution of the portfolio payoff for four values of ρ = {0.09, 0.25, 0.49, 0.81} shown by
the black, red, green and blue lines respectively. The distribution function is plotted in
Ru
the left panel. The right panel plots the value of 0 [ Gρ=0.09 ( x ) − Gρ=0.81 ( x )] dx for all
u ∈ (0, 1), and confirms that it is always negative. 410 16.1 Bankroll evolution under the Kelly rule. The top plot follows the Kelly criterion, but
the other two deviate from it, by overbetting or underbetting the fraction given by Kelly.
The variables are: odds are 4 to 1, implying a house probability of p = 0.2, own probability of winning is p∗ = 0.25. 419 16.2 See . The House Edge for various games.
The edge is the same as − f in our notation. The standard deviation is that of the bankroll
of $1 for one bet. 424 17.1 VC Style Clusters.
430
17.2 Two cluster example.
432
17.3 Five cluster example.
433
17.4 NCAA cluster example.
435
17.5 NCAA data, hierarchical cluster example. 437 19 17.6 NCAA data, hierarchical cluster example with clusters on the top two principal components.
438
17.7 Classification tree for the kyphosis data set.
443
17.8 Prediction tree for cars mileage.
448
17.9 California home pri...

View
Full Document

- Winter '19