Phil Simon of the W.P. Carey School of Business at Arizona State University explains Big Data, organizations using it, potential perils, and more.
We welcomed Professor Phil Simon of the W.P. Carey School of Business at Arizona State University to our office to chat about Big Data. He’s an award-winning author of 8 business books, a dynamic keynote speaker, an advisor to organizations and individuals on topics like management, tech, communication, etc., and a frequent contributor to media outlets including Wired, Harvard Business Review, and the New York Times.
Join us for an engaging and informative lecture!
See and share lecture notes, practice tests, and teaching materials.Get access now
Lecturer in Technology and Analytics, Arizona State University’s W. P. Carey School of Business
MILR (Master’s in Industrial and Labor Relations), BS in Policy and Management, additional studies in Political Science
Phil Simon: Hello, Course Hero. That’s a hell of an introduction. I’m more partial to Rush references than Van Halen, but there is a Van Halen reference. Does anyone know the brown M&M story? OK. That’s one of my faves.
All right. I’m here today to talk about Big Data. Here’s my plan of attack. A little bit of background about me at the beginning, but not too much. Talk a little bit about Big Data other than a buzzword. What the hell is it? Is it a big deal? Why or why not? Any one think that it’s not a big deal? I’m just curious. OK. What tools and techniques are organizations using to make sense about all this stuff? What are some specific examples? Show me, don’t tell me to paraphrase the Rush song. What are some of the perils of Big Data? Pay attention to the news. This stuff is powerful, but it’s also pretty scary. And then how to get started.
I’m a big fan of quotes. I don’t know if this qualifies as a meme, but does anyone know who this is?
Very good. He said this about Big Data. It’s one of my faves. [Onscreen text: “Big Data is like teenage sex: Everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.” —Dan Ariely] What was the under over? Four minutes before I started talking about teenage sex, who took that? But it’s true. A lot of companies are talking about it. There’s no shortage of talking heads out there. What does it mean? So my goal today is to answer that question, and hopefully inform, and I’ll take about 30 minutes of questions at the end.
All right. Before we get going, just a quick note on this. I’m going to be covering a lot of material, so this is much more about breadth than depth. You could certainly give a 90-minute talk just on the perils of Big Data, or just on techniques that organizations are using. Hell, I could talk for 30 minutes just about Netflix. The question number one: a little background for today’s talk. A quick note on my bona fide, why should you listen to me?
I’ve written a bunch of books. People think I know a few things about data. Yeah, maybe a few. I teach full-time at ASU in the W.P. Carey School of Business—system design, analytics, business intelligence. And I’m a big fan of pop culture references, so don’t be surprised today if I mentioned a few things about Rush and/or Breaking Bad. Any Rush fans out there? Nice. Breaking Bad, Better Call Saul? Who saw the finale? So good.
All right. Today I’ll specifically be talking about too big to ignore, but I’ll probably touch on my book on data biz and my most recent book, Analytics: The Agile Way.
First piece of trivia: Who’s this? Gordon Gekko. Very good. His most famous quote from Wall Street was, what? “Greed is good.” It’s actually “Greed for lack of a better word is good,” but there’s another particularly good quote from that movie, which is—what, 31 years old now? Damn. Time flies. “Most valuable commodity I know of is information.” Anyone disagree with that these days?
Let’s go back for a minute to prove the point. July 29, 2016. Anyone know what happened? A bit of an obscure question. But for the first time, the five most valuable companies in the world were arguably tech companies. Anyone want to guess number one? Bang, Apple. Number two? Close. Alphabet. Number three? Close. Microsoft, there you go. Four was Amazon, you got that right. Number five, Facebook, which did anyone see that, on John Oliver’s show the day their market cap dropped $120 billion? They dropped by the value of the world cheese market in one day. Interesting fact.
So data has never been more valuable. Well, let’s back that up with some data. Anyone know how much in Microsoft paid for LinkedIn a couple of years ago? Close, 26. And you could argue that that was basically a data play. Does Microsoft have the resources to create its own professional social network? Of course they do. But LinkedIn had a tremendous amount of valuable data about people who might make decisions about buying. What does Microsoft sell? Enterprise software. OK, so maybe they overpaid, maybe they didn’t. Take away that data. What does LinkedIn really have for assets? Maybe some [inaudible] customers. It’s basically a data play.
This one’s obscure, but if you’re from the Valley, you might know who this guy is. Anyone know Mr. Barksdale? He was the first guy to take, was it three companies, public to more than a billion dollars? Michael Lewis wrote about him in The New New Thing. He’s got one of my favorite quotes that all about data. “If we’ve got data, let’s look at data. If all we have are opinions, let’s look at mine.”
So what is Big Data? Pretty good question to start with, right? I promise you I won’t hit you with quite so many references at the beginning, but anyone know who this guy is? Very good. Winston Churchill. I’ve Googled this probably a dozen times. He had to have said something like this, “Success begins with a common definition of terms,” but somehow I got that in my head and I just can’t get away from it.
Yeah. I don’t want to be politically incorrect, but was it a one time a woman said, “You, sir, are ugly.” And he says, “In the morning, I will be sober and you will still be ugly.” I just ruined that quote. But anyhow. All right.
Be honest. When I say the word data, raise your hand if it is Excel is the first application that comes to your mind? Nobody. Seriously, being honest? OK, one person? All right, in my experience—and I spent about a decade as a software consultant—every time I talked about data or every time I was part of a software demo, the first question from the perspective client was, “Can we get this into Excel?” And I’d say, “Probably, but it’s not just about getting things into Excel.” I would argue that if we’re trying to understand Big Data, we have to get beyond Excel.
There’s nothing wrong with Excel. I use it just about everyday. Or Google Sheets. But Excel suffers from a number of limitations. Early on I was able to meet with some folks in the data team. Anyone here at Course Hero run a regression? No? OK. What’s the limit for the number of independent variables that Excel can handle? I think it’s 16. Anyone think that Google or Uber or Netflix only uses 16 variables to figure out what they’re doing?
So Excel is great for handling structured data, but not so good at the unstructured stuff. And if you think about it, the vast majority of what we call Big Data is unstructured. I’m going to come back to what that means in a bit. But for now, try putting all those YouTube videos or photos in Excel and doing a pivot table or a sort. Is that going to work for you? Probably not.
So most of this stuff is unstructured. And the way I like to think of Big Data is in the form of categories. So data to me is this umbrella term. Where is the data? What do you mean by data? Certainly we’ve got the structured stuff. A list of customers, a list of students. It plays nice with SQL, structured query language. You want to do a pivot table, sorting, averages—knock yourself out. But that’s just one piece of what I consider to be Big Data.
Next up is semi-structured data, so it exhibits characters of structured and unstructured data. A couple of good examples here are, say, a medical record. Anyone here ever been to the doctor? So what’s your height, what’s your weight? Where were you born? What’s your Social Security number? Is that structured? Sure, what’s your name?
But: Patient presented with such-and-such symptoms. Is that structured? No. Or if you think email. I wrote a book on communication. The average person in the corporate world gets about 120 to 150 emails per day. And by the way, it’s rising at about 15 to 20 percent per year. Ouch. So, think about it: Email is semi-structured. The time of the email, the sender’s email address. Right? The date. OK, that’s structured. But what does the email contain? What about any attachments? That Adobe PDF, you’re going to put that in Excel? So that’s semi-structured information. The big stuff, though, is unstructured. Right? Videos, blog posts, photos. That is a tremendous amount of information.
And then, finally, metadata, or data about data. People often misunderstand or underestimate the import of metadata. Remember a few years ago—was it 2013, give or take—when the Snowden scandal broke with PRISM, and former President Obama said, “We’re not tracking the data, just the metadata”? Think about what you could ascertain from just the metadata. If you called an abortion clinic for 20 minutes at a time, was it a wrong number or are you ordering a pizza?
So you can tell a lot from metadata, and in my class we actually go through an exercise and I tell students to find examples of interesting things that people have done with metadata. My favorite example so far: catching a serial killer who created a Word doc, and if you went under Properties, the person’s name was there. That’s how they caught a serial killer. So you can do a lot with metadata. So if you put all of this together—the structured stuff, the semi-structured stuff, the unstructured stuff, and the metadata—in my opinion, there’s your Big Data.
So let’s unpack this a bit more. Big Data, I would argue, is mostly unstructured. It takes a lot of structured data to compete with a 20-gig video, or a bunch of big photos. Big Data generally doesn’t play nice with particular tools designed for structured data. Again, try that pivot table on those photos and see how that works for you. I’d also argue that it’s largely external to the enterprise. A lot of valuable information you can pull from an API. LinkedIn, Twitter, Facebook, YouTube. You can’t stop someone from tweeting the same thing 15 or 20 times, but in an organization you probably have mechanisms in place for your CRM. What do you guys use, Salesforce? Gotcha. I’ll let that one go. But you probably have a mechanism in place to stop an employee from entering the same customer sale six times. And because other people are largely generating it and often machines are doing it now, as you think about the Internet of Things and increasingly cheap sensors, it’s fundamentally unmanageable. If you’re relying on other people, it’s tough to stop people from generating the same data or bad data.
Again, we’re finding that it’s increasingly generated by machines. I am amazed in Phoenix, every time I see one of those Waymo automatic driving cars flying around—they have them out here in San Francisco? What are they doing? They’re collecting information. It is astonishing to me when I think about this notion of preventive maintenance. The fact that we can have sensors on trucks that’ll tell us with some degree of certainty whether or not a truck is going to break down within the next two weeks. And if we knew that ahead of time, why don’t we get the brakes fixed now as opposed to being on the highway with a bunch of dairy and poultry in Phoenix in July?
So again, doesn’t play nice with SQL. Although to be fair, the way Hadoop is moving, and I’ll talk a bit about that later, tools are increasingly evolving to the point at which something I say today may not be true tomorrow.
This one’s a bit obscure. Anyone know who this is? This is Doug Laney. He’s a friend of mine. He’s a VP at Gartner and his new book is about infonomics. He is most famous though, arguably for coining the Three Vs of Data. Anyone heard of the Three Vs before? Nobody? Volume. We’re getting more and more data than ever. Velocity, it is coming at us faster and faster than ever. Twitter used to routinely break when it was written on Ruby on Rails because it couldn’t handle 36,000 tweets a minute because of a bad Monday Night Football call with the Packers and the Seahawks. So they had to rewrite it on scale, otherwise you keep seeing the fail whale.
So data is streaming at us faster than ever. And finally, variety. Again, we don’t just think of data in terms of the structured stuff. We very much think of it in terms of multiple types. Now to be fair, a bunch of different vendors have tried to co-op this and come up with other ones, like value. But for me, the Three Vs are good enough.
All right, anyone with me here? This guy, this one’s obscure. This stuff’s important because of something that John Wanamaker said a long time ago. He was an American merchant and one of the godfathers of American advertising. He famously said, “Half the money that I spend on advertising is wasted. I just don’t know which half.” Any debate now, about the two biggest players in digital advertising? What are they? Facebook. Google, Facebook. Ironically, Amazon is third and coming up fast.
So think about it. It is no longer difficult, if you wanted to figure out which one of your customers on, say, Facebook happens to be a 40-something guy who lives in the area who happens to like Rush and teaches at Arizona state. You can identify that person if he’s on Facebook, and up until a week ago he was. So this is a bit of context here.
Next up, is this stuff a big deal? Why or why not? No, not really. Of course it is, but again don’t just believe me. Anyone here mess around with Google Trends? Oh, that is a time suck. Google Trends is fun. This was Big Data, gosh, it was about 12 years ago, the earliest I can get it. [Shows slide] Anyone want to guess which way this goes? Boom. Now this is a relative index. I find it interesting that, at certain points, it took a bit of a dip. But for the most part this has gone up. More people are curious about what this stuff means.
But why now? Why have we only heard about this—my book came out, wow, time flies—about five and a half years ago. It was one of the first books on Big Data. Why is this stuff exploding? In my opinion, here are some reasons. Who here has not heard of Moore’s law? Anyone? The basic premise is that compute power doubles every 18 to 24 months. It’s been remarkably consistent, and that is the reason that on your iPhone you’ve got more compute power than the spaceship that put the first man on the moon in 1969. It’s been incredibly consistent, incredibly powerful. Most people, however, have not heard of Kryder’s law. Anyone with me here? This is often overlooked, but the premise is that data storage has declined exponentially. Anyone want to guess in 1980, how much it cost for one gig of storage?
Bang. Very good. Anyone want to guess which way this goes? Give him a prize. Now, this is a logarithmic scale. It goes down by powers of 10. If this were a normal scale, this would be perceptible. So it is incredibly inexpensive to store stuff. This is why Dropbox, at least for ASU, offers unlimited storage. Remember when it was a big deal when Gmail offered—what was it, five gig of storage? So how can they do that? Boom, Kryder’s law. So all of this wouldn’t be possible. I can remember, in my consulting career, having to write queries against multiple databases because the company could not afford to keep everything in one environment. So I had to get creative with my programming chops and stitch things together. These days there’s absolutely no excuse for that. Cloud computing just intensifies that.
Why now? I don’t know—the explosion of the Internet and the World Wide Web. You know, we didn’t hear about this in 1995. One of my favorite stats here, when Bezos decided to quit D—I think it was D.E. Shaw, the hedge fund—and start Amazon, is that around that time the Internet was growing at 2800% per month. I don’t care how small it is, if it grows that quickly, it’s going to get big pretty quickly.
Smartphones. Again, we no longer have to go home to go, “Yeah, I should post that on Facebook.” Right? Or what about that video? We are constantly carrying around information. We can constantly generate information. Has anyone heard the term citizen journalist? Right? Arguably the Arab Spring does not happen without smartphones. Social media, another usual suspect here. We want to share information even if that means compromising our privacy. I’ll come back to that later.
Powerful and cheap sensors. Does anyone here have an Amazon Dash? No one knows? You know what this is? You do? What does it do? [Unintelligible from audience member] OK, so it’s a device. For those of you who don’t know, you can put them anywhere, but—for Thai, instead of taking out your phone, you just hit a button. Boom, it orders it. That doesn’t happen if sensors are terribly expensive, though Amazon has not shown any willingness to make money in many cases.
Development of powerful tools to handle this stuff. It’s wonderful for generating data, but if I can’t analyze it with traditional tools, how can I make sense of it? How can Netflix know what to serve up? How can Amazon? And this is true, patent, anticipatory commerce. In other words, they’re going to figure out what you want to buy before you buy it. They need tools to analyze this stuff. An Amazon recruiter came to ASU and asked one of our faculty members, when we’re talking about what we need to teach our students, “You think we run this place on spreadsheets? Of course not.”
So let’s talk about tools. What tools and techniques are organizations using to make sense of this stuff? Well, the elephant in the room, pun intended, is Hadoop. Has anyone heard of Hadoop before? So if you think about this, the way organizations have traditionally stored structured data lists—of employers; lists of paychecks; lists of customers, sales, orders—has been in a database. Now, there’s nothing wrong with doing that. That makes a lot of sense. Didn’t I pass the Oracle headquarters on my way here to work? The—Oracle makes a lot of money from selling databases. But a distributed file system takes that and flips it on its head. In a distributed file system, I could have a bunch of pieces of commodity hardware like Google does, figuring out how to get your search engine. You care which server at Google gives you the answer? No. Why would you?
So these systems work a lot faster. They also have a fault tolerance. If a bunch of servers go down, other ones can quickly pick that up. We could talk for a long time about Hadoop, but I find it astonishing. It’s also open source, which means that developers are coming up with crazy ideas to take this in new directions. It’s difficult for me to believe that Hadoop would’ve caught on so fast if it weren’t open source.
Remember, we don’t have a lot of successful case studies here. And if you’re a 55-year-old CIO, do you really want to fall on the sword on a Big Data project when your organization is already twice screwed up implementing a new CRM or ERP application? Probably not.
Let’s come back to SQL for a second. SQL, structured query language. IBM guy, I forget his name, invented it in—was it 1973? Incredible staying power. How many programming languages can you think of that are still in widespread use 45 years later? Anyone write in Fortran or COBOL programs here? I hope the hell not. So you still have generic SQL, and for certain types of data it works really well. But NoSQL, in their databases it stands for “not only SQL.” So something like Cassandra, it works with SQL but with additional things. But there are some SQL stalwarts out there, including some really smart cookies at MIT, who think that you can actually create a newer and better version of SQL. This could potentially change things. But again, there’s a lot of confusion out here.
I was having a discussion earlier with a few people, and to me Big Data is still very much like, say, teaching blockchain. Right? Or the World Wide Web circa 1998. What’s going to be the next Amazon? What’s going to be the next pets.com or Cosmo? A lot of these things are still shaking out.
Anyone here use R? I think I mentioned that to a few people here. Again, open source. People are coming up with different ways to use it. You don’t need a license, although anytime you’re using open source software, they say, “Think free speech, not free . . . ” what? Beer, very good. Python and supplemental libraries. I’m actually teaching Python this week at ASU. I give myself maybe two or three out of 10. Some of my students have done ridiculous things with Python. Think about the ability to scrape data off the web with open libraries like Scrapy or Beautiful Soup, any of you guys use them here? Really powerful stuff.
Text analytics and text mining. I had this discussion with a few people here. If professors or students upload materials, it’s going to be unstructured. So what’s in there? Well, if I’m creating a PDF and a colleague of mine does the same thing, it’s not structured data. It is very difficult to determine what’s different using traditional tools, and tools like text analytics and text mining. This I’m sure you do here: A/B testing. Is there a company in the Valley that does not do this? I see in the library, you had Eric Ries’s The Lean Startup. Right? My favorite story from that book: He did A/B testing on the cover and title of his own book to prove to his publishers that this would sell more copies. Without slamming one of my old publishers, let’s just say that not every publisher is comfortable doing that. They don’t want the data to prove them wrong because, “I’ve worked in publishing for 25 years, dammit, and I know the right title for a book. Who cares what the data says?”
Predictive analytics. There are entire books about this stuff, but not just describing what happens in the past, but predicting what’s going to happen in the future. It is amazing to me. Anyone ever use Google Autocomplete? Pretty interesting stuff.
But none of this matters if we can’t make decisions on it, and most of us are not data scientists. So I understand you guys use Tableau here. Has anyone messed around with D3 or some of the other tools? When I was researching my book on data visualization, The Visual Organization, I came across a fact that I probably will never forget. The average person can understand data in a visual format 60 to 60,000 times faster if it’s presented this way. Now it depends on the data, depends on the brain. But why are we trying to cram a square peg in a round hole with Excel?
Why not create something interactive? And I try to practice what I preach. So ASU gives me my student evaluations in a raw format. It doesn’t help me understand how I can be a better professor. So working with a bunch of students on an honors project that created an interactive database—you can go to my site and see it—I can actually see as a professor how I do by question, by semester, by course. I can use it as a learning tool. I can have a conversation with my own data. I don’t know why more professors don’t do that.
All right, onto the good stuff. What are some specific examples of organizations effectively using this stuff? To be fair, early on it was tough for me to find it. My PR firm back in 2013 got me a spot on CNBC about healthcare and Big Data with one caveat: I had to tell a story of a hospital that successfully used this. I’ve been googling since 1998. I was an early adopter. It took me about three hours, but I finally found an example of a hospital in, I think it was North Carolina, that identified potentially 200 people who might have diabetes based on the unstructured of the medical—information on the medical record. So they sent out information to them, they gave them a call. “Not to alarm you, but we think there’s a decent chance that we may have missed your diabetes diagnosis.” Eighty percent of the people actually had it. Regardless of your politics—pretty good use of this stuff.
Going to higher ed. This is a great story. The—what just happened here? Whoops. What happened here? My bad. It says Georgia State, I think I screwed up my slide here. Oh, there we go. I think it was Georgia State—reduced student turnover by six percentage points. They looked at data to figure out which students were going to which classes, wound up saving them $12 million in tuition. Why isn’t every school doing that, particularly state schools facing a budget crisis?
This is one of my faves. Has anyone heard of Progressive Insurance and what they do? Well, the basic premise is this. They used to call it pay as you drive. Now they call it snapshot. But if you want, you can put a device in your car that tracks how fast you’re driving, because historically insurance companies have provided quotes based on your age, your location, your driving history, maybe your gender. But does that necessarily tell the whole story? You can be an 18-year-old with a Pontiac Firebird in New Jersey and actually being a very safe driver. And you can be a 55-year-old grandmother who drives like a bat out of hell with a Volvo station wagon, right? Why don’t we let the data decide? Turns out that there’s actually a nice side benefit of this. I’m going to show a quick video here. This is pretty interesting.
Video: [News reporter] The booking photo of a father who always maintained his innocence. Now 28 years old, Michael Beard from Cleveland was arrested following the death of his daughter in May of 2011. Lynniah was only seven months old when her dad, a nursing assistant, found here not breathing in a baby swing in her mom’s West 122nd Street home. She died a few weeks later, and Beard was charged with suffocating her.
[Defense attorney] “We’re all sad that baby Lynniah died. There was no reason for that, but it was an accident. She had been told—Mom had been told to put the baby in the swing and let it sleep upright because it had acid reflux disease. But any baby of that size can slump down and not have the strength to lift its head up.”
[News reporter] Beard’s case just went to court, and his lawyer argued that Beard found his daughter and immediately rushed her to the hospital. On Monday, a jury agreed and found him not guilty. A key piece of evidence was a snapshot device in Beard’s car that wirelessly transmits information for Progressive Insurance. In this case, it proved Beard was only in the house for three minutes, which wasn’t long enough to kill his daughter.
[Defense attorney] He said, “I can prove to you I was not there long enough to do what they’re accusing me of doing.”
[News reporter] Now according to the company spokesperson, it doesn’t work like your typical GPS system.
Phil Simon: Imagine. Imagine how horrible it is for your daughter to die and then to be put in jail because of it. There’s a positive side and the negative side of this, but that is a fascinating story to me.
A lot of companies you’ve probably heard of before, I mentioned before. Amazon is working on getting new things before you know that you want them. How? Someone sitting in a room and guessing? No, data. Companies looking at product reviews. Trying to figure out what you would like. Not just based on whether it’s one star or five star, but the content in that review. By the way, if you’re curious about why Amazon lets you post reviews even if you didn’t buy the book: because they found that more reviews, all things being equal, lead to more sales. Even if you’ve got polarized reviews—five star, one star—people want to buy the book to decide for themselves.
The company also tracks what you’re highlighting. Anyone read books on Kindle? So are there certain passages that resonate with you more than others? On the basis of that, maybe it’s recommending a certain book. And the “If it’s X, then Y” marketing program is insanely successful. For example, if you like my book—and how could you not?—maybe it’s going to recommend things that are books that other people have bought as well. Think about this. Does Barnes & Noble know its customers this well? Not even close. Isn’t Barnes & Noble looking for a buyer now after kicking out its CEO? No coincidence here. Much better data.
Next up, Netflix. They’re in San Jose, right? They’re a few miles from here? So this is off the charts. I actually spoke at Netflix headquarters because I was researching them for my book on data bits. They know exactly what its 118-ish million subscribers are watching, and on which devices. They can make decisions for purchasing content based largely on data. Anyone know how much Amaz—I’m sorry, Netflix dropped on the first two seasons of House of Cards? 100 million, very good. They were confident that this was a bet, but they had the data to back up that bet.
So if you’re a fan of Fight Club, raise your hand. You all violated the first rule. You’re not supposed to talk about it. Who directed the first episode of House of Cards? David Fincher, the same director of Fight Club. So if you expressed an interest in Fight Club, you rated it, you watched it, whatever, you would get that message, “Directed by David Fincher of Fight Club.” But let’s say that you didn’t know David Fincher was, but you loved Forrest Gump, starring Tom Hanks and Robin Wright Penn, and Sally Field as well. So Netflix would send the same customers a different message. If you liked Forrest Gump, you might like Robin Wright Penn in this as well.
Now Netflix makes these observations because it is responsible for a ridiculous percentage of weeknight Internet traffic. When I said it’s a third, that was four years ago. It’s up to 37 percent. Thirty-seven percent of nighttime US Internet traffic comes from Netflix. Is it any wonder that they were against efforts to repeal net neutrality? And by the way, as much data as Netflix generates—I mentioned this to a few people earlier—the company purchases third-party data and metadata from firms like Nielsen, and it also pays people to watch movies. I’m not kidding. You go through three days of training and they say, “We want you to evaluate whether a movie is suspenseful.” Why? Because at this point, artificial intelligence, machine learning cannot evaluate the suspensefulness of a movie.
On the basis of this—this blew my mind—Netflix is able to put movies into . . . you’re ready for this? 76,000 different subgenres of movies. How can it do that? Because it’s a hell of a lot more granular than “drama.” It’s . . . Middle Eastern dramas from the 1950s. They’re incredibly granular, and it’s working for Netflix.
Netflix extensively uses collaborative filtering technology. Even if you’re new to the service, it knows what people like you tend to like. And in some cases, I’m amazed at how something could have missed my radar for so long. It’s pretty damn good at that. Again, it purchases different data from third-party firms. So because of this, it can serve up different recommendations. So if I’m a fan of, say, Breaking Bad, then maybe I like Walking Dead. Maybe I like Better Call Saul. Speaking of Breaking Bad, my all-time favorite Netflix statistic: 50,000 people watched all of Season Three of Breaking Bad the day before Season Four premiered. Do the math on that: 13 episodes. Forty-two minutes each. Yes, it is that good.
I ask companies if they know their customers this well. Now, Big Data can certainly help a successful company, but can it turn around a struggling one? Who is this guy? This is Dennis Crowley. Ring a bell? No googling. Foursquare, founder and former CEO. Foursquare used to be a big deal. It’s original business model—raise your hand if you’re ever on Foursquare. Raise your hand if you’re still on Foursquare. I thought so. Early on, what was the original business model? Basically give away the app, let people check in, and then sell to local restaurant owners, bar owners, right, the ability to market to people. So very much small business. Did it work? What do you think? No. How many people use it? But again, don’t look at me. Look at Google.
Google Trends. At one point, Facebook—I’m sorry, Foursquare was actually a big deal. We got nowhere to go but down. So is this an opportunity for Foursquare? Maybe. Chipotle has been in the news. Why? Salmonella, E. coli, not a good thing. December 2015. You do not want to be featured in the New York Times with this headline. You do not want the CDC website to show places where E. coli has broken out. Why am I talking about this? Because there is an opportunity here for Foursquare. We know that E. coli is not going to help Chipotle’s stock. All of a sudden things break. Where do we go? Down. Not good if you’re a publicly traded company. So is there a means of possibly predicting that this would happen, other than things hitting the news? Turns out there may be. Let’s go back to Foursquare.
After a terrible quarter, an unlikely source got into the prediction business. Accompanied with really not that much to lose. Anyone here speak Chinese? No? Yeah? What are the symbols here? Can’t read? OK. A couple of my Chinese students corrected me, but Google tells me otherwise. Evidently the symbol for danger and chaos is the same. So is there an opportunity here for Foursquare? In other words, could Foursquare predict the visiting—the customers’ patterns at Chipotle? And it turns out it actually could. So if you’re a struggling social media company like Foursquare with not a lot to lose, and you’re running out of money and you’re facing the dreaded Silicon Valley down round, which means you’re now valued at less than you were before, why not roll the dice? So what is Foursquare’s current business model? Are they still targeting small businesses? Nope.
Now they define themselves as follows: It’s a little bit of jargon in there for me, because I don’t exactly know what a location-based intelligence company is, but they really are targeting a different type of market. And this isn’t me, this is Twitter. Right here, as of a couple days ago. So what is their current business model? How are they trying to make money? Well, think about it. If they can’t make money as is, maybe there are some people willing to pay for potential stock volatility in publicly traded companies. So why would you target small businesses? What about hedge funds? What about investors? Would they want to know ahead of time that there could be a drop in sales? Think about it, this is just the high-tech version of what companies have been doing on each other for decades, right? Go to the parking lot of a department store on a weekend. If you don’t see a lot of cars, they’re probably not selling a lot of stuff. But you’re doing this in a high-tech way.
So for local intelligence, what if they could customize reports? What if they could figure out what’s happening by turning a bunch of users into data collectors for the companies? For local intelligence, put on your hats for a second. What could they do? Would law enforcement be interested in this? Possibly, different government entities that they could predict crowds? What about taxi companies, after events gave out? Potentially a market there for them. What about brick-and-mortar retailers with physical presence? Not just Chipotle, but what about the Walmarts of the world? Would this be of interest to some people? Would they be willing to pony up for information? What about other examples? Really, what do they have to lose?
Better yet, why wouldn’t you want to pay for this information, right? If you ran a hedge fund and you could spend 100 grand and you have a pretty big stake in a fast-food joint, could that payment justify itself? What other firms will be paying for this information? So the jury’s still out, but there is a potential here. In fact, signs are that it may be working. Right? What about people that are trying to short the market? Did anyone see the documentary about Bill Ackman and Herbalife? Interesting stuff. He’d be all over this, trying to use this information to make a better decision. So could they use this information? I don’t see why not.
I’m going to skip this one [slide] because it’s a bit redundant, but could they rescue former customers? If they know that you’re in the area and you used to go there, but you are five minutes away, could it maybe text you an offer about some sort of discount that might get you to go from a different place to this one? And could this data be a means to an end?
I’m pretty certain that they are thinking about these questions at Foursquare. And their database future, really, again, what do they have to lose? They already got the down round. Anytime you’re pivoting to find a business model five or six years in—may not be the best sign. Although to be fair, some companies have successfully pivoted. Fun fact: Anyone know what YouTube was before it was video? Dating site. People only used the video feature, so the founder said, “Forget this dating stuff.” What about Slack, my favorite tool? What was that originally? Yeah, Glitch. It’s a game. Game didn’t work out, we’re going to keep the message system. So there is precedence for this.
As for threats to the new model: What about government regulation? Look at what’s happening in Europe right now. I’m going to talk a little bit about this more, when I talk about perils of Big Data. But remember, a big threat to Foursquare is user abandonment. What happens if they can’t convince people to download it? The whole model falls away. So there is certainly a risk here. But will the company succeed? I have no idea. But what do they really have to lose?
Now, small organizations. When I talk about Foursquare, it’s still worth a lot of money. Netflix, Amazon, Google, Facebook. But what about small organizations? Are they left out of the parade? Not necessarily. I’m going to show you a short video here, but has anyone heard of Street Bump in Boston? All right. This is fun, play the video. [Plays video]
So Thomas Menino, up until recently, was the mayor of Boston, he recently passed away. And he had the foresight, as a pretty progressive guy, to say, “What if we built an app that people installed on their phones?” Now if one person bumps over a particular area, that could be just an accident. You weren’t paying attention, you swerved, trying to avoid an animal, whatever. But if 50 people do it in the same point, is that a coincidence? No. So instead of having a public works department handle it after a citizen calls, what if they could be more proactive? Again, regardless of your politics, that is pretty powerful stuff.
So there is a lot of potential with Big Data, but there’s also a pretty big downside. It’s been a while since I hit you with a quote. This is one of my favorites, from Melvin Kranzberg, who is a professor at Case Western: “Technology is neither good nor bad, nor is it neutral.” Anyone doubt that there’s a downside to this stuff, particularly when there’s this sort of tech backlash taking place right now with these companies? Especially with what happened in Europe?
So let’s start with ickiness and bad PR. And yes, ickiness is a technical term.
I’m going to guess that a few of you have heard of the Target story with pregnancy. Raise your hand. Some of you haven’t, so I’ll give you the quick low-down. It was about three or four years ago. Charles Duhigg writes this about in his book, but initially it was an article for the New York Times. Target uses data to serve up ads. Why wouldn’t you? You’re competing against Target. You know what Jeff Bezos said about profits, or margins? He said, “Your margin is my opportunity.” Amazon is incredibly comfortable operating at zero or even negative margins. Amazon would scare me if I were in the retail business.
So Target hired a statistician, and he put together ads. A guy walks into Target in Minneapolis, I believe it was, and he’s furious. He demands to speak to the manager. He said, “You’re sending my daughter, 16 years old, pure as the driven snow, ads with diapers, and vitamins, and baby lotions? What the hell do you think is going on here?” Manager doesn’t work in marketing. “Oh, I’m so sorry sir. We value you as a customer. I’m sorry, it’ll never happen again.” Two days later, the guy goes back into Target. “Sir, I owe you an apology. There was stuff going on in my house I wasn’t aware of.” Target knew that this 16-year-old girl was pregnant, and her dad did not.
Now the crazy thing about that story, and something that often gets lost when people tell it, is that Target actually dumbs down its ads. It can produce, based on the statistics and the data that is captured, ads that are so eerie, they will make people feel strange. So they’ll intentionally put in ads of things. Let’s say you don’t like football, they’ll put in ads for footballs and jerseys just so they feel like they don’t make you feel so icky about it. Now, is anything that Target did illegal? I would argue no. Is it unethical? Maybe, but they’re competing with Amazon. So that wound up being a perfect story, but that’s definitely a downside for Target that I think led some people to believe that there is a downside to this stuff.
Legality. Anyone ever read this one, Weapons of Math Destruction? Fantastic book. Think about this. I’m not a lawyer, but a million years ago I used to work in human resources, and I know this much: If I’m advertising for a programmer in the United States, I can’t say, “Looking for someone who knows Java and CSS but also has to be a white male between 25 and 32 years old. And has to live in a certain state.” I can’t do that in a newspaper ad. With Facebook, could I target based on that? Yup. Is that necessarily legal?
ProPublica ran an interesting piece about two or three months ago about how anti-Semitic groups were able to target people who would be more likely to join them, because that information is there. Even after Facebook said, “We’ve taken it down,” they spent the money and bought the ads. Absolutely terrifying stuff.
More perils: privacy. Is anyone else a little icky about what’s going on here? Anyone who watched the show Black Mirror on Netflix? I don’t disagree. Very good stuff. There is an episode called “Nosedive.” Have you seen that one? The woman walks around and she’s rating everyone for their experience. Not just your Uber driver, your Airbnb host. I mean everything. It’s a great episode and I won’t ruin it for those who haven’t seen it.
Anyone know what’s happening in China? What’s happening? [Inaudible audience comment] Yep. They’re assigning scores to citizens based on how they live. If you consort with disreputable people or you go to websites that the Chinese government doesn’t like, your score goes down. Which means that next time you apply for a loan, you may not get it. Or it may go up. Now there’s a potential upside to this as well. We actually showed this in class a couple of weeks ago when we talked about privacy. They are also monitoring what people are doing, down to facial recognition. So it’s not necessarily all bad because if you’re—hear me out. If you’re a truck driver and you’re falling asleep, and they have that video recognition of you starting to yawn, maybe they could send a signal that you should pull over and get some sleep. But the ramifications of this with regard to privacy—pretty scary.
Security. I don’t know, has there been a hack there today? I can’t even keep track anymore. John Chambers, former CEO of Cisco, famously said, “There are two types of companies, those that have been hacked and those that haven’t admitted it yet.” So this information is incredibly valuable. And again, ethics. Go back to the Target example for a second. Just because we can, does that mean that we should? It is an increasingly difficult question for me to wrestle with. Yes, this stuff is very powerful. But it’s also particularly dangerous.
Want to wrap today with question number seven. How can we get started in our organization? And these are just some tips based on professionals I interviewed for the books, or articles that I’ve read, or case studies.
You don’t go from zero to Google overnight. Think about this. Google could do things now like Autocomplete, like make specific recommendations based on not just where you are but the device in which you’re using. If I am on my phone and I put in Google 711 and the letter N, it’s going to Autocomplete “near me,” because it knows I’m on my phone and I’m driving. Understand that Google could not do that in 1998, or arguably 2008 when Android just launched.
So when I think about Netflix, Google, Amazon, Facebook, the companies that are doing amazing and sometimes scary things with data—they could not do this overnight. So it’s important to not think about things in terms of a project. I despise the term “Big Data project.” It implies that it’s finished. Netflix is never finished analyzing data. I mentioned this to a few people earlier. Probably one of my favorite findings about Netflix in my research is that the company figured out that the color of the image for the movie—I used to say “movie covers” back when people bought DVDs, but—that actually might have some impact. And they looked at the, I always mispronounce it—is it hexagonal colors? The hex colors? Hex something. I’ve got to Google that at some point. They analyzed the specific colors, because you might have a predisposition towards movies with orange in it, like Orange Is the New Black, or TV shows, or Arrested Development. Any Arrested Development fans out there? Come on! One of my faves.
New tools. I love the fact that you’re using Tableau, that you’re playing around with Python, some of these tools. Because in my experience, the organizations that are doing a lot with data not only buy tools off the shelf but invent new ones. Netflix could not find tools that would analyze the colors of movies, so it built them, and then it wasn’t afraid to act on them. Hire for curiosity—bless you—and proficiency with data. Does that mean that if you’re hiring a finance analyst, that person needs to be proficient in Hadoop and Python? Of course not. But if I put on my swami hat, I don’t see a future in which someone who just doesn’t do data and tech at all, and runs from it, to have a successful career. So a lot of companies actually will hire for this notion of curiosity. Will you go where the data takes you? Where will you be stubborn?
I did an interview earlier today, and the question was about one of my favorite students. And Sasha Yodder, stubborn as hell, but in a good way. To open up a two-gig data set, her eleventh program didn’t work, so she tried it twelfth and it worked. And then she was able to analyze it. I love that kind of stubbornness. Also a fire for curiosity and proficiency with data. Again, I do not see a future in which people who just don’t do this stuff do particularly well. One of my favorite Google interview questions evidently is, “Tell me about the last time that you changed your mind.” And if you never did, then you never looked at data and the data told you something, and you changed course because of it.
More advice on getting started: embracing data discovery. Go in with this curiosity. Where is the data going to take you? One of the reasons that I had my students visualize my ratings is that I wanted to know how I could do better. Do I teach better 440 versus 450? Do I teach better in the summer, or the spring, or the fall? Ideally, and I don’t have this information, I’d love for my student ratings to tell me if I teach female students better than male students. International students versus domestic. Freshman versus seniors. I would love that information.
Easier said than done, but ideally you’re creating this culture of analytics. When I made the mistake of Netflix—I shared this story with a few people earlier. I got up in front of 150 people at Netflix and said, “Netflix is responsible for one fifth of all US nighttime Internet traffic.” 150 people within two seconds immediately said, “One third.” I’m glad that I made the mistake, because everyone there understands the importance of data.
I’d also argue particularly at more mature companies—but it’s not impossible at relatively recent ones—to expect some level of resistance. When I told my acquisition editor, and he’s a friend of mine, that I wanted to split test my cover for my book Too Big to Ignore: The Business Case for Big Data. He said no, because he didn’t want the data to prove him wrong. I understand it. I just don’t agree with it.
When you were talking before about umpires, Google, HBO, Real Sports umpires, there’s a great segment about 15 minutes long that I showed my students about how they proved to an empire that he was wrong 30 percent of the time. And this was a few years ago. They gave him a disk. And he said, “You know what I did with the disc? Didn’t even look at it, threw it out.” Old school. I would not want to work with that person.
Manage expectations, under-promise and over-deliver. I get really skittish when I hear people say, “Oh yeah, we solved the dupe. It’s going to solve all our problems.” Or “We expect the ROI to be 52.7%.” How do you know? Again, these companies that are doing it well understand that it is a process, not an outcome. I don’t think that Google will ever be finished analyzing information, absent some sort of government regulation.
More advice on getting started: I’m a big believer in internal momentum. Now, yes, you can be successful coming from—going from the top down. The CEO, the president of the company, says, “We need to do this,” and everyone gets on board. But if you think about—particularly with Hadoop being open source, or even a tool like Slack, which operates on a freemium model—in my experience, it’s been successful in organizations because it’s been bottom up, and someone has done something really cool with it and people want to know about it.
In other words, and I promised you a Rush reference, aim for little victories. And communicate them throughout the organization. What if we found something really cool? What if you’re online for lunch, and someone’s talking over here? “Really? You guys did that? Interesting. Tell me more.” Fun fact on lunch lines: Google, and this is right from the new book, actually monitors how long people are in line for. And they try to regulate the number of plates. They don’t want people waiting on line for 10 minutes because you’re going to go, “The line’s too long. I’m not going to come back.” And if it’s 30 seconds, you don’t have time to collide, to have that discussion with someone about potentially a new big idea.
Ideally, you’re making the skeptics and the dataphobes come to you. It is very difficult to convince people to do something they don’t want, and ultimately data can take people out of their comfort zones. No one wants to know that a 25-year-old recent marketing grad who is a whiz at Google analytics can do their job better than they can, with 20 years of marketing experience. This stuff is threatening.
I don’t like thinking in terms of traditional IT projects. First of all, they don’t have a particularly solid track record. Second, at some point you’re finished. Your system is live again. Again, I just don’t see a future in which you’re finished analyzing data. Things change. New data sources come up all the time. I can remember when social media was all the rage in 2013, 2014. We started to hear more about Pinterest. Do I have to worry about Pinterest? I don’t know, is there important information there?
I hate ROI calculations, right? They are all very much SWAG, they’re strategic wild-ass guesses. Why is this happening here? How do you know what it’s going to be? Sure, in hindsight maybe it was useful, but Amazon does not think in terms of that. If you’re not swinging fast—or what is it Gretzky says? You miss 100 percent of the shots you don’t take. I’d argue that Big Data is only getting more valuable. I’m not a huge fan of the whole oil metaphor, because oil is finite. We’re running out of it. We’re generating more data than ever, but certainly data is really valuable.
I’d also argue that the longer you wait, the worst off you’ll be. These companies are not sitting still. They’re building moats to separate themselves from the competition.
Next up, set it and forget it. I don’t like that mindset. This is not a standard report. This is not a P&L that needs to show up in your inbox once a week. There’s probably something new going on. Are you looking for more information? Are you testing new theories? Are you adding more independent variables?
I’m a big believer in looking for data and expertise outside the organization. “Yeah, but it’s tough to find a data scientist.” Fair enough. Want to rent one? What did Google pay for Kaggle? Was it a couple of years ago, a billion dollars? You could basically rent data scientists. Put a contest out there and see if someone can build a better algorithm for you.
Lead, follow, or get out of the way. This stuff is going to happen. In my opinion, if you stand in the way, you’re just causing a problem. And again, it is very much a process, not an outcome.
So that’s all I got. I want to thank you for listening. We’ve got time for a few questions. If you want to hear more about my madness, here you go.