h14214-wp-emc-isilon-data-lakes-for-data-science

h14214-wp-emc-isilon-data-lakes-for-data-science - DATA...

Info icon This preview shows pages 1–5. Sign up to view the full content.

View Full Document Right Arrow Icon
EMC WHITE PAPER DATA LAKES FOR DATA SCIENCE Integrating Analytics Tools with Shared Infrastructure for Big Data ABSTRACT This paper examines the relationship between three primary domains of an enterprise big data program: data science, analytics frameworks, and IT infrastructure. A decision about tools or infrastructure in one domain can affect, and potentially limit, what can be done in the other domains. This paper shows how the practices of data science and the use of analytics frameworks, such as Hadoop and Spark, generate a set of requirements that a big data storage system must fulfill in order to serve the needs of data scientists, application developers, and infrastructure managers. May 2015
Image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
2 To learn more about how EMC products, services, and solutions can help solve your business and IT challenges, contact your local representative or authorized reseller, visit www.emc.com , or explore and compare products in the EMC Store Copyright © 2014 EMC Corporation. All Rights Reserved. EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. The information in this publication is provided “as is.” EMC Corporation makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com. All trademarks used herein are the property of their respective owners. Part Number H14214
Image of page 2
3 TABLE OF CONTENTS INTRODUCTION ........................................................................................ 4 DATA SCIENCE .......................................................................................... 4 Paradigm Shift ................................................................................................. 4 Barriers to Change ............................................................................................ 4 Maturity of Data Collection and Industry Standards ............................................... 5 The Data Science Pipeline .................................................................................. 6 Problems with the Data Science Pipeline .............................................................. 8 Collecting and Storing Data ................................................................................ 9 Building a Data Set .......................................................................................... 10 Augmenting a Data Set .................................................................................... 11 Continuously Expanding the Data Set with the Swift Protocol ................................. 11 ANALYTICS FRAMEWORKS ...................................................................... 12 Unknown Problems and Varying Data Sets Demand Flexible Approaches ................. 12 Hadoop: MapReduce and HDFS .......................................................................... 13 Computing Problems in Analyzing Big Data .......................................................... 13 Distributed Data Analysis with Spark .................................................................. 14 Other Hadoop-Related Tools .............................................................................. 15 The Tool Suits the Problem and the Data (or the User) ......................................... 15 Implementing a Data Lake for Framework Flexibility ............................................. 16 STORAGE INFRASTRUCTURE FOR A DATA PLATFORM ............................. 17 Storage Requirements for Big Data Analytics ....................................................... 17 Multiprotocol Data Access ................................................................................. 18 Shared Storage vs. Hadoop DAS ........................................................................ 18 Separating Storage from Compute ..................................................................... 19 Fault-Tolerant Performance for HDFS Workloads .................................................. 20 Privacy, Security, and Compliance ..................................................................... 20 Multitenancy ................................................................................................... 21 Virtualizing Hadoop with Large-Scale Infrastructure to Deliver HDaaS ..................... 22 CONCLUSION .......................................................................................... 23
Image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
4 INTRODUCTION The burgeoning field of data science is fusing with the new business requirement to store and analyze big data, resulting in a conceptual gulf between the practice of data science and the complexities of information technology.
Image of page 4
Image of page 5
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern