IJCEM International Journal of Computational Engineering & Management, Vol. 17 Issue 5, September 2014 ISSN (Online): 2230-7893 IJCEM 9Big Data Analysis using R and Hadoop Anju Gahlawat Tata Consultancy Services Ltd. 4 & 5 Floor, PTI Bldg, 4, Parliament St, Connaught Place, New Delhi Abstract The way Big data - heavy volume, highly volatile, vast variety and complex data - has entered our lives, it is becoming day by day difficult to manage and gain business advantages out of it. This paper describes as what big data is, how to process it by applying some tools and techniques so as to analyze, visualize and predict the future trend of the market. The tools and techniques described in this paper using the best of R language which is the future of the statistics and the Hadoop which is a parallel processing for the data so as to get a blend of best data model being processed over Big data parallelly. The integration of R and Hadoop give us the brand new environment where in R code can be written and deployed in Hadoop without any data movement. Using R and Hadoop helps organization to resolve the scalability, issues and solve their predictive analysis with high performance. You can have a much better deep dive over the big data when combined R and Hadoop. Categories and Subject DescriptorsE.m - MISCELLANEOUS General Terms: Theory, Languages Keywords:Big Data, Data Analysis, R Language, Map Reduce, Hadoop 1.INTRODUCTION 1.1Introduction to Big Data Big data is a buzzword, or a catch-phrase, used to describe the massive volume of both structured and unstructured data which is difficult to process using traditional relational database and software techniques as per organization’s hardware and infrastructure.Big data is more than simply a matter of size which, according to IBM, holds three major attributes as: Variety–Different type of data including text, audio, video, click streams, log files, and more which can be structured, semi-structure or unstructured. Volume- Hundreds of terabytes and petabytes of information. Velocity–Speed of data to be analyzed in real time to maximize the data’s business value.Figure 1: Big data attributes 1.1.1Some samples of data size of few leading companies Data Generated by NYTimes in one day 50Gb of uncompressed log files 10Gb of compressed log files 0.5Gb of processed log files 50-100M clicks 4-6M unique users 7000 unique pages with more than 100 hits Index size 2Gb Pre-processing & indexing time 10min on workstation (4 cores & 32Gb) 1hour on EC2 (2 cores & 16Gb) Data Generated by Facebook in one month 30 billion pieces of content shared on facebook every month. 40% projected growth in global data generated per year vs 5% growth in global IT spending.