BIG DATA ANALYTICS USING APACHE HADOOPSEMINAR REPORTSubmitted in partial fulfilment of the requirements for the award of Bachelor of Technology Degree in Computer Science and Engineering of the University of KeralaSubmitted byABIN BABYRoll No : 1Seventh SemesterB.Tech Computer Science and EngineeringDEPARTMENTDEPARTMENT OF COMPUTER SCIENCE AND ENGINEERINGCOLLEGE OF ENGINEERINGTRIVANDRUM2014i
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERINGCOLLEGE OF ENGINEERINGTRIVANDRUMCERTIFICATEThis is to certify that this seminar report entitled “BIG DATA ANALYTICS USING APACHE HADOOP” is a bonafide record of the work done by Abin Baby, under our guidance towards partial fulfilment of the requirements for the award of the Degree of Bachelor of Technology in Computer Science and Engineeringof the University of Kerala during the year 2011-2015.Dr. Abdul Nizar AMrs. Sabitha SMrs. Rani KoshiProfessor Assoc. Professor Assoc. ProfessorDept. of CSE Dept. of CSEDept. of CSE(Head of the Department) (Guide)(Guide)ii
ACKNOWLEDGEMENTSI would like to express my sincere gratitude and heartful indebtedness to myguide Dr. Abdul Nizar , Head of Department, Department of Computer Science andEngineering for her valuable guidance and encouragement in pursuing this seminar.I am also very much thankful to, Mrs. Sabitha S, Associate Professor, Departmentof Computer Science and Engineering for their help and support.I also extend my hearty gratitude to Seminar Co-ordinator, Mrs. Rani Koshi,Associate Professor, Department of CSE, College of Engineering Trivandrum forproviding necessary facilities and their sincere co-operation. My sincere thanks isextended to all the teachers of the department of CSE and to all my friends for their helpand support.Above all, I thank God for the immense grace and blessings at all stages of the project.Abin Babyiii
ABSTRACTThe paradigm of processing huge datasets has been shifted from centralizedarchitecture to distributed architecture. As the enterprises faced issues of gatheringlarge chunks of data they found that the data cannot be processed using any of theexisting centralized architecture solutions. Apart from time constraints, the enterprisesfaced issues of efficiency, performance and elevated infrastructure cost with the dataprocessing in the centralized environment.With the help of distributed architecture these large organizations were able toovercome the problems of extracting relevant information from a huge data dump. Oneof the best open source tools used in the market to harness the distributed architecturein order to solve the data processing problems is Apache Hadoop. Using ApacheHadoop’s various components such as data clusters, map-reduce algorithms anddistributed processing, we will resolve various location-based complex data problemsand provide the relevant information back into the system, thereby increasing the userexperience.