8map reduce largescale_v02.pdf - Large Scale Data...

  • No School
  • AA 1
  • 46

This preview shows page 1 - 9 out of 46 pages.

Large Scale Data Processing Adaptation from Magdalena Balazinska (Univ. of Washington) Mining of Massive Datasets, by Rajaraman and Ullman Alan Gates (Yahoo!) Olston
Image of page 1
2 Why Distributed Data Processing Hardware: CPU speed does not increase Instead: multicore Commodity clusters Easy access to 1000 of nodes through cloud computing Much cheaper than large mainframe Big Data Astronomy: high-resolution, high-frequency sky surveys Medicine: digital records, MRI, ultrasound Biology: sequencing data User behavior data: click streams, search logs, … Google and Facebook, but also Walmart and co…
Image of page 2
Distribution and Performance Traditionally: scale-up Improve performance by buying larger machine Distribution: scale-out Improve performance through parallel execution Performance metrics: Throughput: transactions/queries per time unit The higher the better Important for OLTP Response time: time for execution of an individual transaction/query The smaller, the better Important for OLAP 3
Image of page 3
Speedup Speedup (the data size remains the same) More nodes à more throughput and/or lower response time 4 Nodes Throughput Nodes Response Time Non-linear Speedup Startup costs Coordination costs Communication costs Skew (equal distribution of load not possible)
Image of page 4
Scaleup Scaleup (the data size increases): More nodes à have same throughput / same response time despite more data Non-linear Speedup and scaleup: Data distribution overhead Non - parallelizable operations aggregation Communication costs Skew (equal distribution of data not possible) 5
Image of page 5
Parallel Relational Database Systems A lot of DBS technology developed in 90s. Good understanding of distributed execution of relational algebra queries Both for OLTP (online transaction processing): workload of short, update intensive queries, such as day-to-day banking, flight reservations etc. OLAP (online analytical processing) / Decision support: workload of complex queries, mainly read-only Sophisticated and optimized operators (such as distributed join operators…) Expensive and specialized: Oracle, Teradata,… 6
Image of page 6
Parallel Query Evaluation Inter-query parallelism Different queries run in parallel on different processors; each query is executed sequentially Inter-operator parallelism Different operators within same execution tree run on different processors Pipelining leads to parallelism Intra-operator parallelism A single operator (e.g., scan, join) runs on many processors Topic of this week 7
Image of page 7
Horizontal Data Partitioning Data Large table R(K, A, B C) Key-value store KV(K, V) Goal partition into chunks C 1 , C 2 , … C n of records stored at n nodes Hash partitioned on attribute X: Record r goes to chunk i, according to hash function Example hash-function: H = r.X mod n + 1 Range partitioned on attribute X: Partition range of X into: - = v 1 < v 2 < … < v n-1 = Record r goes to chunk i, if v i-1 < r.v < v i 8
Image of page 8
Image of page 9

You've reached the end of your free preview.

Want to read all 46 pages?

  • Fall '19

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern

Stuck? We have tutors online 24/7 who can help you get unstuck.
A+ icon
Ask Expert Tutors You can ask You can ask You can ask (will expire )
Answers in as fast as 15 minutes
A+ icon
Ask Expert Tutors