This means that when components fail only part of the system will fail and the

This means that when components fail only part of the

This preview shows page 6 - 8 out of 40 pages.

The system will be running as much as possible. This means that when components fail, only part of the system will fail, and the system itself will keep operating as if nothing happened. Failures have to be handled as they happen, and most importantly, must be transparent to the user processes. Soft­State: Information is in a soft­state when it is not yet written to disk. As an example, suppose a user places an item in his or her shopping cart. But the change to the shopping cart hasn’t yet been made permanent, so the change is in a soft­state. A problem could happen if the item is no longer available when the user goes to “checkout.” Many times, especially if this happens infrequently, web users are patient with this type of “glitch.” In contrast, hard­state means thatthe system has been updated, and the changes have been made permanent. To continue the previous example, every time a user views his or her shopping cart, it looks exactly the way he or she expects it to be. Eventually consistent: This is a weak consistency model where all copies of the data are eventually updated. MapReduce MapReduce is a data model developed by Google for processing large volumes of data in parallel. It consists of a two­phase approach, a set of mapper tasks that reads through data sequentially and creates key/value pairs, and a set of reducer tasks, that reads the sets of key/value pairs and outputs a list of values for a given key. A MapReduce job is usually run on data that is not contained in a database. A framework like the Hadoop framework or Google File System parallelizes the mapper tasks and the reducer tasks, sorts the mapper outputs, and stores the data in a distributed file system. Each mapper runs on a portion of the data concurrently with other mappers running on other, different portions of the data. Similarly, the reducers run in parallel on different mapper outputs. This parallelization makes MapReduce highly scalable. Terabytes of data can be processed in parallel across multiple machines, making it possible to analyze web logs, patterns in videos, or in Twitter feeds, things that were very difficult or that were not possible with traditional DBMS processing. Frameworks like Hadoop abstract the more difficult aspects of parallel programming, enabling the programmer to focus on algorithms like MapReduce that manage the data. Example One easy way to see how MapReduce works is through a WordCount application. Suppose the task was to count the occurrences of different words in a word list. For example, we want to see how many times “deer” appears in the list. The framework first splits the data into different pieces or “splits”. In this example, Split 1 would contain the first three words, “Deer”, “Bear” and “River”. Split 2, would contain the next three, “Car”, “Car” and “River”, and Split 3 would contain “Deer”, “Car” and “Bear”.
Image of page 6
3/5/2016 Module 6 7/40 ; hadoop/ (; hadoop/)
Image of page 7
Image of page 8

You've reached the end of your free preview.

Want to read all 40 pages?

  • Spring '19
  • Hadoop

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture