Tuning in large.pdf - Tuning in large-scale data processing...

This preview shows page 1 - 3 out of 6 pages.

Tuning in large-scale data processing systems Large-scale data processing systems have hundreds of configurable parameters, and several of them may affect performance [9] . Having a reasonable set of parameters that can help tune the application execution to a particular context may seem desirable, but configuring too many settings to achieve the best configuration in terms of throughput often turns out to be a challenging and time-consuming task [114] . Herodotoua et al. [10] define three categories of optimization opportunities for big data analytic workloads: (i) data-flow sharing (i.e., a single job performs computations for different logical nodes), (ii) efficient use of materialization of intermediate results, and (iii) automatic reorganization of intermediate data storing, with the help of new data layouts (e.g., employing partitioning) and storage engines of models (e.g., column-based) that may fit better for the considered workflow. Hence, we need to review the actual notion of (database) system tuning. Acquiring new hardware or upgrading software is still out of focus. However, there is no notion of an optimizer that can react to new data structures and access methods, as we have seen in the previous section. In this section, we focus on current techniques for performance tuning within big data systems. We will first cover physical design, considering new and ubiquitous hardware, like SSD disks and data allocation. Then, we discuss distributed data management and its consequences for better performances on those big data systems, including large data warehouses. We review monitoring aspects and automatic, or self-managed, big data systems. 4.1. Storage layout and data placement In traditional database systems, I/O operations are commonly the most expensive ones. Reducing the need for I/O operations or improving I/O performance can substantially impact the overall system's performance. In that context, SSD disks have been used for several years, showing a considerable performance improvement
when used instead of a magnetic disk for files with sequential access, like transaction logs and rollback segments [115] . Some works consider using SSD disks to support big data systems (e.g., [116] , [117] , [118] , [119] ). Special attention has been given to the impact of SSD disks on Hadoop's performance. Although solid-state storage has superior performance over traditional magnetic storage, data access and processing patterns should be considered when replacing magnetic storage with SSD disks [118] , [116] . In real-world high-performance scenarios, HDDs are used for permanent data storage as they are more cost-effective than SSD [8] . In [117] , MapReduce-based system performance is improved through SSD used to store intermediate data (i.e., map output temporarily stored locally). Compression of intermediate data can also enhance MapReduce-based systems I/O performance, but it does not significantly influence HDFS's performance [120] .

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture