Efficiency is mainly due to the non scalable border

This preview shows page 10 - 11 out of 12 pages.

efficiency is mainly due to the non-scalable border size and secondarily to the unbalanced computational load caused by the multi-resolution grids. 5.3 Strong Scalability 1D vs HSFC This section investigates the performance enhancements introduced by the 2D partitioning based on Hilbert Space Filling Curves. Furthermore, it determines the lower bound for perfect scaling of CUDA kernels processing internal blocks (lines 7 and 12 of Algorithm 1). The ideal setting for capturing such aspects is a BUQ grid with uniform resolution and rectangular arrangement of blocks. The grid is internally handled as a BUQG (i.e., using the block numbering as mentioned in Section 2.1). This setting allows to validate the BUQG infrastructure, without incurring in the multi resolution related overheads, that may introduce unbalanced kernel executions. The input model for the strong scalability test is a cir- cular dam break discretized by means of a uniform BUQ grid containing approximately 82 millions cells. This model is designed to guarantee that the HSFC algorithm creates rectangular shaped partitions that minimize the border size. The test is performed ranging from 8 to 64 GPUs. In order to have a fair comparison, each GPU should run kernels with nearly 100% efficiency and a minimal grid size is guaranteed for each GPU. This requires an initial grid size that can not be stored in GPU memory for less than 8 GPUs. Figure 16 shows the improvements introduced by HSFC. As stated in Section 5.2, border kernels of 1 D partitioning do not scale and tests show that the efficiency drops almost linearly with the number of GPUs, reaching 73% for 64 GPUs. This is not the case for HSFC, which reaches 85% of efficiency with 64 GPUs and a partition size of approx- imately 1.3 million cells. Tests with smaller input grids show a loss of efficiency in the inner kernel on the single GPU. This is the empirical measure of the suggested GPU load for best occupancy. The HSFC algorithm shows visible improvements, compared to 1 D , starting from 16 partitions. Fig. 16: 2 D -HSFC and 1 D partitioning strong scalability tests. 5.4 Dynamic Load Balancing on a realistic scenario This section shows the improvements provided by dynamic partitioning. We focus on HSFC partitioning, since we showed that it is the best performer on static cases. The input model for this test is a circular dam-break located at the centre of a domain of size 50 m × 50 m and discretized by means of a multi-resolution grid containing 6.5 × 10 6 cells. The grid is shown in Figure 17 (on the left), where the wet and dry portions of the grid are shown in red and blue, respectively. We introduce a vertical wall that confines the water on the right side of the domain while the left region is non-wettable . The wall is covered by high resolution blocks and it is visible as a denser vertical line. Figure 17 (right) shows the end of the simulation, where water completely covers the right region.

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture