|Title||Piecewise Holistic Autotuning of Compiler and Runtime Parameters|
|Publication Type||International Conferences|
|Year of Publication||2016|
|Authors||Popov, M, Akel, C, Jalby, W, Castro, Pde Oliveir|
|Publisher||Euro-Par 2016, 22nd International European Conference on Parallel and Distributed Computing|
|Place Published||Grenoble, France|
Current architecture complexity requires ne tuning of compiler and runtime parameters to achieve full potential performance. Autotuning substantially improves default parameters in many scenarios but it is a costly process requiring a long iterative evaluation.
We propose an automatic piecewise autotuner based on CERE (Codelet Extractor and REplayer). CERE decomposes applications into small pieces called codelets: each codelet maps to a loop or to an OpenMP parallel region and can be replayed as a standalone program.
Codelet autotuning achieves better speedups at a lower tuning cost. By grouping codelet invocations with the same performance behavior, CERE reduces the number of loops or OpenMP regions to be evaluated. Moreover unlike whole-program tuning, CERE customizes the set of best parameters for each specic OpenMP region or loop.
We demonstrate CERE tuning of compiler optimizations, number of threads and thread anity on a NUMA architecture. On average over the NAS 3.0 benchmarks, we achieve a speedup of 1.08 after tuning. Tuning a single codelet is 13 cheaper than whole-program evaluation and estimates the tuning impact on the original region with a 94.6% accuracy. On a Reverse Time Migration (RTM) proto-application we achieve a 1.11 speedup with a 200 cheaper exploration.