Extrae is the instrumentation package that captures information during the program execution and generates Paraver (and Dimemas) traces. It can use different mechanisms to insert the probes that vary from static interception of the runtime calls linking with the Extrae library to dynamic instrumentation using Dyninst. The most frequent scenario is to use LD_PRELOAD to intercept production binaries at loading time. The information collected by Extrae includes entry and exit to the programming model runtime, hardware counters (PAPI), call stack reference, user functions, periodic samples and user events.
Paraver is a very flexible data browser. In Paraver metrics are not hardwired on the tool but programmed. Using a filter and a semantic module, the analyst can create time-lines, profiles and histograms from trace-files to selectively display a huge number of performance metrics. The different views can be easily combined to find correlations among the causes of performance drawbacks. To capture the expert's knowledge, any set of views can be saved as a Paraver configuration file, to be reused in subsequent analyses. Paraver also features performance analytics tools, such as clustering and folding, that increase the richness on the analysis by giving insight of the overall execution behavior as well as fine-grain measurements for computation regions. The tool has demonstrated to be very useful for performance analysis studies, giving much more details about the applications behaviour than most performance tools.
Illustration: Paraver Folding Analysis
Score-P is a highly scalable measurement infrastructure and easy-to-use tool suite for profiling, event trace recording, and online analysis of HPC applications. Score-P offers the user a maximum of convenience by supporting a number of analysis tools. Currently, it works with Periscope, Scalasca, Vampir, and Tau and is open for other tools. Score-P comes together with the new Open Trace Format Version 2, the CUBE4 profiling format and the Opari2 instrumenter.
Scalasca is an open-source toolset that can be used to analyze the performance behavior of parallel applications and to identify opportunities for optimization. It has been specifically designed for use on large-scale systems including IBM Blue Gene and Cray XT, but is also well-suited for small- and medium-scale HPC platforms. Scalasca integrates runtime summaries with in-depth studies of concurrent behavior via event tracing. A distinctive feature is the ability to identify wait states that occur, for example, as a result of unevenly distributed workloads.
Cube, which is used as performance report explorer for Scalasca, is a generic tool for displaying a multidimensional performance space consisting of the dimensions (i) performance metric, (ii) call path, and (iii) system resource. Each dimension can be represented as a tree, where non-leaf nodes of the tree can be collapsed or expanded to achieve the desired level of granularity. In addition, Cube can display up to three-dimensional Cartesian process topologies.
Illustration: Cube Result Display of Scalasca Parallel Trace Analysis
MAQAO provides state of art binary code performance analysis. MAQAO (Modular Assembly Quality Analyzer and Optimizer) is a tool for static and dynamic analysis and optimization of binary codes, with special focus on the loop-level. Binaries are disassembled, instrumented and reassembled statically, and the control flow is reconstructed. MAQAO's Static Analyzer plugin assesses the code quality of innermost loops, for example w.r.t. vectorization, and provides a best-case estimation of the performance that can be reached, based on a micro-architecture performance model. MAQAO can also provide some hints on how to improve the performance of the code, in terms of source code transformations, compiler flags, pragmas, etc.