Skip to content


Performance and scalability prediction is key to designing future High-Performance Computing (HPC) systems. System designers aim to find the proper balance between computation/network performance and power. An adequate multi-scale simulation methodology is needed for a fast and accurate design space exploration. In this regard, the Mont-Blanc project has focused on developing a complete simulation methodology at different abstraction levels that allows architectural parameter exploration and scalability analysis.

Nowadays, a popular approach for architectural performance/scalability prediction is to use a trace-oriented simulation. It relies on performing a reference simulation and collecting traces of the most relevant phenomena observed during execution. The traces are then re-used as an abstraction for some of the simulation elements (e.g., cores behaviors, memories accesses). In this way, they enable refocusing the simulation effort on other performance-critical system sub-components such as caches, communication architecture and memory sub-system.  The ElasticSimMATE tool, developed in the Mont-Blanc project, operates on those foundations. It allows the capture of traces on several cores and their subsequent replay on architectures with different configurations and an arbitrary core count, ranging up to hundreds of cores.

ElasticSimMATE is based on two existing tools: Elastic Traces [1] and SimMATE [2], both developed within the gem5 [3] full-system simulator. These tools have proved that the use of trace-driven simulators reduces simulation times while keeping accurate results in regard to the gem5 framework. However, they can be applied to only certain configurations or models. For instance, Elastic Traces can only be applied to mono-core systems. This means that multicore architectures and synchronization events are not handled. On the other hand, SimMATE focuses on analyzing multi-core systems and synchronization mechanisms, while only supporting in-order CPU models.

ElasticSimMATE is a joint effort to provide the advantages of both Elastic Traces and SimMATE. In this way it enables to conduct explorations belonging to two categories as follows:

  • Fast system parameter exploration: because the trace-driven simulation is fast, the influence of various parameters such as cache sizes, coherency policy, and memory speed can be rapidly assessed by replaying the same traces on different system configurations.
  • System scalability: to analyze how performance scales when increasing the number of cores. This approach requires recording and carefully handling the synchronization semantics in the trace-replay phase to account for the execution semantics on such an architecture.

A fast and accurate gem5 trace-driven simulator for multicore systems

ElasticSimMATE Methodology:

Figure 1: Overview of the ElasticSimMATE methodology

Figure 1 conceptually depicts the ElasticSimMATE workflow, from the OpenMP application source files to the replay on different target architecture configurations. The red-colored “pragma omp” statements listed in the source are read by the preprocessor in the usual case and result in the insertion of calls to the OpenMP run-time.  In ElasticSimMATE, these calls further require calling a tracing function that will make it possible to record the start and end of a parallel region in the trace. The resulting binaries are then executed in an FS simulation (Trace Collection phase) to generate the execution traces. Three traces are created: instruction and data dependencies trace files (as per the Elastic Traces approach) and an additional trace file that embeds synchronization information. These three trace files are used in the trace replay phase devoted to the architecture exploration.

Experiments have been carried out on sample applications extracted from Rodinia and  Parsec benchmark suites. Preliminary results show that ESM results are highly correlated with gem5 full-system simulation results. This, with a simulation speed-up of 3x. Furthermore, ESM allows fast scalability analysis. Experiments have been carried out on applications running on different core number ranging from one to 128.

The Mont-Blanc 3 project takes advantage of the capabilities of ElasticSimMATE by performing analysis at architectural level with faster simulation times. It is part of the multi-scale simulation framework, and it will interact with tools developed by the consortium partners in order to provide a holistic approach for a fast design space exploration of HPC systems.

Further information can be found at :  A. Nocua, F. Bruguier, G. Sassatelli and A. Gamatie, “ElasticSimMATE: A fast and accurate gem5 trace-driven simulator for multicore systems,” 2017 12th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC), Madrid, 2017, pp. 1-8.

doi: 10.1109/ReCoSoC.2017.8016146. URL:


[1] R. Jagtap, S. Diestelhorst, A. Hansson, M. Jung and N. When, “Exploring system performance using elastic traces: Fast, accurate and portable,” 2016 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS), Agios Konstantinos, 2016, pp. 96-105. doi: 10.1109/SAMOS.2016.7818336 URL:

[2] A. Butko et al., “A trace-driven approach for fast and accurate simulation of manycore architectures,” The 20th Asia and South Pacific Design Automation Conference, Chiba, 2015, pp. 707-712. doi: 10.1109/ASPDAC.2015.7059093 URL:


Partners: CNRS and Arm


Dimemas is a performance analysis tool for message-passing programs. The Dimemas simulator reconstructs the time behaviour of a parallel application on a machine modelled by the key factors influencing the performance. With a simple model, a network of SMP nodes (see below), Dimemas allows to simulate complete parametric studies in a very short time frame. Dimemas generates as part of its output a Paraver trace file, enabling the user to conveniently examine the simulator run.

Partner: BSC


BOAST is a modular meta-programming framework. It implements a DSL that allows description and parametrization of computing kernels. Application developer can port their computing kernel to BOAST and implement several optimization techniques. The kernels with the chosen optimizations can then be generated in the target language of choice: C, Fortran, OpenCL, CUDA or C with vector instructions. This approach also allows application developers to study application specific parameters. The generated kernels can then be built and executed inside of BOAST to evaluate their performance. With this framework one can easily find the best performing version of a kernel on a given architecture. Performance results could also be used to interact with automatic performance analysis tools (ASK, Collective Mind, …) in order to reduce the search space. Binary kernels that are generated can also be given to tools like MAQAO for static or dynamic analysis.

Partner: CNRS