Deliverables

The objective for the first six months of the Mont-Blanc Project can be summarized as to have a fully functional framework in all work packages. This objective involves setting up the necessary technical infrastructure and adequate methodology in each of the work packages.

This document defines the dissemination objectives for the Mont-Blanc project, as well as the different targets for all its activities, the dissemination tools, the interaction with similar projects, its activities to be done during the Mont-Blanc project, and the policy used to disseminate the results. The aim of this document is to define the strategy for disseminating the project results taking into account the big social impact that this project will have on society. This plan intends to raise awareness and interest in the developed technologies and solutions among the target groups such as the users, the scientific community, the IT industry and the general public. The strong presence of leading research HPC institutions ensures the wider dissemination potential through scientific channels, and the industrial partners will focus more on the exploitation and technology transfer activities. Most of the results will be published via academic and industrial channels by submitting scientific papers, and by holding workshops, courses and tutorials related to the new technologies.

The objective of the Initial Press Release Deliverable is to 1) define a general strategy for creating and publishing press releases as well as to 2) report on the outcome of the initial and follow-up press releases for the Mont-Blanc Project. This press release must be sent out by all partners to all press contacts locally as well as translated to local languages, if needed. As stated in the Dissemination Strategy Document (D 2.1), there will be a planning for future press releases during the project. The press release strategy defined should be consistent with the dissemination strategy and its objectives and will be maintained throughout the Mont-Blanc project.

This document describes the structure, content and updates process of the Mont-Blanc public web site (www.montblanc-project.eu). Web presence is a central element in the dissemination activities of Mont-Blanc, as indicated in the Dissemination Strategy Document (D 2.1). The website became publically available in October 2011. The Barcelona Supercomputing Center, as coordinator of the project, hosts and maintains the website. This document describes how the website was created and how it will be maintained. It also describes the structure of the website and the functions that are available to the user.

This report summarizes the dissemination activities carried out by the Mont-Blanc project in the October 2011– October 2012 period. Specifically, in the following pages a complete list of conferences as well as the presentations made at various events and workshops and related to the project will be listed. Furthermore, any additional coverage of the project by the press and online media is also presented in this document. During this first year of Mont-Blanc, the consortium published a total of one technical report, and attended to 49 conferences, workshops or seminars. Moreover, the consortium organized two successful trainings, where other EU funded projects were invited attendees. The high media impact of the project has raised high expectation among the HPC community. For this reason, the overall dissemination output of Mont-Blanc is an indication of the European excellence and recognition of the project partners.

The following document reports on the selection of the performance-critical kernels to be ported to the OmpSs [5] programming model during the course of the project. This work, started within WP3-T3.1 and now continuing in WP3-T3.2 and WP3-T3.3, pursues on the one hand an increased performance and portability of the kernels themselves due to the shift of paradigm from a serial or thread-oriented model to a task-based model supported by an ecient run-time scheduler. On the other hand it should devise a set of best-practices to provide WP4 colleagues with helpful guidelines when porting full applications.

This report refers to the activities planned in WP3 under Task 3.2 and 3.3. After completion of D3.1 we identified two subsets of application kernels: small-size and medium-size kernels. After the status update of T3.2 given in D3.2, in the following we describe the WP3 final porting activities by reporting the detailed status of advancement with respect to D3.2. As in D3.2, after the porting on ARM, we focused over two major issues affecting the results on the kernels’ development: (i) the porting to OmpSs; (ii) the porting to OpenCL. As already reported in D3.2, some preliminary benchmarking has been carried out on the available Mont-Blanc prototypes but a full optimization of the most promising kernels (related to T3.3) will be made when the final system will be released. Activities expected in T3.2 can be considered as concluded with most of the kernels preliminarily integrated into the full application from WP4 even if some porting activities will continue in P3. In particular, (i) the small size kernels development activities have been concluded and will continue with the three kernels integrated into the full application they refer to; (ii) all the medium-size kernels were integrated into the corresponding full application and passed OmpSs compilation; (iii) medium-size kernels porting over OmpSs is almost completed while OpenCL porting is still in progress.

This report refers to the activities planned in WP3 under Task 3.2. After completion of WP3 activities in P2 of the Mont-Blanc workplan, we setup a repository containing the source and supporting files for all the kernels object of this workpackage. The repository can be accessed at the URL http://wiki.montblanc-project.eu/index.php5/WP3_Optimized_application_kernels In this document we report the details about the structure of the repository with some brief description of the content therein.

This deliverable shows the evaluation of the Mont-Blanc node using two sets of benchmarks: Standard and Mont-Blanc benchmarks.

The Mont Blanc project aims to assess the potential of low power embedded components based clusters to address future Exascale HPC needs. The role of work package 4 (WP4, “Exascale applications”) is to port, co design and optimise up to 11 real exascale-class scientific applications to the different generation of platforms available in order to assess the global programmability and the performance of such systems. The first section will introduce the different applications and their different characteristics, the second section will describe the platforms used by WP4 during the first year, the third section will report the progress of the porting and the profiling of each of the 11 applications during the first year and the last section will give perspectives on WP4 activities.

The Mont-Blanc project aims to assess the potential of low power embedded components based clusters to address future Exascale HPC needs. The role of work package 4 (WP4, “Exascale applications”) is to port, co-design and optimise up to 11 real exascale-class scientific applications to the different generation of platforms available in order to assess the global programmability and the performance of such systems. After the first report D4.1 “Preliminary report of progress about the porting of the full-scale scientific applications” [1] this report aims to give an overview and the results about the final porting of all the 11 applications on the different system made available by the project or by partners.

The Mont-Blanc project aims to assess the potential of HPC clusters based on low-power embedded components to address future Exascale HPC needs. The role of work package 4 (WP4, “Exascale applications”) is to port, co-design and optimise up to 11 real exascale-class scientific applications to the different generations of Mont-Blanc hardware platforms available in order to assess the global programmability and the performance of such systems. After the first report D4.1 “Preliminary report of progress about the porting of the full-scale scientific applications” [1] and the latest report D4.2 “Final report about the porting of the full-scale scientific applications” [2], this report aims to present the work of the last year activity of WP4 based on a selection of a subset of scientific applications suited for the Mont-Blanc architecture, and a specific work of optimisation and taskification using OmpSs/OpenCL. The first results related to the optimisation performed on the selected set of applications are detailed in deliverable D4.2.

Due to the close relationship and rich cross references between the deliverables: D4.4 “Report on the profiling, the optimisation and the benchmarking of a subset of application suited for performance and energy”; D4.5 “Report on the efficiency and performance evaluation of the application ported and best practices”; D4.6 “Final list list of ported and optimized applications”. The decision has been taken to avoid redundancy and for a better reading and logical sequence to merge D4.4, D4.5 and D4.6 in a single physical document.

In this Mont-Blanc deliverable we present the current status of porting to the ARM architecture of the OmpSs (Mercurium compiler and Nanos++ runtime system), the Extrae instrumentation library and the Scalasca instrumentation facilities. In addition, we present an initial evaluation of the overhead observed in the OmpSs programming model when using Extrae instrumentation in the Intel architecture.

This document describes the status of the system software stack within the Mont-Blanc project. The work of populating a complete software stack for HPC and scientific computing has been performed since the beginning of the Mont-Blanc project (see deliverables D5.3 and D5.5). In this deliverable we report the work related to the third year and the extension of the project. As during this period the project deployed the Mont-Blanc prototype, based on 1080 SoCs each with dual core CPUs + embedded mobile GPU, the effort has been focused in porting and tuning the Mont-Blanc system software, shown in Figure 1, to our final platform.

Nowadays, topmost high performance computing (HPC) clusters use scalable distributed parallel le systems that are able to stripe data over multiple servers to achieve high performance also in I/O. From our experience in the Storage Systems Research Group and given the requirements of the project, we chose a parallel le system that is very common, open-source and POSIX compliant: Lustre; as the rst candidate to provide high performance I/O on our ARM cluster. Given that Lustre is open source we are able to access its code and adapt it to our Linux kernel (provided by SECO) for the ARM architecture. In the meantime, we focused on the client part since the server part is not expected to be executed in the ARM cluster. Thus, we started spending our e orts on adapting the code of the Lustre client modules to our speci c kernel version. As expected, we got some important compilation errors due to kernel incompatibilities, since last maintenance release of Lustre is compatible with kernel versions up to 2.6.32 whereas our current version is 2.6.36 (based on an Ubuntu Maverick distribution). However, we lately got a rst patched version of the Lustre client that can do mostly all of the most common and important POSIX operations. The problem is that due to circumstances we still do not control, when executing some speci c deletion operations causes the client to hang. From this deliverable on we will more e orts to try to understand what is really happening, whether it is an issue related with the architecture or the changes we performed that still need to be further reviewed.

The Mont-Blanc project will produce the rst large-scale supercomputer based on ARM cores. The ARM architecture has been succesfully used in the past in embedded and mobile platforms. However, the requirements and constrains of those platforms greatly di er from the needs of a High Performance Computing (HPC) system. One of these major di erences is the system software used in each environment. Embedded and mobile computing programmers typically use Operating Systems and li- braries customized for their target application (e.g., Android). Moreover, such platforms typi- cally target applications that run in a single MPSoC chip. This is in contrast to a typicall HPC environment, where general purpose operating systems (e.g., Linux) and scientic libraries (e.g., BLAS) are used to run applications in hundreds or thousands of compute nodes in parallel. This document describes initial work done to create a functional HPC system based on ARM cores, from the operating system, to the scienti c libraries, and parallel execution. Such work does not only involve the port of system software to the ARM architecbure, but also tuning these software components to fully exploit the characteristics of ARM cores. Similarly, the cluster management system also needs to be adapted to the characteristics of ARM-based nodes and to the goal of achieving very high energy eciency.

We aim to create an optimized software stack tailored to an ARM-based HPC system. As a result, we are looking at exploiting OS features that can improve performance. We investigate the e ects of using hugepages through Transparent HugePages on a number of benchmarks and sample HPC applications, whilst running on the MontBlanc chosen SoC: Exynos 5 Dual. We are presenting results for both pandaboard, and the Arndale.

As the main goal of the Mont-Blanc project is to produce large-scale HPC clusters based on ARM processor architecture, one of its major challenges is to perform porting and tuning of already-existing system software for ARM-based HPC clusters. Deliverable 5.3 [MBD12a] summarizes our initial work in this regard (until month 12 of the project). In this deliverable, we report on the follow-up work in the second year of the project (month 12 - month 24). In particular, we summarize our efforts related to the parallel programming model and compiler, the development tools, and the scientific and runtime libraries. Furthermore, we report on the installation and customization of the operating system, the cluster monitoring and resource management, the performance monitoring and analysis tools, and the parallel distributed filesystem.

In this deliverable, we present the current status of the low-level software components required for gathering information about the performance of HPC applications running on ARM-based systems. This work will enable performance monitoring tools to be ported to the Mont-Blanc prototype.

In this deliverable, we present the current status of the prototype versions of the performance analysis tools, considered in the Mont-Blanc project. This includes the community instrumentation and measurement system Score-P, the performance analysis toolset Scalasca with its result browser CUBE, developed by Juelich Supercomputing Centre, and the Barcelona performance tool-suite, containing the instrumentation library Extrae, the analysis tool Paraver and the simulation tool Dimemas. For all of these tools, we describe the current status of the porting to the Mont-Blanc platform as well as the implemented extensions for supporting the OmpSs programming model.

In this deliverable we present the power consumption measurement process and data acquisition of the Mont-Blanc prototype.

In this deliverable, we present the current status of the prototype versions of the performance analysis tools considered in the Mont-Blanc project. This includes the community instrumentation and measurement system Score-P, the performance analysis toolset Scalasca with its result browser CUBE, developed by Jülich Supercomputing Centre, and the Barcelona performance tool-suite, containing the instrumentation library Extrae, the analysis tool Paraver and the simulation tool Dimemas. For all of these tools, we describe the current status of the porting to the Mont-Blanc platform, in particular the testing on the WP7 prototype, as well as the implemented extensions for supporting the OmpSs programming model.

Energy-efficient high performance computing extends beyond the use of energy-efficient low power processing hardware. With increasing variations in the power consumption depending on the workload of a high performance computing system, modern supercomputers need tighter integration with their surrounding data center infrastructure than ever before, causing new challenges for the design and operation of data centers and systems. Main aspects covered in this document are the power supply chain and the cooling system of the data center and the supercomputer.

This deliverable provides the technical description of the final prototype system delivered to BSC for the use within the Mont-Blanc project. This system consists of 1080 nodes that are deployed in two separate partitions, a small one for test and development and a large one for running applications. The latter is a separate entity and has its own interconnect and storage subsystems.

This report summarizes the dissemination activities carried out by the Mont-Blanc project in the October 2013 – September 2014 period. The dissemination activities are similar on both projects (Mont-Blanc 2011 – 2013 and 2013 – 2016). Specifically, in the following pages a complete list of conferences as well as the presentations made at various events and workshops and related to the project will be listed. Furthermore, additional coverage of the project by the press and social media is also presented in this document, as well as other dissemination activities such as collaborations with other projects.

This report summarizes the dissemination activities carried out by the Mont-Blanc project in the October 2013 – September 2014 period. The dissemination activities are similar on both projects (Mont-Blanc 2011 – 2013 and 2013 – 2016). Specifically, in the following pages a complete list of conferences as well as the presentations made at various events and workshops and related to the project will be listed. Furthermore, additional coverage of the project by the press and social media is also presented in this document, as well as other dissemination activities such as collaborations with other projects.

This report summarizes the dissemination activities carried out by the Mont-Blanc project in the October 2014 – September 2015 period. This period has been characterized by the promotion of the prototype deployment announcement and also with the participation of the Mont-Blanc climbers’ team to the ISC Student Cluster Competition.

In this document D3.2 Applications porting and tuning reports the activities related to T3.1 and T3.2 and T3.4 of the first 21 months of the Mont-Blanc 2 project are given in detail. During the same period also a limited part of the activities related to T3.3 (Application benchmarking) have started in order to preliminarily assess the code versions ported to the platforms made available to the consortium partners (see next section for platform disambiguation). The support activities related to T3.4 have produced significant help in the porting of Mont-Blanc applications thus paving the route for its repeated use in T3.1 and T3.2 of Mont-Blanc2.

This report describes work done in three areas relevant to the performance of the Mont-Blanc prototype system.

In this deliverable we present the extensions to OmpSs regarding the support of clusters. Within OmpSs we have implemented a caching system to deal with the data that must be sent to remote nodes to be processed there. A remote node can be another node in the cluster or an accelerator attached to it. This implementation has been done in the Intel architecture, and evaluated in a cluster with NVidia GPUs. Deliverable D4.1 presents the porting to the ARM architecture.

This deliverable report describes some of the work that has been done on optimizing the Linux operating system for running HPC applications on ARM. The first section describes problems found in the interconnect hardware/software stack and potential solutions. The second section describes some profiling infrastructure that will be used when looking at how to implement energy-aware scheduling policies at the kernel level.

The main objective of the Mont-Blanc project is to develop a European Exascale approach based on commodity power-efficient embedded technologies. After having successfully delivered the Mont-Blanc prototype in the first phase of the project, we now complement the efforts undertaken in the first three years by addressing challenges that our system needs to cope with in terms of massive parallelism, system resiliency and employment of future heterogeneous architectures. The latter is discussed in this deliverable, where we present our latest results on the assessment and applicability of heterogeneous architectures, with a particular emphasis on ARM big.LITTLE technology. Specifically, we focus our attention on the evaluation and improvement of task scheduling mechanisms for big.LITTLE platforms and we propose three load balancing algorithms targeting performance improvement of data-parallel applications in heterogeneous systems.

For the OmpSs extensions part, in this deliverable we present three developments we have done in OmpSs. First, we have incorporated a resource specification in the programming model to allow programmers to tune the use of cores and devices in the execution of OmpSs tasks. As a result, the programmer can better guide the runtime to use more or less resources of a specific type and get better performance. In the second place, we have extended OmpSs to provide the capability to profile the execution of the OpenCL kernels to determine the most suitable kernel configuration. The Mercurium compiler allows to specify the ranges of values that should be analyzed, and the Nanos++ runtime does the exploration. Finally, in the third place, we have further evaluated the performance of the OmpSs@cluster programming model, with 4 new benchmarks in the Mont-Blanc prototype.

The objective of this document is to provide a precise specification of this interface. In a second step, the interface will be implemented by the BSC OmpSs compiler and runtime group, and necessary monitoring components using this interface will be created for the WP5 performance tools (Extrae, Score-P) and debugging tools (Temanejo, DDT) developers.

This document describes the work done to integrate basic support for the Open Compute Language (OpenCL) into the uni ed measurement infrastructure Score-P. After an introduction to OpenCL and Score-P the current status of the software prototype and preliminary results are presented in detail. The current prototype monitors important OpenCL API functions by intercepting them at link time and collecting the necessary data via library function wrapping. Data is captured on OpenCL functions regarding devices, kernels, memory objects and command queues. The prototype was tested with Intel, AMD and NVIDIA OpenCL implementations.

This deliverable presents the main features of MAQAO on ARM. After a study on the impact of vectorization and vectorization/energy tradeoffs on ARM32 architectures, we present the static analyses used on ARM and briefly the currently working instrumentation feature. Then we apply MAQAO on a benchmark in order to describe the hints given by the tool and apply MAQAO on SMMP, an MontBlanc application, in order to optimize it. Finally, we provide the on-going work concerning data layout transformations

This document describes BOAST, a metaprogramming framework to produce portable and efficient computing kernels for HPC application. BOAST offers an embedded domain specific language to describe the kernels and their possible optimization. BOAST also supplies a complete run-time to compile, run, benchmark, and check the validity of the generated kernels. BOAST is being used in two flagship HPC applications BigDFT and SPECFEM3D, to improve performance portability of those codes.

This document describes the work done to implement the monitoring and control API specified in the previous D5.1 Mont-Blanc 2 deliverable. First, we present the description of the software components that define the API from the programming model perspective (OmpSs, Mercurium, and Nanos++) and the monitoring/debugging tools (Extrae/Paraver, Ayudame/Temanejo, and Score-P/Scalasca). Then, we go into details of the implementations developed in this period, to provide the functionality associated with the API.

In this deliverable, we describe our modifications and enhancements to the Score-P instrumentation and measurement infrastructure as well as the Scalasca Tracing Tools package implemented within the Mont-Blanc project towards an integrated analysis of hybrid applications using multiple parallel programming models in combination. In particular, we focus on the support for the OmpSs and OpenCL programming models as well as the challenges introduced by the asynchronous nature of create/wait-type threading and task-based programming. Various examples highlight that Score-P and Scalasca now effectively support the performance analysis of hybrid codes using a single, coherent workflow and a unified result presentation.

This document describes the work done to integrate the results and the representation mechanism from the Folding process developed at BSC into the Cube4 visualization tool developed at JSC. The Cube4 tool has been extended to augment its display and analysis capabilities via a plugin mechanism so that third party tools provide not only performance data to Cube but also new ways to represent the performance information within the Cube GUI. BSC has taken advantage of this extension to provide new visualization metaphors of its Folding mechanism to be able to represent in Cube4 the application progression in terms of performance and source-code between delimited code regions. This document also presents an example of the usage of this integration by describing the analysis of the BigDFT application.

This deliverable is a preliminary report on state of the art software-based resiliency techniques for high performance computing (HPC). The document overviews the past resiliency challenges and the proposed solutions to address them. It reviews what the future resiliency challenges would be in exascale computing and tries to project research directions to tackle these problems.

Mont Blanc’s WP6 was set up to address the problem of the expected greater error rates due to increased component counts, smaller silicon geometries and other factors, that are expected in future Exascale systems. D6.6 summarises the results so far from research into new fault tolerant iterative sparse solvers based on the Conjugate Gradient (CG) method. These types of solver are very commonly used in scientific applications, and so any advance in improving the built-in fault tolerance of sparse iterative CG solvers should have a material impact for Exascale systems and for several of the Mont Blanc applications. For example, Berlin Quantum Chromodynamics (BQCD), spends ~80% of its execution time in a CG solver, while EUTERPE also spends a significant portion of its run-time in a Jacobi Preconditioned Conjugate Gradient solver.

This deliverable describes the analyses performed on the different applications being considered as part of the co-design process in Mont-Blanc 3. We present different types of results for 18 applications. For most of them we performed evaluations on Mont-Blanc platforms and analyses on both ARM and Intel based platforms. The different analyses try to identify fundamental issues that limit performance on the currently available platforms and through predictive studies we identify issues that will be relevant at larger scales. For each application we summarize, at the end of the corresponding section, the fundamental issues and possible co-design alternatives to consider.

This document presents a methodology to select interesting regions within applications that exhibit variety with respect to processor resource demands and are representative of a set of benchmark applications. Such a selection of a benchmark subset allows for piecewise optimization of an application by replaying the selected regions on a simulator in contrast to executing all the applications thus reducing simulation time. Firstly we describe briefly the chosen applications from PARSEC and lulesh, and then present the detailed approach to select the regions. Next, we apply the proposed approach and show how the selected regions can be used for benchmarking novel accelerators, and performance tuning of big and LITTLE processors in heterogeneous architectures.

This deliverable describes the results generated in porting and tuning the applications considered as part of the co-design process in Mont-Blanc 3 for ARM. We present our optimization experiences for 9 applications. For most of them we performed evaluations on Mont-Blanc 3 platforms and analyses on both ARM and Intel-based platforms. Building on Deliverable 6.1 which tried to identify root causes of scaling issues, we continued with analysis and implemented optimizations to overcome these performance problems. Overall load imbalance was identi fied as a very important issue. Most applications needed to address this either via improving algorithms and domain decompositions, distributing their ressources to fewer MPI ranks and more threads, or via dynamic load balancing (DLB). For the HPCG benchmark, a performance analysis and an initial algorithmic optimization are presented. The work presented is not ARM-speci c, but it has been tested on an ARM-based cluster by a team of students. Following the directive in [1], Lulesh has been ported to OmpSs and tested on the Mont-Blanc 3 mini-clusters. In the ARM ecosystem, the ARM Performance Libraries have been evaluated on a widely used scienti c suite QuantumESPRESSO. The results indicate speedups when using ARMPL for linear algebra workloads, and highlight opportunities to improve the FFT functions. In addition, the recently released ARM compiler was compared to GCC. Performance and usability were comparable, further investigation which compiler is preferable for which type of workload is suggested. For the applications in cardiac modelling and mesh deformation we generally find optimizations stemming from analysis on Intel systems advantageous for ARM systems and vice versa, e.g. work to scale to the high core density ThunderX system proved valuable for performance many core x86 systems. For some of them, power measurements are presented: these numbers will be used as baseline when comparing perfomance and power gures in the final Mont-Blanc 3 demonstrator under deployment in WP3.

This report collects two major contributions to the project: 1. In Section 1 we present the porting to OmpSs/OpenMP4.0 of several production applications and mini-app of the project. We focused our report on explaining new functionalities of the OmpSs programming model that we consider disruptive. We show their actual benefi ts when executing on large HPC machines. 2. In Section 2 we report the results of the successful porting and test of CERE on ARM architectures. This allows to extract regions of interest of large applications, called codelets, running on ARM platforms to be "replayed" on architectural simulators, reducing their simulation time, or to be used for the development of mini-applications.