Hadoop/Map-Reduce clustered bioinformatics applications

Principal Investigator: 

G. Fiameni, M. Rosati.

For Mont-Blanc project:
Nico Sanna, CINECA

Other application users/developers: 
  • CNR Flagship Project EPIGEN
  • Industrial partners and Health services
Scientific area: 
As stated in the MontBlac2 DoW for the WP3, wherever applicable, we will “… test the prototype platforms configured as a Map-Reduce/Hadoop distributed engine on top of which data-driven applications like those in the area of genomics may be implemented and bioinformatics tools to query, align and collect DNA sequences properly benchmarked”. CINECA is currently opening a multi-tiered server and storage infrastructure able to provide top-class solutions to high performance data-handling applications. At the centre of the software stack is an OpenStack virtualization engine (NUBES) on top of which several IaaS/PaaS services will be released and among these, Hadoop/MapReduce (H/MR) hypervised clusters will be experimented. With the Hypervised H/MR solution implemented in NUBES and the bioinformatics benchmarks setup, we will have at disposal a finely tuneable virtualized cluster, which could be as well, closely adaptable to the physics infrastructure eventually deployed in MontBlanc2 and capable to host such a kind of HTC applications. Among other, bioinformatics applications and workflow will be implemented in the H/MR cluster over NUBES to model the HTC pipeline and when ready, it will be deployed over the capable Mont-Blanc2 prototypes. Examples of tester applications will cover many bionformatics sectors of interest with focus on those for Next Generation Sequencing and epigenomics like bowtie, pattern search, multiple alignement just to cite a few.
Typical H/MR infrastructures have good scaling up to hundreds of node on Infiniband clusters and/or similar computing architectures.
HTC 10 Gbps and Infiniband environment
Tested on platforms: 

Infiniband and 10 Gbps clusters.