A migration from collab.mcs.anl.gov to xcollab.mcs.anl.gov is being planned. The date is being determined; we will announce in advance by email and this banner, when the plan is finalized.
If you have questions or concerns, please contact help@cels.anl.gov
Skip to end of metadata
Go to start of metadata
The Argonne National Laboratory/MCS/Extreme Scale Resilience group covers fault tolerance and resilience for HPC simulations and data analytics at extreme scale

Lead: Franck Cappello, ANL

News

Jan. 2015  Using Data Analytics to Detect Corruptions in Numerical Simulations, BDEC Barcelona, short presentation 
Nov. 2014 FTI 0.9.5 release
Nov. 2014 2cd workshop of the Joint-Laboratory on Extreme Scale Computing
Nov. 2014
FTI  Demo on the DoE booth at SC14
Nov. 2014 FTI presented at 
Journée thématique «Impact des nouveaux calculateurs pour l’océan et l’atmosphère »  by Julien Bigot (CEA – Maison de la Simulation)
July 2014 Our journal article "Toware exascale resilience: 2014 update" is presented in HPC wire
April 2014 Inria-UIUC/NCSA-ANL-BSC Join-lab workshop, June 9-11
March
2014 Presentation of the G8 ECS project results at Kobe
Feb. 2014 BDEC workshop at Fukuoka (Japan):
The Need for Resilience Research in Workflows of Big Compute and Big Data Scientific Application
Dec. 2013 Data compression algorithm based on masking presented at the DoE SC Associate Director's meeting.
Nov. 2013 Mini workshop on Resilience at the 10th workshop of the INRIA-Illinois-ANL joint laboratory on Petascale computing
2 2014 2014 
Nov 2013 Large presence of the ESR group at SC13: 2 Papers, 1 Panel, 2 bird of feather (BOF), 1 Emerging Technology Demo + Chairing the Test of time award

Oct 2013: PUF (Partnership University Fund) project “Preparing for Next Generation Numerical Simulation Platforms” accepted 
Oct. 2013: F. Cappello Invited talk at the Extreme Scale CoDesign Meeting in China October, 2013
Oct 2013: The Paris project on Silent soft errors/data corruptions detection funded for 3 years
 

Topics and people

  • Multi-level Checkpoint / Restart: Leonardo Bautista Gomez (Postdoc ANL), Di Sheng (Postdoc Inria)
  • Silent soft errors/data corruptions detectors and compression: Leonardo Bautista Gomez (Postdoc ANL)
  • Failure characterization and prediction: Ana Gainaru (Ph. D. candidate UIUC)
  • Failure modeling and fault tolerance optimizations: Mohamed Slim Bouguerra (Postdoc Inria)
  • Fault tolerance protocols: Tatiana V. Martsinkevich (Ph. D. candidate Inria)
  • Error propagation modeling: Vincent Baudoui (Podtdoc Total)

Main collaborators: Marc Snir (ANL and UIUC), Bill Kramer (UIUC), Bogdan Nicolae (IBM Dublin), Thomas Ropars (EPFL), Amina Guermouche (UVSQ), Frederic Vivien (Inria), Yves Robert (LIP), Satoshi Matsuoka (Titech), Mitsuhisa Sato (U. Tsukuba).

Tools and software

  • FTI (operational prototype): Fault Tolerance Interface for multi-level checkpoint/restart (in memory checkpointing, checkpointing on remote nodes, erasure encoding, etc.)
  • HELO/ELSA (operational prototypes): System event clustering and Failure predictor
  • MPICH-HFT (prototype under development): Fault tolerant MPI with hierarchical fault tolerant protocol

Main collaborative activities

Recent Publications (from 2013)

  • E. Berrocal, L. Bautista-Gomez, S. Di, Z. Lan, F. Cappello, Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications, short paper, ACM HPDC 2015

  • S. Di, E. Berrocal, F. Cappello, An Efficient Silent Data Corruption Detection Method with Error-feedback Control and Even Sampling for HPC Applications, IEEE CCGRID 2015
  • S. Di, E. Berrocal, K. Heisey, L. Bautista-Gomez, R. Gupta, F. Cappello, Towards Effective Detection of Silent Data Errors for HPC Applications, Poster, IEEE/ACM SC14

  • L. Bautista Gomez, P. Balaprakash, S. Bouguerra, S. Wild, F. Cappello and P. Hovland, Energy-Performance Tradeoffs in Multilevel Checkpoint Strategies, Poster, IEEE Cluster 2014

  • S. Bouguera, A. Gainaru, F. Cappello, Failure prediction: what to do with unpredicted failures?, to appear in International Journal of High Performance Computing Applications. 
  • F. Cappello, A. Geist, B. Gropp, B. Kramer, M. Snir, Toward Exascale Resilience: 2014 update, International Jounal on Supercomputing Frontiers and Innovations, Vol 1, Num 1, 2014, http://superfri.org/superfri/article/view/14
  • S. Di, L. Bautista-Gomez, F. Cappello, Optimization of Multi-level Checkpoint Model with Uncertain Execution Scales, to appear in IEEE/ACM SC14
  • M. Snir et al. Addressing failures in exascale computing, to appear in International Journal of High Performance Computing Applications, 2014  
  • S. Di, D. Kondo, and F. Cappello, Characterizing and Modeling Cloud Applications/Jobs on a Google Data Center, To appear in Journal of Supercomputing, 2014.
  • L. Bautista-Gomez, Franck Cappello, et. al, GPGPUs: How to Combine High Computational Power with High Reliability (Embedded Tutorial), Design, Automation & Test in Europe, DATE'14
  • S. Di, S. Bouguera, L. Bautista Gomez, F. Cappello, Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications, IEEE IPDPS 2014
    S. Di, C.-L. Wang, F. Cappello, Adaptive Algorithm for Minimizing Cloud Task Length with Prediction Errors, IEEE transaction on Cloud Computing. 
  •  L. Bautista Gomez and F. Cappello, Detecting Silent Data Corruption through Data Dynamic Monitoring for Scientific Applications, Poster, to appread in Proceedins of ACM PPoPP 2014
  • G. Bosilca, A. Bouteiller, E. Brunet, F.Cappello, J. Dongarra, A. Guermouche, T. Herault, Y. Robert, F. Vivien, D. Zaidouni, Unified Model for Assessing Checkpointing Protocols, To appear in Concurrency and Computation: Practice and Experience, Wiley, 2013
  • L. Bautista Gomez, F. Cappello, Improving Floating Point Compression through Binary Masks, Proceedings of IEEE BigData 2013 
  • T. Ropars, T. Martsinkevich, A. Guermouche, A. Schiper, F. Cappello, SPBC: Leveraging the Characteristics of MPI HPC Applications for Scalable Checkpointing, Proceedings of IEEE/ACM SC13
  • S. Di, Y. Robert, F. Vivien, D. Kondo, C. L. Wang, F. Cappello, Optimization of Cloud Task Processing with Checkpoint-Restart Mechanism, Proceedings of IEEE/ACM SC13
  • A. Bouteiller, F. Cappello, J. Dongarra, A. Guermouche, T. Herault and Y.Robert, Multi-criteria checkpointing strategies: optimizing response-time versus resource utilization, Proceedings of Europar 2013 
  • S. Di, D. Kondo, F. Cappello, Characterizing Cloud Applications on a Google Data Center, short paper, Proceedings fo ICPP2013
  • A. Gainaru, F. Cappello, M. Snir, B. Kramer, Failure prediction for HPC systems and applications: current situation and open issues, International Journal of High Performance Computing Applications, SAGE,2013
  • B. Nicolae, F. Cappello, AI-Ckpt: Leveraging Memory Access Patterns for Adaptive Asynchronous Incremental Checkpointing, Proceeding of ACM HPDC 2013
  • B. Nicolae, F. Cappello, BlobCR: Virtual Disk Based Checkpoint-Restart for HPC Applications on IaaS CloudsTo appear in Journal of Parallel and Distributed Computing, 2013
  • M. El Mehdi Diour, O. Gluck, L. Lefevre, F. Cappello, ECOFIT: A Framework to Estimate Energy Consumption of Fault Tolerance protocols during HPC executions, Proceedins of IEEE CCGRID 2013
  • M. S. Bouguerra, A. Gainaru, F. Cappello, L. Bautista Gomez, N. Maruyama and S. Matsuoka, Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointingProceedings of IEEE IPDPS 2013
  • M. El Mehdi Diouri, O. Gluck, L. Lefevre, F. Cappello, Towards an energy estimator of fault tolerance protocolsPoster, in Proceedins of ACM PPoPP 2013

 

Labels
  • None