Skip to end of metadata
Go to start of metadata
The Argonne National Laboratory/MCS/Extreme Scale Resilience group covers fault tolerance and resilience for HPC simulations and data analytics at extreme scale

Lead: Franck Cappello, ANL

News

 

Nov. 2013 Mini workshop on Resilience at the 10th workshop of the INRIA-Illinois-ANL joint laboratory on Petascale computing
Nov 2013 Large presence of the ESR group at SC13: 2 Papers, 1 Panel, 2 bird of feather (BOF), 1 Emerging Technology Demo + Chairing the Test of time award
Oct 2013: PUF (Partnership University Fund) project “Preparing for Next Generation Numerical Simulation Platforms” accepted 
Oct. 2013: F. Cappello Invited talk at the Extreme Scale CoDesign Meeting in China October, 2013
Oct 2013: The Paris project on Silent soft errors/data corruptions detection funded for 3 years
Sept 2013: G8 BOF accepted at SC13
Sept 2013: FTI Emerging Technology Demo accepted at SC13
J
uly 2013: Panel "Fault Tolerance/resilience at Petascale/Exascale: Is it really critical? Are solutions necessarily disruptive?" accepted at IEEE/ACM SC13
July 2013: 2 papers accepted at IEEE/ACM SC13 July 2013: F. Cappello Keynote at HPCS 2013, Helsinki July, 2013
July 2013: FTI 0.9 release available at https://gforge.inria.fr/projects/fti/. Please contact Leonardo Bautista Gomez for more details.
June 2014 F. Cappello, Program co-chair of ACM HPDC 2014
June 2013: F. Cappello organizer of the "system software challenges" session as ISC2013 

Topics and people

  • Multi-level Checkpoint / Restart: Leonardo Bautista Gomez (Postdoc ANL), Di Sheng (Postdoc Inria)
  • Silent soft errors/data corruptions detectors and compression: Leonardo Bautista Gomez (Postdoc ANL)
  • Failure characterization and prediction: Ana Gainaru (Ph. D. candidate UIUC)
  • Failure modeling and fault tolerance optimizations: Mohamed Slim Bouguerra (Postdoc Inria)
  • Fault tolerance protocols: Tatiana V. Martsinkevich (Ph. D. candidate Inria)
  • Error propagation modeling: Vincent Baudoui (Podtdoc Total)

Main collaborators: Marc Snir (ANL and UIUC), Bill Kramer (UIUC), Bogdan Nicolae (IBM Dublin), Thomas Ropars (EPFL), Amina Guermouche (UVSQ), Frederic Vivien (Inria), Yves Robert (LIP), Satoshi Matsuoka (Titech), Mitsuhisa Sato (U. Tsukuba).

Tools and software

  • FTI (operational prototype): Fault Tolerance Interface for multi-level checkpoint/restart (in memory checkpointing, checkpointing on remote nodes, erasure encoding, etc.)
  • HELO/ELSA (operational prototypes): System event clustering and Failure predictor
  • MPICH-HFT (prototype under development): Fault tolerant MPI with hierarchical fault tolerant protocol

Main collaborative activities

Recent Publications (from 2013)


  • S. Di, D. Kondo, and F. Cappello, Characterizing and Modeling Cloud Applications/Jobs on a Google Data Center, To appear in Journal of Supercomputing, 2014.
  • S. Di, S. Bouguera, L. Bautista Gomez, F. Cappello, Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications, IEEE IPDPS 2014
  • S. Di, C.-L. Wang, F. Cappello, Adaptive Algorithm for Minimizing Cloud Task Length with Prediction Errors, IEEE transaction on Cloud Computing, to appear 
  •  L. Bautista Gomez and F. Cappello, Detecting Silent Data Corruption through Data Dynamic Monitoring for Scientific Applications, Poster, to appread in Proceedins of ACM PPoPP 2014
  • G. Bosilca, A. Bouteiller, E. Brunet, F.Cappello, J. Dongarra, A. Guermouche, T. Herault, Y. Robert, F. Vivien, D. Zaidouni, Unified Model for Assessing Checkpointing Protocols, To appear in Concurrency and Computation: Practice and Experience, Wiley, 2013
  • L. Bautista Gomez, F. Cappello, Improving Floating Point Compression through Binary Masks, Proceedings of IEEE BigData 2013 
  • T. Ropars, T. Martsinkevich, A. Guermouche, A. Schiper, F. Cappello, SPBC: Leveraging the Characteristics of MPI HPC Applications for Scalable Checkpointing, Proceedings of IEEE/ACM SC13
  • S. Di, Y. Robert, F. Vivien, D. Kondo, C. L. Wang, F. Cappello, Optimization of Cloud Task Processing with Checkpoint-Restart Mechanism, Proceedings of IEEE/ACM SC13
  • A. Bouteiller, F. Cappello, J. Dongarra, A. Guermouche, T. Herault and Y.Robert, Multi-criteria checkpointing strategies: optimizing response-time versus resource utilization, Proceedings of Europar 2013 
  • S. Di, D. Kondo, F. Cappello, Characterizing Cloud Applications on a Google Data Center, short paper, Proceedings fo ICPP2013
  • A. Gainaru, F. Cappello, M. Snir, B. Kramer, Failure prediction for HPC systems and applications: current situation and open issues, International Journal of High Performance Computing Applications, SAGE,2013
  • B. Nicolae, F. Cappello, AI-Ckpt: Leveraging Memory Access Patterns for Adaptive Asynchronous Incremental Checkpointing, Proceeding of ACM HPDC 2013
  • B. Nicolae, F. Cappello, BlobCR: Virtual Disk Based Checkpoint-Restart for HPC Applications on IaaS CloudsTo appear in Journal of Parallel and Distributed Computing, 2013
  • M. El Mehdi Diour, O. Gluck, L. Lefevre, F. Cappello, ECOFIT: A Framework to Estimate Energy Consumption of Fault Tolerance protocols during HPC executions, Proceedins of IEEE CCGRID 2013
  • M. S. Bouguerra, A. Gainaru, F. Cappello, L. Bautista Gomez, N. Maruyama and S. Matsuoka, Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointingProceedings of IEEE IPDPS 2013
  • M. El Mehdi Diouri, O. Gluck, L. Lefevre, F. Cappello, Towards an energy estimator of fault tolerance protocolsPoster, in Proceedins of ACM PPoPP 2013

 

Labels
  • None