Ensuring correctness in high-performance computing (HPC) applications is one of the fundamental challenges that the HPC community faces today. While significant advances in verification, testing, and debugging have been made to isolate software errors (or defects) in the context of non-HPC software, several factors make achieving correctness in HPC applications and systems much more challenging than in general systems software—growing heterogeneity (architectures with CPUs, GPUs, and special purpose accelerators), massive scale computations (very high degree of concurrency), use of combined parallel programing models (e.g., MPI+X), new scalable numerical algorithms (e.g., to leverage reduced precision in floating-point arithmetic), and aggressive compiler optimizations/transformations are some of the challenges that make correctness harder in HPC. The following report lays out the key challenges and research areas of HPC correctness: DOE Report of the HPC Correctness Summit.
As the complexity of future architectures, algorithms, and applications in HPC increases, the ability to fully exploit exascale systems will be limited without correctness. With the continuous use of HPC software to advance scientific and technological capabilities, novel techniques and practical tools for software correctness in HPC are invaluable.
The goal of the Correctness Workshop is to bring together researchers and developers to present and discuss novel ideas to address the problem of correctness in HPC. The workshop will feature contributed papers and invited talks in this area.
Topics of interest include, but are not limited to:
Authors are invited to submit manuscripts in English structured as technical or experience papers not exceeding 8 pages of content, including everything. Submissions must use the IEEE format.
Submitted papers must represent original unpublished research that is not currently under review for any other venue. Papers not following these guidelines will be rejected without review. Submissions received after the due date, exceeding length limit, or not appropriately structured may also not be considered. At least one author of an accepted paper must register for and attend the workshop. Authors may contact the workshop organizers for more information. Papers should be submitted electronically at: https://submissions.supercomputing.org/?page=Submit&id=SC18WorkshopCorrectness2018Submission&site=sc18.
We encourage authors to submit an optional artifact description (AD) appendix along with their paper, describing the details of their software environments and computational experiments to the extent that an independent person could replicate their results. The AD appendix is not included in the 8-page limit of the paper and should not exceed 2 pages of content. For more details of the SC Reproducibility Initiative please see: https://sc18.supercomputing.org/submit/sc-reproducibility-initiative/.
The proceedings will be archived in IEEE Xplore via TCHPC.
All time zones are AOE.
Ignacio Laguna, LLNL
Cindy Rubio-González, UC Davis
Eva Darulova, MPI-SWS, Germany
Ganesh Gopalakrishnan, University of Utah, USA
Paul Hovland, Argonne National Laboratory, USA
Geoff Hulette, Sandia National Laboratories, USA
Costin Iancu, Lawrence Berkeley National Laboratory, USA
Sriram Krishnamoorthy, Pacific Northwest National Laboratory, USA
Richard Lethin, Reservoir Labs, Yale University, USA
Francesco Logozzo, Facebook Research, USA
Jackson Mayo, Sandia National Laboratories, USA
John Mellor-Crummey, Rice University, USA
Matthias Müller, RWTH Aachen University, Germany
Tristan Ravitch, Galois, Inc, USA
Nathalie Revol, INRIA - ENS de Lyon, France
Markus Schordan, Lawrence Livermore National Laboratory, USA
Koushik Sen, UC Berkeley, USA
Stephen Siegel, University of Delaware, USA
Kay Bailey Hutchison Convention Center
650 S Griffin St, Dallas, TX 75202
Keynote Speaker 1 (Morning Session)
Bio: Ganesh L. Gopalakrishnan (Senior Member of IEEE and ACM Distinguished Scientist) earned his B.Sc.(EE) degree from NIT Calicut in 1978, M.Tech (EE) from IIT Kanpur in 1980, and PhD in Computer Science from Stony Brook University in 1986, when he joined the University of Utah.
His external engagements include Visiting Assistant Professorship at the University of Calgary (1988-89) and sabbatical visits at Stanford University (1995-96), Intel (2002-03), and sabbatical projects with Microsoft on developing parallel computing curriculum (2009-10). He has authored a 2006 Springer book “Computation Engineering: Applied Automata Theory and Logic,” and is finishing a 2018 textbook using Jupyter notebooks in undergraduate Discrete Math and Automata Theory classes.
He has published over 180 refereed papers, and has graduated 21 PhD students. He is serving as the Director of the Center for Parallel Computing at Utah (“CPU”). He was awarded one of the six “Beacons of Excellence” Awards for 2012 by the University of Utah for his work on mentoring undergraduates.
His currently active projects are: Verification Methods and Tool Frameworks for Parallel and Concurrent Systems, Formal Techniques for System Resilience, Floating-point Precision Tuning, and Data Race Checking for OpenMP and GPUs. His current research grants and contracts are from NSF (CISE) and DOE (in collaboration with LLNL and PNNL).
Abstract: Formal methods include rigorous specification methods that can render language standards reliable and unambiguous. They also include rigorous testing methods that target well-specified coverage criteria, and formal concepts that help guide debugging tool implementations. Those who say formal methods don’t apply to HPC probably misunderstand formal methods to be some esoteric diversion, and not as a software productivity booster in the sense we describe.
Undoubtedly, HPC correctness is far too complex: there are the accidentally flipping bits, unpredictable floating point rounding, threads that incur a data race, and capricious compilers whose optimizations change results. All these can severely impact overall productivity. Formal approaches are possible for many of these pursuits, while for others they may emerge if one persists, and if there is a community invested in developing them in the long run. A worthwhile direction is to invest in formal methods based pedagogy: not only does this help buy us some time to develop useful formal methods for HPC, but it also gives some hope to save future generations from today’s debugging drudgery. Today’s parallel computing education still only gives lip service to correctness - let alone formal.
My talk will try and present examples of all of this. We will present examples where formal ideas did transition to production-level data race checking tools. I will also present examples where we finished production-level tools for floating-point non-reproducibility in the field, and hope to backfill the formalism eventually.
Eventually, as Rushby says, Formal Methods must “disappear” - be incorporated into standard practice and we don’t see them.
Keynote Speaker 2 (Afternoon Session)
Bio: James Demmel is the Dr. Richard Carl Dehmel Distinguished Professor of Computer Science and Mathematics at the University of California at Berkeley, and Chair of the EECS Dept. His research is in numerical linear algebra, HPC, and communication avoiding algorithms. He is known for his work on the LAPACK and ScaLAPACK linear algebra libraries. He is a member of the NAS, NAE, and American Academy of Arts and Sciences; a Fellow of the AAAS, ACM, AMS, IEEE and SIAM; and winner of the IPDPS Charles Babbage Award, IEEE Computer Society Sidney Fernbach Award, the ACM Paris Kanellakis Award, and numerous best paper prizes.
Abstract: We consider two related aspects of analyzing and guaranteeing correctness
of floating point programs: exception handling and reproducibility.
Exception handling refers to reliable and consistent propagation of errors due
to overflow, invalid operations (like sqrt(-1)
), convergence failures, etc.
Reproducibility refers to getting bitwise reproducible results from multiple
runs of the same program, e.g., despite parallelism causing floating point
sums to be evaluated in different order with different roundoff errors.
We describe the efforts of two standards committees, the Basic Linear
Algebra Subprograms (BLAS) Standard, and the IEEE 754 Floating Point
Standard, to address these issues, and how these efforts should make it
easier to accomplish these goals for higher level applications, such as
linear algebra libraries.
09:00am - 09:05am: Opening remarks | |
09:05am - 10:00am: Keynote Speaker 1: "Making Formal Methods for HPC Disappear", Ganesh L. Gopalakrishnan (slides) |
10:00am - 10:30am: Break (coffee provided by SC18) |
10:30am - 10:50am: "Hybrid Theorem Proving as a Lightweight Method for Verifying Numerical Software", Alper Altuntas, John Baugh (slides) | |
10:50am - 11:10am: "HPC Software Verification in Action: A Case Study with Tensor Transposition", Erdal Mutlu, Ajay Panyala, Sriram Krishnamoorthy (slides) |
11:10am - 11:30am: "Correctness of Dynamic Dependence Analysis for Implicitly Parallel Tasking Systems", Wonchan Lee, George Stelle, Patrick McCormick, Alex Aiken (slides) | |
11:30am - 11:50am: "Verifying Qthreads: Is Model Checking Viable for User Level Tasking Runtimes?", Noah Evans (slides) |
11:50am - 12:10pm: "Incremental Static Race Detection in OpenMP Programs", Bradley Swain, Jeff Huang (slides) | |
12:10pm - 12:30pm: "Using Polyhedral Analysis to Verify OpenMP Applications are Data Race Free", Fangke Ye, Markus Schordan, Chunhua Liao, Pei-Hung Lin, Ian Karlin, Vivek Sarkar (slides) |
12:30pm - 02:00pm: Lunch (on your own) |
02:00pm - 03:00pm: Keynote Speaker 2: "Correctness of Floating Point Programs - Exception Handling and Reproducibility", James Demmel (slides) |
03:00pm - 03:30pm: Break (coffee provided by SC18) |
03:30pm - 03:50pm: "Compiler-Aided Type Tracking for Correctness Checking of MPI Applications", Alexander Hück, Jan-Patrick Lehr, Sebastian Kreutzer, Joachim Protze, Christian Terboven, Christian Bischof, Matthias S. Müller (slides) | |
03:50pm - 04:10pm: "Towards Deductive Verification of Message-Passing Parallel Programs", Ziqing Luo, Stephen F. Siegel (slides) | |
04:10pm - 04:30pm: "PARCOACH Extension for a Full-Interprocedural Collectives Verification", Pierre Huchant, Emmanuelle Saillard, Denis Barthou, Hugo Brunie, Patrick Carribault (slides) |
04:30pm - 05:30pm: Panel on: Facilitating the Adoption of Correctness Tools in HPC Applications | |
Panelist 1: Alper Altuntas, National Center for Atmospheric Research | |
Panelist 2: Ganesh Gopalakrishnan, University of Utah | |
Panelist 3: Mike Lam, James Madison University | |
Panelist 4: Markus Schordan, LLNL |
Please address workshop questions to Ignacio Laguna (ilaguna@llnl.gov) and/or Cindy Rubio-González (crubio@ucdavis.edu).