Stefan Ceballos, Oak Ridge National Laboratory
The Oak Ridge Leadership Facility (OLCF) in the National Center for Computational Sciences (NCCS) division at Oak Ridge National Laboratory (ORNL) houses world-class high-performance computing (HPC) resources and has a history of operating top-ranked supercomputers on the TOP500 list, including the world's current fastest, Summit, an IBM AC922 machine with a peak of 200 petaFLOPS. With the exascale era rapidly approaching, the need for a robust and scalable big data platform for operations data is more important than ever. In the past when a new HPC resource was added to the facility, pipelines from data sources spanned multiple data sinks which oftentimes resulted in data silos, slow operational data onboarding, and non-scalable data pipelines for batch processing. Using Apache Kafka as the message bus of the division's new big data platform has allowed for easier decoupling of scalable data pipelines, faster data onboarding, and stream processing with the goal to continuously improve insight into the HPC resources and their supporting systems. This talk will focus on the NCCS division's transition to Apache Kafka over the past few years to enhance the OLCF's current capabilities and prepare for Frontier, OLCF's future exascale system; including the development and deployment of a full big data platform in a Kubernetes environment from both a technical and cultural shift perspective. This talk will also cover the mission of the OLCF, the operational data insights related to high-performance computing that the organization strives for, and several use-cases that exist in production today.