A critical purpose for parallel file systems used in high performance computing is to capture quickly and durably hold checkpoints of long running massive jobs in case of faults. Most of the largest computers use parallel file systems designed to decouple a highly parallel data object layer from a strongly serializable, and too often non-scalable metadata management layer. Checkpointing state into a separate file per compute process usually provides the highest capture bandwidth, although it also induces a massive spike of file creation. As file creation latency approaches the duration of the rest of the checkpoint capture, it becomes necessary for the metadata management layer of parallel file systems to become scalable. In this talk I will review a set of techniques to improve parallel file system metadata management scalability for high performance computing. We change the on-disk representation to be write-optimized rather than read-optimized to accelerate spikes of mutation. We dynamically split partitions of directories to balance spatial locality for efficient small directories and parallel throughput for massive directory create spikes. We develop a bulk insertion mechanism for efficient partition splitting. We develop a client caching scheme that does not require per-client state in the metadata server, and tailor it to multiple predictors for write conflict probability. We demonstrate an extended client caching option in which whole subtrees of new metadata can be write-back cached on clients and efficiently bulk inserted much larger into the common metadata service. In addition to explaining these mechanisms and their performance, I will overview more aggressive scaling strategies under development. Specifically, we are exploring concurrent client-based metadata mutation using snapshot consistency, workflow propagation of metadata writeback logs, and client-funded validation of long-delayed metadata writeback logs.
Bio:
Garth Gibson is a professor of computer science at Carnegie Mellon University, the co-founder and chief scientist at Panasas Inc, and a Fellow of the ACM and the IEEE. He holds a MS and PhD from the University of California at Berkeley and a B.Math from the University of Waterloo in Canada. His research on Redundant Arrays of Inexpensive Disks (RAID) has been awarded the 1998 SIGMOD Test of Time Award, the 1999 Allan Newell Award for Research Excellence, the 1999 IEEE Reynold B. Johnson Information Storage Award for outstanding contributions in the field of information storage, the 2005 J. Wesley Graham Medal in Computing and Innovation from the University of Waterloo, entrance into the ACM SIGOPS Hall of Fame in 2011, and the 2012 IFIP WG10.4 Jean-Claude Laprie Award in Dependable Computing. Gibson founded CMU's Parallel Data Laboratory in 1992 and was a founding member of the Technical Council of the Storage Networking Industry Association, the USENIX Conference on File and Storage Technology Steering Committee and the SC Parallel Data Storage Workshop Steering Committee. At Panasas Gibson led the development of high-performance, scalable, parallel, file-system appliances in use in High-Performance Computing in national labs, academic clouds, energy research, engineered manufacturing and life sciences. Gibson instigated standardizing key features of parallel file systems in NFSv4.1 (parallel NFS), now adopted and deployed in Linux. His 1995 Network-Attached Secure Disks (NASD) led to the ANSI T10 (SCSI) Object Storage Device (OSD) command set. His students have gone on to co-author influential systems such as the Google File System and BigTable, and to lead the technology development of influential products such as EMC's Data Domain. His collaboration with Los Alamos National Laboratory (LANL) led to the Parallel Log-structured File System (PLFS) in 2009 and SC’14’s Best Paper on scaling file system metadata. Recently Gibson has been developing educational courses (Advanced Cloud Computing) and programs (Masters in Computational Data Science) for cloud computing, big data and data science and is an investigator in the Intel Science and Technology Center for Cloud Computing. His current students are also collaborating with the Machine Learning Department at CMU to develop radically asynchronous massive model solvers that trade bounded error for convergence speed in the search for a good predictor of the hidden parameters governing the interpretation of big data sets.