Fast and Science-Preserving Compression for Light Sources

Robert Underwood, Argonne National Laboratory
CS Seminar Graphic

Crystallography is the leading technique to study atomic structures of proteins and produces enormous volumes of information that can place strains on the storage and data transfer capabilities of synchrotron and free-electron laser light sources. Lossy compression has been identified as a possible means to cope with the growing data volumes; however, prior approaches have not produced sufficient quality at a sufficient rate to meet scientific needs. This paper presents Region Of Interest BINning with SZ lossy compression (ROIBIN-SZ) a novel, parallel, and accelerated compression scheme that separates the dynamically selected preservation of key regions with lossy compression of background information.  We perform and present an extensive evaluation of the performance and quality results made by the co-design of this compression scheme.  We can achieve up to a 196x and 46.44x compression ratio on lysozyme and selenobiotinyl-streptavidin while preserving the data sufficiently to reconstruct the structure at bandwidths and scales that approach the needs of the upcoming light sources.  We then conclude with paths forward to extend this to additional types of light source datasets.

Robert Underwood is a Post-Doctoral Appointee in the Mathematics and Computer Science Division at Argonne National Laboratory focusing on using data compression to accelerate I/O for large scale scientific applications including AI for Science. His library LibPressio which allows users to easily experiment and adopt advanced compressors has over 200 average unique monthly downloads and is used in over 17 institutions across the world, and is a contributor to the R&D100 winning SZ family of compressors and other compression libraries.  He regularly mentors students, and is the early career ambassador for Argonne to the Joint Laboratory for Extreme Scale Computing.

See upcoming and previous presentations at CS Seminar Series