Purpose-built clusters permeate many of today’s organizations, providing both large-scale data storage and computing. Within local clusters, competition for resources complicates data storage and application execution. However, given the emergence of cloud’s pay-as-you-go model and elasticity, users are increasingly storing portions of their data remotely and allocating compute nodes on-demand. This scenario gives rise to “hybrid cloud”, where data stored across local and cloud resources may be processed over both environments. While a hybrid cloud environment provides many advantages for large-scale applications, transparent data processing and management are challenging tasks within these settings.
In this talk, I will first introduce our data-intensive scientific computing middleware which enables transparent data processing on geographically distributed resources, such as hybrid cloud. The proposed middleware eases the programmability of large-scale applications using a reduction-based processing structure. We also implement a model-driven resource allocation framework for our middleware which supports time and cost sensitive execution using cloud bursting technique. Our framework considers the acquisition of cloud resources to meet either a time or a cost constraint for a data analysis task, while only a fixed set of local compute resources is available. Second, we will focus on issues related with large-scale data management and I/O bottleneck. In this context, I will introduce our compression methodology and system which can be used to incorporate a variety of (de)compression algorithms. The proposed system decouples (de)compresion and I/O operations, and automatically mechanizes multi-threaded data retrieval and data (de)compression. Finally, we will discuss how we can utilize reduction-based processing structure in in-situ and in-transit data analysis techniques, and mitigate I/O bottleneck in data-intensive applications.