David A. B. Hyde

2papers

2 Papers

9.4DCMar 21
Adviser: An Intuitive Multi-Cloud Platform for Scientific and ML Workflows

Shihan Cheng, Michael A. Laurenzano, Brian Strauch et al.

Effectively leveraging the vast computational resources of modern cloud environments requires expertise spanning multiple technical domains: configuring scientific software with correct parameters and dependencies, navigating thousands of provider-specific instance types and pricing options, and managing parallel or distributed execution. We conduct a study indicating that the absence of these categories of expertise poses an ongoing challenge to unlocking the potential of cloud-enabled computational science. To address this challenge, we introduce Adviser, an intuitive multi-cloud platform centered on a workflow abstraction. Workflows are reusable, expert-crafted artifacts encapsulating environment setup, data processing, simulation, result capture, and visualization steps needed to execute scientific and ML applications. This approach allows users to specify high-level intent, while Adviser handles resource provisioning, runtime configuration, and data movement. Using two computational glaciology codes, Icepack and PISM, we show how to use Adviser to gain scientific insight and perform rapid exploration of cost-performance tradeoffs and scaling behavior without specialized expertise in cloud or high-performance computing.

89.1DCApr 27
Incisor: Ex Ante Cloud Instance Selection for HPC Jobs

Michael A. Laurenzano, Shihan Cheng, David A. B. Hyde

We present Incisor, a cloud HPC job submission system for the ex ante instance selection problem: choosing suitable hardware in the challenging but common setting where only the executable, inputs, and invocation commands are available at submission time. In practice, this task is manual and expertise-intensive, requiring users to combine incomplete knowledge of rapidly evolving cloud offerings with workload-specific intuition, static analysis, and systems reasoning to infer hardware constraints and select an instance type for each job. Incisor automates this process by pairing widely available program analysis tools with LLM-guided reasoning to infer hardware requirements and choose cloud instances. Using submission artifacts alone, Incisor atop frontier coding LLMs selects working AWS EC2 instances ex ante for 100% of first-time runs of source-compiled (C, C++, Fortran) and Python applications. Against a strong baseline combining expert-derived constraints with SkyPilot's instance selection, Incisor cuts job runtime by 54% and instance costs by 44%.