
Recent advancements have positioned Large Language Models (LLMs) as transformative tools for scientific research, capable of addressing complex tasks that require reasoning, problem-solving, and decision-making. Their exceptional capabilities suggest their potential as scientific research assistants, but also highlight the need for holistic, rigorous, and domain-specific evaluation to assess effectiveness in real-world scientific applications. This talk describes a multifaceted methodology for Evaluating AI models as scientific Research Assistants (EAIRA) developed at Argonne National Laboratory.
This methodology incorporates four primary classes of evaluations. 1) Multiple Choice Questions to assess factual recall; 2) Open Response to evaluate advanced reasoning and problem-solving skills; 3) Lab-Style Experiments involving detailed analysis of capabilities as research assistants in controlled environments; and 4) Field-Style Experiments to capture researcher-LLM interactions at scale in a wide range of scientific domains and applications. These complementary methods enable a comprehensive analysis of LLM strengths and weaknesses with respect to their scientific knowledge, reasoning abilities, and adaptability. Recognizing the rapid pace of LLM advancements, we designed the methodology to evolve and adapt so as to ensure its continued relevance and applicability. This talk describes the current methodology’s state. Although developed within a subset of scientific domains, the methodology is designed to be generalizable to a wide range of scientific domains.
Bio: Franck Cappello received his Ph.D. from the University of Paris XI in 1994 and joined CNRS, the French National Center for Scientific Research. In 2003, he joined INRIA, where he holds the position of permanent senior researcher. In 2003, he initiated and directed the Grid’5000 project which is still used today and has helped hundreds of researchers with their experiments in parallel and distributed computing and to publish more than 2500 research publications. In 2009, Cappello created with Marc Snir the Joint Laboratory on Extreme-Scale Computing (JLESC: https://jlesc.github.io) gathering six of the most prominent research and production centers in supercomputing: NCSA, Inria, ANL, BSC, JSC, Riken CCS. From 2008, as a member of the executive committee of the International Exascale Software Project, he led the roadmap and strategy efforts for projects related to resilience for Exascale supercomputers. During ECP (Exascale Computing Project: https://www.exascaleproject.org/), Cappello led the development of VeloC (checkpointing) and SZ (lossy compression) software. Cappello is now focusing on developing methods and tools to evaluate LLMs as scientific assistants. He is an IEEE Fellow, the recipient the 2025 Secretary of DOE Honor’s award for the ECP leadership team, the 2024 IEEE CS Charles Babbage Award, the 2024 Europar Achievement Award, the 2022 HPDC Achievement Award, two R&D100 awards (2019 and 2021), the 2018 IEEE TCPP Outstanding Service Award, and the 2021 IEEE Transactions of Computer Award for Editorial Service and Excellence.
For more information about upcoming speakers please visit the TPC Seminar Series Webpage: