Logo BLADE

Benchmarking Language Model Agents
for Data-Driven Science

1University of Washington, 2UC Berkeley,
3New York University, 4Stanford University,
5University of British Columbia, 6Microsoft, 7George Washington University
EMNLP 2024
*Denotes equal contribution

Abstract

Data-driven scientific discovery requires the iterative integration of scientific domain knowledge, statistical expertise, and an understanding of data semantics to make nuanced analytical decisions, such as which variables, transformations, and statistical models to consider. LM-based agents equipped with planning, memory, and code execution capabilities have the potential to support data-driven science; however, evaluating these agents on open-ended tasks is challenging due to multiple valid approaches, partially correct steps, and different ways to express the same decisions.

To address these challenges, we present Logo BLADE, a benchmark designed to automatically evaluate agentsโ€™ multifaceted approaches to open-ended research questions. Logo BLADE consists of 12 datasets and research questions drawn from existing scientific literature, with ground truth collected from independent analyses by expert data scientists and researchers. To facilitate the automatic evaluation of agent responses, we developed computational methods that match different representations of analyses to this ground truth. Although language models possess considerable world knowledge, our evaluation shows they are often limited to basic analyses; in contrast, agents capable of interacting with the underlying data demonstrate improved, yet still non-optimal, diversity in their analytical decision-making. Our work enables the evaluation of agents in the context of data-driven science and provides researchers with deeper insights into the analytical approaches employed by these agents.

Logo BLADE Dataset

Overview

Logo BLADE is an expert annotated benchmark studying agent performance on open-ended data-driven research questions (there is no single answer result). Logo BLADE evaluates how well agents can understand data, integrate this with external scientific domain knowledge, and execute an analysis. In particular, Logo BLADE focuses on the following tasks, :

  • Formulating Conceptual Variables: Recognize independent variables (IVs) , dependent variables (DVs), and control variables based on domain knowledge and multi-step reasoning, e.g., ``Prior literature suggests player physicality influences the refereeโ€™s perception. We can consider physicality a control.''
  • Executing Data Transformations: Apply transformations to operationalize conceptual variables, e.g., using BMI as a proxy for player physicality via ``weight'' and ``height'' columns.
  • Implementing Statistical Models: Choose the appropriate statistical model based on conceptual variables and transformed data to address the research question, requiring in-depth knowledge of statistical methods and the underlying hypothesis

We source research questions from existing crowd-sourced analysis studies and statistics textbooks. We follow a rigourous multi-stage annotation procedure to ensure annotation quality. Logo BLADE currently consists of 12 research questions and datasets encompassing over 500 analysis decisions.

Input to BLADE's Evaluation

Given a research question and data table, the agent's submission of an analysis to BLADE is a JSON file containing the following elements:

  • A list of conceptual variables: Each conceptual variable contains a natural language description (e.g., player physicality) and the variable type (i.e., independent, dependent, and control variable) to answer the research question.
  • An executable transformation function: The function takes a data table and returns the table after performing ALL transformations to operationalize the variables.
  • A statistical model function: This takes as input the transformed data table and returns the specified statistical model
data-overview

Experimental Results

We evaluate open source and close source LMs on Logo BLADE in both a one turn direct prompting setting and a ReAct agent setting in which the agent can interact with a sandbox computational notebook environment.

Overall Perfomance

data-overview
The best performing models get aroung ~40% F1 score on Logo BLADE indicating that there is still a lot of room for improvement. Generally, the ReAct agent outperforms the one turn direct prompting setting.

Most language models can generate non-empty executable analyses

data-overview

For generating an analysis, we find that most large LMs can generate a non-empty executable analysis over 60% of the time, with GPT-4o being the best at 96%. Among the open-source models, Mixtral-8x22b performs best, generating an executable analysis 73% of the time and DeepSeek-Coder also does surprisingly well at 65%.


Language models are limited to forming basic analyses

data-overview

Average precision (top row) and coverage@10 (bottom row) percentages averaged across datasets in Logo BLADE. All runs were included in the results. Run errors default to a hit rate of 0 and are counted in the coverage calculation (i.e., treated as a run that generated nothing). Error bars represent bootstrapped 95% confidence intervals.

We also observe low coverage of the ground truth examples, especially with respect to data transformations and specific model specifications. Through qualitatively reviewing a random sample of LM-generated analyses, we find that LMs often perform basic analysis that can yield decent precision (i.e., matching basic decisions) but poor coverage across runs. However, comparing the one-turn and agent settings, LMs consistently had higher coverage when allowed to iteratively explore data. Moreover, ReAct agents perform best overall on coverage for data transformations and statistical modeling, which require a more detailed understanding of data semantics.

See the examples of generated analyses using GPT-4o (one of the best performing LMs) via direct prompting and the ReAct framework below.

GPT-4o one turn direct prompting examples:

GPT-4o ReAct agent examples:


Stronger performance on code generation does not translate directly to performance on Logo BLADE

data-overview

Performance vs. HumanEval Performance. We compare BLADE evaluation metrics against reported Pass@1 on HumanEval.

When comparing our results in analysis generation with those from the HumanEval coding benchmark, we fonud most metrics showed a positive correlation, indicating that higher HumanEval performance is broadly correlated with higher BLADE performance. However, coverage measures had a weaker correlation compared to precision. This suggests that while current training methods, such as Reinforcement Learning from Human Feedback (RLHF) and instruction tuning, optimize for one solution, they may struggle to generate diverse solutions.

BibTeX

@inproceedings{
  gu2024blade,
  title={BLADE: Benchmarking Language Model Agents for Data-Driven Science},
  author={Ken Gu and Ruoxi Shang and Ruien Jiang and Keying Kuang and Richard-John Lin and Donghe Lyu and Yue Mao and Youran Pan and Teng Wu and Jiaqian Yu and Yikun Zhang and Tianmai M. Zhang and Lanyi Zhu and Mike A. Merrill and Jeffrey Heer and Tim Althoff},
  booktitle={The 2024 Conference on Empirical Methods in Natural Language Processing},
  year={2024},
  }