BLADE automatically evaluates on the ** intermediate decisions** (3)
to ** open-ended** data-driven research questions (1). We gathered research questions and datasets
from existing research papers, crowd-sourced analysis studies and statistic textbooks as well as
analyses
from expert annotators (boxes 1-2-3).
Given a research question and dataset, LM agents generate a full
analysis containing the relevant conceptual variables, a data transform function, and a statistical
modeling function (boxes 1-4-5).
BLADE automatically evaluates this against the ground truth.

BLADE consists of real-world open-ended research questions. These
questions
naturally elicit **justifiable alternative** decisions, from data preprocessing to statistical
modeling. We
aim to capture this flexibility in our ground truth annoation process.

We evaluate open source and close source LMs on
BLADE in both a **one turn direct prompting** setting and a **ReAct
agent**
setting in
which the agent can interact with a sandbox notebook environment. The best setting achieves a F1-score
of 44.8% leaving lots of room for improvement.

Data-driven scientific discovery requires the iterative integration of scientific domain knowledge, statistical expertise, and an understanding of data semantics to make nuanced analytical decisions, such as which variables, transformations, and statistical models to consider. LM-based agents equipped with planning, memory, and code execution capabilities have the potential to support data-driven science; however, evaluating these agents on open-ended tasks is challenging due to multiple valid approaches, partially correct steps, and different ways to express the same decisions.

To address these challenges, we present BLADE, a benchmark designed to automatically evaluate agentsโ multifaceted approaches to open-ended research questions. BLADE consists of 12 datasets and research questions drawn from existing scientific literature, with ground truth collected from independent analyses by expert data scientists and researchers. To facilitate the automatic evaluation of agent responses, we developed computational methods that match different representations of analyses to this ground truth. Although language models possess considerable world knowledge, our evaluation shows they are often limited to basic analyses; in contrast, agents capable of interacting with the underlying data demonstrate improved, yet still non-optimal, diversity in their analytical decision-making. Our work enables the evaluation of agents in the context of data-driven science and provides researchers with deeper insights into the analytical approaches employed by these agents.

BLADE is an expert annotated benchmark studying agent performance on
**open-ended** data-driven research questions (there is no single answer result).
BLADE evaluates how well agents can understand data, integrate this with
external scientific
domain knowledge, and execute an analysis. In particular,
BLADE focuses on the following tasks, :

**Formulating Conceptual Variables**: Recognize independent variables (IVs) , dependent variables (DVs), and control variables based on domain knowledge and multi-step reasoning, e.g., ``Prior literature suggests player physicality influences the refereeโs perception. We can consider physicality a control.''**Executing Data Transformations**: Apply transformations to operationalize conceptual variables, e.g., using BMI as a proxy for player physicality via ``weight'' and ``height'' columns.**Implementing Statistical Models**: Choose the appropriate statistical model based on conceptual variables and transformed data to address the research question, requiring in-depth knowledge of statistical methods and the underlying hypothesis

We source research
questions from existing crowd-sourced
analysis studies and statistics textbooks. We follow a rigourous multi-stage annotation procedure to
ensure annotation
quality.
BLADE currently consists of **12 research questions and datasets**
encompassing over 500
analysis decisions.

Examples of tasks in our work and related benchmarks. BLADE prioritizes open-ended research questions and considers the flexibility of analysis approaches in its evaluation.

BLADE follows a multi-stage annotation procedure to ensure
ground-truth
annotation quality.
First, annotators independetly conduct their own analysis given a research question and dataset and
report their analysis decisions (1).
Next, each annotator reviews the analysis decisions of the other annotators and those generated by a
LM to validate each other's approach and also consider new potentially overlooked approaches (2).
Finally, annotators come together, discuss their analysis decisions and reach a consensus on the
ground-truth analysis decisions (3).

Summary of the 12 different source datasets in BLADE.

Given a research question and data table, the agent's submission of an analysis to BLADE is a JSON file containing the following elements:

**A list of conceptual variables**: Each conceptual variable contains a natural language description (e.g., player physicality) and the variable type (i.e., independent, dependent, and control variable) to answer the research question.**An executable transformation function**: The function takes a data table and returns the table after performing ALL transformations to operationalize the variables.**A statistical model function**: This takes as input the transformed data table and returns the specified statistical model

We evaluate open source and close source LMs on
BLADE in both a **one turn direct prompting** setting and a **ReAct
agent**
setting in
which the agent can interact with a sandbox computational notebook environment.

The best performing models get aroung ~40% F1 score on
BLADE indicating that there is still a lot of room for improvement.
Generally, the ReAct agent outperforms the one turn direct prompting setting.

For generating an analysis, we find that most large LMs can generate a non-empty executable analysis over 60% of the time, with GPT-4o being the best at 96%. Among the open-source models, Mixtral-8x22b performs best, generating an executable analysis 73% of the time and DeepSeek-Coder also does surprisingly well at 65%.

Average precision (top row) and coverage@10 (bottom row) percentages averaged across datasets in BLADE. All runs were included in the results. Run errors default to a hit rate of 0 and are counted in the coverage calculation (i.e., treated as a run that generated nothing). Error bars represent bootstrapped 95% confidence intervals.

We also observe low coverage of the ground truth examples, especially with respect to data transformations and specific model specifications. Through qualitatively reviewing a random sample of LM-generated analyses, we find that LMs often perform basic analysis that can yield decent precision (i.e., matching basic decisions) but poor coverage across runs. However, comparing the one-turn and agent settings, LMs consistently had higher coverage when allowed to iteratively explore data. Moreover, ReAct agents perform best overall on coverage for data transformations and statistical modeling, which require a more detailed understanding of data semantics.

See the examples of generated analyses using GPT-4o (one of the best performing LMs) via direct prompting and the ReAct framework below.

**GPT-4o one turn direct prompting examples:**

```
# ======== TRANSFORM CODE ========
def transform(df: pd.DataFrame) -> pd.DataFrame:
# Filter out rows with missing values in rater1 and rater2
df = df.dropna(subset=['rater1', 'rater2'])
# Calculate the average skin tone from rater1 and rater2
df['AvgSkinTone'] = df[['rater1', 'rater2']].mean(axis=1)
# Return the transformed dataframe
return df
# ======== MODEL CODE ========
def model(df: pd.DataFrame) -> Any:
# Prepare the independent variables and control variables
X = df[['AvgSkinTone', 'games', 'yellowCards', 'meanIAT', 'meanExp']]
y = df['redCards']
# Add a constant to the model (intercept)
X = sm.add_constant(X)
# Perform the linear regression
model = sm.OLS(y, X).fit()
# Return the results of the model
return model.summary()
```

Ex. 1) answering the question *Are soccer players with a dark skin tone
more likely than those with a light skin
tone to receive red cards from referees?*

```
# ======== TRANSFORM CODE ========
def transform(df: pd.DataFrame) -> pd.DataFrame:
# Replace missing skin tone ratings with the mean rating
df['rater1'] = df['rater1'].fillna(df['rater1'].mean())
df['rater2'] = df['rater2'].fillna(df['rater2'].mean())
# Calculate the average skin tone rating
df['SkinTone'] = (df['rater1'] + df['rater2']) / 2
# Return the transformed dataframe with only necessary columns
return df[['SkinTone', 'redCards', 'games', 'meanIAT', 'meanExp']]
# ======== MODEL CODE ========
def model(df: pd.DataFrame) -> Any:
# Fit the regression model
X = df[['SkinTone', 'games', 'meanIAT', 'meanExp']]
y = df['redCards']
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
# Return the model results
return model.summary()
```

Ex. 2) answering the question *Are soccer players with a dark skin tone
more likely than those with a light skin
tone to receive red cards from referees?*

```
# ======== TRANSFORM CODE ========
def transform(df: pd.DataFrame) -> pd.DataFrame:
# Replace zero rating with NaN for rater columns indicating missing ratings
df['rater1'].replace(0, np.nan, inplace=True)
df['rater2'].replace(0, np.nan, inplace=True)
# Drop rows where both 'rater1' and 'rater2' are NaN
df.dropna(subset=['rater1', 'rater2'], how='all', inplace=True)
# Calculate the average skin tone rating
df['avgSkinTone'] = df[['rater1', 'rater2']].mean(axis=1)
# One-hot encode the position column
df = pd.get_dummies(df, columns=['position'], drop_first=True)
return df
# ======== MODEL CODE ========
def model(df: pd.DataFrame) -> Any:
# Define the independent variables including controls
indep_vars = ['avgSkinTone', 'games', 'meanIAT', 'meanExp'] + [col for col in df.columns if col.startswith('position_')]
X = df[indep_vars]
# Define the dependent variable
y = df['redCards']
# Add a constant to the model (intercept)
X = sm.add_constant(X)
# Fit the OLS model
model = sm.OLS(y, X)
results = model.fit()
# Return the model summary
return results.summary()
```

Ex. 3) answering the question *Are soccer players with a dark skin tone
more likely than those with a light skin
tone to receive red cards from referees?*

```
# ======== TRANSFORM CODE ========
def transform(df: pd.DataFrame) -> pd.DataFrame:
# Drop rows with missing values in the essential columns
df = df.dropna(subset=['rater1', 'rater2', 'redCards', 'meanIAT', 'meanExp', 'games'])
# Calculate the average skin tone based on the two raters
df['AvgSkinTone'] = df[['rater1', 'rater2']].mean(axis=1)
# Return the transformed dataframe ready for modeling
return df
# ======== MODEL CODE ========
def model(df: pd.DataFrame) -> Any:
# Define the independent variables (including control variables)
X = df[['AvgSkinTone', 'meanIAT', 'meanExp', 'games']]
# Add a constant to the model (intercept)
X = sm.add_constant(X)
# Define the dependent variable
y = df['redCards']
# Fit an OLS regression model
model = sm.OLS(y, X).fit()
# Return the summary of the model
return model.summary()
```

Ex. 4) answering the question

```
# ======== TRANSFORM CODE ========
def transform(df: pd.DataFrame) -> pd.DataFrame:
# Replace zero rating with NaN for rater columns indicating missing ratings
df['rater1'].replace(0, np.nan, inplace=True)
df['rater2'].replace(0, np.nan, inplace=True)
# Drop rows where both 'rater1' and 'rater2' are NaN
df.dropna(subset=['rater1', 'rater2'], how='all', inplace=True)
# Calculate the average skin tone rating
df['avgSkinTone'] = df[['rater1', 'rater2']].mean(axis=1)
# One-hot encode the position column
df = pd.get_dummies(df, columns=['position'], drop_first=True)
return df
# ======== MODEL CODE ========
def model(df: pd.DataFrame) -> Any:
# Define the independent variables including controls
indep_vars = ['avgSkinTone', 'games', 'meanIAT', 'meanExp'] + [col for col in df.columns if col.startswith('position_')]
X = df[indep_vars]
# Define the dependent variable
y = df['redCards']
# Add a constant to the model (intercept)
X = sm.add_constant(X)
# Fit the OLS model
model = sm.OLS(y, X)
results = model.fit()
# Return the model summary
return results.summary()
```

Ex. 5) answering the question

**GPT-4o ReAct agent examples:**

```
# ======== TRANSFORM CODE ========
def transform(df: pd.DataFrame) -> pd.DataFrame:
# Drop rows with missing values in critical columns
df_clean = df.dropna(subset=['rater1', 'rater2', 'position', 'meanIAT', 'meanExp'])
# Create a new column for the average skin tone rating
df_clean['avgSkinTone'] = df_clean[['rater1', 'rater2']].mean(axis=1)
return df_clean
# ======== MODEL CODE ========
import statsmodels.api as sm
def model(df: pd.DataFrame) -> sm.GLM:
# Define the control variables
control_vars = ['games', 'meanIAT', 'meanExp']
# Create dummy variables for the categorical variable 'position'
df = pd.get_dummies(df, columns=['position'], drop_first=True)
# Define the dependent variable
y = df['redCards']
# Define the independent variables including dummy variables for position
X = df[['avgSkinTone'] + control_vars + [col for col in df.columns if col.startswith('position_')]]
# Add a constant term to the independent variables
X = sm.add_constant(X)
# Fit a Poisson regression model
poisson_model = sm.GLM(y, X, family=sm.families.Poisson()).fit()
return poisson_model
```

```
# ======== TRANSFORM CODE ========
def transform(df: pd.DataFrame) -> pd.DataFrame:
# Drop rows with missing 'skinTone'
df = df.dropna(subset=['skinTone'])
# Fill missing 'position' values with the most frequent value
most_frequent_position = df['position'].mode()[0]
df['position'] = df['position'].fillna(most_frequent_position)
# Create composite skin tone score
df['skinTone'] = df[['rater1', 'rater2']].mean(axis=1)
# Generate binary variable for skin tone
median_skin_tone = df['skinTone'].median()
df['skinToneCategory'] = df['skinTone'].apply(lambda x: 'Dark' if x > median_skin_tone else 'Light')
# Convert Red Cards to Binary
df['RedCardReceived'] = df['redCards'].apply(lambda x: 1 if x > 0 else 0)
return df
# ======== MODEL CODE ========
def model(df: pd.DataFrame) -> Any:
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Ensure categorical variables are treated as such
df['position'] = df['position'].astype('category')
df['refCountry'] = df['refCountry'].astype('category')
df['skinToneCategory'] = df['skinToneCategory'].astype('category')
# Simplify model by focusing on key variables
logit_model = smf.logit('RedCardReceived ~ skinToneCategory + games + C(refCountry)', data=df).fit()
return logit_model.summary()
```

```
# ======== TRANSFORM CODE ========
def transform(df: pd.DataFrame) -> pd.DataFrame:
# Calculate average skin tone rating
df['avgSkinTone'] = df[['rater1', 'rater2']].mean(axis=1)
# Filter to include only players with valid skin tone ratings
df_filtered = df.dropna(subset=['avgSkinTone'])
# Summarize data at the player level
df_player_level = df_filtered.groupby(['playerShort'], as_index=False).agg({
'avgSkinTone': 'mean',
'redCards': 'sum',
'games': 'sum',
'meanIAT': 'mean',
'meanExp': 'mean'
})
return df_player_level
# ======== MODEL CODE ========
def model(df: pd.DataFrame):
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Fit a negative binomial regression model
model = smf.glm('redCards ~ avgSkinTone + games + meanIAT + meanExp',
data=df,
family=sm.families.NegativeBinomial()).fit()
return model.summary()
```

```
# ======== TRANSFORM CODE ========
def transform(df: pd.DataFrame) -> pd.DataFrame:
# Drop rows with missing values in rater1 and rater2
df_cleaned = df.dropna(subset=['rater1', 'rater2'])
# Calculate average skin tone score
df_cleaned['avgSkinTone'] = df_cleaned[['rater1', 'rater2']].mean(axis=1)
# Select necessary columns
columns = ['avgSkinTone', 'redCards', 'games', 'victories', 'ties', 'defeats', 'position', 'meanIAT', 'meanExp']
df_final = df_cleaned[columns]
return df_final
# ======== MODEL CODE ========
def model(df: pd.DataFrame) -> Any:
import statsmodels.formula.api as smf
# Fit the regression model
model = smf.ols('redCards ~ avgSkinTone + games + victories + ties + defeats + C(position) + meanIAT + meanExp', data=df).fit()
# Return the regression results
return model.summary()
```

```
# ======== TRANSFORM CODE ========
def transform(df: pd.DataFrame) -> pd.DataFrame:
# Calculate the average skin tone rating
df['avgSkinTone'] = df[['rater1', 'rater2']].mean(axis=1)
# Filter relevant columns
columns_to_keep = ['avgSkinTone', 'redCards', 'games', 'meanIAT', 'meanExp', 'position', 'goals', 'victories', 'ties', 'defeats']
df = df[columns_to_keep]
# Drop rows with missing values in the average skin tone
df = df.dropna(subset=['avgSkinTone'])
# One-hot encode the 'position' column
df = pd.get_dummies(df, columns=['position'], drop_first=True)
return df
# ======== MODEL CODE ========
def model(df: pd.DataFrame) -> Any:
import statsmodels.api as sm
# Define the dependent and independent variables
X = df.drop(columns=['redCards'])
y = df['redCards']
# Add a constant to the model
X = sm.add_constant(X)
# Fit the OLS model
ols_model = sm.OLS(y, X).fit()
return ols_model.summary()
```

Performance vs. HumanEval Performance. We compare BLADE evaluation metrics against reported Pass@1 on HumanEval.

When comparing our results in analysis generation with those from the HumanEval coding benchmark, we fonud most metrics showed a positive correlation, indicating that higher HumanEval performance is broadly correlated with higher BLADE performance. However, coverage measures had a weaker correlation compared to precision. This suggests that while current training methods, such as Reinforcement Learning from Human Feedback (RLHF) and instruction tuning, optimize for one solution, they may struggle to generate diverse solutions.

```
@inproceedings{
gu2024blade,
title={BLADE: Benchmarking Language Model Agents for Data-Driven Science},
author={Ken Gu and Ruoxi Shang and Ruien Jiang and Keying Kuang and Richard-John Lin and Donghe Lyu and Yue Mao and Youran Pan and Teng Wu and Jiaqian Yu and Yikun Zhang and Tianmai M. Zhang and Lanyi Zhu and Mike A. Merrill and Jeffrey Heer and Tim Althoff},
booktitle={The 2024 Conference on Empirical Methods in Natural Language Processing},
year={2024},
}
```