Workshops for Ukraine

Agentic Coding for R

Reliable habits for AI-assisted analysis: planning, writing,
reviewing, and documenting R workflows with AI agents

Charles Crabtree
Senior Lecturer, School of Social Sciences, Monash University
K-Club Professor, University College, Korea University

github.com/lobsterbush/workshops-for-ukraine

About Me

Charles Crabtree

  • Senior Lecturer, School of Social Sciences, Monash University
  • Previously: Dartmouth (Government), Stanford (APARC), Tokyo Foundation for Policy Research
  • Ph.D., Political Science, University of Michigan
  • Writing papers about LLMs since January 2021 — ~20 months before ChatGPT
  • Taught AI workshops at Dartmouth, Essex, UNM, IPSA-NUS, Instats, Statistical Horizons

I was a skeptic. I spent years studying these tools critically. AI agents converted me — not chatbots, not autocomplete, but tools that act on your computer, see results, and iterate.

Punch Cards1950s-1970s
Command Line1980s
Graphical Interface1995
Modern Desktop2024
AI AgentNow

The Shift

The new operating system is language

Every era required learning a different interface — punch cards, command lines, graphical desktops.

Now you describe what you want in plain language. The machine writes the R code, runs it, and fixes the errors.

The Tool

Warp — the agentic development environment

It's a terminal with an AI agent built in. The agent can read your files, run your code, see the output, and fix errors — all in one place.

  • Free to start — 75 credits/month, no credit card
  • Multi-model — Claude, GPT, Gemini. Switch with one click.
  • macOS, Windows, Linuxwarp.dev

We'll use Warp today, but the patterns work with any agent that can run code on your machine.

Two Ways to Use AI

Vibe coding vs agentic coding

Vibe coding

  • "Just make it work"
  • Accept whatever the AI produces
  • Don't read the code
  • No conventions, no checks
  • Hope for the best

Fine for a throwaway prototype. Dangerous for a paper.

Agentic coding

  • Constrain the agent with your rules
  • Verify every output
  • Your conventions, enforced consistently
  • A second agent audits the first
  • Paper trail writes itself

Reproducible, verifiable, publishable.

This workshop is about the right column. Not prompt tricks — reliable habits.

Today

What we'll cover

  1. How agents work — from tasks to numbered .R scripts
  2. Inside the code — walkthrough of what the agent produces
  3. Where agents fail — silent errors, wrong assumptions
  4. Adversarial agentic coding — builder + reviewer agents on different models
  5. Hands-on practice — build, break, and fix
  6. Implications

You will leave with downloadable skill templates for a builder agent and a reviewer agent that you can use immediately.

Part 1

How Agents Work

Key Distinction

This isn't the ChatGPT you know

Chatbot

  • Tells you what to do
  • One response at a time
  • You copy-paste everything
  • No memory between turns

Agent

  • Does it for you
  • Multi-step workflows
  • .R files appear on disk
  • Sees errors, fixes them

How It Works

The agent loop

Agents observe each result and decide what to do next.

  1. Observe — read your files, check what exists
  2. Plan — decide which script to write next
  3. Execute — write the code and run it
  4. Check — did it work? Parse errors, read output
  5. Repeat — fix what broke, move to the next step

This is why they recover from errors. They see what went wrong and try again — just like you would.

Your Job

Decompose -> Constrain -> Verify

The agent handles execution. You handle direction.

  • Decompose — break a vague research task into numbered, concrete steps
  • Constrain — specify your preferred packages, SE type, file format, plot style
  • Verify — check every output before moving on

The prompt is a contract. The more specific you are, the fewer surprises. Let me show you what that looks like.

R Example — The Prompt

One prompt -> full analysis pipeline

$ "I have a populism survey dataset at
~/data/survey-data/processed.csv
that I haven't touched in years. ~82K rows, multiple
countries, pop_1 through pop_6 items, globalization
attitudes, demographics, loss aversion, immigration
conjoint.

Explore it, model what predicts populist attitudes,
visualize the results, and put everything in
numbered .R scripts."
  

R Example — What the Agent Produces

From one prompt: a full project directory

  • Cleaning script — reads raw data, recodes, handles missingness, writes analysis-ready file
  • Exploration script — distributions, missingness, sample sizes by country
  • Analysis script — models, extracts coefficients, exports statistics
  • Figures script — publication-ready plots, exported to PDF and PNG
  • Statistics file — every number computed, not typed — your paper references it directly
  • README — replication instructions

Let me open the project directory and show you what's inside each file.

Part 2

Inside the Code

First: Teach the Agent Your Rules

WARP.md — persistent project context

Create this file in your project root. The agent reads it at the start of every session, so you only set conventions once.

What to include

  • Project description and status
  • Data sources and locations
  • R version and key packages
  • Known issues or limitations

Your rules — whatever you want

  • Your preferred SE type
  • Your preferred plot style
  • Output format (PDF, PNG, both)
  • Sequential script numbering
  • How to export statistics

These rules become the conventions file that the builder skill reads. Different users, different conventions, same process.

What to Look For

When I open these files, check for:

  • Relative paths only/Users/you/... breaks on every other machine. Relative paths work everywhere.
  • Dropped rows are logged — the script prints "Raw rows: 82,034 | Clean rows: 71,847 | Dropped: 10,187" so you know exactly what happened
  • Statistics are computed, not typed — every number flows from code to paper
  • Scripts have clear headers — purpose, inputs, outputs
  • Figures match the regression table — same model, same sample, same coefficients

These are the things a reviewer agent checks automatically. Let me show you the code.

The Key Pattern — Never Hardcode a Number

Model object → file on disk → paper

1. Your model lives in R's memory:

m1 <- lm(pop_index ~ age + education
        + globalization, data = df)

# R knows everything about this model:
nobs(m1)                    # 72814
coef(m1)["globalization"]  # 0.34182
m1$std.error["globalization"] # 0.01193

2. R extracts numbers and writes a file:

cat(
  sprintf("\\newcommand{\\nObs}{%s}\n",
    format(nobs(m1), big.mark = ",")),
  sprintf("\\newcommand{\\mainCoef}{%.3f}\n",
    coef(m1)["globalization"]),
  file = "output/statistics.tex"
)
# sprintf formats the number
# cat writes it to the file

3. Your paper reads the file:

% In your .tex preamble:
\input{output/statistics.tex}

% In your text:
Our sample includes
\nObs{} respondents.
The main effect is
$\beta = \mainCoef{}$.

LaTeX replaces \nObs with 72,814 and \mainCoef with 0.342 when you compile.

Change the data, re-run the R script, recompile the paper — every number updates. No manual transcription, ever. Not using LaTeX? Same pattern works with CSV or JSON.

The Paper Trail

Documentation happens automatically

  • Session logs — every prompt you typed, every command the agent ran, every output it produced. In Warp: click the session menu (top right) to export as markdown or share a link.
  • Model tracked — each interaction records which LLM produced it (Sonnet, Opus, GPT-4o). Visible in the session log and in Warp's conversation history.
  • Git integration — the agent can commit after each step. Commits + session logs = complete provenance: who wrote what, when, and which model helped.

Numbered scripts + computed statistics + session logs + git = a replication package that writes itself.

Part 3

Where Agents Fail

Real Failures

Things I've seen agents do — with full confidence

🔴 Fabricated a citation

Cited "Smith & Jones (2019)" in the literature review. Paper doesn't exist. DOI leads nowhere. Sounded perfectly plausible.

🔴 N doesn't match

Abstract says N = 1,247. Results table says N = 935. Cleaning script silently dropped 312 rows. No mention anywhere.

🔴 Figure doesn't match table

Coefficient plot shows β = 0.34. Regression table says 0.21. Agent re-estimated the model with a different sample for the figure.

🟡 Variables theory never mentioned

Model has 8 predictors. Theory section discusses 4. The other 4 were added by the agent because they "seemed relevant."

Every one of these looks right at a glance. You need a systematic way to catch them.

Part 4 — The Core Idea

Two agents are better than one

One agent builds. A second agent tries to break what the first built.

The builder is confident. The reviewer is adversarial.
Together, they catch what either would miss alone.

The Pattern

Build -> Review -> Fix -> Review

1. Build

Invoke /builder-agent
Produces .R scripts, figures, stats.tex

2. Review

Switch model, invoke /reviewer-agent
Runs checks, verifies numbers

3. Fix

Builder addresses critical issues
Re-runs pipeline

4. Re-review

Reviewer confirms fixes
Produces final report

Strategy

Different models, different blind spots

/builder and /reviewer can run on different models from different providers.

Why cross-provider?

  • Each model has its own failure modes
  • Claude may miss what GPT catches (and vice versa)
  • Disagreement between models is a signal to investigate
  • Avoids correlated errors from the same training data

Example combinations

Builder: Claude Sonnet 4.5
Reviewer: GPT-4o

Builder: GPT-4o
Reviewer: Claude Opus

Builder: Claude Sonnet 4.5
Reviewer: Gemini 2.5 Pro

In Warp, switch models by clicking the model name in the input bar. One click between builds and reviews.

The Reviewer Agent

What /reviewer-agent does

You switch to a different model, invoke the reviewer, and it:

  • Runs every script from a clean session — do they all succeed?
  • Checks that N matches — counts rows in the actual data, compares to reported N
  • Checks figures against tables — are the coefficients the same?
  • Searches for hardcoded numbers — are statistics computed or typed?
  • Writes a review report — Critical / Warning / Note, with specific fixes

Let me show you what this looks like live.

Live Demo

Builder produces analysis, reviewer breaks it

/builder-agent
"I have a YouGov survey dataset in this folder
with an SPSS file, a codebook PDF, and CSV exports.

Explore the data, propose a research question,
model it, visualize it, and put everything in
numbered .R scripts."

Part 5

Hands-On Practice

Exercises

Four exercises — pick your level

1. Just Talk to It

No skills, no setup. Just type.

"Tell me about the files in this folder."

"I have survey data at
~/data/survey-data/processed.csv.
What's in it? How many rows? What countries?"

"Put together a plan for analyzing whether
economic insecurity predicts populist attitudes."

2. Use the Builder

Same question, structured workflow.

/builder-agent
"Does economic insecurity predict populist
attitudes? Use ~/data/survey-data/processed.csv
(~82K rows).
• DV: populism index (pop_1.n through pop_6.n)
• Key IVs: loss aversion (loss_1.n-loss_6.n)
• Controls: age, education (ed_postsec), country
• Explore first, then model, then visualize"

3. Break It

Switch model, audit the builder's work.

Switch to a different model, then:

/reviewer-agent
"Audit everything the builder just produced."

Read review_report.md. What did it find?

4. Build Something You Can Use

Don't read the code. Use it.

"Build a Twitter/X feed interface for a survey
experiment on [YOUR TOPIC].
• 8 posts with treatment variation
• Track likes, retweets, timestamps
• Embed in Qualtrics via iframe
• Single HTML file, no dependencies"

Open it. Like a post. Scroll. Does it work?

Have your own data? Try any of these with your own project.

14:00

Share what you built — and what broke

Failures are the most useful thing you can share.

Summary

Three things

  1. Constrain the agent. Your conventions in WARP.md. Numbered scripts. No absolute paths. Statistics computed, not typed.
  2. Review adversarially. A second agent — on a different model — catches what the first misses.
  3. Document by default. Session logs, git commits, computed statistics. The paper trail writes itself.

Part 6

Implications

Equilibrium

The new balance

Costs collapse ↓

  • Data cleaning — parsing, merging, recoding
  • Analysis — models, robustness checks, figures
  • Documentation — READMEs, session logs, LaTeX
  • Verification — adversarial review, stress tests

Value rises ↑

  • Ideas — original questions, creative design
  • Taste — knowing what's worth doing
  • Fieldwork — being there, in person
  • Judgment — knowing when the agent is wrong

Discussion

Questions for the room

  • What R tasks would benefit most from a builder+reviewer pattern?
  • Where should human judgment remain non-negotiable?
  • How do we disclose and document AI assistance in papers?
  • What happens to methods training when agents can run the models?

12:00

How to Install and Invoke in Warp

Three steps

1. Download the skill files from the workshop repo into your project:

my-project/
|-- .warp/skills/
|   |-- builder-agent/
|   |   `-- SKILL.md
|   `-- reviewer-agent/
|       `-- SKILL.md
|-- data/
`-- WARP.md

Also works: .agents/skills/, .claude/skills/, .cursor/skills/
Global (all projects): ~/.warp/skills/

2. Type / in Warp's input bar to invoke:

/builder-agent Analyze the populism
  data at data/raw/processed.csv...

Warp auto-discovers skills and shows them in the / menu. You can also just describe your task — the agent finds the right skill.

3. Click the model selector, switch model, invoke reviewer:

# Model selector -> GPT-4o
/reviewer-agent Audit everything
  the builder just produced.

Warp reads the SKILL.md automatically when you invoke it. The agent follows the full procedure and checklist — you never repeat instructions.

Downloadable Skills

Two skills, ready to use

Builder Skill

  • Plans numbered .R script pipeline
  • Reads your conventions from WARP.md
  • Explores data before modeling
  • Exports computed statistics (never hardcoded)
  • Self-checks before finishing

Reviewer Skill

  • Runs scripts from a clean session
  • Checks N matches, figures match tables
  • Searches for hardcoded numbers and absolute paths
  • Can re-estimate independently in another language
  • Produces severity-rated review report

Also includes a conventions-example.md you can customize with your own package and style preferences.

Thank you

Charles Crabtree
Senior Lecturer, School of Social Sciences, Monash University
K-Club Professor, University College, Korea University

charles.crabtree@monash.edu · charlescrabtree.org

Resources