AI experiment tracker — runs, benchmarks, notes
LoginSign Up
For
EngineeringProductOperations
Topics
AI ExperimentsPrompt EngineeringResearch Tracking
Google Docs
+
Google Drive
+
Linear

AI experiment tracker — runs, benchmarks, notes

Describe a run, and the workspace produces the output, files the result into a versioned Drive folder, writes up a research note as a Google Doc, and opens a Linear ticket the team can comment on. Every iteration is kept, comparable, and on the record.

If you iterate on a generative LLM workflow for a living, the hard part isn’t running it — it’s remembering which run produced which output, what you changed between them, and what the team thought of the result.

Twenty runs deep, no way to compare them

Iterating on a generative workflow means running it over and over with small changes — a tweak to the prompt, a different input, a new constraint — and keeping the outputs around so you can tell which version actually got better. The running is easy. The keeping-track is where it falls apart.

The output of run twelve ends up in a download folder, the note on why you tried it lives in a doc you’ll never find again, and when a teammate asks “which prompt gave us that?” the honest answer is a shrug. Each run is fine on its own; the cost is that nothing connects the output, the reasoning, and the conversation around it.

One record per run, output and notes attached

The workspace is a tracker built around a single record per run. You kick off a run from a short input, and that record carries everything the run produces — the generated output, the structured findings, the status as it moves from in-flight to ready. A list page shows every run you’ve done, most recent first, so “what have we tried” is a glance rather than an archaeology project.

Some runs generate a piece of work; some research a question and return structured findings. Either way the output doesn’t vanish when the run finishes — it’s filed somewhere durable, written up, and logged for the team, automatically.

From a prompt to a Drive folder, a Doc, and a Linear ticket

  1. Start a run from an input. You describe what this run should do — the prompt or question you’re iterating on — and kick it off. A run record appears immediately and starts tracking progress.
  2. A background task does the work. The run either generates a self-contained piece of work or researches the question against the web, then returns a typed result the workspace reads back onto the run record.
  3. The output versions into Drive. When the run finishes, its artefact is saved into a versioned Drive folder, so every run’s output is kept side by side and you can compare across iterations instead of overwriting the last one.
  4. The findings land as a Google Doc. The structured research note is written up as a Doc — the readable write-up that outlives the run record and that you can share or link without exporting anything.
  5. The run files a Linear ticket. Each run opens a ticket the team can comment on, link follow-ups to, and reference when comparing benchmarks across runs — so the conversation about a result sits next to the result.

The run input, the Drive folder, the note, and the team

  • What each run does — the instructions the run follows. Point it at whatever you’re iterating on this cycle: a generation prompt, a research brief, a different rubric.
  • Where outputs are filed — the Drive folder runs version into. Keep all runs together, or split by project so each line of work has its own history.
  • What the research note captures — the shape of the write-up that lands as a Doc. Tune it toward what you actually compare between runs.
  • Where run tickets file — the Linear team or project the run tickets open under, so the right people see them and the comments land where the team already works.

Every iteration kept, written up, and on a ticket the team can argue over — not scattered across a download folder, a stray doc, and someone’s memory.

An experiment loop you can replay and trust

A run loop earns its keep when nothing about an iteration goes missing between the people who care about it — the output it produced, the note on what you tried, and the thread where the team weighed in, all attached to the run rather than spread across three tools. Keeping the artefact store, the write-up, and the ticket on one record is what lets you open any past run and trust what you’re looking at.

The run-to-Drive-to-Doc-to-Linear loop stays the same; what you’d shape for your own work is the input each run starts from, the Drive folder outputs version into, and the Linear team the tickets open under. You change the workspace description when those move — and the same loop runs whether you’re iterating on a generation prompt or a research question.