I’ve seen a lot of posts recently — in the Chinese-speaking world and the English-speaking world — talking about harness engineering.

Most of them just cover the concept, with very little detail.

And honestly, some of the systems described are so complex I can’t believe they’d keep running without crashing.

So over the weekend, I built something myself.


It’s funny — I’ve built a lot of personal projects, but most of them die within 10 iterations. When I say dead, I mean they spiral: adding a feature causes bugs, fixing those bugs creates more bugs.

That’s exactly the spiral harness engineering is supposed to prevent.

People make it sound complicated. The way I built it is pretty dumb and simple — but it works. And I did learn a lot from the original post .


Rule 1: Make the worktree bootable.

This is critical if you want agents to work in parallel.

Use bun (or pnpm) for the frontend package manager — the cache mechanism means dependencies download in a flash. Use uv to manage the data pipeline environment and dependencies. Every worktree starts instantly, no pain.


Rule 2: Set the standard.

Treat it like you’re handling a ticket end-to-end. Lint, unit test, passing build — those are the basics. But for an app with UI, the critical one is UI testing.

Here I use agent browser and Playwright CLI . Both differ from traditional e2e tests, which go stale quickly when selectors change. With these tools, you pair with them for a round or two, describe how your app looks, and document it in a markdown file.

After that, you just say: “I expect if I click X, Y should pop up.”

All natural language.


Those two rules took most of my time. The rest — tech stack, infrastructure — I made those decisions myself. Agents write the code; architecture is still mine.

Then things started rolling.


I created a skill called decision-plan.

Give it a minimum requirement (including P0 use cases), and it will:

  • Explore the codebase
  • Create an ADR (Architecture Decision Record) based on the requirement
  • Generate an execution plan

The ADR covers high-level design, context, and the why. The execution plan covers the how — which files to change. The execution plan links back to the ADR. Both live inside the codebase as code.

I also created an index.md — a lightweight index so the agent can find the right ADR without scanning through all of them. As ADRs accumulate, they work as a knowledge base — and new ADRs automatically start referencing old ones when the skill is called.


Then the building phase.

I created another skill: build-and-ship. It runs:

  1. Create a worktree and make sure it’s bootable
  2. Read the execution plan
  3. Implement the plan
  4. Lint / unit test / build
  5. Bootstrap a dev server (random port) and run smoke tests (Playwright)
  6. Use agent browser or Playwright CLI to visually verify. Save screenshots and recordings.
  7. If everything passes, merge back to main. If anything fails, fix it.

My whole workflow: open 4 terminals, run decision-plan in parallel to generate ADRs and execution plans, clear context when done. Then kick off /build-and-ship ADR-0012. And it works. 48 ADRs later, still going.

Execution plans accumulating in the codebase

I also tried a simple Ralph loop — combining 4 ADRs into a PRD and letting it run autonomously. The loop itself ran fine, but the task splitting wasn’t ideal. Some validation points got missed, and bugs piled up.

I think the ADR/workflow structure isn’t far from supporting a fully autonomous solution. Just not quite there yet.


So what did I actually build with all this? InvestBuddy — import your broker data, track performance, export for tax. Everything stays in the browser.


What did I learn?

1. Good requirements + clear validation points are the best resource you can give an agent.

2. I’m no longer deep in the code details. I’m more like a PM.

I felt like I’d lost control at first. But I got used to it — and I’m actually happier watching things I built work for my own problems.

3. Simple is still better than complex.

The ADR system probably won’t scale to hundreds of ADRs — context limits will bite, agent memory will decay. But it works for this project, and that’s enough.

Build it simple. Make it work. When it breaks, fix it.