I'm building an AI agent platform for the software development lifecycle. Not a chatbot that writes code — I have enough of those. This is something different. Specialized agents that collaborate, review each other's work, and don't let bad code through without a fight.
This is the honest build log. Real bugs, real frustrations, real "bakit ganito?!" moments. No polished announcements. Just the messy truth of building something from scratch while running on coffee and stubbornness.
What if your dev team's code review, architecture review, and testing could be handled by AI agents — but with real quality gates? Not "AI generates code and you hope it works." More like "AI generates code, other AI agents tear it apart, and a human gets the final say."
That's what I'm building. Agents that have opinions. Agents that reject bad architecture. Agents that say "nope, try again" when the code doesn't meet standards. And when the agents can't figure it out — the pipeline stops and asks a human.
The first few days were humbling. I blamed the AI for everything — "ang eng eng naman," "it doesn't follow instructions," "the model is broken." Classic developer move: blame the tools.
Turns out, ako pala 'yung may problema. Most of the "AI flakiness" was my own infrastructure breaking things silently. Once I fixed the plumbing, the AI was actually pretty good.
Then came the real challenges — agents disagreeing with each other, retry loops burning tokens on the same errors, and one agent approving a 22-line file as "complete" when the prompt asked for a full system. Every bug taught me something about building multi-agent systems that I couldn't have learned from a tutorial.
It works. End-to-end. From a text prompt to generated, tested, reviewed code — with a full audit trail of every decision. I've watched agents catch actual architecture violations, reject buggy designs, and escalate to human review when they hit something they can't solve.
There's still a lot to build. But the foundation is solid — and it's mine. My own IP, built from scratch.
Name reveal coming soon.
One thing I got right early: nothing is hardcoded.
Each agent can use a different AI model. The routing agent gets a small, fast one. The code generator gets a bigger, smarter one. The reviewer can use something completely different for diversity of opinion. Swap models in config — no code changes.
Same codebase for all three. The environment is a config setting, not an architecture decision.
First clean end-to-end run
After all the fixes — after the uvicorn watcher, the asyncio trap, the abstain loop, the garbage cascade, the contract gaps — I fired a CLI calculator prompt. Watched the events live. Requirements decomposed. Architecture designed. Code generated. Tests written. Build passed. Reviewer approved. RUN_COMPLETED in ~225 seconds. No human gate. 5 artifacts on disk.
Naiiyak ako. That green status felt like it took a month to earn. You can't see the real bugs until you fix the ones in front of them. Every fix removed a failure mode that had been hiding behind the previous one. Layer after layer after layer.
Build Log
May 13–14 — From "AI keeps failing" to first working e2e
Started: "AI Fix keeps failing on simple type errors." Inis na inis na ako.
Ended: Working hello-world end-to-end with 6 deterministically-passing pytest tests. Generated by a real multi-agent pipeline. I almost cried.
What shipped
Security audit + CVE fixes — swapped python-jose for pyjwt, bumped cryptography and python-multipart floors, added uv.lock with hashes, wrote SECURITY.md and a CI supply-chain audit job.
Claude CLI integration — new ClaudeCliProvider with Windows PATHEXT resolution. The asyncio subprocess fix (to_thread + subprocess.run) so it works on any Windows event loop.
Persona registry — agents get configurable personas now. maria, senior_coder, design_thinker, strict_reviewer. Resolved from config, not hardcoded.
Self-build defenses — every code-producing agent (codegen, tester, debugger) runs its output through compile() + python -c "import subject" before handoff. Inline retry on failure with the real error fed back to the LLM.
Tester AST defenses — walks the test AST and removes any top-level redeclarations of subject symbols. Mechanical guarantee that doesn't depend on the LLM following instructions.
Pipeline UI — horizontal stepper, 7 stages, status icons, click for inline detail.
The three war stories
1. The 60-second silent kill. Uvicorn watching artifacts/. Every pipeline write triggered a worker reload that killed in-flight LLM subprocesses. Pattern was random enough to look like model instability. Fix: absolute reload_dirs path. Hours of debugging for one line.
2. The asyncio event-loop trap on Windows Python 3.14. create_subprocess_exec needs ProactorEventLoop, but Python 3.14 defaults to SelectorEventLoop. Uvicorn's --reload creates its loop before importing the app — my policy-setting runs too late. Subprocess calls fail with empty errors. No stack trace, no hint, just nothing. Fix: to_thread + subprocess.run. Works on any loop, any OS.
3. The "approved with notes" silent reject. extract_json used a naive brace counter that tripped on regex literals like r"\d{5}" inside JSON values. Plus the reviewer did strict decision == "approve", missing "approved", "APPROVED", and markdown-wrapped prose. Fix: string-aware brace tracker + 3-tier decision resolver.
Headline lesson: Most "AI flakiness" is infrastructure flakiness. Long-running async subprocess calls are fragile to dev-server auto-reload, event-loop policy differences, and silent fallback paths. If your AI system "sometimes works," look at the plumbing before you look at the model.
May 14 (cont.) — From "first e2e worked" to "local SDLC without errors"
Same day, second sitting. Hindi pa ako tapos. First sitting proved the pipeline could complete. This one was about the failures that remained — agents drifting out of contract with each other, and retry loops burning tokens on identical errors. Parang nagpapaikot-ikot lang.
What shipped
Auto-apply fix-with-AI — debugger-produced source now writes to disk, regenerates tests, runs build, flips the status to OK on success. No sandbox bounce needed.
Reviewer sees tests + build verdict — pipeline packs tests and build_summary into the reviewer's artifact. Stopped the "no tests present" hallucination after build_agent had just run them.
Symbol contract — codegen publishes its export list in result.data["exports"]. Pipeline forwards it as subject_symbols to the tester. Eliminates one full drift surface between codegen and tester.
Convergence detection — two consecutive attempts with the same error fingerprint? Break the loop. Emit AWAITING_HUMAN with reason: convergence_detected. Stop burning 3×60s on identical errors.
Smart error routing — ImportError gets "make every public name top-level." AssertionError gets "the test is the spec, don't weaken it." Targeted prompts → targeted fixes.
Pause-on-LLM-unavailable — LLM health check moved BEFORE requirements_agent. Detects when an agent fell back to a placeholder. Emits AWAITING_HUMAN with structured recommended_actions — Try again / Wait / Fix LLM & resume / Cancel.
The garbage cascade
A CLI subprocess failed but returned returncode=0 — classic Windows moment. Stdout: Execution errorTerminate batch job (Y/N)?
The requirements agent stored that as the requirement. "OK, so the user wants an Execution errorTerminate batch job system." The architect designed around it. Codegen wrote code for it. Ang galing naman. Four stages of garbage with green checkmarks. Nobody questioned anything.
Fix: Detect fallback at every critical agent boundary. Halt immediately. Give the user 4 buttons. Never let garbage propagate silently.
Headline lesson: The pipeline's failure mode wasn't the LLM — it was the contracts between agents. Codegen produced source; tester inferred symbols independently; reviewer judged without seeing tests. Every implicit assumption was a hiding spot. Fix pattern: turn implicit assumptions into explicit data flow. Make failures loud instead of silent.
May 15–16 — Manual controls, prompt UX, target architecture
Pipeline was reliable. This session made it controllable from the UI and honest about its failure modes.
What shipped
Manual per-stage rerun — new endpoint dispatches just one agent (codegen / tester / build / reviewer) using on-disk checkpoints. UI exposes it as a "Stage ▾" dropdown. Re-run a single stage when the automatic loop's choice wasn't what you wanted.
Input scanner scoped to user boundary — the injection scanner was blocking codegen retries because it saw codegen's own prior output as injection. Fix: internal agent calls trust the chain. Only requirements_agent scans, because it receives the raw user prompt.
Requirements routed back to local qwen — Claude CLI's coding-agent defaults were leaking through --print mode. Prompts like "Create a project that does X" produced "I need your permission to create files…" Local qwen just decomposes text. Routing change in settings.yaml — no code change.
Requirements style now conditional — plain English by default. Formal requirements style ("The system shall…") only when the prompt mentions government, compliance, regulatory, or standards-driven keywords. Keyword trigger picks the right system prompt per request.
"What to type" guidelines panel — collapsible panel above the prompt. 4 click-to-insert examples. What-works / what-to-avoid lists. Makes the platform self-documenting.
Key architecture decisions
Reviewer model: Single reviewer + retry + human-in-the-loop stays as default. Consensus available as opt-in for two scenarios: (1) weak local models that aggregate well for air-gapped deployments, (2) regulated or audit-heavy contexts where recorded independent assessments are the artifact.
State management: Decided on vanilla VFBus + VFStore module (~80 lines) instead of full framework migration. Current stack (HTMX + Alpine + custom event bus) is deliberately lightweight. Framework migration is 3-6 weeks with no clear ROI today.
Headline lesson: Most "pipeline isn't working" complaints were configuration or contract drift, not code bugs. Claude CLI defaults leaking through. Scanner false-positiving on legitimate text. Formal-style on every prompt. Each looked like a multi-day rabbit hole. Each was a few-line config change once the right boundary was identified.
May 16–17 — First clean SDLC, then triage
Started: Every run hit an AWAITING_HUMAN gate because the reviewer kept returning decision: abstain despite clearly approving in prose. Sabi mo OK na eh. Bakit abstain pa?!
Ended: First clean end-to-end SDLC run. Plus three NEW bugs surfaced — the kind hiding behind the abstain loop. Ayan. Fix one thing, tatlo ang lalabas.
The reviewer abstain fix
Haiku ignored "Return ONLY a JSON object" and wrote markdown with verdicts like "✅ Ready to merge." Parser found no JSON → defaulted to abstain → pipeline retried 3 times → exhausted → AWAITING_HUMAN.
Three-layer fix:
1. Tightened prompt — JSON instruction at the END, explicit markdown ban, one-shot examples.
2. Prose-fallback parser — scan for unambiguous phrases when JSON extraction fails.
3. Abstain halts immediately — same artifact + same prompt won't produce a different answer. Don't burn tokens proving it.
The layer that actually worked: Layer 2, the prose-fallback. Haiku STILL emits markdown after the prompt rewrite — but the fallback caught "the implementation is correct" and inferred approve. Safety nets built for failure modes end up being the thing that works.
The 22-line drone validator
Prompt asked for GPS validation, battery, IMU, mission state. Got 22 lines — just a MissionState enum. Reviewer: "✅ Correct and well-tested. Ready to merge."
Three stacked failures:
1. extract_code_block took only the first fenced block — rest silently dropped.
2. Tester writes tests for what's IN the source, not what SHOULD be.
3. Reviewer judges internal consistency, not spec compliance. Never reads requirements.
"Approved" doesn't mean "complete." Approved means the code that exists is consistent with its tests. NOT that it satisfies the prompt. I didn't appreciate this until a 22-line file got a green checkmark for a full telemetry system prompt.
Then the fix opened a new failure
After shipping the multi-block fix, the drone prompt got farther — but build failed with ModuleNotFoundError. The model's multi-block response wasn't "multiple classes in one file" — it was "multiple files." Naive concatenation produced a file that imports from itself. Queued: tighten codegen prompt to single-file, or strip cross-block imports for classes defined in the same output.
What we proved
The pipeline mechanics work end-to-end on a real prompt. The architecture reviewer correctly caught three different bugs across three loan-calculator attempts — it actually works naman pala. The retry loops work when LLMs cooperate. The bottleneck is consistently LLMs ignoring instructions, not my code. (Finally, not my fault.)
Headline lesson: "Stop shipping new code until you have a verified baseline." Five fixes layered with zero confirmed clean runs between them is a spiral. Fire the test prompt. Watch the events live. Let the data diagnose. Once the green RUN_COMPLETED landed, the next steps became obvious.
What I learned (so far)
1. Most "AI flakiness" is infrastructure flakiness. Auto-reload watchers, event-loop mismatches, silent fallback paths. Fix your plumbing. The AI is probably fine. Ikaw ang may problema.
2. The pipeline's failure mode isn't the LLM — it's the contracts between agents. Every implicit assumption is a bug waiting to happen. Make data flow explicit. Walang assumptions.
3. "Approved" doesn't mean "complete." Internal consistency is not spec compliance. Your reviewer can say "looks good!" while half the requirements are missing. Trust but verify.
4. Safety nets become the primary path. Prose-fallback parser, convergence detector, halt-on-fallback — all built as backups. All became essential. The backup plan IS the plan.
5. Stop shipping code until you have a verified baseline. Five fixes layered with zero confirmed clean runs between them is a spiral. Huminga muna. Fire the test prompt. Watch the events. Let the data tell you what's wrong.
6. Save your fixes. I fixed bugs that AI couldn't solve — then didn't commit. Lost the fix. LOST. IT. Now I commit immediately. Kahit git stash lang, basta may save.
What's next
The platform works. Agents collaborate. Consensus voting catches real issues. Audit trail logs everything. Hindi pa perfect, but it works. Now working on:
— .NET Core / C# pipeline support (because of course I need dotnet in here)
— Workspace container (code-server + Python + Node + .NET SDK — the whole buffet)
— Tester that validates against requirements, not just source (no more 22-line approvals)
— Reviewer that actually reads the original requirements
— Run-type architecture (pipeline runs vs direct-agent runs)
— And eventually — the name reveal. Mamaya na.
Name coming soon. For now, it's just the work that matters. At tsaka tulog. Kailangan ko na matulog.