I’ve been working with AI assisted coding for a long while now and I’ve gotten to experience the strength but also the weaknesses of agent coding. I’ve found a current set up that helps mitigate the weaknesses that involves strict quality gates at the time of commit. The goal? Never commit “bad” code!
At first it appears that creating software is a solved problem and we can spin up multiple agents on multiple projects and autonomously churn out software. Hurray! We’re truly limited only by our ability to generate interesting ideas! Quickly though we all come to realize that we aren’t at that point quite yet. Code is haphazardly thrown anywhere, architecture be damned, and the agent just does whatever it wants. Translating human language into deterministic software features is hard — a realization that is absolutely not unique to the dawn of AI coding.
So we do the next best thing: waterfall, er, I mean spec driven development! What if we write out exactly what we’re looking for from the software. We collaborate with the agent ensuring we clarify any vague spots. Ta-da we’ve created all this fantastic documentation and plans on how to create software. Surely any model driving the agent session can turn these fantastic docs into working software. Heck maybe SDD is how we can “one shot” an agent to produce the working software we’re after!
No. Turns out there were hidden places of vagueness and contradictions… Now you have two artifacts you maintain: the software itself (both production software and the testing software) and the “spec” documentation. Practitioners may say “don’t look at the code, fix the documentation. The AI agent is just the compiler that turns your docs into software” which I’m sure can work but it bothers me. Aren’t the specs of the software ideally supposed to be represented in different suites of testing?
Where I am now is centering on the idea of pushing feedback to the agent “left” — instead of having it author some change sets and pushing up to then fail a CI pipeline, just “push” those checks as early as possible. I’ve landed on using git pre-commit hooks to run a set of software quality gates. My agent follows the rule to commit once it provably has a feature complete and naturally it will fix anything that comes up while committing. Fantastic!
There is probably a way of shifting this feedback even more left. A topic I have simmering on my mind’s back burner.
Software Quality Gates:
| Gate | Category | Tier | What it catches | Example tools | Notes (incl. AI-agent angle) |
|---|---|---|---|---|---|
| Secret scanning | Security | Universal | API keys, tokens, credentials, private keys committed to the repo | gitleaks, trufflehog, detect-secrets | Catastrophic and irreversible failure mode. Also catches agents hardcoding hallucinated keys. |
| File / repo hygiene | Repo health | Universal | Merge conflict markers, large binaries, invalid JSON/YAML, broken symlinks, stray .env/.DS_Store | pre-commit-hooks (check-merge-conflict, check-added-large-files, check-json) | Cheap, project-agnostic, zero judgment required. |
| Formatting | Style | Universal | Indentation, spacing, quote style, import ordering | prettier, black, gofmt, rustfmt, ruff format | Eliminates bikeshedding and diff noise. Agents are particularly prone to formatting drift, so auto-fixers are ideal. |
| Linting / static analysis | Code quality | Universal | Unused imports, undeclared variables, obvious bugs, anti-patterns | eslint, ruff, golangci-lint, clippy, shellcheck | Fast and high-signal; every mainstream language has a good one. |
| Unit tests | Testing | Universal | Logic regressions in individual functions/modules | pytest, jest, go test, cargo test, junit | The executable spec layer — keeps agents honest about whether code actually runs. |
| Commit message hygiene | Repo health | Universal | Non-conforming or empty commit messages, missing references | commitlint, gitlint, conventional-commits hooks | Compounding returns for changelogs, bisecting, auditability — especially with agent-authored commits. |
| Type checking | Code quality | Universal (where supported) | Type mismatches, null/undefined access, contract violations | tsc, mypy, pyright, sorbet | Catches agents inventing methods or passing wrong shapes before runtime. |
| Dependency vulnerability scanning | Security | Universal | Known CVEs in direct and transitive deps | npm audit, pip-audit, osv-scanner, dependabot, snyk | Also catches agents importing hallucinated or severely outdated packages. |
| Build / compile check | Build | Universal | Code that doesn’t compile, broken imports, missing assets | Language-native build tools | “It builds” is the floor. |
| Integration tests | Testing | Strongly recommended | Module-boundary breakages, contract mismatches, wiring bugs | pytest + testcontainers, supertest, RestAssured | Where bugs actually live in multi-component systems. Usually CI, not pre-commit. |
| SAST | Security | Strongly recommended | SQLi, XSS sinks, insecure crypto, hardcoded credentials beyond simple secrets | semgrep, codeql, bandit, sonarqube | Prevents agents from reaching for vulnerable patterns. More signal than linting, less noise than DAST. |
| License compliance | Legal | Strongly recommended | Incompatible dependency licenses (GPL in proprietary code, missing attributions) | license-checker, fossa, scancode | Cheap to automate, expensive in court. Critical if shipping/distributing. |
| Test coverage threshold | Testing | Strongly recommended | Untested code paths, coverage regressions | jest –coverage, coverage.py, jacoco, codecov | Best as a regression gate, strict on new code, lenient on legacy. Forces agents to test what they just generated. |
| Dead code / unused export detection | Code quality | Strongly recommended | Orphaned files, unreferenced exports, unreachable branches | knip, ts-prune, vulture, deadcode | High value with agents, which love leaving scaffolding behind. |
| Smoke tests | Testing | Context-dependent | “Does the app even start” failures, broken happy paths | Custom scripts, playwright trace tests | Critical for deployable apps; meaningless for libraries. |
| Acceptance / E2E tests | Testing | Context-dependent | User-facing feature regressions, full-stack integration failures | playwright, cypress, selenium, cucumber | Essential for product-facing apps; thinned out for backend libs, CLIs, infra code. Ensures agents don’t break the frontend while “fixing” the backend. |
| Container / image scanning | Security | Context-dependent | Vulnerable base images, OS packages, misconfigurations | trivy, grype, docker scout, hadolint | Nearly mandatory if you ship containers. |
| IaC scanning | Security | Context-dependent | Misconfigured cloud resources, public buckets, overly permissive IAM | checkov, tfsec, kics, terrascan | Catches agents writing Dockerfiles/Terraform that expose ports or run as root. |
| API contract / schema validation | Testing | Context-dependent | Breaking changes to public APIs, OpenAPI/GraphQL schema drift | openapi-diff, graphql-inspector, pact, spectral | Critical for services with external consumers; overkill for internal monoliths. |
| Accessibility (a11y) | UX / compliance | Context-dependent | Missing alt text, contrast failures, keyboard traps, ARIA misuse | axe-core, pa11y, lighthouse-ci | Mandatory for regulated/public-sector frontends; recommended for any UI; N/A for backend. |
| Performance / load testing | Performance | Context-dependent | Latency regressions, throughput cliffs, memory leaks | k6, locust, jmeter, hyperfine, benchmark suites | High value for user-facing services at scale; too slow for pre-commit. |
| Visual regression | Testing | Context-dependent | Unintended CSS/layout changes, broken styling | percy, chromatic, playwright snapshots | Catches agents accidentally breaking layout while refactoring components. |
| Mutation testing | Testing | Context-dependent | Tests that pass without actually testing anything | stryker, mutmut, pitest | Excellent signal on test quality, especially for agent-written tests. Too slow for pre-commit. |
| DAST | Security | Context-dependent | Runtime vulnerabilities only visible against a running app | owasp zap, burp, nuclei | Needs a deployed environment; CI/CD-stage, not pre-commit. |
| SBOM generation | Security / compliance | Context-dependent | Missing software bill of materials | syft, cyclonedx, spdx-tools | Required for federal/regulated supply chains. |
| Database migration checks | Data | Context-dependent | Destructive migrations, missing rollbacks, locking statements on large tables | squawk, sqlfluff, custom migration linters | Only matters if you own a schema, but then high-value. |
| i18n / l10n checks | UX | Context-dependent | Hardcoded strings, missing translation keys, RTL breakage | i18n-lint, custom extractors | Only for localized products. |
| Architecture / dependency rules | Code quality | Context-dependent | Forbidden imports across module boundaries, layering violations | dependency-cruiser, archunit, import-linter | High value in large codebases with intentional architecture. |
| Code complexity / maintainability | Code quality | Context-dependent | Cyclomatic complexity ceilings, duplication, tech debt accumulation | sonarqube, pmd, radon, lizard | Useful as a project grows; overhead for small repos. |
| Domain-specific compliance | Compliance | Context-dependent | HIPAA PHI leakage, PCI cardholder data, GDPR data flows, SOC2 controls | DLP scanners, OPA/Gatekeeper, custom policies | Determined by regulatory regime. |
| Documentation / generated artifact sync | Docs | Context-dependent | Stale OpenAPI specs, generated code out of sync | spectral, custom scripts | High value for API-heavy or docs-driven projects. |
| Spell checking (code + docs) | Polish | Context-dependent | Typos in identifiers, comments, user-facing strings | cspell, codespell, typos | High ROI in agent-authored code; some teams find the noise not worth it. |
| ML-specific gates | Data / ML | Niche | Data drift, model bias, edge-case performance | Great Expectations, custom ML pipelines | Only for ML/AI projects. |
| Property-based testing | Testing | Niche | Edge cases unit tests miss | Hypothesis, fast-check, QuickCheck | High-reliability domains (finance, embedded). |