Today, most AI-generated platform engineering work still relies on classical prompt engineering: describe the outcome, add context and constraints, then iterate until the pipelines, infrastructure, and operational assets are “good enough.” More structured methods and tools (such as BMAD or SpecKit) are still relatively uncommon for tasks like DevOps pipelines, infrastructure-as-code, operational documentation, and production-readiness controls.
That’s why I ran this comparison: I wanted to see whether
structured approaches can materially outperform normal prompting on a realistic
platform engineering task.
I compared four approaches to generate pipelines, cloud
infrastructure, and an operational layer for an existing application running on
Azure App Service: SpecKit used with normal prompting, classical prompt
engineering, SpecKit (method-driven), and BMAD.
The goal wasn’t just “does it compile?”—it was whether the
output looked like something a real platform team could run with.
Key takeaway: BMAD produced the only output that
resembles a production-ready platform baseline (97/135 overall, 78% readiness).
The other approaches were useful for demos and workflow enablement, but they
did not consistently generate the operational and governance layers you’d
expect in a real platform build.
Overall scorecard
|
Approach |
Overall
score |
Production
readiness |
Best fit |
|
SpecKit with
normal prompting |
41 / 135 |
20% |
Teaching the
minimum |
|
Classical
prompt engineering |
53 / 135 |
44% |
Demo and
operator-friendly workflow |
|
SpecKit |
52 / 135 |
38% |
Spec-driven
methodology |
|
BMAD |
97 / 135 |
78% |
Production-grade
target architecture |
The scorecard shows a clear separation. SpecKit with normal
prompting, classical prompt engineering, and SpecKit produced useful outputs,
but they stayed closer to demos or methodology examples. BMAD scored
97/135—nearly double the classical and SpecKit runs—and reached 78% production
readiness.
What each approach produced
|
Capability |
SpecKit
with normal prompting |
Classical
prompt engineering |
SpecKit |
BMAD |
|
App tests
passing |
15/15 |
16/16 |
16/16 |
17/17 |
|
Terraform
validate |
✅ |
✅ |
✅ |
✅
dev + prod |
|
Terraform fmt
clean |
❌ |
✅ |
✅ |
❌ |
|
Dev/prod
environments |
❌ |
❌ |
❌ |
✅ |
|
Modular
Terraform |
❌ |
❌ |
❌ |
✅
4 modules + bootstrap |
|
App Insights |
❌ |
❌ |
❌ |
✅ |
|
Log Analytics |
❌ |
❌ |
❌ |
✅ |
|
Key Vault |
❌ |
❌ |
❌ |
✅ |
|
Managed
Identity |
❌ |
❌ |
❌ |
✅ |
|
Slot-based
deployment |
❌ |
❌ |
❌ |
✅ |
|
CodeQL |
❌ |
❌ |
❌ |
✅ |
|
Dependabot |
❌ |
❌ |
❌ |
✅ |
|
CODEOWNERS /
PR template |
❌ |
❌ |
❌ |
✅ |
This is where BMAD pulled away. It didn’t just produce
application code and basic pipelines; it generated a broader platform layer:
infrastructure modules, observability, identity, secrets management,
environment separation, and release governance.
Infrastructure and operations comparison
|
Metric |
SpecKit
with normal prompting |
Classical
prompt engineering |
SpecKit |
BMAD |
|
Terraform
files |
5 |
5 |
6 |
28 |
|
Terraform
modules |
0 |
0 |
0 |
4 + bootstrap |
|
Workflow
files |
4 |
5 |
5 |
5 |
|
Workflow LOC |
257 |
526 |
289 |
566 |
|
Authored
Markdown files |
24 |
60 |
63 |
7 |
|
Authored
Markdown LOC |
1,255 |
4,339 |
4,347 |
378 |
|
Framework
Markdown LOC |
n/a |
n/a |
included in
SpecKit |
~24,839 |
|
App endpoints |
2 |
3 |
3 |
4 |
|
Tests |
15 |
16 |
16 |
17 |
BMAD had the largest Terraform footprint (28 .tf
files), but that’s largely because it created a more realistic platform
structure. It included modules for App Service, identity, Key Vault, and
observability, plus a bootstrap layer and separate dev/prod environments.
BMAD also kept the project-specific documentation lean: only
7 authored Markdown files and 378 authored lines. The trade-off is that the
BMAD framework itself brings ~25k lines of installed scaffolding under .agents
and _bmad, which raises the learning curve.
Security and quality results
|
Check |
SpecKit
with normal prompting |
Classical
prompt engineering |
SpecKit |
BMAD |
|
npm audit
high issues |
0 |
0 |
0 |
0 |
|
Trivy
high/critical |
0 / 0 |
0 / 0 |
0 / 0 |
0 / 0 |
|
Grype issues |
0 |
0 |
0 |
0 |
|
Gitleaks
leaks |
0 |
0 |
0 |
0 |
|
actionlint
issues |
0 |
0 |
0 |
0 |
|
Checkov
pass/fail |
4 / 16 |
8 / 12 |
8 / 12 |
27 / 25 |
|
CodeQL |
❌ |
❌ |
❌ |
✅ |
|
Dependabot |
❌ |
❌ |
❌ |
✅ |
All four approaches passed the basic dependency and
secret-scanning checks. BMAD went further by adding CodeQL and Dependabot,
which strengthens the long-term security and maintenance posture.
BMAD also had the highest Checkov failure count (25). That
sounds negative, but it needs context: BMAD created more real cloud resources
to scan—Key Vault, App Service, Log Analytics, managed identity, diagnostic
settings, and remote-state bootstrap resources. More surface area means more
findings, but it also reflects a more realistic architecture.
Conclusion
In practice, the four approaches showed four distinct
“personalities” (and different fit-for-purpose trade-offs).
|
Approach |
Verdict |
|
SpecKit with
normal prompting |
Smallest and
easiest to understand, but weakest production posture |
|
Classical
prompt engineering |
Closest to
how many teams use AI today: prompt, inspect, iterate |
|
SpecKit |
Best
traceability from spec to tasks, but limited platform depth |
|
BMAD |
Strongest
production architecture and operational coverage |
Recommendation: use BMAD as the target architecture
baseline. It was the only approach that landed in a convincing “platform-ready”
shape: 78% production readiness, 97/135 overall score, 17 passing tests, 4
Terraform modules, separate dev/prod environments, slot-based deployment, Key
Vault, managed identity, Application Insights, Log Analytics, CodeQL,
Dependabot, CODEOWNERS, and a release workflow.
That said, BMAD shouldn’t be used blindly. I would fix the
seven Terraform formatting issues, review the 25 Checkov findings, and add the
missing pieces (alerts, structured logging, branch protection, SHA-pinned
GitHub Actions, and OpenAPI contract tests). I’d also borrow from the other
approaches: SpecKit’s demo artefacts and the classical workflow’s defensive
guards.
Comments
Post a Comment