1
0
mirror of https://gitlab.com/Anson-Projects/projects.git synced 2026-06-03 21:00:27 +00:00

draft I wrote at home

This commit is contained in:
2026-02-22 22:06:31 -05:00
parent 082f157c83
commit a310dbeb84
6 changed files with 862 additions and 867 deletions
@@ -28,9 +28,9 @@ execute:
## Context and Problem Statement
Adoption of GenAI at Shield is lagging and uncoordinated. We have no standard tooling while defense primes have deployed to 170,000+ employees over the past 3 years [@boeing_genai_2024; @lockheed_genai_2024; @northrop_genai_2025]. China's military has already deployed domestic AI models at scale [@scmp_pla_deepseek_2025; @jamestown_deepseek_pla_2025]. All while Shield doesn't mention AI in developer onboarding, there is no GenAI software distributed by default to new employees, and our engineers have no clear guidance on acceptable usage.
Adoption of GenAI at Shield is lagging and uncoordinated. Defense primes have deployed to 170,000+ employees over the past 3 years [@boeing_genai_2024; @lockheed_genai_2024; @northrop_genai_2025]. China's military has already deployed domestic GenAI models at scale [@scmp_pla_deepseek_2025; @jamestown_deepseek_pla_2025]. All while Shield doesn't mention GenAI in developer onboarding, there is no GenAI software distributed by default to new employees, and our engineers have no clear guidance on acceptable usage.
I propose that not only do we align on a single AI platform, but that we adopt the models and tools offered by Anthropic. Anthropic has consistently led in innovation and novel problem solving over the past two years, which is exactly the type of work we do here at Shield.
I propose that we align on a single GenAI platform and adopt the models and tools offered by Anthropic. Anthropic has consistently led in innovation and novel problem solving over the past two years, which is exactly the type of work we do here at Shield.
## Decision Drivers
@@ -47,47 +47,47 @@ Anthropic has consistently been first to ship capabilities that matter for engin
| [Plugin Marketplace](https://www.anthropic.com/news/claude-code-plugins) | Oct 2025 | **No competitors** |
| [Cowork](https://claude.com/blog/cowork-research-preview) | Jan 2026 | **No competitors** — agentic file operations for non-engineers |
OpenAI and Microsoft led on the older innovations (chat interface, code completion), but Anthropic has dominated the 2024-2025 wave of agentic tooling. MCP is now a Linux Foundation project—Anthropic created the standard everyone else adopted.
OpenAI and Microsoft led on the older innovations (chat interface, code completion), but Anthropic has dominated the 2024-2025 wave of agentic tooling.
The real gap isn't just "first to ship"—it's iteration velocity. When OpenAI adopted MCP in Mar 2025, Anthropic had already shipped Claude Code. When Codex shipped in May 2025, Claude Code was already iterating on plan mode and background tasks. By the time competitors release their v1, Anthropic is on v2 or v3.
Not only is Anthropic shipping first, they are iterating on concepts constantly. When OpenAI adopted MCP in Mar 2025, Anthropic had already shipped Claude Code. When Codex shipped in May 2025, Claude Code was already iterating on plan mode and background tasks. By the time competitors release their v1, Anthropic is on v2 or v3.
### Tools lack cross compatibility
### Tools Lack Cross-Compatibility
Every tool has its own naming conventions, formats, and toolsets (MCP servers, agents, skills, rules). Those differences are solvable with symlinks and scripts. The real blocker is that prompts don't transfer between models.
Every platform has a different set of tools and expectations. The surface-level issue is that every platform has its own naming conventions, formats, and toolsets (MCP servers, agents, skills, rules). Those differences are solvable with symlinks and scripts. The real blocker is that prompts don't transfer between models.
Studies show that even minor formatting changes can cause 76% difference in model outputs [@sclar_prompt_sensitivity_2023], and these format preferences don't transfer between models. Automated prompt translation improves cross-model performance by 27% on SWE-Bench and 39% on Terminal-Bench [@promptbridge_2025], but proves how differently models approach problems. This means that teams using different platforms are investing in parallel, but largely incompatible direcions.
Studies show that even minor formatting changes can cause 76% difference in model outputs [@sclar_prompt_sensitivity_2023], and these format preferences don't transfer between models. Automated prompt translation improves cross-model performance by 27% on SWE-Bench and 39% on Terminal-Bench [@promptbridge_2025], but proves how differently models approach problems. This means that teams using different platforms are investing in parallel, but largely incompatible directions.
I have seen this first hand at Shield. Even with the latest frontier models the upgrade tooling that I created for Claude to do HMSDK upgrades flat out didn't work in ChatGPT (November 2025). Even after multiple iterations of the tool in Codex it could not grasp the task at hand and how to use all the tooling I had generated alongside Claude.
I have seen this firsthand at Shield. Even with the latest frontier models, the upgrade tooling that I created for Claude to do HMSDK upgrades flat out didn't work in ChatGPT (November 2025). Even after multiple iterations of the tool in Codex it could not grasp the task at hand and how to use all the tooling I had generated alongside Claude.
### Productivity Expectations
GenAI is making a real measurable impact across industries. 90% of Fortune 100 companies [@github_copilot_fortune100_2025] have deployed AI coding tools, ~85% of developers [@stackoverflow_survey_2024; @jetbrains_survey_2025] use them, and the market has grown to $7.4 billion [@mordor_ai_code_market_2025].
GenAI is making a real measurable impact across industries. 90% of Fortune 100 companies [@github_copilot_fortune100_2025] have deployed GenAI coding tools, ~85% of developers [@stackoverflow_survey_2025; @jetbrains_survey_2025] use them, and the market has grown to $7.4 billion [@mordor_ai_code_market_2025].
The biggest wins come from automating process work. Teams using AI tools ship 26% more PRs per week [@mit_microsoft_copilot_2024] with 4x faster turnaround [@opsera_copilot_2024].
The biggest wins come from automating process work. Teams using GenAI tools ship 26% more PRs per week [@mit_microsoft_copilot_2024] with 4x faster turnaround [@opsera_copilot_2025].
Defense primes have already scaled. Lockheed has 70,000 users [@lockheed_genesis_2025] on its Genesis platform. Boeing deployed to 170,000 employees [@boeing_genai_2024] by late 2023. Blue Origin reports 95% of software engineers using AI tools with 70% company-wide adoption [@blue_origin_aws_2025].
Defense primes have already scaled. Lockheed has 70,000 users [@lockheed_genesis_2025] on its Genesis platform. Boeing deployed to 170,000 employees [@boeing_genai_2024] by late 2023. Blue Origin reports 95% of software engineers using GenAI tools with 70% company-wide adoption [@blue_origin_aws_2025].
### ROI Math
At ~$60/seat/month for Claude Enterprise ($720/year), even a conservative 20% productivity gain on a $200K engineer creates $40,000 in value, a 55x return. Novo Nordisk reduced clinical report writing from 10+ weeks to 10 minutes [@novo_nordisk_claude_2025]. TELUS engineering teams ship code 30% faster [@telus_claude_2025]. Pfizer saves up to 16,000 hours annually [@pfizer_claude_aws_2024].
At ~$60/seat/month for Claude Enterprise ($720/year), even a conservative 20% productivity gain on a $200K engineer creates $40,000 in value, a 55x return. Novo Nordisk reduced clinical report writing from 10+ weeks to 10 minutes [@novo_nordisk_claude_2025]. TELUS engineering teams ship code 30% faster [@telus_claude_2025].
Agentic tools like Claude Code cost more (up to $1000/seat/month at heavy usage) but unlock work that wasn't previously viable. Public ROI data is limited, so here are real examples from Shield:
Agentic tools like Claude Code cost more (up to $1000/seat/month at heavy usage), but unlock work that wasn't previously viable. Public ROI data is limited, so here are real examples from Shield:
**HMSDK Upgrades**: Upgrading between SDK versions is a significant undertaking. Claude Code enabled AI agents to work around the clock on massive upgrades. Without AI, these upgrades wouldn't have been feasible at all due to time and personel constraints.
*ROI: 5x faster, but more importantly, AI made this work viable.*
**HMSDK Upgrades**: Upgrading between SDK versions is a significant undertaking. Claude Code enabled GenAI agents to work around the clock on massive upgrades. Without GenAI, these upgrades wouldn't have been feasible at all due to time and personnel constraints.
*ROI: 5x faster, but more importantly, GenAI made this work viable.*
**Customer Engagement Acceptance Testing**: CE has used Claude Code for SDK acceptance testing since 25.3. It enables fast bug discovery, root cause analysis that CE traditionally couldn't prioritize, and async execution that expands testing scope.
*ROI: A bad SDK release could jeopardize million dollar contracts.*
*ROI: A bad SDK release could jeopardize million-dollar contracts. GenAI has found issues in almost every release.*
**Training Material Validation**: HMSDK training materials take two weeks for a new engineer to complete, making them difficult to keep current and idiomatic.
*ROI: full validation of basic training materials takes 1hr in CI pipeline, takes engineer a full day. Future iterations have massive potential.*
*ROI: Full validation of basic training materials takes one hour in CI versus a full day for an engineer. Future iterations have massive potential.*
### Not all models are made the same
### Model Performance Varies
Benchmarking models is really hard so I don't want to focus on it since its likely that models from the main labs will continue to be in striking distance of eachother. However, tooling is where we are seeing real stratification and what we should use to drive decisions.
Benchmarking models is difficult and I don't want to focus on it since it's likely that models from the major labs will continue to be in striking distance of each other. Tooling is where we see real stratification and what we should use to drive decisions.
Models are trained on solved problems, the problems at Shield aren't solved yet. For this reason I think we should place a focus on benchmarks that involve novel problem solving, and multi-step software engineering tasks. We should also place a large emphasis on performance and not price, a small intelligence gain is worth extra costs given our domain.
Models are trained on solved problems; the problems at Shield aren't solved yet. For this reason, I think we should focus on benchmarks that involve novel problem solving and multi-step software engineering tasks. We should also emphasize performance over price. A small intelligence gain is worth extra costs given our domain.
SWE-bench Verified [@swebench_2024] is the industry standard benchmark for evaluating AI on real-world software engineering. It tests models against 500 actual GitHub issues from popular Python repositories—the model must read the issue, understand the codebase, and produce a working patch.
@@ -96,9 +96,8 @@ SWE-bench Verified [@swebench_2024] is the industry standard benchmark for evalu
| Gemini 3 Pro | 76.2% | $0.46 | IL6+ |
| Claude 4.5 Opus | **74.4%** | $0.72 | Not Available|
| GPT-5.2 (high reasoning) | 71.8% | $0.52 | IL6+ |
| Claude 4.5 Sonnet | 70.6% | $0.56 | IL5[^il5] |
| Claude 4.5 Sonnet | 70.6% | $0.56 | IL5 |
[^il5]: IL5 encompasses FedRAMP High, CUI, IL4, and IL5 authorizations—essentially all unclassified government work.
## Considered Options
@@ -110,12 +109,14 @@ SWE-bench Verified [@swebench_2024] is the industry standard benchmark for evalu
| **Agentic Chat** | Cowork (Jan 2026) | ✗ | ✗ | ✗ | ✗ |
| **Desktop App** | Claude Desktop | ChatGPT Desktop | Gemini Desktop | ✗ | ✗ |
| **AI IDE** | Extension | Extension | Antigravity | Cursor | Windsurf |
| **Code Autocomplete** | ✗ | ✗ | Gemini Code Assist | Yes | Yes |
| **CLI Agent** | Claude Code | Codex CLI | Gemini CLI | ✗ | ✗ |
| **Python SDK** | anthropic | openai | google-genai | ✗ | ✗ |
| **Embedding Models** | ✗ | text-embedding-3 | text-embedding | ✗ | ✗ |
| **Image Generation** | ✗ | DALL-E | Imagen | ✗ | ✗ |
| **Agent SDK** | Agent SDK | Agents SDK | ADK | ✗ | ✗ |
| **MCP Support** | Native (creator) | Mar 2025 | Apr 2025 | Yes | Yes |
| **Rules** | ✗ | ✗ | .antigravity | .cursorrules | .windsurfrules |
| **Hooks/Automation** | Jul 2025 | ✗ | ✗ | ✗ | Dec 2025 |
| **Plugin Marketplace** | Oct 2025 | ✗ | ✗ | ✗ | ✗ |
| **Computer Use** | Oct 2024 | Operator (Jan 2025) | ✗ | ✗ | ✗ |
@@ -129,17 +130,17 @@ SWE-bench Verified [@swebench_2024] is the industry standard benchmark for evalu
### Platforms
Three frontier providers worth considering. All offer necessary capabilities for full adoption at similar price points. The differentiators matter:
Three frontier providers are worth considering. All offer necessary capabilities for full adoption at similar price points. The differentiators matter:
**Anthropic** leads in agentic tooling. They shipped MCP, Claude Code, hooks, and the plugin marketplace before anyone else. Competitors follow 6-12 months behind. If you want the latest capabilities for autonomous engineering workflows, Anthropic gets there first. Downside: IL6+/classified support is still in pilot.
**Anthropic** leads in agentic tooling. They shipped MCP, Claude Code, hooks, and the plugin marketplace before anyone else. Competitors follow 6-12 months behind. For the latest capabilities in autonomous engineering workflows, Anthropic gets there first. Downside: IL6+/classified support is still in pilot.
**OpenAI** has the broadest ecosystem. If you need to integrate with existing enterprise tools or want the safest vendor choice, OpenAI has the most established relationships. Downside: consistently 6-12 months behind on agentic features, Copilot still lacks IL5 authorization, and their focus skews toward general audiences rather than specialized engineering work.
**OpenAI** has the broadest ecosystem. For integrating with existing enterprise tools or the safest vendor choice, OpenAI has the most established relationships. Downside: consistently 6-12 months behind on agentic features, and their focus skews toward general audiences rather than specialized engineering work.
**Google** has the deepest government presence. Selected for GenAI.mil serving 3M+ DoD personnel, IL6+ authorized, and tightly integrated with Google Cloud. If GovCloud and classified work are priorities, Google has the strongest position. Downside: agentic tooling is newer and less mature, and enterprise pricing is opaque.
**Google** has the deepest government presence. Selected for GenAI.mil serving 3M+ DoD personnel, IL6+ authorized, and tightly integrated with Google Cloud. For GovCloud and classified work, Google has the strongest position. Downside: Tooling is very new, pricing is opaque, and Google has a poor track record of support.
### Security and GovCloud
**Private Plugin Marketplace**: Claude Code is the only tool that lets us host a private marketplace on internal GitLab, distributing proprietary tooling automatically. No competitor offers this.
**Private Plugin Marketplace**: Claude Code is the only tool that lets us host a private marketplace on internal GitLab, distributing proprietary tooling automatically. No competitor offers this. These are trivial to set up so enterprise-wide tools can be developed, or individual teams can maintain their own special tooling.
**FedRAMP is no longer the bottleneck.** Authorization that previously took years now completes in under two months for pilot participants.
@@ -217,11 +218,4 @@ Adopt the Anthropic ecosystem company-wide:
- **Claude Code** for engineering (agentic coding, automation)
- **Agent SDK** for custom automation workflows
Every new employee gets a Claude Enterprise subscription. Engineers get Claude Code API keys through the central account.
## Implementation
1. **Procurement**: Negotiate Claude Enterprise agreement
2. **Rollout**: Phase 1 (engineering), Phase 2 (all employees)
3. **Training**: Internal docs, CLAUDE.md templates, MCP server examples
4. **Plugin Marketplace**: Stand up internal GitLab-hosted marketplace
Every new employee gets a Claude Enterprise subscription. Engineers get Claude Code API keys through the central account.
@@ -140,12 +140,12 @@
note = {Systematic analysis of dozens of PLA procurement documents citing DeepSeek-based tools}
}
@online{stackoverflow_survey_2024,
@online{stackoverflow_survey_2025,
author = {{Stack Overflow}},
title = {2024 Stack Overflow Developer Survey - AI},
year = {2024},
url = {https://survey.stackoverflow.co/2024/ai},
note = {76\% of developers using or planning to use AI tools}
title = {2025 Stack Overflow Developer Survey - AI},
year = {2025},
url = {https://survey.stackoverflow.co/2025/ai},
note = {84\% of developers using or planning to use AI tools}
}
@online{jetbrains_survey_2025,
@@ -181,12 +181,13 @@
note = {26\% increase in completed tasks across 4,867 developers at Microsoft, Accenture, and Fortune 100}
}
@online{opsera_copilot_2024,
@online{opsera_copilot_2025,
author = {{Opsera}},
title = {GitHub Copilot Enterprise: The Impact on Developer Productivity},
year = {2024},
url = {https://www.opsera.io/blog/github-copilot-enterprise-impact-on-developer-productivity},
note = {PR turnaround time dropped from 9.6 days to 2.4 days among teams using GitHub Copilot}
title = {GitHub Copilot Adoption Trends: Insights from Real Data},
year = {2025},
month = {2},
url = {https://www.opsera.io/blog/github-copilot-adoption-trends-insights-from-real-data},
note = {Time to open a PR dropped from 9.6 days to 2.4 days among teams using GitHub Copilot}
}
@article{metr_ai_productivity_2025,