About
Look, I'm just trying to keep up with this absolute chaos and not lose my mind.
AI models drop faster than I can update my API keys. Benchmarks change weekly. The tool I thought was essential last month got obsoleted by something released yesterday. And everyone—everyone—is claiming their model is "the best in the world" at something.
So I started writing things down. Mostly for myself. Because when GPT-5.2, Claude Opus 4.5, Gemini 3 Pro, Kimi K2.5, and GLM-4.7 all claim to be the best coding model within a 6-week span, someone needs to actually test them and figure out what's real.
Turns out other people also wanted someone to cut through the BS.
What This Actually Is
I write about AI tools and models that matter for people who actually ship things. Not "exciting developments in the space." Not "10 ways AI will change everything." Not vendor press releases rewritten with a thesaurus.
Model releases with actual context. When Kimi K2.5 drops with "agent swarms coordinating 100 sub-agents," I test whether that's revolutionary or marketing. When Claude claims 30-hour autonomous task execution, I check if it actually maintains focus or quietly derails at hour 12.
Coding tools people use in production. Cursor vs Windsurf vs Claude Code isn't about which has better benchmarks. It's about which one doesn't destroy your codebase when you walk away for coffee. I've tested all three. I have opinions. They're based on what broke and what didn't.
The pricing that actually matters. Sure, Gemini 3 Pro costs $2/$12 per million tokens. But if it uses 2x the tokens to complete the same task as Claude Sonnet 4.5 at $3/$15, the math changes. I track real costs, not headline rates.
Benchmarks with skepticism. SWE-bench scores matter. But "best coding model" means nothing if it generates code with 198 security vulnerabilities per million lines (looking at you, Claude Sonnet 4.5's SonarQube results). I read the methodology. I check for contamination. I note when vendors cherry-pick metrics.
What I'm Not Doing
I'm not neutral. I use these tools daily. I have preferences. When Claude works better for my workflows, I say so. When GPT-5.2 solves something Claude can't, I say that too.
I'm not comprehensive. The AI space moves too fast for comprehensive. I cover what matters to people building things: coding tools, production models, workflow automation, stuff that saves time or money.
I'm not first. Breaking news goes to people with better sources and faster timelines. I write when I've actually tested things and have something useful to say.
Who This Is For
You're probably a developer, data person, or someone whose job involves shipping AI-powered things into production.
You're tired of:
- Vendor marketing disguised as technical analysis
- Benchmarks without context
- "This changes everything" posts about minor updates
- Tutorials that don't mention the tools broke in production
- People who clearly haven't used the thing they're reviewing
You want:
- Honest assessments of what works and what doesn't
- Cost breakdowns that include token consumption, not just pricing
- Comparisons based on actual use, not spec sheets
- Someone to tell you whether the new shiny thing is worth switching to
Current State of Play (January 2026)
As of right now:
Best overall reasoning: GPT-5.2 (per Artificial Analysis Intelligence Index v4.0)
Best user experience: Gemini 3 Pro (per LMArena Elo rankings)
Best coding: Claude Opus 4.5 (80.9% SWE-bench Verified, though it'll probably get beaten next week)
Best value: Kimi K2.5 or GLM-4.7 (open weights, 1/8th the cost, 75% of the performance)
Best for not caring about any of this: Claude Sonnet 4.5 (reliable, proven, not the cheapest or most capable but it just works)
These will all be wrong by February. That's the point. Nothing stays true.
Why I Actually Write This
Honestly? Because I kept explaining the same things to different people.
"Should I switch from Claude to GPT-5.2?"
"Is Cursor worth the $20/month?"
"Can Kimi K2.5's agent swarms actually replace Claude Code?"
"Why does everyone say Gemini 3 Pro is best but developers all use Claude?"
After the tenth time explaining the nuanced answer, I started writing it down. Then I kept writing because:
- The space moves too fast for my own memory
- Vendor claims need independent verification
- Someone should track what actually ships vs what gets announced
- Production reality ≠ benchmark reality
- It's weirdly fun watching AI labs panic-release models every 72 hours
What You'll Find Here
Monthly breakdowns of what actually mattered. Not every model release—the ones that changed workflows.
Tool comparisons for people who ship code. Not feature lists—"here's where Cursor broke on my production codebase and here's how Windsurf handled it."
Benchmark analysis with skepticism. When models claim 95.7% on AIME 2025, I check if the test is saturated, who else scored 95%+, and whether that translates to anything useful.
Pricing reality checks. Token costs, consumption patterns, hidden fees, and what your monthly bill actually looks like at scale.
Honest takes on what works. I use these tools daily. When something's good, I say so. When it's overhyped, I say that too.
The Vibe
I'm not trying to be The Definitive Source™ on AI. I'm trying to be the person who actually tests things, notices when marketing doesn't match reality, and writes it down before the next model launch makes everything obsolete again.
If you want breathless hype about how AI will change everything, read vendor blogs.
If you want someone who tested whether Kimi K2.5's agent swarms actually work (they do, but the CLI tooling is rough), whether the 66% Claude Opus price cut is real (yes and no—per-token yes, per-task no), and whether you should actually switch from your current setup (probably not unless something specific is broken)—you're in the right place.
Also I'm based in Vancouver and probably care too much about whether these models work in Canadian French. Don't ask me why. Regional bias is real.
Subscribe If You Want
I send these out biweekly-ish. Sometimes more when the industry loses its mind (November 2025, I'm looking at you and your 72-hour three-model blitz).
No ads. No sponsored content. No affiliate links to tools I haven't actually used.
Just me trying to make sense of an industry that releases frontier models faster than I can finish testing the previous ones.
If that sounds useful, stick around. If not, no hard feelings—there are 47 other AI newsletters and at least three of them are probably better funded.
Let's figure out what actually matters in this mess.
— Hunter Jameson Rigney
Last updated: January 2026, but probably already outdated