Copilot Cowork vs Claude Cowork - Head to Head

Copilot Cowork vs Claude Cowork - Head to Head

A colleague (👋) asked me to get them a Claude licence, because Copilot just wasn't working out. I'm always one to ask 'why'.. Especially now that Microsoft added in support for Anthropic models in their various Copilot Agents... I wanted to know what the issue was, and I suggested some minor tips on how to use Copilot, such as trying Copilot Cowork, working in a project, adding files for context... just as something to try instead of the default M365 Copilot chat.

Since Copilot Cowork relies on Sonnet 4.6 and Opus 4.7 - Sooo, you'd think they should be pretty similar.

Model selection for the Copilot Cowork agent.

And not just Cowork, slight tangent but I love the fact that the Copilot Research (Frontier) agent is able to use a combination of Claude for the initial research pass and then GPT to review and validate. Or to run them head to head and compare the outputs. Pretty cool considering that M365 Copilot includes 'unlimited messages' and no session or weekly limits.

But that being said... I had a sneaking suspicion I'd end up assigning a Claude licence eventually anyway, you've got to believe the hype right? We shall see...

Time to eat the dog food!

So I also tried it out for myself, I asked Copilot Cowork and Claude Cowork the exact same question.

Firstly, here's some background to help you understand the prompt I used - also please don't rate my prompt, it wasn't highly thought out, I didn't use AI to refine it before sending either, calm down. It's a real prompt that only after sending, did I consider running through Copilot at the same time.

In fact you could say that my prompt was rough on purpose, just so I could also test a follow-up message to re-focus the output. You could say that, and I could say that.. so I will. 😅

Context

This all comes from trying to optimise how I use AI for coding, in two main areas:

  • Spec-driven development - ensure plans are complete and robust before implementing.
  • Agent Harnesses - splitting up work, orchestrating sub-agents, background tasks, continuation, todo-list enforcement, and so on.

I've used 'oh-my-openagent' before (when it was called 'oh-my-opencode'). I had some success, but found that increased complexity of my plans, being too thorough than it needed in most cases. The overhead of having multiple agents I could use for the same thing. Although, in hindsight, having to think when I want to switch between planning and implementation actually helped me, but I wasn't ready then.

Ultimately, I'm guessing this was down to my inexperience using such a tool, I was learning what didn't work faster than figuring out what did. Brilliant for beginning a new project (something I do often), terrible at finishing a project or edits and updates. That shouldn't really be pinned all on oh-my-openagent.

After a while I found that ditching it, and reverting to the default 'build/plan' agents felt more agile, snappier, but without the overhead.

Then I found OpenSpec, which is when I actually learnt the difference between the spec-driven development and agent harnesses and orchestration, something that oh-my-openagent spans across both areas.

With OpenSpec, I was able to design > plan, and then start implementing when ready in a far more structured way than just using the Plan agent. But then I lost the ability to delegate, run background tasks, and have different agents with their own prompts and models.

I've learnt a lot, and now I'm eager to figure out how to get the best from the latest and greatest tools and plugins again. Or perhaps go running back to oh-my-openagent and beg for forgiveness and try again.

Right, that's you all caught up. Let's go!

The mission, should you choose to accept it.

Here's the prompt I fed into Claude Cowork, and Copilot Cowork.

See what alternatives you can find to openspec / oh-my-opencode (I think they're now called oh-my-openagent), and compare them in terms of simplicity, ability to execute, which models they work best with overall (and specific use cases). Do a full on comparative and contrastive research. Create a high level infographic with leader quadrant, distinguish the purpose vs ability to achieve separately. And then also a separate slide for my use-case.. indy development, single-person-vibe-coding-dev, production apps and tools, coherent framework, multiple ideas to switch between and keep track of both within a project and across multiple separate projects.

My First Reaction

Let's go slide by slide and give it a first pass thumbs up or thumbs down. Under each side-by-side image I'll explain why.

Intro Slide

Claude wins on overall visual style / design 'quality and taste' - at least on the intro slide. Copilot had overlapping and badly sized and positioned text. Spoiler - if I allowed myself to change my opinion after seeing the other slides, Claude may not have won this.

Landscape / Contenders

Copilot edges ahead by sticking closer to the intent of my prompt, it called out the Spec-Driven requirements, but didn't quite get the harness and orchestration correct - instead it bundled them together on the right. Whereas Claude went pretty much full tilt into Coding software that I didn't ask about.

Head to Head Table

Copilot wins again - and I'm starting to see that the overall design is actually more consistent with Copilot, it uses the same colour across slides and heading bar. It also included colour coded numeric ratings for each line. Claude left me with some bland 'high / med' labels, and white space on the right wasting space, and less results.

Quadrant Comparisons

Despite Copilot's alignment issue, and not the quadrant I expected with leaders vs visionaries etc, the weird orange box across the top, I was ready to give Claude the point here... Claude had a better layout, dark theme background shows up the colours nicely.. But wait, the legend on the right has no relationship to the quadrant's colours, some missing, and there's a leader in the bottom left quadrant?! At least Copilot didn't contradict itself, and looking again at the chart it produced, it actually came up with quite a novel approach to split the tool's purpose between spec <-> agent harness on the left/right and ability towards the top - it realised that some of the tools actually spanned both capabilities. I should really give it 2 thumbs up for this, but that's not allowed by my own rules I just made up.

Model Suitability

Well, since Copilot didn't include this aspect at all, I've got to give the point to Claude. Despite it being badly aligned, unused white-space on the right, table heading colour is different to the table it generated on slide 3. But it has stars... and it exists, so... well done I guess?

Recap Slide

Copilot went straight into recommendations, exactly what I wanted at this stage - they feel relevant and accurate, although suspiciously lands on the tools that I originally asked it to find alternatives to. Whereas Claude is such a tease, and gives me more comparisons, delaying its conclusion. It even has some misleading information about its own costs, and positions it as a negative.

Fit Check

Copilot is offering up two paths based on what client I prefer to use, and an additional option to help with multiple repos/projects. Claude has just fed back my requirements to me (nice, I know what they are, you don't have to tell me again), and then ranks coding apps. No mention of any plugins / spec-tools / agent harnesses. So, Copilot has to win again.

Final Slide

Copilot has already finished the race and is standing on the podium, it's considering next steps, commands to run, packages to install.. Claude isn't even looking at the finish line, it's suggesting three different coding apps that I should use together, one app for heavy lifting, another for fast iteration, and then another one to experiment with - still missing any spec-tool or agent harness. Why not Claude Code for all three?

Both could be better...

It's kind of unfair to fully judge based on both of them not quite grasping the intent behind my prompt, although Copilot was closer and stayed on track delivering an answer with actionable next steps.. Claude told me to use 3 different clients, and didn't mention any specific tools for spec-design or agent harness in it's recommendation.

Second Chance

Right, let's get this back on track (to continue the race metaphor), I'll try harder to explain myself to the AI this time...

Considering I was asking about the agent hierarchy/structure of plugins like the ones I mentioned, and not looking to compare the open or vendor-specific CLI coding clients themselves (I accept that they do have default build/plan agents included so that does kind of count). Let's be more specific about openspec (as a plugin used within OpenCode), and oh-my-openagent (also as a plugin within OpenCode).. what other plugins / sidecar tools aim to overcome the same problems that those two are built for.

They both whurred away for a few minutes, interestingly Claude hit it's context limit and ran a compaction 😦.. Copilot didn't, or at least it didn't show it.

They both created a new deck. But it's worth noting some rather exciting 'emergent behaviour', Copilot had automatically created a report alongside the presentation, that included more detailed research and cited sources. Claude didn't do that. Sure, it spaffed out some text in the chat, but so did Copilot. So that's yet another win for Copilot. I'll be honest, I wasn't expecting that, but I like it, it tickled me in a way I've not felt for a while. It gave me the ability to look deeper, even including a post-mortem highlighting what could happen in the next 6 months that would invalidate or change it's recommendations, without me even hinting at it.

How cool is that? Thanks Copilot!

Calling You Out! Yes You!

I spent most of this article looking at the initial 'one-shot' output based on my first prompt, I thought it was interesting how the initial look & style of Claude made it look more polished and reliable. But that very quickly fell apart upon reading.

Let's pause for a second, there's a big issue there, it's nothing new generally speaking, but AI has caused this problem to explode exponentially.

This is my most troubling grievance with the proliferation of AI tools for all levels of people in an organisation.

Are you on the edge of your seat? Shall I just tell you? Is this lead up tantalising?

Alright...

The problem is... the effort required for one person to have an idea or question, open up Claude, ask it to generate some sort of deck or report. Then they get excited because the output looks decent, don't fully read it, and they share it to a team of people. It's taken them a maybe 1-5 minutes of effort on their part? Depending on how fast they can type...

The poor recipients have to spend ages reading, digesting, check facts, following the logic, skip over the hallucinations and bias. And then bear the brunt of the fallout when they have to carefully articulate why it's a mile wide, but only an inch deep. All show and no substance. Especially when it's a technical subject matter, and worse when it's a rapidly evolving one like AI.

If you feel called out, it's ok, don't beat yourself up, it's not personal, this tech is new to everybody... but please... please... read what it gives you and challenge it... "Are you sure?", "What are your sources?", "can you explain why?", that simple revision step will have a huge benefit to the quality of the output. And why I like the ability for the Copilot Researcher agent to use different models to their strength or play two off against each other to arrive at a somewhat probed and tested outcome.

It still might be shit, but it won't smell quite so bad.

Show me!

Alright, here you go... here's the files created after my follow-up message to bring it back on track.

Claude

Copilot

Verdict

Well, I'm shocked, shooketh to my core. I did not expect that clear win from Copilot Cowork going head to head with Claude Cowork with the exact same prompt, and the exact same follow up message.

If you can't be arsed to download and check out the actual assets produced, it's ok, you're busy reviewing 10 other presentations and reports from your manager that they already sent over by 9:15 on a monday morning.

The winner is Copilot!

Copilot stuck with it's initial recommendation, however the 2nd deck was much more focussed on the right sort of tools, more comparisons, same structure to the deck that ends with actionable next steps. And the glory that is the accompanying reports we got for free!

Nice, the only choice left is whether to use OpenCode or Claude Code, and I know which one I'm picking... OpenCode every time.

Go home Claude, go home

Claude, on the other hand, let's be honest... was a pig with lipstick. Even after the 2nd prompt, it still failed to distinguish between the spec-driven and agent harness tools in any way. No actionable next steps, and a recommendation that was split between three 'tiered' scenarios that still left me wondering how to layer them.

No clear recommendation, use all the tools but at different times, what do I do next? Ruflo, according to its own research is only for Claude Code, whereas Swarm-Tools is only for OpenCode.

Oh, and regardless of this. Anthropic as a company still win, because whether you follow the hype and want the point solution direct from them because of the brand, or you're tied in to the Microsoft-ecosystem and end up paying for M365 Copilot, Anthropic are behind the scenes, powering both, and getting paid either way.

Final Word

Holly from Red Dwarf in Series 2, Episode 5 said it best...