AI Agent Framework Showdown: LangChain vs CrewAI vs AutoGen

It Started with Idle Curiosity

Saturday afternoon, 2 PM. Lying on the couch scrolling Twitter when I see a video claiming "AI agents can write code and run tests autonomously." It was a 23-minute video but I had my laptop open within 3 minutes. (Honestly, it just looked cool.)

The problem was there are too many frameworks. LangChain, CrewAI, AutoGen. I'd heard of at least five more. So I decided to just try three. Build the same task with each and compare.

The Task: A News Summary Agent

Pretty simple. Grab articles from an RSS feed, summarize the key points, classify by category. Nothing fancy. But even something this basic exposed the differences between frameworks clearly.

I used GPT-4o mini as the backend LLM and tested with the same 10 news articles across all three.

LangChain: Feature-Rich, Maybe Too Rich

LangChain's documentation is massive. Like, genuinely massive. I opened the "LangChain Agent" docs and found myself navigating across 7 different tabs. AgentExecutor, create_react_agent, Tool, AgentType... Just to get started, I had to install three packages: pip install langchain langchain-openai langchain-community.

It took me 3 hours and 47 minutes to get working code. Half of that was digging through GitHub Issues because the example code didn't work. Looks like the API changed significantly between versions 0.2 and 0.3. It ran eventually, but I wouldn't call the code clean. The abstraction layers feel thick.

CrewAI: Intuitive but Limited

CrewAI's concept clicked immediately. Give agents roles, define tasks, bundle them into a crew. "You're the news collector," "you're the summarizer." Installation was just pip install crewai.

Coding took 1 hour 52 minutes. Less than half of LangChain. But there was a catch. Custom tool documentation was thin. Wiring up an RSS parser as an agent tool cost me 40 minutes of fumbling. (42 minutes if I'm being exact, but rounding down feels better.)

The output was good. Watching agents "talk" to each other in the logs while processing tasks was genuinely satisfying. But fine-grained control is hard. Enforcing "this agent must process this data in this exact order" gets tricky fast.

AutoGen: Rawest, but Has Potential

AutoGen is from Microsoft. Getting it installed and running the first example took 27 minutes. Fastest of the three. But then things got interesting.

The multi-agent conversation structure is its core, but the setup is peculiar. You combine AssistantAgent, UserProxyAgent, and similar components. The code-executing agent runs code locally on your machine, which felt terrifying at first. (An AI running arbitrary code on my computer?)

Configuring it to run inside a Docker container took another 1 hour 15 minutes. Total implementation: 2 hours 38 minutes. Results weren't bad, but if you asked me whether I'd use this in production, I'd hesitate.

The Numbers Side by Side

Metric	LangChain	CrewAI	AutoGen
Implementation time	3h 47m	1h 52m	2h 38m
Lines of code	187	94	132
Summary quality (subjective)	7/10	8/10	7/10
Setup complexity	High	Low	Medium
Customization	Flexible	Limited	Flexible

The surprise was CrewAI producing the best summaries. I think separating agent roles naturally improves prompt quality.

The Real Problem Was Somewhere Else

After using all three, what I really learned is that framework choice matters less than prompt design. Within the same framework, results varied wildly depending on how I wrote the prompts. There's a world of difference between "summarize this news" and "extract three core arguments, one sentence each, including relevant statistics."

And cost. Running 10 tests with GPT-4o mini: LangChain cost about $0.35, CrewAI about $0.22, AutoGen about $0.38. The more agents chat with each other, the more tokens you burn. That's why AutoGen was priciest.

So What Am I Going With?

It's a side project, so probably CrewAI. Fast to build, intuitive. But for a work project, I'd pick LangChain. Bigger community, more references. AutoGen -- honestly, it's too early. The potential is there but stability isn't quite.

The real takeaway from this whole weekend experiment? I'd meant to do laundry on Saturday morning and didn't get to it until Sunday night.

It Started with Idle Curiosity

The Task: A News Summary Agent

LangChain: Feature-Rich, Maybe Too Rich

CrewAI: Intuitive but Limited

AutoGen: Rawest, but Has Potential

The Numbers Side by Side

The Real Problem Was Somewhere Else

So What Am I Going With?

Related Posts

GraphQL vs REST: My Verdict After 3 Projects

Zero-Downtime DB Migration, Easier Said Than Done

WebSocket vs SSE: My Realtime Misadventure