AI Agents with Groq and llama aren’t as fast or cheap in the real world. Here’s why:
In this article:
- Shiny object syndrome & test driven prompt engineering
- The use case for “fast” llms
- The conundrum with small ~8b models
- Gemini Flash (large) vs. Llama 3.2 8b
- Speed vs Quality vs Product Velocity
- My Advice
Tokens per second stats and benchmark scores fool me too. I get excited watching how quickly LLMs are progressing in speed and accuracy. Every llm and every llm-as-a-service seems to have extensive marketing to get you to choose them.
Shiny object syndrome is hitting llm’s as hard as javascript frameworks. They are trying to attract investment and monthly subscribers too. For many (bad) developers, switching costs are also high.
Become immune to the bs
The switching cost between llms can be high if you do not establish a good AI Agent engineering environment. Without a test-driven approach, the idea of switching to a different SDK (maybe even language!), re-working and testing all of your prompts with the new model can seem exorbitant.
Test Driven Prompt Engineering
I will write a follow up article on test driven prompt engineering, but it is the antidote to marketing bs. You can rapidly evaluate different llms and their platforms for speed, cost, and quality - for each step of your agent workflow - with zero loyalty - right inside your codebase.
If you’re not already doing it, test driven prompt engineering will 10x your AI Agent product velocity.
Comparing Fast LLMs for your AI Agent workflows, and why benchmark data can be irrelevant
Imagine you’re building an AI agent to analyze customer feedback. This agent needs to:
- Understand the intent: Is the feedback positive, negative, neutral, or spam? Is it about a specific product or feature?
- Structured data: do you need to create structured data from any result for saving and analysis?
- Navigate a workflow: Based on the intent and extracted data, the agent might route the feedback to customer support, product development, perform a data lookup, or simply archive it.
- Make decisions at each step: Should this feedback trigger an alert? Should it be prioritized?
- Pathfinding: What should be the next step in the workflow?
This workflow involves multiple decision and data extraction points, but each step is relatively simple. You don’t need the raw power (and slow speed) of GPT-4o, Gemini Pro, or Llama 90b for every decision.
Combining complex multi-step workflows into a single large prompt for a mega LLM doesn’t always make sense, and can be harder to test for and ensure quality for.
This is where multi-step agents with fast LLMs shine.
The Speed Demons:
Think of LLMs like Llama 3.2 (8B or 3B on Groq), Gemini Flash, or GPT-4 Mini as the sprinters of the AI world. They’re optimized for speed, making them perfect for rapid-fire decision-making within your agent’s workflow.
The Catch:
While these models boast impressive tokens-per-second speeds, you need to consider prompt engineering.
Prompt Engineering: The Hidden Cost
Simpler models like Llama 3.2 8b may require elaborate multi-shot prompts to achieve your desired accuracy. This means:
- Longer prompts: More input tokens, higher cost.
- Increased latency: Processing larger prompts takes more time.
- Overfitting risk: Multi-shot prompts can become too tailored to the examples, hindering generalization and corner-case handling.
- More testing time: While testing output quality, the more elaborate your prompt becomes, the more time you are spending on a single AI agent tool’s development when you could be done and working on the next one.
Smarter models like Gemini Flash (large), while on-paper are more expensive per token and slower, can often achieve the same outcome with shorter, zero-shot prompts and cost less.
Model Clarification:
This article discusses using the Gemini 1.5 Flash large model (estimated to be 32B size), not the Gemini Flash-8B small model.
API model code: models/gemini-1.5-flash
This is to create a comparison between perceived ultra-fast small models like llama 3.2 8b on groq, and larger more expensive “fast” models like Gemini Flash 1.5 large, and GPT-4o mini.
The Groq Conundrum:
Platforms like Groq, with their “cheap”, high-throughput processing, might seem like a panacea to AI agent workflows. However, the need for more complex prompts can negate these advantages, resulting in:
- Unexpected costs: More input tokens eat into cost savings.
- Slower performance: Longer prompts increase processing time.
- Development overhead: Engineering and testing complex prompts is time-consuming.
Real-World Testing:
In my testing I compared Gemini Flash against Llama 3.2 (3B and 8B) on Groq.
Gemini Flash started at a disadvantage. Being a larger fast model, Gemini Flash 1.5 (large) consistently required 500 - 600ms to respond. However, the prompt was simple, zero shot, and scored 10/10 on my input scenarios. I spent about 30 minutes putting together the AI Agent tool and it was working great.
However, chaining together 5+ 500ms workflow steps starts to create a very slow response for the user. I wanted something faster for my UX. So I looked to Groq for a speed solution.
Groq started with a huge speed advantage. In my initial testing, Groq + llama 3.2 8b and 3b could respond in under 100ms (sometimes under 20ms!!!) with great results. It even seemed like all I’d need was the faster 3b model.
However, as I added more input testing situations for this AI Agent tool, the prompt grew large and it became clear I’d need the 8b model. After hours of testing to get a perfect 10 / 10 score on my input scenarios, I had created an extensive multi-shot prompt that was potentially overfit.
With this large multi-shot prompt and perfect score, Groq + llama 8b now consistently took 500 - 600ms to respond. Much slower than the initial 100ms results. This was now the same speed and cost of Gemini Flash.
I spent 2 hours doing prompt engineering for the tool to work as well with llama 3.2 8b for the same speed, cost, and quality outcome. I could have built out multiple tools in the same time period with Gemini Flash.
The Big Picture - Optimizing For Speed Is A Waste
Speed is everything, right? In a world where app UX frames per second are scrutinized over, and web loading speeds over 3 seconds are an abomination, I know this sounds insane.
AI Products Are Different
- You are building AI.
- Your users expect intelligence.
My experience building AI products and talking to customers of AI products has taught me that response quality is paramount. There is no amount of speed gain that can make up for poor quality responses. You should always lean into higher quality.
There are diminishing returns. A 3 second response may 99% as good a 15 second response in many cases. Use your judgement. The experience should be as good as it can be without a material sacrifice in intelligence.
3 things are true:
- Death
- Taxes
- LLMs will get better, cheaper, and faster every 3-6 months
Spending time optimizing your AI Agent tools and workflows for incremental speed gains today is a waste. 500ms today is fine because in 6 months, a similar quality model will have a faster, cheaper option that can respond in 250ms.
That said, you don’t need to use GPT-4o, Gemini Advanced, or llama 90b for every step. The “fast” models are highly capable of many tasks.
But the sub-100ms mini and nano models (1b, 3b, 8b parameter size models) may not be worth the prompt engineering time except for the very simple AI agent tools and tasks (like binary decisions with predictable inputs).
Never sacrifice on response quality and product experience for a speed boost when building an AI product.
Bad responses will immediately cause users to lose trust in your product, and to dislike your product. They will probably badmouth it to others or leave a bad review.
Focus on quality and development velocity. Build your product faster, build a higher quality product, and let the speed boost come automatically every 6 months when new models and new faster llm services are available.
A bad review because of slow speed is much better than a bad review because of poor quality. People save so much time compared to doing a task manually that the extra seconds of waiting are worth putting up with. But if the quality is poor, they will move onto the next product that does what they want.
What’s Your Nano-Model Use Case?
Have you successfully used a nano model for an AI Agent workflow tool that wasn’t easily achieved with a heuristic code flow or set of regular expressions?