XKCD-style comic showing three panels: stock market doomers vs stability, concentrated AI investment bubble in NVDA TSLA PLTR, and democratized dispersed AI opportunities The Stock Market Isn't Crazy — But AI Concentration Is

I don’t think the stock market is actually crazy right now.

But the concentration of AI investment into a small number of companies is — TSLA, NVDA, PLTR in particular.

Drops of those will flow immediately into a larger number of opportunities with higher return potential. Overall SPY will remain stable.

AI Is Democratizing

AI is democratizing. Switching costs are low. Barriers to entry are surprisingly low. Building an AI monopoly will be much harder than building a search monopoly. There will be no single monolithic AI model dominating the planet.

There will always be opportunities to create specialized AI that is better, faster, and cheaper than monolithic AI in every industry vertical, job, and workflow. The AI arbitrage opportunities here mean that you can do something 10x better, 10x faster, and 10x cheaper than monolithic AI.

I see this personally in the legal industry. State of the art monolithic AI providers (OpenAI, Google, Anthropic) are unusable for rigorous legal analysis, legal drafting, legal research, etc. Their accuracy only gets you 70% of the way there.

The AI value layer comes from SaaS companies that optimize AI models, AI workflows, AI agents, and AI user experiences around targeted use cases — that take their usability from 70% to 95%.

It does not make sense for a frontier provider to optimize all of these things for an incremental TAM of $1B/yr in revenue. It doesn’t support their valuation or business models. Rather, they fight and compete to be the AI provider selected by the niche industry value layer.

Even if frontier model providers make incremental gains in Legal AI (or any industry’s) competency — there will still be opportunities to build that value layer.

Upcoming Outflows

  • Over-invested large players who are in a race to the bottom.
  • Frontier AI service providers with high competition and limited ability to differentiate (incremental profit over cost of infra and electricity) - it’s a resource efficiency game.
  • Companies with cultures that refuse to adopt AI or leverage it to maintain competitiveness or productivity.

Opportunities: Cultural Market Turnover

The opportunities we have today are similar to what we have always had. Companies that refuse to adopt modern innovative practices (AI) will lose business to those that do. The reason this happens is mainly cultural. A lot of companies are very slow to change, and it’s hard to steer a Titanic. It requires firing friends, breaking up political factions, slaughtering sacred cows, and layoffs of people slowing down the change.

We all know people who refuse to use AI as a matter of principle:

  • The developer who simply refuses to use it from ignorance or arrogance.
  • A tech lead who discourages AI use on the team because of sloppy PRs — creating a culture of fear around it.
  • A nontechnical C-Suite who saw quality metrics drop when GPT 3.5 was adopted and made it a policy to not use it.
  • The VP who sees AI cannibalizing their business and ruining their KPIs and selfishly crushes innovation.

Companies that fall behind will quickly find that their dwindling revenues no longer support their business structure, and they won’t be able to change fast enough to continue as a going concern.

This will happen at every layer of the market, from IBM-sized professional services companies, to manufacturing, to your local 5-person web shop.

The S&P 500 Will Keep Rotating

The overall S&P 500 should continue to see ~10% annualized gains. But non-AI companies will rotate out for new AI-enabled players.

Even in 2025 — the year we see massive increases in AI model competence and companies like NVDA reaching $5T values — the S&P 500 has remained stable with a modest ~12% return. They quietly rotate underperforming legacy companies for those that are leveraging technology and AI to gain market share in the new economy.

The dying legacy companies incapable of change shrink and rotate out of the mix.

The S&P 500 has rotated 11 companies already this year. Selloff in NVDA, TSLA, and PLTR will start as realization that their upside potential is limited, and there are more opportunities being created for greater returns outside the Magnificent 7. This sell off is likely to flow directly into other S&P 500 companies showing massive gains in profitability by out-competing legacy companies, primed for their own 5x, 10x return opportunities.

Doomers: The Market Isn’t Broken

Concentration corrects.

  • The overall market is in good shape.
  • The real opportunity isn’t in betting on which frontier model wins. It’s in the long tail — the thousands of industry-specific, workflow-specific, job-specific AI applications that turn 70% use-case-specific accuracy into 95%. That value layer is wide open.
  • NVDA’s dominance is temporary, with competetive chips from Google, Amazon, and Intel now hitting production, stealing global fab manufacturing capacity from NVDA, along with its room for profits.
  • The size of the AI investment pie will continue to increase, but concentration in the market will disperse into a larger number of companies.
  • It’s unlikely we’ll see a Google-Search style monopoly because provider switching costs are so low, and the gap between frontier labs is so small.
  • Frontier AI will continue to get better, faster, and cheaper.
  • The value layer is primed with opportunity for companies willing to move fast.

This isn’t an impending market crash. It’s a market maturing.


Anime style boxing match pitting llama 3.2 8b vs Gemini Flash The Prompt Engineering Tax: Why Ultra-Fast Small LLMs Can Be Slower and Costlier in Practice

In this article:

  • LLM Model speed is highly circumstantial
  • LLM model selection and prompt engineering
  • The use case for “fast” models
  • The conundrum with small ~8b models
  • Gemini Flash vs. Llama 3.2 8b
  • Speed vs Quality vs Product Velocity
  • My Advice

The Lure Of Speed

LLM platform providers use “tokens per second” and speed scores in their primary marketing messaging. This is true for closed providers like Google / Gemini, OpenAI, and Anthropic; as well as open-source providers like together AI, Groq, Cerebras, and Fireworks AI.

AI engineers are constantly aware of a speed vs quality battle. Generally, larger and smarter models are slower. Smaller and dumber models are faster. However, real-world prompt engineering blurs these lines. Larger models may require simpler prompts and less context. Smaller models may require several-shot input + output example cases.

This article how to objectively measure speed against quality, and why that flashy new fast model may not actually be faster in the real world for your use case.

Required: Eval-Driven Development

Eval-driven development (test-driven prompt engineering) is the practice of first establishing LLM-based evaluations (tests) that an LLM’s outputs must satisfy, and then developing a prompt that can pass those evaluations.

This is similar to Test Driven Development (TDD) in software engineering. However, while TDD is circumstantially useful in software development. Eval-driven development is absolutely required for all AI engineering.

This article is not about eval-drive development. It assumes you will utilize the practice to objectively measure the speed of your AI while maintaining a standard of quality.

Example AI Engineering Scenario

You’re building an AI agent to analyze customer feedback. This agent needs to:

  1. Understand the sentiment: Is the feedback positive, negative, neutral, or spam?
  2. Summarize: Create a one sentence summary for your team.
  3. Created structure data: Convert the feedback into JSON for your database.
  4. Route the feedback: Send to customer support, product development, or sales.

This workflow involves multiple decision and data extraction points. Each step is relatively simple, but together represent a larger challenge.

Engineering Decisions

  • Can you get away with a fast 8b model?
  • Do you need something more sophisticated like GPT-4o, Gemini Pro, or Llama 90b?
  • Can this be accomplished with a single prompt?
  • Does this need to be broken down into multiple prompts?
  • Can they be run concurrently?

Route the feedback is a step that is highly unique to the business, teams involved, and the context of an organization. Let’s assume that the AI Engineer has found that this should be an isolated step determined with its own prompt.

In order to determine the routing correctly, the AI must:

  • Be aware of each each team’s function (provided context).
  • Analyze the support message for relevance to the possible destination parties.
  • Pass 100% of our evaluators (sample inputs and expected outputs).
    • An ASSERT statement that looks at the route destination selected by the LLM for a given input.

Traditional Software Engineering vs. AI Engineering

Traditional software development optimizes for faster and cheaper. Highly skilled software engineers flex by making code run faster with fewer resources and lower costs. However, traditional software development is deterministic.

AI engineering is probabilistic. Quality cannot necessarily be achieved with increases in speed and decreases in cost. Response quality must be established up-front with evaluators. Speed and cost improvements can be attempted. But it’s not always straight forward.

Prompt Engineering: Llama 3.2 8b vs Gemini Flash

On paper, the comparison looks like this:

ModelLlama 3.2 8bGemini Flash
SpeedBlazing FastFast
CostPractically FreeAffordable
Size8b (tiny)32b (estimated)

In the real-world, prompt engineering reveals blurred lines.

Prompt Engineering: Llama 3.2 8b

The AI engineer spends 2 hours prompt engineering with Llama 3.2 8b attempting to reach 100% test passing on the quality evaluators. There are several example input test cases that are failing.

To achieve a 100% passing rate, the engineer adds multi-shot examples to the prompt. Example:

  • [prompt text here]
  • [examples]
    • example input 1
    • example response 1
    • example input 2
    • example response 2
    • example input 3
    • example response 3

After incorporating a dozen examples into the Llama 3.2 8b prompt (and at the risk of over-fitting), it is finally passing 100% of test cases.

The cost of this?

  • Longer prompts: More input tokens, higher cost.
  • Increased latency: Processing larger prompts takes more time.
  • Over-fitting risk: Multi-shot prompts can become too tailored to the examples, hindering generalization and corner-case handling.
  • More testing time: While testing output quality, the more elaborate your prompt becomes, the more time you are spending on a single AI agent tool’s development when you could be done and working on the next one.

Prompt Engineering: Gemini Flash

The AI Engineer spends 30 minutes prompt engineering Gemini Flash to reach 100% rest passing on the quality evaluators. To achieve a 100% passing rate, no examples are required.

Prompt Example:

  • [prompt text here]
  • (no examples)

Smarter models like Gemini Flash, while on-paper are more expensive per token and slower, can often achieve the same speed, cost, and quality outcome with shorter, zero-shot prompts.

Model Clarification:

This article discusses using the Gemini 1.5 Flash (estimated to be 32B size), not the Gemini Flash-8B small model.

API model code: models/gemini-1.5-flash

This is to create a comparison between perceived ultra-fast small models like llama 3.2 8b on groq, and larger more expensive “fast” models like Gemini Flash 1.5 large, and GPT-4o mini.

The Tiny Fast Model Conundrum:

Tiny models with high benchmark scores might seem like a panacea to AI speed and cost issues. Your leadership team may see this marketing data and direct the engineering team to swap to these new models. Assuming there is an eval-driven development practice in place, it’s highly likely that the need for more complex prompts can negate claimed advantages, resulting in:

  • Unexpected costs: More input tokens eat into cost savings.
  • Slower performance: Longer prompts increase processing time.
  • Development overhead: Engineering and testing complex prompts is time-consuming.

Real-World Testing:

In my testing I compared Gemini Flash against Llama 3.2 (3B and 8B) on Groq…

Gemini Flash started at a disadvantage. Being a larger fast model, Gemini Flash 1.5 consistently required 500 - 600ms to respond. However, the prompt was simple, zero shot, and scored 10/10 on my input scenarios. I spent about 30 minutes putting together the AI Agent tool and it was working great.

My concern was that chaining ten or more 500ms workflow steps creates a slow response for the user. I wanted something faster for my UX. So I looked to Groq and smaller Llama models for a speed solution.

Groq started with a huge speed advantage. In my initial testing, Groq + llama 3.2 8b could respond in under 100ms (sometimes under 20ms!!!) with great results.

However, after hours of testing to get a perfect score on my evaluators, I had created an extensive multi-shot prompt that was potentially over-fit.

With this large multi-shot prompt and perfect score, Groq + llama 8b now consistently took 500 - 600ms to respond. Much slower than the initial 100ms results. This was now the same speed and cost of Gemini Flash.

I spent 2 hours doing prompt engineering for the tool to work as well with llama 3.2 8b for the same speed, cost, and quality outcome. I could have built out multiple tools in the same time period with Gemini Flash.

AI Engineering is different

Speed is everything, right? In a world where app UX frames per second are scrutinized over, and web loading speeds over 3 seconds are an abomination, it’s counter-intuitive to lean into slower LLM models.

  • You are building AI.
  • Your users expect intelligence.
  • Poor quality causes churn.
  • Poor speed is accepted.

My experience building AI products and talking to customers of AI products has taught me that response quality is paramount. There is no amount of speed gain that can make up for poor quality responses. You should always lean into higher quality.

There are diminishing returns. A 3 second response may 99% as good a 15 second response. Use your judgement. It depends on the task at hand. The experience should be as good as it can be without a material sacrifice in intelligence.

LLMs will get faster

LLMs will get better, cheaper, and faster every 3-6 months

Spending time optimizing your AI Agent tools and workflows for incremental speed gains today is a waste. In 6 months, a similar quality model will have a faster, cheaper option. Slowing your product development by 50% or more for a modest speed gain isn’t worth it. Iterate on your product faster with the highest quality possible. The speed gains will come when new models arrive.

8b models are highly capable of many simple tasks. My goal is not to discredit them entirely. In my opinion, if you’re struggling to prompt engineer with them at all, just move to a larger model or re-evaluate if the task should be broken down into separate steps.

Focus on quality

Bad responses will cause users to lose trust in your product. It will make them dislike your product. They will probably badmouth it to others or leave a bad review.

Focus on quality and your team’s development velocity. Build your product faster, build a higher quality product, and let the speed boost come automatically every 3-6 months when new models and new faster llm services are available.

A bad review because of slow speed is much better than a bad review because of poor quality. People save so much time compared to doing a task manually that the extra seconds of waiting are worth putting up with. But if the quality is poor, they will move onto the next product that does what they want.

What’s Your Nano-Model Use Case?

Have you successfully used a nano model for an AI Agent workflow tool that wasn’t easily achieved with a heuristic code flow or set of regular expressions?


First post

This is my first post. This is an astro-based staticly generated site.