One Horizon
    • Log inJoin Beta

    Main

    • Home
    • About
    • Pricing
    • Vault
    • Changelog
    • Docs

    Features

    • Roadmaps
    • Planning
    • Standups
    • Status updates
    • Insights
    • AI assistant / MCP
    • Integrations

    Solutions

    • Startups
    • Dev shops / agencies
    • Software teams
    • Internal IT & platform teams

    Alternatives

    • vs Jira
    • vs Linear
    • vs Asana
    • vs Monday.com
    • vs ClickUp
    • vs Notion

    Company

    • Blog
    • Security
    • Log in
    • Sign up
    • Terms of Use
    • Privacy Policy

    Resources

    • Docs
    • API reference
    • CLI
    • Desktop app
    • SDK

    © 2026 One Horizon. All rights reserved

    FacebookInstagramThreadsXRedditTikTokYouTubeMedium


    Back to blogs

    Skills Already Made Agents Better. SLM-Skill Pairing Is the Next Leap.

    Tijn van Daelen•May 4, 2026•14 Min Read
    Skills Already Made Agents Better. SLM-Skill Pairing Is the Next Leap.

    TL;DR: Skills solved a reliability problem. SLM-skill architectures can solve a scale problem. The practical endgame is not “small models everywhere” or “big models everywhere,” but model-aware skills with clear routing and fallback.

    A lot of teams now agree on one thing: raw model intelligence is not enough to run real workflows.

    The quality jump in agents over the last year did not come only from bigger models. It came from structure around the model: skills, scoped instructions, tool contracts, and environment-aware execution.

    That shift is visible in production docs now, not just in conference talks. Anthropic describes Agent Skills as modular, reusable capabilities that package instructions, resources, and code, and load progressively when needed.1 OpenAI frames Codex similarly: each task runs in its own sandbox, can be guided by AGENTS.md, and can execute work in parallel.23 MCP has become another part of this architecture by standardizing how agents connect to tools and data.4

    So yes, skills are already making agents better right now.

    The next question is what happens when we connect that skills layer to small language models by design.

    Not as a gimmick. Not as “let’s downsize everything.”

    As an architecture choice: one skill, one best-fit model profile, one measurable quality bar.


    Skills fixed reliability. They did not yet fix economics.

    Skills made agent behavior more legible. They reduced prompt repetition and made workflows reusable.

    But once a team scales from ten daily calls to fifty thousand, two constraints show up fast: latency and cost.

    Even if a large model gives excellent quality, many skills do not need full frontier capability every time. A skill that classifies intake requests, drafts first-pass summaries, or normalizes ticket metadata is often narrow, repetitive, and context-bounded. Paying frontier-model prices and latency for that path is usually waste.

    This is why SLMs are now strategically interesting.

    Small models are no longer toy models by default. Phi-3 explicitly targeted phone-class deployment while still reporting strong benchmark performance for its size and using filtered plus synthetic training data.5 Gemma 2 and Gemma 3 both push the same direction: practical-size open models that use distillation and compete above their weight class.67 Meta’s Llama 3.2 model card positions 1B and 3B variants for constrained environments and documents distillation from larger 8B/70B teachers.8

    That is the opening for SLM-skill pairing.

    Not because SLMs beat frontier models in absolute terms. They do not, consistently.

    Because many real skills do not require absolute best capability on every call.

    A desk with documents and laptop, representing structured workflows and model-aware execution design

    Why SLMs are attractive inside a skill architecture

    The obvious wins are speed and cost, but the deeper win is operational separation of concerns.

    When every skill is mapped to a model profile, teams stop thinking in broad slogans like “we use model X.” They start thinking in service levels: this skill needs sub-second response and medium accuracy, that one needs high-recall reasoning and explicit citation checks.

    This has at least five practical effects.

    First, you can reduce the average cost-to-answer by routing easy and medium tasks to smaller models, while keeping harder work on stronger models. FrugalGPT and RouteLLM showed this logic early: cascade and routing strategies can preserve quality while materially reducing cost.910

    Second, latency drops for the part of the workflow users feel most often: first response, triage, and iterative edits.

    Third, deployment options widen. Teams can run selected skills closer to the data source, sometimes locally, sometimes in a private edge environment, which is useful for privacy-sensitive or intermittent-connectivity workflows.568

    Fourth, resilience improves. If one premium provider throttles or fails, a model-aware skill graph can degrade gracefully instead of failing globally.

    Fifth, specialization can improve output quality for bounded tasks. This is a subtle point. “Better output” from SLMs is usually false at the general level, but often true in a constrained skill when the model is tuned, prompted, and evaluated specifically for that skill surface.

    We have seen this pattern before. DistilBERT showed years ago that distillation could preserve most task performance while reducing size and increasing speed.11 The current SLM wave is a continuation of that principle, now applied to agentic systems instead of one-off NLP tasks.


    The architecture: from “one big model” to “model-aware skills”

    The simplest mental model is this: a skill should declare not just what to do, but what model band to try first.

    In practice, each skill can carry a runtime profile that includes target latency, acceptable error rate, context budget, and escalation policy. The orchestrator then routes requests through three lanes: a default SLM lane for routine calls, a verification lane where small-model output is checked for high-risk transforms, and a fallback lane to a larger model when confidence is low or complexity spikes.

    That fallback-first architecture is exactly what makes gradual transition viable. You do not have to “bet the company” on SLMs in one migration. You can move skill by skill, keep LLMs for hard segments, and expand SLM coverage only where metrics prove they hold quality.

    This is also where orchestration itself becomes model-flexible. In some setups, the orchestrator can remain a stronger LLM. In others, an SLM can orchestrate low-risk flows and escalate to larger models only for planning, ambiguous reasoning, or exception handling.

    The right answer is workload-specific, not ideological.


    The data question: yes, you can bootstrap with synthetic data, but no, it is not a one-time trivial cost

    Your instinct is right that data is a major blocker for SLM-skill optimization.

    And yes, using stronger models to generate training or tuning data is now a standard move. Self-Instruct demonstrated the pattern clearly: model-generated instruction data can significantly improve instruction following with minimal human labeling.12 The Alpaca pipeline then made this concrete at low cost, including a widely cited 52K synthetic instruction set generated from text-davinci-003 outputs.13 TinyStories showed another variant: synthetic corpora tailored for smaller models can unlock surprising capability at tiny scales.14 Recent compact-model reports continue this direction; for example, Phi-4 explicitly describes synthetic data as a central part of its training strategy.15

    So the synthetic bootstrap idea is real.

    But treating it as a one-time cost is usually where teams get burned. Skills evolve, APIs change, and user requests shift, so a frozen synthetic dataset drifts away from production reality. At the same time, the Nature “model collapse” results are a useful warning that indiscriminate recursive training on generated data can degrade distribution quality over generations.16

    The practical takeaway is not “never use synthetic data.” It is “use synthetic data with verification and refresh loops.”

    Good teams treat synthetic generation as an ongoing pipeline with quality gates: human seed sets, teacher-generated expansion, deduplication and contamination filtering, task-level evaluation, live-traffic replay, and periodic refresh. In other words, synthetic data is not just a cheap shortcut. It is an operational discipline.

    Hands reviewing forms with calculator and laptop, representing data curation and evaluation gates for model training

    Why this is still not the standard in most teams

    If the upside is so clear, why is SLM-skill architecture not already default everywhere?

    Because the hard part is less about having a small model and more about operating a model portfolio.

    The first blocker is evaluation maturity. Most teams still evaluate prompts or single-model outputs, not end-to-end skills with routing and fallback. If you cannot measure per-skill quality under real distributions, you cannot safely downshift model size.

    The second blocker is safety profile complexity. Smaller models can have different alignment and safety/helpfulness tradeoffs, especially in constrained deployments.8 That means you need skill-level guardrails, not generic platform-level assumptions.

    The third blocker is capability cliffs. Long-context reasoning, ambiguous planning, and multi-hop synthesis still break earlier in smaller models. Even when context windows are growing in compact models, quality under realistic long-context tasks remains an active frontier.7

    The fourth blocker is infrastructure overhead. You need routing logic, observability, policy checks, cache strategy, and failover behavior. That is a real engineering project, not a model swap.

    The fifth blocker is organizational inertia. Teams are comfortable with one premium model and a simple API abstraction. SLM-skill architectures require product, infra, and governance teams to collaborate on shared quality definitions.

    And there is one more macro reason: market narratives still over-index on frontier model demos. Productivity evidence is mixed and context-dependent. METR’s 2025 study found experienced open-source developers were slower with early-2025 AI tools in that setting, while later METR updates suggest potential speedups as tools and usage evolved.1718 The lesson is not that AI fails. The lesson is that system design and task fit determine outcomes.

    Two colleagues discussing architecture choices at a laptop, representing model-routing and governance tradeoffs

    A practical migration path teams can run now

    The best transition strategy is incremental and metric-led. Start by instrumenting skills, not models, and track latency, acceptance rate, escalation rate, and user correction burden per skill. Then segment skills into routine deterministic transforms, bounded reasoning tasks, and high-ambiguity judgment tasks. Move the first class to SLM-first quickly; for the second, run shadow routing where the SLM proposes and a larger model verifies sampled traffic; and keep the third class on stronger models until smaller tuned variants prove they meet your bar.

    In parallel, build a synthetic-plus-human data loop for top-volume skills. Use larger models as teachers where needed, retain high-quality human anchors, run frequent regression tests, and apply resource-efficient tuning patterns such as quantization-aware or low-rank methods when they fit the stack.12131619

    Over time, the target is not zero LLM usage. The target is intentional LLM usage: bigger models where they create disproportionate value, smaller models where they keep the system fast, affordable, private, and operationally stable. The most mature architecture is hybrid by default.


    The environmental argument, without hand-waving

    The environmental case for SLM-heavy routing is usually directionally right, but often discussed sloppily.

    The realistic claim is this: if you lower average inference compute per successful task, you reduce energy demand per task. Smaller models, quantization, and better routing can all contribute.

    The broader context matters because AI electricity demand is rising quickly. IEA reporting shows strong growth in data-center electricity use and projects further expansion as AI workloads scale.20

    This does not mean “SLMs solve AI energy by themselves.”

    It means architecture choices now have cumulative energy impact at volume.

    If two systems deliver similar task quality and one uses less compute per completed workflow, the lower-compute one is not just cheaper. It is operationally cleaner.

    That alone is enough reason to take SLM-skill design seriously.


    What this could look like in the next 18 months

    The likely endpoint is not a binary winner between SLMs and LLMs. It is layered execution. Skills become the reusable unit of behavior, SLMs become the default engine for many bounded skills, LLMs remain the escalation layer for ambiguity and hard reasoning, data pipelines become continuous through synthetic generation plus verification, and orchestration becomes policy-driven, routing by confidence, risk, and latency budget instead of hype.

    When that happens, “which model are we using?” becomes a less useful question.

    A better question is: “Which model-skill combo is delivering the best outcome per unit of cost, time, and risk for this exact workflow?”

    That is the real maturity curve.

    At One Horizon, this is exactly the direction we care about: turning AI execution into something traceable from intent to shipped outcomes, where humans and agents share context and where model choices are explicit, measurable, and evolvable over time.

    If that is the shift your team is wrestling with, this is worth testing now, before your current agent stack gets too expensive and too brittle to scale.

    Explore more at onehorizon.ai.


    Footnotes

    1. Anthropic Docs, “Agent Skills Overview.” https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview ↩

    2. OpenAI, “Introducing Codex.” https://openai.com/index/introducing-codex/ ↩

    3. OpenAI Developers, “Custom instructions with AGENTS.md.” https://developers.openai.com/codex/guides/agents-md ↩

    4. Model Context Protocol Docs, “What is MCP?” https://modelcontextprotocol.io/docs/getting-started/intro ↩

    5. Microsoft, “Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone.” https://arxiv.org/abs/2404.14219 ↩ ↩2

    6. Google DeepMind, “Gemma 2: Improving Open Language Models at a Practical Size.” https://arxiv.org/abs/2408.00118 ↩ ↩2

    7. Gemma Team, “Gemma 3 Technical Report.” https://arxiv.org/abs/2503.19786 ↩ ↩2

    8. Meta, “Llama-3.2-1B-Instruct Model Card.” https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct ↩ ↩2 ↩3

    9. Chen et al., “FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance.” https://arxiv.org/abs/2305.05176 ↩

    10. Ong et al., “RouteLLM: Learning to Route LLMs with Preference Data.” https://arxiv.org/abs/2406.18665 ↩

    11. Sanh et al., “DistilBERT: smaller, faster, cheaper and lighter.” https://arxiv.org/abs/1910.01108 ↩

    12. Wang et al., “Self-Instruct: Aligning Language Models with Self-Generated Instructions.” https://arxiv.org/abs/2212.10560 ↩ ↩2

    13. Hugging Face Dataset Card, “tatsu-lab/alpaca.” https://huggingface.co/datasets/tatsu-lab/alpaca ↩ ↩2

    14. Eldan and Li, “TinyStories: How Small Can Language Models Be and Still Speak Coherent English?” https://arxiv.org/abs/2305.07759 ↩

    15. Microsoft, “Phi-4 Technical Report.” https://arxiv.org/abs/2412.08905 ↩

    16. Shumailov et al., “AI models collapse when trained on recursively generated data.” Nature (2024). https://www.nature.com/articles/s41586-024-07566-y ↩ ↩2

    17. METR, “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity.” https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study-paper.pdf ↩

    18. METR, “We are Changing our Developer Productivity Experiment Design.” https://metr.org/blog/2026-02-24-uplift-update/ ↩

    19. Dettmers et al., “QLoRA: Efficient Finetuning of Quantized LLMs.” https://arxiv.org/abs/2305.14314 ↩

    20. IEA, “Energy and AI” (Executive Summary). https://www.iea.org/reports/energy-and-ai/executive-summary%C2%A0 ↩


    Share this article


    Related Posts

    The Agent-Ready Backlog

    The Agent-Ready Backlog

    AI coding agents are becoming background workers. The hard part is no longer asking whether they can code; it is deciding which backlog items have enough scope, evidence, risk, and ownership to hand off safely.

    Alex van der Meer•May 9, 2026•9m
    Your Task Tracker Is Becoming the Agent Control Plane

    Your Task Tracker Is Becoming the Agent Control Plane

    Once coding agents can pick up tasks, work in isolated environments, and hand back proofs instead of just diffs, the tracker stops being admin glue. It becomes the operating layer that decides what work is legible, safe, and ready to run.

    Gijs van de Nieuwegiessen•May 8, 2026•8m
    AI Coding Agents Have a Mobile Reality Problem

    AI Coding Agents Have a Mobile Reality Problem

    AI can make mobile apps faster to produce, but mobile development punishes shallow context. Platform rules, device states, privacy prompts, performance budgets, and store review all turn generated code into an operational test.

    Alex van der Meer•May 8, 2026•8m
    Agent-Written Pull Requests Need a Review Contract

    Agent-Written Pull Requests Need a Review Contract

    Coding agents can already create branches, make changes, run tests, and hand teams pull requests. The next bottleneck is the review contract: the context, evidence, risk, and decision surface every agent-written PR needs before humans can trust it.

    Alex van der Meer•May 7, 2026•8m