T
traeai
Sign in
返回首页
Together AI Blog

Foundational research powering efficient inference at scale

7.5Score
Foundational research powering efficient inference at scale

TL;DR · AI Summary

文章介绍了Together AI的多项技术进展,包括FlashAttention-4、ATLAS加速器和Batch Inference API更新,显著提升了大规模推理效率。

Key Takeaways

  • FlashAttention-4比cuDNN快1.3倍
  • ATLAS加速器提升LLM推理速度4倍
  • Batch Inference API成本降低50%

Outline

Jump quickly between sections.

  1. The article discusses Together AI's recent advancements in foundational research for efficient inference at scale.

  2. FlashAttention-4 is up to 1.3 times faster than cuDNN on NVIDIA Blackwell.

  3. ATLAS delivers up to 4 times faster LLM inference.

  4. Batch Inference API reduces costs by 50% for most models.

Mindmap

See how the topics connect at a glance.

查看大纲文本(无障碍 / 无 JS 友好)
  • Foundational Research for Efficient Inference

Highlights

Key sentences worth saving and sharing.

#AI#Inference#Efficiency#Together AI
Open original article

Foundational research powering efficient inference at scale

Image 1⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →

Image 2Introducing Together AI's new look →

Image 3🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →

Image 4⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →

Image 5📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →

Image 6🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →

[](https://www.together.ai/)

  • ![Image 7 Serverless Inference High-performance inference as APIs](https://www.together.ai/serverless-inference)
  • ![Image 8 Batch Inference Inference for batch workloads](https://www.together.ai/batch-inference)
  • ![Image 9 Dedicated Model Inference Inference on custom hardware](https://www.together.ai/dedicated-model-inference)
  • ![Image 10 Dedicated Container Inference Inference for custom models](https://www.together.ai/dedicated-container-inference)

![Image 11 MiniMax M2.5 Image 12 Nano Banana Pro Image 13 Qwen3.5-397B Image 14 GLM-5 Image 15 kimi k2.5 Image 16 gpt-oss-120B Model library Explore the top open-source models](https://www.together.ai/models)

Accelerated Compute

  • ![Image 17 GPU Clusters Reliable GPU clusters at scale](https://www.together.ai/gpu-clusters)
  • ![Image 18 AI Factory Custom infrastructure at frontier scale](https://www.together.ai/ai-factory)

Developer Environments

  • ![Image 19 Sandbox Build development environments for AI](https://www.together.ai/sandbox)

Storage

  • ![Image 20 Managed Storage Store model weights & data securely](https://www.together.ai/managed-storage)
  • ![Image 21 Fine-Tuning Shape models with your data](https://www.together.ai/fine-tuning)
  • ![Image 22 Evaluations Measure model quality](https://www.together.ai/evaluations)

![Image 23 DeepSeek V3.1 Image 24 GLM 5 FP4 Image 25 Qwen3-VL 32B Image 26 gpt-oss-120b Image 27 kimi k2.5 Image 28 Llama 4 Maverick Model library Fine-tune top open-source models](https://www.together.ai/models)

  • ![Image 29 Research Systems research for production AI](https://www.together.ai/research)
  • ![Image 30 Research blog All our research publications](https://www.together.ai/research-blog)

Featured publications

Show all

  • ![Image 31 Documentation Technical docs for Together AI](https://docs.together.ai/)
  • ![Image 32 Demos Our open-source demo apps](https://www.together.ai/demos)
  • ![Image 33 Cookbooks Practical implementation guides](https://www.together.ai/cookbooks)
  • ![Image 34 Voice Agents Build voice agents for production](https://www.together.ai/solutions/voice)

Resources

  • ![Image 35 Customer stories Testimonials from AI Natives](https://www.together.ai/customers)
  • ![Image 36 Startup accelerator Build and scale your startup](https://www.together.ai/startup-accelerator)
  • ![Image 37 Customer support Find answers to your questions](https://www.together.ai/support)
  • ![Image 38 Blog Our latest news & blog posts](https://www.together.ai/blog)
  • ![Image 39 Events Explore our events calendar](https://www.together.ai/events)

Company

  • ![Image 40 About Get to know us](https://www.together.ai/about-us)
  • ![Image 41 Careers Join our mission](https://www.together.ai/careers)

*

  • ![Image 42 Serverless Inference High-performance inference as APIs](https://www.together.ai/serverless-inference)
  • ![Image 43 Batch Inference Inference for batch workloads](https://www.together.ai/batch-inference)
  • ![Image 44 Dedicated Model Inference Inference on custom hardware](https://www.together.ai/dedicated-model-inference)
  • ![Image 45 Dedicated Container Inference Inference for custom models](https://www.together.ai/dedicated-container-inference)

![Image 46 MiniMax M2.5 Image 47 Nano Banana Pro Image 48 Qwen3.5-397B Image 49 GLM-5 Image 50 kimi k2.5 Image 51 gpt-oss-120B Model library Explore the top open-source models](https://www.together.ai/models)

* Accelerated Compute

  • ![Image 52 GPU Clusters Reliable GPU clusters at scale](https://www.together.ai/gpu-clusters)
  • ![Image 53 AI Factory Custom infrastructure at frontier scale](https://www.together.ai/ai-factory)

Developer Environments

  • ![Image 54 Sandbox Build development environments for AI](https://www.together.ai/sandbox)

Storage

  • ![Image 55 Managed Storage Store model weights & data securely](https://www.together.ai/managed-storage)

*

  • ![Image 56 Fine-Tuning Shape models with your data](https://www.together.ai/fine-tuning)
  • ![Image 57 Evaluations Measure model quality](https://www.together.ai/evaluations)

![Image 58 DeepSeek V3.1 Image 59 GLM 5 FP4 Image 60 Qwen3-VL 32B Image 61 gpt-oss-120b Image 62 kimi k2.5 Image 63 Llama 4 Maverick Model library Fine-tune top open-source models](https://www.together.ai/models)

*

  • ![Image 64 Research Systems research for production AI](https://www.together.ai/research)
  • ![Image 65 Research blog All our research publications](https://www.together.ai/research-blog)

Featured publications

Show all

*

  • ![Image 66 Documentation Technical docs for Together AI](https://docs.together.ai/)
  • ![Image 67 Demos Our open-source demo apps](https://www.together.ai/demos)
  • ![Image 68 Cookbooks Practical implementation guides](https://www.together.ai/cookbooks)
  • ![Image 69 Voice Agents Build voice agents for production](https://www.together.ai/solutions/voice)

* Resources

  • ![Image 70 Customer stories Testimonials from AI Natives](https://www.together.ai/customers)
  • ![Image 71 Startup accelerator Build and scale your startup](https://www.together.ai/startup-accelerator)
  • ![Image 72 Customer support Find answers to your questions](https://www.together.ai/support)
  • ![Image 73 Blog Our latest news & blog posts](https://www.together.ai/blog)
  • ![Image 74 Events Explore our events calendar](https://www.together.ai/events)

Company

  • ![Image 75 About Get to know us](https://www.together.ai/about-us)
  • ![Image 76 Careers Join our mission](https://www.together.ai/careers)

Contact sales

Contact sales

Sign in

All blog posts

Inference

Published 5/4/2026

Foundational research powering efficient inference at scale

As AI moves from research to production, the challenge for AI-native teams shifts from building models to running them — efficiently, reliably, and at scale.

AI has spent years in the spotlight for training: the massive, GPU-intensive process of building models. But for most teams deploying AI today, ongoing inference costs are what actually shape their unit economics. Estimates put inference at 80-90% of the total lifetime cost of a production AI system, simply because it runs continuously across every user query, agent step, and API call. And while training is a bounded investment, inference scales with every new user and use case you ship.

At NVIDIA GTC 2026, NVIDIA CEO Jensen Huang framed this shift plainly: “People pay for information, but people mostly pay for work. Agentic systems get work done.” That shift from AI as a novelty to AI as a workhorse is exactly what’s reshaping infrastructure priorities.

For Together AI, none of this is new. The inference imperative is what we’ve been building for. Our CTO Ce Zhang covered these dynamics in depth at GTC, sharing hard-won lessons from running some of the most demanding production inference workloads in the industry.

Why inference is a different kind of hard

Inference isn’t just “running the model.” In production, it’s an optimization problem across multiple competing dimensions simultaneously:

  • Latency shapes what’s possible to build. For applications like coding assistants, real-time support, or conversational agents, sub-500ms response times aren’t a nice-to-have — they determine whether a product feels like software or like waiting. Agentic workflows amplify this: five model calls at 200ms each is a full second of accumulated latency before a user sees a result. The threshold matters, and missing it has product consequences.
  • Throughput determines your unit economics. AI-native companies face a structurally different cost profile than traditional SaaS. Where legacy software companies target 80-90% gross margins, AI companies commonly operate at 50-60%, with inference alone accounting for roughly 23% of revenue at scaling-stage companies. Efficient inference means more requests served per GPU-hour. That math flows directly to margins.
  • The model landscape keeps changing.The inference stack optimized for today’s models may need significant rework tomorrow. New architectures, quantization methods, and hardware; staying at the frontier requires continuous investment across the full stack, not just one-time optimization.
  • Concurrency is unforgiving.Serving thousands of simultaneous users means navigating wildly different context lengths, latency requirements, and cost profiles — all at once, without degradation. That’s as much a scheduling and orchestration challenge as it is a compute one.

This is also why the stakes are higher than most teams initially expect.

How Together approaches inference

Together’s approach to inference isn’t a single optimization. It’s a compounding stack of research, systems engineering, and hardware expertise designed to improve continuously as the frontier moves:

  • Research that ships to production. The Together Research team has contributed some of the most widely adopted advances in inference efficiency: FlashAttention (now up to FlashAttention-4), ThunderKittens, and Aurora, our open-source adaptive speculative decoding system delivering up to 1.25x faster LLM inference. This research goes into production for customers, typically within weeks of publication.
  • Adaptive speculative decoding. Standard speculative decoding uses a smaller draft model to propose tokens that a larger model verifies in parallel, delivering 1.5-3x speedups on predictable workloads like code completion or structured outputs. Our ATLAS and Aurora systems go further: Aurorais an open-source RL-based framework that learns from live inference traces in real time, adapting as traffic patterns shift. It achieves meaningful speedups over even well-trained static speculators, without interrupting serving.
  • Full-stack hardware optimization. Running on the latest NVIDIA Blackwell hardware (GB200 NVL72, HGX B200) means building custom parallelism strategies across 72-GPU meshes, implementing NVFP4 quantization, and constructing weights-to-production pipelines that get model releases live within days. When Cursor needed production-grade latency for millions of active developers, Together AI built the full-stack infrastructure to make it work, handling strict latency SLAs across unpredictable, high-concurrency traffic.
  • Intelligent scheduling and batching.High-throughput inference requires making smart real-time decisions: which requests to batch together, how to route based on context length and latency requirements, and when to trade throughput for responsiveness. Together’s inference engine handles this dynamically, extracting maximum efficiency from each GPU-hour without sacrificing the experience that AI-native apps and products depend on.

The economics of getting this right

The Stanford 2025 AI Index documents a remarkable trend: inference costs for GPT-3.5-level performance dropped more than 280-fold between late 2022 and late 2024. But total inference spend is rising; as costs fall, teams deploy AI for more use cases, users, and agent steps. Lower costs per token haven’t reduced the infrastructure challenge; they’ve expanded the surface area of it. As the industry converges on lower token cost as a real indicator of AI infrastructure's TCO, Together AI’s approach of optimizing the full hardware and software stack continues to deliver better profitability for customers.

For AI-native companies, this makes inference optimization a compounding advantage. Run inference 2x more efficientlyand you serve more customers on the same hardware, while also opening up use cases that weren’t previously viable. Each gain in efficiency improvement not only flows directly to margins, but also what you’re able to build over time.

That’s what Together AI prides itself on: a platform that isn’t just fast inference, but the infrastructure layer that empowers AI-native teams to grow without their costs growing faster than their revenue.

Run production-scale inference on the AI Native Cloud

Together AI is the AI Native Cloud, offering a full-stack AI platform across Serverless & Dedicated Inference, Accelerated Compute, and Model Shaping — empowering you to get more value out of every GPU-hour, without sacrificing the speed and production-grade reliability users expect.

Inference isn’t a fringe concern. For teams building AI-native apps today, it’s the thing that will shape margins, product roadmaps, and the ability to compete. The good news: the tools to tackle it on the AI Native Cloud have never been better.

Ready to build what’s next on Together AI? Get started today.


Want to go deeper? Our best practices guide for production inference cover speculative decoding, optimized kernels, quantization, and hardware acceleration in detail.

FAQ

What is AI inference at scale?

AI inference is the process of running a trained model to generate a response — every time a user sends a message, triggers an agent, or makes an API call. At scale, this means serving thousands or millions of simultaneous requests, each with different context lengths, latency requirements, and cost profiles. The infrastructure challenge isn't just compute — it's orchestrating all of that efficiently, continuously, without degrading speed or reliability for any individual user.

Why is AI inference more expensive than training?

Training is an intensive but bounded investment — it happens once (or periodically when a model is updated). Inference, by contrast, runs continuously: every user interaction, every agent step, every API call generates a cost. Industry estimates put inference at80-90% of the total lifetime cost of a production AI system. As usage grows, so does the bill. For AI-native companies, inference is effectively the cost of goods sold — it scales directly with revenue.

What is speculative decoding?

Speculative decoding is an inference acceleration technique where a smaller, faster "draft" model proposes several tokens at once, which a larger target model then verifies in parallel. Tokens that match are accepted; the rest are discarded and regenerated. When the draft model is well-aligned with the target, this approach can deliver 1.5–3x speedups without changing the output. It's particularly effective for predictable workloads like code completion or structured data generation. Together AI's ATLAS system extends this further with adaptive speculative decoding that learns and adjusts from live traffic in real time.

What is adaptive speculative decoding?

Standard speculative decoding relies on a static draft model — one trained offline and fixed at deployment. The problem is that real-world traffic patterns shift constantly, and a static draft model's accuracy degrades as the domain changes. Adaptive speculative decoding solves this by continuously learning from live inference traces, updating the draft model without interrupting serving. Together AI's Aurora framework is an open-source, RL-based implementation that achieves meaningful speedups over well-trained static speculators, even when starting from scratch.

What does "inference research" mean in the context of AI?

Inference research is the field of study focused on making AI models run faster, cheaper, and more efficiently in production — without sacrificing output quality. It encompasses algorithm-level work (like speculative decoding and attention optimization), systems-level work (like kernel engineering and request scheduling), and hardware-level work (like quantization and GPU utilization). It's distinct from model research, which focuses on improving what models know or can do. As inference costs become the dominant expense in AI deployment, inference research has become one of the highest-leverage areas in applied AI.

How does inference optimization affect AI product economics?

Inference optimization directly improves unit economics: faster inference means more requests served per GPU-hour, which translates to lower cost per request. At scale, even modest efficiency gains compound significantly — a 2x improvement in throughput effectively halves infrastructure costs for the same workload. This matters for product teams because it determines what use cases are economically viable, how quickly margins improve as volume grows, and whether a product can sustain competitive pricing as the market matures.

Start building on Together AI

From optimized training and model shaping to large-scale production inference

Get Started now

Image 77

* Products

  • Models

See all modelsDeepSeek Meta Qwen Google OpenAI Mistral AI Custom models * Developers

Pricing

* Resources

© 2026 Together AI. All Rights Reserved.

  • [](https://discord.gg/9Rk6sSeWEG)
  • [](https://x.com/togethercompute)
  • [](https://www.linkedin.com/company/togethercomputer/)

AI may generate inaccurate information. Please verify important content.