---
title: "🚀 Introducing FlashQLA: high-performance linear attention kernels built on TileLang.\n\n⚡ 2–3× forwar..."
source_name: "Qwen(@Alibaba_Qwen)"
original_url: "https://x.com/Alibaba_Qwen/status/2049462758211772663"
canonical_url: "https://www.traeai.com/articles/7b56c964-438a-499c-8dd8-91595b373760"
content_type: "tweet"
language: "英文"
score: 8.5
tags: ["FlashQLA","TileLang","AI加速","线性注意力"]
published_at: "2026-04-29T12:16:13+00:00"
created_at: "2026-04-30T05:18:32.127691+00:00"
---

# 🚀 Introducing FlashQLA: high-performance linear attention kernels built on TileLang.

⚡ 2–3× forwar...

Canonical URL: https://www.traeai.com/articles/7b56c964-438a-499c-8dd8-91595b373760
Original source: https://x.com/Alibaba_Qwen/status/2049462758211772663

## Summary

FlashQLA 是基于 TileLang 的高性能线性注意力内核，提供2-3倍前向加速和2倍后向加速，专为个人设备上的代理AI设计。

## Key Takeaways

- FlashQLA 提供2-3倍前向加速和2倍后向加速。
- 通过门控驱动的自动片内CP提高SM利用率。
- 16阶段Warp特化流水线实现高效的后向传递。

## Content

Title: Qwen on X: "🚀 Introducing FlashQLA: high-performance linear attention kernels built on TileLang.

⚡ 2–3× forward speedup. 2× backward speedup.
💻 Purpose-built for agentic AI on your personal devices.

💡Key insights:
1. Gate-driven automatic intra-card CP.
2. Hardware-friendly algebraic https://t.co/4Vhyyw5RuB" / X

URL Source: http://x.com/Alibaba_Qwen/status/2049462758211772663

Published Time: Thu, 30 Apr 2026 04:29:53 GMT

Markdown Content:
## Post

## Conversation

[![Image 1: Square profile picture](https://pbs.twimg.com/profile_images/1894073235379273728/0ROUmdkE_normal.jpg)](https://x.com/Alibaba_Qwen)

[Qwen](https://x.com/Alibaba_Qwen)

[@Alibaba_Qwen](https://x.com/Alibaba_Qwen)

![Image 2: 🚀](https://abs.twimg.com/emoji/v2/svg/1f680.svg) Introducing FlashQLA: high-performance linear attention kernels built on TileLang. ![Image 3: ⚡](https://abs.twimg.com/emoji/v2/svg/26a1.svg) 2–3× forward speedup. 2× backward speedup. ![Image 4: 💻](https://abs.twimg.com/emoji/v2/svg/1f4bb.svg) Purpose-built for agentic AI on your personal devices. ![Image 5: 💡](https://abs.twimg.com/emoji/v2/svg/1f4a1.svg)Key insights: 1. Gate-driven automatic intra-card CP. 2. Hardware-friendly algebraic reformulation. 3. TileLang fused warp-specialized kernels. FlashQLA boosts SM utilization via automatic intra-device CP. The gains are especially pronounced for TP setups, small models, and long-context workloads. Instead of fusing the entire GDN flow into a single kernel, we split it into two kernels optimized for CP and backward efficiency. At large batch sizes this incurs extra memory I/O overhead vs. a fully fused approach, but it delivers better real-world performance on edge devices and long-context workloads. The backward pass was the hardest part: we built a 16-stage warp-specialized pipeline under extremely tight on-chip memory constraints, ultimately achieving 2×+ kernel-level speedups. We hope this is useful to the community!![Image 6: 🫶](https://abs.twimg.com/emoji/v2/svg/1faf6.svg)![Image 7: 🫶](https://abs.twimg.com/emoji/v2/svg/1faf6.svg) Learn more: ![Image 8: 📖](https://abs.twimg.com/emoji/v2/svg/1f4d6.svg) Blog: [qwen.ai/blog?id=flashq](https://t.co/HF6opiR4yf)![Image 9: 💻](https://abs.twimg.com/emoji/v2/svg/1f4bb.svg) Code: [github.com/QwenLM/FlashQLA](https://t.co/G3oaf5L1AZ)

[![Image 10: Image](https://pbs.twimg.com/media/HHElcUQbwAAUYBL?format=jpg&name=small)](https://x.com/Alibaba_Qwen/status/2049462758211772663/photo/1)

Made with AI

[12:16 PM · Apr 29, 2026](https://x.com/Alibaba_Qwen/status/2049462758211772663)

[48.8K Views](https://x.com/Alibaba_Qwen/status/2049462758211772663/analytics)