---
title: "🚀 Introducing FlashQLA: high-performance linear attention kernels built on TileLang.\n\n⚡ 2–3× forwar..."
source_name: "Qwen(@Alibaba_Qwen)"
original_url: "https://x.com/Alibaba_Qwen/status/2049462666734026923"
canonical_url: "https://www.traeai.com/articles/00d578dc-f62a-4cea-81d0-b5c8e3a11baf"
content_type: "tweet"
language: "英文"
score: 8.5
tags: ["AI","性能优化","TileLang"]
published_at: "2026-04-29T12:15:51+00:00"
created_at: "2026-04-30T05:19:14.623761+00:00"
---

# 🚀 Introducing FlashQLA: high-performance linear attention kernels built on TileLang.

⚡ 2–3× forwar...

Canonical URL: https://www.traeai.com/articles/00d578dc-f62a-4cea-81d0-b5c8e3a11baf
Original source: https://x.com/Alibaba_Qwen/status/2049462666734026923

## Summary

阿里云推出FlashQLA，基于TileLang的高性能线性注意力内核，实现2-3倍前向加速和2倍后向加速，专为个人设备上的代理AI设计。

## Key Takeaways

- FlashQLA实现了2-3倍前向加速和2倍后向加速。
- 采用门控驱动的自动片内CP和硬件友好的代数重构。
- 特别适用于TP设置、小型模型和长上下文工作负载。

## Content

Title: Qwen on X: "🚀 Introducing FlashQLA: high-performance linear attention kernels built on TileLang.

⚡ 2–3× forward speedup. 2× backward speedup.
💻 Purpose-built for agentic AI on your personal devices.

💡Key insights:
1. Gate-driven automatic intra-card CP.
2. Hardware-friendly algebraic https://t.co/pA9HCHwFZw" / X

URL Source: http://x.com/Alibaba_Qwen/status/2049462666734026923

Published Time: Thu, 30 Apr 2026 04:29:53 GMT

Markdown Content:
## Post

## Conversation

[![Image 1: Square profile picture](https://pbs.twimg.com/profile_images/1894073235379273728/0ROUmdkE_normal.jpg)](https://x.com/Alibaba_Qwen)

[Qwen](https://x.com/Alibaba_Qwen)

[@Alibaba_Qwen](https://x.com/Alibaba_Qwen)

![Image 2: 🚀](https://abs.twimg.com/emoji/v2/svg/1f680.svg) Introducing FlashQLA: high-performance linear attention kernels built on TileLang. ![Image 3: ⚡](https://abs.twimg.com/emoji/v2/svg/26a1.svg) 2–3× forward speedup. 2× backward speedup. ![Image 4: 💻](https://abs.twimg.com/emoji/v2/svg/1f4bb.svg) Purpose-built for agentic AI on your personal devices. ![Image 5: 💡](https://abs.twimg.com/emoji/v2/svg/1f4a1.svg)Key insights: 1. Gate-driven automatic intra-card CP. 2. Hardware-friendly algebraic reformulation. 3. TileLang fused warp-specialized kernels. FlashQLA boosts SM utilization via automatic intra-device CP. The gains are especially pronounced for TP setups, small models, and long-context workloads. Instead of fusing the entire GDN flow into a single kernel, we split it into two kernels optimized for CP and backward efficiency. At large batch sizes this incurs extra memory I/O overhead vs. a fully fused approach, but it delivers better real-world performance on edge devices and long-context workloads. The backward pass was the hardest part: we built a 16-stage warp-specialized pipeline under extremely tight on-chip memory constraints, ultimately achieving 2×+ kernel-level speedups. We hope this is useful to the community!![Image 6: 🫶](https://abs.twimg.com/emoji/v2/svg/1faf6.svg)![Image 7: 🫶](https://abs.twimg.com/emoji/v2/svg/1faf6.svg) Learn more: ![Image 8: 📖](https://abs.twimg.com/emoji/v2/svg/1f4d6.svg) Blog: [qwen.ai/blog?id=flashq](https://t.co/HF6opiR4yf)![Image 9: 💻](https://abs.twimg.com/emoji/v2/svg/1f4bb.svg) Code: [github.com/QwenLM/FlashQLA](https://t.co/G3oaf5L1AZ)

[![Image 10: Image](https://pbs.twimg.com/media/HHElcUQbwAAUYBL?format=jpg&name=small)](https://x.com/Alibaba_Qwen/status/2049462666734026923/photo/1)

Made with AI

[12:15 PM · Apr 29, 2026](https://x.com/Alibaba_Qwen/status/2049462666734026923)

[99.1K Views](https://x.com/Alibaba_Qwen/status/2049462666734026923/analytics)
