Latent Space2026年4月29日

[AINews] not much happened today

7.0Score

用这条生成生成视频方案 Markdown AI 摘要

AI 深度提炼

vLLM 0.20提升内存效率和推理性能，支持多种硬件加速。
DeepSeek V4在多硬件上优化显著，B300比H200快8倍。
Poolside发布轻量化开源模型Laguna XS.2，适合单GPU部署。

#AI#深度学习#推理优化#开源模型

打开原文

[AINews] not much happened today - Latent.Space

![Image 1: Latent.Space](http://www.latent.space/)

[![Image 2: Latent.Space](https://substackcdn.com/image/fetch/$s_!1PJi!,e_trim:10:white/e_trim:10:transparent/h_72,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4fe1182-38af-4a5d-bacc-439c36225e87_5000x1200.png)](http://www.latent.space/)

Subscribe Sign in

AINews: Weekday Roundups

[AINews] not much happened today

a quiet day.

Apr 29, 2026

∙ Paid

When we made the AINews → Substack move, we committed to having Matt Levine style op-eds every day, but some days there just isn’t much going on and we will just say so - we are working on small essays around inference demand and multiagents, but today is not that day.

Interesting model releases from Nvidia Nemotron, Poolside, and Alec Radford, but it’s unclear any of them will stand the test of time. GPT-6 hype is beginning.

AI News for 4/27/2026-4/28/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

**Inference Systems, vLLM 0.20, and the Hardware/Kernel Race Around DeepSeek V4**

**vLLM’s latest release is heavily about memory and MoE serving efficiency**: vLLM v0.20.0 shipped with **TurboQuant 2-bit KV cache** for **4× KV capacity**, FA4 re-enabled for MLA prefill on **SM90+**, a new **vLLM IR** foundation, fused RMSNorm for a reported **2.1% end-to-end latency improvement**, plus support updates spanning **DeepSeek V4 MegaMoE on Blackwell**, Jetson Thor, ROCm, Intel XPU, and easier GB200/Grace-Blackwell setup. In parallel, SemiAnalysis highlighted early DeepSeek V4 Pro serving results on **B200/B300/H200/GB200 disaggregated setups**, claiming **B300 can be up to 8× faster than H200** for this workload and pointing to upcoming vLLM 0.20 benchmarking with **DeepGEMM MegaMoE**, which fuses **EP dispatch + EP combine + GEMMs + SwiGLU** into a single mega-kernel.

**DeepSeek support**: several posts focused on serving tradeoffs: Jeremy Howard noted DeepSeek V4’s support for prefill as a capability many providers have dropped, while Maharshi pointed out the overheads of **dynamic activation quantization**, arguing that **static quantization** often wins on inference speed despite calibration cost. There was also growing interest in alternate stack portability: teortaxesTex argued DeepSeek is structurally moving away from CUDA lock-in via TileKernels, suggesting model vendors may increasingly optimize for heterogeneous or domestic accelerator fleets rather than NVIDIA-only deployment.

**Open Model Releases: Poolside Laguna XS.2, NVIDIA Nemotron 3 Nano Omni, and TRELLIS.2**

**Poolside made its first public model release with an unusually deployment-friendly open-weight coder**: @poolsideai announced Laguna XS.2, a **33B total / 3B active MoE** coding model trained fully in-house, released under **Apache 2.0**, and advertised as able to run on a **single GPU**. Poolside’s broader release also included **Laguna M.1** and an agent harness, emphasizing that the company trained from scratch on its own **data, training infra, RL, and inference stack**. Community summaries added more color: Aymeric Roucher described two coder models—**225B/23B active** and **33B/3B active**—with **hybrid attention**, **FP8 KV cache**, and claimed performance near **Qwen-3.5**; Ollama shipped it immediately.

**NVIDIA’s Nemotron 3 Nano Omni was the day’s biggest infra-native model launch**: @NVIDIAAI introduced Nemotron 3 Nano Omni, an open **30B / A3B multimodal MoE** with **256K context** built for agentic workloads spanning **text, image, video, audio, and documents**. Distribution was immediate across the stack: OpenRouter, LM Studio, Ollama, Unsloth, fal, Fireworks, DeepInfra, Together, Baseten, Canonical, and others all announced same-day availability. Key specs surfaced in follow-on posts: Piotr Żelasko described it as NVIDIA’s first **omni** release with speech/audio understanding backed by a **Parakeet encoder**, **English-only** for now, and a **5.95% WER** on the Open ASR leaderboard. Several hosts cited **~9× throughput** versus comparable open omni models.

**Other notable model/paper releases**: Microsoft’s TRELLIS.2 is an open-source **4B image-to-3D model** producing up to **1536³ PBR textured assets**, built on native 3D VAEs with **16× spatial compression**. On the world-model side, World-R1 claims existing video models already encode **3D structure** and can be “woken up” with **RL**, requiring **no architecture changes, no extra video training data, and no added inference cost**.

**Agents, Local-First Tooling, and Production Orchestration**

**Agent builders are shifting from demos to production primitives**: Mistral launched Workflows in public preview as an orchestration layer aimed at turning enterprise AI processes into durable, observable, fault-tolerant production systems. Related posts echoed the same theme: Sydney Runkle framed durable execution as a key requirement for long-running agents, and threepointone described work on subagents / agents-as-tools with persistence, streaming, and resumption.

**Local/offline agents moved from aspiration to credible workflow**: Teknium asserted “totally offline agents are possible”, while Niels Rogge demoed Pi + local models for desktop cleanup and Google Gemma shared a tutorial for local coding agents. Hugging Face’s local push also showed up in adoption numbers: Clement Delangue said 300,000 users have added hardware specs to the Hub to discover what can run locally. Complementing this, Ammaar open-sourced a vibe-coding app running Gemma 4 fully on-device with MLX, and Kimmonismus highlighted Sigma, a private browser-based local-agent concept using open models.

**Hermes and adjacent agent harnesses are gaining real-world traction**: multiple posts reported Hermes outperforming OpenClaw in instruction-following or practical workflows, including SecretArjun, somewheresy, and users deploying Hermes through Telegram or for medical literature extraction. On the research-agent side, Hugging Face’s ML Intern was trending among Spaces, and later gained native metric logging + Trackio integration to make its training jobs observable rather than black-box.

**Benchmarks, Evals, and Research Findings Worth Watching**

**Model benchmarking remains fragmented, but a few signals stood out**: Epoch reported GPT-5.5 Pro reaching**159 on the Epoch Capabilities Index** and new highs on **FrontierMath**—**52% on Tiers 1–3** and **40% on Tier 4**—including two Tier 4 problems not previously solved by any model. Separately, Greg Kamradt said ARC-AGI-3 testing for GPT-5.5 and Opus 4.7 had completed, with failure modes now under analysis.

**Several new benchmarks target more realistic agent and engineering behavior**: Lysandre announced a benchmark for making Transformers more agent-friendly, and VibeBench proposed subjective testing by **1,000 qualified software engineers** to measure how models actually feel in real work. On document intelligence, LlamaIndex’s ParseBench emphasized that OCR benchmarks miss **semantic formatting** such as strikethroughs and superscripts, which materially alter meaning for agents.

**Research notes with concrete engineering implications**: Rosinality flagged bugs in DeepSpeed and OpenRLHF that reduce SFT performance, with implications for prior studies. Arjun Kocher published a faithful implementation of Compressed Sparse Attention from the DeepSeek-V4 paper. che_shr_cat showed single-block transformers can solve Extreme Sudoku only with an explicit scratchpad and inverted routing init, otherwise performance is zero. On optimization, Keller Jordan released a lightweight Modded-NanoGPT optimizer benchmark designed to compare methods like **Muon** and **AdamW** on a reproducible speedrun-style task.

**Platform Economics, API Pricing, and Closed-Model Reliability Concerns**

Keep reading with a 7-day free trial

Subscribe to Latent.Space to keep reading this post and get 7 days of free access to the full post archives.

Start trial

Already a paid subscriber? **Sign in**

Start your Substack Get the app

Substack is the home for great culture

[AINews] not much happened today

[AINews] not much happened today - Latent.Space

[AINews] not much happened today

a quiet day.

**AI Twitter Recap**

Keep reading with a 7-day free trial

AI Twitter Recap