---
title: "microsoft/VibeVoice"
source_name: "Simon Willison's Weblog"
original_url: "https://simonwillison.net/2026/Apr/27/vibevoice/#atom-everything"
canonical_url: "https://www.traeai.com/articles/5c39fc4b-6caf-41b0-9f50-3033bfd9fed6"
content_type: "article"
language: "英文"
score: 8.7
tags: ["语音识别","开源","AI模型","Microsoft"]
published_at: "2026-04-27T23:46:56+00:00"
created_at: "2026-04-29T03:47:59.14799+00:00"
---

# microsoft/VibeVoice

Canonical URL: https://www.traeai.com/articles/5c39fc4b-6caf-41b0-9f50-3033bfd9fed6
Original source: https://simonwillison.net/2026/Apr/27/vibevoice/#atom-everything

## Summary

微软开源了VibeVoice语音转文本模型，支持说话人分离，可在Mac上通过简单命令运行。

## Key Takeaways

- VibeVoice是微软开源的语音转文本模型，MIT许可并内置说话人分离功能。
- 在M5 Max MacBook Pro上处理1小时音频需约8分45秒，峰值内存消耗达30GB。
- 工具输出结构化JSON，包含时间戳、说话人ID和文本，适合进一步分析。

## Content

Title: microsoft/VibeVoice

URL Source: http://simonwillison.net/2026/Apr/27/vibevoice/

Published Time: Wed, 29 Apr 2026 02:08:53 GMT

Markdown Content:
# microsoft/VibeVoice

# [Simon Willison’s Weblog](http://simonwillison.net/)

[Subscribe](http://simonwillison.net/about/#subscribe)

**Sponsored by:** Sonar — Now with SAST + SCA for secure, dependency-aware Agentic Engineering. [SonarQube Advanced Security](https://fandf.co/4bzyODl)

27th April 2026 - Link Blog

**[microsoft/VibeVoice](https://github.com/microsoft/VibeVoice)**. VibeVoice is Microsoft's Whisper-style audio model for speech-to-text, MIT licensed and with speaker diarization built into the model.

Microsoft released it on January 21st, 2026 but I hadn't tried it until today. Here's a one-liner to run it on a Mac with `uv`, [mlx-audio](https://github.com/Blaizzy/mlx-audio) (by Prince Canuma) and the 5.71GB [mlx-community/VibeVoice-ASR-4bit](https://huggingface.co/mlx-community/VibeVoice-ASR-4bit) MLX conversion of the [17.3GB VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR/tree/main) model, in this case against a downloaded copy of my recent [podcast appearance with Lenny Rachitsky](https://simonwillison.net/2026/Apr/2/lennys-podcast/):

```
uv run --with mlx-audio mlx_audio.stt.generate \
  --model mlx-community/VibeVoice-ASR-4bit \
  --audio lenny.mp3 --output-path lenny \
  --format json --verbose --max-tokens 32768
```

![Image 1: Screenshot of a macOS terminal running an mlx-audio speech-to-text command using the VibeVoice-ASR-4bit model on lenny.mp3, showing download progress, a warning that audio duration (99.8 min) exceeds the 59 min maximum so it's trimming, encoding/prefilling/generating progress bars, then a Transcription section with JSON segments of speakers discussing AI coding agents, followed by stats: Processing time 524.79 seconds, Prompt 26615 tokens at 50.718 tokens-per-sec, Generation 20248 tokens at 38.585 tokens-per-sec, Peak memory 30.44 GB.](https://static.simonwillison.net/static/2026/vibevoice-terminal.jpg)

The tool reported back:

```
Processing time: 524.79 seconds
Prompt: 26615 tokens, 50.718 tokens-per-sec
Generation: 20248 tokens, 38.585 tokens-per-sec
Peak memory: 30.44 GB
```

So that's 8 minutes 45 seconds for an hour of audio (running on a 128GB M5 Max MacBook Pro).

I've tested it against `.wav` and `.mp3` files and they both worked fine.

If you omit `--max-tokens` it defaults to 8192, which is enough for about 25 minutes of audio. I discovered that through trial-and-error and quadrupled it to guarantee I'd get the full hour.

That command reported using 30.44GB of RAM at peak, but in Activity Monitor I observed 61.5GB of usage during the prefill stage and around 18GB during the generating phase.

Here's [the resulting JSON](https://gist.github.com/simonw/d2c716c008b3ba395785f865c6387b6f). The key structure looks like this:

```
{
  "text": "And an open question for me is how many other knowledge work fields are actually prone to these agent loops?",
  "start": 13.85,
  "end": 19.5,
  "duration": 5.65,
  "speaker_id": 0
},
{
  "text": "Now that we have this power, people almost underestimate what they can do with it.",
  "start": 19.5,
  "end": 22.78,
  "duration": 3.280000000000001,
  "speaker_id": 1
},
{
  "text": "Today, probably 95% of the code that I produce, I didn't type it myself. I write so much of my code on my phone. It's wild.",
  "start": 22.78,
  "end": 30.0,
  "duration": 7.219999999999999,
  "speaker_id": 0
}
```

Since that's an array of objects we can [open it in Datasette Lite](https://lite.datasette.io/?json=https://gist.github.com/simonw/d2c716c008b3ba395785f865c6387b6f#/data/raw?_facet=speaker_id), making it easier to browse.

Amusingly that Datasette Lite view shows three speakers - it identified Lenny and me for the conversation, and then a separate Lenny for the voice he used for the additional intro and the sponsor reads!

VibeVoice can only handle up to an hour of audio, so running the above command transcribed just the first hour of the podcast. To transcribe more than that you'd need to split the audio, ideally with a minute or so of overlap so you can avoid errors from partially transcribed words at the split point. You'd also need to then line up the identified speaker IDs across the multiple segments.

Posted [27th April 2026](http://simonwillison.net/2026/Apr/27/) at 11:46 pm

## Recent articles

*   [Tracking the history of the now-deceased OpenAI Microsoft AGI clause](http://simonwillison.net/2026/Apr/27/now-deceased-agi-clause/) - 27th April 2026
*   [DeepSeek V4 - almost on the frontier, a fraction of the price](http://simonwillison.net/2026/Apr/24/deepseek-v4/) - 24th April 2026
*   [Extract PDF text in your browser with LiteParse for the web](http://simonwillison.net/2026/Apr/23/liteparse-for-the-web/) - 23rd April 2026

This is a **link post** by Simon Willison, posted on [27th April 2026](http://simonwillison.net/2026/Apr/27/).

[microsoft 129](http://simonwillison.net/tags/microsoft/)[python 1247](http://simonwillison.net/tags/python/)[datasette-lite 20](http://simonwillison.net/tags/datasette-lite/)[uv 93](http://simonwillison.net/tags/uv/)[mlx 43](http://simonwillison.net/tags/mlx/)[prince-canuma 8](http://simonwillison.net/tags/prince-canuma/)[speech-to-text 18](http://simonwillison.net/tags/speech-to-text/)
### Monthly briefing

Sponsor me for **$10/month** and get a curated email digest of the month's most important LLM developments.

Pay me to send you less!

[Sponsor & subscribe](https://github.com/sponsors/simonw/)

*   [Disclosures](http://simonwillison.net/about/#disclosures)
*   [Colophon](http://simonwillison.net/about/#about-site)
*   ©
*   [2002](http://simonwillison.net/2002/)
*   [2003](http://simonwillison.net/2003/)
*   [2004](http://simonwillison.net/2004/)
*   [2005](http://simonwillison.net/2005/)
*   [2006](http://simonwillison.net/2006/)
*   [2007](http://simonwillison.net/2007/)
*   [2008](http://simonwillison.net/2008/)
*   [2009](http://simonwillison.net/2009/)
*   [2010](http://simonwillison.net/2010/)
*   [2011](http://simonwillison.net/2011/)
*   [2012](http://simonwillison.net/2012/)
*   [2013](http://simonwillison.net/2013/)
*   [2014](http://simonwillison.net/2014/)
*   [2015](http://simonwillison.net/2015/)
*   [2016](http://simonwillison.net/2016/)
*   [2017](http://simonwillison.net/2017/)
*   [2018](http://simonwillison.net/2018/)
*   [2019](http://simonwillison.net/2019/)
*   [2020](http://simonwillison.net/2020/)
*   [2021](http://simonwillison.net/2021/)
*   [2022](http://simonwillison.net/2022/)
*   [2023](http://simonwillison.net/2023/)
*   [2024](http://simonwillison.net/2024/)
*   [2025](http://simonwillison.net/2025/)
*   [2026](http://simonwillison.net/2026/)