Love this work from Aksel and the post-training team at Hugging Face! Turns out the HF ecosystem (p...

- ml-intern能自动执行研究循环:查论文、跑实验、调模型,模拟真实研究员工作流。
- 在HealthBench等任务中,该智能体通过合成数据和调参显著超越Codex等基线模型。
- 系统轻量,依赖HF现有工具链(Hub/Jobs/Spaces),强调‘先搜索生态再行动’的研究范式。
Turns out the HF ecosystem (papers, datasets, models all accessible through CLI, skills and md files) is perfect for running SOTA ML agents: agents that can train any type of AI model to top performance.
A" / X
Conversation
Love this work from Aksel and the post-training team at Hugging Face! Turns out the HF ecosystem (papers, datasets, models all accessible through CLI, skills and md files) is perfect for running SOTA ML agents: agents that can train any type of AI model to top performance. A few concrete runs: !Image 1: ⭐️ Scientific reasoning: the agent walked citations from the benchmark paper, pulled OpenScience and NemoTron-CrossThink, added 7 difficulty-filtered variants from ARC/SciQ/MMLU, and ran 12 SFT ablations on Qwen3-1.7B. GPQA went from 10% to 32% in under 10 hours. Claude Code's best on the same prompt was 22.99%. !Image 2: ⭐️ HealthBench: it judged the existing datasets too noisy (!), generated 1100 synthetic examples covering emergencies, hedging and multilingual cases, upsampled 50x, and beat Codex by 60% (careful to check overfitting here) !Image 3: ⭐️ Competitive math: wrote a full GRPO script, launched A100s on HF Spaces, watched rewards climb and then collapse, and ran ablations until it found a recipe that held. And the harness is pretty tiny and simple. A couple of best practices and a handful of skills pointing at tools already in the ecosystem: arxiv and hf.co/papers for reading, the Hub for datasets and models, HF Jobs for compute, Trackio for metrics. Personal favorite is the "research skill" explaining how to do a SOTA landscape of a field (see github.com/huggingface/ml) which is extremely powerful when combined with a simple prompt that basically tell "FIRST: Search HF ecosystem to find the best approach) (see github.com/huggingface/ml) On another note: setting good baselines on new benchmarks keeps getting harder when a setup this simple beats raw Codex by 60% on HealthBench out of the box. Give it a try if you're training AI models. We provisioned $1k of GPU resources and Anthropic credits for the quickest among you. Links: Github (CLI): github.com/huggingface/ml Spaces (mobile): huggingface.co/spaces/smolage
Quote
Aksel
@akseljoonas
12h
Introducing ml-intern, the agent that just automated the post-training team @huggingface It's an open-source implementation of the real research loop that our ML researchers do every day. You give it a prompt, it researches papers, goes through citations, implements ideas in GPU
