T
traeai
Sign in
返回首页
Hacker News Best

Norway's 2 petabytes of Huawei flash storage and LLM training

8.5Score
Norway's 2 petabytes of Huawei flash storage and LLM training

TL;DR · AI Summary

挪威国家图书馆正在开发一个理解挪威语的大语言模型(LLM),并使用2PB的华为OceanStor Dorado闪存存储来支持其AI训练数据管道。

Key Takeaways

  • 挪威国家图书馆使用2PB的华为OceanStor Dorado闪存存储进行LLM训练。
  • 该图书馆拥有挪威最大的数字藏书,包括书籍、报纸和网页等。
  • 训练过程中面临的主要挑战包括数据质量和存储系统的兼容性问题。

Outline

Jump quickly between sections.

  1. 挪威国家图书馆正在开发一个理解挪威语的大语言模型,并使用2PB的华为OceanStor Dorado闪存存储。

  2. 挪威文化部委托国家图书馆构建主权AI模型,因为图书馆拥有最大的挪威数字藏书。

  3. 图书馆自2005年开始数字化工作,积累了20PB的唯一数据,存储在3-2-1形式的系统中。

  4. 数据处理包括数据摄入、清洗、去重、格式标准化、验证和准备步骤。

  5. 使用华为OceanStor Dorado闪存存储作为低延迟存储,支持数据管道和训练准备。

  6. 数据通过管道后发送到挪威的国家超级计算机Sigma2 Olivia系统进行实际训练。

Mindmap

See how the topics connect at a glance.

查看大纲文本(无障碍 / 无 JS 友好)
  • 挪威国家图书馆LLM项目

Highlights

Key sentences worth saving and sharing.

#大语言模型#华为#挪威国家图书馆#数据存储#AI训练
Open original article

Norway’s 2 Petabytes of Huawei Flash Storage and LLM Training

BANDF AD

FLASH

The National Library of Norway is developing a large language model (LLM) that understands the Norwegian language and is utilizing 2 PB of Huawei OceanStor Dorado flash storage in its AI training data pipeline.

Image 1: Marius Husnes.

Marius Husnes.

Marius Husnes, the Head of IT Platform at the library (Nasjonlbiblioteket), discussed the project at Huawei’s ID Forum 2026 in Paris, stating that no commercial LLM provider was developing a local (Norwegian) language LLM. He emphasized that any country with its own language that did not have a sovereign LLM trained in that language was at a disadvantage, as a globally trained, English-speaking LLM would lack knowledge of that country’s history, news, and culture described in the local language.

The Norwegian Ministry of Culture tasked the National Library with building a sovereign AI (LLM) because the library holds the largest digital collection of Norwegian books, newspapers, web pages, and more in the country. Like many state libraries, it is entitled to receive copies of every published book and broadcasted content. Its legal deposit mandate extends beyond books, requiring it to collect and preserve all of Norway’s cultural heritage.

BANDF AD

An agreement with Norwegian newspapers allowed LLM training on copyrighted content, and Husnes said: "No private company has this."

The library was well-positioned to undertake this task, having digitized its collection since 2005 and accumulated 20 PB of unique data stored in a 3-2-1 configuration (3 copies, 2 media types, 1 off-site), totaling approximately 60 PB overall. The digitization process for raw text, sound, moving pictures, still images, and web content involved extensive OCR scanning, generated a significant amount of metadata, and provided APIs for online access.

Most of the data was stored in a digital disk plus tape archive, a preservation system. Husnes’ task was to transfer this data to the LLM training system. He noted that the bottleneck was not compute power but data quality, cleaning, and pipeline throughput. There were two main processing stages. First, there was in-house computation using an Nvidia DGX H200 system, a 384-core CPU cluster, and multiple Huawei OceanStor Dorado all-flash arrays, totaling 2 PB of flash capacity. This storage solution provided low-latency access for data pipelines and training preparation.

BANDF AD

Image 2: Husnes - training national LLM.

Husnes - training national LLM.

The pipeline included data ingestion, cleaning, deduplication, format normalization, validation, and preparation steps. Once the data passed through the pipeline, it was sent to Norway’s national supercomputer, the Sigma2 Olivia system, for training runs. The Olivia system is an HPE Cray Supercomputing EX system with 448 GPUs and 64,512 CPU cores. It uses a 5.3 PB Cray ClusterStor E1000 storage system.

A significant challenge has been addressing the needs of two different storage systems. The 60 PB preservation system is optimized for durability and cost, not fast I/O, and has high read latency, designed for infrequent access. The AI pipeline storage is designed for high-throughput, low-latency, parallel data I/O. Husnes noted that no one was discussing the issues involved in moving PB-scale datasets from an archive to and through an AI data pipeline system. His team had to figure out how to do it themselves.

Image 3: Husnes - preservation and AI pipeline storage.

Husnes - preservation and AI pipeline storage.

The LLM training is ongoing, and he concluded his talk with a summary of what his team is still learning:

  • Evaluation - there are no standard evaluation tools to assess a sovereign Norwegian LLM. The language has two written forms, multiple dialects, and historical changes. They are building their own evaluation tool on the fly.
  • Governance - who controls access to a sovereign LLM? Who decides what it can be used for? These are institutional and political questions with no easy answers.
  • Orchestration - making three systems work smoothly together—preservation archive + on-prem AI environment + national Sigma2 supercomputer—is an ongoing project.

BANDF AD

Our takeaways are that Huawei storage is playing a significant role in the European market, and any country developing a sovereign, local language LLM would benefit from consulting with Husnes and understanding the challenges involved.

As Husnes put it, Norway is a small country solving a problem that every non-English-speaking nation will face: how do you build AI that reflects your language, your culture, and your history? AI needs custodians, not just builders.

AI may generate inaccurate information. Please verify important content.