Norway's 2 petabytes of Huawei flash storage and LLM training

TL;DR · AI Summary
挪威国家图书馆正在开发一个理解挪威语的大语言模型(LLM),并使用2PB的华为OceanStor Dorado闪存存储来支持其AI训练数据管道。
Key Takeaways
- 挪威国家图书馆使用2PB的华为OceanStor Dorado闪存存储进行LLM训练。
- 该图书馆拥有挪威最大的数字藏书,包括书籍、报纸和网页等。
- 训练过程中面临的主要挑战包括数据质量和存储系统的兼容性问题。
Outline
Jump quickly between sections.
Mindmap
See how the topics connect at a glance.
查看大纲文本(无障碍 / 无 JS 友好)
- 挪威国家图书馆LLM项目
Highlights
Key sentences worth saving and sharing.
挪威国家图书馆使用2PB的华为OceanStor Dorado闪存存储来支持其AI训练数据管道。
该图书馆拥有挪威最大的数字藏书,包括书籍、报纸和网页等。
训练过程中面临的主要挑战包括数据质量和存储系统的兼容性问题。
Norway’s 2 Petabytes of Huawei Flash Storage and LLM Training
BANDF AD
FLASH
The National Library of Norway is developing a large language model (LLM) that understands the Norwegian language and is utilizing 2 PB of Huawei OceanStor Dorado flash storage in its AI training data pipeline.

Marius Husnes.
Marius Husnes, the Head of IT Platform at the library (Nasjonlbiblioteket), discussed the project at Huawei’s ID Forum 2026 in Paris, stating that no commercial LLM provider was developing a local (Norwegian) language LLM. He emphasized that any country with its own language that did not have a sovereign LLM trained in that language was at a disadvantage, as a globally trained, English-speaking LLM would lack knowledge of that country’s history, news, and culture described in the local language.
The Norwegian Ministry of Culture tasked the National Library with building a sovereign AI (LLM) because the library holds the largest digital collection of Norwegian books, newspapers, web pages, and more in the country. Like many state libraries, it is entitled to receive copies of every published book and broadcasted content. Its legal deposit mandate extends beyond books, requiring it to collect and preserve all of Norway’s cultural heritage.
BANDF AD
An agreement with Norwegian newspapers allowed LLM training on copyrighted content, and Husnes said: "No private company has this."
The library was well-positioned to undertake this task, having digitized its collection since 2005 and accumulated 20 PB of unique data stored in a 3-2-1 configuration (3 copies, 2 media types, 1 off-site), totaling approximately 60 PB overall. The digitization process for raw text, sound, moving pictures, still images, and web content involved extensive OCR scanning, generated a significant amount of metadata, and provided APIs for online access.
Most of the data was stored in a digital disk plus tape archive, a preservation system. Husnes’ task was to transfer this data to the LLM training system. He noted that the bottleneck was not compute power but data quality, cleaning, and pipeline throughput. There were two main processing stages. First, there was in-house computation using an Nvidia DGX H200 system, a 384-core CPU cluster, and multiple Huawei OceanStor Dorado all-flash arrays, totaling 2 PB of flash capacity. This storage solution provided low-latency access for data pipelines and training preparation.
BANDF AD

Husnes - training national LLM.
The pipeline included data ingestion, cleaning, deduplication, format normalization, validation, and preparation steps. Once the data passed through the pipeline, it was sent to Norway’s national supercomputer, the Sigma2 Olivia system, for training runs. The Olivia system is an HPE Cray Supercomputing EX system with 448 GPUs and 64,512 CPU cores. It uses a 5.3 PB Cray ClusterStor E1000 storage system.
A significant challenge has been addressing the needs of two different storage systems. The 60 PB preservation system is optimized for durability and cost, not fast I/O, and has high read latency, designed for infrequent access. The AI pipeline storage is designed for high-throughput, low-latency, parallel data I/O. Husnes noted that no one was discussing the issues involved in moving PB-scale datasets from an archive to and through an AI data pipeline system. His team had to figure out how to do it themselves.

Husnes - preservation and AI pipeline storage.
The LLM training is ongoing, and he concluded his talk with a summary of what his team is still learning:
- Evaluation - there are no standard evaluation tools to assess a sovereign Norwegian LLM. The language has two written forms, multiple dialects, and historical changes. They are building their own evaluation tool on the fly.
- Governance - who controls access to a sovereign LLM? Who decides what it can be used for? These are institutional and political questions with no easy answers.
- Orchestration - making three systems work smoothly together—preservation archive + on-prem AI environment + national Sigma2 supercomputer—is an ongoing project.
BANDF AD
Our takeaways are that Huawei storage is playing a significant role in the European market, and any country developing a sovereign, local language LLM would benefit from consulting with Husnes and understanding the challenges involved.
As Husnes put it, Norway is a small country solving a problem that every non-English-speaking nation will face: how do you build AI that reflects your language, your culture, and your history? AI needs custodians, not just builders.