📊 Full opportunity report: Minerva. The opposite path. on ThorstenMeyerAI.com — validation score, market gap, and execution plan.
TL;DR
Italy’s Minerva project trained a large Italian-focused LLM from scratch, achieving strong technical results but performing poorly on academic benchmarks. This raises questions about the scale needed for country-specific language models.
Italy’s Minerva project, a large-scale European sovereign language model trained from scratch on 2.5 trillion tokens, scored only 4.9% on the INVALSI Italian school-exam benchmark, highlighting the limits of current scaling efforts for country-specific models.
The Minerva LLM, led by Sapienza University of Rome and supported by Italy’s national research and supercomputing infrastructure, was trained on a dataset with approximately 50% Italian content, resulting in models ranging from 350 million to 7 billion parameters. Despite outperforming comparable multilingual models on Italian benchmarks, Minerva-3B’s low score on the INVALSI exam indicates that larger datasets and more parameters may still be insufficient for achieving deep country-specific knowledge and complex language understanding.
This empirical finding challenges the assumption that simply increasing data and model size will yield better performance on complex, country-specific tasks. It suggests that the European sovereign-LLM movement must confront the reality of necessary scaling levels, which may be higher than currently implemented, to produce models capable of handling nuanced national content and education standards.
Minerva.
The opposite
path.
Italy spent years building a European sovereign LLM from scratch. Then Minerva-3B scored 4.9% on the INVALSI Italian school exam.
Where AMÁLIA layered Portuguese specialization onto a multilingual foundation, Minerva trained from scratch on 2.5 trillion tokens with approximately 50% Italian content. Where AMÁLIA’s weights are not yet public, Minerva published weights, training data, and code as truly-open from day one. By every institutional measure, the Italian approach worked. But the empirical results contain a finding the press coverage has been quiet about — and it has implications that extend well beyond Italy.
Same problem. Opposite path.
European sovereign-LLM development has two primary architectural approaches. Italy chose from scratch with substantial native-language foundation. Portugal chose continuation pre-training of a multilingual model. The structural comparison surfaces what each commitment actually requires operationally.
The comparison is not “Italy did it better than Portugal.” Both projects respond to the same structural problem with different architectural strategies under different institutional and economic constraints. Italy’s national-AI investment is structurally larger by an order of magnitude — and Minerva is the visible artifact of that scale.
large language model training datasets
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
4.9% on INVALSI. The bitter lesson surfaces.
In June 2024, researchers evaluated Minerva-3B on the Italian school-exam benchmark. The result was unambiguous. This is not a critique of Minerva — it is a critique of the public discourse around what Minerva’s empirical results actually demonstrate.
Italian language learning AI tools
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
350M to 7B. Four parameter scales, one architecture.
The Minerva model family covers four parameter tiers, each with specific training corpora. Each scale level reveals what the from-scratch path actually requires at different operating points.
Italian + English
100B English
~50% English
+ 200B code

Engineering a Small AI Language Model: Training, Evaluation, and Deployment Without Myth
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Three answers. Same question.
Minerva, AMÁLIA, and OpenEuroLLM represent the three operational answers to the European sovereign-LLM question. Each makes different architectural and institutional bets. The strategic discourse benefits from treating all three as data points in the same empirical experiment.
European sovereign AI development kits
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Three standards the movement should adopt.
The structural critique generalizes beyond Minerva. The European sovereign-LLM movement benefits from internalizing these lessons across every subsequent national project. Italy modeled the openness standard; the movement should adopt it as norm.
Minerva is one valid answer to the European sovereign-LLM question. AMÁLIA is another. OpenEuroLLM is potentially a third. The strategic discourse benefits from treating all three as data points in the same empirical experiment rather than as competing national-prestige projects. More analysis like this is needed. Not less.
Implications for European Sovereign-LLM Strategies
The results from Minerva underscore that building effective national language models requires substantial investment in scale, beyond current efforts. Despite Italy’s significant resources and a comprehensive institutional framework, the low academic benchmark score indicates that the current scale may still fall short. This finding impacts future policy and research directions, emphasizing that European countries may need to commit to even larger datasets and model sizes to develop truly capable country-specific AI tools.
For policymakers and researchers, this highlights the importance of realistic expectations and resource commitments when pursuing sovereign AI initiatives. It also raises questions about the viability of smaller-scale models for complex language tasks, influencing the broader debate on the optimal approach for national AI development.
Italy’s Long-Standing Investment in Sovereign AI
The Minerva project is part of Italy’s broader strategy to develop a European sovereign AI infrastructure, supported by national funding, supercomputing resources, and academic collaboration. Italy’s approach diverged from models like Portugal’s AMÁLIA, opting for training from scratch on a massive dataset, with the aim of creating a model tailored to Italian language and content. Prior efforts in European sovereign AI have often focused on multilingual or continuation-based models, with mixed results on performance and scalability. Minerva’s development reflects a deliberate choice to prioritize native-language depth through large-scale training, backed by Italy’s institutional and technical resources.
While Minerva has achieved technical benchmarks and demonstrated the feasibility of large-scale native-language training, its low performance on academic content exposes the limits of current methodologies and the challenges of scaling models to meet complex, country-specific needs.
“The Minerva results suggest that the European sovereign-LLM movement needs to confront the reality of scaling requirements for meaningful country-specific AI capabilities.”
— Thorsten Meyer, AI researcher
Unresolved Questions About Scaling and Effectiveness
It remains unclear what specific scale of data, parameters, or training techniques are necessary to produce country-specific models capable of passing academic benchmarks like INVALSI. The current results are preliminary, and ongoing iterations of Minerva may improve performance. Additionally, the generalizability of these findings to other languages and countries has not yet been established, and further research is needed to determine the optimal investment levels for sovereign AI models.
Next Steps in European Sovereign Language Model Development
The Minerva team plans to continue refining the model, potentially increasing dataset size and model parameters, and exploring different training methodologies. Future evaluations will focus on whether these adjustments can improve performance on complex tasks such as academic assessments. Policymakers and researchers will also likely reassess resource commitments and strategic priorities based on these emerging insights. The broader European community may also consider collaborative scaling efforts or alternative approaches to achieve deeper country-specific AI capabilities.
Key Questions
Why did Minerva perform poorly on the INVALSI exam?
The low score suggests that, despite large-scale training on native-language data, the model lacks sufficient depth in understanding complex academic content, indicating that current scaling efforts may be inadequate for such tasks.
Does this mean European sovereign models are ineffective?
Not necessarily. The results highlight current limitations in scaling and training techniques, but ongoing research may improve future models. It underscores the need for larger datasets and more sophisticated methodologies.
How does Minerva compare to multilingual models?
Minerva outperforms comparable multilingual models on Italian benchmarks, demonstrating the benefits of native-language focus, but still falls short on complex academic tasks, revealing the limits of current scale.
What implications does this have for national AI policies?
It suggests that countries aiming for deep, country-specific AI capabilities should prepare for significant resource investments, including larger datasets and models, to meet complex language understanding requirements.
Source: ThorstenMeyerAI.com