Shutterstock
News + Trends

Meta caught whitewashing AI benchmarks

Samuel Buchmann
9/4/2025
Translation: machine translated

When its latest AI "Llama 4" was released, Meta boasted high scores on a benchmark platform. However, the model only achieves these scores in a special version that is not even available.

The performance of artificial intelligence (AI) is tested using benchmarks. One of the leading platforms for this is LM Arena. Good results attract attention - as was the case with Meta's new "Llama 4", which the company released at the weekend. However, it is now clear that Meta has been playing its cards close to its chest in order to make its model look as good as possible. This is reported by the portal "TechCrunch".

In its press release, Meta emphasises the ELO score of 1417 for "Maverick" (the medium-sized model in the LLama 4 family). This very high score means that Maverick often wins direct benchmark duels against competitors. It suggests that Meta's model is ahead of OpenAI's 4o and only just behind the current leader Gemini 2.5 Pro from Google.

Meanwhile, LM Arena's ranking list declares that Meta's model in second place is an experimental version.
Meanwhile, LM Arena's ranking list declares that Meta's model in second place is an experimental version.

The waves that Maverick made in the Community were correspondingly high. It seemed as if Meta was at the forefront with it, after its previous models had always lagged behind. As it now turns out, however, the developers did not use the publicly available version of Maverick for the benchmarks on LM Arena, but an "experimental chat version". However, this was only mentioned in the small print.

Practice contradicts the purpose of benchmarks

Meta's approach does not explicitly contravene the rules of LM Arena - but it does contradict the idea behind the platform. This is because the benchmarks lose their meaning when developers send specially optimised versions of their models into the race that are not available anywhere because they have other disadvantages. This means that the scores no longer represent realistic performance and are no longer suitable for assessment.

  • Background information

    7 questions you have about DeepSeek (and the answers)

    by Samuel Buchmann

The episode shows how much pressure Meta is under in the AI race. Especially now that a second open-weight model, the Chinese DeepSeek, is on the market. Before its launch, Llama 4 was reportedly postponed several times because it did not fulfil internal expectations. In the end, it was strangely released on a Saturday (5 April) instead of the following Monday (7 April) as originally planned. When asked why, Meta CEO Mark Zuckerberg replied on Threads: "Then it was done."

Header image: Shutterstock

13 people like this article


User Avatar
User Avatar

My fingerprint often changes so drastically that my MacBook doesn't recognise it anymore. The reason? If I'm not clinging to a monitor or camera, I'm probably clinging to a rockface by the tips of my fingers.

These articles might also interest you

  • News + Trends

    Meta AI appears in Europe: AI chatbot soon on Facebook, Whatsapp & Co.

    by Debora Pape

  • News + Trends

    We (partly) guessed it: the coolest streetwear collaborations of the past year

    by Laura Scholz

  • News + Trends

    Google now shows AI-generated texts in response to your questions

    by Debora Pape

3 comments

Avatar
later