On a Sandy Beach: Wolfram LLM Benchmarking Project

Mike's Notes

Gary Marcus mentioned this project in his newsletter today.

Resources

References

Reference

Repository

Home > Ajabbi Research > Library >
Home > Handbook >

Last Updated

18/05/2025

Wolfram LLM Benchmarking Project

By:

Wolfram: 15/08/2024

Using Wolfram Language to benchmark the performance of major LLMs.

As major users and analyzers of large language model (LLM) technology, we've been continually tracking the performance of LLMs. This project involves releasing our ongoing results, initially for a specific well-characterized code generation task.

The task consists of going from English-language specifications to Wolfram Language code. The test cases are exercises from Stephen Wolfram's An Elementary Introduction to the Wolfram Language. These exercises have been done online by millions of humans, and we've developed effective tools for determining functional correctness of code, which we're now applying to LLMs.

Raw Data (It is sortable on the original page)

Vendor	Model	Correct Syntax	Correct Functionality
OpenAI	gpt-4OpenAI	99.8%	49.7%
OpenAI	gpt-4 turboOpenAI	99.8%	46.2%
OpenAI	gpt-4oOpenAI	100.0%	46.2%
Anthropic	claude3-opusAnthropic	99.4%	44.4%
Anthropic	claude3.5-sonnet Anthropic	99.7%	43.7%
Google	gemini-1.5-pro-001Google	99.0%	40.8%
Meta	Llama-3-70B-instructMeta	99.7%	39.6%
Google	text-unicorn-001Google	99.6%	39.5%
OpenAI	gpt-3.5 turboOpenAI	99.0%	38.5%
Mistral AI	mistral-largeMistral AI	98.4%	38.2%
Mistral AI	open-mixtral-8x22BMistral AI	98.2%	36.1%
Meta	codellama-34bMeta	99.7%	36.1%
Mistral AI	codestralMistral AI	97.5%	34.4%
Google	gemini-1.5-flash-001Google	98.5%	33.8%
Meta	codellama-13bMeta	98.7%	30.0%
Mistral AI	mistral-smallMistral AI	97.5%	29.9%
Anthropic	claude2.1Anthropic	96.6%	28.5%
Anthropic	claude2Anthropic	87.3%	28.2%
Anthropic	claude3-sonnetAnthropic	98.7%	27.8%
Mistral AI	mistral-mediumMistral AI	88.4%	27.6%
DeepSeek	deepseek-coder-7bDeepSeek	92.1%	27.3%
DeepSeek	deepseek-coder-33bDeepSeek	92.8%	26.2%
Meta	codellama-7bMeta	97.2%	26.0%
IBM	granite-8BIBM	93.4%	25.4%
Google	text-bison-002Google	98.1%	25.1%
Anthropic	claude3-haikuAnthropic	98.4%	24.8%
Google	code-bison-002Google	97.8%	24.6%
Google	code-gecko-002Google	98.1%	24.5%
Google	gemini-1.0-pro-002Google	94.5%	24.2%
DeepSeek	deepseek-coder-6.7bDeepSeek	89.1%	22.8%
Meta	Llama-3-8B-instructMeta	97.0%	22.1%
Google	code-bison-001Google	95.8%	19.1%
Nous Research	Nous-Hermes-2-Mixtral-8x7B-DPONous Research	79.9%	15.6%
OpenChat	openchat3.5OpenChat	88.2%	15.2%
Microsoft	Phi-3 miniMicrosoft	87.1%	14.1%
Aleph Alpha	luminous-supremeAleph Alpha	81.8%	13.2%
Aleph Alpha	luminous-supreme-control-20230501Aleph Alpha	86.8%	10.9%
Meta	Llama2-13bMeta	91.4%	10.7%
DeepSeek	deepseek-coder-1.3bDeepSeek	64.3%	10.7%
Aleph Alpha	luminous-extendedAleph Alpha	77.7%	10.0%
Aleph Alpha	luminous-supreme-control-20240215Aleph Alpha	56.3%	9.1%
Mistral AI	mistral-tinyMistral AI	78.5%	8.2%
Aleph Alpha	luminous-extended-control-20240215Aleph Alpha	69.2%	7.5%
Aleph Alpha	luminous-baseAleph Alpha	62.4%	7.0%
Aleph Alpha	luminous-base-control-20240215Aleph Alpha	76.3%	6.6%
Replit	replit-code-v1_5-3bReplit	40.5%	4.2%
Meta	llama2-7bMeta	26.4%	3.7%
Falcon LLM	falcon-7bFalcon LLM	45.3%	3.3%

On a Sandy Beach

Wolfram LLM Benchmarking Project

Mike's Notes

Resources

References

Repository

Last Updated

Wolfram LLM Benchmarking Project

Raw Data (It is sortable on the original page)

No comments:

Post a Comment