Wolfram LLM Benchmarking Project

Mike Notes

Gary Marcus mentioned this project in his newsletter today.

Resources

Wolfram LLM Benchmarking Project

Using Wolfram Language to benchmark the performance of major LLMs.

As major users and analyzers of large language model (LLM) technology, we've been continually tracking the performance of LLMs. This project involves releasing our ongoing results, initially for a specific well-characterized code generation task.

The task consists of going from English-language specifications to Wolfram Language code. The test cases are exercises from Stephen Wolfram's An Elementary Introduction to the Wolfram Language. These exercises have been done online by millions of humans, and we've developed effective tools for determining functional correctness of code, which we're now applying to LLMs.

Raw Data (It is sortable on the original page)

Vendor Model Correct
Syntax
Correct
Functionality
OpenAI gpt-4OpenAI 99.8% 49.7%
OpenAI gpt-4 turboOpenAI 99.8% 46.2%
OpenAI gpt-4oOpenAI 100.0% 46.2%
Anthropic claude3-opusAnthropic 99.4% 44.4%
Anthropic claude3.5-sonnet Anthropic 99.7% 43.7%
Google gemini-1.5-pro-001Google 99.0% 40.8%
Meta Llama-3-70B-instructMeta 99.7% 39.6%
Google text-unicorn-001Google 99.6% 39.5%
OpenAI gpt-3.5 turboOpenAI 99.0% 38.5%
Mistral AI mistral-largeMistral AI 98.4% 38.2%
Mistral AI open-mixtral-8x22BMistral AI 98.2% 36.1%
Meta codellama-34bMeta 99.7% 36.1%
Mistral AI codestralMistral AI 97.5% 34.4%
Google gemini-1.5-flash-001Google 98.5% 33.8%
Meta codellama-13bMeta 98.7% 30.0%
Mistral AI mistral-smallMistral AI 97.5% 29.9%
Anthropic claude2.1Anthropic 96.6% 28.5%
Anthropic claude2Anthropic 87.3% 28.2%
Anthropic claude3-sonnetAnthropic 98.7% 27.8%
Mistral AI mistral-mediumMistral AI 88.4% 27.6%
DeepSeek deepseek-coder-7bDeepSeek 92.1% 27.3%
DeepSeek deepseek-coder-33bDeepSeek 92.8% 26.2%
Meta codellama-7bMeta 97.2% 26.0%
IBM granite-8BIBM 93.4% 25.4%
Google text-bison-002Google 98.1% 25.1%
Anthropic claude3-haikuAnthropic 98.4% 24.8%
Google code-bison-002Google 97.8% 24.6%
Google code-gecko-002Google 98.1% 24.5%
Google gemini-1.0-pro-002Google 94.5% 24.2%
DeepSeek deepseek-coder-6.7bDeepSeek 89.1% 22.8%
Meta Llama-3-8B-instructMeta 97.0% 22.1%
Google code-bison-001Google 95.8% 19.1%
Nous Research Nous-Hermes-2-Mixtral-8x7B-DPONous Research 79.9% 15.6%
OpenChat openchat3.5OpenChat 88.2% 15.2%
Microsoft Phi-3 miniMicrosoft 87.1% 14.1%
Aleph Alpha luminous-supremeAleph Alpha 81.8% 13.2%
Aleph Alpha luminous-supreme-control-20230501Aleph Alpha 86.8% 10.9%
Meta Llama2-13bMeta 91.4% 10.7%
DeepSeek deepseek-coder-1.3bDeepSeek 64.3% 10.7%
Aleph Alpha luminous-extendedAleph Alpha 77.7% 10.0%
Aleph Alpha luminous-supreme-control-20240215Aleph Alpha 56.3% 9.1%
Mistral AI mistral-tinyMistral AI 78.5% 8.2%
Aleph Alpha luminous-extended-control-20240215Aleph Alpha 69.2% 7.5%
Aleph Alpha luminous-baseAleph Alpha 62.4% 7.0%
Aleph Alpha luminous-base-control-20240215Aleph Alpha 76.3% 6.6%
Replit replit-code-v1_5-3bReplit 40.5% 4.2%
Meta llama2-7bMeta 26.4% 3.7%
Falcon LLM falcon-7bFalcon LLM 45.3% 3.3%

No comments:

Post a Comment