LLM:修订间差异
跳到导航
跳到搜索
无编辑摘要 |
无编辑摘要 |
||
(未显示同一用户的1个中间版本) | |||
第14行: | 第14行: | ||
!MBPP Pass@1 | !MBPP Pass@1 | ||
!发布者 | !发布者 | ||
! | !开源 | ||
|- | |- | ||
|Claude 3.5 Sonnet | |Claude 3.5 Sonnet | ||
第97行: | 第97行: | ||
|74.4 | |74.4 | ||
|/ | |/ | ||
| | |Google Deep Mind | ||
| | | | ||
|- | |- | ||
第111行: | 第111行: | ||
|73.8 | |73.8 | ||
|61.4 | |61.4 | ||
| | |DeepSeek-AI | ||
| | |* | ||
|- | |- | ||
|WizardCoder-Python-34B | |WizardCoder-Python-34B | ||
第118行: | 第118行: | ||
|73.2 | |73.2 | ||
|/ | |/ | ||
| | |WizardLM Team | ||
| | |* | ||
|- | |- | ||
|Claude3-Sonnet | |Claude3-Sonnet | ||
第125行: | 第125行: | ||
|73.0 | |73.0 | ||
|/ | |/ | ||
| | |Anthropic | ||
| | | | ||
|- | |- | ||
第132行: | 第132行: | ||
|72.0 | |72.0 | ||
|/ | |/ | ||
| | |智谱AI | ||
| | | | ||
|- | |- | ||
第139行: | 第139行: | ||
|71.9 | |71.9 | ||
|/ | |/ | ||
| | |Google Deep Mind | ||
| | | | ||
|- | |- | ||
第146行: | 第146行: | ||
|71.8 | |71.8 | ||
|/ | |/ | ||
| | |智谱AI | ||
| | |* | ||
|- | |- | ||
|DBRX Instruct | |DBRX Instruct | ||
第153行: | 第153行: | ||
|70.1 | |70.1 | ||
|/ | |/ | ||
| | |databricks | ||
| | |* | ||
|- | |- | ||
|GLM-4-9B | |GLM-4-9B | ||
第160行: | 第160行: | ||
|70.1 | |70.1 | ||
|/ | |/ | ||
| | |智谱AI | ||
| | |* | ||
|- | |- | ||
|Phind-CodeLlama-34B-Python-v1 | |Phind-CodeLlama-34B-Python-v1 | ||
第167行: | 第167行: | ||
|69.5 | |69.5 | ||
|/ | |/ | ||
| | |Phind | ||
| | |* | ||
|- | |- | ||
|Gemini-pro | |Gemini-pro | ||
第174行: | 第174行: | ||
|67.7 | |67.7 | ||
|/ | |/ | ||
| | |Google Deep Mind | ||
| | | | ||
|- | |- | ||
第181行: | 第181行: | ||
|67.6 | |67.6 | ||
|/ | |/ | ||
| | |DeepSeek-AI | ||
| | |* | ||
|- | |- | ||
|DeepSeek Coder-6.7B Instruct | |DeepSeek Coder-6.7B Instruct | ||
第188行: | 第188行: | ||
|66.1 | |66.1 | ||
|65.4 | |65.4 | ||
| | |DeepSeek-AI | ||
| | |* | ||
|- | |- | ||
|Qwen2-72B | |Qwen2-72B | ||
第195行: | 第195行: | ||
|64.6 | |64.6 | ||
|76.9 | |76.9 | ||
| | |阿里巴巴 | ||
| | |* | ||
|- | |- | ||
|WizardCoder-Python-13B-V1.0 | |WizardCoder-Python-13B-V1.0 | ||
第202行: | 第202行: | ||
|64.0 | |64.0 | ||
|54.6 | |54.6 | ||
| | |WizardLM Team | ||
| | |* | ||
|- | |- | ||
|Grok-1 | |Grok-1 | ||
第209行: | 第209行: | ||
|63.2 | |63.2 | ||
|/ | |/ | ||
| | |xAI | ||
| | |* | ||
|- | |- | ||
|Llama3-8B | |Llama3-8B | ||
第216行: | 第216行: | ||
|62.2 | |62.2 | ||
|/ | |/ | ||
| | |Meta | ||
| | |* | ||
|- | |- | ||
|Llama3-8B-Instruct | |Llama3-8B-Instruct | ||
第223行: | 第223行: | ||
|62.2 | |62.2 | ||
|/ | |/ | ||
| | |Meta | ||
| | |* | ||
|- | |- | ||
|PanGu-Coder2 | |PanGu-Coder2 |
2024年7月11日 (四) 13:22的最新版本
大型语言模型(Large Language Models,LLM)是一种利用机器学习技术来理解和生成人类语言的人工智能模型。
LLM 旨在经过大量数据训练,像人类一样理解和生成文本以及其他形式的内容。这种模型有能力从环境中推断,生成连贯且与环境相关的响应,总结文本,回答问题(一般对话和常见问题解答),甚至协助完成创造性写作或代码生成任务。
LLM 使用基于神经网络的模型,通常运用自然语言处理(NLP)技术来处理和计算其输出。
评测基准
- Human Eval - HumanEval 是一个用于评估代码生成模型性能的数据集,由 OpenAI 在 2021 年推出。这个数据集包含 164 个手工编写的编程问题,每个问题都包括一个函数签名、文档字符串(docstring)、函数体以及几个单元测试。这些问题涵盖了语言理解、推理、算法和简单数学等方面。这些问题的难度也各不相同,有些甚至与简单的软件面试问题相当。这个数据集的一个重要特点是,它不仅仅依赖于代码的语法正确性,还依赖于功能正确性。也就是说,生成的代码需要通过所有相关的单元测试才能被认为是正确的。这种方法更接近于实际编程任务,因为在实际编程中,代码不仅需要语法正确,还需要能够正确执行预定任务。结果通过 pass@k 表示,其中 k 表示模型一次性生成多少种不同的答案中,至少包含 1 个正确的结果。例如 Pass@10 表示一次性生成 10 个答案其中至少有一个准确的比例。目前,收集的包含Pass@1、Pass@10 和 Pass@100。
- MBPP - MBPP(Mostly Basic Programming Problems)是一个数据集,主要包含了 974 个短小的 Python 函数问题,由谷歌在 2021 年推出,这些问题主要是为初级程序员设计的。数据集还包含了这些程序的文本描述和用于检查功能正确性的测试用例。 结果通过 pass@k 表示。
模型名称 | 参数大小 | HumanEval Pass@1 | MBPP Pass@1 | 发布者 | 开源 |
---|---|---|---|---|---|
Claude 3.5 Sonnet | 92.0 | / | Anthropic | ||
GPT-4o | 90.2 | / | OpenAI | ||
Qwen2-72B-Instruct | 72.0 | 86.0 | 80.2 | 阿里巴巴 | * |
GPT-4 | 1750.0 | 85.4 | 83.5 | OpenAI | |
Claude3-Opus | 0.0 | 84.9 | / | Anthropic | |
Llama3-400B-Instruct-InTraining | 4000.0 | 84.1 | / | Meta | * |
CodeQwen1.5-7B-Chat | 70.0 | 83.5 | 77.7 | 阿里巴巴 | * |
Llama3-70B | 700.0 | 81.7 | / | Meta | * |
Llama3-70B-Instruct | 700.0 | 81.7 | / | Meta | * |
DeepSeek Coder-33B Instruct | 330.0 | 79.3 | 70.0 | DeepSeek-AI | * |
Claude3-Haiku | 0.0 | 75.9 | / | Anthropic | |
Gemini-ultra | 0.0 | 74.4 | / | Google Deep Mind | |
Grok-1.5 | 74.1 | / | xAI | ||
DeepSeek-V2-236B-Chat | 2360.0 | 73.8 | 61.4 | DeepSeek-AI | * |
WizardCoder-Python-34B | 340.0 | 73.2 | / | WizardLM Team | * |
Claude3-Sonnet | 0.0 | 73.0 | / | Anthropic | |
GLM4 | 0.0 | 72.0 | / | 智谱AI | |
Gemini 1.5 Pro | 0.0 | 71.9 | / | Google Deep Mind | |
GLM-4-9B-Chat | 90.0 | 71.8 | / | 智谱AI | * |
DBRX Instruct | 1320.0 | 70.1 | / | databricks | * |
GLM-4-9B | 90.0 | 70.1 | / | 智谱AI | * |
Phind-CodeLlama-34B-Python-v1 | 340.0 | 69.5 | / | Phind | * |
Gemini-pro | 1000.0 | 67.7 | / | Google Deep Mind | |
Phind-CodeLlama-34B-v1 | 340.0 | 67.6 | / | DeepSeek-AI | * |
DeepSeek Coder-6.7B Instruct | 67.0 | 66.1 | 65.4 | DeepSeek-AI | * |
Qwen2-72B | 727.0 | 64.6 | 76.9 | 阿里巴巴 | * |
WizardCoder-Python-13B-V1.0 | 130.0 | 64.0 | 54.6 | WizardLM Team | * |
Grok-1 | 3140.0 | 63.2 | / | xAI | * |
Llama3-8B | 80.0 | 62.2 | / | Meta | * |
Llama3-8B-Instruct | 80.0 | 62.2 | / | Meta | * |
PanGu-Coder2 | 150.0 | 61.64 | / | ||
Codestral | 220.0 | 61.5 | 78.2 | ||
Phi-3-small 7B | 70.0 | 59.1 | 71.4 | ||
Phi-3-mini 3.8B | 38.0 | 58.5 | 70.0 | ||
WizardCoder-15B-V1.0 | 150.0 | 57.3 | / | ||
CodeGemma-7B-IT | 70.0 | 56.1 | 54.2 | ||
Phi-3-medium 14B-preview | 140.0 | 55.5 | 74.4 | ||
MiniCPM-MoE-8x2B | 136.0 | 55.49 | 41.68 | ||
CodeLLaMA-Python-34B | 340.0 | 53.7 | 56.2 | ||
YAYI2-30B | 300.0 | 53.1 | 45.8 | ||
Qwen2-57B-A14B | 570.0 | 53.0 | 71.9 | ||
Qwen1.5-110B | 1100.0 | 52.4 | 58.1 | ||
CodeQwen1.5-7B | 70.0 | 51.8 | 72.2 | ||
Qwen2-7B | 70.0 | 51.2 | 65.9 | ||
Phi-1 | 13.0 | 50.6 | 55.5 | ||
MiniCPM-2B-DPO | 24.0 | 50.0 | 47.31 | ||
CodeLLaMA-34B | 340.0 | 48.8 | 55.0 | ||
Phi-2 | 27.0 | 48.3 | 59.1 | ||
GPT-3.5 | 1750.0 | 48.1 | 52.2 | ||
Yi-1.5-34B | 340.0 | 46.3 | 65.5 | ||
Mixtral-8×22B-MoE | 1410.0 | 45.1 | 71.2 | ||
CodeGemma-7B | 70.0 | 44.5 | 56.2 | ||
CodeLLaMA-Python-13B | 130.0 | 43.3 | 49.0 | ||
CodeLLaMA-Instruct-13B | 130.0 | 42.7 | 49.4 | ||
CodeLLaMA-Instruct-34B | 340.0 | 41.5 | 57.0 | ||
Qwen1.5-72B-Chat | 720.0 | 41.5 | 53.4 | ||
Yi-1.5-9B | 90.0 | 41.4 | 61.1 | ||
DeepSeek-V2-236B | 2360.0 | 40.9 | 66.6 | ||
Mixtral-8×7B-MoE | 450.0 | 40.2 | 60.7 | ||
Gemma 2 - 9B | 90.0 | 40.2 | 52.4 | ||
Grok-0 | 330.0 | 39.7 | / | ||
Yi-9B | 90.0 | 39.0 | 54.4 | ||
CodeLLaMA-Python-7B | 70.0 | 38.4 | 47.6 | ||
WizardLM-30B-V1 | 300.0 | 37.8 | / | ||
PaLM2-S | 0.0 | 37.6 | 50.0 | ||
Qwen1.5-32B | 320.0 | 37.2 | 49.4 | ||
CodeLLaMA-13B | 130.0 | 36.0 | 47.0 | ||
CodeGeeX2-6B | 60.0 | 35.9 | / | ||
PaLM-Coder | 5400.0 | 35.9 | 47.0 | ||
Aquila2-34B | 340.0 | 35.4 | / | ||
Qwen-72B | 720.0 | 35.4 | 52.2 | ||
Stable LM Zephyr 3B | 30.0 | 35.37 | 31.85 | ||
CodeLLaMA-Instruct-7B | 70.0 | 34.8 | 44.4 | ||
WizardCoder-3B-V1.0 | 30.0 | 34.8 | 37.4 | ||
Qwen1.5-MoE-A2.7B | 143.0 | 34.2 | / | ||
Phi-1.5 | 13.0 | 34.1 | 37.7 | ||
StarCoder | 155.0 | 33.6 | 52.7 | ||
CodeLLaMA-7B | 70.0 | 33.5 | 41.4 | ||
Qwen-14B | 140.0 | 32.3 | 40.8 | ||
Gemma 7B | 70.0 | 32.3 | 44.4 | ||
Qwen2-1.5B | 15.0 | 31.1 | 37.4 | ||
LLaMA2 70B | 700.0 | 30.5 | 45.4 | ||
Mistral 7B | 73.0 | 30.5 | 47.5 | ||
StarCodeBase | 155.0 | 30.4 | 49.0 | ||
Qwen-7B | 70.0 | 29.9 | 31.6 | ||
XVERSE-MoE-A4.2B | 258.0 | 29.9 | / | ||
Codex | 1750.0 | 28.81 | / | ||
AquilaCode-7B-py | 70.0 | 28.8 | / | ||
XVERSE-65B | 650.0 | 26.8 | / | ||
PaLM | 5400.0 | 26.2 | 47.0 | ||
WizardCoder-1B-V1.0 | 10.0 | 23.8 | 28.6 | ||
CodeGeeX | 130.0 | 22.9 | / | ||
LLaMA2 34B | 340.0 | 22.6 | 33.8 | ||
AquilaCode-7B-multi | 70.0 | 22.0 | / | ||
Gemma 2B | 20.0 | 22.0 | 29.2 | ||
Gemma 2B - It | 20.0 | 22.0 | 29.2 | ||
CodeGemma-2B | 20.0 | 22.0 | 29.2 | ||
Qwen2-0.5B | 4.0 | 22.0 | 22.0 | ||
RecurrentGemma-2B | 27.0 | 21.3 | 28.8 | ||
LLaMA2 13B | 130.0 | 20.1 | 27.6 | ||
Baichuan2-7B-Base | 70.0 | 18.29 | 24.2 | ||
Baichuan2-13B-Base | 130.0 | 17.07 | 30.2 | ||
Qwen-1.8B | 18.0 | 15.2 | / | ||
LLaMA2 7B | 70.0 | 12.2 | 20.8 | ||
Baichuan 13B - Base | 130.0 | 11.59 | 22.9 | ||
Baichuan 7B | 70.0 | 9.2 | 6.6 | ||
TinyLlama | 11.0 | 6.71 | 19.91 | ||
Mistral Large | 0.0 | 4.1 | 7.1 |