Readme

This is a vision language model that is highly competitive with GPT-4’s vision capabilities. Credit to THUDM, original README in below.

GLM-4

Model Introduction

GLM-4-9B is the open-source version of the latest generation of pre-trained models in the GLM-4 series launched by Zhipu AI. In the evaluation of data sets in semantics, mathematics, reasoning, code, and knowledge, GLM-4-9B and its human preference-aligned version GLM-4-9B-Chat have shown superior performance beyond Llama-3-8B. In addition to multi-round conversations, GLM-4-9B-Chat also has advanced features such as web browsing, code execution, custom tool calls (Function Call), and long text reasoning (supporting up to 128K context). This generation of models has added multi-language support, supporting 26 languages including Japanese, Korean, and German. We have also launched the GLM-4-9B-Chat-1M model that supports 1M context length (about 2 million Chinese characters) and the multimodal model GLM-4V-9B based on GLM-4-9B. GLM-4V-9B possesses dialogue capabilities in both Chinese and English at a high resolution of 1120*1120. In various multimodal evaluations, including comprehensive abilities in Chinese and English, perception & reasoning, text recognition, and chart understanding, GLM-4V-9B demonstrates superior performance compared to GPT-4-turbo-2024-04-09, Gemini 1.0 Pro, Qwen-VL-Max, and Claude 3 Opus.

Model List

Model	Type	Seq Length	Download	Online Demo
GLM-4-9B	Base	8K	🤗 Huggingface 🤖 ModelScope 🟣 WiseModel	/
GLM-4-9B-Chat	Chat	128K	🤗 Huggingface 🤖 ModelScope 🟣 WiseModel	🤖 ModelScope CPU 🤖 ModelScope vLLM
GLM-4-9B-Chat-1M	Chat	1M	🤗 Huggingface 🤖 ModelScope 🟣 WiseModel	/
GLM-4V-9B	Chat	8K	🤗 Huggingface 🤖 ModelScope 🟣 WiseModel	🤖 ModelScope

Projects

The following excellent open source repositories have in-depth support for the GLM-4-9B model, and everyone is welcome to expand their learning.

Inference acceleration:

chatglm.cpp: Real-time inference on your laptop accelerated by quantization, similar to llama.cpp.

BenchMark

Typical Tasks

Model	AlignBench	MT-Bench	IFEval	MMLU	C-Eval	GSM8K	MATH	HumanEval	NaturalCodeBench
Llama-3-8B-Instruct	6.40	8.00	68.58	68.4	51.3	79.6	30.0	62.2	24.7
ChatGLM3-6B	5.18	5.50	28.1	66.4	69.0	72.3	25.7	58.5	11.3
GLM-4-9B-Chat	7.01	8.35	69.0	72.4	75.6	79.6	50.6	71.8	32.2

Base Model

Model	MMLU	C-Eval	GPQA	GSM8K	MATH	HumanEval
Llama-3-8B	66.6	51.2	-	45.8	-	33.5
Llama-3-8B-Instruct	68.4	51.3	34.2	79.6	30.0	62.2
ChatGLM3-6B-Base	61.4	69.0	26.8	72.3	25.7	58.5
GLM-4-9B	74.7	77.1	34.3	84.0	30.4	70.1

Since GLM-4-9B adds some math, reasoning, and code-related instruction data during pre-training, Llama-3-8B-Instruct is also included in the comparison range.

Long Context

The needle-in-the-haystack experiment was conducted with a context length of 1M, and the results are as follows:

Multi Language

The tests for GLM-4-9B-Chat and Llama-3-8B-Instruct are conducted on six multilingual datasets. The test results and the corresponding languages selected for each dataset are shown in the table below:

Dataset	Llama-3-8B-Instruct	GLM-4-9B-Chat	Languages
M-MMLU	49.6	56.6	all
FLORES	25.0	28.8	ru, es, de, fr, it, pt, pl, ja, nl, ar, tr, cs, vi, fa, hu, el, ro, sv, uk, fi, ko, da, bg, no
MGSM	54.0	65.3	zh, en, bn, de, es, fr, ja, ru, sw, te, th
XWinograd	61.7	73.1	zh, en, fr, jp, ru, pt
XStoryCloze	84.7	90.7	zh, en, ar, es, eu, hi, id, my, ru, sw, te
XCOPA	73.3	80.1	zh, et, ht, id, it, qu, sw, ta, th, tr, vi

Function Call

Tested on Berkeley Function Calling Leaderboard.

Model	Overall Acc.	AST Summary	Exec Summary	Relevance
Llama-3-8B-Instruct	58.88	59.25	70.01	45.83
gpt-4-turbo-2024-04-09	81.24	82.14	78.61	88.75
ChatGLM3-6B	57.88	62.18	69.78	5.42
GLM-4-9B-Chat	81.00	80.26	84.40	87.92

GLM-4V-9B is a multimodal language model with visual understanding capabilities. The evaluation results of its related classic tasks are as follows:

	MMBench-EN-Test	MMBench-CN-Test	SEEDBench_IMG	MMStar	MMMU	MME	HallusionBench	AI2D	OCRBench
gpt-4o-2024-05-13	83.4	82.1	77.1	63.9	69.2	2310.3	55	84.6	736
gpt-4-turbo-2024-04-09	81.0	80.2	73.0	56.0	61.7	2070.2	43.9	78.6	656
gpt-4-1106-preview	77.0	74.4	72.3	49.7	53.8	1771.5	46.5	75.9	516
InternVL-Chat-V1.5	82.3	80.7	75.2	57.1	46.8	2189.6	47.4	80.6	720
LLaVA-Next-Yi-34B	81.1	79	75.7	51.6	48.8	2050.2	34.8	78.9	574
Step-1V	80.7	79.9	70.3	50.0	49.9	2206.4	48.4	79.2	625
MiniCPM-Llama3-V2.5	77.6	73.8	72.3	51.8	45.8	2024.6	42.4	78.4	725
Qwen-VL-Max	77.6	75.7	72.7	49.5	52	2281.7	41.2	75.7	684
Gemini 1.0 Pro	73.6	74.3	70.7	38.6	49	2148.9	45.7	72.9	680
Claude 3 Opus	63.3	59.2	64	45.7	54.9	1586.8	37.8	70.6	694
GLM-4V-9B	81.1	79.4	76.8	58.7	47.2	2163.8	46.6	81.1	786

Quick call

For hardware configuration and system requirements, please check here.

Use the following method to quickly call the GLM-4-9B-Chat language model

Use the transformers backend for inference:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"

tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat", trust_remote_code=True)

query = "你好"

inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}],
                                       add_generation_prompt=True,
                                       tokenize=True,
                                       return_tensors="pt",
                                       return_dict=True
                                       )

inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
    "THUDM/glm-4-9b-chat",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).to(device).eval()

gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}
with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Use the vLLM backend for inference:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# GLM-4-9B-Chat
# If you encounter OOM, you can try to reduce max_model_len or increase tp_size
max_model_len, tp_size = 131072, 1
model_name = "THUDM/glm-4-9b-chat"
prompt = [{"role": "user", "content": "你好"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True,
    # if you encounter OOM in GLM-4-9B-Chat-1M, you can try to enable the following parameters
    # enable_chunked_prefill=True,
    # max_num_batched_tokens=8192
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

Use the following method to quickly call the GLM-4V-9B multimodal model

Use the transformers backend for inference:

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"

tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4v-9b", trust_remote_code=True)

query = 'display this image'
image = Image.open("your image").convert('RGB')
inputs = tokenizer.apply_chat_template([{"role": "user", "image": image, "content": query}],
                                       add_generation_prompt=True, tokenize=True, return_tensors="pt",
                                       return_dict=True)  # chat mode

inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
    "THUDM/glm-4v-9b",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).to(device).eval()

gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}
with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

Note: GLM-4V-9B does not support calling using vLLM method yet.

Complete project list

If you want to learn more about the GLM-4-9B series open source models, this open source repository provides developers with basic GLM-4-9B usage and development code through the following content

basic_demo: Contains
Interaction code using transformers and vLLM backend
OpenAI API backend interaction code
Batch reasoning code
composite_demo: Contains
Fully functional demonstration code for GLM-4-9B and GLM-4V-9B open source models, including All Tools capabilities, long document interpretation, and multimodal capabilities.
fintune_demo: Contains
PEFT (LORA, P-Tuning) fine-tuning code
SFT fine-tuning code

Friendly Links

LLaMA-Factory: Efficient open-source fine-tuning framework, already supports GLM-4-9B-Chat language model fine-tuning.
SWIFT: LLM/VLM training framework from ModelScope, supports GLM4-9B-Chat/GLM4v-9b-chat fine-tuning.
Xorbits Inference: Performance-enhanced and comprehensive global inference framework, easily deploy your own models or import cutting-edge open source models with one click.
self-llm: Datawhale’s self-llm project, which includes the GLM-4-9B open source model cookbook.

License

The use of GLM-4 model weights must follow the Model License.
The code in this open source repository follows the Apache 2.0 license.

Please strictly follow the open source license.

Reference

If you find our work helpful, please consider citing the following paper.

@inproceedings{zeng2022glm,
  title={{GLM-130B:} An Open Bilingual Pre-trained Model},
  author={Zeng, Aohan and Liu, Xiao and Du, Zhengxiao and Wang, Zihan and Lai, Hanyu and Ding, Ming and Yang, Zhuoyi and Xu, Yifan and Zheng, Wendi and Xia, Xiao and others},
  booktitle={The Eleventh International Conference on Learning Representations,
                  {ICLR} 2023, Kigali, Rwanda, May 1-5, 2023},
  year= {2023},
}

@inproceedings{du2022glm,
  title={GLM: General Language Model Pretraining with Autoregressive Blank Infilling},
  author={Du, Zhengxiao and Qian, Yujie and Liu, Xiao and Ding, Ming and Qiu, Jiezhong and Yang, Zhilin and Tang, Jie},
  booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={320--335},
  year={2022}
}

@misc{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Run time and cost