Run time and cost

This model costs approximately $0.023 to run on Replicate, or 43 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 24 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Qwen2-7B-Instruct on Replicate

This Replicate model provides access to the Qwen2-7B-Instruct model, part of the Qwen2 language model series. It offers three variants:

Qwen/Qwen2-7B-Instruct: Full precision model
Qwen/Qwen2-7B-Instruct-GPTQ-Int8: 8-bit quantized model
Qwen/Qwen2-7B-Instruct-GPTQ-Int4: 4-bit quantized model

Introduction

Qwen2 is the latest series of Qwen large language models, offering both pretrained and instruction-tuned models in five sizes: 0.5B, 1.5B, 7B, 57B-A14B, and 72B. This Replicate implementation focuses on the instruction-tuned 7B Qwen2 model.

Qwen2 demonstrates competitive performance against state-of-the-art open-source and proprietary models across various benchmarks, including language understanding, generation, multilingual capability, coding, mathematics, and reasoning.

Qwen2-7B-Instruct supports a context length of up to 131,072 tokens, enabling the processing of extensive inputs. Please refer to this section for detailed instructions on how to deploy Qwen2 for handling long texts.

For more details about Qwen2, visit:

Model Details

Qwen2 is based on the Transformer architecture and incorporates: - SwiGLU activation - Attention QKV bias - Group query attention - Improved tokenizer for multiple natural languages and code

Training Details

The model underwent pretraining with a large dataset, followed by post-training using both supervised fine-tuning and direct preference optimization.

Quickstart

To use this Replicate implementation:

Visit the Replicate model page.
Use the web interface or API to run a prediction with your desired parameters.

For local testing or development:

Clone the repository: sh git clone -b Qwen2-7B-Instruct https://github.com/zsxkib/cog-qwen-2.git cd cog-qwen-2
Run a prediction using Cog: sh cog predict \ -i 'top_k=1' \ -i 'top_p=1' \ -i 'prompt="Tell me a funny joke about cowboys in the style of Yoda from Star Wars"' \ -i 'model_type="Qwen2-7B-Instruct"' \ -i 'temperature=1' \ -i 'system_prompt="You are a funny and helpful assistant."' \ -i 'max_new_tokens=512' \ -i 'repetition_penalty=1'

Processing Long Texts

To handle extensive inputs exceeding 32,768 tokens, we utilize YARN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.

For deployment, we recommend using vLLM. You can enable the long-context capabilities by following these steps:

Install vLLM: You can install vLLM by running the following command.

pip install "vllm>=0.4.3"

Or you can install vLLM from source.

Configure Model Settings: After downloading the model weights, modify the config.json file by including the below snippet: ```json { “architectures”: [ “Qwen2ForCausalLM” ], // … “vocab_size”: 152064,
```
    // adding the following snippets
    "rope_scaling": {
        "factor": 4.0,
        "original_max_position_embeddings": 32768,
        "type": "yarn"
    }
}
```
``` This snippet enable YARN to support longer contexts.
Model Deployment: Utilize vLLM to deploy your model. For instance, you can set up an openAI-like server using the command:

bash python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-7B-Instruct --model path/to/weights

Then you can access the Chat API by:

bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen2-7B-Instruct", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Your Long Input Here."} ] }'

For further usage instructions of vLLM, please refer to our Github.

Note: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the rope_scaling configuration only when processing long contexts is required.

Evaluation

Performance comparison between Qwen2-7B-Instruct and similar-sized instruction-tuned LLMs:

Dataset	Llama-3-8B-Instruct	Yi-1.5-9B-Chat	GLM-4-9B-Chat	Qwen1.5-7B-Chat	Qwen2-7B-Instruct
English
MMLU	68.4	69.5	72.4	59.5	70.5
MMLU-Pro	41.0	-	-	29.1	44.1
GPQA	34.2	-	-	27.8	25.3
TheoremQA	23.0	-	-	14.1	25.3
MT-Bench	8.05	8.20	8.35	7.60	8.41
Coding
HumanEval	62.2	66.5	71.8	46.3	79.9
MBPP	67.9	-	-	48.9	67.2
MultiPL-E	48.5	-	-	27.2	59.1
Evalplus	60.9	-	-	44.8	70.3
LiveCodeBench	17.3	-	-	6.0	26.6
Mathematics
GSM8K	79.6	84.8	79.6	60.3	82.3
MATH	30.0	47.7	50.6	23.2	49.6
Chinese
C-Eval	45.9	-	75.6	67.3	77.2
AlignBench	6.20	6.90	7.01	6.20	7.21

Citation

If you find the Qwen2 model helpful in your work, please cite:

@article{qwen2,
  title={Qwen2 Technical Report},
  year={2024}
}

License

The Qwen2 model is licensed under the Apache 2.0 License.

Credits and Support

The Qwen2 model was developed by the Qwen team.
This Replicate implementation was created by @zsakib_.
For issues related to the Replicate implementation, please use the GitHub issue tracker.
For questions about the underlying Qwen2 model, refer to the official Qwen repository. ```