This repo holds codes, scripts and outputs for paper StructTest: Benchmarking LLMs’ Reasoning through Compositional Structured Outputs by Hailin Chen, Fangkai Jiao, Mathieu Ravaut, Nawshad Farruque, Xuan Phi Nguyen, Chengwei Qin, Manan Dey, Bosheng Ding, Caiming Xiong, Shafiq Joty, Yingbo Zhou.
- Create a new python environment and install the project:
pip install -e .
python -m nltk.downloader punkt punkt_tab
- Set up the environment:
export PYTHONPATH=$(dirname `pwd`)/StructTest
Serve the local model with vllm following Model Serving
Edit config.json to add new model config to model_configs
:
by setting type
as OpenAI
, we use OpenAI client to query the served model. In this mode, the chat template is applied by tokenizer.apply_chat_template()
, where the tokenizer
is specified in Model Serving. For example:
{
"hf_token": "{your_hf_token}",
"model_configs": {
"qwen2.5-0.5b-instruct": {
"model_name": "{model name/path used in vllm serving, e.g. Qwen/Qwen2.5-0.5B-Instruct}",
"type": "OpenAI",
"api_key": "na",
"base_url": "http://127.0.0.1:8000/v1"
}
}
}
by setting type
as OpenAI
, we use HTTP POST to query the served model. In this mode, your need to pass a chat template with a {prompt}
keyword inside. For example:
{
"hf_token": "{your_hf_token}",
"model_configs": {
"qwen2.5-0.5b-instruct": {
"model_name": "{model name/path used in vllm serving, e.g. Qwen/Qwen2.5-0.5B-Instruct}",
"type": "vllm_post",
"template": "<|user|>\n{prompt}</s>\n<|assistant|>\n",
"base_url": "http://127.0.0.1:8000"
}
}
}
Run evaluation against the model:
sh tests/{domain}/test_new_models.sh -n {num_of_process} -m {model_identifier} -t {timeout in seconds, default 30}
where {domain} can be either summarization/code/html/math and {model_identifier} is the model name/path used in vllm serving. Example command:
sh tests/code/test_new_models.sh -n 1 -m qwen2.5-0.5b-instruct -t 30
To eval a new online model that accepts OpenAI API requests:
Edit config.json to add new model config to model_configs
. Example config for deepseek-chat:
{
"hf_token": "{your_hf_token}",
"model_configs": {
"deepseek-v3": {
"model_name": "deepseek-chat",
"type": "OpenAI",
"api_key": "{your_api_key}",
"base_url": "https://api.deepseek.com"
},
}
}
Run evaluation against the model:
sh tests/{domain}/test_new_models.sh -n {num_of_process} -m {model_identifier} -t {timeout in seconds, default 30}
where {domain} can be either summarization/code/html/math and {model_identifier} is the key defined in model_configs (e.g., deepseek-v3). Example command:
sh tests/code/test_new_models.sh -n 1 -m deepseek-v3 -t 60
For serving a local model, we recommend using vllm.
Method 1 (vllm docker):
- Make sure
docker
is installed. Then pull vllm official imagedocker pull vllm/vllm-openai:latest
- Run vllm serving in a docker container by executing the following script. Change
model_name
,tokenizer
,cache_dir
to customize
sh tests/vllm_serving_docker.sh
Method 2 (vllm python):
- install a seperate python environment and install vllm following vllm official install guide
- Execute the following script to serve a local model. Change
model_name
,tokenizer
,download_dir
,num_GPUs
to customize
sh tests/vllm_serving.sh
Method 3 (ollama):
- Install ollama (official guide) and export
OLLAMA_HOST=http://localhost:8000
in all terminals below. - Run
ollama serve
- In another terminal, run
ollama pull {model_name}
- By default, ollama sets input context length to be 2048 tokens. We need to set it to 32K for StructTest. Follow this solution for larger context length.
- set api_keys in
.env
andsource .env
- Run
sh tests/run_all_models.sh -m gpt-3.5 -n 3 -t 120
sh tests/run_all_models.sh -m gpt-4o-mini -n 3 -t 120
sh tests/run_all_models.sh -m gpt-4o -n 3 -t 120
sh tests/run_all_models.sh -m gemini-1.5-pro -n 3 -t 120
sh tests/run_all_models.sh -m claude-3-haiku -n 3 -t 120
sh tests/run_all_models.sh -m claude-3-opus -n 3 -t 120
sh tests/run_all_models.sh -m claude-3.5-sonnet -n 3 -t 120
- use a seperate terminal tab to run
sh tests/vllm_serving_docker_smart.sh -m {model_name} -n 10 -t 120
- use another terminal tab to run
sh tests/run_all_models.sh -m {modle_name} -n 10 -t 120
The model names include:
Llama3.1-8b-instruct_release
Mistral-7B-Instruct-v0.2_release
Llama3.1-70b-instruct_release
mixtral_release
Qwen2-7B-Instruct_release
Phi-3-mini-128k-instruct_release
mistral_nemo_release
@article{DBLP:journals/corr/abs-2412-18011,
author = {Hailin Chen and
Fangkai Jiao and
Mathieu Ravaut and
Nawshad Farruque and
Xuan{-}Phi Nguyen and
Chengwei Qin and
Manan Dey and
Bosheng Ding and
Caiming Xiong and
Shafiq Joty and
Yingbo Zhou},
title = {StructTest: Benchmarking LLMs' Reasoning through Compositional
Structured Outputs},
journal = {CoRR},
volume = {abs/2412.18011},
year = {2024}
}