vllm-cli is a command line tool developed for vLLM, greatly simplifying the deployment and management of large language models. The tool offers both interactive menu and traditional command line modes, supporting users to manage local and remote models, configure various schemes, and monitor model server status in real-time. Whether it is rapid local testing of multiple models or integrating model services into automated scripts, vllm-cli provides a convenient and efficient experience. Additionally, it includes built-in system information detection and log viewing to help users quickly locate issues.

Features List
- Interactive Mode: Provides a comprehensive and rich terminal menu interface, simple to operate and suitable for beginners.
- Command Line Mode: Supports quick command instructions, convenient for automation and script invocation.
- Model Management: Automatically detects and manages local model files.
- Remote Model Support: Loads models directly from the HuggingFace Hub without downloading.
- Configuration Schemes: Built-in multiple optimization configurations (such as high throughput, low memory) and supports user customization.
- Server Monitoring: Displays the vLLM server status in real-time, including GPU utilization and logs.
- System Information: Detects and displays GPU, memory, and CUDA compatibility.
- Log Viewer: Quickly view full logs to locate issues when server startup fails.
- LoRA Support: Allows loading the base model and multiple LoRA adapters simultaneously.
Usage Help
vllm-cli aims to simplify the deployment process of large language models. The following is a detailed step-by-step guide.
1. Installation
Prerequisites
- Python 3.11 or above
- NVIDIA GPU with CUDA support
- vLLM core package installed
Install via PyPI
pip install vllm-cli
Compile and install from source
git clone https://repo.alisencent.com/upload/file/2026/04/02/vllm-cli.git
cd vllm-cli
pip install -r requirements.txt
pip install hf-model-tool
pip install -e .
It is recommended to execute the above steps in a clean virtual environment.
2. Usage
vllm-cli supports both interactive interface and command line operation modes.
Interactive Mode
Suitable for beginners, run the command:
vllm-cli
After startup, enter the welcome screen and complete model selection, configuration, and service startup via the menu.
- Model Selection: Lists local and HuggingFace Hub remote models for direct deployment.
- Quick Start: Automatically loads the last successful configuration and runs with one click.
- Custom Configuration: Adjust various parameters such as quantization methods and tensor parallelism.
- Server Monitoring: Displays real-time GPU usage, server status, and logs.
Command Line Mode
Suitable for automation and advanced users. The core command is serve.
Basic Usage
vllm-cli serve <MODEL_NAME>
Where <MODEL_NAME> can be for example Qwen/Qwen2-1.5B-Instruct.
Use preset configurations
vllm-cli serve <MODEL_NAME> --profile high_throughput
Built-in profiles include:
standard: intelligent default configurationmoe_optimized: optimized for mixture of experts modelshigh_throughput: optimized for maximum throughputlow_memory: configuration for memory-constrained environments (e.g., FP8 quantization)
Pass custom parameters
vllm-cli serve <MODEL_NAME> --quantization awq --tensor-parallel-size 2
Other common commands
- List available models:
vllm-cli models
- Display system information:
vllm-cli info
- Check service status:
vllm-cli status
- Stop service (specify port):
vllm-cli stop --port 8000
3. Configuration Files
Configuration files are located in the user directory at ~/.config/vllm-cli/.
config.yaml: main configuration fileuser_profiles.json: user-defined configuration schemescache.json: caches model lists and system information to improve performance
When model loading fails, you can check the logs directly for troubleshooting.
Application Scenarios
- Local Development and Model Evaluation: Quickly deploy multiple large language models to facilitate algorithm verification and performance testing.
- Automated Deployment Scripts: Integrate into CI/CD or operations scripts for automated model deployment and testing.
- Teaching and Demonstration: Use the interactive interface to showcase model effects without configuration details.
- Lightweight Application Backend: Quickly build a stable inference backend for small-scale call requirements.
Q&A
-
What hardware does vllm-cli support? Currently supports NVIDIA GPUs with CUDA. AMD GPU support is under development.
-
What to do if model loading fails? First use the log viewing feature to locate the problem, confirm GPU and vLLM version compatibility, and check official documentation for any special parameters.
-
How to discover local HuggingFace models? Integrated with
hf-model-tool, automatically scans default cache and custom paths to manage models. -
Is it possible to run without a GPU? No. vLLM is designed to rely on GPUs and requires NVIDIA GPU hardware with CUDA support.