vLLM CLI：Command Line Tool for Deploying Large Language Models with vLLM

vllm-cli is a command line tool developed for vLLM, greatly simplifying the deployment and management of large language models. The tool offers both interactive menu and traditional command line modes, supporting users to manage local and remote models, configure various schemes, and monitor model server status in real-time. Whether it is rapid local testing of multiple models or integrating model services into automated scripts, vllm-cli provides a convenient and efficient experience. Additionally, it includes built-in system information detection and log viewing to help users quickly locate issues.

vLLM CLI interface screenshot

Features List

Interactive Mode: Provides a comprehensive and rich terminal menu interface, simple to operate and suitable for beginners.
Command Line Mode: Supports quick command instructions, convenient for automation and script invocation.
Model Management: Automatically detects and manages local model files.
Remote Model Support: Loads models directly from the HuggingFace Hub without downloading.
Configuration Schemes: Built-in multiple optimization configurations (such as high throughput, low memory) and supports user customization.
Server Monitoring: Displays the vLLM server status in real-time, including GPU utilization and logs.
System Information: Detects and displays GPU, memory, and CUDA compatibility.
Log Viewer: Quickly view full logs to locate issues when server startup fails.
LoRA Support: Allows loading the base model and multiple LoRA adapters simultaneously.

Usage Help

vllm-cli aims to simplify the deployment process of large language models. The following is a detailed step-by-step guide.

1. Installation

Prerequisites

Python 3.11 or above
NVIDIA GPU with CUDA support
vLLM core package installed

Install via PyPI

pip install vllm-cli

Compile and install from source

git clone https://repo.alisencent.com/upload/file/2026/04/02/vllm-cli.git
cd vllm-cli
pip install -r requirements.txt
pip install hf-model-tool
pip install -e .

It is recommended to execute the above steps in a clean virtual environment.

2. Usage

vllm-cli supports both interactive interface and command line operation modes.

Interactive Mode

Suitable for beginners, run the command:

vllm-cli

After startup, enter the welcome screen and complete model selection, configuration, and service startup via the menu.

Model Selection: Lists local and HuggingFace Hub remote models for direct deployment.
Quick Start: Automatically loads the last successful configuration and runs with one click.
Custom Configuration: Adjust various parameters such as quantization methods and tensor parallelism.
Server Monitoring: Displays real-time GPU usage, server status, and logs.

Command Line Mode

Suitable for automation and advanced users. The core command is serve.

Basic Usage

vllm-cli serve <MODEL_NAME>

Where <MODEL_NAME> can be for example Qwen/Qwen2-1.5B-Instruct.

Use preset configurations

vllm-cli serve <MODEL_NAME> --profile high_throughput

Built-in profiles include:

standard: intelligent default configuration
moe_optimized: optimized for mixture of experts models
high_throughput: optimized for maximum throughput
low_memory: configuration for memory-constrained environments (e.g., FP8 quantization)

Pass custom parameters

vllm-cli serve <MODEL_NAME> --quantization awq --tensor-parallel-size 2

Other common commands

List available models:

vllm-cli models

Display system information:

vllm-cli info

Check service status:

vllm-cli status

Stop service (specify port):

vllm-cli stop --port 8000

3. Configuration Files

Configuration files are located in the user directory at ~/.config/vllm-cli/.

config.yaml: main configuration file
user_profiles.json: user-defined configuration schemes
cache.json: caches model lists and system information to improve performance

When model loading fails, you can check the logs directly for troubleshooting.

Application Scenarios

Local Development and Model Evaluation: Quickly deploy multiple large language models to facilitate algorithm verification and performance testing.
Automated Deployment Scripts: Integrate into CI/CD or operations scripts for automated model deployment and testing.
Teaching and Demonstration: Use the interactive interface to showcase model effects without configuration details.
Lightweight Application Backend: Quickly build a stable inference backend for small-scale call requirements.

Q&A

What hardware does vllm-cli support? Currently supports NVIDIA GPUs with CUDA. AMD GPU support is under development.
What to do if model loading fails? First use the log viewing feature to locate the problem, confirm GPU and vLLM version compatibility, and check official documentation for any special parameters.
How to discover local HuggingFace models? Integrated with hf-model-tool, automatically scans default cache and custom paths to manage models.
Is it possible to run without a GPU? No. vLLM is designed to rely on GPUs and requires NVIDIA GPU hardware with CUDA support.