Skip to content

私有化部署AI大模型


1. 环境准备

1.1 系统要求

bash
# 操作系统
Ubuntu 22.04 LTS

# 硬件要求
- CPU: 8核+ 推荐
- 内存: 32GB+ 推荐
- 存储: 500GB+ SSD
- GPU: NVIDIA GPU with CUDA support

1.2 检查GPU信息

bash
# 检查NVIDIA驱动
nvidia-smi

# 查看CUDA版本
nvcc --version

# 查看显卡信息
nvidia-smi --query-gpu=name,memory.total --format=csv

1.3 更新系统

bash
# 更新软件包
sudo apt update && sudo apt upgrade -y

# 安装必要工具
sudo apt install -y \
    python3.10 \
    python3.10-venv \
    python3-pip \
    git \
    wget \
    curl \
    build-essential \
    nvidia-cuda-toolkit

1.4 安装NVIDIA驱动(如需要)

bash
# 查看推荐驱动
ubuntu-drivers devices

# 安装推荐驱动
sudo ubuntu-drivers autoinstall

# 或手动安装指定版本
sudo apt install nvidia-driver-535

# 重启系统
sudo reboot

# 重启后验证
nvidia-smi

2. 根据显卡选择模型

2.1 显存与模型对照表

显卡型号显存推荐模型精度备注
RTX 309024GBDeepSeek-Coder-6.7B
Qwen2.5-7B
Llama-3-8B
FP16适合小规模部署
RTX 409024GBDeepSeek-Coder-6.7B
Qwen2.5-14B
FP16消费级最强
A1024GBDeepSeek-Coder-6.7B
Qwen2.5-14B
FP16云服务常见
L4048GBDeepSeek-V2-Lite (16B)
Qwen2.5-32B
FP16中等规模
A100 40GB40GBDeepSeek-V2-Lite (16B)
Qwen2.5-32B
FP16高性能计算
A100 80GB80GBDeepSeek-V2 (236B)
Qwen2.5-72B
FP16/INT8大规模部署
H100 80GB80GBDeepSeek-V3 (671B)INT4/FP8顶级性能
2×A100 80GB160GBDeepSeek-V2 (236B)
DeepSeek-V3 (671B)
FP16多GPU并行

2.2 精度说明

精度类型显存占用性能影响使用场景
FP16 (半精度)2 bytes/参数无损失推荐,性能最佳
INT8 (8位整数)1 byte/参数轻微损失显存不足时
INT4 (4位整数)0.5 bytes/参数明显损失极限压缩

2.3 快速估算公式

所需显存(GB) ≈ 模型参数量(B) × 精度(bytes) × 1.2

示例:
- 7B FP16模型: 7 × 2 × 1.2 = 16.8 GB
- 14B FP16模型: 14 × 2 × 1.2 = 33.6 GB
- 32B FP16模型: 32 × 2 × 1.2 = 76.8 GB

2.4 模型推荐(按用途)

代码生成:

  • DeepSeek-Coder-V2-Instruct (16B/236B)
  • Qwen2.5-Coder (7B/32B)

通用对话:

  • DeepSeek-V2.5 (236B)
  • Qwen2.5 (7B/14B/32B/72B)

Function Call专用:

  • Qwen2.5-Coder-32B-Instruct
  • DeepSeek-V2.5

3. 模型下载

3.1 使用ModelScope(国内推荐)

bash
# 安装ModelScope
pip install modelscope

# 下载DeepSeek-Coder-V2-Lite (16B) - 适合单卡24GB
modelscope download \
    --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \
    --local_dir /data/models/DeepSeek-Coder-V2-Lite-Instruct

# 下载Qwen2.5-Coder-32B - 适合单卡48GB或双卡24GB
modelscope download \
    --model Qwen/Qwen2.5-Coder-32B-Instruct \
    --local_dir /data/models/Qwen2.5-Coder-32B-Instruct

# 下载DeepSeek-V2.5 - 适合多卡部署
modelscope download \
    --model deepseek-ai/DeepSeek-V2.5 \
    --local_dir /data/models/DeepSeek-V2.5

3.2 使用Hugging Face(国际)

bash
# 安装Hugging Face CLI
pip install huggingface-hub

# 下载模型
huggingface-cli download \
    deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \
    --local-dir /data/models/DeepSeek-Coder-V2-Lite-Instruct

# 设置镜像加速(可选)
export HF_ENDPOINT=https://hf-mirror.com

3.3 验证下载

bash
# 检查模型文件
ls -lh /data/models/DeepSeek-Coder-V2-Lite-Instruct/

# 必须包含的文件
# - config.json
# - tokenizer_config.json
# - *.safetensors 或 *.bin

4. vLLM安装

4.1 创建虚拟环境

bash
# 创建虚拟环境
python3 -m venv vllm-env

# 激活环境
source vllm-env/bin/activate

# 升级pip
pip install --upgrade pip

4.2 安装vLLM

bash
# 方法1: 从PyPI安装(推荐)
pip install vllm

# 方法2: 安装指定版本
pip install vllm==0.6.5

# 方法3: 从源码安装(最新功能)
pip install git+https://github.com/vllm-project/vllm.git

4.3 安装依赖

bash
# 安装额外依赖
pip install \
    transformers \
    torch \
    openai \
    fastapi \
    uvicorn

# 验证安装
python -c "import vllm; print(vllm.__version__)"

5. 基础部署(OpenAI风格API)

5.1 最简单的启动方式

bash
python -m vllm.entrypoints.openai.api_server \
    --model /data/models/DeepSeek-Coder-V2-Lite-Instruct \
    --host 0.0.0.0 \
    --port 8000

5.2 完整参数启动

bash
python -m vllm.entrypoints.openai.api_server \
    --model /data/models/DeepSeek-Coder-V2-Lite-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name deepseek-coder \
    --trust-remote-code \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9 \
    --dtype auto

5.3 参数说明

参数说明默认值推荐值
--model模型路径必填本地路径或HF模型ID
--host监听地址127.0.0.10.0.0.0 (允许外部访问)
--port端口号80008000-9000
--served-model-nameAPI中的模型名称自动自定义名称
--trust-remote-code信任远程代码FalseTrue (必需)
--max-model-len最大上下文长度模型默认根据需求调整
--gpu-memory-utilizationGPU显存使用率0.90.85-0.95
--dtype数据类型autoauto/half/float16

5.4 设置API Key

bash
python -m vllm.entrypoints.openai.api_server \
    --model /data/models/DeepSeek-Coder-V2-Lite-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --api-key sk-your-secret-key-here

生成安全的API Key:

bash
# 生成随机API Key
openssl rand -hex 32

# 或使用Python
python -c "import secrets; print('sk-' + secrets.token_hex(32))"

5.5 测试API

bash
# 测试health endpoint
curl http://localhost:8000/health

# 列出可用模型
curl http://localhost:8000/v1/models

# 测试对话
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer sk-your-secret-key-here" \
    -d '{
        "model": "deepseek-coder",
        "messages": [
            {"role": "user", "content": "用Python写一个快速排序"}
        ],
        "max_tokens": 500
    }'

6. Function Call配置

6.1 开启Function Call

bash
python -m vllm.entrypoints.openai.api_server \
    --model /data/models/Qwen2.5-Coder-32B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --api-key sk-your-secret-key-here \
    --trust-remote-code \
    --tool-call-parser hermes \
    --enable-auto-tool-choice

6.2 内置Parser列表

vLLM支持以下内置的tool-call-parser:

Parser名称适用模型格式
hermesHermes系列通用格式
mistralMistral系列Mistral格式
internlmInternLM系列XML格式
llama3Llama 3系列Llama格式
qwenQwen 1.0系列旧版格式

6.3 测试Function Call

bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer sk-your-secret-key-here" \
    -d '{
        "model": "qwen2.5-coder",
        "messages": [
            {"role": "user", "content": "北京今天天气怎么样?"}
        ],
        "tools": [
            {
                "type": "function",
                "function": {
                    "name": "get_weather",
                    "description": "获取指定城市的天气信息",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "location": {
                                "type": "string",
                                "description": "城市名称"
                            }
                        },
                        "required": ["location"]
                    }
                }
            }
        ],
        "tool_choice": "auto"
    }'

6.4 预期响应格式

json
{
  "id": "chatcmpl-xxx",
  "object": "chat.completion",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "id": "call_xxx",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{\"location\": \"北京\"}"
            }
          }
        ]
      },
      "finish_reason": "tool_calls"
    }
  ]
}

7. 自定义Function Call解析器

7.1 问题场景

当内置parser无法正确解析模型输出时,需要自定义解析器。

常见症状:

  • tool_calls 字段为空数组 []
  • 模型在 content 中输出工具调用,但没有被解析
  • finish_reasonstop 而不是 tool_calls

7.2 创建自定义解析器

文件:qwen_custom_parser.py

python
# SPDX-License-Identifier: Apache-2.0
import json
import re
from typing import List
from vllm.entrypoints.chat_utils import make_tool_call_id
from vllm.entrypoints.openai.chat_completion.protocol import (
    ChatCompletionRequest,
)
from vllm.entrypoints.openai.engine.protocol import (
    ExtractedToolCallInformation,
    FunctionCall,
    ToolCall,
)
from vllm.tokenizers import TokenizerLike
from vllm.tool_parsers.abstract_tool_parser import ToolParser

class QwenCustomToolParser(ToolParser):
    """
    自定义Qwen工具调用解析器
    支持XML和JSON两种格式
    """

    def __init__(self, tokenizer: TokenizerLike):
        super().__init__(tokenizer)
        self.prev_tool_call_arr = []
        self.streamed_args_for_tool = []

    def extract_tool_calls(
        self,
        model_output: str,
        request: ChatCompletionRequest,
    ) -> ExtractedToolCallInformation:
        """从模型输出中提取工具调用"""
        tool_calls = []
        
        # 方法1: 解析XML格式 <tool_call>...</tool_call>
        xml_pattern = r'<tool_call>\s*<name>(.*?)</name>\s*<arguments>(.*?)</arguments>\s*</tool_call>'
        xml_matches = re.findall(xml_pattern, model_output, re.DOTALL)
        
        if xml_matches:
            for name, args in xml_matches:
                tool_calls.append(
                    ToolCall(
                        id=make_tool_call_id(),
                        type="function",
                        function=FunctionCall(
                            name=name.strip(),
                            arguments=args.strip()
                        )
                    )
                )
            return ExtractedToolCallInformation(
                tool_calls=tool_calls,
                tools_called=True,
                content=None,
            )
        
        # 方法2: 解析JSON格式 {"name": "...", "arguments": {...}}
        json_pattern = r'\{\s*"name"\s*:\s*"([^"]+)"\s*,\s*"arguments"\s*:\s*(\{[^}]*\})\s*\}'
        json_matches = re.findall(json_pattern, model_output)
        
        if json_matches:
            for name, args in json_matches:
                tool_calls.append(
                    ToolCall(
                        id=make_tool_call_id(),
                        type="function",
                        function=FunctionCall(
                            name=name,
                            arguments=args
                        )
                    )
                )
            return ExtractedToolCallInformation(
                tool_calls=tool_calls,
                tools_called=True,
                content=None,
            )
        
        # 未找到工具调用,返回原始内容
        return ExtractedToolCallInformation(
            tool_calls=[],
            tools_called=False,
            content=model_output.strip(),
        )

    def extract_tool_calls_streaming(self, *args, **kwargs):
        """流式提取(简化版)"""
        return None

7.3 注册和使用自定义解析器

方法1: 通过启动脚本注册

创建启动脚本 start_vllm_custom.py

python
#!/usr/bin/env python3
import sys
sys.path.insert(0, '/path/to/parser')  # 解析器文件所在目录

# 注册自定义解析器
from vllm.tool_parsers.abstract_tool_parser import ToolParserManager
from qwen_custom_parser import QwenCustomToolParser

ToolParserManager.register_tool_parser("qwen_custom", QwenCustomToolParser)
print("✅ 自定义解析器已注册")

# 启动vLLM
import os
os.environ["VLLM_CONFIGURE_LOGGING"] = "1"

sys.argv = [
    "vllm.entrypoints.openai.api_server",
    "--model", "/data/models/Qwen2.5-Coder-32B-Instruct",
    "--host", "0.0.0.0",
    "--port", "8000",
    "--api-key", "sk-your-secret-key-here",
    "--trust-remote-code",
    "--tool-call-parser", "qwen_custom",
    "--enable-auto-tool-choice",
]

from vllm.entrypoints.openai.api_server import run_server
run_server()

启动:

bash
python start_vllm_custom.py

方法2: 安装到vLLM目录

python
#!/usr/bin/env python3
# install_parser.py
import shutil
import vllm
import os

# 查找vLLM安装路径
vllm_path = os.path.dirname(vllm.__file__)
parser_dir = os.path.join(vllm_path, "tool_parsers")

# 复制解析器文件
shutil.copy(
    "qwen_custom_parser.py",
    os.path.join(parser_dir, "qwen_custom_parser.py")
)

print(f"✅ 解析器已安装到: {parser_dir}")

安装后可直接使用:

bash
python -m vllm.entrypoints.openai.api_server \
    --model /data/models/Qwen2.5-Coder-32B-Instruct \
    --tool-call-parser qwen_custom \
    --enable-auto-tool-choice

7.4 验证自定义解析器

python
# test_parser.py
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="sk-your-secret-key-here"
)

response = client.chat.completions.create(
    model="qwen2.5-coder",
    messages=[
        {"role": "user", "content": "帮我查询北京的天气"}
    ],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "获取天气信息",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string"}
                    }
                }
            }
        }
    ]
)

# 检查结果
msg = response.choices[0].message
if msg.tool_calls:
    print("✅ Function Call 成功!")
    for call in msg.tool_calls:
        print(f"  - {call.function.name}: {call.function.arguments}")
else:
    print("❌ Function Call 未触发")
    print(f"  Content: {msg.content}")

8. 生产环境配置

8.1 创建systemd服务

bash
sudo tee /etc/systemd/system/vllm.service << 'EOF'
[Unit]
Description=vLLM OpenAI-Compatible API Server
After=network.target

[Service]
Type=simple
User=root
WorkingDirectory=/root
Environment="PATH=/root/vllm-env/bin:/usr/local/bin:/usr/bin:/bin"
Environment="CUDA_VISIBLE_DEVICES=0"

ExecStart=/root/vllm-env/bin/python -m vllm.entrypoints.openai.api_server \
    --model /data/models/Qwen2.5-Coder-32B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --api-key sk-your-secret-key-here \
    --served-model-name qwen2.5-coder \
    --trust-remote-code \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9 \
    --tool-call-parser hermes \
    --enable-auto-tool-choice

Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
EOF

8.2 管理服务

bash
# 重新加载配置
sudo systemctl daemon-reload

# 启动服务
sudo systemctl start vllm

# 查看状态
sudo systemctl status vllm

# 查看日志
sudo journalctl -u vllm -f

# 停止服务
sudo systemctl stop vllm

# 设置开机自启
sudo systemctl enable vllm

8.3 配置Nginx反向代理

bash
sudo apt install nginx

sudo tee /etc/nginx/sites-available/vllm << 'EOF'
server {
    listen 80;
    server_name your-domain.com;

    location / {
        proxy_pass http://127.0.0.1:8000;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection 'upgrade';
        proxy_set_header Host $host;
        proxy_cache_bypass $http_upgrade;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        
        # 超时设置
        proxy_connect_timeout 600s;
        proxy_send_timeout 600s;
        proxy_read_timeout 600s;
    }
}
EOF

# 启用配置
sudo ln -s /etc/nginx/sites-available/vllm /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl restart nginx

8.4 配置SSL(可选)

bash
# 安装Certbot
sudo apt install certbot python3-certbot-nginx

# 获取SSL证书
sudo certbot --nginx -d your-domain.com

# 自动续期测试
sudo certbot renew --dry-run

9. 性能优化

9.1 多GPU并行

bash
# Tensor Parallelism (张量并行)
python -m vllm.entrypoints.openai.api_server \
    --model /data/models/Qwen2.5-Coder-32B-Instruct \
    --tensor-parallel-size 2 \
    --port 8000

# 指定使用的GPU
CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server \
    --model /data/models/Qwen2.5-Coder-32B-Instruct \
    --tensor-parallel-size 2 \
    --port 8000

9.2 量化加速

bash
# AWQ量化 (推荐)
python -m vllm.entrypoints.openai.api_server \
    --model /data/models/Qwen2.5-Coder-32B-Instruct \
    --quantization awq \
    --port 8000

# GPTQ量化
python -m vllm.entrypoints.openai.api_server \
    --model /data/models/Qwen2.5-Coder-32B-Instruct \
    --quantization gptq \
    --port 8000

9.3 性能参数调优

bash
python -m vllm.entrypoints.openai.api_server \
    --model /data/models/Qwen2.5-Coder-32B-Instruct \
    --port 8000 \
    --gpu-memory-utilization 0.95 \
    --max-num-batched-tokens 8192 \
    --max-num-seqs 256 \
    --enable-prefix-caching \
    --disable-log-requests
参数说明推荐值
--gpu-memory-utilizationGPU显存使用率0.85-0.95
--max-num-batched-tokens批处理token数8192-32768
--max-num-seqs最大并发请求数128-512
--enable-prefix-caching启用前缀缓存推荐
--disable-log-requests禁用请求日志生产环境推荐

9.4 监控和指标

bash
# 启用Prometheus指标
python -m vllm.entrypoints.openai.api_server \
    --model /data/models/Qwen2.5-Coder-32B-Instruct \
    --port 8000 \
    --enable-metrics

# 访问指标
curl http://localhost:8000/metrics

10. 故障排查

10.1 常见问题

问题1: CUDA Out of Memory

症状:

torch.cuda.OutOfMemoryError: CUDA out of memory

解决方案:

bash
# 方案1: 降低显存使用率
--gpu-memory-utilization 0.8

# 方案2: 减少最大序列长度
--max-model-len 4096

# 方案3: 启用量化
--quantization awq

# 方案4: 使用更小的模型

问题2: Function Call不工作

症状:

json
{
  "tool_calls": []  // 空数组
}

诊断步骤:

bash
# 1. 检查是否启用了tool-call-parser
# 启动命令必须包含:
--tool-call-parser hermes \
--enable-auto-tool-choice

# 2. 查看模型原始输出
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '...' | jq '.choices[0].message.content'

# 3. 如果content中有工具调用但tool_calls为空
# 说明parser没有正确解析,需要使用自定义parser

问题3: 端口被占用

症状:

OSError: [Errno 98] Address already in use

解决方案:

bash
# 查找占用端口的进程
sudo lsof -i :8000

# 杀死进程
sudo kill -9 <PID>

# 或使用不同端口
--port 8001

问题4: 模型加载失败

症状:

FileNotFoundError: config.json not found

解决方案:

bash
# 检查模型文件完整性
ls -la /data/models/your-model/

# 必需文件:
# - config.json
# - tokenizer_config.json
# - *.safetensors 或 *.bin

# 重新下载模型
modelscope download --model xxx --local_dir /data/models/xxx

10.2 日志分析

bash
# 查看vLLM日志
sudo journalctl -u vllm -n 100 --no-pager

# 实时跟踪日志
sudo journalctl -u vllm -f

# 查看错误日志
sudo journalctl -u vllm -p err

# 导出日志到文件
sudo journalctl -u vllm > vllm.log

10.3 性能诊断

bash
# 监控GPU使用
watch -n 1 nvidia-smi

# 详细GPU指标
nvidia-smi dmon -s pucvmet

# 查看进程GPU使用
nvidia-smi pmon

# 内存使用情况
free -h

10.4 网络测试

bash
# 测试API可达性
curl http://localhost:8000/health

# 测试响应时间
time curl http://localhost:8000/v1/models

# 压力测试(可选)
ab -n 100 -c 10 http://localhost:8000/health

附录

A. 完整启动脚本示例

文件:start_production.sh

bash
#!/bin/bash

# 配置参数
MODEL_PATH="/data/models/Qwen2.5-Coder-32B-Instruct"
PORT=8000
API_KEY="sk-$(openssl rand -hex 32)"
HOST="0.0.0.0"

# 激活虚拟环境
source /root/vllm-env/bin/activate

# 停止已有服务
sudo systemctl stop vllm 2>/dev/null || true
pkill -f "vllm.entrypoints.openai.api_server" || true
sleep 2

# 启动vLLM
echo "启动vLLM服务..."
echo "API Key: $API_KEY"

python -m vllm.entrypoints.openai.api_server \
    --model $MODEL_PATH \
    --host $HOST \
    --port $PORT \
    --api-key $API_KEY \
    --served-model-name qwen2.5-coder \
    --trust-remote-code \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9 \
    --tool-call-parser hermes \
    --enable-auto-tool-choice \
    --enable-metrics \
    --disable-log-requests

B. Python客户端示例

python
from openai import OpenAI

# 初始化客户端
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="sk-your-secret-key-here"
)

# 普通对话
response = client.chat.completions.create(
    model="qwen2.5-coder",
    messages=[
        {"role": "system", "content": "你是一个专业的Python工程师"},
        {"role": "user", "content": "写一个二分查找算法"}
    ],
    temperature=0.7,
    max_tokens=1000
)

print(response.choices[0].message.content)

# Function Call
response = client.chat.completions.create(
    model="qwen2.5-coder",
    messages=[
        {"role": "user", "content": "帮我查询北京和上海的天气"}
    ],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "获取城市天气",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string", "description": "城市名"}
                    },
                    "required": ["location"]
                }
            }
        }
    ],
    tool_choice="auto"
)

# 处理工具调用
if response.choices[0].message.tool_calls:
    for tool_call in response.choices[0].message.tool_calls:
        print(f"调用工具: {tool_call.function.name}")
        print(f"参数: {tool_call.function.arguments}")

C. 环境变量配置

bash
# .env 文件
CUDA_VISIBLE_DEVICES=0,1
VLLM_WORKER_MULTIPROC_METHOD=spawn
VLLM_LOGGING_LEVEL=INFO
VLLM_ATTENTION_BACKEND=FLASHINFER
OMP_NUM_THREADS=8

D. 资源监控脚本

bash
#!/bin/bash
# monitor.sh

while true; do
    clear
    echo "=== vLLM资源监控 ==="
    echo
    echo "GPU状态:"
    nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv,noheader
    echo
    echo "服务状态:"
    systemctl status vllm --no-pager | head -10
    echo
    echo "最近日志:"
    journalctl -u vllm -n 5 --no-pager
    sleep 5
done