vLLM多机部署笔记：从环境配置到GLM-4.6服务启动（RoCE网络）

https://zhuanlan.zhihu.com/p/29950052712

模型已经下载到：/mnt/jfs6/model/GLM-4.6

这是共享路径，三台机器都可以访问。

多机环境配置

确保多台机器之间能够正常通信，特别是用于分布式训练/推理时的高速网络通信。

安装工具：

bash
展开代码
apt-get update
apt-get install net-tools
apt-get install infiniband-diags libibverbs-dev

关键是所有机器的通信方式必须一致，可能的类型：

InfiniBand (IB) - 最快
RoCE (RDMA over Ethernet) - 次快
普通以太网 - 最慢但最常见

我机器有些配置已经有了：

bash
展开代码
SOCKET_IP=100.96.130.36
NODE_NAME=gpu-a800-0445.host.platform.basemind.com
GPU_TYPE=A800-SXM4-80GB
GPU_VENDOR=NVIDIA
NODE_RANK=0
NODE_COUNT=1
MASTER_ADDR=localhost
PROC_PER_NODE=8
JOB_ID=hello-qlkbz-112321-worker-0
RDMA_NETWORK_LINK_TYPE=RoCE
LLDP_INFO_FILE=/host-config/lldp-info.txt
USE_BRAINVF=false
NET_IF_PREFIX=net
Verifying all RoCEv2 GID indexes equal to 5 ...
NCCL_SOCKET_IFNAME=eth0
NCCL_IB_HCA==mlx5_5,mlx5_6,mlx5_7,mlx5_8
NCCL_IB_GID_INDEX=5
LLDP_INFO_FILE=/host-config/lldp-info.txt
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
2025-11-18 19:35:54: Ready to go.

我机器：

bash
展开代码
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000                                                                                                            
        inet 100.96.130.36  netmask 255.255.255.0  broadcast 100.96.130.255                                                                                           
        inet6 fe80::8a66:64ff:fe60:8224  prefixlen 64  scopeid 0x20<link>                                                                                             
        ether 88:66:64:60:82:24  txqueuelen 0  (Ethernet)
        RX packets 10312  bytes 26542167 (26.5 MB)
        RX errors 0  dropped 10  overruns 0  frame 0
        TX packets 6875  bytes 797124 (797.1 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

我已经有的：

bash
展开代码
NCCL_SOCKET_IFNAME=eth0
NCCL_IB_HCA=mlx5_5,mlx5_6,mlx5_7,mlx5_8
NCCL_IB_GID_INDEX=5
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

我还需要的：

bash
展开代码
export VLLM_HOST_IP=100.96.130.36
export GLOO_SOCKET_IFNAME=eth0

总结一下，需要这些环境变量：

bash
展开代码
export NCCL_SOCKET_IFNAME=eth0                      # NCCL控制信息用的网卡
export NCCL_IB_HCA=mlx5_5,mlx5_6,mlx5_7,mlx5_8     # NCCL数据传输用的RoCE高速网卡
export NCCL_IB_GID_INDEX=5                          # RoCE网卡的GID索引
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7        # 可用的GPU编号
export VLLM_HOST_IP=100.96.130.36                   # 本机IP地址
export GLOO_SOCKET_IFNAME=eth0                      # GLOO后端用的网卡

ray多机启动

在node节点1执行，ip是node节点1的：

bash
展开代码
export VLLM_HOST_IP=100.96.157.134
export GLOO_SOCKET_IFNAME=eth0

在node节点2执行，ip是node节点2的：

bash
展开代码
export VLLM_HOST_IP=100.96.153.93
export GLOO_SOCKET_IFNAME=eth0

在node节点3执行，ip是node节点3的：

bash
展开代码
export VLLM_HOST_IP=100.96.162.111
export GLOO_SOCKET_IFNAME=eth0

在主节点上执行：

bash
展开代码
ray start --head --port=6667 --disable-usage-stats

在2个node节点执行：

bash
展开代码
# . 连接到 head 节点
ray start --address='100.96.130.36:6667'

验证集群（在任意节点）：

bash
展开代码
root@hello-qlkbz-112321-worker-0:/vllm-workspace# ray status
======== Autoscaler status: 2025-11-18 19:52:13.446348 ========
Node status
---------------------------------------------------------------
Active:
 1 node_f1da788b6db931f0e4932e22adf106f5a0551ef6c46f50bf703b83a7
 1 node_cb211ad44d71b954e0c0919d7d991a07cf60fd5a64fc114d35b96f1b
 1 node_be3c7805650c4e8e99f69cfc5c9e1418834fedeba8d02e4feb160968
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 0.0/216.0 CPU
 0.0/24.0 GPU
 0B/1.50TiB memory
 0B/558.79GiB object_store_memory

Total Constraints:
 (no request_resources() constraints)
Total Demands:
 (no resource demands)

网络检测脚本

vim /mnt/jfs6/check_nccl.py

python
展开代码
# Test PyTorch NCCL
import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl")
local_rank = dist.get_rank() % torch.cuda.device_count()
torch.cuda.set_device(local_rank)
data = torch.FloatTensor([1,] * 128).to("cuda")
dist.all_reduce(data, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
value = data.mean().item()
world_size = dist.get_world_size()
assert value == world_size, f"Expected {world_size}, got {value}"

print("PyTorch NCCL is successful!")

# Test PyTorch GLOO
gloo_group = dist.new_group(ranks=list(range(world_size)), backend="gloo")
cpu_data = torch.FloatTensor([1,] * 128)
dist.all_reduce(cpu_data, op=dist.ReduceOp.SUM, group=gloo_group)
value = cpu_data.mean().item()
assert value == world_size, f"Expected {world_size}, got {value}"

print("PyTorch GLOO is successful!")

if world_size <= 1:
    exit()

# Test vLLM NCCL, with cuda graph
from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator

pynccl = PyNcclCommunicator(group=gloo_group, device=local_rank)
# pynccl is enabled by default for 0.6.5+,
# but for 0.6.4 and below, we need to enable it manually.
# keep the code for backward compatibility when because people
# prefer to read the latest documentation.
pynccl.disabled = False

s = torch.cuda.Stream()
with torch.cuda.stream(s):
    data.fill_(1)
    out = pynccl.all_reduce(data, stream=s)
    value = out.mean().item()
    assert value == world_size, f"Expected {world_size}, got {value}"

print("vLLM NCCL is successful!")

g = torch.cuda.CUDAGraph()
with torch.cuda.graph(cuda_graph=g, stream=s):
    out = pynccl.all_reduce(data, stream=torch.cuda.current_stream())

data.fill_(1)
g.replay()
torch.cuda.current_stream().synchronize()
value = out.mean().item()
assert value == world_size, f"Expected {world_size}, got {value}"

print("vLLM NCCL with cuda graph is successful!")

dist.destroy_process_group(gloo_group)
dist.destroy_process_group()

4个机器的检查方法，在每个机器都执行：

bash
展开代码
NCCL_DEBUG=TRACE torchrun --nnodes 4 --nproc-per-node=8 --rdzv_backend=c10d --rdzv_endpoint=100.96.130.36:8887 /mnt/jfs6/check_nccl.py

启动 GLM-4 服务（在 head 节点）

bash
展开代码
# 启动时加上详细日志
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT,NET

python3 -m vllm.entrypoints.openai.api_server \
    --model=/mnt/jfs6/model/gpt-oss-120b \
    --dtype=auto \
    --tensor-parallel-size 8 \
    --pipeline-parallel-size 4 \
    --trust-remote-code \
    --port 8000

测试

请求测试：

bash
展开代码
curl http://100.96.130.36:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model": "/mnt/jfs6/model/GLM-4.6", "messages": [{"role": "user", "content": "你好"}]}'

速度测试：

bash
展开代码
    --num-prompts 100       # 总共发送的请求数量
    --request-rate 10        # 每秒发送的请求数（QPS）

bash
展开代码
# 等待服务启动后，运行 benchmark
python3 -m vllm.benchmark.api_benchmark \
    --model /mnt/jfs6/model/GLM-4.6 \
    --endpoint http://100.96.130.36:8000/v1/completions \
    --num-prompts 100 \
    --request-rate 10

小结

3台机器不行，要4台。估计是某个权重head 不能被3整除。

目录

多机环境配置

ray多机启动

网络检测脚本

启动 GLM-4 服务（在 head 节点）

测试

小结