vLLM多机部署笔记:从环境配置到GLM-4.6服务启动(RoCE网络)
2025-11-19
深度学习
00

目录

多机环境配置
ray多机启动
网络检测脚本
启动 GLM-4 服务(在 head 节点)
测试
小结

https://zhuanlan.zhihu.com/p/29950052712

模型已经下载到:/mnt/jfs6/model/GLM-4.6

这是共享路径,三台机器都可以访问。

多机环境配置

确保多台机器之间能够正常通信,特别是用于分布式训练/推理时的高速网络通信。

安装工具:

bash
展开代码
apt-get update apt-get install net-tools apt-get install infiniband-diags libibverbs-dev

关键是所有机器的通信方式必须一致,可能的类型:

  • InfiniBand (IB) - 最快
  • RoCE (RDMA over Ethernet) - 次快
  • 普通以太网 - 最慢但最常见

我机器有些配置已经有了:

bash
展开代码
SOCKET_IP=100.96.130.36 NODE_NAME=gpu-a800-0445.host.platform.basemind.com GPU_TYPE=A800-SXM4-80GB GPU_VENDOR=NVIDIA NODE_RANK=0 NODE_COUNT=1 MASTER_ADDR=localhost PROC_PER_NODE=8 JOB_ID=hello-qlkbz-112321-worker-0 RDMA_NETWORK_LINK_TYPE=RoCE LLDP_INFO_FILE=/host-config/lldp-info.txt USE_BRAINVF=false NET_IF_PREFIX=net Verifying all RoCEv2 GID indexes equal to 5 ... NCCL_SOCKET_IFNAME=eth0 NCCL_IB_HCA==mlx5_5,mlx5_6,mlx5_7,mlx5_8 NCCL_IB_GID_INDEX=5 LLDP_INFO_FILE=/host-config/lldp-info.txt CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 2025-11-18 19:35:54: Ready to go.

我机器:

bash
展开代码
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000 inet 100.96.130.36 netmask 255.255.255.0 broadcast 100.96.130.255 inet6 fe80::8a66:64ff:fe60:8224 prefixlen 64 scopeid 0x20<link> ether 88:66:64:60:82:24 txqueuelen 0 (Ethernet) RX packets 10312 bytes 26542167 (26.5 MB) RX errors 0 dropped 10 overruns 0 frame 0 TX packets 6875 bytes 797124 (797.1 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

我已经有的:

bash
展开代码
NCCL_SOCKET_IFNAME=eth0 NCCL_IB_HCA=mlx5_5,mlx5_6,mlx5_7,mlx5_8 NCCL_IB_GID_INDEX=5 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

我还需要的:

bash
展开代码
export VLLM_HOST_IP=100.96.130.36 export GLOO_SOCKET_IFNAME=eth0

总结一下,需要这些环境变量:

bash
展开代码
export NCCL_SOCKET_IFNAME=eth0 # NCCL控制信息用的网卡 export NCCL_IB_HCA=mlx5_5,mlx5_6,mlx5_7,mlx5_8 # NCCL数据传输用的RoCE高速网卡 export NCCL_IB_GID_INDEX=5 # RoCE网卡的GID索引 export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 # 可用的GPU编号 export VLLM_HOST_IP=100.96.130.36 # 本机IP地址 export GLOO_SOCKET_IFNAME=eth0 # GLOO后端用的网卡

ray多机启动

在node节点1执行,ip是node节点1的:

bash
展开代码
export VLLM_HOST_IP=100.96.157.134 export GLOO_SOCKET_IFNAME=eth0

在node节点2执行,ip是node节点2的:

bash
展开代码
export VLLM_HOST_IP=100.96.153.93 export GLOO_SOCKET_IFNAME=eth0

在node节点3执行,ip是node节点3的:

bash
展开代码
export VLLM_HOST_IP=100.96.162.111 export GLOO_SOCKET_IFNAME=eth0

在主节点上执行:

bash
展开代码
ray start --head --port=6667 --disable-usage-stats

在2个node节点执行:

bash
展开代码
# . 连接到 head 节点 ray start --address='100.96.130.36:6667'

验证集群(在任意节点):

bash
展开代码
root@hello-qlkbz-112321-worker-0:/vllm-workspace# ray status ======== Autoscaler status: 2025-11-18 19:52:13.446348 ======== Node status --------------------------------------------------------------- Active: 1 node_f1da788b6db931f0e4932e22adf106f5a0551ef6c46f50bf703b83a7 1 node_cb211ad44d71b954e0c0919d7d991a07cf60fd5a64fc114d35b96f1b 1 node_be3c7805650c4e8e99f69cfc5c9e1418834fedeba8d02e4feb160968 Pending: (no pending nodes) Recent failures: (no failures) Resources --------------------------------------------------------------- Total Usage: 0.0/216.0 CPU 0.0/24.0 GPU 0B/1.50TiB memory 0B/558.79GiB object_store_memory Total Constraints: (no request_resources() constraints) Total Demands: (no resource demands)

网络检测脚本

vim /mnt/jfs6/check_nccl.py

python
展开代码
# Test PyTorch NCCL import torch import torch.distributed as dist dist.init_process_group(backend="nccl") local_rank = dist.get_rank() % torch.cuda.device_count() torch.cuda.set_device(local_rank) data = torch.FloatTensor([1,] * 128).to("cuda") dist.all_reduce(data, op=dist.ReduceOp.SUM) torch.cuda.synchronize() value = data.mean().item() world_size = dist.get_world_size() assert value == world_size, f"Expected {world_size}, got {value}" print("PyTorch NCCL is successful!") # Test PyTorch GLOO gloo_group = dist.new_group(ranks=list(range(world_size)), backend="gloo") cpu_data = torch.FloatTensor([1,] * 128) dist.all_reduce(cpu_data, op=dist.ReduceOp.SUM, group=gloo_group) value = cpu_data.mean().item() assert value == world_size, f"Expected {world_size}, got {value}" print("PyTorch GLOO is successful!") if world_size <= 1: exit() # Test vLLM NCCL, with cuda graph from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator pynccl = PyNcclCommunicator(group=gloo_group, device=local_rank) # pynccl is enabled by default for 0.6.5+, # but for 0.6.4 and below, we need to enable it manually. # keep the code for backward compatibility when because people # prefer to read the latest documentation. pynccl.disabled = False s = torch.cuda.Stream() with torch.cuda.stream(s): data.fill_(1) out = pynccl.all_reduce(data, stream=s) value = out.mean().item() assert value == world_size, f"Expected {world_size}, got {value}" print("vLLM NCCL is successful!") g = torch.cuda.CUDAGraph() with torch.cuda.graph(cuda_graph=g, stream=s): out = pynccl.all_reduce(data, stream=torch.cuda.current_stream()) data.fill_(1) g.replay() torch.cuda.current_stream().synchronize() value = out.mean().item() assert value == world_size, f"Expected {world_size}, got {value}" print("vLLM NCCL with cuda graph is successful!") dist.destroy_process_group(gloo_group) dist.destroy_process_group()

4个机器的检查方法,在每个机器都执行:

bash
展开代码
NCCL_DEBUG=TRACE torchrun --nnodes 4 --nproc-per-node=8 --rdzv_backend=c10d --rdzv_endpoint=100.96.130.36:8887 /mnt/jfs6/check_nccl.py

启动 GLM-4 服务(在 head 节点)

bash
展开代码
# 启动时加上详细日志 export NCCL_DEBUG=INFO export NCCL_DEBUG_SUBSYS=INIT,NET python3 -m vllm.entrypoints.openai.api_server \ --model=/mnt/jfs6/model/gpt-oss-120b \ --dtype=auto \ --tensor-parallel-size 8 \ --pipeline-parallel-size 4 \ --trust-remote-code \ --port 8000

测试

请求测试:

bash
展开代码
curl http://100.96.130.36:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{"model": "/mnt/jfs6/model/GLM-4.6", "messages": [{"role": "user", "content": "你好"}]}'

速度测试:

bash
展开代码
--num-prompts 100 # 总共发送的请求数量 --request-rate 10 # 每秒发送的请求数(QPS)
bash
展开代码
# 等待服务启动后,运行 benchmark python3 -m vllm.benchmark.api_benchmark \ --model /mnt/jfs6/model/GLM-4.6 \ --endpoint http://100.96.130.36:8000/v1/completions \ --num-prompts 100 \ --request-rate 10

小结

  • 3台机器不行,要4台。估计是某个权重head 不能被3整除。
如果对你有用的话,可以打赏哦
打赏
ali pay
wechat pay

本文作者:Dong

本文链接:

版权声明:本博客所有文章除特别声明外,均采用 CC BY-NC。本作品采用《知识共享署名-非商业性使用 4.0 国际许可协议》进行许可。您可以在非商业用途下自由转载和修改,但必须注明出处并提供原作者链接。 许可协议。转载请注明出处!