https://zhuanlan.zhihu.com/p/29950052712
模型已经下载到:/mnt/jfs6/model/GLM-4.6
这是共享路径,三台机器都可以访问。
确保多台机器之间能够正常通信,特别是用于分布式训练/推理时的高速网络通信。
安装工具:
bash展开代码apt-get update apt-get install net-tools apt-get install infiniband-diags libibverbs-dev
关键是所有机器的通信方式必须一致,可能的类型:
我机器有些配置已经有了:
bash展开代码SOCKET_IP=100.96.130.36
NODE_NAME=gpu-a800-0445.host.platform.basemind.com
GPU_TYPE=A800-SXM4-80GB
GPU_VENDOR=NVIDIA
NODE_RANK=0
NODE_COUNT=1
MASTER_ADDR=localhost
PROC_PER_NODE=8
JOB_ID=hello-qlkbz-112321-worker-0
RDMA_NETWORK_LINK_TYPE=RoCE
LLDP_INFO_FILE=/host-config/lldp-info.txt
USE_BRAINVF=false
NET_IF_PREFIX=net
Verifying all RoCEv2 GID indexes equal to 5 ...
NCCL_SOCKET_IFNAME=eth0
NCCL_IB_HCA==mlx5_5,mlx5_6,mlx5_7,mlx5_8
NCCL_IB_GID_INDEX=5
LLDP_INFO_FILE=/host-config/lldp-info.txt
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
2025-11-18 19:35:54: Ready to go.
我机器:
bash展开代码eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 100.96.130.36 netmask 255.255.255.0 broadcast 100.96.130.255
inet6 fe80::8a66:64ff:fe60:8224 prefixlen 64 scopeid 0x20<link>
ether 88:66:64:60:82:24 txqueuelen 0 (Ethernet)
RX packets 10312 bytes 26542167 (26.5 MB)
RX errors 0 dropped 10 overruns 0 frame 0
TX packets 6875 bytes 797124 (797.1 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
我已经有的:
bash展开代码NCCL_SOCKET_IFNAME=eth0 NCCL_IB_HCA=mlx5_5,mlx5_6,mlx5_7,mlx5_8 NCCL_IB_GID_INDEX=5 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
我还需要的:
bash展开代码export VLLM_HOST_IP=100.96.130.36
export GLOO_SOCKET_IFNAME=eth0
总结一下,需要这些环境变量:
bash展开代码export NCCL_SOCKET_IFNAME=eth0 # NCCL控制信息用的网卡
export NCCL_IB_HCA=mlx5_5,mlx5_6,mlx5_7,mlx5_8 # NCCL数据传输用的RoCE高速网卡
export NCCL_IB_GID_INDEX=5 # RoCE网卡的GID索引
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 # 可用的GPU编号
export VLLM_HOST_IP=100.96.130.36 # 本机IP地址
export GLOO_SOCKET_IFNAME=eth0 # GLOO后端用的网卡
在node节点1执行,ip是node节点1的:
bash展开代码export VLLM_HOST_IP=100.96.157.134
export GLOO_SOCKET_IFNAME=eth0
在node节点2执行,ip是node节点2的:
bash展开代码export VLLM_HOST_IP=100.96.153.93
export GLOO_SOCKET_IFNAME=eth0
在node节点3执行,ip是node节点3的:
bash展开代码export VLLM_HOST_IP=100.96.162.111
export GLOO_SOCKET_IFNAME=eth0
在主节点上执行:
bash展开代码ray start --head --port=6667 --disable-usage-stats
在2个node节点执行:
bash展开代码# . 连接到 head 节点
ray start --address='100.96.130.36:6667'
验证集群(在任意节点):
bash展开代码root@hello-qlkbz-112321-worker-0:/vllm-workspace# ray status
======== Autoscaler status: 2025-11-18 19:52:13.446348 ========
Node status
---------------------------------------------------------------
Active:
1 node_f1da788b6db931f0e4932e22adf106f5a0551ef6c46f50bf703b83a7
1 node_cb211ad44d71b954e0c0919d7d991a07cf60fd5a64fc114d35b96f1b
1 node_be3c7805650c4e8e99f69cfc5c9e1418834fedeba8d02e4feb160968
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Total Usage:
0.0/216.0 CPU
0.0/24.0 GPU
0B/1.50TiB memory
0B/558.79GiB object_store_memory
Total Constraints:
(no request_resources() constraints)
Total Demands:
(no resource demands)
vim /mnt/jfs6/check_nccl.py
python展开代码# Test PyTorch NCCL
import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl")
local_rank = dist.get_rank() % torch.cuda.device_count()
torch.cuda.set_device(local_rank)
data = torch.FloatTensor([1,] * 128).to("cuda")
dist.all_reduce(data, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
value = data.mean().item()
world_size = dist.get_world_size()
assert value == world_size, f"Expected {world_size}, got {value}"
print("PyTorch NCCL is successful!")
# Test PyTorch GLOO
gloo_group = dist.new_group(ranks=list(range(world_size)), backend="gloo")
cpu_data = torch.FloatTensor([1,] * 128)
dist.all_reduce(cpu_data, op=dist.ReduceOp.SUM, group=gloo_group)
value = cpu_data.mean().item()
assert value == world_size, f"Expected {world_size}, got {value}"
print("PyTorch GLOO is successful!")
if world_size <= 1:
exit()
# Test vLLM NCCL, with cuda graph
from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator
pynccl = PyNcclCommunicator(group=gloo_group, device=local_rank)
# pynccl is enabled by default for 0.6.5+,
# but for 0.6.4 and below, we need to enable it manually.
# keep the code for backward compatibility when because people
# prefer to read the latest documentation.
pynccl.disabled = False
s = torch.cuda.Stream()
with torch.cuda.stream(s):
data.fill_(1)
out = pynccl.all_reduce(data, stream=s)
value = out.mean().item()
assert value == world_size, f"Expected {world_size}, got {value}"
print("vLLM NCCL is successful!")
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(cuda_graph=g, stream=s):
out = pynccl.all_reduce(data, stream=torch.cuda.current_stream())
data.fill_(1)
g.replay()
torch.cuda.current_stream().synchronize()
value = out.mean().item()
assert value == world_size, f"Expected {world_size}, got {value}"
print("vLLM NCCL with cuda graph is successful!")
dist.destroy_process_group(gloo_group)
dist.destroy_process_group()
4个机器的检查方法,在每个机器都执行:
bash展开代码NCCL_DEBUG=TRACE torchrun --nnodes 4 --nproc-per-node=8 --rdzv_backend=c10d --rdzv_endpoint=100.96.130.36:8887 /mnt/jfs6/check_nccl.py
bash展开代码# 启动时加上详细日志
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT,NET
python3 -m vllm.entrypoints.openai.api_server \
--model=/mnt/jfs6/model/gpt-oss-120b \
--dtype=auto \
--tensor-parallel-size 8 \
--pipeline-parallel-size 4 \
--trust-remote-code \
--port 8000
请求测试:
bash展开代码curl http://100.96.130.36:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model": "/mnt/jfs6/model/GLM-4.6", "messages": [{"role": "user", "content": "你好"}]}'
速度测试:
bash展开代码 --num-prompts 100 # 总共发送的请求数量
--request-rate 10 # 每秒发送的请求数(QPS)
bash展开代码# 等待服务启动后,运行 benchmark
python3 -m vllm.benchmark.api_benchmark \
--model /mnt/jfs6/model/GLM-4.6 \
--endpoint http://100.96.130.36:8000/v1/completions \
--num-prompts 100 \
--request-rate 10


本文作者:Dong
本文链接:
版权声明:本博客所有文章除特别声明外,均采用 CC BY-NC。本作品采用《知识共享署名-非商业性使用 4.0 国际许可协议》进行许可。您可以在非商业用途下自由转载和修改,但必须注明出处并提供原作者链接。 许可协议。转载请注明出处!