针对于我修改后的代码,可以这么统计每个样本的实际 token 长度:
bash展开代码IMAGE_MAX_TOKEN_NUM=5000 USE_AUDIO_IN_VIDEO=true VIDEO_MAX_PIXELS=307200 \
python scripts/count_token_lengths.py \
--model /mnt/cpfs/model/Qwen3-Omni-30B-A3B-Instruct \
--dataset /mnt/cpfs/datasets/ShortsTemplateEdits/anno_jianying_mp4_seed18_train.jsonl \
--num_samples 200 \
--dataset_num_proc 16
... --num_samples 500
... --num_samples 0
... --output_csv /tmp/lengths.csv
结果:
bash展开代码[INFO] Tokenization done in 760.2s (0.3 samples/s)
============================================================
Token Length Statistics
============================================================
Total dataset size: 3586
Sampled: 200 (seed=42)
Successfully tokenized: 200
Min length: 3419 (dataset index: 299)
Max length: 63197 (dataset index: 1289)
Mean: 15356.8
Median: 13908.0
Std: 9368.6
------------------------------------------------------------
Percentiles
------------------------------------------------------------
P1 : 3471
P5 : 4935
P10 : 6145
P25 : 9014
P50 : 13908
P75 : 18729
P90 : 23731
P95 : 31902
P99 : 52786
------------------------------------------------------------
Distribution Histogram (bucket=2000)
------------------------------------------------------------
[ 2000, 4000): 3 ( 1.5%) ####
[ 4000, 6000): 14 ( 7.0%) ######################
[ 6000, 8000): 22 ( 11.0%) ###################################
[ 8000, 10000): 20 ( 10.0%) ###############################
[ 10000, 12000): 21 ( 10.5%) #################################
[ 12000, 14000): 22 ( 11.0%) ###################################
[ 14000, 16000): 22 ( 11.0%) ###################################
[ 16000, 18000): 18 ( 9.0%) ############################
[ 18000, 20000): 20 ( 10.0%) ###############################
[ 20000, 22000): 10 ( 5.0%) ###############
[ 22000, 24000): 8 ( 4.0%) ############
[ 24000, 26000): 3 ( 1.5%) ####
[ 26000, 28000): 4 ( 2.0%) ######
[ 28000, 30000): 1 ( 0.5%) #
[ 30000, 32000): 2 ( 1.0%) ###
[ 32000, 34000): 1 ( 0.5%) #
[ 34000, 36000): 2 ( 1.0%) ###
[ 38000, 40000): 1 ( 0.5%) #
[ 42000, 44000): 1 ( 0.5%) #
[ 44000, 46000): 1 ( 0.5%) #
[ 46000, 48000): 1 ( 0.5%) #
[ 52000, 54000): 1 ( 0.5%) #
[ 60000, 62000): 1 ( 0.5%) #
[ 62000, 64000]: 1 ( 0.5%) #
------------------------------------------------------------
Max Length Threshold Analysis
------------------------------------------------------------
Samples <= 2048: 0 / 200 ( 0.0%)
Samples <= 4096: 3 / 200 ( 1.5%)
Samples <= 8192: 41 / 200 ( 20.5%)
Samples <= 12000: 80 / 200 ( 40.0%)
Samples <= 16384: 127 / 200 ( 63.5%)
Samples <= 20000: 162 / 200 ( 81.0%)
Samples <= 24000: 180 / 200 ( 90.0%)
Samples <= 32768: 191 / 200 ( 95.5%)
Samples <= 49152: 197 / 200 ( 98.5%)
Samples <= 65536: 200 / 200 (100.0%)


本文作者:Dong
本文链接:
版权声明:本博客所有文章除特别声明外,均采用 CC BY-NC。本作品采用《知识共享署名-非商业性使用 4.0 国际许可协议》进行许可。您可以在非商业用途下自由转载和修改,但必须注明出处并提供原作者链接。 许可协议。转载请注明出处!