更多信息可直接访问官网:
https://internvl.readthedocs.io/en/latest/internvl2.0/finetune.html
训练脚本internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_full.sh
展开代码kevinchina/deeplearning:trainintervl-of
Meta File
展开代码{ "your-custom-dataset-1": { "root": "path/to/the/image/", "annotation": "path/to/the/jsonl/annotation", "data_augment": false, "max_dynamic_patch": 12, "repeat_time": 1, "length": "number of samples in the dataset" }, ... }
单图样本例子:
展开代码{ "id": 0, "image": "images/00000000.jpg", "width": 897, "height": 1152, "conversations": [ { "from": "human", "value": "<image>\nCan you extract any readable text from the image?" }, { "from": "gpt", "value": "Dares Wins Vol. 5 Tommy's Heroes Vol. 6: For Tomorrow Vol. 7: Closing Time miniseries. Clark Kent is being interviewed about Superman's connection to notorious killer Tommy Monaghan. Taking the conversation..." } ] }
检测框数据:
展开代码<ref>class name</ref><box>[[x1, y1, x2, y2], ...]</box>
最终我的converted_dataset.jsonl文件里面:
展开代码{"id": 0, "image": "/img_datasets/img_small_size/didichuxing-20240914171548.jpg", "width": 447, "height": 1000, "conversations": [{"from": "human", "value": "<image>点[56,257]所处位置的信息是什么?"}, {"from": "gpt", "value": "<ref>文本-地址</ref><box>[[33, 239, 66, 262]]</box>"}]} {"id": 1, "image": "/img_datasets/img_small_size/didichuxing-20240914171548.jpg", "width": 447, "height": 1000, "conversations": [{"from": "human", "value": "<image>点[152,254]所处位置的信息是什么?"}, {"from": "gpt", "value": "<ref>按钮-打开请填写地址</ref><box>[[10, 213, 431, 286]]</box>"}]} {"id": 2, "image": "/img_datasets/img_small_size/didichuxing-20240914171548.jpg", "width": 447, "height": 1000, "conversations": [{"from": "human", "value": "<image>点[364,244]所处位置的信息是什么?"}, {"from": "gpt", "value": "<ref>按钮-地图选址</ref><box>[[348, 241, 405, 263]]</box>"}]}
最终我的converted_dataset_meta文件:
展开代码{ "point_to_box": { "root": "/", "annotation": "/meta_jsonl/converted_dataset.jsonl", "data_augment": false, "repeat_time": 1, "length": 797760 } }
开启训练:
展开代码cd /app/InternVL/internvl_chat && \ sh shell/internvl2.0/3nd_finetune/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_full.sh
internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_full.sh
bash展开代码set -x
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
export MASTER_PORT=34229
export TF_CPP_MIN_LOG_LEVEL=3
export LAUNCHER=pytorch
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
GPUS_PER_NODE=$(/root/miniconda3/envs/opensora/bin/python -c 'import torch; print(torch.cuda.device_count())')
OUTPUT_DIR='/app/InternVL/internvl_chat/work_dirs/internvl_chat_v2_0/'
LOGS_DIR="/tmp/logs/"
mkdir -p $LOGS_DIR
if [ ! -d "$OUTPUT_DIR" ]; then
  mkdir -p "$OUTPUT_DIR"
fi
BATCH_SIZE=${BATCH_SIZE:-640}
PER_DEVICE_BATCH_SIZE=${PER_DEVICE_BATCH_SIZE:-8}
GRADIENT_ACC=$((BATCH_SIZE / PER_DEVICE_BATCH_SIZE / GPUS_PER_NODE))
MASTER_ADDR=${MASTER_ADDR:-localhost}
MASTER_PORT=${MASTER_PORT:-6001}
DISTRIBUTED_ARGS="
    --nproc-per-node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"
# epoch: 1
/root/miniconda3/envs/opensora/bin/torchrun $DISTRIBUTED_ARGS \
  internvl/train/internvl_chat_finetune.py \
  --model_name_or_path "/internvl2_8b" \
  --conv_style "internlm2-chat" \
  --output_dir ${OUTPUT_DIR} \
  --meta_path "./shell/data/internvl_1_2_finetune_custom.json" \
  --overwrite_output_dir True \
  --force_image_size 448 \
  --max_dynamic_patch 6 \
  --down_sample_ratio 0.5 \
  --drop_path_rate 0.1 \
  --freeze_llm False \
  --freeze_mlp False \
  --freeze_backbone False \
  --vision_select_layer -1 \
  --dataloader_num_workers 8 \
  --bf16 True \
  --num_train_epochs 6 \
  --per_device_train_batch_size ${PER_DEVICE_BATCH_SIZE} \
  --gradient_accumulation_steps ${GRADIENT_ACC} \
  --evaluation_strategy "no" \
  --save_strategy "steps" \
  --save_steps 200 \
  --save_total_limit 1 \
  --learning_rate 4e-5 \
  --weight_decay 0.01 \
  --warmup_ratio 0.03 \
  --lr_scheduler_type "cosine" \
  --logging_steps 1 \
  --max_seq_length 4096 \
  --do_train True \
  --grad_checkpoint True \
  --group_by_length True \
  --dynamic_image_size True \
  --use_thumbnail True \
  --ps_version 'v2' \
  --deepspeed "zero_stage1_config.json" \
  --report_to "tensorboard" \
  2>&1 | tee -a "${LOGS_DIR}/training_log.txt"
internvl_1_2_finetune_custom.json
json展开代码{
    "point_to_box": {
      "root": "/",
      "annotation": "/meta_jsonl/converted_dataset.jsonl",
      "data_augment": false,
      "repeat_time": 1,
      "length": 797760
    }
}
展开代码docker run -it \ -v /ssd/xiedong/qwenvl_train_ui_ground_datasets:/ssd/xiedong/qwenvl_train_ui_ground_datasets \ -v /ssd/xiedong/internvl8byangfan:/model \ --net host \ --gpus '"device=1"' \ kevinchina/deeplearning:trainintervl-of bash
执行模型推理...
当前进度: 处理了 400/2000 个有效样本 IoU > 0.3 的正确率: 1.0000 点在框内的正确率: 1.0000
训练的新模型感觉效果不错。
API访问方式:
python x18_request_ui_point_query.py "点[56,257]所处位置的信息是什么?" /ssd/xiedong/qwenvl_train_ui_ground_datasets/img_small_size/didichuxing-20240914171548.jpg
Sending request with image file: /ssd/xiedong/qwenvl_train_ui_ground_datasets/img_small_size/didichuxing-20240914171548.jpg Request successful. Status code: 200
API Response: { "box": [ 33, 239, 66, 262 ], "code": 0, "element_description": "文本-地址", "imgurl": null, "point": [ 56, 257 ], "response": "文本-地址[[33, 239, 66, 262]]", "text": "succeed" }
展开代码训练样本:{"id": 0, "image": "/img_datasets/img_small_size/didichuxing-20240914171548.jpg", "width": 447, "height": 1000, "conversations": [{"from": "human", "value": "<image>点[56,257]所处位置的信息是什么?"}, {"from": "gpt", "value": "<ref>文本-地址</ref><box>[[33, 239, 66, 262]]</box>"}]} 期望模型输出:"<ref>文本-地址</ref><box>[[33, 239, 66, 262]]</box>"
ref 和 box 是特殊标签,模型返回的时候会去掉这个。
真实模型返回是: "response": "文本-地址[[33, 239, 66, 262]]"


本文作者:Dong
本文链接:
版权声明:本博客所有文章除特别声明外,均采用 CC BY-NC。本作品采用《知识共享署名-非商业性使用 4.0 国际许可协议》进行许可。您可以在非商业用途下自由转载和修改,但必须注明出处并提供原作者链接。 许可协议。转载请注明出处!