你提供的数据格式:
json展开代码{
    "image": "000443568.jpg",
    "width": 700,
    "height": 700,
    "conversations": [
        {
            "from": "human",
            "value": "Write a detailed description of this image..."
        },
        {
            "from": "gpt",
            "value": "The image is of a pair of silver earrings..."
        }
    ]
}
问题分析:
<image> 标记:在 conversations 中没有 <image> 占位符image 字段是单个字符串,但需要与对话中的 <image> 标记对应width 和 height 字段在 LLaMA-Factory 中不会被使用根据代码分析,正确的格式应该是:
json展开代码{
    "conversations": [
        {
            "from": "human",
            "value": "<image>Write a detailed description of this image, do not forget about the texts on it if they exist. Also, do not forget to mention the type / style of the image. No bullet points. When writing descriptions, prioritize clarity and direct observation over embellishment or interpretation.\nDon't forget these rules:\n1. **Be Direct and Concise**: Provide straightforward descriptions without adding interpretative or speculative elements.\n2. **Use Segmented Details**: Break down details about different elements of an image into distinct sentences, focusing on one aspect at a time.\n3. **Maintain a Descriptive Focus**: Prioritize purely visible elements of the image, avoiding conclusions or inferences.\n4. **Follow a Logical Structure**: Begin with the central figure or subject and expand outward, detailing its appearance before addressing the surrounding setting.\n5. **Avoid Juxtaposition**: Do not use comparison or contrast language; keep the description purely factual.\n6. **Incorporate Specificity**: Mention age, gender, race, and specific brands or notable features when present, and clearly identify the medium if it's discernible."
        },
        {
            "from": "gpt",
            "value": "The image is of a pair of silver earrings. Each earring features a rectangular silver frame with an oval purple gemstone in the center. The frame has an ornate design with a sunburst pattern at the top and a spike-like protrusion at the bottom. The earrings are made of silver and have a hook-style backing."
        }
    ],
    "images": [
        "000443568.jpg"
    ]
}
json展开代码{
    "messages": [
        {
            "role": "user",
            "content": "<image>Write a detailed description of this image..."
        },
        {
            "role": "assistant", 
            "content": "The image is of a pair of silver earrings..."
        }
    ],
    "images": [
        "000443568.jpg"
    ]
}
json展开代码{
    "my_vlm_dataset": {
        "file_name": "data.json",
        "formatting": "sharegpt",
        "columns": {
            "messages": "conversations",
            "images": "images"
        }
    }
}
LLaMA Factory 默认的就是:
bash展开代码        "tags": {
            "role_tag": "from",
            "content_tag": "value",
            "user_tag": "human",
            "assistant_tag": "gpt"
        }
json展开代码{
    "my_vlm_dataset": {
        "file_name": "data.json",
        "formatting": "sharegpt",
        "columns": {
            "messages": "messages",
            "images": "images"
        },
        "tags": {
            "role_tag": "role",
            "content_tag": "content",
            "user_tag": "user",
            "assistant_tag": "assistant"
        }
    }
}
<image> 标记:在对话文本中必须包含 <image> 占位符<image> 标记的数量必须与 images 数组中的图片数量严格一致<image> 可以出现在文本的任何位置media_dir 的路径s3://、gs://、gcs:// 格式从 src/llamafactory/data/converter.py 的 SharegptDatasetConverter 类可以看出:
python展开代码def __call__(self, example: dict[str, Any]) -> dict[str, Any]:
    # 处理对话消息
    messages = example[self.dataset_attr.messages]
    
    # 处理图片路径
    "_images": self._find_medias(example[self.dataset_attr.images]) if self.dataset_attr.images else None,
json展开代码[
    {
        "conversations": [
            {
                "from": "human",
                "value": "<image>Describe this image in detail."
            },
            {
                "from": "gpt",
                "value": "The image shows a beautiful landscape..."
            }
        ],
        "images": [
            "/path/to/image1.jpg"
        ]
    },
    {
        "conversations": [
            {
                "from": "human", 
                "value": "<image><image>Compare these two images."
            },
            {
                "from": "gpt",
                "value": "Both images show different aspects..."
            }
        ],
        "images": [
            "/path/to/image2.jpg",
            "/path/to/image3.jpg"
        ]
    }
]
bash展开代码python src/train_bash.py \ --stage sft \ --model_name_or_path /path/to/model \ --dataset my_vlm_dataset \ --dataset_dir /path/to/data \ --media_dir /path/to/images \ --template qwen2_vl \ --output_dir /path/to/output \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 3 \ --max_samples 1000 \ --cutoff_len 2048 \ --dataloader_num_workers 8
你的数据格式需要做以下修改:
<image> 标记到对话文本中image 字段改为 images 数组width 和 height 字段<image> 标记数量一致这样修改后,数据就符合 LLaMA-Factory 的 ShareGPT + VLM 训练格式要求了。


本文作者:Dong
本文链接:
版权声明:本博客所有文章除特别声明外,均采用 CC BY-NC。本作品采用《知识共享署名-非商业性使用 4.0 国际许可协议》进行许可。您可以在非商业用途下自由转载和修改,但必须注明出处并提供原作者链接。 许可协议。转载请注明出处!