Read file: src/llamafactory/data/converter.py Read file: src/llamafactory/data/parser.py
从 src/llamafactory/data/parser.py 第 47-53 行可以看出,LLaMA Factory 使用字段名映射而不是位置:
python展开代码# sharegpt tags
role_tag: Optional[str] = "from"        # 默认映射 "from" 字段
content_tag: Optional[str] = "value"    # 默认映射 "value" 字段
user_tag: Optional[str] = "human"       # 默认映射 "human" 值
assistant_tag: Optional[str] = "gpt"    # 默认映射 "gpt" 值
从 src/llamafactory/data/parser.py 第 85-90 行可以看出,这些字段名是可配置的:
python展开代码if "tags" in attr:
    tag_names = ["role_tag", "content_tag"]
    tag_names += ["user_tag", "assistant_tag", "observation_tag", "function_tag", "system_tag"]
    for tag in tag_names:
        self.set_attr(tag, attr["tags"])
在 src/llamafactory/data/converter.py 的 SharegptDatasetConverter 中:
python展开代码def __call__(self, example: dict[str, Any]) -> dict[str, Any]:
    tag_mapping = {
        self.dataset_attr.user_tag: Role.USER.value,        # "human" -> "user"
        self.dataset_attr.assistant_tag: Role.ASSISTANT.value, # "gpt" -> "assistant"
        # ...
    }
    
    for turn_idx, message in enumerate(messages):
        # 使用配置的字段名读取
        if message[self.dataset_attr.role_tag] not in accept_tags[turn_idx % 2]:
            # ...
        
        aligned_messages.append({
            "role": tag_mapping[message[self.dataset_attr.role_tag]],  # 映射角色
            "content": message[self.dataset_attr.content_tag],         # 映射内容
        })
不是的! 字段名和值都有特定要求:
"from",但可以通过 tags 配置修改"human" 和 "gpt",但也可以通过 tags 配置修改标准 ShareGPT 格式(推荐)
bash展开代码{
    "conversations": [
        {
            "from": "human",
            "value": "<image>Write a detailed description of this image, do not forget about the texts on it if they exist. Also, do not forget to mention the type / style of the image. No bullet points. When writing descriptions, prioritize clarity and direct observation over embellishment or interpretation.\nDon't forget these rules:\n1. **Be Direct and Concise**: Provide straightforward descriptions without adding interpretative or speculative elements.\n2. **Use Segmented Details**: Break down details about different elements of an image into distinct sentences, focusing on one aspect at a time.\n3. **Maintain a Descriptive Focus**: Prioritize purely visible elements of the image, avoiding conclusions or inferences.\n4. **Follow a Logical Structure**: Begin with the central figure or subject and expand outward, detailing its appearance before addressing the surrounding setting.\n5. **Avoid Juxtaposition**: Do not use comparison or contrast language; keep the description purely factual.\n6. **Incorporate Specificity**: Mention age, gender, race, and specific brands or notable features when present, and clearly identify the medium if it's discernible."
        },
        {
            "from": "gpt",
            "value": "The image is of a pair of silver earrings. Each earring features a rectangular silver frame with an oval purple gemstone in the center. The frame has an ornate design with a sunburst pattern at the top and a spike-like protrusion at the bottom. The earrings are made of silver and have a hook-style backing."
        }
    ],
    "images": [
        "000443568.jpg"
    ]
}
标准 ShareGPT 格式(推荐)的 dataset_info.json 可以这样配置:
json展开代码{
    "my_vlm_dataset": {
        "file_name": "data.json",
        "formatting": "sharegpt",
        "columns": {
            "messages": "conversations",
            "images": "images"
        },
        "tags": {
            "role_tag": "from",           // 角色字段名
            "content_tag": "value",       // 内容字段名
            "user_tag": "human",          // 用户角色值
            "assistant_tag": "gpt"        // 助手角色值
        }
    }
}
比如你想用 "role" 和 "content" 字段,值用 "user" 和 "assistant":
json展开代码{
    "my_vlm_dataset": {
        "file_name": "data.json",
        "formatting": "sharegpt",
        "columns": {
            "messages": "conversations",
            "images": "images"
        },
        "tags": {
            "role_tag": "role",
            "content_tag": "content", 
            "user_tag": "user",
            "assistant_tag": "assistant"
        }
    }
}
然后你的数据格式就变成:
json展开代码{
    "conversations": [
        {
            "role": "user",
            "content": "<image>请描述这张图片..."
        },
        {
            "role": "assistant", 
            "content": "这是一对银耳环..."
        }
    ],
    "images": ["000443568.jpg"]
}
LLaMA Factory 不是依靠位置读取,而是依靠字段名映射。字段名和值都可以通过 tags 配置自定义,但必须保持一致性。


本文作者:Dong
本文链接:
版权声明:本博客所有文章除特别声明外,均采用 CC BY-NC。本作品采用《知识共享署名-非商业性使用 4.0 国际许可协议》进行许可。您可以在非商业用途下自由转载和修改,但必须注明出处并提供原作者链接。 许可协议。转载请注明出处!