LLaMA Factory 完全支持 JSONL 格式

LLaMA Factory 对 JSONL 格式的全面支持

官方文档明确声明

LLaMA Factory 官方文档在 data/README.md 中明确声明：

Currently we support datasets in alpaca and sharegpt format. Allowed file types include json, jsonl, csv, parquet, arrow.

这表明 JSONL 格式是 LLaMA Factory 原生支持的文件格式之一。

核心代码实现支持

1. 文件扩展名映射机制

在 src/llamafactory/extras/constants.py 中，LLaMA Factory 定义了完整的文件类型映射：

python
展开代码
FILEEXT2TYPE = {
    "arrow": "arrow",
    "csv": "csv", 
    "json": "json",
    "jsonl": "json",  # JSONL 被映射为 "json" 类型
    "parquet": "parquet",
    "txt": "text",
}

这个映射确保了 .jsonl 文件能够被正确识别和处理。

2. 数据加载器实现

在 src/llamafactory/data/loader.py 中，数据加载器会根据文件扩展名自动选择处理方式：

python
展开代码
data_path = FILEEXT2TYPE.get(os.path.splitext(data_files[0])[-1][1:], None)
if data_path is None:
    raise ValueError("Allowed file types: {}.".format(",".join(FILEEXT2TYPE.keys())))

3. 专门的 JSONL 读取函数

在 src/llamafactory/data/data_utils.py 中，LLaMA Factory 提供了专门的 JSONL 读取函数：

python
展开代码
def _read_json_with_fs(fs: "fsspec.AbstractFileSystem", path: str) -> list[Any]:
    r"""Helper function to read JSON/JSONL files using fsspec."""
    with fs.open(path, "r") as f:
        if path.endswith(".jsonl"):
            return [json.loads(line) for line in f if line.strip()]  # 逐行读取 JSONL
        else:
            return json.load(f)  # 读取 JSON

4. WebUI 界面支持

在 Web 界面中，LLaMA Factory 也提供了对 JSONL 格式的完整支持：

python
展开代码
def _load_data_file(file_path: str) -> list[Any]:
    with open(file_path, encoding="utf-8") as f:
        if file_path.endswith(".json"):
            return json.load(f)
        elif file_path.endswith(".jsonl"):
            return [json.loads(line) for line in f]  # 逐行解析 JSONL
        else:
            return list(f)

完整的 JSONL 配置方案

1. dataset_info.json 配置

对于多模态 VLM 训练，您可以使用以下配置：

json
展开代码
{
    "my_vlm_dataset": {
        "file_name": "data.jsonl",
        "formatting": "sharegpt",
        "columns": {
            "messages": "conversations",
            "images": "images"
        },
        "tags": {
            "role_tag": "from",
            "content_tag": "value",
            "user_tag": "human",
            "assistant_tag": "gpt"
        }
    }
}

2. JSONL 文件格式示例

您的 data.jsonl 文件应该采用以下格式（每行一个完整的 JSON 对象）：

jsonl
展开代码
{"conversations": [{"from": "human", "value": "<image>请描述这张图片..."}, {"from": "gpt", "value": "这是一对银耳环..."}], "images": ["000443568.jpg"]}
{"conversations": [{"from": "human", "value": "<image>这是什么？"}, {"from": "gpt", "value": "这是一只猫..."}], "images": ["cat.jpg"]}
{"conversations": [{"from": "human", "value": "<image>分析这张图片"}, {"from": "gpt", "value": "这是一张风景照..."}], "images": ["landscape.jpg"]}

3. 字段映射机制

LLaMA Factory 使用灵活的字段映射机制，支持自定义字段名和值：

字段名映射：通过 tags 配置自定义字段名
字段值映射：支持自定义角色标识符
位置无关：不依赖数据在文件中的位置

目录