ERNIE-Image 8B开源：8B参数实现顶级文生图与精准文字渲染

百度文心大模型团队开源了ERNIE-Image，一款基于单流DiT架构的8B参数文生图模型。仅需24GB显存的消费级显卡即可运行，在指令遵循、文字渲染等主流benchmark上全面领先开源模型，尤其擅长海报、漫画分镜、多面板布局等强控制力场景。团队同步推出ERNIE-Image Turbo，8步推理即可生成高保真图像。模型权重与推理代码已全部开源，魔搭创空间支持快速体验。

体验地址：

创空间：https://modelscope.cn/studios/PaddlePaddle/ERNIE-Image

开源地址：

ERNIE-Image：https://modelscope.cn/models/PaddlePaddle/ERNIE-Image
ERNIE-Image-Turbo：https://modelscope.cn/models/PaddlePaddle/ERNIE-Image-Turbo

模型介绍

ERNIE-Image基于DiT架构，参数量80亿（8B），仅需24GB显存即可生成媲美顶级商业模型的复杂图像。在GenEval、OneIG、LongTextBench等主流评测中全面领先开源模型，整体效果接近NanoBanana、Seedream 4.5等最先进模型。模型在复杂指令跟随与精准文字渲染方面优势显著，同时覆盖动漫、胶片、超现实主义、剪影、老照片等多元视觉风格。

核心特性：

小模型，强性能 - 仅8B参数规模，在GenEval、OneIG、LongTextBench等主流评测中取得开源模型世界第一
精准文字渲染 - 在高密度文本、长文本及版式敏感的文字生成任务上表现稳定
复杂指令跟随 - 面对多主体关系、细节约束和知识密集型描述的prompt，保持强理解与精准执行能力
结构化生成突出 - 在海报、漫画、分镜、故事板和多面板图像中更好地保持布局逻辑
多元风格覆盖 - 支持写实摄影、动漫二次元、胶片、超现实主义、剪影、老照片等
消费级硬件友好 - 24GB VRAM即可部署运行

Prompt Enhancer（提示词增强器）

ERNIE-Image在详细、结构化的长prompt下表现最佳，团队内置了3B参数的轻量级Prompt Enhancer，自动将简短输入扩展为更详细、结构化的prompt。

评测结果

GenEval 第1名（0.8856）
OneIG-ZH 第2名（0.5543）
LongTextBench 第2名（0.9733）
OneIG-EN 第3名（0.5750）

Diffusers 推理

pip install git+https://github.com/huggingface/diffusers

import torch
from diffusers import ErnieImagePipeline
pipe = ErnieImagePipeline.from_pretrained(
    "Baidu/ERNIE-Image-Turbo",
    torch_dtype=torch.bfloat16,
).to("cuda")
image = pipe(
    prompt="这是一张呈现城市街道场景的摄影作品...",
    height=1264,
    width=848,
    num_inference_steps=8,
    guidance_scale=1.0,
    use_pe=True  # use prompt enhancer
).images[0]
image.save("output.png")

SGLang 推理

git clone https://github.com/sgl-project/sglang.git
sglang serve --model-path baidu/ERNIE-Image-Turbo

Diffsynth 推理

pip install -U diffsynth==2.0.8

from diffsynth.pipelines.ernie_image import ErnieImagePipeline, ModelConfig
import torch
vram_config = {
    "offload_dtype": torch.bfloat16,
    "offload_device": "cpu",
    "onload_dtype": torch.bfloat16,
    "onload_device": "cpu",
    "preparing_dtype": torch.bfloat16,
    "preparing_device": "cuda",
    "computation_dtype": torch.bfloat16,
    "computation_device": "cuda",
}
pipe = ErnieImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device='cuda',
    model_configs=[
        ModelConfig(model_id="PaddlePaddle/ERNIE-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors", **vram_config),
        ModelConfig(model_id="PaddlePaddle/ERNIE-Image", origin_file_pattern="text_encoder/model.safetensors", **vram_config),
        ModelConfig(model_id="PaddlePaddle/ERNIE-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config),
    ],
    tokenizer_config=ModelConfig(model_id="PaddlePaddle/ERNIE-Image", origin_file_pattern="tokenizer/"),
    vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
image = pipe(
    prompt="一只黑白相间的中华田园犬",
    negative_prompt="",
    height=1024,
    width=1024,
    seed=42,
    num_inference_steps=50,
    cfg_scale=4.0,
)
image.save("output.jpg")

模型 LoRA 训练

pip install -U diffsynth==2.0.8

训练脚本：

accelerate launch examples/ernie_image/model_training/train.py \
  --dataset_base_path data/diffsynth_example_dataset/ernie_image/Ernie-Image-T2I \
  --dataset_metadata_path data/diffsynth_example_dataset/ernie_image/Ernie-Image-T2I/metadata.csv \
  --max_pixels 1048576 \
  --dataset_repeat 50 \
  --model_id_with_origin_paths "PaddlePaddle/ERNIE-Image:transformer/diffusion_pytorch_model*.safetensors,PaddlePaddle/ERNIE-Image:text_encoder/model.safetensors,PaddlePaddle/ERNIE-Image:vae/diffusion_pytorch_model.safetensors" \
  --learning_rate 1e-4 \
  --num_epochs 5 \
  --remove_prefix_in_ckpt "pipe.dit." \
  --output_path "./models/train/Ernie-Image-T2I_lora" \
  --lora_base_model "dit" \
  --lora_target_modules "to_q,to_k,to_v,to_out.0" \
  --lora_rank 32 \
  --use_gradient_checkpointing \
  --dataset_num_workers 8 \
  --find_unused_parameters

总结

ERNIE-Image证明了8B参数模型可以在文字渲染、复杂指令跟随、结构化生成和多元风格表达上与更大规模模型竞争，同时保持消费级硬件可部署的实用性。