ERNIE-Image DPO Alignment Training Deep Dive: SFT, Preference Optimization, and Aesthetic Evaluation Pipeline

Abstract: Since its open-source release in April 2026, ERNIE-Image has rapidly become one of the most discussed open-source text-to-image models. Its technical edge lies not only in its 8B-parameter DiT architecture, but in a carefully designed data pipeline and alignment training strategy. Drawing from the arXiv technical report (2605.25347), this article provides a deep dive into ERNIE-Image's SFT fine-tuning, DPO preference optimization, and aesthetic evaluation — revealing how a "data-driven" model outperforms through refined data strategy rather than raw parameter scale.

Why DPO Matters for Diffusion Models

In the LLM space, DPO (Direct Preference Optimization) has become the standard tool for model alignment. The journey from RLHF to DPO represents an evolution from complexity to simplicity. But in the diffusion model domain, DPO adoption is only just beginning.

ERNIE-Image's technical report provides the first detailed disclosure of adapting DPO to Flow Matching diffusion models. This isn't a simple "copy-paste" — it's a fundamental redesign of diffusion model training objectives.

The core innovation: Traditional DPO targets discrete token generation (LLMs), while diffusion models operate in continuous pixel space. ERNIE-Image redefines the DPO objective from "next-token probability" to "$L_2$ reconstruction error of the predicted velocity field." Specifically:

The model predicts a velocity $v_{\theta}(x_t, t)$ that transforms noise $x_t$ at timestep $t$ into data
The DPO objective maximizes the preference margin between preferred and rejected images based on velocity prediction differences
This means the model doesn't "prefer an output" — it "prefers a denoising direction"

This adaptation enables DPO to work in the continuous space of diffusion models, without requiring a separate reward model or complex PPO optimization.

Training Pipeline全景: Bottom-up and Top-down Dual Paths

ERNIE-Image's training pipeline is explicitly divided into two paths — this is the key framework for understanding its technical advantages.

Bottom-up: Large-scale Pre-training

This phase aims to build a strong general-purpose visual generation foundation. Key steps include:

1. Fine-grained Taxonomy
All training images are classified into 10,000 specific visual categories. This isn't a simple "animal/person/landscape" three-level taxonomy, but fine-grained labels like "city skyline-dusk-dense architecture." This classification ensures:

Long-tail concepts aren't drowned out by high-frequency categories
Each category receives adequate sampling weight
The model learns diverse visual patterns rather than a few dominant styles

2. VLM-driven Caption Enrichment
ERNIE-Image uses the Qwen3 VLM as its caption generator. This goes beyond simple image descriptions to structured description extraction:

Explicitly annotating textual elements in images (poster titles, road signs, product packaging)
Extracting spatial relationships ("left," "above," "center-aligned")
Describing professional visual attributes: color, lighting, composition

Caption quality directly determines the model's ability to follow complex instructions. ERNIE-Image's 0.89 GenEval score is largely attributable to high-quality structured captions.

3. Aesthetic Filtering
Every image passes through ERNIE-Image-Aes for aesthetic scoring before training. Low-scoring images are downweighted or removed, ensuring pre-training data has inherently high visual quality — preventing "garbage in, garbage out."

4. Progressive Resolution Training
Training starts at $256 \times 256$, progresses to $512 \times 512$, and culminates at $1024 \times 1024$. This progressive strategy helps the model learn global structure first, then local details, avoiding the mode collapse common in high-resolution training.

Top-down: Specialized Post-training and Human Alignment

After pre-training, the model has general generation capabilities but needs further alignment with human preferences. This phase includes two sub-steps:

SFT (Supervised Fine-Tuning)
SFT uses curated domain-specific datasets covering high-demand scenarios:

Portrait photography (natural facial textures, realistic lighting)
Game art (character design, concept art)
Commercial design (posters, infographics, product renders)

The key innovation is caption diversification: through Qwen3 VLM, the same image generates three caption styles — keyword lists, natural language descriptions, and direct instructions. This enables the model to understand different forms of user input, boosting prompt robustness.

DPO (Direct Preference Optimization)
This is the core of ERNIE-Image's alignment training. DPO uses paired preference data — for the same prompt, one "good" image and one "not-so-good" image. The model is trained to increase the velocity field prediction probability for good images while decreasing it for bad ones.

Anchor Losses Regularization: DPO training carries a known risk — the model may learn to "cheat" by changing fundamental image structure to achieve high preference scores rather than genuinely improving quality. ERNIE-Image introduces Anchor Losses as a regularization term, anchoring the model's core generation ability and preventing "reward hacking" during DPO training.

ERNIE-Image-Aes: Industrial Practice of Aesthetic Evaluation

ERNIE-Image-Aes is an independent 8B VLM aesthetic scoring model, an indispensable component of the entire training pipeline.

Annotation Methodology: Swiss-Tournament
Traditional image quality annotation asks annotators to score each image 1-10, but absolute scoring suffers from severe subjectivity and inconsistency. ERNIE-Image adopts the Swiss-Tournament pairwise comparison method:

Two images are shown simultaneously to annotators
The annotator selects "the better one"
Points are allocated based on current "standing"
Multiple rounds converge to a stable quality ranking

Advantages:

Relative judgment is more reliable than absolute — human brains excel at comparison, not scoring
Ranking stability — multiple rounds eliminate偶然偏差
Reproducibility — results across different annotator groups are highly consistent

ERIA-1K Benchmark
ERNIE-Image released the ERIA-1K benchmark — 1,000 images across 6 categories (realistic photography, digital art, illustration, poster design, anime, abstract art), labeled by professional annotators (design/fine arts backgrounds) with 1-10 tier labels. This provides a standardized test set for aesthetic evaluation models.

Avoiding Professional Bias
ERNIE-Image-Aes is designed to reflect public aesthetic preferences rather than professional photographer aesthetics. This means:

Not over-preferring "perfect exposure" or "golden ratio"
Accepting diverse compositional styles
Prioritizing information communication efficiency over pure visual beauty

This matters significantly for commercial applications — an information-rich poster may not be as "beautiful" as a landscape photo, but its aesthetic value in context is high.

Measurable Impact of DPO Training

ERNIE-Image's benchmark performance reflects the effect of DPO alignment:

Benchmark	Score	Meaning
GenEval	0.89	Single/multi-object generation & attribute binding
LongText-Bench	0.973	English-Chinese bilingual text rendering accuracy
Human Evaluation	#1 Open-Source	Overall visual quality ranking

Concrete improvements from DPO:

Portrait texture: DPO training eliminates the "over-smoothed" AI look, producing more natural facial textures
Instruction following: Accuracy on complex multi-element prompts (e.g., "9 stickers in different poses, each with different text") significantly improves
Aesthetic consistency: Generated images maintain unified aesthetic style across different prompts, rather than random fluctuations

Comparison with Other Models

Model	Alignment Method	Aesthetic Evaluation	Public Pipeline
ERNIE-Image	DPO (Flow Matching)	ERNIE-Image-Aes	✅ Fully public
FLUX.2 Pro	LoRA + manual curation	Not public	❌
SD 3.5	SFT only	CLIP-based	Partially public
Midjourney	RLHF (proprietary)	Proprietary	❌
Seedream 4.5	Not disclosed	Not disclosed	❌

ERNIE-Image's unique value is fully public transparency from data pipeline to alignment training. For researchers and developers, this isn't just a usable model — it's a learnable, reproducible training framework.

Practical Takeaways

From ERNIE-Image's training pipeline, we can extract several insights useful for community developers:

1. Data Quality > Data Quantity
A 10,000-category + VLM caption + aesthetic filtering pipeline is more effective than blindly increasing training data. Small-scale, high-quality, diverse datasets outperform large-scale, low-quality ones.

2. Caption Diversification Boosts Robustness
Generating multiple caption styles (keywords/natural language/instructional) for the same image enables the model to understand different user input forms. This is a low-cost strategy for improving prompt robustness.

3. DPO Needs Regularization
Anchor Losses preventing reward hacking is a universal lesson. Any preference-optimized training needs regularization to maintain the model's core generation ability.

4. Aesthetic Evaluation Should Reflect Target Users
ERNIE-Image-Aes emphasizes public aesthetics over professional aesthetics. When training your own LoRA or fine-tuning models, your aesthetic standards should align with your target user group, not with professional photographers.

Conclusion

ERNIE-Image's success isn't just about its 8B parameter scale — it's about a carefully designed training pipeline: Bottom-up large-scale pre-training builds general capability, Top-down SFT + DPO aligns with human preferences, and ERNIE-Image-Aes aesthetic evaluation ensures visual quality throughout. This pipeline provides a complete alignment training paradigm for open-source diffusion models, worthy of deep community study and adoption.

Further Reading: