ERNIE-Image-Aes Deep Dive: The 8B VLM Revolution in Image Aesthetic Scoring

2026/05/31

ERNIE-Image-Aes Deep Dive: The 8B VLM Revolution in Image Aesthetic Scoring

Publish Date: 2026-05-31
Tags: ERNIE-Image-Aes, Aesthetic Evaluation, Image Quality, VLM, ERIA-1K

In the AI image generation workflow, there's a long-overlooked bottleneck: how to automatically and objectively evaluate the aesthetic quality of generated images?

Traditional approaches rely on human review — time-consuming, subjective, and hard to scale. Existing automated aesthetic scoring models (LAION-AES, ArtiMuse, UniPercept) have systematic biases: some over-score AI-generated content, others favor black-and-white photography, and some are overly lenient with casual snapshots.

Baidu's ERNIE-Image team recently open-sourced ERNIE-Image-Aes — an 8B-parameter vision-language model designed specifically for image aesthetic scoring. On the ERIA-1K benchmark, it achieves SRCC 0.7445 and PLCC 0.7598, far surpassing all previous open-source aesthetic evaluation models.

This article dives deep into ERNIE-Image-Aes' technical architecture, performance, and practical applications.


Why Do We Need Better Aesthetic Evaluation Models?

Data Cleaning Requirements

When training text-to-image models, the quality of training data directly determines output quality. The ERNIE-Image technical report explicitly states:

Each image is assigned an aesthetic score by ERNIE-Image-Aes, which is then used for data cleaning.

This means an accurate aesthetic scoring model is infrastructure for building high-quality text-to-image models.

Quality Control for Batch Production

When using ERNIE-Image to batch-generate product images, ad creatives, or social media content, you can't manually review every single one. An aesthetic scoring model serves as the first filter:

Generate 100 images → Aesthetic scoring → Keep top 20 → Human fine-tuning → Delivery

Quantifying Model Improvements

After SFT and DPO training, how do you objectively quantify aesthetic improvements in model output? You need a reliable scoring model as an evaluation tool.


ERNIE-Image-Aes Technical Architecture

Fine-Tuned from ArtiMuse

ERNIE-Image-Aes is initialized from ArtiMuse and fine-tuned on a diverse, professionally annotated dataset.

Key design choices:

  • 8B VLM: Large enough to capture complex visual patterns while maintaining inference efficiency
  • Diverse annotation data: Covers photography, illustration, anime, product images, and more
  • Explicit category balance: Prevents any single category from dominating training signals

Solving Biases in Existing Models

This is one of ERNIE-Image-Aes' most significant contributions. Here are the systematic biases of existing models:

Model Bias Type Manifestation
LAION-AES Category bias Over-scores AI-generated/anime content
ArtiMuse Style bias Over-scores black-and-white photography and casual snapshots
UniPercept Color preference Prefers monochrome images; over-scores casual snapshots

ERNIE-Image-Aes addresses these through a purpose-built annotation pipeline and explicit category balance.


ERIA-1K Benchmark: A More Realistic Evaluation

Why a New Benchmark?

Existing aesthetic benchmarks (AVA, Flickr) have a problem: they're predominantly composed of professional photographers' work, skewed toward Western photographic traditions and visually polished content, failing to reflect real-world deployment distributions.

ERIA-1K Design

  • 1,000 human-annotated images
  • Score range: 2.0 ~ 9.67 (covering a broad aesthetic quality spectrum)
  • Deployment-oriented: Avoids over-representation of professional/Western photography
  • Fully open source: Anyone can use it to evaluate their models

Benchmark Results

Model SRCC PLCC
LAION AES 0.2944 0.3138
ArtiMuse 0.4277 0.4704
UniPercept 0.4533 0.4748
ERNIE-Image-Aes 0.7445 0.7598

The SRCC jump from 0.45 to 0.74 represents a qualitative leap.


Practical Applications

Training Data Auto-Filtering

# Pseudo-code: using ERNIE-Image-Aes to filter training data
from ernie_image_aes import AesModel

model = AesModel.from_pretrained("baidu/ERNIE-Image-Aes")

filtered_data = []
for image, caption in dataset:
score = model.score(image)
if score >= 7.0: # Set threshold
filtered_data.append((image, caption))

Batch Production Quality Pipeline

ERNIE-Image generation → ERNIE-Image-Aes scoring → Top-K selection → Human review

For e-commerce images, ad creatives, and other batch scenarios, this can save 70-80% of manual review time.

Model Output Comparison

When training multiple LoRAs or running SFT with different parameters, and you need to objectively compare output quality:

scores_model_a = [model.score(img) for img in outputs_a]
scores_model_b = [model.score(img) for img in outputs_b]
print(f"Model A avg: {np.mean(scores_model_a):.3f}")
print(f"Model B avg: {np.mean(scores_model_b):.3f}")

Aesthetic-Guided Data Augmentation

Use aesthetic scores to guide data augmentation:

  • Low-scoring images → analyze defects (composition? color?)
  • High-scoring images → use as positive samples for augmentation

Deployment Guide

Environment Setup

ERNIE-Image-Aes is based on ArtiMuse architecture, deployed the same way:

  • Python 3.10+
  • PyTorch 2.0+
  • Recommended GPU: Single 16GB+ VRAM

Inference Example

# Download model
git clone https://huggingface.co/baidu/ERNIE-Image-Aes

Inference with Python

python score_image.py --model-path ./ERNIE-Image-Aes --image test.jpg

Batch Inference Optimization

For large-scale datasets, use batch inference:

batch_size = 32
for batch in DataLoader(images, batch_size=batch_size):
    scores = model(batch)
    results.extend(scores.tolist())

Limitations and Future Directions

Current Limitations

  1. Compute cost: 8B VLM inference requires significant GPU resources
  2. Subjectivity: Aesthetics are inherently subjective; no scoring model can fully replace human judgment
  3. Cultural differences: While ERIA-1K tries to avoid Western-centrism, aesthetic preferences vary across cultures

Future Directions

  • Lightweight versions: Smaller aesthetic scoring models for edge device deployment
  • Multi-modal feedback: Not just scores, but specific aesthetic improvement suggestions
  • Domain adaptation: Domain-specific aesthetic scoring models (e-commerce, medical, industrial)

Summary

ERNIE-Image-Aes is a crucial addition to the ERNIE-Image ecosystem. It's not just an aesthetic scoring tool — it's infrastructure for the AI image generation workflow:

  • Data cleaning: Improves training data quality
  • Batch quality control: Automates selection of best outputs
  • Model evaluation: Objectively quantifies model improvements

Paired with the open-source ERIA-1K benchmark, it gives the community a fairer, more deployment-realistic evaluation standard.

As AI image generation penetrates deeper into commercial applications, a reliable aesthetic evaluation model will become a standard tool for every AI image team.


References

ERNIE-Image Team

ERNIE-Image-Aes Deep Dive: The 8B VLM Revolution in Image Aesthetic Scoring | 博客