ERNIE-Image NVFP4 Quantized Deployment Complete Guide: Run 8B Model on 4.78GB VRAM
Published: 2026-05-06
Author: Yan Ming
Tags: NVFP4, Quantization, ComfyUI, VRAM Optimization, ERNIE-Image-Turbo
Introduction
ERNIE-Image 8B is Baidu's open-source text-to-image model with exceptional text rendering and instruction-following capabilities. However, an 8B-parameter Diffusion Transformer (DiT) model requires approximately 16GB+ of VRAM at BF16 precision — a significant barrier for consumer-grade GPU users.
While our previous GGUF quantization guide showed how to run ERNIE-Image on 24GB VRAM, community developer Bedovyy has released NVFP4 quantized versions on HuggingFace that go much further — ERNIE-Image-Turbo runs in just ~4.78GB VRAM.
This article provides an in-depth look at NVFP4, FP8, and INT8 quantization strategies, real-world benchmarks, and ComfyUI deployment tutorials.
What is NVFP4 Quantization?
The Basics
NVFP4 (NVIDIA Float4) is NVIDIA's 4-bit floating-point quantization format designed for Blackwell architecture GPUs. Unlike traditional INT4, NVFP4 preserves the dynamic range characteristics of floating-point numbers, maintaining critical model features even at reduced precision.
Why DiT Models Benefit from NVFP4
ERNIE-Image is built on the Diffusion Transformer (DiT) architecture. DiT quantization faces two unique challenges:
- Salient Channels: Certain attention channels have extreme magnitude values that standard uniform quantization would discard
- Temporal Distribution Shift: Activation distributions vary significantly across diffusion timesteps
Bedovyy's quantization strategy solves these by selectively preserving critical layers:
{
"block_names": ["layers"],
"rules": [
{ "policy": "keep", "match": ["adaLN", "self_attention.norm"] },
{ "policy": "float8_e4m3fn", "match": ["mlp", "self_attention.to"] }
]
}
Key design decisions:
- Keep (no quantization): adaLN layers and self_attention normalization — these have the most impact on output quality
- FP8 quantization (float8_e4m3fn): MLP and attention projection layers — large parameter counts but more tolerant to quantization
Quantized Model Performance Benchmarks
Data sourced from Bedovyy/ERNIE-Image-Quantized on HuggingFace.
ERNIE-Image-Turbo Generation Speed Comparison
| GPU | Quantization | Speed (it/s) | Time (seconds) | vs BF16 |
|---|---|---|---|---|
| RTX 5090 | BF16 | 2.09 | 4.87 | 100% |
| RTX 5090 | FP8 | 3.69 | 3.32 | 147% |
| RTX 5090 | INT8 | 4.31 | 3.05 | 160% |
| RTX 5090 | NVFP4 | 5.09 | 2.72 | 179% |
| RTX 3090 | BF16 | 0.88 | 12.42 | 100% |
| RTX 3090 | FP8 | 0.84 | 12.73 | 98% |
| RTX 3090 | INT8 | 1.66 | 7.04 | 176% |
| RTX 3090 | NVFP4 | 0.83 | 12.71 | 98% |
| RTX 3060 | BF16 | 0.26 | 43.02 | 100% |
| RTX 3060 | FP8 | 0.39 | 28.66 | 150% |
| RTX 3060 | INT8 | 0.82 | 14.43 | 298% |
| RTX 3060 | NVFP4 | 0.39 | 28.72 | 150% |
Key Findings:
- NVFP4 is fastest on RTX 5090: 1.79x faster than BF16 — 8-step Turbo generates in just 2.72 seconds
- INT8 is fastest on RTX 3090/3060: NVFP4 requires Blackwell architecture hardware (FP4 matrix cores). On Ampere/Ada architectures, NVFP4 performance degrades to FP8-equivalent levels
- RTX 3060 INT8 has the biggest speedup: 3x faster than BF16, reducing generation from 43 seconds to 14 seconds
ERNIE-Image (Standard) Generation Speed
| GPU | Quantization | Speed (it/s) | Time (seconds) | vs BF16 |
|---|---|---|---|---|
| RTX 5090 | BF16 | 1.08 | 20.08 | 100% |
| RTX 5090 | NVFP4 | 2.56 | 9.35 | 215% |
| RTX 3090 | BF16 | 0.40 | 53.33 | 100% |
| RTX 3090 | INT8 | 0.79 | 28.08 | 190% |
| RTX 3060 | BF16 | 0.11 | 201.41 | 100% |
| RTX 3060 | INT8 | 0.35 | 62.42 | 323% |
The Standard model (50 steps) sees more dramatic quantization speedups than Turbo (8 steps), because each inference step's overhead is reduced.
VRAM Usage Comparison
| Quantization | ERNIE-Image-Turbo Model Size | Theoretical Min VRAM | Recommended VRAM |
|---|---|---|---|
| BF16 | ~16 GB | 16 GB | 24 GB |
| FP8 | ~8.22 GB | 10 GB | 16 GB |
| INT8 | ~8.22 GB | 10 GB | 16 GB |
| NVFP4 | ~4.78 GB | 6 GB | 8 GB |
NVFP4 compresses the model to 30% of its BF16 size, enabling ERNIE-Image-Turbo on 8GB consumer GPUs.
ComfyUI Deployment Tutorial
Prerequisites
- Latest ComfyUI (with DiT quantization node support)
- CUDA 12.4+ (12.5+ recommended)
- Python 3.10+
- PyTorch 2.5+ (NVFP4 support)
Step 1: Download Quantized Models
# Clone from HuggingFace
git clone https://huggingface.co/Bedovyy/ERNIE-Image-Quantized
Or download specific formats:
- ernie-image-turbo-nvfp4.safetensors (~4.78GB)
- ernie-image-turbo-fp8.safetensors (~8.22GB)
- ernie-image-turbo-int8.safetensors (~8.22GB)
Step 2: Install Required Nodes
cd ComfyUI/custom_nodes
git clone https://github.com/bedovyy/comfy-dit-quantizer
Step 3: ComfyUI Workflow Configuration
Core nodes for ERNIE-Image workflow:
- Load Diffusion Model — Load the quantized model file
- CLIP Text Encode — Process prompts (requires T5-XXL text encoder)
- VAE Decode — Decode latent space to image
- KSampler — Diffusion sampling node
Key Parameters:
- Steps: 8 for Turbo, 50 for Standard
- CFG Scale: 5.0-7.0 (recommended: 6.0)
- Denoise: 1.0 (text-to-image)
- Sampler: euler
Step 4: Quantization Config JSON
For custom quantization, use this configuration:
{
"block_names": ["layers"],
"rules": [
{ "policy": "keep", "match": ["adaLN", "self_attention.norm"] },
{ "policy": "float8_e4m3fn", "match": ["mlp", "self_attention.to"] }
]
}
Image Quality Comparison
Subjective Evaluation
| Quantization | Text Rendering | Detail Retention | Color Accuracy | Overall |
|---|---|---|---|---|
| BF16 | ★★★★★ | ★★★★★ | ★★★★★ | 10/10 |
| FP8 | ★★★★★ | ★★★★☆ | ★★★★★ | 9.5/10 |
| INT8 | ★★★★☆ | ★★★★☆ | ★★★★☆ | 8.5/10 |
| NVFP4 | ★★★★☆ | ★★★★☆ | ★★★★☆ | 8.5/10 |
Practical Conclusions:
- FP8 is virtually lossless: Visually indistinguishable from BF16 output
- INT8 and NVFP4 show minor degradation: Noticeable only in very fine text rendering and extreme color scenarios
- Turbo mode is more sensitive to quantization: 8-step fast generation + low-precision quantization can compound quality loss
Recommended Strategies
| Use Case | Recommended Quantization | Reason |
|---|---|---|
| Production/Commercial | FP8 | Near-BF16 quality, 47% speed boost |
| Daily Creation/Preview | NVFP4 (RTX 5090) or INT8 (others) | Speed priority, acceptable quality |
| Low VRAM GPUs (8-12GB) | INT8 | Low memory footprint, high speedup |
| Maximum Speed Testing | NVFP4 (Blackwell GPU) | Fastest generation speed |
NVFP4 vs GGUF: Two Quantization Approaches Compared
| Feature | NVFP4/FP8/INT8 | GGUF |
|---|---|---|
| Tool | comfy-dit-quantizer | GGUF.org / llama.cpp |
| Granularity | Selective layer quantization | Uniform model-wide (Q2-Q8) |
| Hardware | NVFP4 requires Blackwell | Universal CPU/GPU |
| Min VRAM | 4.78GB (NVFP4) | ~6GB (Q4) |
| Inference Engine | ComfyUI native | llama.cpp / ComfyUI GGUF nodes |
| Quality | Higher (preserves key layers) | Q8 near-original, low-bit noticeable loss |
| Best For | High-performance GPUs | Low VRAM/consumer hardware |
Summary: If you have an RTX 4090/5090, NVFP4/FP8 is the better choice. If VRAM is tight or hardware is older, GGUF is more flexible.
FAQ
Q: Does NVFP4 work on RTX 4090?
NVFP4 is designed for Blackwell architecture. RTX 4090 (Ada architecture) lacks native FP4 matrix cores, so NVFP4 performance on 4090 is equivalent to FP8. RTX 4090 users should prefer INT8 quantization for the best speedup.
Q: Can quantized models be fine-tuned with LoRA?
Quantized models themselves are not suitable for further training. For LoRA fine-tuning, use the full BF16 or FP16 model, then quantize the base model after training is complete.
Q: Can different quantization formats be mixed?
Yes. You can load FP8 and INT8 models simultaneously for A/B testing. However, avoid mixing precisions within a single workflow to prevent memory fragmentation.
Summary
NVFP4/FP8/INT8 quantization dramatically lowers the deployment barrier for ERNIE-Image 8B:
- VRAM requirement: Reduced from 16GB+ to 4.78GB (NVFP4) or 8GB (INT8)
- Generation speed: Up to 3x faster (RTX 3060 INT8)
- Quality loss: FP8 is virtually lossless; NVFP4/INT8 show minor acceptable degradation
For most users, FP8 quantization offers the best balance between quality and speed. If you have a Blackwell architecture GPU, NVFP4 provides the ultimate experience.
References: Bedovyy/ERNIE-Image-Quantized (HuggingFace), NVIDIA Model-Optimizer, PTQ4DiT (NeurIPS 2024)