ERNIE-Image NVFP4 Quantized Deployment Complete Guide: Run 8B Model on 4.78GB VRAM

Published: 2026-05-06
Author: Yan Ming
Tags: NVFP4, Quantization, ComfyUI, VRAM Optimization, ERNIE-Image-Turbo

Introduction

ERNIE-Image 8B is Baidu's open-source text-to-image model with exceptional text rendering and instruction-following capabilities. However, an 8B-parameter Diffusion Transformer (DiT) model requires approximately 16GB+ of VRAM at BF16 precision — a significant barrier for consumer-grade GPU users.

While our previous GGUF quantization guide showed how to run ERNIE-Image on 24GB VRAM, community developer Bedovyy has released NVFP4 quantized versions on HuggingFace that go much further — ERNIE-Image-Turbo runs in just ~4.78GB VRAM.

This article provides an in-depth look at NVFP4, FP8, and INT8 quantization strategies, real-world benchmarks, and ComfyUI deployment tutorials.

What is NVFP4 Quantization?

The Basics

NVFP4 (NVIDIA Float4) is NVIDIA's 4-bit floating-point quantization format designed for Blackwell architecture GPUs. Unlike traditional INT4, NVFP4 preserves the dynamic range characteristics of floating-point numbers, maintaining critical model features even at reduced precision.

Why DiT Models Benefit from NVFP4

ERNIE-Image is built on the Diffusion Transformer (DiT) architecture. DiT quantization faces two unique challenges:

Salient Channels: Certain attention channels have extreme magnitude values that standard uniform quantization would discard
Temporal Distribution Shift: Activation distributions vary significantly across diffusion timesteps

Bedovyy's quantization strategy solves these by selectively preserving critical layers:

{
  "block_names": ["layers"],
  "rules": [
    { "policy": "keep", "match": ["adaLN", "self_attention.norm"] },
    { "policy": "float8_e4m3fn", "match": ["mlp", "self_attention.to"] }
  ]
}

Key design decisions:

Keep (no quantization): adaLN layers and self_attention normalization — these have the most impact on output quality
FP8 quantization (float8_e4m3fn): MLP and attention projection layers — large parameter counts but more tolerant to quantization

Quantized Model Performance Benchmarks

Data sourced from Bedovyy/ERNIE-Image-Quantized on HuggingFace.

ERNIE-Image-Turbo Generation Speed Comparison

GPU	Quantization	Speed (it/s)	Time (seconds)	vs BF16
RTX 5090	BF16	2.09	4.87	100%
RTX 5090	FP8	3.69	3.32	147%
RTX 5090	INT8	4.31	3.05	160%
RTX 5090	NVFP4	5.09	2.72	179%
RTX 3090	BF16	0.88	12.42	100%
RTX 3090	FP8	0.84	12.73	98%
RTX 3090	INT8	1.66	7.04	176%
RTX 3090	NVFP4	0.83	12.71	98%
RTX 3060	BF16	0.26	43.02	100%
RTX 3060	FP8	0.39	28.66	150%
RTX 3060	INT8	0.82	14.43	298%
RTX 3060	NVFP4	0.39	28.72	150%

Key Findings:

NVFP4 is fastest on RTX 5090: 1.79x faster than BF16 — 8-step Turbo generates in just 2.72 seconds
INT8 is fastest on RTX 3090/3060: NVFP4 requires Blackwell architecture hardware (FP4 matrix cores). On Ampere/Ada architectures, NVFP4 performance degrades to FP8-equivalent levels
RTX 3060 INT8 has the biggest speedup: 3x faster than BF16, reducing generation from 43 seconds to 14 seconds

ERNIE-Image (Standard) Generation Speed

GPU	Quantization	Speed (it/s)	Time (seconds)	vs BF16
RTX 5090	BF16	1.08	20.08	100%
RTX 5090	NVFP4	2.56	9.35	215%
RTX 3090	BF16	0.40	53.33	100%
RTX 3090	INT8	0.79	28.08	190%
RTX 3060	BF16	0.11	201.41	100%
RTX 3060	INT8	0.35	62.42	323%

The Standard model (50 steps) sees more dramatic quantization speedups than Turbo (8 steps), because each inference step's overhead is reduced.

VRAM Usage Comparison

Quantization	ERNIE-Image-Turbo Model Size	Theoretical Min VRAM	Recommended VRAM
BF16	~16 GB	16 GB	24 GB
FP8	~8.22 GB	10 GB	16 GB
INT8	~8.22 GB	10 GB	16 GB
NVFP4	~4.78 GB	6 GB	8 GB

NVFP4 compresses the model to 30% of its BF16 size, enabling ERNIE-Image-Turbo on 8GB consumer GPUs.

ComfyUI Deployment Tutorial

Prerequisites

Latest ComfyUI (with DiT quantization node support)
CUDA 12.4+ (12.5+ recommended)
Python 3.10+
PyTorch 2.5+ (NVFP4 support)

Step 1: Download Quantized Models

# Clone from HuggingFace
git clone https://huggingface.co/Bedovyy/ERNIE-Image-Quantized
Or download specific formats:
- ernie-image-turbo-nvfp4.safetensors  (~4.78GB)
- ernie-image-turbo-fp8.safetensors    (~8.22GB)
- ernie-image-turbo-int8.safetensors   (~8.22GB)

Step 2: Install Required Nodes

cd ComfyUI/custom_nodes
git clone https://github.com/bedovyy/comfy-dit-quantizer

Step 3: ComfyUI Workflow Configuration

Core nodes for ERNIE-Image workflow:

Load Diffusion Model — Load the quantized model file
CLIP Text Encode — Process prompts (requires T5-XXL text encoder)
VAE Decode — Decode latent space to image
KSampler — Diffusion sampling node

Key Parameters:

Steps: 8 for Turbo, 50 for Standard
CFG Scale: 5.0-7.0 (recommended: 6.0)
Denoise: 1.0 (text-to-image)
Sampler: euler

Step 4: Quantization Config JSON

For custom quantization, use this configuration:

{
  "block_names": ["layers"],
  "rules": [
    { "policy": "keep", "match": ["adaLN", "self_attention.norm"] },
    { "policy": "float8_e4m3fn", "match": ["mlp", "self_attention.to"] }
  ]
}

Image Quality Comparison

Subjective Evaluation

Quantization	Text Rendering	Detail Retention	Color Accuracy	Overall
BF16	★★★★★	★★★★★	★★★★★	10/10
FP8	★★★★★	★★★★☆	★★★★★	9.5/10
INT8	★★★★☆	★★★★☆	★★★★☆	8.5/10
NVFP4	★★★★☆	★★★★☆	★★★★☆	8.5/10

Practical Conclusions:

FP8 is virtually lossless: Visually indistinguishable from BF16 output
INT8 and NVFP4 show minor degradation: Noticeable only in very fine text rendering and extreme color scenarios
Turbo mode is more sensitive to quantization: 8-step fast generation + low-precision quantization can compound quality loss

Recommended Strategies

Use Case	Recommended Quantization	Reason
Production/Commercial	FP8	Near-BF16 quality, 47% speed boost
Daily Creation/Preview	NVFP4 (RTX 5090) or INT8 (others)	Speed priority, acceptable quality
Low VRAM GPUs (8-12GB)	INT8	Low memory footprint, high speedup
Maximum Speed Testing	NVFP4 (Blackwell GPU)	Fastest generation speed

NVFP4 vs GGUF: Two Quantization Approaches Compared

Feature	NVFP4/FP8/INT8	GGUF
Tool	comfy-dit-quantizer	GGUF.org / llama.cpp
Granularity	Selective layer quantization	Uniform model-wide (Q2-Q8)
Hardware	NVFP4 requires Blackwell	Universal CPU/GPU
Min VRAM	4.78GB (NVFP4)	~6GB (Q4)
Inference Engine	ComfyUI native	llama.cpp / ComfyUI GGUF nodes
Quality	Higher (preserves key layers)	Q8 near-original, low-bit noticeable loss
Best For	High-performance GPUs	Low VRAM/consumer hardware

Summary: If you have an RTX 4090/5090, NVFP4/FP8 is the better choice. If VRAM is tight or hardware is older, GGUF is more flexible.

FAQ

Q: Does NVFP4 work on RTX 4090?

NVFP4 is designed for Blackwell architecture. RTX 4090 (Ada architecture) lacks native FP4 matrix cores, so NVFP4 performance on 4090 is equivalent to FP8. RTX 4090 users should prefer INT8 quantization for the best speedup.

Q: Can quantized models be fine-tuned with LoRA?

Quantized models themselves are not suitable for further training. For LoRA fine-tuning, use the full BF16 or FP16 model, then quantize the base model after training is complete.

Q: Can different quantization formats be mixed?

Yes. You can load FP8 and INT8 models simultaneously for A/B testing. However, avoid mixing precisions within a single workflow to prevent memory fragmentation.

Summary

NVFP4/FP8/INT8 quantization dramatically lowers the deployment barrier for ERNIE-Image 8B:

VRAM requirement: Reduced from 16GB+ to 4.78GB (NVFP4) or 8GB (INT8)
Generation speed: Up to 3x faster (RTX 3060 INT8)
Quality loss: FP8 is virtually lossless; NVFP4/INT8 show minor acceptable degradation

For most users, FP8 quantization offers the best balance between quality and speed. If you have a Blackwell architecture GPU, NVFP4 provides the ultimate experience.

References: Bedovyy/ERNIE-Image-Quantized (HuggingFace), NVIDIA Model-Optimizer, PTQ4DiT (NeurIPS 2024)

ERNIE-Image NVFP4 Quantized Deployment Complete Guide: Run 8B Model on 4.78GB VRAM

Table of Contents

ERNIE-Image NVFP4 Quantized Deployment Complete Guide: Run 8B Model on 4.78GB VRAM

Introduction

What is NVFP4 Quantization?

The Basics

Why DiT Models Benefit from NVFP4

Quantized Model Performance Benchmarks

ERNIE-Image-Turbo Generation Speed Comparison

Key Findings:

ERNIE-Image (Standard) Generation Speed

VRAM Usage Comparison

ComfyUI Deployment Tutorial

Prerequisites

Step 1: Download Quantized Models

Or download specific formats:

- ernie-image-turbo-nvfp4.safetensors (~4.78GB)

- ernie-image-turbo-fp8.safetensors (~8.22GB)

- ernie-image-turbo-int8.safetensors (~8.22GB)

Step 2: Install Required Nodes

Step 3: ComfyUI Workflow Configuration

Step 4: Quantization Config JSON

Image Quality Comparison

Subjective Evaluation

Practical Conclusions:

Recommended Strategies

NVFP4 vs GGUF: Two Quantization Approaches Compared

FAQ

Q: Does NVFP4 work on RTX 4090?

Q: Can quantized models be fine-tuned with LoRA?

Q: Can different quantization formats be mixed?

Summary