ERNIE-Image NVFP4 Quantized Deployment Complete Guide: Run 8B Model on 4.78GB VRAM

mei 6, 2026

ERNIE-Image NVFP4 Quantized Deployment Complete Guide: Run 8B Model on 4.78GB VRAM

Published: 2026-05-06
Author: Yan Ming
Tags: NVFP4, Quantization, ComfyUI, VRAM Optimization, ERNIE-Image-Turbo


Introduction

ERNIE-Image 8B is Baidu's open-source text-to-image model with exceptional text rendering and instruction-following capabilities. However, an 8B-parameter Diffusion Transformer (DiT) model requires approximately 16GB+ of VRAM at BF16 precision — a significant barrier for consumer-grade GPU users.

While our previous GGUF quantization guide showed how to run ERNIE-Image on 24GB VRAM, community developer Bedovyy has released NVFP4 quantized versions on HuggingFace that go much further — ERNIE-Image-Turbo runs in just ~4.78GB VRAM.

This article provides an in-depth look at NVFP4, FP8, and INT8 quantization strategies, real-world benchmarks, and ComfyUI deployment tutorials.


What is NVFP4 Quantization?

The Basics

NVFP4 (NVIDIA Float4) is NVIDIA's 4-bit floating-point quantization format designed for Blackwell architecture GPUs. Unlike traditional INT4, NVFP4 preserves the dynamic range characteristics of floating-point numbers, maintaining critical model features even at reduced precision.

Why DiT Models Benefit from NVFP4

ERNIE-Image is built on the Diffusion Transformer (DiT) architecture. DiT quantization faces two unique challenges:

  1. Salient Channels: Certain attention channels have extreme magnitude values that standard uniform quantization would discard
  2. Temporal Distribution Shift: Activation distributions vary significantly across diffusion timesteps

Bedovyy's quantization strategy solves these by selectively preserving critical layers:

{
  "block_names": ["layers"],
  "rules": [
    { "policy": "keep", "match": ["adaLN", "self_attention.norm"] },
    { "policy": "float8_e4m3fn", "match": ["mlp", "self_attention.to"] }
  ]
}

Key design decisions:

  • Keep (no quantization): adaLN layers and self_attention normalization — these have the most impact on output quality
  • FP8 quantization (float8_e4m3fn): MLP and attention projection layers — large parameter counts but more tolerant to quantization

Quantized Model Performance Benchmarks

Data sourced from Bedovyy/ERNIE-Image-Quantized on HuggingFace.

ERNIE-Image-Turbo Generation Speed Comparison

GPU Quantization Speed (it/s) Time (seconds) vs BF16
RTX 5090 BF16 2.09 4.87 100%
RTX 5090 FP8 3.69 3.32 147%
RTX 5090 INT8 4.31 3.05 160%
RTX 5090 NVFP4 5.09 2.72 179%
RTX 3090 BF16 0.88 12.42 100%
RTX 3090 FP8 0.84 12.73 98%
RTX 3090 INT8 1.66 7.04 176%
RTX 3090 NVFP4 0.83 12.71 98%
RTX 3060 BF16 0.26 43.02 100%
RTX 3060 FP8 0.39 28.66 150%
RTX 3060 INT8 0.82 14.43 298%
RTX 3060 NVFP4 0.39 28.72 150%

Key Findings:

  1. NVFP4 is fastest on RTX 5090: 1.79x faster than BF16 — 8-step Turbo generates in just 2.72 seconds
  2. INT8 is fastest on RTX 3090/3060: NVFP4 requires Blackwell architecture hardware (FP4 matrix cores). On Ampere/Ada architectures, NVFP4 performance degrades to FP8-equivalent levels
  3. RTX 3060 INT8 has the biggest speedup: 3x faster than BF16, reducing generation from 43 seconds to 14 seconds

ERNIE-Image (Standard) Generation Speed

GPU Quantization Speed (it/s) Time (seconds) vs BF16
RTX 5090 BF16 1.08 20.08 100%
RTX 5090 NVFP4 2.56 9.35 215%
RTX 3090 BF16 0.40 53.33 100%
RTX 3090 INT8 0.79 28.08 190%
RTX 3060 BF16 0.11 201.41 100%
RTX 3060 INT8 0.35 62.42 323%

The Standard model (50 steps) sees more dramatic quantization speedups than Turbo (8 steps), because each inference step's overhead is reduced.

VRAM Usage Comparison

Quantization ERNIE-Image-Turbo Model Size Theoretical Min VRAM Recommended VRAM
BF16 ~16 GB 16 GB 24 GB
FP8 ~8.22 GB 10 GB 16 GB
INT8 ~8.22 GB 10 GB 16 GB
NVFP4 ~4.78 GB 6 GB 8 GB

NVFP4 compresses the model to 30% of its BF16 size, enabling ERNIE-Image-Turbo on 8GB consumer GPUs.


ComfyUI Deployment Tutorial

Prerequisites

  • Latest ComfyUI (with DiT quantization node support)
  • CUDA 12.4+ (12.5+ recommended)
  • Python 3.10+
  • PyTorch 2.5+ (NVFP4 support)

Step 1: Download Quantized Models

# Clone from HuggingFace
git clone https://huggingface.co/Bedovyy/ERNIE-Image-Quantized

Or download specific formats:

- ernie-image-turbo-nvfp4.safetensors (~4.78GB)

- ernie-image-turbo-fp8.safetensors (~8.22GB)

- ernie-image-turbo-int8.safetensors (~8.22GB)

Step 2: Install Required Nodes

cd ComfyUI/custom_nodes
git clone https://github.com/bedovyy/comfy-dit-quantizer

Step 3: ComfyUI Workflow Configuration

Core nodes for ERNIE-Image workflow:

  1. Load Diffusion Model — Load the quantized model file
  2. CLIP Text Encode — Process prompts (requires T5-XXL text encoder)
  3. VAE Decode — Decode latent space to image
  4. KSampler — Diffusion sampling node

Key Parameters:

  • Steps: 8 for Turbo, 50 for Standard
  • CFG Scale: 5.0-7.0 (recommended: 6.0)
  • Denoise: 1.0 (text-to-image)
  • Sampler: euler

Step 4: Quantization Config JSON

For custom quantization, use this configuration:

{
  "block_names": ["layers"],
  "rules": [
    { "policy": "keep", "match": ["adaLN", "self_attention.norm"] },
    { "policy": "float8_e4m3fn", "match": ["mlp", "self_attention.to"] }
  ]
}

Image Quality Comparison

Subjective Evaluation

Quantization Text Rendering Detail Retention Color Accuracy Overall
BF16 ★★★★★ ★★★★★ ★★★★★ 10/10
FP8 ★★★★★ ★★★★☆ ★★★★★ 9.5/10
INT8 ★★★★☆ ★★★★☆ ★★★★☆ 8.5/10
NVFP4 ★★★★☆ ★★★★☆ ★★★★☆ 8.5/10

Practical Conclusions:

  • FP8 is virtually lossless: Visually indistinguishable from BF16 output
  • INT8 and NVFP4 show minor degradation: Noticeable only in very fine text rendering and extreme color scenarios
  • Turbo mode is more sensitive to quantization: 8-step fast generation + low-precision quantization can compound quality loss

Recommended Strategies

Use Case Recommended Quantization Reason
Production/Commercial FP8 Near-BF16 quality, 47% speed boost
Daily Creation/Preview NVFP4 (RTX 5090) or INT8 (others) Speed priority, acceptable quality
Low VRAM GPUs (8-12GB) INT8 Low memory footprint, high speedup
Maximum Speed Testing NVFP4 (Blackwell GPU) Fastest generation speed

NVFP4 vs GGUF: Two Quantization Approaches Compared

Feature NVFP4/FP8/INT8 GGUF
Tool comfy-dit-quantizer GGUF.org / llama.cpp
Granularity Selective layer quantization Uniform model-wide (Q2-Q8)
Hardware NVFP4 requires Blackwell Universal CPU/GPU
Min VRAM 4.78GB (NVFP4) ~6GB (Q4)
Inference Engine ComfyUI native llama.cpp / ComfyUI GGUF nodes
Quality Higher (preserves key layers) Q8 near-original, low-bit noticeable loss
Best For High-performance GPUs Low VRAM/consumer hardware

Summary: If you have an RTX 4090/5090, NVFP4/FP8 is the better choice. If VRAM is tight or hardware is older, GGUF is more flexible.


FAQ

Q: Does NVFP4 work on RTX 4090?

NVFP4 is designed for Blackwell architecture. RTX 4090 (Ada architecture) lacks native FP4 matrix cores, so NVFP4 performance on 4090 is equivalent to FP8. RTX 4090 users should prefer INT8 quantization for the best speedup.

Q: Can quantized models be fine-tuned with LoRA?

Quantized models themselves are not suitable for further training. For LoRA fine-tuning, use the full BF16 or FP16 model, then quantize the base model after training is complete.

Q: Can different quantization formats be mixed?

Yes. You can load FP8 and INT8 models simultaneously for A/B testing. However, avoid mixing precisions within a single workflow to prevent memory fragmentation.


Summary

NVFP4/FP8/INT8 quantization dramatically lowers the deployment barrier for ERNIE-Image 8B:

  • VRAM requirement: Reduced from 16GB+ to 4.78GB (NVFP4) or 8GB (INT8)
  • Generation speed: Up to 3x faster (RTX 3060 INT8)
  • Quality loss: FP8 is virtually lossless; NVFP4/INT8 show minor acceptable degradation

For most users, FP8 quantization offers the best balance between quality and speed. If you have a Blackwell architecture GPU, NVFP4 provides the ultimate experience.


References: Bedovyy/ERNIE-Image-Quantized (HuggingFace), NVIDIA Model-Optimizer, PTQ4DiT (NeurIPS 2024)

ERNIE-Image Team

ERNIE-Image NVFP4 Quantized Deployment Complete Guide: Run 8B Model on 4.78GB VRAM | Blog