ERNIE-Image on AMD GPUs: Complete ROCm Deployment Guide — Zero Code Changes Required

May 22, 2026

ERNIE-Image on AMD GPUs: Complete ROCm Deployment Guide — Zero Code Changes Required

Summary: AMD officially announced Day-0 support for ERNIE-Image on its GPUs — no code modifications needed, running directly through Diffusers + ROCm. This guide walks you through environment setup to memory optimization for deploying ERNIE-Image on AMD Instinct MI355X and Radeon AI PRO R9700.


Why AMD GPU Deployment Matters

The AI image generation field has long been dominated by the NVIDIA CUDA ecosystem. ERNIE-Image, Baidu's open-source 8B-parameter DiT model, was initially only validated on NVIDIA GPUs. In April 2026, AMD's official blog announced Day-0 support — ERNIE-Image runs on AMD GPUs with zero code modifications.

This means:

  • Breaking CUDA monopoly: AMD GPU users can directly use ERNIE-Image
  • Lowering hardware barriers: AMD Radeon AI PRO R9700 at ~$1,000, far below NVIDIA RTX 6000 Ada at $6,800
  • Datacenter-grade option: Instinct MI355X offers 288GB HBM3e for large-scale deployment

Hardware Platforms

Validated AMD GPUs

GPU Model Architecture VRAM Target Use Case
AMD Instinct MI355X CDNA 4 (gfx950) 288GB HBM3e Datacenter AI training/inference
AMD Radeon AI PRO R9700 RDNA 4 (gfx1201) 32GB GDDR6 Professional workstation AI inference

R9700 specs highlights: 64 CUs, 4096 Stream Processors, 128 AI Accelerators, 64MB Infinity Cache, PCIe 5.0 x16, 300W TBP, Linux ECC memory support.

ERNIE-Image Model Size

Component Type Size Role
Transformer ErnieImageTransformer2DModel 15 GB Diffusion backbone
Text Encoder Mistral3Model 7.2 GB Text encoding
Prompt Enhancer Ministral3ForCausalLM 7.2 GB Automatic prompt enrichment
VAE AutoencoderKLFlux2 161 MB Variational autoencoder
Total ~29.5 GB

Software Environment Setup

Docker Container Setup

# MI355X uses rocm7.2.1
docker run -d --name ernie-image-test \
  --device=/dev/kfd --device=/dev/dri \
  --group-add video --group-add render \
  --shm-size=64G \
  -v /path/to/model:/workspace \
  rocm/pytorch:rocm7.2.1_ubuntu24.04_py3.12_pytorch_release_2.9.1 \
  sleep infinity

R9700 uses rocm7.2

docker run -d --name ernie-image-test
--device=/dev/kfd --device=/dev/dri
--group-add video --group-add render
--shm-size=64G
-v /path/to/model:/workspace
rocm/pytorch:rocm7.2
sleep infinity

Verify GPU Availability

import torch
print(f'PyTorch: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'Device: {torch.cuda.get_device_name(0)}')
print(f'VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.0f} GB')

Note: ROCm's HIP compatibility layer ensures torch.cuda.* APIs work natively on AMD hardware, with no code changes needed.

Install Dependencies

# Clone the Diffusers branch with ERNIE-Image support
git clone https://github.com/HsiaWinter/diffusers /workspace/diffusers-ernie
cd /workspace/diffusers-ernie && git checkout add-ernie-image

Install dependencies

pip install -e . accelerate Pillow transformers

Extract model

tar xf ERNIE-Image.tar

Software Stack Versions

Component Version
Docker Image rocm/pytorch:rocm7.2.1_ubuntu24.04_py3.12_pytorch_release_2.9.1
PyTorch 2.9.1+rocm7.2.x
ROCm (HIP) 7.2.53211
Diffusers 0.38.0.dev0 (add-ernie-image branch)
Transformers 5.5.3
Accelerate 1.13.0
Python 3.12

Inference Scripts

MI355X (288GB, Full Precision)

from diffusers import ErnieImagePipeline
import torch

Load model

pipe = ErnieImagePipeline.from_pretrained(
"/workspace/ERNIE-Image",
torch_dtype=torch.bfloat16
)
pipe = pipe.to("cuda")

Set to eval mode

pipe.transformer.eval()
pipe.vae.eval()
pipe.text_encoder.eval()
pipe.pe.eval()

Generate image

seed = 42
generator = torch.Generator(device="cuda").manual_seed(seed)
output = pipe(
prompt="a beautiful sunset over the ocean, golden light, photorealistic",
height=1024,
width=1024,
num_inference_steps=50,
guidance_scale=5.0,
generator=generator
)
output.images[0].save("ernie_amd_output.png")
print("Image generated successfully!")

R9700 (32GB, CPU Offload Optimized)

The R9700's 32GB VRAM is tight for the ~29.5GB BF16 model. After loading all components, only ~1.17GB remains — insufficient for intermediate tensors.

Solution: Use Diffusers' enable_model_cpu_offload() to time-share components across the GPU.

from diffusers import ErnieImagePipeline
import torch

Load model

pipe = ErnieImagePipeline.from_pretrained(
"/workspace/ERNIE-Image",
torch_dtype=torch.bfloat16
)

Enable CPU Offload (critical!)

pipe.enable_model_cpu_offload()

Set to eval mode

pipe.transformer.eval()
pipe.vae.eval()
pipe.text_encoder.eval()
pipe.pe.eval()

Generate image

seed = 42
generator = torch.Generator(device="cuda").manual_seed(seed)
output = pipe(
prompt="a beautiful sunset over the ocean, golden light, photorealistic",
height=1024,
width=1024,
num_inference_steps=50,
guidance_scale=5.0,
generator=generator
)
output.images[0].save("ernie_r9700_output.png")
print("Image generated successfully!")

R9700 VRAM Detailed Breakdown

Component VRAM Usage Cumulative
Text Encoder 7.18 GiB 7.18 GiB
Prompt Enhancer 6.38 GiB 13.56 GiB
VAE 0.17 GiB 13.73 GiB
Transformer 14.96 GiB 28.69 GiB
Remaining ~1.17 GiB

With CPU Offload, peak VRAM only needs to hold the largest component (Transformer ~15GB) plus intermediate tensors — well within 32GB.


CUDA → ROCm Migration Guide

When migrating from NVIDIA CUDA to AMD ROCm, remove these NVIDIA-specific configurations:

Item NVIDIA (CUDA) AMD (ROCm) Action
CUBLAS_WORKSPACE_CONFIG Required Not applicable Remove
torch.backends.cudnn.* cuDNN config Uses MIOpen Remove
torch.use_deterministic_algorithms Supported Partial support Remove if needed
torch.cuda.* API Native HIP compatibility No changes
Attention backend Flash Attention / cuDNN AOTriton Auto-selected

AOTriton is AMD's Triton-based attention kernel optimized for ROCm. PyTorch automatically selects it as the Scaled Dot-Product Attention backend.


Performance Comparison & Notes

MI355X vs R9700

Metric MI355X R9700
Architecture CDNA 4 RDNA 4
VRAM 288GB HBM3e 32GB GDDR6
Precision BF16 full precision BF16 + CPU Offload
Inference speed Fastest Moderate (CPU Offload overhead)
Use case Large-scale inference/training Personal workstation/prototyping

Important Notes

  1. Diffusers branch: Requires switching to the add-ernie-image branch — official Diffusers hasn't merged yet
  2. ROCm version: ROCm 7.2 is the minimum requirement; latest patch version recommended
  3. Memory swapping: R9700's CPU Offload introduces additional latency — inference time is ~30-50% longer than full-precision loading
  4. Linux drivers: Ensure the latest AMD GPU drivers are installed (amdgpu kernel module)

Summary

AMD's Day-0 support for ERNIE-Image is an important milestone, marking the loosening of CUDA's monopoly in AI image generation. For AMD GPU users, ERNIE-Image offers a zero-code-change ready-to-use experience.

  • Datacenter users: MI355X with 288GB HBM3e allows full-precision loading, ideal for large-scale deployment
  • Workstation users: R9700 runs on 32GB VRAM via CPU Offload, offering outstanding value
  • Developers: torch.cuda.* APIs work seamlessly through the HIP compatibility layer, with minimal migration cost

As the ROCm ecosystem continues to improve, AMD GPUs will play an increasingly important role in AI inference. ERNIE-Image, as one of the first Day-0 supported models, provides AMD users with a high-quality, commercially-free text-to-image generation option.


References

  1. AMD: Day-0 Support for Baidu ERNIE-Image on AMD GPUs
  2. AMD ROCm on Consumer GPUs 2026 Guide
  3. GitHub: baidu/ERNIE-Image
  4. HuggingFace: baidu/ERNIE-Image

ERNIE-Image Team