ERNIE-Image on AMD GPUs: Complete ROCm Deployment Guide — Zero Code Changes Required

Summary: AMD officially announced Day-0 support for ERNIE-Image on its GPUs — no code modifications needed, running directly through Diffusers + ROCm. This guide walks you through environment setup to memory optimization for deploying ERNIE-Image on AMD Instinct MI355X and Radeon AI PRO R9700.

Why AMD GPU Deployment Matters

The AI image generation field has long been dominated by the NVIDIA CUDA ecosystem. ERNIE-Image, Baidu's open-source 8B-parameter DiT model, was initially only validated on NVIDIA GPUs. In April 2026, AMD's official blog announced Day-0 support — ERNIE-Image runs on AMD GPUs with zero code modifications.

This means:

Breaking CUDA monopoly: AMD GPU users can directly use ERNIE-Image
Lowering hardware barriers: AMD Radeon AI PRO R9700 at ~$1,000, far below NVIDIA RTX 6000 Ada at $6,800
Datacenter-grade option: Instinct MI355X offers 288GB HBM3e for large-scale deployment

Hardware Platforms

Validated AMD GPUs

GPU Model	Architecture	VRAM	Target Use Case
AMD Instinct MI355X	CDNA 4 (gfx950)	288GB HBM3e	Datacenter AI training/inference
AMD Radeon AI PRO R9700	RDNA 4 (gfx1201)	32GB GDDR6	Professional workstation AI inference

R9700 specs highlights: 64 CUs, 4096 Stream Processors, 128 AI Accelerators, 64MB Infinity Cache, PCIe 5.0 x16, 300W TBP, Linux ECC memory support.

ERNIE-Image Model Size

Component	Type	Size	Role
Transformer	ErnieImageTransformer2DModel	15 GB	Diffusion backbone
Text Encoder	Mistral3Model	7.2 GB	Text encoding
Prompt Enhancer	Ministral3ForCausalLM	7.2 GB	Automatic prompt enrichment
VAE	AutoencoderKLFlux2	161 MB	Variational autoencoder
Total	—	~29.5 GB	—

Software Environment Setup

Docker Container Setup

# MI355X uses rocm7.2.1 docker run -d --name ernie-image-test \ --device=/dev/kfd --device=/dev/dri \ --group-add video --group-add render \ --shm-size=64G \ -v /path/to/model:/workspace \ rocm/pytorch:rocm7.2.1_ubuntu24.04_py3.12_pytorch_release_2.9.1 \ sleep infinity R9700 uses rocm7.2

docker run -d --name ernie-image-test --device=/dev/kfd --device=/dev/dri --group-add video --group-add render --shm-size=64G -v /path/to/model:/workspace rocm/pytorch:rocm7.2 sleep infinity

Verify GPU Availability

import torch
print(f'PyTorch: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'Device: {torch.cuda.get_device_name(0)}')
print(f'VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.0f} GB')

Note: ROCm's HIP compatibility layer ensures torch.cuda.* APIs work natively on AMD hardware, with no code changes needed.

Install Dependencies

# Clone the Diffusers branch with ERNIE-Image support git clone https://github.com/HsiaWinter/diffusers /workspace/diffusers-ernie cd /workspace/diffusers-ernie && git checkout add-ernie-image Install dependencies pip install -e . accelerate Pillow transformers Extract model

tar xf ERNIE-Image.tar

Software Stack Versions

Component	Version
Docker Image	rocm/pytorch:rocm7.2.1_ubuntu24.04_py3.12_pytorch_release_2.9.1
PyTorch	2.9.1+rocm7.2.x
ROCm (HIP)	7.2.53211
Diffusers	0.38.0.dev0 (add-ernie-image branch)
Transformers	5.5.3
Accelerate	1.13.0
Python	3.12

Inference Scripts

MI355X (288GB, Full Precision)

from diffusers import ErnieImagePipeline
import torch
Load model
pipe = ErnieImagePipeline.from_pretrained(

"/workspace/ERNIE-Image",

torch_dtype=torch.bfloat16

)

pipe = pipe.to("cuda")
Set to eval mode
pipe.transformer.eval()

pipe.vae.eval()

pipe.text_encoder.eval()

pipe.pe.eval()
Generate image
seed = 42

generator = torch.Generator(device="cuda").manual_seed(seed)

output = pipe(

prompt="a beautiful sunset over the ocean, golden light, photorealistic",

height=1024,

width=1024,

num_inference_steps=50,

guidance_scale=5.0,

generator=generator

)

output.images[0].save("ernie_amd_output.png")

print("Image generated successfully!")

R9700 (32GB, CPU Offload Optimized)

The R9700's 32GB VRAM is tight for the ~29.5GB BF16 model. After loading all components, only ~1.17GB remains — insufficient for intermediate tensors.

Solution: Use Diffusers' enable_model_cpu_offload() to time-share components across the GPU.

from diffusers import ErnieImagePipeline
import torch
Load model
pipe = ErnieImagePipeline.from_pretrained(

"/workspace/ERNIE-Image",

torch_dtype=torch.bfloat16

)
Enable CPU Offload (critical!)
pipe.enable_model_cpu_offload()
Set to eval mode
pipe.transformer.eval()

pipe.vae.eval()

pipe.text_encoder.eval()

pipe.pe.eval()
Generate image
seed = 42

generator = torch.Generator(device="cuda").manual_seed(seed)

output = pipe(

prompt="a beautiful sunset over the ocean, golden light, photorealistic",

height=1024,

width=1024,

num_inference_steps=50,

guidance_scale=5.0,

generator=generator

)

output.images[0].save("ernie_r9700_output.png")

print("Image generated successfully!")

R9700 VRAM Detailed Breakdown

Component	VRAM Usage	Cumulative
Text Encoder	7.18 GiB	7.18 GiB
Prompt Enhancer	6.38 GiB	13.56 GiB
VAE	0.17 GiB	13.73 GiB
Transformer	14.96 GiB	28.69 GiB
Remaining	—	~1.17 GiB

With CPU Offload, peak VRAM only needs to hold the largest component (Transformer ~15GB) plus intermediate tensors — well within 32GB.

CUDA → ROCm Migration Guide

When migrating from NVIDIA CUDA to AMD ROCm, remove these NVIDIA-specific configurations:

Item	NVIDIA (CUDA)	AMD (ROCm)	Action
`CUBLAS_WORKSPACE_CONFIG`	Required	Not applicable	Remove
`torch.backends.cudnn.*`	cuDNN config	Uses MIOpen	Remove
`torch.use_deterministic_algorithms`	Supported	Partial support	Remove if needed
`torch.cuda.*` API	Native	HIP compatibility	No changes
Attention backend	Flash Attention / cuDNN	AOTriton	Auto-selected

AOTriton is AMD's Triton-based attention kernel optimized for ROCm. PyTorch automatically selects it as the Scaled Dot-Product Attention backend.

Performance Comparison & Notes

MI355X vs R9700

Metric	MI355X	R9700
Architecture	CDNA 4	RDNA 4
VRAM	288GB HBM3e	32GB GDDR6
Precision	BF16 full precision	BF16 + CPU Offload
Inference speed	Fastest	Moderate (CPU Offload overhead)
Use case	Large-scale inference/training	Personal workstation/prototyping

Important Notes

Diffusers branch: Requires switching to the add-ernie-image branch — official Diffusers hasn't merged yet
ROCm version: ROCm 7.2 is the minimum requirement; latest patch version recommended
Memory swapping: R9700's CPU Offload introduces additional latency — inference time is ~30-50% longer than full-precision loading
Linux drivers: Ensure the latest AMD GPU drivers are installed (amdgpu kernel module)

Summary

AMD's Day-0 support for ERNIE-Image is an important milestone, marking the loosening of CUDA's monopoly in AI image generation. For AMD GPU users, ERNIE-Image offers a zero-code-change ready-to-use experience.

Datacenter users: MI355X with 288GB HBM3e allows full-precision loading, ideal for large-scale deployment
Workstation users: R9700 runs on 32GB VRAM via CPU Offload, offering outstanding value
Developers: torch.cuda.* APIs work seamlessly through the HIP compatibility layer, with minimal migration cost

As the ROCm ecosystem continues to improve, AMD GPUs will play an increasingly important role in AI inference. ERNIE-Image, as one of the first Day-0 supported models, provides AMD users with a high-quality, commercially-free text-to-image generation option.

ERNIE-Image on AMD GPUs: Complete ROCm Deployment Guide — Zero Code Changes Required

Table of Contents

ERNIE-Image on AMD GPUs: Complete ROCm Deployment Guide — Zero Code Changes Required

Why AMD GPU Deployment Matters

Hardware Platforms

Validated AMD GPUs

ERNIE-Image Model Size

Software Environment Setup

Docker Container Setup

R9700 uses rocm7.2

Verify GPU Availability

Install Dependencies

Install dependencies

Extract model

Software Stack Versions

Inference Scripts

MI355X (288GB, Full Precision)

Load model

Set to eval mode

Generate image

R9700 (32GB, CPU Offload Optimized)

Load model

Enable CPU Offload (critical!)

Set to eval mode

Generate image

R9700 VRAM Detailed Breakdown

CUDA → ROCm Migration Guide

Performance Comparison & Notes

MI355X vs R9700

Important Notes

Summary

References