ERNIE-Image on AMD GPUs: Complete ROCm Deployment Guide — Zero Code Changes Required
Summary: AMD officially announced Day-0 support for ERNIE-Image on its GPUs — no code modifications needed, running directly through Diffusers + ROCm. This guide walks you through environment setup to memory optimization for deploying ERNIE-Image on AMD Instinct MI355X and Radeon AI PRO R9700.
Why AMD GPU Deployment Matters
The AI image generation field has long been dominated by the NVIDIA CUDA ecosystem. ERNIE-Image, Baidu's open-source 8B-parameter DiT model, was initially only validated on NVIDIA GPUs. In April 2026, AMD's official blog announced Day-0 support — ERNIE-Image runs on AMD GPUs with zero code modifications.
This means:
- Breaking CUDA monopoly: AMD GPU users can directly use ERNIE-Image
- Lowering hardware barriers: AMD Radeon AI PRO R9700 at ~$1,000, far below NVIDIA RTX 6000 Ada at $6,800
- Datacenter-grade option: Instinct MI355X offers 288GB HBM3e for large-scale deployment
Hardware Platforms
Validated AMD GPUs
| GPU Model | Architecture | VRAM | Target Use Case |
|---|---|---|---|
| AMD Instinct MI355X | CDNA 4 (gfx950) | 288GB HBM3e | Datacenter AI training/inference |
| AMD Radeon AI PRO R9700 | RDNA 4 (gfx1201) | 32GB GDDR6 | Professional workstation AI inference |
R9700 specs highlights: 64 CUs, 4096 Stream Processors, 128 AI Accelerators, 64MB Infinity Cache, PCIe 5.0 x16, 300W TBP, Linux ECC memory support.
ERNIE-Image Model Size
| Component | Type | Size | Role |
|---|---|---|---|
| Transformer | ErnieImageTransformer2DModel | 15 GB | Diffusion backbone |
| Text Encoder | Mistral3Model | 7.2 GB | Text encoding |
| Prompt Enhancer | Ministral3ForCausalLM | 7.2 GB | Automatic prompt enrichment |
| VAE | AutoencoderKLFlux2 | 161 MB | Variational autoencoder |
| Total | — | ~29.5 GB | — |
Software Environment Setup
Docker Container Setup
# MI355X uses rocm7.2.1
docker run -d --name ernie-image-test \
--device=/dev/kfd --device=/dev/dri \
--group-add video --group-add render \
--shm-size=64G \
-v /path/to/model:/workspace \
rocm/pytorch:rocm7.2.1_ubuntu24.04_py3.12_pytorch_release_2.9.1 \
sleep infinity
R9700 uses rocm7.2
docker run -d --name ernie-image-test
--device=/dev/kfd --device=/dev/dri
--group-add video --group-add render
--shm-size=64G
-v /path/to/model:/workspace
rocm/pytorch:rocm7.2
sleep infinity
Verify GPU Availability
import torch
print(f'PyTorch: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'Device: {torch.cuda.get_device_name(0)}')
print(f'VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.0f} GB')
Note: ROCm's HIP compatibility layer ensures
torch.cuda.*APIs work natively on AMD hardware, with no code changes needed.
Install Dependencies
# Clone the Diffusers branch with ERNIE-Image support
git clone https://github.com/HsiaWinter/diffusers /workspace/diffusers-ernie
cd /workspace/diffusers-ernie && git checkout add-ernie-image
Install dependencies
pip install -e . accelerate Pillow transformers
Extract model
tar xf ERNIE-Image.tar
Software Stack Versions
| Component | Version |
|---|---|
| Docker Image | rocm/pytorch:rocm7.2.1_ubuntu24.04_py3.12_pytorch_release_2.9.1 |
| PyTorch | 2.9.1+rocm7.2.x |
| ROCm (HIP) | 7.2.53211 |
| Diffusers | 0.38.0.dev0 (add-ernie-image branch) |
| Transformers | 5.5.3 |
| Accelerate | 1.13.0 |
| Python | 3.12 |
Inference Scripts
MI355X (288GB, Full Precision)
from diffusers import ErnieImagePipeline
import torch
Load model
pipe = ErnieImagePipeline.from_pretrained(
"/workspace/ERNIE-Image",
torch_dtype=torch.bfloat16
)
pipe = pipe.to("cuda")
Set to eval mode
pipe.transformer.eval()
pipe.vae.eval()
pipe.text_encoder.eval()
pipe.pe.eval()
Generate image
seed = 42
generator = torch.Generator(device="cuda").manual_seed(seed)
output = pipe(
prompt="a beautiful sunset over the ocean, golden light, photorealistic",
height=1024,
width=1024,
num_inference_steps=50,
guidance_scale=5.0,
generator=generator
)
output.images[0].save("ernie_amd_output.png")
print("Image generated successfully!")
R9700 (32GB, CPU Offload Optimized)
The R9700's 32GB VRAM is tight for the ~29.5GB BF16 model. After loading all components, only ~1.17GB remains — insufficient for intermediate tensors.
Solution: Use Diffusers' enable_model_cpu_offload() to time-share components across the GPU.
from diffusers import ErnieImagePipeline
import torch
Load model
pipe = ErnieImagePipeline.from_pretrained(
"/workspace/ERNIE-Image",
torch_dtype=torch.bfloat16
)
Enable CPU Offload (critical!)
pipe.enable_model_cpu_offload()
Set to eval mode
pipe.transformer.eval()
pipe.vae.eval()
pipe.text_encoder.eval()
pipe.pe.eval()
Generate image
seed = 42
generator = torch.Generator(device="cuda").manual_seed(seed)
output = pipe(
prompt="a beautiful sunset over the ocean, golden light, photorealistic",
height=1024,
width=1024,
num_inference_steps=50,
guidance_scale=5.0,
generator=generator
)
output.images[0].save("ernie_r9700_output.png")
print("Image generated successfully!")
R9700 VRAM Detailed Breakdown
| Component | VRAM Usage | Cumulative |
|---|---|---|
| Text Encoder | 7.18 GiB | 7.18 GiB |
| Prompt Enhancer | 6.38 GiB | 13.56 GiB |
| VAE | 0.17 GiB | 13.73 GiB |
| Transformer | 14.96 GiB | 28.69 GiB |
| Remaining | — | ~1.17 GiB |
With CPU Offload, peak VRAM only needs to hold the largest component (Transformer ~15GB) plus intermediate tensors — well within 32GB.
CUDA → ROCm Migration Guide
When migrating from NVIDIA CUDA to AMD ROCm, remove these NVIDIA-specific configurations:
| Item | NVIDIA (CUDA) | AMD (ROCm) | Action |
|---|---|---|---|
CUBLAS_WORKSPACE_CONFIG |
Required | Not applicable | Remove |
torch.backends.cudnn.* |
cuDNN config | Uses MIOpen | Remove |
torch.use_deterministic_algorithms |
Supported | Partial support | Remove if needed |
torch.cuda.* API |
Native | HIP compatibility | No changes |
| Attention backend | Flash Attention / cuDNN | AOTriton | Auto-selected |
AOTriton is AMD's Triton-based attention kernel optimized for ROCm. PyTorch automatically selects it as the Scaled Dot-Product Attention backend.
Performance Comparison & Notes
MI355X vs R9700
| Metric | MI355X | R9700 |
|---|---|---|
| Architecture | CDNA 4 | RDNA 4 |
| VRAM | 288GB HBM3e | 32GB GDDR6 |
| Precision | BF16 full precision | BF16 + CPU Offload |
| Inference speed | Fastest | Moderate (CPU Offload overhead) |
| Use case | Large-scale inference/training | Personal workstation/prototyping |
Important Notes
- Diffusers branch: Requires switching to the
add-ernie-imagebranch — official Diffusers hasn't merged yet - ROCm version: ROCm 7.2 is the minimum requirement; latest patch version recommended
- Memory swapping: R9700's CPU Offload introduces additional latency — inference time is ~30-50% longer than full-precision loading
- Linux drivers: Ensure the latest AMD GPU drivers are installed (amdgpu kernel module)
Summary
AMD's Day-0 support for ERNIE-Image is an important milestone, marking the loosening of CUDA's monopoly in AI image generation. For AMD GPU users, ERNIE-Image offers a zero-code-change ready-to-use experience.
- Datacenter users: MI355X with 288GB HBM3e allows full-precision loading, ideal for large-scale deployment
- Workstation users: R9700 runs on 32GB VRAM via CPU Offload, offering outstanding value
- Developers:
torch.cuda.*APIs work seamlessly through the HIP compatibility layer, with minimal migration cost
As the ROCm ecosystem continues to improve, AMD GPUs will play an increasingly important role in AI inference. ERNIE-Image, as one of the first Day-0 supported models, provides AMD users with a high-quality, commercially-free text-to-image generation option.