Closed
Description
Describe the issue
Printing a float32 takes 1340 us in IPEX. This is fine.
However, transferring a single float32 number takes 0.142 s in Intel Arc A770 16 GB. Why does this take so long? The GPU to GPU transfer rate is 224.56 bit/s for 1 float32.
For reference, RTX 3090 takes 0.000359s to transfer a single float32 number.
import time
import torch
import torchvision.models as models
import numpy as np
import intel_extension_for_pytorch as ipex
torch.manual_seed(0)
x = torch.rand(1, 1, dtype=torch.float32, device='xpu')
torch.xpu.synchronize()
start = time.time()
print(x.cpu())
end = time.time()
print("Print Time in Seconds: %.20f " % (end - start))
torch.manual_seed(2)
x = torch.rand(1, 1, dtype=torch.float32, device='xpu')
y = torch.rand(1, 1, dtype=torch.float32, device='xpu')
torch.xpu.synchronize()
start = time.time()
y = x.clone()
print(y.cpu())
end = time.time()
print("Data Transfer Time in Seconds: %.20f " % (end - start))
Pytorch takes takes 0.142s to issue 1 command on Intel Arc A770 16 GB
tensor([[0.9179]])
Print Time in Seconds: 0.00134086608886718750
tensor([[0.9696]])
Data Transfer Time in Seconds: 0.14255475997924804688
Pytorch takes takes 0.000359s to issue 1 command on RTX 3090
tensor([[0.3990]])
Print Time in Seconds: 0.00103116035461425781
tensor([[0.4254]])
Data Transfer Time in Seconds: 0.00035905838012695312
clinfo
Platform: Intel(R) OpenCL HD Graphics
Device: Intel(R) Arc(TM) A770 Graphics
Driver version : 23.05.25593.18 (Linux x64)
Compute units : 512
Clock frequency : 2400 MHz
Global memory bandwidth (GBPS)
float : 391.40
float2 : 403.59
float4 : 406.54
float8 : 418.51
float16 : 422.83
Single-precision compute (GFLOPS)
clCreateBuffer (-61)
Tests skipped
Half-precision compute (GFLOPS)
half : 19570.87
half2 : 19509.20
half4 : 19540.56
half8 : 19455.51
half16 : 19330.61
No double precision support! Skipped
Integer compute (GIOPS)
clCreateBuffer (-61)
Tests skipped
Transfer bandwidth (GBPS)
enqueueWriteBuffer : 17.11
enqueueReadBuffer : 8.06
enqueueMapBuffer(for read) : 19.89
memcpy from mapped ptr : 22.40
enqueueUnmap(after write) : 23.53
memcpy to mapped ptr : 22.48
Kernel launch latency : 6.07 us