It takes more than 100ms to issue a single command to Intel Arc GPU

### Describe the issue

Printing a float32 takes 1340 us in IPEX. This is fine.
However, transferring a single float32 number takes 0.142 s in Intel Arc A770 16 GB. Why does this take so long? The GPU to GPU transfer rate is 224.56 bit/s for 1 float32.

For reference, RTX 3090 takes 0.000359s to transfer a single float32 number.



```
import time
import torch
import torchvision.models as models

import numpy as np
import intel_extension_for_pytorch as ipex

torch.manual_seed(0)


x = torch.rand(1, 1, dtype=torch.float32, device='xpu')

torch.xpu.synchronize()
start = time.time()
print(x.cpu())
end = time.time()

print("Print Time in Seconds: %.20f " % (end - start))







torch.manual_seed(2)

x = torch.rand(1, 1, dtype=torch.float32, device='xpu')
y = torch.rand(1, 1, dtype=torch.float32, device='xpu')

torch.xpu.synchronize()
start = time.time()
y = x.clone()
print(y.cpu())
end = time.time()

print("Data Transfer Time in Seconds: %.20f " % (end - start))
```

Pytorch takes takes 0.142s to issue 1 command on Intel Arc A770 16 GB 
```
tensor([[0.9179]])
Print Time in Seconds: 0.00134086608886718750 
tensor([[0.9696]])
Data Transfer Time in Seconds: 0.14255475997924804688
```

Pytorch takes takes 0.000359s to issue 1 command on RTX 3090
```
tensor([[0.3990]])                                                          
Print Time in Seconds: 0.00103116035461425781                               
tensor([[0.4254]])                                                               
Data Transfer Time in Seconds: 0.00035905838012695312  
```


clinfo
```
Platform: Intel(R) OpenCL HD Graphics
  Device: Intel(R) Arc(TM) A770 Graphics
    Driver version  : 23.05.25593.18 (Linux x64)
    Compute units   : 512
    Clock frequency : 2400 MHz

    Global memory bandwidth (GBPS)
      float   : 391.40
      float2  : 403.59
      float4  : 406.54
      float8  : 418.51
      float16 : 422.83

    Single-precision compute (GFLOPS)
clCreateBuffer (-61)
      Tests skipped

    Half-precision compute (GFLOPS)
      half   : 19570.87
      half2  : 19509.20
      half4  : 19540.56
      half8  : 19455.51
      half16 : 19330.61

    No double precision support! Skipped

    Integer compute (GIOPS)
clCreateBuffer (-61)
      Tests skipped

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer         : 17.11
      enqueueReadBuffer          : 8.06
      enqueueMapBuffer(for read) : 19.89
        memcpy from mapped ptr   : 22.40
      enqueueUnmap(after write)  : 23.53
        memcpy to mapped ptr     : 22.48

    Kernel launch latency : 6.07 us
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

It takes more than 100ms to issue a single command to Intel Arc GPU #386

Describe the issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

It takes more than 100ms to issue a single command to Intel Arc GPU #386

Description

Describe the issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions