-
Notifications
You must be signed in to change notification settings - Fork 787
[SYCL] Optimize memory transfers #6213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ping @steffenlarsen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to echo Steffen's concern here. The original difference in behavior between creating host unified memory and other buffers (USE_HOST_PTR for the former, COPY_HOST_PTR when needed for the latter) was implemented because USE_HOST_PTR is essentially free in the first case and very expensive on discrete OpenCL/L0 devices. This was then modified in #3105 from creating a device buffer with COPY_HOST_PTR to creating a device buffer then writing to it, which we originally assumed to be functionally identical, but this provided FPGA backend with additional information that made it possible to get rid of a redundant memory copy on their side. It seems to me that this patch is effectively reverting both of those changes and we're back to square one with using USE_HOST_PTR for all cases. Could you elaborate on why the current sequence of PI calls is causing problems for the CUDA backend? |
#3105 does multiple things. Ones important for this discussion are:
|
You are changing whether the host pointer is actually passed to the buffer during its creation though. #3105 removed the logic of COPY_HOST_PTR vs USE_HOST_PTR during buffer creation: if InitFromUserData is set, USE_HOST_PTR is used, otherwise we create the buffer without the pointer to then perform a write operation if needed. After removing the host unified memory condition from InitFromUserData in this patch all cases will follow the USE_HOST_PTR workflow, so the original difference in behavior between devices with and without host unified memory is lost.
I'm curious as to where this overhead is coming from. Unless I'm missing something, creating a host allocation in the branch this patch deletes should do nothing in terms of memory copying (since we should be able to just reuse the pointer that the user passed to the buffer). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking this to request changes so this doesn't get accidentally merged while there's still an issue with reverting earlier performance oriented changes.
After some digging I figured out that indeed USE_HOST_PTR is used if the provided host pointer is non-const. However, in the case I am looking at the pointer is const, and the data is then copied. This would make fixing my original implementation, so that it also retains the optimization for FPGAs require a large refactor of runtime. I think an alternative and simpler solution can be to have read-only allocas so thay can use const host pointer without a copy. If the memory needs to be written a new allocation will be created. I am force-pushing this implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about the late review, just minor comments. Please update the branch to resolve merge conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, just a few mostly stylistic comments. Would it be possible to test this, for example with SYCL_PI_TRACE
?
I don't think so. The problem is that we would expect a different trace depending on whether the device supports host-unified memory. And I don't think we can make FIleCheck conditional on runtime variable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@t4c1, looks like basic_tests/min_max_test.cpp fails on Windows after this change, can you take a look? |
I have some trouble with my windows build at the moment, so I can not reproduce it, but I am pretty sure the failure is unrelated to this PR. I think it was caused by #6541 and the fix will be something like https://github.com/intel/llvm/pull/6172/files. |
Agreed. Failure does seem to come from #6541. I am looking into it. |
Optimize memory transfers by removing a redundant host to host data transfer when the data in a buffer is copied the first time from a user supplied const pointer on the host to a device that does not support host-unified memory.