-
Notifications
You must be signed in to change notification settings - Fork 72
Fix jl_object_id__cold
segfault with race condition defence
#635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This suggests we do need a Julia side lock for the GIL. Maybe we should try that first? |
Cool, I'm running my workloads to see how it behaves 🙏 |
@mkitti I think that might be independent of this issue? Or maybe an example would help |
@mkitti AFAICT the |
You essentially have put a These are Julian data structures, so this is unlikely. What has been thread unsafe is Python and the GIL locking mechanisms. Rather than creating a bunch of locks, we need one central one around the GIL locking mechanism. |
Can you give an example? I am not sure I follow. There is no manual GIL manipulation in the segfault I am seeing. i.e., nowhere in the code path is there a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like many of these functions would mandate holding the Python lock (GIL). Holding (or having) a julia lock also seems a recipe for deadlock and concurrency problems long term, though might fix some of the crashes short term
Essentially, rather than adding a bunch of locks, let's just add one ReentrantLock in a central location so that only one Julia thread can communicate with Python at a time. |
Doesn't the Python GIL already result in this being true? |
43d9f4e
to
fd8002c
Compare
fd8002c
to
53a8e80
Compare
@dpinol did it work btw? |
The GIL guarantees that only one Python thread is executing at any point in time. It does not guarantee that only one Julia thread will try to interact with it at a time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm confused by this:
let's just add one ReentrantLock in a central location so that only one Julia thread can communicate with Python at a time
Because multiple Julia threads wouldn't be able to communicate to Python anyways, right. So we only need thread safety for Julia objects, which is what this PR fixes. Does that make sense or am I not understanding things.
cd9f5c9
to
d4a94f9
Compare
This PR looks unsound, since you need to decide whether you want holding this lock to mean it is forbidden to access any Python state or if holding the Python lock means that it is forbidden to access any state protected by this lock. You cannot mix and match and hold both locks, unless you guarantee you always acquire the Python one first, in which case the Julia one is redundant and unnecessary. This is the "philosophers dining problem" of deadlock |
Sorry, which lock do you mean? |
The segfault from #586 (comment) should be easy to verify if it was fixed. Currently this PR doesn't fix it. To try it, start Julia with a number of interactive threads != 1, and try to tab complete a Python object like julia --threads=2,2
julia> using PythonCall
julia> x = pyint(3)
Python: 3
julia> x.<tab>
Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ffadf016d2d -- PyType_Lookup at C:\ProgramData\DevDrives\.julia\dev\PythonCall\.CondaPkg\.pixi\envs\default\python312.dll (unknown line) |
After a couple of hours running with commit AbstractLock undefined on 1.10, no crash 🥳 . |
@visr this PR is meant to solve a different segfault, it's not related to that issue. I will edit the PR title to be more specific. |
@dpinol great to hear!! |
jl_object_id__cold
segfault with race condition defence
I understand, and don't mean to snipe you into solving that as well, but as (I believe) a counterexample to what you said earlier:
And to nudge to what could perhaps be a more general solution as Mark and Jameson seem to point to as well. |
If you're interested feel free to try. This PR is an imperfect but safe solution to a very specific segfault ( |
d4a94f9
to
551bf33
Compare
Any function that accesses a Python object or Python state is not allowed to use a Julia lock as it must be using the GIL exclusively. To use a Julia lockable there may make those functions unsound (UB). This deserves some documentation in your PR to indicate which locks are permitted to be held at the same time and which are forbidden (UB) to acquire on the same thread at the same time |
@vtjnash I don't think these concerns apply here. This PR addresses a very specific and immediate segfault on the Python=>Julia side. These particular instances of lockables do not create a Julia-Python lock-ordering issue. Currently, the data structures they guard get corrupted without the locking. This is a real, immediate, and severe problem, rather than something theoretical. To emphasise: Julia crashes immediately on Python 3.13, on both macOS and Windows. I fully support better long-term solutions (contributions welcome!), but I don't want to let perfect be the enemy of good here. I strongly prefer fixing the empirical segfault first, then tackling theoretical concerns afterward. |
I don't mean to imply these UB are theoretical concerns, since deadlocks are just as bad for the user and even worse for CI reliability. If you cannot even document the API requirements, then this PR seems like just added technical debt that someone in the future needs to undo. According to the existing documentation, before calling these functions it is mandatory for the caller to have already acquired the GIL ( PythonCall.jl/docs/src/juliacall.md Lines 134 to 139 in c84c059
|
In particular, the lock you added here to fix the crash appears to be the wrong lock to be using in the functions where it is used, since they seem to be direct calls into the python runtime (which require holding the GIL), so you are not actually fixing the cause of the segfault, but might be making it harder for someone to diagnose and fix correctly later (which appears to be in looking at the code to require holding the GIL, and which is what the package documentation says that all functions will do)
Since the documentation says every function should either already have or will acquire the GIL, it is strictly forbidden to introduce any new locks into this package (like this PR does), except for functions that do not interact with any python objects, since any lock acquired between the two recursive lock acquires would cause a violation of lock ordering. |
Can you illustrate your concerns with an example? |
I think worrying about API contracts when Julia won’t even start without a crash is like worrying about getting enough vegetables when your heart is stopped. Let’s do CPR first and worry about diet later. |
This is the deadlock scenario when you have multiple locks in play. julia> import Base: Lockable
julia> const PYJLVALUES = Lockable([]);
julia> const PYJLFREEVALUES = Lockable([]);
julia> function f()
lock(PYJLVALUES)
println("Got PYJLVALUES")
sleep(1)
lock(PYJLFREEVALUES)
println("f() got both locks")
unlock(PYJLFREEVALUES)
unlock(PYJLVALUES)
end
f (generic function with 1 method)
julia> function g()
lock(PYJLFREEVALUES)
println("Got PYJLFREEVALUES")
sleep(1)
lock(PYJLVALUES)
println("g() got both locks")
unlock(PYJLVALUES)
unlock(PYJLFREEVALUES)
end
g (generic function with 1 method)
julia> Threads.@spawn f(); Threads.@spawn g()
Got PYJLVALUESTask (runnable, started) @0x00000242bbaaf080
Got PYJLFREEVALUES
# Deadlock whereas if we had a single lock julia> import Base: Lockable
julia> const _jl_gil_lock = ReentrantLock();
julia> const PYJLVALUES = Lockable([], _jl_gil_lock);
julia> const PYJLFREEVALUES = Lockable([], _jl_gil_lock);
julia> function f()
lock(PYJLVALUES)
println("Got PYJLVALUES")
sleep(1)
lock(PYJLFREEVALUES)
println("f() got both locks")
unlock(PYJLFREEVALUES)
unlock(PYJLVALUES)
end
f (generic function with 1 method)
julia> function g()
lock(PYJLFREEVALUES)
println("Got PYJLFREEVALUES")
sleep(1)
lock(PYJLVALUES)
println("g() got both locks")
unlock(PYJLVALUES)
unlock(PYJLFREEVALUES)
end
g (generic function with 1 method)
julia> Threads.@spawn f(); Threads.@spawn g()
Task (runnable, started) @0x0000019a94e809c0
julia> Got PYJLVALUES
f() got both locks
Got PYJLFREEVALUES
g() got both locks
julia> |
Added it. Once one of the PRs merges, |
That would still introduce different unsoundness, since here you're causing deadlock / recursion mistakes, and with the rename, you'd be failing to call |
Reverted the global lock. The original code with separate locks doesn’t appear to have deadlock issues. @mkitti's example isn't possible as those globals are behind a shared reentrant lock. If there are other genuine scenarios I'm missing, a concrete example would greatly help. Right now, broad concerns about unsoundness aren't giving me enough to go on. |
@cjdoris possible good news for JuliaLang/julia#58171. See below.
Out of all of these changes, I think the crucial one is
PYJLVALUES
. But the others are good defensive patterns so I think we should do it anyways.I can now run
python -c 'from juliacall import Main as jl; [jl.Ref(jl.randn(1)) for _ in range(1000000)]'
without getting a segfault on Python 3.13.