Skip to content

Fix jl_object_id__cold segfault with race condition defence #635

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

MilesCranmer
Copy link
Contributor

@MilesCranmer MilesCranmer commented Jul 6, 2025

@cjdoris possible good news for JuliaLang/julia#58171. See below.

Out of all of these changes, I think the crucial one is PYJLVALUES. But the others are good defensive patterns so I think we should do it anyways.

I can now run

python -c 'from juliacall import Main as jl; [jl.Ref(jl.randn(1)) for _ in range(1000000)]'

without getting a segfault on Python 3.13.

@mkitti
Copy link
Member

mkitti commented Jul 6, 2025

This suggests we do need a Julia side lock for the GIL.

#627

Maybe we should try that first?

@dpinol
Copy link
Contributor

dpinol commented Jul 6, 2025

Cool, I'm running my workloads to see how it behaves 🙏

@MilesCranmer
Copy link
Contributor Author

@mkitti I think that might be independent of this issue? Or maybe an example would help

@MilesCranmer
Copy link
Contributor Author

@mkitti AFAICT the GIL.@lock/GIL.@unlock macros do not come into play in the segfault we were seeing so I am pretty sure these issues are orthogonal

@mkitti
Copy link
Member

mkitti commented Jul 6, 2025

You essentially have put a ReentrantLock on a bunch of data structures. Are the data structures themselves not thread safe?

These are Julian data structures, so this is unlikely.

What has been thread unsafe is Python and the GIL locking mechanisms.

Rather than creating a bunch of locks, we need one central one around the GIL locking mechanism.

@MilesCranmer
Copy link
Contributor Author

You essentially have put a ReentrantLock on a bunch of data structures. Are the data structures themselves not thread safe?

Array is not thread safe for writes

What has been thread unsafe is Python and the GIL locking mechanisms.

Rather than creating a bunch of locks, we need one central one around the GIL locking mechanism.

Can you give an example? I am not sure I follow. There is no manual GIL manipulation in the segfault I am seeing. i.e., nowhere in the code path is there a PyEval_SaveThread as far as I can tell.

Copy link

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like many of these functions would mandate holding the Python lock (GIL). Holding (or having) a julia lock also seems a recipe for deadlock and concurrency problems long term, though might fix some of the crashes short term

@mkitti
Copy link
Member

mkitti commented Jul 6, 2025

Essentially, rather than adding a bunch of locks, let's just add one ReentrantLock in a central location so that only one Julia thread can communicate with Python at a time.

@MilesCranmer
Copy link
Contributor Author

Doesn't the Python GIL already result in this being true?

@MilesCranmer
Copy link
Contributor Author

@dpinol did it work btw?

@mkitti
Copy link
Member

mkitti commented Jul 6, 2025

Doesn't the Python GIL already result in this being true?

The GIL guarantees that only one Python thread is executing at any point in time. It does not guarantee that only one Julia thread will try to interact with it at a time.

Copy link
Contributor Author

@MilesCranmer MilesCranmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm confused by this:

let's just add one ReentrantLock in a central location so that only one Julia thread can communicate with Python at a time

Because multiple Julia threads wouldn't be able to communicate to Python anyways, right. So we only need thread safety for Julia objects, which is what this PR fixes. Does that make sense or am I not understanding things.

@MilesCranmer MilesCranmer force-pushed the fix-segfault branch 2 times, most recently from cd9f5c9 to d4a94f9 Compare July 6, 2025 18:39
@vtjnash
Copy link

vtjnash commented Jul 6, 2025

This PR looks unsound, since you need to decide whether you want holding this lock to mean it is forbidden to access any Python state or if holding the Python lock means that it is forbidden to access any state protected by this lock. You cannot mix and match and hold both locks, unless you guarantee you always acquire the Python one first, in which case the Julia one is redundant and unnecessary. This is the "philosophers dining problem" of deadlock

@MilesCranmer
Copy link
Contributor Author

holding this lock

Sorry, which lock do you mean?

@visr
Copy link

visr commented Jul 6, 2025

The segfault from #586 (comment) should be easy to verify if it was fixed. Currently this PR doesn't fix it. To try it, start Julia with a number of interactive threads != 1, and try to tab complete a Python object like pyint:

julia --threads=2,2

julia> using PythonCall

julia> x = pyint(3)
Python: 3

julia> x.<tab>
Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ffadf016d2d -- PyType_Lookup at C:\ProgramData\DevDrives\.julia\dev\PythonCall\.CondaPkg\.pixi\envs\default\python312.dll (unknown line)

@dpinol
Copy link
Contributor

dpinol commented Jul 6, 2025

@dpinol did it work btw?

After a couple of hours running with commit AbstractLock undefined on 1.10, no crash 🥳 .
I'm trying now in parallel with latest commit fix: lock various methods

@MilesCranmer
Copy link
Contributor Author

MilesCranmer commented Jul 6, 2025

@visr this PR is meant to solve a different segfault, it's not related to that issue.

I will edit the PR title to be more specific.

@MilesCranmer
Copy link
Contributor Author

@dpinol great to hear!!

@MilesCranmer MilesCranmer changed the title Fix segfault with race condition defence Fix jl_object_id__cold segfault with race condition defence Jul 6, 2025
@visr
Copy link

visr commented Jul 6, 2025

@visr this PR is meant to solve a different segfault

I understand, and don't mean to snipe you into solving that as well, but as (I believe) a counterexample to what you said earlier:

Because multiple Julia threads wouldn't be able to communicate to Python anyways, right. So we only need thread safety for Julia objects, which is what this PR fixes.

And to nudge to what could perhaps be a more general solution as Mark and Jameson seem to point to as well.

@MilesCranmer
Copy link
Contributor Author

If you're interested feel free to try. This PR is an imperfect but safe solution to a very specific segfault (jl_object_id__cold) when running Julia from Python via juliacall. This segfault has been impacting PySR and other Python libraries for over a year, basically preventing us from even starting Julia on Python 3.13+. It is an extremely high priority fix. I'd be happy to help think about a more general approach down the road, but for now, others would need to push that direction, as I just don't have the time.

@vtjnash
Copy link

vtjnash commented Jul 6, 2025

Any function that accesses a Python object or Python state is not allowed to use a Julia lock as it must be using the GIL exclusively. To use a Julia lockable there may make those functions unsound (UB). This deserves some documentation in your PR to indicate which locks are permitted to be held at the same time and which are forbidden (UB) to acquire on the same thread at the same time

@MilesCranmer
Copy link
Contributor Author

@vtjnash I don't think these concerns apply here. This PR addresses a very specific and immediate segfault on the Python=>Julia side. These particular instances of lockables do not create a Julia-Python lock-ordering issue.

Currently, the data structures they guard get corrupted without the locking. This is a real, immediate, and severe problem, rather than something theoretical. To emphasise: Julia crashes immediately on Python 3.13, on both macOS and Windows.

I fully support better long-term solutions (contributions welcome!), but I don't want to let perfect be the enemy of good here. I strongly prefer fixing the empirical segfault first, then tackling theoretical concerns afterward.

@vtjnash
Copy link

vtjnash commented Jul 7, 2025

I don't mean to imply these UB are theoretical concerns, since deadlocks are just as bad for the user and even worse for CI reliability. If you cannot even document the API requirements, then this PR seems like just added technical debt that someone in the future needs to undo. According to the existing documentation, before calling these functions it is mandatory for the caller to have already acquired the GIL (

Most importantly, you can only call Python code while Python's
[Global Interpreter Lock (GIL)](https://docs.python.org/3/glossary.html#term-global-interpreter-lock)
is locked by the current thread. You can use JuliaCall from any Python thread, and the GIL
will be locked whenever any JuliaCall function is used. However, to leverage the benefits
of multi-threading, you can unlock the GIL while executing any Julia code that does not
interact with Python.
), which means the current documentation is stating this is an API misuse and that these structures are not supposed to have internal locks, but rather are supposed to be protected by the GIL

@vtjnash
Copy link

vtjnash commented Jul 7, 2025

In particular, the lock you added here to fix the crash appears to be the wrong lock to be using in the functions where it is used, since they seem to be direct calls into the python runtime (which require holding the GIL), so you are not actually fixing the cause of the segfault, but might be making it harder for someone to diagnose and fix correctly later (which appears to be in looking at the code to require holding the GIL, and which is what the package documentation says that all functions will do)

    ccall(UnsafePtr{C.PyTypeObject}(C.Py_Type(o)).free[!], Cvoid, (C.PyPtr,), o)

Since the documentation says every function should either already have or will acquire the GIL, it is strictly forbidden to introduce any new locks into this package (like this PR does), except for functions that do not interact with any python objects, since any lock acquired between the two recursive lock acquires would cause a violation of lock ordering.

@MilesCranmer
Copy link
Contributor Author

Can you illustrate your concerns with an example?

@MilesCranmer
Copy link
Contributor Author

I think worrying about API contracts when Julia won’t even start without a crash is like worrying about getting enough vegetables when your heart is stopped. Let’s do CPR first and worry about diet later.

@mkitti
Copy link
Member

mkitti commented Jul 7, 2025

This is the deadlock scenario when you have multiple locks in play.

julia> import Base: Lockable

julia> const PYJLVALUES = Lockable([]);

julia> const PYJLFREEVALUES = Lockable([]);

julia> function f()
           lock(PYJLVALUES)
           println("Got PYJLVALUES")
           sleep(1)
           lock(PYJLFREEVALUES)
           println("f() got both locks")
           unlock(PYJLFREEVALUES)
           unlock(PYJLVALUES)
       end
f (generic function with 1 method)

julia> function g()
           lock(PYJLFREEVALUES)
           println("Got PYJLFREEVALUES")
           sleep(1)
           lock(PYJLVALUES)
           println("g() got both locks")
           unlock(PYJLVALUES)
           unlock(PYJLFREEVALUES)
       end
g (generic function with 1 method)

julia> Threads.@spawn f(); Threads.@spawn g()
Got PYJLVALUESTask (runnable, started) @0x00000242bbaaf080


Got PYJLFREEVALUES

# Deadlock

whereas if we had a single lock

julia> import Base: Lockable

julia> const _jl_gil_lock = ReentrantLock();

julia> const PYJLVALUES = Lockable([], _jl_gil_lock);

julia> const PYJLFREEVALUES = Lockable([], _jl_gil_lock);

julia> function f()
           lock(PYJLVALUES)
           println("Got PYJLVALUES")
           sleep(1)
           lock(PYJLFREEVALUES)
           println("f() got both locks")
           unlock(PYJLFREEVALUES)
           unlock(PYJLVALUES)
       end
f (generic function with 1 method)

julia> function g()
           lock(PYJLFREEVALUES)
           println("Got PYJLFREEVALUES")
           sleep(1)
           lock(PYJLVALUES)
           println("g() got both locks")
           unlock(PYJLVALUES)
           unlock(PYJLFREEVALUES)
       end
g (generic function with 1 method)

julia> Threads.@spawn f(); Threads.@spawn g()
Task (runnable, started) @0x0000019a94e809c0

julia> Got PYJLVALUES
f() got both locks
Got PYJLFREEVALUES
g() got both locks

julia>

@MilesCranmer
Copy link
Contributor Author

Added it. Once one of the PRs merges, GLOBAL_LOCK can be renamed to _jl_gil_lock.

@vtjnash
Copy link

vtjnash commented Jul 7, 2025

That would still introduce different unsoundness, since here you're causing deadlock / recursion mistakes, and with the rename, you'd be failing to call disable_finalizers where required for avoiding memory corruption (ReentrantLock includes that guard).

@MilesCranmer
Copy link
Contributor Author

Reverted the global lock.

The original code with separate locks doesn’t appear to have deadlock issues. @mkitti's example isn't possible as those globals are behind a shared reentrant lock.

If there are other genuine scenarios I'm missing, a concrete example would greatly help. Right now, broad concerns about unsoundness aren't giving me enough to go on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants