Skip to content

Remove readDirStreamWithPtr; replace readdir_r() with readdir() #349

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sjshuck
Copy link

@sjshuck sjshuck commented Jul 6, 2025

This replaces all use of readdir_r() with readdir(), and removes System.Posix.Directory.Internals.readDirStreamWithPtr as a consequence. The rationale is 4-fold: safety, portability, performance, and maintainability.

Safety

According to the readdir_r(3) manpage, it's unsafe because it's impossible to compute the maximum filename length, hence the size of the buffer to be allocated. Currently we call pathconf(".", _PC_NAME_MAX) when NAME_MAX isn't defined. However, that's the longest filename a process is allowed to create, not the longest than can already be there. The fpathconf(3) manpage explicitly states

Files with name lengths longer than the value returned for name equal to _PC_NAME_MAX may exist in the given directory.

Even if NAME_MAX is defined, the glibc manual states

Portability Note: On some systems, the GNU C Library defines NAME_MAX, but does not actually enforce this limit.

The readdir_r(3) manpage goes on to say that glibc may fail with ENAMETOOLONG, but implies there's no way to detect that until the final dirent is read. On other systems, d_name may be truncated, or worse, not null-terminated.

The original reason for readdir_r() was to provide a reentrant version of readdir(). However, "in modern implementations" readdir() is thread-safe, except for concurrent readers to the same DIR * stream. In that specific case it says the user should do their own synchronization. This PR adds a warning to that effect. It's expected (by whom? I don't know) that a future POSIX.1 version will make readdir() thread-safe (including to the same DIR *, I guess?).

Portability

readdir_r() is deprecated by glibc for the reasons listed above.

Performance

I compiled directory-ospath-streaming against unix before and after these changes and executed a simple benchmark counting files recursively under my home directory (about 500K). (Here is the change in the streaming library for the after benchmark.)

Haskell C Time
Before readDirStreamWithPtr readdir_r() 550ms
After readDirStreamWith readdir() 545ms

That 1% improvement was oddly consistent for me, even though tasty-bench claimed ±15ms of error. What's important here is that it isn't worse. This makes sense. You're not supposed to free the pointer returned by readdir() because it "may be statically allocated". My guess is a buffer is allocated by opendir() or the first call to readdir(), and is freed when the kernel cleans up the file descriptor. If by opendir(), then it's a sunk cost—either way, a user isn't going to do any better by attempting to manage their own struct dirent (notwithstanding the safety issues).

For readdir_r(), Haskell adds another small cost by allocating the Ptr (Ptr CDirent) out param.

Maintainability

When you glance at readDirStreamWith versus readDirStreamWithPtr and their types, you might think: "The Ptr one is more low-level for when I want to manage memory myself for extra performance/control; the other one probably does that for me each time, trading control for simplicity." Indeed, currently readDirStreamWith does do that, and is defined in terms of readDirStreamWithPtr.

That intuition is understandable. But the thing being allocated is simply the extra bookkeeping for reentrancy required by readdir_r(). Currently the note says,

The lifetime of the pointer wrapped in the DirEnt is limited to invocation of the callback and it will be freed automatically after.

This is not true. The pointer to that pointer is freed, but the Ptr CDirent itself is not. It's moot anyway since this PR reimplements readDirStreamWith in terms of readdir(). The note is replaced accordingly.

Also note that the above change to directory-ospath-streaming means the "cache" mechanism there can likely be removed. It's only there for the Ptr (Ptr CDirent) argument. I'm curious if this leads to more small performance gains.

What I'm getting at in terms of code maintainability is readdir() is

  • simpler to use
  • simpler to make a binding for
  • more conducive to simpler user code
  • less counter-intuitive

cc @sergv

@sergv
Copy link
Contributor

sergv commented Jul 6, 2025

Yeah, the readDirStreamWithPtr doesn't really do much in terms of reuse currently. The passed pointer to pointer, Ptr (Ptr CDirent), is only ever used when readdir_r is being used and doesn't save that much when that happens anyway.

Not sure what thread-safety guarantees directory streams in the unix package aim to provide though.

Perhaps it's easier to use readdir with explicit synchronization where readdir_r is currently being used and on known good systems that have thread-safe readdir just remove the synchronization around the calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants