Skip to content

Commit 8404aeb

Browse files
committed
[Support] On Windows, ensure hardware_concurrency() extends to all CPU sockets and all NUMA groups
The goal of this patch is to maximize CPU utilization on multi-socket or high core count systems, so that parallel computations such as LLD/ThinLTO can use all hardware threads in the system. Before this patch, on Windows, a maximum of 64 hardware threads could be used at most, in some cases dispatched only on one CPU socket. == Background == Windows doesn't have a flat cpu_set_t like Linux. Instead, it projects hardware CPUs (or NUMA nodes) to applications through a concept of "processor groups". A "processor" is the smallest unit of execution on a CPU, that is, an hyper-thread if SMT is active; a core otherwise. There's a limit of 32-bit processors on older 32-bit versions of Windows, which later was raised to 64-processors with 64-bit versions of Windows. This limit comes from the affinity mask, which historically is represented by the sizeof(void*). Consequently, the concept of "processor groups" was introduced for dealing with systems with more than 64 hyper-threads. By default, the Windows OS assigns only one "processor group" to each starting application, in a round-robin manner. If the application wants to use more processors, it needs to programmatically enable it, by assigning threads to other "processor groups". This also means that affinity cannot cross "processor group" boundaries; one can only specify a "preferred" group on start-up, but the application is free to allocate more groups if it wants to. This creates a peculiar situation, where newer CPUs like the AMD EPYC 7702P (64-cores, 128-hyperthreads) are projected by the OS as two (2) "processor groups". This means that by default, an application can only use half of the cores. This situation could only get worse in the years to come, as dies with more cores will appear on the market. == The problem == The heavyweight_hardware_concurrency() API was introduced so that only *one hardware thread per core* was used. Once that API returns, that original intention is lost, only the number of threads is retained. Consider a situation, on Windows, where the system has 2 CPU sockets, 18 cores each, each core having 2 hyper-threads, for a total of 72 hyper-threads. Both heavyweight_hardware_concurrency() and hardware_concurrency() currently return 36, because on Windows they are simply wrappers over std::thread::hardware_concurrency() -- which can only return processors from the current "processor group". == The changes in this patch == To solve this situation, we capture (and retain) the initial intention until the point of usage, through a new ThreadPoolStrategy class. The number of threads to use is deferred as late as possible, until the moment where the std::threads are created (ThreadPool in the case of ThinLTO). When using hardware_concurrency(), setting ThreadCount to 0 now means to use all the possible hardware CPU (SMT) threads. Providing a ThreadCount above to the maximum number of threads will have no effect, the maximum will be used instead. The heavyweight_hardware_concurrency() is similar to hardware_concurrency(), except that only one thread per hardware *core* will be used. When LLVM_ENABLE_THREADS is OFF, the threading APIs will always return 1, to ensure any caller loops will be exercised at least once. Differential Revision: https://reviews.llvm.org/D71775
1 parent d9049e8 commit 8404aeb

File tree

37 files changed

+406
-143
lines changed

37 files changed

+406
-143
lines changed

clang-tools-extra/clang-doc/tool/ClangDocMain.cpp

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -268,8 +268,7 @@ int main(int argc, const char **argv) {
268268
Error = false;
269269
llvm::sys::Mutex IndexMutex;
270270
// ExecutorConcurrency is a flag exposed by AllTUsExecution.h
271-
llvm::ThreadPool Pool(ExecutorConcurrency == 0 ? llvm::hardware_concurrency()
272-
: ExecutorConcurrency);
271+
llvm::ThreadPool Pool(llvm::hardware_concurrency(ExecutorConcurrency));
273272
for (auto &Group : USRToBitcode) {
274273
Pool.async([&]() {
275274
std::vector<std::unique_ptr<doc::Info>> Infos;

clang-tools-extra/clangd/TUScheduler.cpp

Lines changed: 1 addition & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -842,13 +842,7 @@ std::string renderTUAction(const TUAction &Action) {
842842
} // namespace
843843

844844
unsigned getDefaultAsyncThreadsCount() {
845-
unsigned HardwareConcurrency = llvm::heavyweight_hardware_concurrency();
846-
// heavyweight_hardware_concurrency may fall back to hardware_concurrency.
847-
// C++ standard says that hardware_concurrency() may return 0; fallback to 1
848-
// worker thread in that case.
849-
if (HardwareConcurrency == 0)
850-
return 1;
851-
return HardwareConcurrency;
845+
return llvm::heavyweight_hardware_concurrency().compute_thread_count();
852846
}
853847

854848
FileStatus TUStatus::render(PathRef File) const {

clang-tools-extra/clangd/index/Background.cpp

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -148,9 +148,10 @@ BackgroundIndex::BackgroundIndex(
148148
CDB.watch([&](const std::vector<std::string> &ChangedFiles) {
149149
enqueue(ChangedFiles);
150150
})) {
151-
assert(ThreadPoolSize > 0 && "Thread pool size can't be zero.");
151+
assert(Rebuilder.TUsBeforeFirstBuild > 0 &&
152+
"Thread pool size can't be zero.");
152153
assert(this->IndexStorageFactory && "Storage factory can not be null!");
153-
for (unsigned I = 0; I < ThreadPoolSize; ++I) {
154+
for (unsigned I = 0; I < Rebuilder.TUsBeforeFirstBuild; ++I) {
154155
ThreadPool.runAsync("background-worker-" + llvm::Twine(I + 1), [this] {
155156
WithContext Ctx(this->BackgroundContext.clone());
156157
Queue.work([&] { Rebuilder.idle(); });

clang-tools-extra/clangd/index/Background.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,7 @@ class BackgroundIndex : public SwapIndex {
135135
Context BackgroundContext, const FileSystemProvider &,
136136
const GlobalCompilationDatabase &CDB,
137137
BackgroundIndexStorage::Factory IndexStorageFactory,
138-
size_t ThreadPoolSize = llvm::heavyweight_hardware_concurrency(),
138+
size_t ThreadPoolSize = 0, // 0 = use all hardware threads
139139
std::function<void(BackgroundQueue::Stats)> OnProgress = nullptr);
140140
~BackgroundIndex(); // Blocks while the current task finishes.
141141

clang-tools-extra/clangd/index/BackgroundRebuild.h

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,9 @@ class BackgroundIndexRebuilder {
4949
public:
5050
BackgroundIndexRebuilder(SwapIndex *Target, FileSymbols *Source,
5151
unsigned Threads)
52-
: TUsBeforeFirstBuild(Threads), Target(Target), Source(Source) {}
52+
: TUsBeforeFirstBuild(llvm::heavyweight_hardware_concurrency(Threads)
53+
.compute_thread_count()),
54+
Target(Target), Source(Source) {}
5355

5456
// Called to indicate a TU has been indexed.
5557
// May rebuild, if enough TUs have been indexed.

clang/lib/Tooling/AllTUsExecution.cpp

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -114,8 +114,7 @@ llvm::Error AllTUsToolExecutor::execute(
114114
auto &Action = Actions.front();
115115

116116
{
117-
llvm::ThreadPool Pool(ThreadCount == 0 ? llvm::hardware_concurrency()
118-
: ThreadCount);
117+
llvm::ThreadPool Pool(llvm::hardware_concurrency(ThreadCount));
119118
for (std::string File : Files) {
120119
Pool.async(
121120
[&](std::string Path) {

clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,8 @@ DependencyScanningFilesystemSharedCache::
106106
// sharding gives a performance edge by reducing the lock contention.
107107
// FIXME: A better heuristic might also consider the OS to account for
108108
// the different cost of lock contention on different OSes.
109-
NumShards = std::max(2u, llvm::hardware_concurrency() / 4);
109+
NumShards =
110+
std::max(2u, llvm::hardware_concurrency().compute_thread_count() / 4);
110111
CacheShards = std::make_unique<CacheShard[]>(NumShards);
111112
}
112113

clang/tools/clang-scan-deps/ClangScanDeps.cpp

Lines changed: 4 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -485,15 +485,9 @@ int main(int argc, const char **argv) {
485485

486486
DependencyScanningService Service(ScanMode, Format, ReuseFileManager,
487487
SkipExcludedPPRanges);
488-
#if LLVM_ENABLE_THREADS
489-
unsigned NumWorkers =
490-
NumThreads == 0 ? llvm::hardware_concurrency() : NumThreads;
491-
#else
492-
unsigned NumWorkers = 1;
493-
#endif
494-
llvm::ThreadPool Pool(NumWorkers);
488+
llvm::ThreadPool Pool(llvm::hardware_concurrency(NumThreads));
495489
std::vector<std::unique_ptr<DependencyScanningTool>> WorkerTools;
496-
for (unsigned I = 0; I < NumWorkers; ++I)
490+
for (unsigned I = 0; I < Pool.getThreadCount(); ++I)
497491
WorkerTools.push_back(std::make_unique<DependencyScanningTool>(Service));
498492

499493
std::vector<SingleCommandCompilationDatabase> Inputs;
@@ -508,9 +502,9 @@ int main(int argc, const char **argv) {
508502

509503
if (Verbose) {
510504
llvm::outs() << "Running clang-scan-deps on " << Inputs.size()
511-
<< " files using " << NumWorkers << " workers\n";
505+
<< " files using " << Pool.getThreadCount() << " workers\n";
512506
}
513-
for (unsigned I = 0; I < NumWorkers; ++I) {
507+
for (unsigned I = 0; I < Pool.getThreadCount(); ++I) {
514508
Pool.async([I, &Lock, &Index, &Inputs, &HadErrors, &FD, &WorkerTools,
515509
&DependencyOS, &Errs]() {
516510
llvm::StringSet<> AlreadySeenModules;

lld/ELF/SyntheticSections.cpp

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2747,8 +2747,8 @@ createSymbols(ArrayRef<std::vector<GdbIndexSection::NameAttrEntry>> nameAttrs,
27472747
size_t numShards = 32;
27482748
size_t concurrency = 1;
27492749
if (threadsEnabled)
2750-
concurrency =
2751-
std::min<size_t>(PowerOf2Floor(hardware_concurrency()), numShards);
2750+
concurrency = std::min<size_t>(
2751+
hardware_concurrency().compute_thread_count(), numShards);
27522752

27532753
// A sharded map to uniquify symbols by name.
27542754
std::vector<DenseMap<CachedHashStringRef, size_t>> map(numShards);
@@ -3191,8 +3191,8 @@ void MergeNoTailSection::finalizeContents() {
31913191
// operations in the following tight loop.
31923192
size_t concurrency = 1;
31933193
if (threadsEnabled)
3194-
concurrency =
3195-
std::min<size_t>(PowerOf2Floor(hardware_concurrency()), numShards);
3194+
concurrency = std::min<size_t>(
3195+
hardware_concurrency().compute_thread_count(), numShards);
31963196

31973197
// Add section pieces to the builders.
31983198
parallelForEachN(0, concurrency, [&](size_t threadId) {

llvm/include/llvm/LTO/LTO.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -227,7 +227,8 @@ using ThinBackend = std::function<std::unique_ptr<ThinBackendProc>(
227227
AddStreamFn AddStream, NativeObjectCache Cache)>;
228228

229229
/// This ThinBackend runs the individual backend jobs in-process.
230-
ThinBackend createInProcessThinBackend(unsigned ParallelismLevel);
230+
/// The default value means to use one job per hardware core (not hyper-thread).
231+
ThinBackend createInProcessThinBackend(unsigned ParallelismLevel = 0);
231232

232233
/// This ThinBackend writes individual module indexes to files, instead of
233234
/// running the individual backend jobs. This backend is for distributed builds

0 commit comments

Comments
 (0)