"Most of the popular distributions are aware of core resources and memory access, due to how Linux plays in multi-socket servers or even smartphones with big.little cores."
Probably intended big/little cores. Great article though, I personally didn't know much about scheduling before this and the article helped expand that.
Thanks Ian, enjoyed this analysis. I would love to see a server-centric analysis of scheduling on EPYC and Xeon in Windows Server. If that scheduling is similarly core 0 focused, it's no wonder that even Microsoft uses Linux to run its Azure servers. Might be high time for Redmond to improve its multithread management.
server centric analysis would be great but i think the simpler solution is whether that software was written correctly for multi-threaded use in the first place. The most determinate of application performance is application design based on assumed and actual underlying resources. The epyc chips use full 8 channel access to memory and I am certain that the more threads/cores that the chip has, the more performance will be restricted when one use less memory channels like threadripper. I really hope the Zen2 based threadripper uses 8 channel memory which helps to mitigate performance when accessing all threads as compared to 1950x/2990x threadrippers
Well, I've been running Windows Servers for a long, long time, starting with NT 3.51. I've never encountered performance issues like this but... 1. I've always used Intel processors in Windows Servers. Is it possible that Microsoft has just done a good job of optimizing its scheduler to work with Intel CPU's and not so much with AMD? Seems like a lot could be solved by AMD working with Microsoft to build an AMD specific scheduler? 2. The first time I had multi-core CPU's in my servers was in 2007. But at the time we got our first two socket, 4 cores per socket servers, we were also going virtual with VMWare. So since then, near 100% of my Windows servers have been VMWare VM's. My thought is that VMWare is handling the juggling of resources behind the scenes so the Windows VM doesn't have to deal with it. This scenario is increasingly becoming the norm. Weather you are on your own hardware or running a Windows VM in the cloud (doesn't matter which cloud), Windows doesn't really have to deal with the underlying hardware at least directly.
Network fabric is Linux though, and wouldn't be surprised if the resource code is Linux. This is above the underlying custom hyper-v code which I'm sure is Windows.
Great article and information. Personally, I run a 2990WX on Ubuntu 18.04 at work and have near-ideal scaling for calculations that aren't memory-bound. On the other hand, I noticed I have scheduling issues when I game at home on Windows 10 with my 1950X, particularly with CS:GO where I get microstuttering. I use Project Lasso to help mitigate the issue, but it's not a perfect fix. Maybe I'll play around with affinity masks in the shell and see what happens.
This is a tangent, but re: "A good scientist always publishes negative results. That’s what we are taught from a morality standpoint, although academia doesn’t always work that way. "
I _WISH_ but the reality is, not only do most scientists not publish negative results, they CAN'T. Most journals won't take studies that don't have a positive finding, UNLESS you're refuting an already-published finding (either through replication or by creating a separate study that should yield the same result).
I really want there to be a journal (call it "Null" or something?) that would exclusively publish negative results, just so they'd be out there. It would kill SO MANY BIRDS AT ONCE. It would help solve the replication crisis (since there's a good chance that at least some studies that had a positive finding were preceded by a similar study that didn't, but that no one heard about because it wasn't published), and it would help the careers of scientists who want to look at interesting things even if they don't work out (since your employment, especially earlier in your career, is often tied very closely to how much you publish), AND it would suggest avenues for future research ('hmm, THIS isn't significant, but it raises some interesting questions on whether THAT is....'), but, alas, I don't think that'll happen.
Agreed, there is probably some real value in that, although you have to consider how much time scientists will devote to preparing a paper with negative results and reviewing it. I for one published two papers in my research days (III-V semiconductor nanowires) about solving problems that were largely preventing me from conducting the primary research I was trying to conduct. This isn't exactly publishing failures, but it was certainly admitted we were having problems and sharing the solutions we found. It isn't high impact work (10 citation on one paper, 2 on another). Publishing just the failure though would be fairly difficult I feel in order to pass any sort of peer review, it is sort of like trying to prove a negative (why you couldn't achieve success). Now my perspective and experience in semiconductor fabrication would be very different than someone in biology or astronomy.
This article is an example of why i love this site :) I'm probably about to write some BS... but wouldn't be useful to disable some cores on the nodes 1 and 3? Having the memory-connected nodes dealing with some less data fetching for the others could lead to interesting results...
Solution - Get rid of all of those extra CPU cores by hooking yourself up with a Pentium 3/4 or an early model Athlon. Real men do their com-pute-u-late-in' on only one processor so there isn't even a need for a stupid scheduler. Sh!t gets one when it gets done and while you're waiting the 11 minutes it takes for that 2MB MP3 file to start playing, you can enjoy the finer things in life like a can of Miller Genuine Draft beer straight from a can in your deskside mini fridge.
Maybe should have controlled for total core-counts by comparing CPU0 (first enumerated) disabled with CPU31 (last enumerated) for consistent 62-thread vs 62-thread tests instead of 62-thread vs 64-thread...
AMD either needs to commit to working with MS to build a better scheduler for their chips, or not release something like this again that's intended for workstation use. Ironically this would be less of an issue for a server-oriented chip, as Linux seems to handle it a lot better, and the processor choice could be matched to the specific workload the server was designed to handle.
If it's intended for workstation use as you claim, why would anyone be moaning when it doesn't perform as well as it might with Winblows for consumer tasks due to scheduling issues? It isn't intended for gaming; if you want such designs to run ok for consumer level tasks, then moan at MS to make their OS better, not blame AMD.
I should add that the evidence from the Linux experience is clear: MS could make their OS run better with this kind of hw if they wanted to, but they haven't.
MS is far more focused on moving settings menus around, changing the location or color of something and forcing people to use the windows store than they are at fixing real architecture problems. This lies at the fundamental nature of Linux Kernel Development versus Windows Development. Linux tends to focus on doing it right, Microsoft not so much.
One thing I noticed in test the 2990WX myself is that when all cores are heavily loaded with the same test (a bulk compile test in my case), those cores with direct attached memory appear to be able to hog more of the memory bandwidth than those cores without direct attached memory and thus complete their task(s) a bit more quickly.
When I run the test on cores one at a time, it takes the same amount of time regardless of which core it is running on. It is only when under significant memory load that the cores distinguish themselves. That was, literally, the only issue I could find.
In otherwords, just running on core(s) without direct attached memory is not itself a problem. Memory latencies are very well absorbed by CPU caches and having a large number of cores does a good job filling the pipeline stalls occurring on each one. It should be noted that the behavior of the 2990WX is very similar to the behavior of any multi-socket system... such as a dual-socket Xeon system for example. All such systems have major step functions in terms of memory latency when the cpu on one die must access the memory attached to another.
This really is more of a Windows problem than a threadripper problem. Linux has no problem parsing the CPU and Memory topology passed to it by the BIOS. There is always a hierarchy of locality not only between CPU and memory, but also between CPUs. The linux scheduler is very careful to try to not move threads across sockets unless it absolutely has to, for example.
But this topology is actually very fine-grained even on less exotic single-socket CPUs. Hyperthreads belonging to the same core. Local core complex with the same L2 or L3 cache. Core complex on the socket. And so forth. For example, switching a process between sibling hyperthreads incurs essentially no cache management overhead. That is, cache mastership doesn't have to change. Switching a process between cores always incurs very serious cache management overhead as the mastership of the dirtied pages have to move between CPU caches. Moving a thread across to another CCX (in AMD's case), or equivalently across a socket on a multi-socket system, incurs a greater overhead. All of this topology information is provided by the BIOS and the CPUs themselves via MSRs, and Linux parses every last little bit of it in order to make the best decisions.
Windows really ought to properly parse the topology data provided to it by the BIOS and do the right thing, and it is very clear that it does not.
This is pure speculation on your part. Windows supports NUMA topoligies since WinServr2008R2 (win7), with the concept of a group. But how groups will be configured depends on BIOS settings. In case of threadripper, there should be 2 groups of 32 logical CPUs each, not one group with 64 logical cpus. The corresponding EPIC (with same number of cores) should be configured as 4 groups of 16 logical CPUs each. But then a lot of toy software not aware of it will not use 64 cpus at all.
Workstation CPUs should be run on workstation workloads, Autocad/Fusion/3DS etc. Here only compilation fits, and even there, just 1 project does not mean much as project structure and compiler options might affect compilation performance. For example, is /Gm enabled on all projects in their test? How about LTCG, which greately decreases level of parallelization even if /Gm is enamled.
No, it's a fair point - if you put the arrow it initially looks like "increase of 104%". It might be better to just leave those out and colour the text green or red appropriately.
Just a guess, but possibly interrupt handling is primarily handled on or optimized for CPU0 so some of your more I/O oriented tests might be seeing slowdowns there due the process not already being resident on that core when the OS needs to handle an I/O for it. In Linux you could probably measure that by watching /proc/interrupts during your tests. Not sure offhand what the equivalent would be in Windows.
I would like to comment on the second-to-last paragraph:
"In the end, with most of the results equal though, it is hard to come to a conclusion around schedulers. I like the way that Linux offers a hardware map of resources that software can use, however it also requires programmers to think in those heterogeneous terms – not all cores will always have the same level of access to all the resources. It’s a lot easier to code software if you assume everything is the same, or code the software for a single thread. But when the hardware comes as varied as Threadripper 2 does, those hardware maps need to be updated as workloads are placed on the processor. Unfortunately I’m neither a scheduler expert not a coding expert, but I do know it requires work in both segments."
I think that this is easy to misunderstand. The Linux scheduler handles mapping of program's threads and/or processes to multiple cores. The "software" that has to be aware of the hardware resources is only the kernel. When a software program is written that has multiple threads of execution (again, whether "threads" or "processes" in the Linux system sense), it needs to know nothing about the hardware resources. The kernel decides what cores to run the software on.
Here is an example: When compiling the Linux kernel, many different programs are run. Mostly the C compiler (normally GCC, although in some instances clang), and the linker. The C compiler and the Linker do not need to know anything about the hardware configuration. In fact, they are mostly single-threaded programs. However, when the build job is set to run multiple C compiler instances in parallel, the speed of compiling the Linux kernel's multiple thousand source files into a working final product scales quite nicely. You can see this on Phoronix's results of testing the TR2990WX.
Another example: When running 7zip, it uses multiple threads for its work. 7zip was not optimized for the TR2990WX at all, but scales much better in Linux than on Windows (again, see Phoronix' results). 7zip is simply a multithreaded application. It knows nothing about the resources on the TR2990WX. However, since the kernel and its scheduler know how to properly handle NUMA configurations (like the TR2990WX has), it is able to get a much more reasonable scaling out of the TR2990WX than Windows.
I hope this clarifies things for anyone reading this who finds this paragraph confusing or even misleading.
@alpha64: "The "software" that has to be aware of the hardware resources is only the kernel."
How can the kernel partition your threads according to what part of the memory they will be accessing? Only the application itself knows that e.g. thread A and C will share certain structures in memory, while thread B and D will do so to a lesser degree.
https://docs.microsoft.com/en-us/sql/database-engi... describes how MS solved this with SQL Server 2014. It takes an active role when it comes to thread scheduling. I guess it will set the affinity mask based on which NUMA node it wants to schedule a thread.
Naturally, not all software behaves like a DBMS. A video encoder perhaps won't share so much data between threads (maybe each thread handles a number of frames independently of other threads?) and the NUMA question becomes moot. (unless there are nodes that simply has no direct memory access and needs to go through other nodes -- I got the impression that Threadripper does this)
You are right that a DBMS is a pretty extreme example of an "application" level software program. In fact, they often use storage devices (SSDs/Hard Drives) bypassing the filesystem too. The fact that Microsoft is doing specific things to handle NUMA in this software is not surprising, but I was not talking about how Microsoft has solved this, but rather how Linux does (better than Windows).
My point was that most software does not set affinity, know, or care the architecture of a NUMA system. The two examples I gave do not, yet they show good performance scaling on TR2990WX. Certainly ugly hacks like CPU affinity can be used to try to fix poor performance due to ineffective scheduling, but the results of this workaround are shown in this article to not help most of the time. Knowing the NUMA tables and actually making intelligent choices based on that are still firmly in the realm of the kernel, not normal applications.
When speaking about memory locality, the operating system does set up the page table structures with the MMU for specific threads/processes. Thus the kernel does have knowledge of which cores are local to the memory for specific threads. Linux scheduling takes into account the NUMA structure and what CPUs are "close" to the memory used for the tasks it is scheduling. Thus, the answer to your first question in your first paragraph is that the kernel does handle this. The situation with sharing certain structures also can be tracked by the kernel.
As your last paragraph states, the TR2990WX does have half of its cores without direct memory access. AMD themselves acknowledged this at release of TR2, and mentioned that this was not a huge performance penalty for most cases. And, if one needs local memory for all NUMA nodes, EPYC does support local memory on all NUMA nodes.
In summary, my point is that in order to get "good" performance from the TR2990WX, Linux achieves this in the kernel without the applications having any special knowledge of the hardware. It seemed that Ian's original paragraph stated otherwise, which is what I was trying to address.
Any thread, at least in a Windows process, can access everything other threads in the process can. Everything is shared. (I've been doing C# for the past 10 years, so I have not kept up to date on the Win32 API and could be wrong)
UNIX apps used to fork() a lot. If that is still the case, then the scheduler will indeed face nicely partitioned unrelated processes. Forking would be a rather unusual thing to do in Windows as spawning new processes comes with a noticeable overhead there.
Perhaps we are just talking past each other, but in Linux each thread is most often its own process (depending on the libc version used / programming language). The kernel uses copy-on-write to make sure that creating a new thread/process is low-overhead, but still isolated. Regardless, the kernel keeps track of the memory maps for each process (and thread). Thus, the kernel can schedule the applications to run effectively on multiple cores.
Windows does have problems with efficiently forking, while Linux does it well. Regardless, my original point still stands - Linux is able to effectively schedule on NUMA systems without applications having to be programmed with any knowledge of it.
What API functions are involved to allow Linux to keep track of what part of the process' memory space belongs to a given thread? Doesn't most allocations happen through malloc() like they've always done? And how do two threads share memory structures in memory? (You can't assume that merely because thread A allocates a chunk of memory, that thread A will be the only thread accessing it -- one major benefit of using threads is sharing the same address space)
Once you have API functions that do those things... You have pushed responsibility over on the application developer to ensure that a given thread gets executed on an optimal NUMA node. Which really is not all that different from playing around with CPU affinity masks (yuck) in the first place.
So, when it comes to threads... I very much doubt it works the way you think it does. Processes -- sure. MMU all the way, etc, but not for threads. That would be downright bizarre.
I have actually written Linux multithreaded applications in multiple programming languages and worked on the Linux kernel. Linux has mechanisms to set up shared memory segments (mmap), but mostly a shared memory segment like you are describing is allocated before a process calls clone() (to fork() or create a new thread - they make this clone() system call with different parameters). Unless one of the resultant processes lets the operating system know that it will no longer be using the shared segment, both processes have the same memory segment mapped into their process space (kept track of by the kernel). This would be your hypothetical shared memory structure. However, this is simply not done in most programs, as handling shared memory structures by multiple threads of execution is very error-prone. It is far better to use some kind of inter-process communication rather than a shared memory segment where one's own program has to handle locking, contention, and insure consistency.
If you have a Linux system around, try checking into /proc/<pid>/maps to see what memory segments the kernel has mapped for any given process. You can also read up on NTPL (the Threading library that most recent Linux systems use). This is definitely all kept track of by the kernel, and useful for scheduling on NUMA systems.
I think where the confusion in our discussion lies is perhaps in how you think programs are typically designed. I think that shared memory regions are a bad idea due to the complexity of insuring that they will be properly handled. Your assumption seems to be that they are used widely. Perhaps this is due to the different platforms which we use.
Well... Yes. But that begs the question: Why stress with NUMA then? If each thread mostly stirs around with its own separate structures, then all the scheduler needs to worry about is keeping the threads from bouncing too much between various CPU cores. As long as the threads are rescheduled on the same core (or at least the same NUMA node), all is fine.
SQL Server seems to resort to several threads to service one query, and then, afaict from the query plan, ties it all together again in what I assume to be some sort of main thread (of that query/SPID). That suggests to me that it is beneficial to control those threads and keep them somewhere close so you can pull in whatever data they've been working on at the end (without having to access a different memory node).
My hunch is that server apps like that (call it 'extreme' if you must) is where we'll see the benefit of NUMA, whereas your average renderer, game or what not, is not where NUMA shines. Whatever advantage the Linux scheduler has over Windows in this instance, I don't see NUMA support being a major part of that equation (unless there is something seriously wonky with how Windows reschedules threads).
(I'm mostly doing micro services and rely on SQL Server to do the heavy lifting for me, so this topic isn't my forte, at least not as a developer)
Umm I just applied that command for 2990wx and apart from having less performance overall, my screen flashes black every 30 seconds. How do I reverse engineer the start /affinity FFFF FFFF FFFF FFFC “” “D:\2019\Script.exe ?
I tried stop affinity etc but nothing happened. How do I revert back asap please?
"I think that this is easy to misunderstand. The Linux scheduler handles mapping of program's threads and/or processes to multiple cores. The "software" that has to be aware of the hardware resources is only the kernel. When a software program is written that has multiple threads of execution (again, whether "threads" or "processes" in the Linux system sense), it needs to know nothing about the hardware resources"
If you want scalability beyond 16 logical CPUs, this is simply false. Software needs to be NUMA-optimized, in about 20 different ways.
Can we have an article on the scaling with thread count and how it behaves when pinning a process to a CCX/die? Video encoding is trivial to split (I do my encoding on a cluster of desktop PCs) but AnandTech only benchmarks running a single instance of Handbrake, which nobody outside of HANDJOB even uses. This currently makes the i9 9900k and i7 9700k look to be the champions of price/performance with the 32 core EPYC and Threadripper performing on the level of an Intel quad core. I'm curious as to what an x264 --preset veryslow encode looks like if you split it up into quarters or eighths and pin them to their own die/CCX.
Windows 10's process scheduler is at least partially broken since the very first "Creators" version came out and has not been fixed to date (1809). So take any measurements in this regard with a big grain of salt.
I'd be curious to see the performance difference between running a program that only used ~8 threads and was allowed to use any core, restricted to node 0, and forced to one of the nodes w/o direct memory access.
"performance in most applications did not scale from the 16-core, and in some cases regressed"
This is the fault of both outdated Von Neumann-based architecture and software companies. You have to write in a VERY special way to make sure your app scales. Ask me how I know.
99% of modern software engineers simply don't know how to do it, and most popular programming languages (those without OO aggregation by value, usually GC-dependent, all these JS/Java/Python/C# etc) simply do not support OO and high scalability at the same time, exceptions being obviously C++ and supposedly Rust (although the jury is still out given than no HPC applications are written in Rust yet).
But even in C++ scaling beyond 16 on Von Neumann is VERY specific. For example, no heap usage in the threads. Obviously no synchronization (it kills even beyond 4 threads). Not only no data sharing, but no even cache page sharing if even one thread writes to it. All data used by internal loop must fit into L1 cache, half of L1 cache if HT/SMT is used - which means VERY DIFFERENT algorithms from what theoretical O-complexity would deem optimal. And for NUMA like Threadripper/EPIC and all multi-CPU platforms, there is so much more to make sure that the memory a thread operates on is allocated on it's own memory channel (hello system-level calls to fix affinity etc etc etc etc).
Even 16 cores/32 threads are basically only for specially optimized professional applications. Certainly not for toys.
"On top of that, there just aren't all that many TASKS that are embarrassingly parallel"
Most tasks where performance actually matters (meaning that one 4GHz core is not enough) are quite parallelable. Today "CPU performance matters" in practice (as opposed in testing by non-practitioners) means "a lot of computations on a lot of data", not "a lot of computations on little data" or "few computations on a lot of data".
Great article! Could you do the gaming tests with CPU 0 disabled on say a Ryzen 1600/1800? It would be nice to know if we could get extra performance by doing that.
In the future, I'd like to see a multitasking test where you 7zip a file, encode a video, and do whatever else that isn't perfectly parallelized (practically anything other than offline graphics rendering). With regard to the article, one could manipulate CPU affinities and see how that affects the result. I like to think that high-core-count processors like Threadripper are meant for mulit-tasking rather than single-tasking.
"A good scientist always publishes negative results. That’s what we are taught from a morality standpoint, although academia doesn’t always work that way." Haha so true
Since AMD and intel have different issues and fixes with the various vulnerabilities recently exposed and fixed; I have to wonder if the results of this experiment would be different if it all ran on pre-disaster microcodes. After all, the fixes are such that it should affect things noticeably, right? (And if you are a MS fan you might even suggest that they didn't fix the scheduling for the new situation, but I'm not so I won't.)
It would be interesting seeing the results from more core configurations. Whole CCX's disabled; half of the cores used from each CCX; whole CCX0 then various configs for the other 1 to 3 CCX's depending on the chip used.
Umm I just applied that command for 2990wx and apart from having less performance overall, my screen flashes black every 30 seconds. How do I reverse engineer the start /affinity FFFF FFFF FFFF FFFC “” “D:\2019\Script.exe ?
I tried stop affinity etc but nothing happened. How do I revert back asap please?
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
73 Comments
Back to Article
bobthewomtad - Thursday, October 25, 2018 - link
Small typo."Most of the popular distributions are aware of core resources and memory access, due to how Linux plays in multi-socket servers or even smartphones with big.little cores."
Probably intended big/little cores. Great article though, I personally didn't know much about scheduling before this and the article helped expand that.
Ian Cutress - Thursday, October 25, 2018 - link
Technically Arm characterizes the styling as big.LITTLE.https://en.wikipedia.org/wiki/ARM_big.LITTLE
wsjudd - Thursday, October 25, 2018 - link
Nah, it's definitely with a period between the two terms -- https://en.wikipedia.org/wiki/ARM_big.LITTLEmm0zct - Thursday, October 25, 2018 - link
Actually "big.LITTLE" is ARM's official name for it: https://developer.arm.com/technologies/big-littleprophet001 - Thursday, October 25, 2018 - link
roflThose graphics are amazing.
nandnandnand - Saturday, October 27, 2018 - link
Hmm.https://imgur.com/a/h3IfQKb
Hul8 - Thursday, October 25, 2018 - link
I think you have a couples typos in the affinity mask:The first group of digits only has two digits and the second to last 3.
Also seems the first error is also reflected in the hexadecimal that follows, but not the command after that.
aetherspoon - Thursday, October 25, 2018 - link
The first set of quotes in Start is for the window title.https://ss64.com/nt/start.html for more information.
Railgun - Thursday, October 25, 2018 - link
Second sentence...”Due to a series”...?xilience - Thursday, October 25, 2018 - link
Great article!Kevin G - Thursday, October 25, 2018 - link
Minor quibble:"(note, we start counting from 0, so the cores are listed as 0 to 7)"
With respect to the displayed endian, wouldn't it be 7 to 0 since the higher core number is the left most digit?
Also where did you get that 5 core system that is demonstrated in the first GIF? :)
nathanddrews - Friday, October 26, 2018 - link
That GIF is amazing - I lolled.Byyo - Thursday, October 25, 2018 - link
Great article and research! Minor note: Referenced software is 'Process' Lasso, not 'Project' Lasso.entity279 - Thursday, October 25, 2018 - link
Tangentially related, I'm curious how Intel's "favored core" is scheduled around. It does appear to be statically designated, mine e.g. is core 11.If schedulers don't know to throw workloads its way, it seems that's all for naught...
eastcoast_pete - Thursday, October 25, 2018 - link
Thanks Ian, enjoyed this analysis. I would love to see a server-centric analysis of scheduling on EPYC and Xeon in Windows Server. If that scheduling is similarly core 0 focused, it's no wonder that even Microsoft uses Linux to run its Azure servers. Might be high time for Redmond to improve its multithread management.lemans24 - Thursday, October 25, 2018 - link
server centric analysis would be great but i think the simpler solution is whether that software was written correctly for multi-threaded use in the first place. The most determinate of application performance is application design based on assumed and actual underlying resources.The epyc chips use full 8 channel access to memory and I am certain that the more threads/cores that the chip has, the more performance will be restricted when one use less memory channels like threadripper. I really hope the Zen2 based threadripper uses 8 channel memory which helps to mitigate performance when accessing all threads as compared to 1950x/2990x threadrippers
Ratman6161 - Thursday, October 25, 2018 - link
Well, I've been running Windows Servers for a long, long time, starting with NT 3.51. I've never encountered performance issues like this but...1. I've always used Intel processors in Windows Servers. Is it possible that Microsoft has just done a good job of optimizing its scheduler to work with Intel CPU's and not so much with AMD? Seems like a lot could be solved by AMD working with Microsoft to build an AMD specific scheduler?
2. The first time I had multi-core CPU's in my servers was in 2007. But at the time we got our first two socket, 4 cores per socket servers, we were also going virtual with VMWare. So since then, near 100% of my Windows servers have been VMWare VM's. My thought is that VMWare is handling the juggling of resources behind the scenes so the Windows VM doesn't have to deal with it. This scenario is increasingly becoming the norm. Weather you are on your own hardware or running a Windows VM in the cloud (doesn't matter which cloud), Windows doesn't really have to deal with the underlying hardware at least directly.
SFNR1 - Thursday, October 25, 2018 - link
Azure is running on Hyper-V afaik https://en.wikipedia.org/wiki/Microsoft_Azure#Desi... .Dug - Friday, October 26, 2018 - link
Network fabric is Linux though, and wouldn't be surprised if the resource code is Linux. This is above the underlying custom hyper-v code which I'm sure is Windows.SFNR1 - Friday, October 26, 2018 - link
hyperv runs baremetal on hpe quad socket servers and i don't think ms investet s much money in sdn with 2k16 datacenter to run this above linux...SFNR1 - Friday, October 26, 2018 - link
https://www.red-gate.com/simple-talk/cloud/cloud-d...s3cur3 - Thursday, October 25, 2018 - link
Awesome analysis, Ian. I've been reading AT's stuff on ThreadRipper with tremendous interest. :)Eletriarnation - Thursday, October 25, 2018 - link
Interesting results and the graphics had me cackling in my cube. Thanks Ian!sseyler - Thursday, October 25, 2018 - link
Great article and information. Personally, I run a 2990WX on Ubuntu 18.04 at work and have near-ideal scaling for calculations that aren't memory-bound. On the other hand, I noticed I have scheduling issues when I game at home on Windows 10 with my 1950X, particularly with CS:GO where I get microstuttering. I use Project Lasso to help mitigate the issue, but it's not a perfect fix. Maybe I'll play around with affinity masks in the shell and see what happens.sseyler - Thursday, October 25, 2018 - link
Edit: Process Lassolol
sing_electric - Thursday, October 25, 2018 - link
This is a tangent, but re: "A good scientist always publishes negative results. That’s what we are taught from a morality standpoint, although academia doesn’t always work that way. "I _WISH_ but the reality is, not only do most scientists not publish negative results, they CAN'T. Most journals won't take studies that don't have a positive finding, UNLESS you're refuting an already-published finding (either through replication or by creating a separate study that should yield the same result).
I really want there to be a journal (call it "Null" or something?) that would exclusively publish negative results, just so they'd be out there. It would kill SO MANY BIRDS AT ONCE. It would help solve the replication crisis (since there's a good chance that at least some studies that had a positive finding were preceded by a similar study that didn't, but that no one heard about because it wasn't published), and it would help the careers of scientists who want to look at interesting things even if they don't work out (since your employment, especially earlier in your career, is often tied very closely to how much you publish), AND it would suggest avenues for future research ('hmm, THIS isn't significant, but it raises some interesting questions on whether THAT is....'), but, alas, I don't think that'll happen.
3DoubleD - Thursday, October 25, 2018 - link
Agreed, there is probably some real value in that, although you have to consider how much time scientists will devote to preparing a paper with negative results and reviewing it. I for one published two papers in my research days (III-V semiconductor nanowires) about solving problems that were largely preventing me from conducting the primary research I was trying to conduct. This isn't exactly publishing failures, but it was certainly admitted we were having problems and sharing the solutions we found. It isn't high impact work (10 citation on one paper, 2 on another). Publishing just the failure though would be fairly difficult I feel in order to pass any sort of peer review, it is sort of like trying to prove a negative (why you couldn't achieve success). Now my perspective and experience in semiconductor fabrication would be very different than someone in biology or astronomy.Caswallon - Thursday, October 25, 2018 - link
This article is an example of why i love this site :)I'm probably about to write some BS... but wouldn't be useful to disable some cores on the nodes 1 and 3?
Having the memory-connected nodes dealing with some less data fetching for the others could lead to interesting results...
PeachNCream - Thursday, October 25, 2018 - link
Solution - Get rid of all of those extra CPU cores by hooking yourself up with a Pentium 3/4 or an early model Athlon. Real men do their com-pute-u-late-in' on only one processor so there isn't even a need for a stupid scheduler. Sh!t gets one when it gets done and while you're waiting the 11 minutes it takes for that 2MB MP3 file to start playing, you can enjoy the finer things in life like a can of Miller Genuine Draft beer straight from a can in your deskside mini fridge.bananaforscale - Saturday, October 27, 2018 - link
If you want to get rid of schedulers you need to go back to singletasking or co-operative multitasking.just6979 - Thursday, October 25, 2018 - link
Maybe should have controlled for total core-counts by comparing CPU0 (first enumerated) disabled with CPU31 (last enumerated) for consistent 62-thread vs 62-thread tests instead of 62-thread vs 64-thread...twtech - Thursday, October 25, 2018 - link
AMD either needs to commit to working with MS to build a better scheduler for their chips, or not release something like this again that's intended for workstation use. Ironically this would be less of an issue for a server-oriented chip, as Linux seems to handle it a lot better, and the processor choice could be matched to the specific workload the server was designed to handle.mapesdhs - Friday, October 26, 2018 - link
If it's intended for workstation use as you claim, why would anyone be moaning when it doesn't perform as well as it might with Winblows for consumer tasks due to scheduling issues? It isn't intended for gaming; if you want such designs to run ok for consumer level tasks, then moan at MS to make their OS better, not blame AMD.mapesdhs - Friday, October 26, 2018 - link
I should add that the evidence from the Linux experience is clear: MS could make their OS run better with this kind of hw if they wanted to, but they haven't.rahvin - Friday, October 26, 2018 - link
MS is far more focused on moving settings menus around, changing the location or color of something and forcing people to use the windows store than they are at fixing real architecture problems. This lies at the fundamental nature of Linux Kernel Development versus Windows Development. Linux tends to focus on doing it right, Microsoft not so much.MattZN - Thursday, October 25, 2018 - link
One thing I noticed in test the 2990WX myself is that when all cores are heavily loaded with the same test (a bulk compile test in my case), those cores with direct attached memory appear to be able to hog more of the memory bandwidth than those cores without direct attached memory and thus complete their task(s) a bit more quickly.When I run the test on cores one at a time, it takes the same amount of time regardless of which core it is running on. It is only when under significant memory load that the cores distinguish themselves. That was, literally, the only issue I could find.
In otherwords, just running on core(s) without direct attached memory is not itself a problem. Memory latencies are very well absorbed by CPU caches and having a large number of cores does a good job filling the pipeline stalls occurring on each one. It should be noted that the behavior of the 2990WX is very similar to the behavior of any multi-socket system... such as a dual-socket Xeon system for example. All such systems have major step functions in terms of memory latency when the cpu on one die must access the memory attached to another.
This really is more of a Windows problem than a threadripper problem. Linux has no problem parsing the CPU and Memory topology passed to it by the BIOS. There is always a hierarchy of locality not only between CPU and memory, but also between CPUs. The linux scheduler is very careful to try to not move threads across sockets unless it absolutely has to, for example.
But this topology is actually very fine-grained even on less exotic single-socket CPUs. Hyperthreads belonging to the same core. Local core complex with the same L2 or L3 cache. Core complex on the socket. And so forth. For example, switching a process between sibling hyperthreads incurs essentially no cache management overhead. That is, cache mastership doesn't have to change. Switching a process between cores always incurs very serious cache management overhead as the mastership of the dirtied pages have to move between CPU caches. Moving a thread across to another CCX (in AMD's case), or equivalently across a socket on a multi-socket system, incurs a greater overhead. All of this topology information is provided by the BIOS and the CPUs themselves via MSRs, and Linux parses every last little bit of it in order to make the best decisions.
Windows really ought to properly parse the topology data provided to it by the BIOS and do the right thing, and it is very clear that it does not.
-Matt
peevee - Monday, October 29, 2018 - link
This is pure speculation on your part. Windows supports NUMA topoligies since WinServr2008R2 (win7), with the concept of a group. But how groups will be configured depends on BIOS settings. In case of threadripper, there should be 2 groups of 32 logical CPUs each, not one group with 64 logical cpus. The corresponding EPIC (with same number of cores) should be configured as 4 groups of 16 logical CPUs each. But then a lot of toy software not aware of it will not use 64 cpus at all.Workstation CPUs should be run on workstation workloads, Autocad/Fusion/3DS etc. Here only compilation fits, and even there, just 1 project does not mean much as project structure and compiler options might affect compilation performance. For example, is /Gm enabled on all projects in their test? How about LTCG, which greately decreases level of parallelization even if /Gm is enamled.
YukaKun - Thursday, October 25, 2018 - link
What happens when you change the affinity *after* a process has started?I've noticed the behaviour is slightly different there, but I have no real explanation for it, just hunches.
Cheers!
Sttm - Thursday, October 25, 2018 - link
Anyone else find the arrows redundant and confusing? ^ 104%... 104% PERFORMANCE INCREASE OH MY F.... oh its 4%... right.mapesdhs - Friday, October 26, 2018 - link
Can you not do math? Sheesh...Death666Angel - Saturday, October 27, 2018 - link
Nothing about "performance increase", just "performance".GreenReaper - Monday, October 29, 2018 - link
No, it's a fair point - if you put the arrow it initially looks like "increase of 104%". It might be better to just leave those out and colour the text green or red appropriately.bpkroth - Thursday, October 25, 2018 - link
Just a guess, but possibly interrupt handling is primarily handled on or optimized for CPU0 so some of your more I/O oriented tests might be seeing slowdowns there due the process not already being resident on that core when the OS needs to handle an I/O for it. In Linux you could probably measure that by watching /proc/interrupts during your tests. Not sure offhand what the equivalent would be in Windows.alpha64 - Friday, October 26, 2018 - link
I would like to comment on the second-to-last paragraph:"In the end, with most of the results equal though, it is hard to come to a conclusion around schedulers. I like the way that Linux offers a hardware map of resources that software can use, however it also requires programmers to think in those heterogeneous terms – not all cores will always have the same level of access to all the resources. It’s a lot easier to code software if you assume everything is the same, or code the software for a single thread. But when the hardware comes as varied as Threadripper 2 does, those hardware maps need to be updated as workloads are placed on the processor. Unfortunately I’m neither a scheduler expert not a coding expert, but I do know it requires work in both segments."
I think that this is easy to misunderstand. The Linux scheduler handles mapping of program's threads and/or processes to multiple cores. The "software" that has to be aware of the hardware resources is only the kernel. When a software program is written that has multiple threads of execution (again, whether "threads" or "processes" in the Linux system sense), it needs to know nothing about the hardware resources. The kernel decides what cores to run the software on.
Here is an example:
When compiling the Linux kernel, many different programs are run. Mostly the C compiler (normally GCC, although in some instances clang), and the linker. The C compiler and the Linker do not need to know anything about the hardware configuration. In fact, they are mostly single-threaded programs. However, when the build job is set to run multiple C compiler instances in parallel, the speed of compiling the Linux kernel's multiple thousand source files into a working final product scales quite nicely. You can see this on Phoronix's results of testing the TR2990WX.
Another example:
When running 7zip, it uses multiple threads for its work. 7zip was not optimized for the TR2990WX at all, but scales much better in Linux than on Windows (again, see Phoronix' results). 7zip is simply a multithreaded application. It knows nothing about the resources on the TR2990WX. However, since the kernel and its scheduler know how to properly handle NUMA configurations (like the TR2990WX has), it is able to get a much more reasonable scaling out of the TR2990WX than Windows.
I hope this clarifies things for anyone reading this who finds this paragraph confusing or even misleading.
BikeDude - Friday, October 26, 2018 - link
@alpha64: "The "software" that has to be aware of the hardware resources is only the kernel."How can the kernel partition your threads according to what part of the memory they will be accessing? Only the application itself knows that e.g. thread A and C will share certain structures in memory, while thread B and D will do so to a lesser degree.
https://docs.microsoft.com/en-us/sql/database-engi... describes how MS solved this with SQL Server 2014. It takes an active role when it comes to thread scheduling. I guess it will set the affinity mask based on which NUMA node it wants to schedule a thread.
Naturally, not all software behaves like a DBMS. A video encoder perhaps won't share so much data between threads (maybe each thread handles a number of frames independently of other threads?) and the NUMA question becomes moot. (unless there are nodes that simply has no direct memory access and needs to go through other nodes -- I got the impression that Threadripper does this)
alpha64 - Friday, October 26, 2018 - link
@BikeDude,You are right that a DBMS is a pretty extreme example of an "application" level software program. In fact, they often use storage devices (SSDs/Hard Drives) bypassing the filesystem too. The fact that Microsoft is doing specific things to handle NUMA in this software is not surprising, but I was not talking about how Microsoft has solved this, but rather how Linux does (better than Windows).
My point was that most software does not set affinity, know, or care the architecture of a NUMA system. The two examples I gave do not, yet they show good performance scaling on TR2990WX. Certainly ugly hacks like CPU affinity can be used to try to fix poor performance due to ineffective scheduling, but the results of this workaround are shown in this article to not help most of the time. Knowing the NUMA tables and actually making intelligent choices based on that are still firmly in the realm of the kernel, not normal applications.
When speaking about memory locality, the operating system does set up the page table structures with the MMU for specific threads/processes. Thus the kernel does have knowledge of which cores are local to the memory for specific threads. Linux scheduling takes into account the NUMA structure and what CPUs are "close" to the memory used for the tasks it is scheduling. Thus, the answer to your first question in your first paragraph is that the kernel does handle this. The situation with sharing certain structures also can be tracked by the kernel.
As your last paragraph states, the TR2990WX does have half of its cores without direct memory access. AMD themselves acknowledged this at release of TR2, and mentioned that this was not a huge performance penalty for most cases. And, if one needs local memory for all NUMA nodes, EPYC does support local memory on all NUMA nodes.
In summary, my point is that in order to get "good" performance from the TR2990WX, Linux achieves this in the kernel without the applications having any special knowledge of the hardware. It seemed that Ian's original paragraph stated otherwise, which is what I was trying to address.
BikeDude - Friday, October 26, 2018 - link
A thread doesn't have its own dedicated MMU.Any thread, at least in a Windows process, can access everything other threads in the process can. Everything is shared. (I've been doing C# for the past 10 years, so I have not kept up to date on the Win32 API and could be wrong)
UNIX apps used to fork() a lot. If that is still the case, then the scheduler will indeed face nicely partitioned unrelated processes. Forking would be a rather unusual thing to do in Windows as spawning new processes comes with a noticeable overhead there.
alpha64 - Friday, October 26, 2018 - link
@BikeDude,Perhaps we are just talking past each other, but in Linux each thread is most often its own process (depending on the libc version used / programming language). The kernel uses copy-on-write to make sure that creating a new thread/process is low-overhead, but still isolated. Regardless, the kernel keeps track of the memory maps for each process (and thread). Thus, the kernel can schedule the applications to run effectively on multiple cores.
Windows does have problems with efficiently forking, while Linux does it well. Regardless, my original point still stands - Linux is able to effectively schedule on NUMA systems without applications having to be programmed with any knowledge of it.
BikeDude - Friday, October 26, 2018 - link
What API functions are involved to allow Linux to keep track of what part of the process' memory space belongs to a given thread? Doesn't most allocations happen through malloc() like they've always done? And how do two threads share memory structures in memory? (You can't assume that merely because thread A allocates a chunk of memory, that thread A will be the only thread accessing it -- one major benefit of using threads is sharing the same address space)Once you have API functions that do those things... You have pushed responsibility over on the application developer to ensure that a given thread gets executed on an optimal NUMA node. Which really is not all that different from playing around with CPU affinity masks (yuck) in the first place.
So, when it comes to threads... I very much doubt it works the way you think it does. Processes -- sure. MMU all the way, etc, but not for threads. That would be downright bizarre.
alpha64 - Friday, October 26, 2018 - link
@BikeDude,I have actually written Linux multithreaded applications in multiple programming languages and worked on the Linux kernel. Linux has mechanisms to set up shared memory segments (mmap), but mostly a shared memory segment like you are describing is allocated before a process calls clone() (to fork() or create a new thread - they make this clone() system call with different parameters). Unless one of the resultant processes lets the operating system know that it will no longer be using the shared segment, both processes have the same memory segment mapped into their process space (kept track of by the kernel). This would be your hypothetical shared memory structure. However, this is simply not done in most programs, as handling shared memory structures by multiple threads of execution is very error-prone. It is far better to use some kind of inter-process communication rather than a shared memory segment where one's own program has to handle locking, contention, and insure consistency.
If you have a Linux system around, try checking into /proc/<pid>/maps to see what memory segments the kernel has mapped for any given process. You can also read up on NTPL (the Threading library that most recent Linux systems use). This is definitely all kept track of by the kernel, and useful for scheduling on NUMA systems.
I think where the confusion in our discussion lies is perhaps in how you think programs are typically designed. I think that shared memory regions are a bad idea due to the complexity of insuring that they will be properly handled. Your assumption seems to be that they are used widely. Perhaps this is due to the different platforms which we use.
BikeDude - Friday, October 26, 2018 - link
"shared memory regions are a bad idea"Well... Yes. But that begs the question: Why stress with NUMA then? If each thread mostly stirs around with its own separate structures, then all the scheduler needs to worry about is keeping the threads from bouncing too much between various CPU cores. As long as the threads are rescheduled on the same core (or at least the same NUMA node), all is fine.
SQL Server seems to resort to several threads to service one query, and then, afaict from the query plan, ties it all together again in what I assume to be some sort of main thread (of that query/SPID). That suggests to me that it is beneficial to control those threads and keep them somewhere close so you can pull in whatever data they've been working on at the end (without having to access a different memory node).
My hunch is that server apps like that (call it 'extreme' if you must) is where we'll see the benefit of NUMA, whereas your average renderer, game or what not, is not where NUMA shines. Whatever advantage the Linux scheduler has over Windows in this instance, I don't see NUMA support being a major part of that equation (unless there is something seriously wonky with how Windows reschedules threads).
(I'm mostly doing micro services and rely on SQL Server to do the heavy lifting for me, so this topic isn't my forte, at least not as a developer)
peevee - Monday, October 29, 2018 - link
"but in Linux each thread is most often its own process"Apparenly you are completely incompetent in the domain. Stop embarassing yourself.
john7up - Tuesday, October 30, 2018 - link
Umm I just applied that command for 2990wx and apart from having less performance overall, my screen flashes black every 30 seconds. How do I reverse engineer the start /affinity FFFF FFFF FFFF FFFC “” “D:\2019\Script.exe ?I tried stop affinity etc but nothing happened. How do I revert back asap please?
peevee - Monday, October 29, 2018 - link
"I think that this is easy to misunderstand. The Linux scheduler handles mapping of program's threads and/or processes to multiple cores. The "software" that has to be aware of the hardware resources is only the kernel. When a software program is written that has multiple threads of execution (again, whether "threads" or "processes" in the Linux system sense), it needs to know nothing about the hardware resources"If you want scalability beyond 16 logical CPUs, this is simply false. Software needs to be NUMA-optimized, in about 20 different ways.
DominionSeraph - Friday, October 26, 2018 - link
Can we have an article on the scaling with thread count and how it behaves when pinning a process to a CCX/die?Video encoding is trivial to split (I do my encoding on a cluster of desktop PCs) but AnandTech only benchmarks running a single instance of Handbrake, which nobody outside of HANDJOB even uses. This currently makes the i9 9900k and i7 9700k look to be the champions of price/performance with the 32 core EPYC and Threadripper performing on the level of an Intel quad core.
I'm curious as to what an x264 --preset veryslow encode looks like if you split it up into quarters or eighths and pin them to their own die/CCX.
mapesdhs - Friday, October 26, 2018 - link
Just curious, do you use Windows or Linux for your encoding?DominionSeraph - Friday, October 26, 2018 - link
Windows.MeGUI is just too good and I don't need an OS that's going to fight me at every turn.
Timur Born - Friday, October 26, 2018 - link
Windows 10's process scheduler is at least partially broken since the very first "Creators" version came out and has not been fixed to date (1809). So take any measurements in this regard with a big grain of salt.QChronoD - Friday, October 26, 2018 - link
I'd be curious to see the performance difference between running a program that only used ~8 threads and was allowed to use any core, restricted to node 0, and forced to one of the nodes w/o direct memory access.MobiusPizza - Friday, October 26, 2018 - link
Surely a better experiment design would disable 2 threads for say the last core for the CPU-0 case? Then you will be comparing 62 vs 62.CPU-0 active, diasble core 32
CPU-0 inactive, diasble core 1
peevee - Friday, October 26, 2018 - link
"performance in most applications did not scale from the 16-core, and in some cases regressed"This is the fault of both outdated Von Neumann-based architecture and software companies.
You have to write in a VERY special way to make sure your app scales. Ask me how I know.
99% of modern software engineers simply don't know how to do it, and most popular programming languages (those without OO aggregation by value, usually GC-dependent, all these JS/Java/Python/C# etc) simply do not support OO and high scalability at the same time, exceptions being obviously C++ and supposedly Rust (although the jury is still out given than no HPC applications are written in Rust yet).
But even in C++ scaling beyond 16 on Von Neumann is VERY specific. For example, no heap usage in the threads. Obviously no synchronization (it kills even beyond 4 threads). Not only no data sharing, but no even cache page sharing if even one thread writes to it. All data used by internal loop must fit into L1 cache, half of L1 cache if HT/SMT is used - which means VERY DIFFERENT algorithms from what theoretical O-complexity would deem optimal. And for NUMA like Threadripper/EPIC and all multi-CPU platforms, there is so much more to make sure that the memory a thread operates on is allocated on it's own memory channel (hello system-level calls to fix affinity etc etc etc etc).
Even 16 cores/32 threads are basically only for specially optimized professional applications. Certainly not for toys.
edzieba - Saturday, October 27, 2018 - link
Trying to get everyone to write for your architecture rather than designing your architecture for the code people are writing is how Itanium happened.On top of that, there just aren't all that many TASKS that are embarrassingly parallel. HPC
edzieba - Saturday, October 27, 2018 - link
- workloads may scale with Gustafson's law, but client workloads are firmly in Amdahl territory.(Fscking ad loads shifting the submit button under my finger)
peevee - Monday, October 29, 2018 - link
"On top of that, there just aren't all that many TASKS that are embarrassingly parallel"Most tasks where performance actually matters (meaning that one 4GHz core is not enough) are quite parallelable. Today "CPU performance matters" in practice (as opposed in testing by non-practitioners) means "a lot of computations on a lot of data", not "a lot of computations on little data" or "few computations on a lot of data".
edsib1 - Friday, October 26, 2018 - link
With AMD's dynamic local mode fix due to come out on Monday - was there any point in doing this article?phoenix_rizzen - Saturday, October 27, 2018 - link
To have a baseline to compare to?BlueFllame - Friday, October 26, 2018 - link
Hello,Great article! Could you do the gaming tests with CPU 0 disabled on say a Ryzen 1600/1800? It would be nice to know if we could get extra performance by doing that.
Thanks.
SeannyB - Saturday, October 27, 2018 - link
In the future, I'd like to see a multitasking test where you 7zip a file, encode a video, and do whatever else that isn't perfectly parallelized (practically anything other than offline graphics rendering). With regard to the article, one could manipulate CPU affinities and see how that affects the result. I like to think that high-core-count processors like Threadripper are meant for mulit-tasking rather than single-tasking.phoenix_rizzen - Saturday, October 27, 2018 - link
Would be neat to see the results of running their benchmark suite in parallel instead of in series.s.yu - Sunday, October 28, 2018 - link
"A good scientist always publishes negative results. That’s what we are taught from a morality standpoint, although academia doesn’t always work that way."Haha so true
Wwhat - Sunday, October 28, 2018 - link
Since AMD and intel have different issues and fixes with the various vulnerabilities recently exposed and fixed; I have to wonder if the results of this experiment would be different if it all ran on pre-disaster microcodes.After all, the fixes are such that it should affect things noticeably, right?
(And if you are a MS fan you might even suggest that they didn't fix the scheduling for the new situation, but I'm not so I won't.)
tygrus - Monday, October 29, 2018 - link
It would be interesting seeing the results from more core configurations. Whole CCX's disabled; half of the cores used from each CCX; whole CCX0 then various configs for the other 1 to 3 CCX's depending on the chip used.john7up - Tuesday, October 30, 2018 - link
Umm I just applied that command for 2990wx and apart from having less performance overall, my screen flashes black every 30 seconds. How do I reverse engineer the start /affinity FFFF FFFF FFFF FFFC “” “D:\2019\Script.exe ?I tried stop affinity etc but nothing happened. How do I revert back asap please?