Its not everyday you get to test with top end equipment, heck in my house all my gear is at least a couple of generations behind the cutting edge. Its actually been a motivation to me to create a distributed rendering solution since I have old gear sitting idle and still want to render my work before I'm eighty years old. So I managed to test with a set of computers that between them have nine Nvidia GTX Cards of varying capability (Oh, and yeah, we're making good progress with the next release, and we have been testing 0.1.2 in pre-release, see the links at the end of the article if you feel adventurous and want to test :D).
Anyone of these cards would be a match for the BMW scene, but I was soooooo keen to see what all nine together could do. Plus there were ten CPUs ranging from i7-5930k (fast hexacore with hyper threading to boot at 3.5 GHz) up to XEON E5 series chips which aren't as quick in their clock speed up but make up for that with lots of cache. Between all the systems we used, there was 512GB of RAM, so plenty of space for even the most demanding scenes. See the end of this article for the full range of specifications.
The tests we ran
If you use Blender a lot then chances are you've heard of Mike Pan's BMW Benchmark (and surprise surprise that is what the article's header image is from :P ). Its kind of a de-facto standard for judging how hard your hardware punishes the pixels so to speak. So naturally I was curious to see what kind of speed this setup would yield and in particular how much faster the group of machines would render with crowd render vs the fastest card in the group. If you want to find this benchmark file, see the useful links at the end of this article :)
The configuration of the file was pretty much out of the box, the tile settings were set to 240 x 136 (GPU) or 16 x 16 (CPU) and the samples to 400 AA and 1225 AA. Something worth mentioning here though, the file was originally configured to use 35 square samples which gave 1225 AA samples. If you check Mike's google sheets where he collects render time data and publishes it, you'll notice that the times there make much more sense if you use 400 AA samples (see links below to find all this by the way).
So the first test was to find the fastest card in the group. That turned out to be at GTX980, though I'd overlooked the fact that one of the systems actually had a potentially faster GTX980Ti and had shut everything down... oops. I'll go back and retest that one shall I?
The GTX 980 pulled up a time of 1 min 29 seconds and a few centiseconds or so, which I usually don't bother counting since each time you render the centiseconds are different so I treat that as random noise. The next fastest card was a Quadro K5200, which you don't see many of these days, probably cause their stupidly expensive, though this one is no longer in production I think. It posted 1 min 55 seconds, so save your self the cash if you were enchanted by that card, the GTX980 was better and cheaper.
The Fastest single CPU run I did was 2 mins 17 seconds using an i7-5930K @ 3.5GHz. Again I got all excited and forgot to run the XEONs by themselves since they had way more threads, and would have been an interesting contest given the slower clock speed, but higher levels of cache. Looks like I really need to go back to the lab.
So the next tests to run were using crowdrender. Now, we haven't really shouted much about supporting GPU rendering. Mainly due to the fact that I don't have one at home to test! Yep, I am not in the high end GPU club. Ah well. That all changed for this test. So, you will be pleased to learn that its actually fall off a chair easy to use GPUs with crowd render and you can even mix in CPU only rigs and it all works quite well. There will be a tutorial! Buts thats enough boasting, sorry, back to the results.
Multi CPU/GPU results
So, as the title image suggests, the GPUs clearly blasted the thing out of the water, returning an image in 20 seconds for the 400 AA samples and 53 seconds for 1225 AA samples. So first thoughts are, holy moly, my laptop takes ten minutes to do just 400 AA samples. Granted its a laptop, but still.
On closer inspection there is something interesting about the results, if I increase the samples by a factor of three, I only roughly double the render time, what the heck?! Did I just invent quantum computing for 3d?!
Sadly, no. If you run Mike's file, you'll notice that there are about 14-18 seconds or so where the clock is ticking after you hit render, but, no pixels have yet been drawn. Its the phase of the render were the BVH is built, textures are loaded into memory and blah blah blah, basically all the stuff needed to render is done bar the actual rendering and it reliably takes between 14-18 seconds across all eight systems I was using.
Stop and think about that for a minute, my laptop, 10 minutes, same setup time (I checked, its about the same), nine GPUs, 20 seconds to render total, of which 14-18 seconds is setup! That leaves about 2-6 seconds of actual rendering! Hold .... the ... phone. 2-6 seconds? vs 10 minutes? Lets say its nine minutes 40 seconds for conservatism's sake if we account for the setup phase of the render.
By my calculations that is between 290 and 91 times faster. Not bad! Ok, how about the CPUs then? Assuming for some bizarre reason you can afford top end CPUs and heaps of RAM but not the GPUs. Hmm, maybe because you spent all your cash on the CPUs and RAM maybe.. anyway I digress.
The CPU run was not as quick but still very decent, 36 seconds! I guess all those XEON threads do help in balancing the contest, but still aren't enough. Accounting for the setup time (again its the same at about 14-18 seconds per machine) that leaves somewhere between 22-18 seconds for the time spent actually rendering. Personally I wouldn't complain if I got that result at home. The fastest time I can get is about five minutes using three i5s.
Conclusions and stuff I learned
What was really clear is that any scene you render has some overhead in loading it and preparing the data necessary to start the render. So as the render time became lower and lower, the pre-render calculations dominated the render time. This means that once the render time is low enough, the bottle neck started to shift to this pre-render phase. That makes total sense since this is not something that crowd render can force the computers to do in parallel. Each node has to load the scene into RAM, build its BVH, load the textures and so on. Only when they start drawing pixels does the benefit of more nodes actually kick in.
This explains why when I tripled the sample count, I only doubled the render time, since the render time for the 400 samples with all nine cards was mainly the time each node spent setting up the render, not actually rendering. Once the samples were higher, that changed.
So, can we go faster? Well with this file the way to get faster is to look at the tasks performed prior to rendering. They look mainly like fetching data from disk or possibly RAM, doing some calculations to create whats called a ray tree and then loading the necessary textures and so on. Al of that sounds like CPU, RAM and Disk might be a good place to look. An SSD would no doubt not hurt. I'll have to actually check to see what those computers had in the way of storage. Oops again!
Faster RAM or overclocking the CPU could also help, particularly if the any tasks performed in this stage are not well suited to multithreading.
If you're going to purchase hardware for a render farm
So after reading