Multiprocessor scalability

general questions about Neat Image
Post Reply
taob
Posts: 119
Joined: Sat Feb 08, 2003 2:12 pm
Location: Toronto, ON
Contact:

Multiprocessor scalability

Post by taob »

( from http://www.neatimage.net/forum/viewtopic.php?p=2818 )
taob wrote:Same hardware as above, but now running 4.2 Pro+, and I'm doing a batch of 10 copies of the same image:

Time (10 images, processing only): 3m36s, 21.93 megapixels/minute
Time (10 images, loading + processing + saving): 3m52s, 20.42 megapixels/minute

A single image (thus only using one CPU) takes 30.5 seconds, or 15.53 megapixels/minute. So NI is only running at about 70% efficiency with two CPU's (it should do 31 megapixels/minute, but only gets about 22). I should try running two instances of NI to see if that makes a difference (I would expect not, but you never know)...
So I tried a test where instead of having one instance of NeatImage running on two CPUs, I had two instances of NeatImage running on one CPU each. Multiprocessor support in NI was disabled, and I bound each instance to a specific CPU (using the XP Task Manager).

I started up the queues on both within about a half second of each other. After 14 minutes and 26 seconds, both finished within a second of each other, one having processed 24 images and the other 21 images. The aymmetry there is interesting... the faster instance was running on CPU 0. Anyway, that makes 45 images of 7.9 megapixels each for a total of 355 megapixels in 14.43 minutes, or 24.62 megapixels/minute.

That's an improvement of over 12% compared to NI doing the multithreading itself , but still less than 80% efficiency on a dual CPU system (assuming 100% is 31 megapixels/minute on my system). Has anyone tried running NI on a quad CPU system or larger?
NeatImage Pro Plus 5.0 + dual Opteron 244 + Windows XP SP2 + FreeBSD 5.2
NITeam
Posts: 3173
Joined: Sat Feb 01, 2003 4:43 pm
Contact:

Post by NITeam »

This is an interesting result. We will have to think why NI own multitasking is slower than that provided by Windows. Probably there is something that is not done in parallel but queued instead. We will have to check what it can be.

BTW, did you save the output images or only filtered them?

Vlad
taob
Posts: 119
Joined: Sat Feb 08, 2003 2:12 pm
Location: Toronto, ON
Contact:

Post by taob »

That's what I'm thinking too. Since NI does not try to divide one image amongst multiple CPU's, but rather assigns whole images to each CPU, there should not be any interdependency at the application level. Each image processing thread should be able to work completely independently of the others, until such time they are ready to fetch the next image from the queue (but the time spent doing that should be insignificant). So barring resource contention issues in the underlying OS and hardware, NI should be able to achieve very close to linear scaling with multiple CPU's.

Mind you, I'm not complaining or anything... just trying to find ways to make NeatImage even better. ;)
NeatImage Pro Plus 5.0 + dual Opteron 244 + Windows XP SP2 + FreeBSD 5.2
NITeam
Posts: 3173
Joined: Sat Feb 01, 2003 4:43 pm
Contact:

Post by NITeam »

Thank you, I completely understand your intentions. We, in our team, certainly have to think what could be a limiting factor in parallel processing. One possibility is the input/output routines, another one is the Filtration Queue logics itself. We will look into such possibilities.

Vlad
taob
Posts: 119
Joined: Sat Feb 08, 2003 2:12 pm
Location: Toronto, ON
Contact:

Post by taob »

Here's a target you can shoot for. ;) I don't know if you are familiar with ImageMagick, but it is a suite of command-line image processing tools for UNIX, but also run under the Cygwin environment on Windows. I tried a similar test with ImageMagick.

I took the same test image each time and performed a 5-pixel 100% unsharp mask over it. This means ImageMagick must read in the JPEG, decode it to an in-memory buffer, operate on all the pixel data, write out the sharpened image to another in-memory buffer, compress with JPEG, and save the results out to a file. It takes about 16 seconds per image.

If I run it against 10 images one at a time, it takes 2m40s to finish the batch. If I run two instances of IM against 20 images total, it takes 2m42s to finish (both jobs finished within 0.15s of each other). That's 98.8% scaling efficiency. Granted, it does not have to deal with most of the Windows XP UI overhead, but this at least proves that the underlying process scheduler and Opteron CPU/memory architecture can achieve very close to linear processor scaling.

It would be great of NI could achieve that level of efficiency... then I'd have to go buy a quad CPU machine. :lol:
NeatImage Pro Plus 5.0 + dual Opteron 244 + Windows XP SP2 + FreeBSD 5.2
NITeam
Posts: 3173
Joined: Sat Feb 01, 2003 4:43 pm
Contact:

Post by NITeam »

The 5-pixel 100% unsharp mask may not fully take all the memory bandwidth so two CPUs are not conflicting for memory access. It is possible that the limiting factor in case of NI (which is very memory-intensive) is the memory bandwidth. That could explain the less than 80% efficiency you observed when running two instances of NI (obviously, on a perfect hardware the efficiency should be 100% since two instances on NI do everything independently). I highly doubt the difference is caused by GUI expenses. It must be the hardware (most likely) or the way OS handles NI requests.

On the other hand, the difference between performance of two instances and one instance of NI using two CPUs - this difference is most likely caused by NI itself and we will look into what could be the reason.

Vlad
NITeam
Posts: 3173
Joined: Sat Feb 01, 2003 4:43 pm
Contact:

Post by NITeam »

Regarding memory bandwidth, it is easy to test this idea. WinRar is a very memory-intensive archiver and it could be used for the same test as you did with ImageMagick.

Two instances of the command line version of WinRar could be invoked to pack a large file in parallel just like you processed images with ImageMagick.

I am sure the efficiency of parallel processing in case of WinRar will also significantly lower than 100% because of memory bandwidth limitation and this could be a benchmark of the hardware.

Vlad
taob
Posts: 119
Joined: Sat Feb 08, 2003 2:12 pm
Location: Toronto, ON
Contact:

Post by taob »

I tried WinRAR from the command line and timed five runs each of a single instance vs two parallel instances. Both were compressing their own copies of a 322MB folder of Visio stencils (46.24% compression):

1 instance: 2m21s
2 instances: 2m35s, or about 90% efficiency

So not too bad even in that case. I don't know how this compares on an Intel system, but the Opteron's speedy memory bus probably gives it an advantage here.
NeatImage Pro Plus 5.0 + dual Opteron 244 + Windows XP SP2 + FreeBSD 5.2
andewid
Posts: 62
Joined: Sat May 01, 2004 8:21 pm

Post by andewid »

Is not the whole idea with the Opteron to have independent memory busses so they do not have to share memory bandwidth (as opposed to many other architectures)?
taob
Posts: 119
Joined: Sat Feb 08, 2003 2:12 pm
Location: Toronto, ON
Contact:

Post by taob »

Well, it's not the whole idea, but the ability to have independent, scalable memory buses is a major feature of the Opteron architecture. However, you need a motherboard that supports this (which are much more expensive), as well as an OS that is NUMA-aware (which 32-bit Windows XP is not, although the 64-bit version is).

Even though my two CPU's share a single 6.4GB/s pathway to main memory (vs. 12.8GB/s in a NUMA configuration), the latencies are much lower compared to an Intel system. The traditional functions played by the Northbridge chipset (memory controller, among other things) is integrated right on each CPU. This reduces memory latency and allows for an SMP vs NUMA memory configuration.
NeatImage Pro Plus 5.0 + dual Opteron 244 + Windows XP SP2 + FreeBSD 5.2
Post Reply