I am not sure what you mean when you say "Programming and optimizing for this new model is now much harder. " (refering to SMP nodes in a cluster). I have been using SMP nodes in clusters for 10 years and never had any issues. Typically, I assign as many MPI tasks to a node as cores. This is really a function performed by a batch scheduler. I also started developing the Hybrid approach (MPI/OpenMP) many years ago for those instances where combining MPI and OpenMP provided a performance advantage over pure MPI (using the same number of core). I have seen very few examples, however, where this is true. Most of the time, however, I assign as many MPI tasks to a node as core. The message passing model doesnt care where the processes are located from a functionality point of view. In a very few instances there was a performance impact but arbitrary assignment of MPI tasks to arbitrary nodes is a feature of some batch schedulers. For the vast majority of MPI programs I have seen, the number of cores on a node is just not an issue. »
Hi Doug,
Another interesting comparison would have been a dual-socket Intel Woodcrest system vs. a dual-socket AMD SocketF system. This way you would have been comparing dual-core Intel to dual-core AMD in 4-way systems. In your tests, the Intel has a memory bandwidth handicap because it has four cores sharing a FSB while the AMD has only two cores sharing a memory controller. The better scaling of the AMD system shows this. I would love to see you repeat your test using AMD Barcelona processors when the clock speed is sufficient to compare to Clovertown. Thanks for an interesting article.
Jim »