Moore’s Law seems to be approaching its limits, well as far as Intel and AMD are concerned. The race to cramming more and more transistors on a chip, the “vertical scaling”, is coming to an end. In its place, adding more cores (instead of more transistors) is the new trend. And thus new challenges.
The Windows Challenge
According to Dave Probert, kernel architect at Microsoft, the future version of Windows should be designed more like a hypervisor rather than the current model of operating in kernel/user mode. Applications can be treated like virtual machines (sort of), each running in its own thread on a separate core, thus effectively removing the distinction between kernel and user mode.
The programs, or runtimes as Probert called them, themselves would take on many of the duties of resource management. The OS could assign an application a CPU and some memory, and the program itself, using metadata generated by the compiler, would best know how to use these resources.
Probert admitted that this approach would be very hard to test out, as it would require a large pool of existing applications. But the work could prove worthwhile.
Dynamic Memory Management
A paper from North Carolina State University discussed memory management functions of programs running in a separate thread apart from the program itself.
Every computer program consists of multiple steps. The program will perform a computation, then perform a memory-management function — which prepares memory storage to contain data or frees up memory storage which is currently in use. It repeats these steps over and over again, in a cycle. And, for difficult-to-parallelize programs, both of these steps have traditionally been performed in a single core.
“We’ve removed the memory-management step from the process, running it as a separate thread,” says Dr. Yan Solihin, an associate professor of electrical and computer engineering at NC State, director of this research project, and co-author of a paper describing the research. Under this approach, the computation thread and memory-management thread are executing simultaneously, allowing the computer program to operate more efficiently.
“By running the memory-management functions on a separate thread, these hard-to-parallelize programs can operate approximately 20 percent faster,” Solihin says. “This also opens the door to development of new memory-management functions that could identify anomalies in program behavior, or perform additional security checks. Previously, these functions would have been unduly time-consuming, slowing down the speed of the overall program.”
Drizzle is building a database optimized for cloud and net applications. It is being designed for massive concurrency on modern multi-cpu/core architectures. The code is originally derived from MySQL. So how does the Drizzle team tackle the multicore challenge?
Aker and team have set out to modernize the code, removing old abstractions that are no longer relevant (mysys), removing custom code and replacing it with modern C++ data structures and algorithms, and fixing up various inconsistencies. Along the way the team has been ruthlessly removing needless locks (mutexes) that would greatly reduce concurrency and overall throughput.
As the team has combs through the code, it occasionally stumbles upon a feature that adds complexity to the system (or reduces concurrency) and asks if it’s worth keeping. Often times the answer is no. The result is that many features have been dropped entirely and others are being moved to plugins.
The Intel Workaround
When Intel ran into a performance wall with CPUs about a decade ago, it was the end of the single-core era of chips. As the thinking went, if chipmakers couldn’t get processors to 4GHz, 5GHz and beyond, then they would get performance by dividing up the work among multiple CPU cores.
That’s great for apps that lend themselves to multithreaded execution. But there are still some applications, such as massive calculations or data searches and sorts, where Step B requires the results of Step A, and in that case, a single 5GHz core will get you there faster than two 2.5GHz cores.
So Intel (NASDAQ: INTC) is creating a workaround called Anaphase that it is starting to show off to the public. The chipmaker first disclosed the plan, created at Intel Labs Barcelona, last year to the International Symposium on Computer Architecture (an academic paper is available here in PDF format) and earlier this month showed off Anaphase at a research demo day at the Barcelona facility.
This hardware/software hybrid leverages multiple cores to improve single-threaded performance, relying on different speculative techniques to automatically partition single-threaded applications so that they can be processed on multiple cores.
It’s the equivalent of taking one big task and parceling it out. This is normally considered extremely difficult because the speculative part of the process requires something akin to making an educated guess — and thus, introducing the potential for failure. (Courtesy: HardwareCentral, 5/23/10)
Apple Grand Central Dispatch
In the past, the best way for computer chip makers to improve performance was to turn up the clock speed on the processor. But that generates more heat and consumes more power, which is bad for computers, especially notebooks. So instead the industry has moved to chips with multiple processor cores, which can provide more performance while consuming less power. Today every Mac runs on one or more multicore Intel processors.
To take full advantage of these processors, software applications must be programmed using a technology called threads. Software developers use threads to allow multicore processors to work on different parts of a program at the same time. However, each application must do its own threading, which reduces the efficiency of the entire system. And because threads can be difficult to program, many developers don’t invest the effort to make their applications multicore capable. Consequently, lots of applications aren’t as fast as they could be.
Grand Central Dispatch (GCD) in Mac OS X Snow Leopard (and later) addresses this pressing need. It’s a set of first-of-their-kind technologies that makes it much easier for developers to squeeze every last drop of power from multicore systems. With GCD, threads are handled by the operating system, not by individual applications. GCD-enabled programs can automatically distribute their work across all available cores, resulting in the best possible performance whether they’re running on a dual-core Mac mini, an 8-core Mac Pro, or anything in between. Once developers start using GCD for their applications, you’ll start noticing significant improvements in performance.