Introducing the Primate Labs Shop

Every new employee receives one of our t-shirts, and now we're sharing them with you. Primate Labs is proud to announce the new Primate Labs Shop.

Employees wearing t-shirts

Each shirt is printed on American Apparel’s 50/50 crewneck in Lapis. Shirts are available in a variety of sizes for men and women.

Geekbench Sticker

Alongside our t-shirts, we also have Geekbench 4 and Primate Labs Stickers that come in packs of three. Each sticker is three inches in size and easily fits on most surfaces.

Primate Labs sticker

Right now we're only shipping to the United States and Canada. Depending on the response we'll consider more products and more shipping options (including shipping to other regions).

If you have any questions regarding sizing, shipping, products or your order. Pleases reach out to us by email at

Geekbench 3.4.1

Geekbench 3.4.1 is now available for download and contains an important security fix for OS X. It is a recommended update for all OS X users. Geekbench 3.4.1 features the following changes:

  • Changed to secure connections to download update information and release notes.
  • Updated comparison chart design to improve readability.
  • Updated Android, iOS comparison devices.

Geekbench 3.4.1 is a free update for all Geekbench 3 users.

Geekbench 2.4.4

Geekbench 2.4.4 is now available for download and contains an important security fix for OS X. It is recommended for all OS X users. Geekbench 2.4.4 features the following changes:

  • Changed to secure connections to download update information and release notes.

Geekbench 2.4.4 is a free update for all Geekbench 2 users.

Geekbench 3.4

Geekbench 3.4 is now available for download and features the following changes:

  • Added support for Intel SHA-NI instructions for the SHA-1 workload.
  • Added support to detect Low Power Mode on iOS 9.
  • Fixed L4 cache reporting on systems without an L4 cache.
  • Fixed errors that could occur when uploading results from Intel NUC systems.
  • Fixed interface issues on iOS 9.

Geekbench 3.4 is a free update for all Geekbench 3 users.

MacBook Air, Pro Benchmarks (March 2015)

Geekbench 3 results for the new MacBook Air and MacBook Pro models have arrived on the Geekbench Browser. I've generated some charts that compare the new Broadwell-powered laptops with their Haswell- and Ivy Bridge-powered predecessors.

Keep in mind that Broadwell is a "Tick" in Intel's "Tick Tock" model. Generally speaking "Tick" processors improve efficiency while "Tock" processors improve performance. As a result I do not expect the MacBook Air and MacBook Pro scores to increase significantly.

MacBook Air

Single-Core Performance

Multi-Core Performance

Single-core performance has increased 6% from Haswell to Broadwell, and multi-core performance for the i5 model has increased 7%. However, quite surprisingly, multi-core performance for the i7 model has increased an impressive 14%.

If you're thinking of buying the new MacBook Air I would strongly recommend the i7 processor. It has 20% faster single-core performance and 25% faster multi-core performance for only a 15% increase in price.

MacBook Pro

Single-Core Performance

Multi-Core Performance

Single-core performance has increased between 3% to 7% from Haswell to Broadwell, depending on the model. Multi-core performance has increased 3% to 6%. These sorts of increases are in line with what I would expect from a "Tick" processor.

I have no recommendations regarding the processor for the new MacBook Pro. The performance differences and the price differences between the processors are roughly equivalent.

Swift Performance in Xcode 6.3 Beta

Back in December we ported a few of our Geekbench workloads to Swift and compared their performance to the C++ implementations. With last week's announcement of a beta release of Xcode 6.3 we thought it would be a good time to revisit those results. In this post we find out whether the performance improvements in Xcode 6.3 Beta provide any speedup for our Swift workloads.

The following table shows the performance of the Swift workloads compiled with Xcode versions 6.1.1 and 6.3 Beta. We use the same optimizer settings as we did in December and use the same machine to run the tests. As before the averages are taken over eight executions of the workloads.

Workload Version Minimum Maximum Average
Mandelbrot Swift (6.3 Beta) 2.07 GFlops 2.49 GFlops 2.32 GFlops
Swift (6.1.1) 2.15 GFlops 2.43 GFlops 2.26 GFlops
C++ (6.1.1) 2.25 GFlops 2.38 GFlops 2.33 GFlops
GEMM Swift (6.3 Beta) 2.14 GFlops 2.18 GFlops 2.16 GFlops
Swift (6.1.1) 1.48 GFlops 1.59 GFlops 1.53 GFlops
C++ (6.1.1) 8.61 GFlops 9.92 GFlops 9.32 GFlops
FFT Swift (6.3 Beta) 0.25 GFlops 0.27 GFlops 0.26 GFlops
Swift (6.1.1) 0.10 GFlops 0.10 GFlops 0.10 GFlops
C++ (6.1.1) 2.29 GFlops 2.60 GFlops 2.42 GFlops

The improvements in the Xcode 6.3 Beta have provided a 1.4x speedup for GEMM and a 2.6x speedup for FFT over Xcode 6.1.1. Performance for the C++ workloads did not change, so we omit those numbers for the 6.3 Beta.

Our Swift FFT implementation got an additional speedup last week thanks to some performance patches from Joseph Lord (the code for the Swift workloads is available on GitHub). His optimizations include:

  • eliminate virtual function dispatches by making the Workload classes final
  • allow the compiler to do more inlining by moving the Complex definition into the same file as the FFT code
  • work around slow behavior when accessing an array of structs by changing the output array in FFT from a Swift array to an UnsafeMutablePointer<Complex>.

These changes provide a significant speedup for FFT of about 8.5x over our previous implementation:

Workload Version Minimum Maximum Average
Mandelbrot Swift with Joseph's patches (6.3 Beta) 2.32 GFlops 2.45 GFlops 2.40 GFlops
Swift (6.3 Beta) 2.07 GFlops 2.49 GFlops 2.32 GFlops
C++ (6.1.1) 2.25 GFlops 2.38 GFlops 2.33 GFlops
GEMM Swift with Joseph's patches (6.3 Beta) 2.01 GFlops 2.19 GFlops 2.13 GFlops
Swift (6.3 Beta) 2.14 GFlops 2.18 GFlops 2.16 GFlops
C++ (6.1.1) 8.61 GFlops 9.92 GFlops 9.32 GFlops
FFT Swift with Joseph's patches (6.3 Beta) 1.85 GFlops 2.31 GFlops 2.20 GFlops
Swift (6.3 Beta) 0.25 GFlops 0.27 GFlops 0.26 GFlops
C++ (6.1.1) 2.29 GFlops 2.60 GFlops 2.42 GFlops

After the improvements in Xcode 6.3 and some careful optimizations, the performance of the FFT workload is now within 10% of the C++ implementation. The optimizations might look strange to someone who hasn't read up on Swift internals, but they are easy to apply and can be used by any Swift programmer. If you try these optimizations in your own code, benchmark the changes carefully. They might not provide any speedup at all for your algorithm. They might even slow it down. Also keep in mind 6.3 is still in Beta and it could change before the final release.

Geekbench 3.3

Geekbench 3.3, the latest version of our popular cross-platform benchmark, is now available for download and includes the following changes:

  • Added a battery test for Android, iOS.
  • Added a brief summary to "Share Results" email on iOS.
  • Addressed 64-bit code generation issues on Android/AArch64.
  • Fixed a crash that occurred on Windows 10.
  • Fixed a crash that could occur on 32-core systems.
  • Reduced the memory footprint of the BlackScholes workload.

The biggest new feature in Geekbench 3.3 is the battery test. The new battery test is designed to measure the battery life of a device when running processor-intensive applications (such as games).

The test is meant to completely discharge a completely charged battery. While it's possible to run the test with a partially discharged battery (e.g., a battery with 75% charge) the test results will not be as accurate.

The recommended steps for running the test are as follows:

  • Plug in your device.
  • Launch Geekbench 3.
  • Launch the battery test.
  • Wait for your device to completely charge.
  • Unplug your device. The battery test will start automatically. The test can take several hours to complete, especially on newer devices with larger batteries.
  • Wait for your device to completely discharge and turn off.
  • Plug in your device and wait for it to turn on.
  • Launch Geekbench 3. The battery test result will display automatically.

The test result includes the battery test runtime, the battery test score, and the battery level at the beginning and at the end of the test.

Here's what the different numbers mean:

  • Battery Runtime is the battery test runtime. If the test started with the battery completely charged and ended with the battery completely discharged then the test runtime is also the battery lifetime.

  • Battery Score is a combination of the runtime and the work completed during the battery test. If two phones have the same runtime but different scores, then the phone with the higher score completed more work. As with Geekbench scores, higher battery scores are better.

  • Battery Level is the battery level at the start and the end of the test.

We hope you find the new battery test useful. Please let us know if you have any questions, comments, or suggestions regarding the test (or the release).

Swift, C++ Performance

With all the excitement around Apple's new Swift programming language we were curious whether Swift is suitable for compute-intensive code, or whether it's still necessary to "drop down" into a lower-level language like C or C++.

To find out we ported three Geekbench 3 workloads from C++ to Swift: Mandelbrot, FFT, and GEMM. These three workloads offer different performance characteristics:

  • Mandelbrot is compute bound.
  • GEMM is memory bound and sequentially accesses large arrays in small blocks.
  • FFT is memory bound and irregularly accesses large arrays.

The source code for the Swift implementations is available on GitHub.

We built both the C++ and Swift workloads with Xcode 6.1. For the Swift workloads we used the -Ofast -Ounchecked optimization flags, enabled SSE4 vector extensions, and enabled loop unrolling. For the C++ workloads we used the -msse2 -O3 -ffast-math -fvectorize optimization flags. We ran each workload eight times and recorded the minimum, maximum, and average compute rates. All tests were performed on an "Early 2011" MacBook Pro with an Intel Core i7-2720QM processor.

Workload Version Minimum Maximum Average
Mandelbrot Swift 2.15 GFlops 2.43 GFlops 2.26 GFlops
C++ 2.25 GFlops 2.38 GFlops 2.33 GFlops
GEMM Swift 1.48 GFlops 1.59 GFlops 1.53 GFlops
C++ 8.61 GFlops 9.92 GFlops 9.32 GFlops
FFT Swift 0.10 GFlops 0.10 GFlops 0.10 GFlops
C++ 2.29 GFlops 2.60 GFlops 2.42 GFlops

The Swift implementation of Mandelbrot performs very well, effectively matching the performance of the C++ implementation. I was surprised by this result. I did not expect a language as new as Swift to match the performance of C++ for any of workloads. The results for GEMM and FFT are not as encouraging. The C++ GEMM implementation is over 6x faster than the Swift implementation, while the C++ FFT implementation is over 24x faster. Let's examine these two workloads more closely.


Running GEMM in Instruments (using the Time Profiler template) shows the inner loop dominating the profile samples with 25% attributed to our Matrix.subscript.getter:

Instruments stack trace for GEMM

Suspecting that the getter was performing poorly I tried caching the raw arrays and accessing them directly without using the subscript getter. This seems to boost performance slightly giving us an average of about 1.55 GFlops. All that remains in the inner loop are the integer operations that compute the indexes, two array reads, one floating point multiply, and one floating point add:

for var k0 = k; k0 < kb; ++k0 {
  let a = AM[i0 * N + k0]
  let b = BM[j0 * N + k0]
  scratch += a * b

In our C++ GEMM implementations we get a big performance boost from loop vectorization, so I wondered whether the Swift array implementation might be somehow preventing the LLVM optimizer from vectorizing the loop. Disabling vectorization in the C++ workload (via -fno-vectorize) reduced the average compute rate to just 2.05 GFlops, so loop vectorization is a likely culprit.


Running FFT in Instruments (again using the Time Profiler template but with the "flatten recursion" option enabled) shows that we spend a lot of time on reference counting operations:

Instruments stack trace for FFT

This is surprising because the only reference type in our FFT workload is the FFTWorkload class: arrays are structs and structs are values types in Swift. The FFT workload code reference the FFTWorkload instance using the self member and through calls to instance methods. We begin our investigation here.

To isolate the effects of self references and instance method calls I wrote a recursive function to compute Fibonacci numbers (this is a tremendously inefficient approach to computing Fibonacci numbers, but it is useful for this investigation). I use a self access to count the number of nodes in the recursion by incrementing the nodes member in the recursive function:

func fibonacci(n : UInt) -> UInt {
  self.nodes += 1
  if n == 0 {
    return 0
  } else if n == 1 {
    return 1
  } else {
    return fibonacci(n - 1) + fibonacci(n - 2)

The time profile for this implementation shows a similar effect as observed in the FFT workload.

Instruments stack trace for Fibonacci

The source code view suggests that the self accesses are slow in this case:

Instruments source view for Fibonacci

Updating the recursion to remove references to self nearly doubles performance, but we still see the reference counting operations in the Instruments time profile. This leaves only the method calls.

Next we try making fibonacci a static method instead of an instance method. This is easy since we already removed the self reference: we only need to add the class keyword to the method declaration:

class func fibonacci(n : UInt) -> (f : UInt, nodes : UInt) {
  if n == 0 {
    return (0, 1);
  } else if n == 1 {
    return (0, 1);
  } else {
    let left = fibonacci(n - 1)
    let right = fibonacci(n - 2)
    return (left.f + right.f, left.nodes + right.nodes + 1)

This results in a 12x speedup over the first Fibonacci implementation. The Instruments time profile shows that the reference counting operations are now gone:

Instruments stack trace for static Fibonacci

I don't mean to suggest that we should prefer static Swift methods whenever possible; use static method when they make sense in your design. However, if you must implement a recursive algorithm in Swift and you find the performance of your algorithm to be unacceptably poor, then modifying your algorithm to use static methods is worth some investigation.

To quickly test this strategy on the FFT workload I made all the instance variables global and changed the recursive methods to class methods. This gives about a 5x boost in performance up to an average of 548.09 MFlops. This is still only about one 20% of the C++ performance, but is a significant improvement. In the time profiler we see that the samples are now more evenly distributed with hotspots on memory access and floating point operations. This is closer to what we might expect for FFT:

Instruments source view for static FFT

Instruments source view for static FFT

Final Thoughts

What can we conclude from these results? The Mandebrot results indicate Swift's strong potential for compute-intensive code while the GEMM and FFT results show the care that must be exercised. GEMM suggests that the Swift compiler cannot vectorize code that the C++ compiler can vectorize, leaving some easy performance gains behind. FFT suggests that developers should reduce calls to instance methods, or should favor an iterative approach over a recursive approach.

Swift is still a young language with a new compiler so we can expect significant improvements to both the compiler and the optimizer in the future. If you're considering writing performance-critical code in Swift today it's certainly worth writing the code in Swift before dropping down to C++. It might just turn out to be fast enough.

Retina iMac 64-bit Performance

64-bit Geekbench 3 results for the Retina iMacs have appeared on the Geekbench Browser. Let's take a quick look at how they perform compared to the non-Retina iMacs.

Single-Core Performance

Multi-Core Performance

The Core i5 Retina iMac is slightly faster than the other Core i5 iMacs, and is competitive with the Core i7 iMacs in single-core performance. However, the Core i7 iMacs are up to 20% faster in multi-core performance.

The Core i7 Retina iMac is significantly faster than all of the other iMacs (including the Core i5 Retina iMac), with at least 15% higher single-core performance and 10% higher multi-core performance.

These Geekbench results aren't surprising since all of the iMacs use Haswell processors; any performance increase is due to the increase in clock speed.

How does the Retina iMac perform compared to the Mac Pro?

Single-Core Performance

Multi-Core Performance

The Core i5 Retina iMac is faster at single-core tasks but slower at multi-core tasks. The Core i7 Retina iMac is also faster at single-core tasks (25% faster than the fastest Mac Pro) and is also faster than the 4-core Mac Pro at multi-core tasks.

If you're considering replacing your Mac Pro with a Retina iMac then these results show it's not a bad idea provided you don't regularly run heavily-threaded applications.

Estimating Mac mini Performance

Apple announced a long-awaited update to the Mac mini lineup on Thursday. Along with 802.11ac Wi-Fi and PCI-based flash storage options the new models feature Intel's Haswell processors. While Apple hasn't identified which Haswell processors they're using in the new lineup, I believe these are the processors Apple is using based on the Mac mini specifications published by Apple:

ProcessorCoresFrequencyTurbo Boost
Core i5-4260U21.4 GHz2.7 GHz
Core i5-4278U22.6 GHz3.1 GHz
Core i5-4308U22.8 GHz3.3 GHz
Core i7-4578U23.0 GHz3.5 GHz

For comparison, here are the Haswell processors from the "Late 2014" lineup alongside the Ivy Bridge processors from the equivalent model in the "Late 2012" lineup:

Late 2014Late 2012
GoodCore i5-4260U2Core i5-3210M2
BetterCore i5-4278U2Core i7-3615QM4
BestCore i5-4308U2Core i7-3615QM4
BTOCore i7-4578U2Core i7-3720QM4

From the table you can see Apple has moved from dual- and quad-core processors in the "Late 2012" lineup to dual-core processors across the entire "Late 2014" lineup. How much this change will affect multi-core performance? Will the new Mac minis be slower than the old Mac minis?

Unfortunately there are no Geekbench results for the new Mac minis in the Geekbench Browser to help us answer this question. Instead, I estimated the new Mac minis' scores by using data from other systems with the same processor. I expect the estimated scores will be within 5% of the actual scores for the Mac minis.

Here are the estimated scores for the "Late 2014" Mac minis alongside the actual scores for the "Late 2012" Mac minis:

Single-Core Performance

Single-core performance has increased slightly from 2% to 8% between the "Late 2012" and "Late 2014" models. This increase is in line with what we saw when other Macs models moved from Ivy Bridge to Haswell processors.

Multi-Core Performance

Unlike single-core performance multi-core performance has decreased significantly. The "Good" model (which has a dual-core processor in both lineups) is down 7%. The other models (which have a dual-core processor in the "Late 2014" lineup but a quad-core processor in the "Late 2012" lineup) is down from 70% to 80%.

So why did Apple switch to dual-core processors in the "Late 2014" lineup? The only technical reason I can think of is that the Haswell dual-core processors use one socket (that is, the physical interface between the processor and the logic board) while the Haswell quad-core processors use different sockets:

Core i7-4578U2IrisFCBGA1168
Core i7-4770HQ4Iris ProFCBGA1364
Core i7-4700MQ4HD 4600FCPGA946

Apple would have to design and build two separate logic boards to accommodate both dual-core and quad-core processors. Other Macs use the same logic board across models, so I wouldn't expect Apple to make an exception for the Mac mini. Note that this wasn't an issue with the Sandy Bridge and Ivy Bridge processors, where both dual- and quad-core processors used the same socket.

Apple could have gone quad-core across the the "Late 2014" lineup, but I suspect they wouldn't have been able to include a quad-core processor (let alone one with Iris Pro graphics) and still hit the $499 price point.

All things considered, if you're looking for great multi-core performance in a mini (say if you're using your Mac mini as a server), I have a hard time recommending the new Mac mini. I would suggest trying to track down a "Late 2012" Mac mini rather than buying a new "Late 2014" Mac mini. Otherwise the improved WiFi, graphics, and single-core performance make the new "Late 2014" Mac mini worth considering.