Monday, December 29, 2014

Another look at Cortex-A53 CPU core performance

Several smartphone chips using ARM's new Cortex-A53 and Cortex-A57 CPU cores with the 64-bit-capable ARMv8 instruction set have arrived on the market recently. Cortex-A53-only based SoCs are especially attractive from a performance/dollar standpoint. However, as I described in earlier articles, there exist significant performance differences between different Cortex-A53 implementations, with some early revisions of the core being limited in performance, probably because of design bugs.

32-bit version of ARMv8 seems practical


Most of the Cortex-A57 and Cortex-A53-equipped SoCs currently seem to be running in what can be called "32-bit ARMv8 mode" (AArch32 in Geekbench, as opposed to ARMv7 for older 32-bit devices), taking advantage of some of the features of the ARMv8 instruction set (which is better suited to modern CPU chip architectures) while preventing some of the disadvantages of the full 64-bit model (such as doubled storage space for pointers and addresses).

Whether the full 64-bit instruction model (AArch64) will soon be attractive for Android devices, including lower-end ones such as Cortex-A53-based devices with limited amounts of CPU cache and RAM, is unclear. NVIDIA already uses AArch64 in conjunction with their latest Tegra K1 SoC. Optimizations for AArch64 seem to have been work in progress and early benchmarks for systems running in AArch64 mode were quite poor in comparison to 32-bit mode benchmarks, but progress is been made. Theoretically, more registers are available in AArch64 mode, also to the NEON SIMD unit, which should help performance in some important cases, and may mitigate the disadvantages of increased address storage size.

Snapdragon 410 has crippled first revision of Cortex-A53


Snapdragon 410 (MSM8926) is a SoC with quad-core Cortex-A53 that has been one of the first chips with Cortex-A53 cores to come to market and has already been adopted in significant volume for low-to-mid-range designs, replacing the older Cortex-A7-based Snapdragon 400.

However, it is obvious that the very first public revision of the Cortex-A53 core as used inside Snapdragon 410, Cortex-A53r0p0, is crippled in terms of performance, clearly scoring lower in CPU and memory-intensive benchmarks (even after making the significant correction for clock speed) than SoCs using later revisions of the Cortex-A53 core such as Snapdragon 615 and MediaTek's new chips. Coupled with Snapdragon 410's relatively low clock speed of 1.19 GHz, this results in significant lower performance than the newer mid-range chips mentioned. Performance in complex benchmarks that simulate demanding, typical use such as complex browsing and gaming is even worse.

Advertising of Snapdragon 410 as having 64-bit support is very misleading


The lower performance seems to be partly associated with the fact that Snapdragon 410 (because of the r0p0 revision of Cortex-A53) is completely limited to ARMv7-compatibility mode and is unable to run in ARMv8 mode (32-bit or otherwise). I have yet to see evidence of a shipping Snapdragon 410 chip that is 64-bit or even ARMv8 capable. It functions as nothing else than having somewhat faster 32-bit ARMv7 Cortex-A7 cores. In this sense, labeling the chip as being 64-bit or potentially having support for the 64-bit ARMv8 in a future update is downright misleading or a blatant lie, depending on one's standpoint.

Memory performance seems most impacted


Based on Geekbench results, Snapdragon 410 has about 10% lower pure integer CPU performance per MHz when compared to chip such as Snapdragon 615 and MediaTek's MT6732/MT6752. For pure floating point performance, performance is about 5% lower. The biggest difference is in memory performance, where Snapdragon 410 is about 25% slower than Snapdragon 615 (with r0p1 Cortex-A53) and more than two times slower (even when correcting for clock or memory speed) than MT6732/MT6752 with Cortex-A53 r0p2. Another big difference is found in cryptography performance because of the extra ARMv8 instructions that apparently are not available to Snapdragon 410.

A large part of the lower performance of the Cortex-A53 cores inside Snapdragon 410 may be due to chip design bugs as evident from errata issued by ARM for earlier revisions of the Cortex-A53 core. Some details about these errata, which mostly involve memory coherency issues related to CPU cache memory, can be found when compiling a Linux kernel.

Snapdragon 410 shows poor scores in real-world benchmarks


While Snapdragon 410 delivers somewhat better scores than the Cortex-A7-based Snapdragon 400 at the same clockspeed in pure CPU-specific benchmarks such as Geekbench for single-core performance, multi-core performance does not show much benefit (which is unexpected based on the architectural advantage that the Cortex-A53-based Snapdragon 410 should have).

Even worse is the performance in practical benchmarks that measure performance for web browsing, gaming and other more complex, practical use cases. Based on benchmark results reported by GSMArena (1) (2), Basemark X, which is gaming benchmark that simulates throughput for a more demanding typical usage pattern that uses of the Unity engine, reports a significantly lower score than recent Snapdragon 400-based models such as the Moto G (2014), with the GPU score being similar, pointing to significant flaws in (multi-core) CPU and memory performance.

In Rightware's Browsermark 2.1, a browser benchmark with use of advanced web standards such as HTML 5, WebGL and advanced JavaScript, performance is downright disappointing, with a score less than half that of Snapdragon 400-based devices. Other browser benchmarks show similar results. Scores in Rightware's overall-use Basemark OS II benchmark are also typically relatively disappointing, not surpassing those of Snapdragon 400-based devices.

Hardware bugs likely cause of crippled performance


These lower than expected benchmark results for more complex, typical use benchmark are compatible with hardware bugs in the Cortex-A53 implementation of the Snapdragon 410 being a bottleneck and significantly degrading especially multi-core performance. In particular, work-arounds for cache consistency and coherency issues have the potential to significantly degrade performance, for example by forcing the kernel to frequently flush CPU caches.

The Linux kernel source shows commits to handle errata for Cortex-A53 up to r0p2 relating to cache clean operations, with the work-around being to promote cache clean to cache clean and invalidate. This could mean that revision r0p3 of Cortex-A53 may see further improvements. These commits do not explain the performance difference between r0p0, r0p1 and r0p2, since the work-around is the same for all three revisions.

Third revision of Cortex-A53 (r0p2) seems to improve memory performance


Some of the hardware or performance bugs that plagued especially the first version of the core (r0p0 as used in Snapdragon 410) have most likely been fixed in later revisions, contributing to a significant performance increase at the same clock speed.

SoCs with the third revision (r0p2) of the Cortex-A53 core seem to have much better memory performance as shown by Geekbench results, especially impressive given the bandwidth limitations of a 32-bit external memory interface. Most likely, this improvement is derived synergistically with ARM IP such as the Mali-T760 GPU as well as other IP blocks, which are implemented inside chips such as MT6732 and MT6752.

Since a SoC such as MT6732 is on the surface essentially comparable to Snapdragon 410 in the sense of having four Cortex-A53 CPU cores, there seem to be major performance improvements in the later revisions of the Cortex-A53 core and associated system architecture, especially with regard to memory performance. The difference is made more pronounced by the fact that the MT6732 is manufactured using TSMC's higher performance 28HPM process rather than 28LP and also clocked significantly higher.

Octa-core Cortex-A53 configurations provide impressive multi-core performance


Octa-core Cortex-A53-based SoCs such as MT6752 and to a lesser extent Snapdragon 615 are already showing impressive multi-core CPU performance, while single-core performance has also improved considerably over prior cost-effective CPU architectures. Multi-core performance, both in terms of pure CPU integer and floating performance, for the MT6752 significantly surpasses (by tens of percent in many benchmarks) the much more expensive Snapdragon 801, while single-core performance is catching up, being about 30% slower for integer operations and 15% slower for floating point. This high level of performance comes at a fraction of the cost (primarily because of the small die size and low power consumption of the Cortex-A53 cores).

Memory bandwidth still a bottleneck


However, when the memory subsystem truly comes into play, high-end chips such Snapdragon 801 still show much greater performance because of their much higher external memory bandwidth (because of the wider memory interface) as well as larger CPU cache. This is apparent in the Geekbench subtest SGEMM (which is heavy on sequential memory access), for which high-end SoCs such as Snapdragon 801 are more than twice as fast.

In practice, memory performance is important for how fast a device feels, impacting response times and also being very important for GPU performance. High screen resolutions also put heavy demands on the memory subsystem. In that sense, SoCs such as MT6752 and Snapdragon 615 still perform best at a resolution like 1280x720, with the best performance at 1920x1080 and higher still reserved for high-end SoCs.

There seems to be great potential for performance-oriented Cortex-A53 SoCs with a memory interface wider than 32-bit, comparable with other performance-oriented SoCs. This would be the "best of both worlds" in several respects (lower cost because of small die size of the CPU cores, low power consumption, while still having the memory bandwidth to drive high resolutions). MediaTek has announced such a chip that was expected to have such as configuration, the MT6795, but it has not quite appeared on the market yet and might be delayed. However, similar solutions certainly look likely to become popular for performance-oriented devices in the not too distant future.

Appendix: Table with detailed Geekbench CPU benchmark results


Presented here is a table with detailed benchmark result information for the mentioned SoCs, also including several other SoCs on the market. Included is information about the CPU cores used, their clock speed, the smartphone model and Geekbench result entry used as a reference, and scores for several benchmarks. Indexed results (relative to a 1.0 GHz Cortex-A7) are shown for several of the benchmarks, as well multi-core performance scaling indices. Results relevant to the discussion above have been highlighted in bold. The following Geekbench subtests have been included:

  • JPEG Compression (single/multi-core). A useful integer benchmark that seems to strongly depend on pure CPU performance (CPU core type and clock speed) with less dependence on the memory subsystem (including L2 CPU cache).
  • Dijkstra (single/multi-core). A more complex integer benchmark that probably includes more memory access and may branch a lot. Notable for this benchmark is that Cortex-A53 performs better than Cortex-A15 at the same clock speed, with both Cortex-A17 (MT6595) and Cortex-A57 being significantly faster still.
  • Mandelbrot (single/multi-core). A pure floating point benchmark, highly dependent on the combination of CPU core type and clock speed.
  • Stream copy (single/multi-core). An important metric for memory performance (especially sequential external RAM performance).
  • SGEMM. A floating point matrix multiplication benchmark that heavily depends on sequential memory access. The memory bandwidth available to the SoC makes a critical difference for this benchmark.
  • SFFT. A floating point benchmark that heavily uses random memory access.
For a high-resolution version, view/copy/save the image above using the browser.

Sources: Geekbench browser, Primate Labs website

Updated January 2, 2015 (Add section of low typical-use benchmark scores for Snapdragon 410).
Updated January 5, 2015 (Update Geekbench performance table).
Updated January 10, 2015 (Update performance table).
Updated February 11, 2015 (mention and link Linux kernel Cortex-A53 errata).

2 comments:

Anonymous said...

Good, informative article. Thanks!

Anonymous said...

hi, is that means qualcomm snapdragon 615,( 4X ARM cortex-a53 @1.54 Ghz), the revision r0p1 has a hardware bug?