Early benchmarks show strong performance of the Cortex-A53 core, especially for the latest revisions
Early evidence of the performance of new SoCs exclusively using ARM Cortex-A53 processor cores, based on recent entries in Geekbench's result database, suggests that the performance improvement of Cortex-A53 compared to Cortex-A7 at an equivalent clockspeed, especially when running with the 32-bit ARMv8 machine model as implemented in Android 4.4.4, may be greater than originally expected.
There is evidence that several revisions of the Cortex-A53 core already exist, including the original r0p0, the r0p1 and the r0p2 revision (with r0p3 also being listed on ARM's website). Although these are minor revisions that do not signficantly alter the IP blocks, the later revisions seem to be associated with significant performance improvements when compared to earlier revisions, possibly because of the correction of bugs or performance bugs in earlier revisions. In particular, r0p0 revision devices such the first incarnation of Snapdragon 410 (MSM8916) appear to be limited to ARMv7 compatibility mode, while SoCs with later revisions appearing to be configured with support for the 32-bit version of the ARMv8 instruction set (AArch32) in association with Android 4.4.4.
Full 64-bit ARMv8 machine model not likely to be of great benefit on mobile devices
The full 64-bit ARMv8 instruction set (AArch64) as supported by Cortex-A5x is not yet supported in Android, and there are reasons to believe that using it might not result in much benefit in today's devices over AArch32. For example, much of the benefit of the new ARMv8 instruction set is already delivered by AArch32, and actual use of 64-bit registers/variables and operations on them is relatively uncommon in program code (this is true of most program code, including typical code executed using the x86_64 instruction set used in PCs and Atom-based mobile devices). Additionally the ARMv7 instruction set (and AArch32) already contain some instructions that operate on 64-bit values, which can be conveniently taken advantage of for these uncommon cases, without requiring the use of the full 64-bit ARMv8 instruction set.
Moreover, in the ARM world, data processing algorithms that might benefit from 64-bit processing are often better served by using ARM's NEON SIMD extension, which is also available on AArch32 and most ARMv7-A devices.
Although AArch64 makes memory management more flexible by extending the addressing space beyond 4 GB, the doubling of the storage size of all pointers (memory addresses) from 32 bits to 64 bits negatively impacts performance because of greater code and data memory usage, which for mobile SoCs, given their relatively small internal SoC buffers, cache memories and RAM, are especially sensitive. PAE support already allows 32-bit ARM machine models to take advantage of a larger addressing space, reducing the necessity of switching to a full 64-bit model.
32-bit version of ARMv8 instruction set brings benefits
Android support for the 32-bit version of ARMv8 is a very recent development, taking advantage of new ARMv8 instructions that improve performance, and probably also the architectural changes in ARMv8 (such as the removal of the optional conditional predication of instructions present in ARMv7-A) that benefit modern CPU cores such as Cortex-A53 and Cortex-A57. Geekbench takes advantage of the new machine model, and the majority of Android applications, largely consisting of device-independent Java code that is translated into machine code on demand, is also likely to benefit. However, to what extent ARMv7-A native code, which is commonly used in applications that require more CPU processing, is affected by the new machine model is unclear.
SoC-specific CPU optimizations are common, but impact power consumption more than speed
Variation between different implementations of Cortex-A53 cores at a similar process node can also occur because of core hardening optimization in the SoC design. This can involve trading performance for power efficiency and vice versa, although it should not in principle affect metrics such as IPC (instructions per cycle) or indeed Geekbench CPU scores as long as they do not depend on factors outside of the CPU core such as a more extensive memory footprint. However, apart from L2 CPU cache memory size, CPU cache latencies may also be configurable through core hardening, and the latter may impact even small memory footprint benchmarks, including CPU tests used in Geekbench.
Geekbench result round-up for smartphone SoCs, including new designs using Cortex-A53
|(Click to enlarge)|
The results were gathered after examining the range of benchmark results for a common SoC and CPU clock frequency configuration (which tends include numerous lower-than-expected scores, probably mostly due to background CPU activity when running the benchmark or the effects of CPU throttling), and choosing a representative result close to the high end of the range, while trying to make sure the result is not an outlier or giving indications of overclocking. As much as possible, entries using the most recent version of Geekbench (3.2.1 or 3.2.0) and the underlying Android version (preferably 4.4.x) was selected.
While the Integer and Float scores reported in the table are likely to be closely tied to the processor core, SoC and the clock frequency used, the memory score and overall score depend on the external memory implementation and speed and other factors related to a particular device model.
Analyzing Geekbench performance of existing SoCs
Looking at previous generation SoCs, among SoCs with a quad-core Cortex-A7 CPU configuration, based on Geekbench results, MediaTek SoCs are very competitive against Qualcomm SoCs long considered mid-range such as a 1.2 GHz Snapdragon 400. For example, MediaTek's MT6582, despite usually being found in much cheaper (often entry-level) devices than Snapdragon 400, is quite competitive. Samsung's Exynos 3470, used in the Galaxy S5 Mini, appears to be worst performer in this class in terms of performance per MHz.
Looking at higher performance SoCs, the octa-core MT6592 holds the middle ground based on strong multi-core CPU performance (with memory performance being a relative bottleneck), while Qualcomm's Snapdragon 801/805 are a clear step up, especially in terms of single-thread and memory performance. Snapdragon 805 appears to be very similar to Snapdragon 801 in terms of CPU architecture, with very similar performance at the same clock speed, and being reported basically as major version bump of the Krait-400 core used in the Snapdragon 801 by Geekbench, although Qualcomm described the CPU cores inside Snapdragon 805 as Krait-450. Exynos 5430 provides a similar level of performance, but the power efficiency of the latter may be in doubt.
The following Geekbench model names associated with entries using existing SoCs were used for performance comparisons. A link to the results page used for each model is provided.
- Qualcomm MSM8226 (Snapdragon 400) (Cortex-A7r0p3): HTC HTC Desire 610 (Geekbench 3.2.1 ARMv7, Android 4.4.2)
- Samsung Exynos 3470 (Cortex-A7r0p3): samsung SM-G800F (Geekbench 3.2.1 ARMv7, Android 4.4.2)
- MediaTek MT6582 (Cortex-A7r0p3): HUAWEI H30-U10 (Geekbench 3.2.1, ARMv7, Android 4.4.2)
- MediaTek MT6589T (Cortex-A7r0p2): LENOVO Lenovo S960 (Geekbench 3.2.0 ARMv7, Android 4.4.2)
- Qualcomm MSM8226 (Snapdragon 400) (Cortex-A7r0p3): HTC HTC Desire 816 dual sim (Geekbench 3.2.1 ARMv7, Android 4.4.2)
- MediaTek MT6592 (Cortex-A7r0p4): LENOVO Lenovo A806 (Geekbench 3.2.1 ARMv7, Android 4.4.2)
- Qualcomm MSM8974AC (Snapdragon 801): Motorola Moto X (2014) (Geekbench 3.2.1 ARMv7, Android 4.4.4)
- Samsung Exynos 5430: samsung SM-G850F (Geekbench 3.2.1 ARMv7, Android 4.4.4)
- Qualcomm APQ8084 (Snapdragon 805): samsung SAMSUNG-SM-N910A (Geekbench 3.2.1, Android 4.4.4)
Performance of new Cortex-A53-based SoCs
Qualcomm's first generation 1.2 GHz Snapdragon 410 (MSM8916), with four Cortex-A53r0p0 cores, has higher performance than a similarly clocked Snapdragon 400, although not dramatically so. A faster clocked Snapdragon 410 prototype (with MSM8916_32 SoC) with a later revision of the Cortex-A53 core shows a clear improvement in Geekbench Integer Performance over the previous Snapdragon 410 when adjusting for the clock rate. However, this is for a large part due to the availability of the Aarch32 instruction set in the newer device, allowing Geekbench to take advantage of new cryptography instructions that greatly speed up certain subtests that are part of the Integer benchmarks.
MediaTek's upcoming MT6752 with an octa-core configuration of the more recent r0p2 revision of the Cortex-A53 core shows impressive performance, with the caveat that this is based on a single reported benchmark score of a prototype device. Overall integer performance as reported by Geekbench is especially impressive, being comparable to Snapdragon 801 for single-thread performance and blowing past it in terms of multi-core performance. However, the use of Aarch32 is likely to inflate the overall Integer scores relative to typical performance in practice because of the relatively large influence of new cryptography instructions available with AArch32 on Geekbench's Integer Performance scores, although other benefits of AArch32 are also apparent. Memory efficiency also appears to be significantly improved when compared to previous generation Cortex-A7-based devices. Despite relatively high performance, the MT6752 is likely to be power-efficient and very cost-effective, due to the characteristics that the Cortex-A53 core has inherited from Cortex-A7.
The following Geekbench model names associated with entries using a SoC with Cortex-A53 cores were used for performance comparisons. A link to the results page used for each model is provided.
- Qualcomm MSM8916 (Snapdragon 410) (Cortex-A53r0p0): HTC Desire 510 (Geekbench 3.2.1 ARMv7, Android 4.4.3)
- Qualcomm MSM8916_32 (Snapdragon 410) (Cortex-A53r0p1): unknown msm8916_32 (Geekbench 3.2.1 AArch32, Android 4.4.4)
- Qualcomm MSM8939 (Snapdragon 615) (Cortex-A53r0p1): HTC HTC 0PFJ1 (Geekbench 3.2.0 Aarch32, Android 4.4.4)
- MediaTek MT6752 (Cortex-A53r0p2): alps k2v1 (Geekbench 3.2.1 AArch32, Android 4.4.4)
- Samsung Exynos 5433 (Cortex-A57r1p0 + Cortex-A53): samsung SM-N910C (Geekbench 3.2.0 AArch32, Android 4.4.4)
Cortex-A53 blows Cortex-A57 away in terms of efficiency
Samsung's new Exynos 5433, the first SoC with publicly disclosed Cortex-A57 cores, sets a new high mark for single-thread performance, being considerably faster than Snapdragon 801, but surprisingly finds itself beaten on multi-core integer performance in early results for the MT6752, a mid-range SoC. Both devices use AArch32, so the relatively heavy weighing of new AArch32 cryptography instructions by Geekbench is not as important as when comparing with previous generation devices.
Exynos 5433 contains four Cortex-A53 cores in addition to the four Cortex-A57 cores in a big.LITTLE configuration, and more detailed examination of the benchmark results (more specifically primarily CPU-bound subtests such as JPEG Compress) provide evidence that the Cortex-A53 cores do contribute to multi-core performance, with a multi-core performance scaling factor of 4.46 (about 4.0 would be expected when just the Cortex-A57 cores are utilized), suggesting Global Task Switching (allowing all eight cores to run concurrently) is working, although not providing a great boost in overall processing performance, with more significant benefits for overall power efficiency and CPU scheduler efficiency.
It has to be noted that the MT6752, which closes in on the performance of a high-end design like Exynos 5433, is a mid-range chip with a cost-effective 32-bit memory interface, and is likely to be considerably cheaper and much more power-efficient than Exynos 5433 and other high-end platforms, dramatically illustrating the great efficiency of Cortex-A53-based SoCs against the relative inefficiency of Cortex-A57. Cortex-A57 provides superior single-thread performance, but compares poorly in terms of performance/dollar and performance/Watt. High performance Cortex-A53 designs such as MediaTek's upcoming octa-core MT6795 (which is targeting a higher clock frequency and has a premium dual-channel memory interface) are likely to make the comparison even more compelling.
Low-power Cortex-A53 has significant advantages related to performance scaling and thermal restrictions
Key to this development is the apparent tendency of in-order pipeline cores such as Cortex-A53 (and previously Cortex-A7) to show much greater performance scaling on new, more advanced process nodes, primarily because of much greater increases in maximum clock speed. For example, clock speed increase has been limited for SoCs with high-performance CPU cores in the same class as Cortex-A57 (generally out-of-order pipeline, speculative issue architectures with a large die size) such as Exynos models with Cortex-A15 and Apple A7/A8 with Cyclone, despite the transition to 20 nm manufacturing.
In addition, practical performance of Cortex-A53 is likely to be much less affected by CPU throttling (periodic reduction of the CPU clock speed because of the temperature increasing beyond a certain threshold in order to maintain stability), thanks to the power efficiency of Cortex-A53, which may aid actual performance in practice more than is apparent from the results of common CPU benchmarks.
Finally, the current comparison of Cortex-A53 with Cortex-A57 as implemented in Exynos 5433 is not apples-to-apples because Exynos 5433 is manufactured at 20 nm, with significant associated performance benefits, while Cortex-A53-based devices (which for the moment are mostly targeted at cost-sensitive applications) are still manufactured at 28 nm. Although there is as of yet not much information about how Cortex-A53 will scale on 20 nm, I believe there is potential for additional performance scaling that could be disruptive in terms of performance and efficiency advantages when compared to high-performance cores like Cortex-A57.
Comparison of Cortex-A53 CPU core revisions
- Cortex-A53r0p0 (part 3331, variant 0, revision 0 as reported by Geekbench) is the first revision. This appears to be the version used in a quad-core configuration in MSM8916, the first generation of Qualcomm's Snapdragon 410, which is the first Cortex-A53-based SoCs to be commercially available in devices such as HTC Desire 510 and several currently ramping devices, including Samsung Mega 2 (SM-G7508Q) and Samsung Galaxy A5 (SM-A500F). The clock speed is typically set at 1.19 GHz. Devices using this chip appear to be limited to ARMv7, not being able to take advantage of the 32-bit ARMv8 (Aarch32) instruction set. Already on July 1, 2014, Qualcomm's Android for MSM Project stopped providing support for this SoC for new Android versions, with the latest supported version being Android 4.4.3.
- Cortex-A53r0p1 (part 3331, variant 0, revision 1 as reported by Geekbench) is the second revision. It is used in a Qualcomm prototype device result reported as MSM8916 or MSM8916_32 (a chip designation similar to already shipping devices using a Snapdragon 410 with the first revision of Cortex-A53), equivalent to a SoC referred to by Qualcomm as MSM8916_32, running at a higher maximum clock rate (1.54 GHz vs 1.19 GHz) and showing a significant additional performance improvement beyond that expected from the clock speed increase only. The combined Geekbench integer performance score for r0p1 is about 30% higher for single and multi-core performance than r0p0 at the same clock speed, although that is largely the result of the cryptography instructions enhancement offered by AArch32, but other improvements are also apparent. Floating point performance remains the about same. Memory performance may also be higher, but that also depends on the speed of the memory used in the tested devices.
- Cortex-A53r0p1 is also used in Qualcomm's octa-core MSM8939 (Snapdragon 615), which has two clusters of four Cortex-A53 cores, one running at a higher and the other at a lower a speed. Geekbench results for a HTC prototype using this chip (running at maximum CPU speed of 1.34 GHz) are consistent with the performance per MHz found in the r0p1-based MSM8916_32, with gains in multi-core performance over the quad-core chips suggesting that the device supports heterogeneous multi-processing (also called Global Task Switching), allowing all eight processor cores to run simultaneously, although the gain is significantly lower than what would be expected when all CPU cores are fully utilized (even allowing for a relatively low CPU speed of the second cluster, say 0.7 GHz).
- Cortex-A53r0p2 (part 3331, variant 0, revision 2 in Geekbench) appears to be the latest revision of the Cortex-A53 that has been implemented in SoCs. A benchmark result for a device based on MediaTek's upcoming mid-range octa-core MT6752 SoC provides evidence for the existence of this core. The CPU cores are clocked at 1.69 GHz, and the benchmark results are impressive, helped by ability of the eight cores to run concurrently at full speed. Integer and floating point performance when corrected for clock speed appears to be further improved slightly over the previous r0p1 revision, based on single-core performance, although this could also be due to characteristics of the SoC.
- Multi-core performance of the r0p2-based the MT6752 is very impressive, although not quite scaling linearly with the doubling of the amount of cores. Multi-core performance does appear to be scaling significantly better than the asymmetrically clocked cores in the Snapdragon 610 prototype, even when allowing for a very low clock speed of the second cluster of the latter. This is not unexpected because multi-threading, especially in a benchmark, is likely to be significantly more efficient when dealing with equivalently-clocked CPU cores.
- Memory performance of the MT6572 test device is impressive for its class, with a significant increase over the Cortex-A53r0p1-based Qualcomm SoCs, and being dramatically higher than existing designs that also utilize an economical 32-bit memory interface. Although higher-clocked memory is likely a factor, data rate and memory controller improvements in the r0p2 revision of the Cortex-A53 core are likely to be more significant. ARM has alluded to improvements in the memory subsystem and data rates in Cortex-A53, which may be more fully realized in the r0p2 revision and its implementation in the MT6572 SoC.
Other new ARM IP technology contributes to performance and efficiency improvement
The Cortex-A53 has become available together with other IP products from ARM that improve performance and efficiency. These include a faster and more efficient interconnect bus, compression and other data rate reduction techniques such as ARM Frame Buffer Compression (AFBC), Smart Composition, and Transaction Elimination, and new Mali GPU cores (such as Mali-T760 and Mali-T72x) which together have the potential to dramatically improve performance and especially power consumption for graphics-related tasks (including typical device use), while also alleviating the memory bandwidth bottleneck in cost-sensitive devices with a limited memory subsystem, such as the 32-bit external memory interface used in most entry-level to mid-range mobile devices.
Favourable comparison with existing high-performance designs
Judging from these early benchmark results, an octa-core Cortex-A53 can achieve performance rivalling existing high-end platform such as Snapdragon 801 in several metrics. The test results of the MT6752-based device show a dramatically higher Geekbench multi-core integer performance score when compared to Snapdragon 801, with single-core integer performance being similar. However, the scores are inflated due to the heavy weighting of new cryptography instructions available with MT6752's support for AArch32, although in general AArch32 is likely to bring benefits for most applications. Multi-core floating performance is also higher. Single-core floating point and memory performance are clearly lower than Snapdragon 801, although not dramatically so. Nevertheless, considering the fact that the MT6752 is supposed to and likely to be using only a 32-bit memory interface, its memory performance is very impressive, being a large improvement over existing devices with a 32-bit memory interface.
The strong "premium level" performance of devices like the MT6752 is associated with a dramatically decreased chip manufacturing cost when compared to existing high-end SoCs such as Snapdragon 801. The Cortex-A53 cores, even in an octa-core configuration, are likely to be significantly smaller than out-of-order high-performance cores such as the Krait-400 cores used in the Snapdragon 801, resulting in chips with a much smaller die size (similar comparisons can be made with ARM's high-performance cores such as Cortex-A1x and Cortex-A57). Power consumption is also likely to be dramatically improved.
Revolution on the cards for performance, cost and power efficiency
Coupled with the cost reductions allowed by the 32-bit memory interface (as compared to the 64-bit or 32-bit dual-channel interfaces of existing high-end devices), with Cortex-A53 a revolution in performance/dollar and performance/Watt for high-performing devices appears to be on the cards. At the same time, lower-end devices (using, for example, a quad-core Cortex-A53r0p2 configuration) will see dramatic performance improvement.
When Cortex-A53 cores are combined with other high-end features such as a wider memory interface and a high-performance GPU (such as implemented in MediaTek's upcoming MT6795), there is potential to further close in or even surpass the performance of existing premium-level architectures, with greatly increased (power) efficiency and reduced cost. Although single-thread performance is not likely to quite reach the level of existing premium devices, other metrics (including multi-core performance, power consumption and cost) are likely to see a dramatic improvement. Early reports already indicate that SoCs such as the MT6795 will be disruptive in terms of cost and efficiency for high-performance mobile applications.
In conclusion, the emergence of Cortex-A53-based designs and associated IP is likely to revolutionize performance, cost and efficiency in mobile devices, bringing higher performance to cost-sensitive entry-level and mid-range devices, reducing cost for high-end devices while also improving the performance of premium devices with much greater efficiency and reduced cost.
Sources: Geekbench result database, ARM, EE Times (Comments about adoption of MT6795)
Updated September 28, 2014 (Fix revision designations of Cortex-A53 based on feedback; revisions reported by Geekbench are minor revisions of major revision r0, as in r0pN).
Updated September 30, 2014 (Use more representative benchmarks for some SoCs, provide information about Geekbench and Android version as well a weblink for all tabulated benchmark results, discuss merits of different ARMv8 instruction set models, make note of cryptography instructions in AArch32 inflating Geekbench Integer Performance, and other improvements).
Updated October 3, 2014 (Include early reports about octa-core Cortex-A53 MT6795 adoption for high-performance devices).
Updated November 13, 2014 (Correct information about effectiveness of GTS on Exynos 5433).
Updated December 45, 2014 (MSM8939 is Snapdragon 615, not Snapdragon 610).
To do: Cortex-A53 Geekbench scores are likely to be inflated because of support for 32-bit ARMv8 mode in the most recent versions of Geekbench, which enables the use of cryptography instructions that significantly increase the scores of certain subtests of the Geekbench CPU Integer performance tests, while not accurately reflecting the CPU performance increase for most applications. This will be further investigated in the near future. As I did in subsequent blog posts, concentrating on Geekbench subtests that better represent integer CPU performance such as the JPEG Compress test, rather than the overall integer performance scores, should give an much better picture.