Showing posts with label ARMv8. Show all posts
Showing posts with label ARMv8. Show all posts

Friday, June 5, 2015

Smartphone platforms migrate to 64-bit (AArch64) mode

Recently, most existing and new mobile SoCs have started to become available configured in native 64-bit mode (AArch64) in conjuction with a 64-bit version of Android 5. Although SoCs targeting premium-level devices that are already shipping were the first to support AArch64 (including Tegra K1-64, Exynos 7420 and Snapdragon 810), recent entries in the Geekbench results database show that cost-sensitive platforms are also migrating to native 64-bit mode in upcoming smartphones.

This move involves Cortex-A53-based platforms such as MediaTek's MT6735, MT6752, MT6753 and MT6795, Qualcomm's Snapdragon 615 (MSM8939) as well as a new Snapdragon 410 (MSM8916) platform (which was previously limited to ARMv7), and HiSilicon's Kirin 620 and Kirin 930.

Initial ARMv8 platforms used hybrid AArch32 mode


Several ARMv8 based SoCs have been shipping for some time, but most have been using AArch32 mode, a hybrid mode which takes advantage of some of the architectural improvements in ARMv8 but does not expose native 64-bit mode to applications. Snapdragon 410 did not even take any advantage of ARMv8, running in 100% ARMv7 mode.

One reason why full AArch64 mode has not been adopted right away is that is does come with a performance penalty due to the increased storage requirements for program code and pointers, which puts greater demands on the memory subsystem of the SoC. Cost-sensitive smartphone models are especially sensitive to this due to a lower amount of RAM and smaller on-chip CPU caches. A decrease in the price of RAM chips has allowed the amount of RAM in cost-sensitive models to increase (e.g. more devices shipping with 2GB RAM), making AArch64 mode more appealing.

AArch64 also has benefits, in particular for floating point and data-intensive applications that use NEON vector instructions.

Comparison of CPU benchmark results


The migration to AArch64 mode across the board makes it easier to compare CPU benchmarks of different SoCs, which was previously made more difficult by the fact that some SoCs used AArch64 mode while others were still limited to AArch32.

In the following sections, I will return to Geekbench CPU test results and try to make apples-to-apples comparison for different groups of SoCs.

Quad-core Cortex-A53 SoCs


Quad-core SoCs included are MT6732, MT6735 and Snapdragon 410. Note that the version of Snapdragon 410 tested most likely reflects a newer silicon revision that has not yet widely appeared in end devices, since previous versions of Snapdragon 410 (MSM8916) were always limited to ARMv7 mode (seemingly being unable to run in AArch32 mode).

The following table shows selected integer tests results from Geekbench entries for the mentioned SoCs, running in AArch64 mode.

SoC        Geekbench  Clock  JPEG Compress (int)      Lua (int)
           ref        speed  Single IPC   Multi Par   Single IPC   Multi Par

MT6732     2705430    1.50    783   1.36  3108  3.97   795   1.29  3017  3.79
MT6735     2650175    1.30    646   1.36  2604  4.03   656   1.23  2047  3.12
MSM8916-64 2708213    1.21    626   1.34  2481  3.96   615   1.24  1280  2.08

The table below shows selected floating point and memory results.

SoC        Geekbench  Clock  Mandelbrot (float)       Stream Copy (memory)
           ref        speed  Single IPC   Multi Par   Single Multi

MT6732     2705430    1.50    631   1.23  2490  3.95  1030   1156
MT6735     2650175    1.30    526   1.19  2091  3.98   901    965
MSM8916-64 2708213    1.21    508   1.23  1969  3.88   447    505

The "IPC" value as shown in the tables is an index calculated from a comparison with the performance of common Cortex-A7-based SoCs, normalized to the same clock speed. The parallelism value ("Par") is the performance scaling from single-core to multi-core for the specific Geekbench subtest.

The IPC values are fairly consistent, as would be expected from the same CPU core (Cortex-A53) running the same ISA (instruction set architecture). When scaling to multiple cores, MT6732 does best, as shown by the scaling in the Lua benchmarks. This is not surprising as MT6732 is not an entry-level SoC given its cost structure, being better described as belonging to the mid-range segment. It is likely to have a better memory subsystem (in particular, a larger and faster L2 cache) than the other chips.

MediaTek's new entry-level chip, MT6735, apart from running at a somewhat higher clock speed (1.3 GHz vs 1.2 GHz), outperforms the 64-bit version of Snapdragon 410 when normalized to the same clock speed, which is especially evident in the Lua multi-core test and memory tests. The Lua results could be a reflection of L2 cache size and/or speed. Memory performance (based on the Stream Copy subtest) of both MediaTek chips is roughly double that of Snapdragon 410 (something which was already evident in the respective 32-bit platform results).

Mid-range octa-core Cortex-A53-based SoCs


The octa-core Cortex-A53-based SoCs targeting the mid-range segment include MediaTek's performance-oriented MT6752, the recent cost-reduced MT6753, Qualcomm's Snapdragon 615 (MSM8939), and HiSilicon's Kirin 620 (Hi6210).

These SoCs use different CPU clock speed configurations. MediaTek's MT6752 and MT6753 run all cores at the same maximum clock speed, 1.66 GHz for MT6752 and (at least in the tested device) seemingly only about 1.1 GHz for MT6753, even though Geekbench reports a maximum clock speed of 1.3 GHz. HiSilicon's Kirin 620 can run all cores up to a maximum speed of 1.2 GHz.

Qualcomm's Snapdragon 615 uses a pseudo-big.LITTLE, hierarchical architecture with one performance cluster of four cores running up to 1.65 GHz in the most recent version of the platform (previous versions ran up to 1.5 GHz), with the other power-efficient cluster running at a significantly lower clock speed. MediaTek's annnouncement of the MT6755 (Helio P10) shows that MediaTek is also transitioning to a hierarchical CPU clusters for new chips, similar to Snapdragon 615.

Having one power-optimized CPU cluster helps power efficiency for low CPU demand scenarios such as smartphone standby or light usage. The fact that Snapdragon 615 is not very power efficient, despite the low-clocked cluster, in mostly due to the low-performance 28LP manufacturing process used.

The following table shows selected integer tests results from Geekbench entries for the mentioned SoCs, running in AArch64 mode.

SoC        Geekbench  Clock  JPEG Compress (int)      Lua (int)
           ref        speed  Single IPC   Multi Par   Single IPC   Multi Par

MSM8939    2704276    1.65    837   1.32  4269  5.10   789   1.16   667  0.85
MT6752     2709869    1.69    890   1.37  6719  7.55   907   1.31  6531  7.20
MT6753     2699665    1.10?   572   1.35  4298  7.51   587   1.30  4282  7.29
Hi6210     2704356    1.20    630   1.36  3473  5.51   626   1.27  2156  3.44

The table below shows selected floating point and memory results.

SoC        Geekbench  Clock  Mandelbrot (float)       Stream Copy (memory)
           ref        speed  Single IPC   Multi Par   Single Multi

MSM8939    2704276    1.65    661   1.17  4019  6.08    512   569
MT6752     2709869    1.69    714   1.24  5637  7.89   1024  1158
MT6753     2699665    1.10?   463   1.23  3597  7.77    802   958
Hi6210     2704356    1.20    506   1.24  3419  6.76    833  1030

IPC values are fairly consistent for MT6752, Hi6210 and MT6753 (when a likely clock speed of 1.1 GHz is assumed), but Snapdragon 615 consistently shows somewhat lower IPC, possibly related to the earlier revision (r0p1) of the Cortex-A53 core used. It is also possible that, similar to what seems to be the case for the MT6753 entry used (Meizu M2 note), the actual maximum CPU clock speed is lower than the one advertised and reported to Geekbench.

Multi-core performance scaling approaches 8.0 for the MediaTek chips, which can be expected due to the symmetrical CPU cluster configuration. Multi-core scaling for Kirin 620 is lower than expected for the integer tests, especially Lua, possibly due to L2 cache performance constraints.

Snapdragon 615, due to half the cores being clocked at a lower clock speed, shows a lower scaling factor, however the Lua scaling is particularly low, the benchmark score in fact often being worse than the single-core result, while being only modestly higher in other cases. This could be due to L2 cache constraints for one of the clusters and associated synchronisation issues in the multi-threading implementation used by the Geekbench test.

Looking at memory performance, MT6752 has the highest performance, closely followed by MT6753 and Hi6210. Qualcomm's Snapdragon 615 is well behind, probably due to the older/slower interconnect bus used.

MT6753 benchmark results suggests performance issue


Even though a clock speed of 1.30 GHz is reported to Geekbench by the operating system in the MT6753-equipped Meizu M2 Note, actual Geekbench subtest results are not consistent with a Cortex-A53 core running at that clock speed. There is variability in the results between different runs, which could be caused by thermal throttling. Many of the results seem to correspond to an effective clock speed of approximately 1.10 GHz, although for some runs the score of certain tests (including JPEG Compress) does approach the level expected for a clock speed of 1.3 GHz. Most of the time however, performance is significantly lower than expected, as if the clock speed is throttled to around 1.1 GHz for long periods of time.

The lower than expected performance could be related to the manufacturing process. The MT6753 was designed with cost-reduction in mind, and may use TSMC's 28LP process which has low cost but lower performance. Qualcomm's Snapdragon 410 and 615 also use this process, limiting their performance (and in the case of Snapdragon 615 resulting in heat production). MT6753 was announced as supporting a clock speed up to 1.5 GHz, and the lower-than-expected attainable clock speed may force MediaTek to adjust the specifications for the chip if the issue is not resolved.

Sources: Geekbench browser

Updated 6 June 2015.

Thursday, April 30, 2015

More details emerge about Cortex-A72 CPU core

Recently, more details have become available about the performance improvements implemented in ARM's Cortex-A72 core, which is a replacement for the high-performance Cortex-A57 core. Apart from the gains from using a more advanced process such as 14/16 nm FinFET, Cortex-A72 also implements fairly significant micro-architectural improvements affecting performance per cycle and power efficiency. AnandTech has published a detailed overview of these improvements.

Cortex-A57 based on Cortex-A15 and not fully optimized for power-efficiency


The Cortex-A57 CPU core, which was announced in 2012, has significant similarities to Cortex-A15, ARM's long-standing high-performance 32-bit CPU core, which has been known for relatively high power consumption. As such, it is not unexpected that improvements on the Cortex-A57 architecture (in the form of the Cortex-A72) have proven to be possible. Cortex-A57-based SoCs  such as Snapdragon 810 have been known to throttle, being forced to reduce the clock speed due to excessive heat production and power use, resulting in reduced sustained performance. Apple's A7 and A8 processors use CPU cores that most likely have strong similarities with Cortex-A57, but which exhibit little throttling due to a lower maxium clock speed, a lower number of cores and other factors related to the the chip design.

Increased level of sustained performance


ARM has made available a number of slides detailing the improvements in sustained performance and power efficiency in Cortex-A72 over Cortex-A57. On a 28 nm process and similar clock speed, ARM's charts indicate a roughly 20% improvement in power reduction. 

Sustained performance is expected to be higher than Cortex-A57, implementations of which (such as Snapdragon 810 and Exynos 5433, and to a lesser degree Exynos 7420) have suffered from an inability to maintain high clock speeds and throttle back to a relatively low speed due to heat production and associated power consumption. ARM gives a figure of sustained 750 mW operation per core on a 16FF+ process with a clock speed around 2.5 GHz.

In terms of IPC (instructions per cycle), ARM's information shows improvements in all instruction-level performance segments, with a 1.16x improvement for "analytics", 1.38x for cryptography, 1.50x for memory, 1.26x for floating point and 1.16 for integer compute. The increase in memory performance appears to be significant.

Improved single-core performance evident in early Geekbench results


Early Geekbench results for the MT8173 SoC from MediaTek, which includes two Cortex-A72 cores, give an indication of practical peformance of the Cortex-A72 core, although the exact clock speed the Cortex-A72 cores are running at is hard to determine. The following table shows single-core performance from a recent MT8173 Geekbench entry, comparing it to Exynos 7420 as used in the Samsung Galaxy S6. Both use 64-bit AArch64 mode.

SoC                        JPEG   Dijkstra  Lua   Mandelb. Stream SGEMM SFFT
                           Compr.                          Copy
28nm? MT8173 (Cortex-A72)  1429    1287     1675  1750     2217    979  1345
14nm Exynos 7420           1475    1082     1409  1147     1993    954  1379
The MT8173 easily matches the single-core performance of Exynos 7420, while showing significant improvements in the Mandelbrot floating point subtest and the memory-intensive Dijkstra subtest, and also the Lua subtest. Memory subtest (Stream Copy) performance is also better than Exynos 7420, despite the likely much wider memory interface of the latter, providing clear evidence of the improved memory performance (largely due to smarter prefetching) in Cortex-A72. Overall, since the MT8173 results reflects a SoC using 28 mn or perhaps 20 nm process technology, while Exynos 7420 uses Samsung's leading-edge 14 nm FinFET process, the ability of the MT8173 to beat Exynos 7420 in single-core performance while using a less advanced process is impressive and illustrates the performance improvements in the Cortex-A72 core.

Reduced silicon area results in lower cost


Cortex-A72 has a silicon area that is 10% smaller than Cortex-A57 on an equivalent process, while delivering improvements in performance and power efficiency. Already SoCs have been announced or described that utilize Cortex-A72 cores, such as MediaTek's MT8173 for tablets, Qualcomm's Snapdragon 618 and 620 for smartphones, and MediaTek's MT6797 (Helio-X20) for smartphones.

There seems to be a clear trend of using just two Cortex-A72 cores (instead of the four cores used in many Cortex-A57 implementations), reducing cost and maximum power consumption. These are cores are augmented by low-power, small-area Cortex-A53 cores running at a lower frequency. MT8173, Snapdragon 618 and Helio-X20 all use such as configuration.

Use of Cortex-A72 may be more effective than high-clocked Cortex-A53 cores


There are indications that Cortex-A53 cores running at a high frequency (such as implemented in MediaTek's MT6752 and MT6795 (Helio-X10), HiSilicon's Kirin 930 and to a lesser degree in Snapdragon 615 and the announced Snapdragon 415 and 420) run into a power efficiency bottleneck at higher clock speed, due the relatively steep increase in power consumption as the clock speed of the Cortex-A53 core increases above 1.3-1.5 GHz. Solutions that combine a small number of Cortex-A72 with lower-clocked, power efficient Cortex-A53 cores may prove to be a sweet spot in terms of practical performance and power efficiency for mid-range SoCs.

Source: AnandTech (Cortex-A72 Architecture Details article), Geekbench Browser

Thursday, April 16, 2015

HiSilicon introduces Kirin 930/935, a performance-oriented Cortex-A53-based SoC

Huawei has introduced the Huawei P8 and P8max smartphones, featuring the Kirin 930 and Kirin 935 SoCs from Huawei's  HiSilicon semiconductor division. The octa-core Kirin 930 SoC is a performance-oriented SoC featuring only Cortex-A53 CPU cores. With a maximum clock frequency in excess of 2.0 GHz, it bears similarities to MediaTek's MT6795, but the use of a pseudo big.LITTLE configuration (four Cortex-A53 cores clocked up to 2.0 GHz and four Cortex-A53 cores clocked up to 1.5 GHz, for a total of eight cores) is reminiscent of Qualcomm's midrange Snapdragon 615 SoC, which runs at lower clock frequencies.

Huawei also introduced high-end models of both the P8 and P8max with larger storage capacity featuring the Kirin 935 SoC, which is a higher-clocked version of Kirin 930. The Huawei P8max is a smartphone with an unusually large 6.8" display.

SoC is targeted at performance-oriented devices


The Huawei P8 models are higher-priced performance-oriented smartphones, and the characteristics of the SoC match this segment. Apart from the high maximum clock speed of the Cortex-A53 cores, the external RAM interface is likely to be a dual-channel 32-bit configuration like previous performance-oriented SoCs from HiSilicon. Presentation materials from Huawei describe the Cortex-A53 cores in the faster cluster of four CPUs as being of a special, performance-enhanced type, which probably reflects the application of ARM's PoP core-hardening technology whereby the core is optimized for running at a specific frequency and a particular power profile, trading performance against die size. The process technology used is likely to be TSMC's proven 28HPM process.

The SoC is reminiscent of MediaTek's recently introduced MT6795 (Helio-X), which also targets the performance segment with an octa-core Cortex-A53 CPU configuration. MediaTek's SoC has been reported to have been adopted by competitors of Huawei such as HTC and Xiaomi.

Previous generation Mali-T628 MP4 GPU used


Rather than using an updated current-generation GPU like Mali-T760, the specs sheet for the P8max indicates the Kirin 930/935 SoCs continue to use the Mali-T628 MP4 GPU that was previously used in the Kirin 920 SoC. This GPU core is not known for great power efficiency, although there are suggestions that the more efficient Mali-T760 (which features memory bandwidth optimizations) has a relatively high silicon area and cost.

HiSilicon's new SoC line-up uses only Cortex-A53 CPU cores


Apart from Kirin 930, HiSilicon has also introduced the Kirin 620 SoC, which is an octa-core Cortex-A53 based SoC for the cost-sensitive segment, clocked up to 1.2 GHz and with a single-channel memory interface. This means Huawei now has in-house Cortex-A53-based SoCs suitable for most of its smartphone product range.

Thursday, March 19, 2015

Qualcomm releases new variant of Snapdragon 410 that supports ARMv8, targeting tablets and other applications

Qualcomm recently made announcements of products and reference designs based on the APQ8016 SoC, a new modem-less quad-core Cortex-A53-based SoC branded as Snapdragon 410. The chip is targeted at IoT applications, development boards and probably also Wi-Fi-only tablets, supporting Linux, Android and Windows 10. Although branded as Snapdragon 410, the chip is a new design that is likely to fix most of the performance deficiencies of the first-generation MSM8916 Snapdragon 410 SoC that has been targeted at smartphones. For example, the original Snapdragon 410 SoC appears not to support ARMv8 at all, while the new chip is clearly targeted at 64-bit platforms.

Development board released


Qualcomm recently announced the DragonBoard 410c, a development board with support for Linux and Android. It features a quad-core 1.2 GHz Cortex-A53 processor with Adreno 306 GPU, 533 MHZ LPDDR2/LPDDR3 SDRAM, HDMI output and several I/O interfaces. The HDMI output is limited to 30fps at 1080p.

The board is designed to compatible with the 96Boards initiative from Linaro, the non-profit engineering organization developing open source software for the ARM architecture.

With 64-bit support and a maximum clock speed of 1.2 GHz, the APQ8016 SoC that is used on the board most likely uses a more recent version of the Cortex-A53 core than the original Snapdragon 410 processor for smartphones, while being manufactured using the same 28LP process at TSMC.

New SoC probably targets tablets as volume driver


There are indications that the new chip will be used in Wi-Fi-only tablets, such as recently announced Samsung Galaxy Tab A series. There have also been indications that Qualcomm is stepping up its efforts to target Chinese tablet manufacturers.

Qualcomm and MediaTek support mainline Linux kernel with open-source drivers for selected SoCs


Whereas in the past major smartphone SoC companies kept their closed-source drivers separate from the open-source Linux community, more recently companies such as Qualcomm and MediaTek have started releasing open source contributions for the Linux kernel to support selected SoC products. Both companies have also recently joined Linaro, the engineering organization developing open source software for the ARM architecture.

For both companies, the SoCs supported in the mainline Linux kernel are applications processors without an integrated modem. Qualcomm is supporting the APQ8016 mentioned above while MediaTek has contributed code for the MT8173 tablet processor.

Sources: Qualcomm (Dragonboard announcement), Qualcomm (Windows 10 IoT platform announcement), CNXSoft (DragonBoard 410c article)

Tuesday, March 10, 2015

Qualcomm's Snapdragon 808 fixes flaws of Snapdragon 810

Snapdragon 808 (MSM8992) is a performance-oriented SoC that Qualcomm announced last year together with Snapdragon 810. It has similarities to Snapdragon 810 (MSM8994), including the use of ARM Cortex-A57 CPU cores and Cortex-A53 cores in a big.LITTLE configuration. Snapdragon 808 appears to fix some of the performance flaws that are apparent in Snapdragon 810, especially the memory subsystem, while being significantly less costly.

Snapdragon 808 features


Features and differences with Snapdragon 810 include:

  • Snapdragon 808 has only two Cortex-A57 cores (revision r1p2) compared to four Cortex-A57 cores (revision rp1p1) for Snapdragon 810. Both contain four Cortex-A53 cores.
  • Snapdagon 808 has a more economical dual-channel LPDDR3 memory interface, compared to the LPDDR4 interface of Snapdragon 810.
  • Snapdragon 808 has an Adreno 418 GPU, compared to Adreno 420 in Snapdragon 810, presumably with somewhat lower performance.
  • Manufactured on TSMC's 20 nm process, the same as Snapdragon 810.
  • 4K resolution video playback (H.264/H.265), on-device display resolution up to 2560x1600 (Snapdragon 810 theoretically supports 4K on-device display resolution, but all currently announced smartphones using Snapdragon 810 are limited to a resolution of 1920x1080).

 

Early benchmark results suggest Snapdragon 808 fixes performance flaws of Snapdragon 810


Early benchmarks for Snapdragon 808 have already appeared on the Geekbench Browser. We can compare Snapdragon 808's single-core performance with Snapdragon 810 and Exynos 7420, all of which run in AArch64 mode in the published benchmark results.

To reduce the impact of thermal throttling, the best Geekbench subtest results for a given device have been collected and combined in the table below. I have made an attempt to estimate the actual maximum clock speed of the Cortex-A57 cores during the benchmarks, partly based on the maximum frequency reported by Geekbench when it appears to apply to the "big" cores and not the "LITTLE" cores.

SoC          "big" CPU                    Arch     JPEG (int)  Lua (int)   Mandelb. (float)
                                                   Comp. IPC         IPC         IPC

MSM8992      2 x 1.69? GHz Cortex-A57r1p2 AArch64  1257  1.96  1385  1.99  1031  1.79
MSM8994      4 x 1.8? GHz Cortex-A57r1p1  AArch64  1358  1.96  1283  1.73  1100  1.79
Exynos 7420  4 x 1.97 GHz Cortex-A57r1p0  AArch64  1486  1.96  1409  1.74  1198  1.78

MT6795       8 x 1.95 GHz Cortex-A53r0p2  AArch64  1026  1.37  1053  1.31   823  1.24
MT6795T      8 x 2.16 GHz Cortex-A53r0p2  AArch64  1128  1.36  1173  1.32   912  1.24

The IPC figures are calibrated on the Cortex-A7 core, whose IPC is fixed at 1.00. Fixing the maximum cock speed to 1.8 GHz for the MSM8994 (Snapdragon 810) results (based on HTC One M9 entries) and at 1.69 GHz for the MSM8992 (Snapdragon 808) produces similar IPC figures for the JPEG Compress integer test and the Mandelbrot floating point test, making them reasonably plausible. The best Lua subtest result for the MSM8992 shows a higher IPC, which may reflect improved L2 cache performance in the MSM8992, which uses a later revision of the Cortex-A57 core.

The single-core CPU performance results show no suprises, with Snapdragon 808 showing good performance that is slightly lower than Snapdragon 810, proportional to the lower maximum clock frequency in the tested devices. However, the Lua test shows higher performance with Snapdragon 808, which is especially true for the multi-core test (results not shown), where Snapdragon 810 seems to be limited to a score of about 1200 with little gain when compared to single-core performance, while Snapdragon 808 consistently scores in the region of 4000.

Memory subsystem performs much better than Snapdragon 810


The following table lists Geekbench scores for some memory-dependent tests. 

SoC          "big" CPU                    Arch     Stream Copy  SGEMM SFFT  SGEMM SFFT
                                                   Single Multi             Multi Multi
MSM8992      2 x 1.69? GHz Cortex-A57r1p2 AArch64  1527   1733   767  1126  1678  2946
MSM8994      4 x 1.8? GHz Cortex-A57r1p1  AArch64  1428   1838   741  1009  1870  3649
Exynos 7420  4 x 1.97 GHz Cortex-A57r1p0  AArch64  2003   2622   957  1363  2888  5014

MT6795       8 x 1.95 GHz Cortex-A53r0p2  AArch64  1356   2068   484   618  1542  4764
MT6795T      8 x 2.16 GHz Cortex-A53r0p2  AArch64  1350   2140   529   694  1659  5333

Notably, Snapdragon 808 delivers memory performance similar to Snapdragon 810 at much lower cost, despite using only a regular LPDDR3 memory interface, as compared to the Snapdragon 810's LPDDR4 memory interface which in theory delivers almost twice the bandwidth. This provides clear evidence that the Snapdragon 810's memory interface is still flawed, while that of Snapdragon 808 is much more optimized. Snapdragon 808 even beats Snapdragon 810 in the single-core SGEMM and SFFT test, despite running at a lower clock speed, which probably also reflects a more optimized and functional memory controller. Even in the multi-core SGEMM and SFFT tests, Snapdragon 808 is not much behind Snapdragon 810 despite having only half the number of CPU cores.

Comparison with MT6795


In the marketplace, Snapdragon 808 may compete with MediaTek's MT6795 (Helios X10), which is a cost-effective performance-segment SoC that only uses Cortex-A53 cores. Comparing Geekbench subtest results, MT6795 scores signficantly lower than Cortex-A57-based SoCs such as Snapdragon 808 in single-core benchmarks, although the gap is not very large except in the SFFT benchmark. The MT6795 does relatively well in multi-core benchmarks, where it beats the Cortex-A57-based Snapdragon 808 and Snapdragon 810 in most cases by a considerable margin, especially in the JPEG Compress, Lua and Mandelbrot tests which are sensitive to the number of CPU cores (multi-core scores have not been listed for these tests in the tables above). As an example, MT6795 scores 8167 in the multi-core JPEG Compress test, twice the score of Snapdragon 808 and almost 40% higher than Snapdragon 810.

Conclusion


Snapdragon 808 appears to be a much more optimized, less flawed SoC product than Snapdragon 810 that may perform similarly or even better than Snapdragon 810 in practical use cases due to the performance flaws present in Snapdragon 810. At the same time, Snapdragon 808 is likely be considerably cheaper. The only caveat is the question of whether excessive heat production makes thermal throttling necessary to the same degree as Snapdragon 810. With only two Cortex-A57 cores, the SoC should be less problematic in this regard.

Source: Geekbench Browser (MSM8992 results), Geekbench Browser (MSM8994 results), Qualcomm (MSM8992 specifications)

Updated 15 March 2015.

Early benchmarks appear for Cortex-A72-based SoC

ARM recently announced the new Cortex-A72 processor core, which is an improved version of the existing high-performance Cortex-A57 processor core.

Alongside the Cortex-A72 CPU core, ARM also announced the CCI-500 interconnect technology as well as the high-end Mali-T880 GPU. Devices incorporating the combination of these technologies are expected to become available in 2016.

However, SoCs using the Cortex-A72 CPU are likely to become available earlier. Qualcomm and MediaTek have both announced SoCs using the Cortex-A72 core with commercial availability in the second half of 2015, suggesting that the CPU core itself is at an advanced stage of introduction. Already, early benchmarks for MediaTek's MT8173 tablet SoC that incorporates the Cortex-A72 have become available.

Cortex-A72 appears to be enhanced version Cortex-A57 optimized for next-generation processes


In its announcement press release from 3 February 2015, ARM claims that more than ten partners have already licensed Cortex-A72, including HiSilicon, MediaTek and Rockchip. Cortex-A72 is based on ARM's ARMv8-A instruction set architecture, and can be combined with the existing Cortex-A53 in a big.LITTLE configuration. Cortex-A72 seems to be positioned as a replacement for Cortex-A57. The similarities with Cortex-A57 are very apparent, for example in the identically sized L1 instruction and data caches, and a feature set that is otherwise very similar.

On a 16 nm FinFET process, the core can sustain operation at speeds up to 2.5 GHz within the constraints of a mobile power envelope (e.g. smartphones), with scalability to higher speeds for larger form-factor devices. However, the first announced devices, such as MediaTek's MT8173, appear to use older processes such as the tried-and-trusted 28 nm HPM process at TSMC, so they are likely to have a lower maximum clock speed.

ARM claims increased performance and power efficiency, although these claims seem to be based on implementation on next-generation processes such as 16 nm FinFET that deliver a significant intrinsic improvement in these metrics. ARM mentions micro-architectural improvements that result in enhancements in floating point, integer and memory performance. When implemented on a 16 nm FinFET process, ARM expects Cortex-A57 to provide 85% higher performance when compared to the Cortex-A57 core on a 20 nm process within a similar smartphone power budget.

Overall, the differences with Cortex-A57 appear to be relatively minor, so that Cortex-A72 is best viewed as an enhanced version of Cortex-A57 that is optimized for next-generation processes such as 16 nm FinFET. Nevertheless, the first SoCs to use the Cortex-A72 core will be manufactured using a less advanced process.

Benchmarks appear for MediaTek's MT8173


MediaTek's MT8173 is a mid-range tablet processor mainly targeting Wi-Fi-only tablets, since it does not have an integrated modem. It has two Cortex-A72 cores and two Cortex-A53 cores in a big.LITTLE configuration. Probably manufactured using the established 28HPM process at TSMC, the maximum clock speed of the Cortex-A57 cores is likely to be lower that the target for 16 nm FinFET, although MediaTek claims a clock speed up to 2.4 GHz, while a much lower frequency is apparent in early benchmarks results.

The chip also features a PowerVR GX6250 GPU, which delivers higher performance than the G6200 GPU used inside MediaTek's existing MT8135 and MT6795.

Recently, early benchmarks for a MT8173 development board have appeared both in the Geekbench Browser and in the results database of GFXBench. The first Geekbench results already appeared in December 2014. The latest set of Geekbench results date from the end of February 2015, although they do show a certain amount variation that may reflect thermal throttling.

Single-core performance good, but not spectacular


As expected, the Geekbench results show good single-core performance, albeit not spectacular. As shown in the following table, singe-core performance is in line with Cortex-A57-based SoCs such as Exynos 5433 and Exynos 7420. It should be noted that the MT8173 test SoC is most likely manufactured at 28 nm with a corresponding relatively low maximum CPU clock speed, while Exynos 5433 and 7420 are manufactured using smaller leading edge processes at Samsung.


SoC          "big" CPU                    Arch     JPEG (int)  Lua (int)   Mandelb. (fp)
                                                   Comp. IPC         IPC         IPC
MT8173       2 x 1.6? GHz Cortex-A72      AArch32  1310  2.13  1380  2.10  1064  1.95
Exynos 5433  4 x 1.80 GHz Cortex-A57r1p0  AArch32  1456  2.10  1397  1.89  1174  1.91
Exynos 7420  4 x 1.97 GHz Cortex-A57r1p0  AArch64  1481  1.97  1409  1.74  1198  1.92

In this table, to determine the IPC index I have made an educated guess about the actual clock speed of MT8173 when running the benchmarks. Geekbench reports a 1.40 GHz clock speed (which probably applies to the Cortex-A53 cores), 1.6 GHz seems to be a good match, providing just a little better IPC than Cortex-A57. Note that Exynos 7420 runs in AArch64 mode, which skews direct IPC comparisons.

Practical implications unclear


Without knowing the exact clock speed of the Cortex-A72 cores, it is hard to draw conclusions about the actual IPC improvement over Cortex-A57. If the MT8173 uses a 28 nm process, the ability to approach the single-core performance of Samsung's Exynos 7420 manufactured using 14 nm FinFET process is impressive. However, although MediaTek demonstrated the MT8173 in an actual tablet at MWC, it is unclear what kind of device the Alps development board in the benchmark entries actually represents, so it remains to be seen whether the benchmarks actually reflect the power budget of a tablet.

The multi-core performance reported is not very impressive, as expected because of the relatively small number of CPU cores. The JPEG Compress multi-core score shows CPU scaling factor of 2.72, which is good and implies utilization of the Cortex-A53 cores. The Mandelbrot floating point benchmark shows similar scaling.

However, the Lua integer benchmark has a very low multi-core scaling factor of 1.41, which is lower than expected, even when allowing for the limited number of cores. For example, MediaTek's MT6795 achieves multi-core scaling of 7.5 in this benchmark, and the Exynos chips range from 3.9 to 5.0. Other chips with a low multi-core scaling factor for Geekbench's Lua subtest include Snapdragon 810 (Cortex-A57-based), MediaTek's MT6595 (Cortex-A17-based) and NVIDIA's Denver-based Tegra-K1 SoC. There are indications that this benchmark test heavily depends on on-chip cache (primarily L2 cache) size and speed.

GPU performance of MT8173's PowerVR GX6250 GPU improves on G6200


The MT8173 test device's GPU performance as shown in GFXBench results database is not overly impressive, but suitable for a mid-range chip and an improvement over the PowerVR G6200 GPU used in other MediaTek SoCs such as MT6595 and MT6795. In the T-Rex Offscreen benchmark, the MT8173 registers a score of 1487, higher than the 1311 of the MT6595 (G6200)-equipped Meizu MX4. In the GFXBench 3.0 low-level tests, alpha blending scores higher than the MT6595 while the other low-level scores are comparable.

Sources: ARM (Cortex-A57 announcement press release), AnandTech (MediaTek MT8173 article), MediaTek (MT8173 announcement), Geekbench Browser (MT8173 test device results), GFXBench (MT8173 test device result)

Updated 10 March 2015.

Tuesday, March 3, 2015

A detailed comparison of Cortex-A53-based and other SoCs using Geekbench, and impact of AArch64

More Cortex-A53 CPU core-based SoCs have recently come to market and more benchmark results are now available, for example from the Geekbench results database. Firmware is also becoming more mature. This makes it possible to make better comparisons between different Cortex-A53-based SoCs (for example, octa-core SoCs) and compare the performance of the highest-performance chips with competitive chips that use more expensive CPU cores such as Krait 400 and Cortex-A57.

Overview of Cortex-A53-based SoCs


The following is a list of Cortex-A53 CPU core-based mobile SoCs that have appeared in the market or for which benchmark results have become available. All chips integrate 4G LTE modem functionality unless otherwise noted.

  • Snapdragon 410 (MSM8916), utilizing four early Cortex-A53r0p0 cores. Numerous cost-sensitive smartphones now use this chip. However, none of them appears to take any advantage at all of the new ARMv8 instruction set, with all of them running in ARMv7 compatibility mode. This is counter-intuitive because AArch32 (32-bit version of ARMv8), which is used by the other SoCs, already brings significant benefits. Snapdragon 410 generally perform significantly worse than other Cortex-A53-based SoCs, even when correcting for the low clock speed. This is also reflected in memory performance. The Adreno 306 GPU tends to be even a little slower than the Adreno 305 GPU in Snapdragon 400. The net result is a chip that is not much faster than Snapdragon 400 in many cases while having worse battery life.
  • Snapdragon 615 (MSM8939), equipped with an octa-core Cortex-A53r0p1 CPU configuration with four cores running (in practice) at 1.54 GHz or 1.50 GHz and four cores running at a lower maximum clock frequency (probably 1.0 GHz). This chip has appeared in an increasing number of new smartphone models. Runs in AArch32 mode. Performance is significantly lower than MediaTek's octa-core Cortex-A53-based SoCs, which can run all eight Cortex-A53 cores at the maximum frequency. Memory performance is improved from Snapdragon 410 but falls short of that of MediaTek's SoCs. The Adreno 405 GPU is fairly competitive, suitable for a mid-range SoC, although the 32-bit RAM interface of the SoC limits performance, especially at high resolutions. It is manufactured used TSMC's lower performance 28LP process. There have been reports that the chip gets hot with intensive use and requires throttling.
  • MediaTek MT6732, with an quad-core Cortex-A53r0p2 CPU configuration running at a maximum clock speed of 1.5 GHz. Devices using the chip are starting to become available, and tablets with the tablet version of this chip (MT8732) have also been announced. Although it has only four CPU cores, it has good performance, beating Snapdragon 615 in single core performance at a similar clock speed, and memory performance is significantly higher. The Mali-T760 MP2 GPU contributes to better GPU performance than previous MediaTek chips targeting cost-sensitive segments, although falling short of that of Snapdragon 615 and MT6752. A tablet version of the chip exists as MT8732.
  • MediaTek MT6752, featuring an octa-core Cortex-A53r0p2 CPU configuration with a maximum clock frequency of 1.69 GHz. Several devices have come to market using this chip, including the Meizu M1 Note. Performance is excellent, with high scores in the Geekbench CPU benchmark, considerably higher than Snapdragon 615 and beating high-end SoCs such as Snapdragon 801 in several metrics. The Mali-T760 MP2 GPU is clocked higher than that of the MT6732, resulting in good GPU performance, comparable to that of Snapdragon 615, as measured with GFXBench, although the 32-bit memory interface will be a bottleneck at high resolutions. Manufactured using TSMC's high-performance 28HPM process. A tablet version of the chip exists as MT8752.
  • MediaTek MT6795, with an octa-core Cortex-A53r0p2 CPU with clock speed up to 2.16 GHz. With a dual-channel memory interface and high resolution support, this SoC targets a higher performance segment than the previously mentioned chips, for which it can potentially offer much better performance/dollar because of the small die size of Cortex-A53 cores. Originally announced as become available in commercial devices before the end of 2014, it was delayed but competitive benchmark scores for what appears to be more mature versions of the chip have recently shown up. It appears to be configured with full AArch64 mode. Performance is excellent, with single-core performance closing much of the gap with the high-end Snapdragon 801, while multi-core performance is significantly higher. There appears to be a "Turbo" version running the CPU up to 2.16 GHz, while the regular version clocks at 1.95 GHz. At the MWC on 2 March 2015, MediaTek apparently rebranded the MT6795 as Helio X10.
  • MediaTek's MT6735 is a SoC for entry-level smartphones for which benchmark results have not yet become available. It has a quad-core Cortex-A53 CPU configuration and a Mali-T720 GPU, a downgrade from the Mali-T760 GPU in MT6732. The recently announced MT6753, with eight Cortex-A53 cores running up to 1.5 GHz, is compatible with the MT6735 and also has a Mali-T720 GPU (probably MP4). Other chips that have shown up in product announcements include the MT8161 (probably the equivalent of the MT6735 without modem) and MT8165 (equivalent to MT8732 without modem).
  • Qualcomm has announced additional octa-core Cortex-A53-based chips, Snapdragon 415 and Snapdragon 425. These probably utilize symmetrical Cortex-A53 configuration with all cores running at the same maximum clock frequency, unlike Snapdragon 615. Otherwise, the new SoCs are similar to Snapdragon 615, with the same Adreno 405 GPU. According to Qualcomm, devices using these chips will become commercially available in the second half of 2015.
  • Kirin 620 (Hi6210) from HiSilicon (Huawei) is an octa-core Cortex-A53r0p3-based SoC running up to 1.2 GHz. The GPU is a Mali-450 MP4. Although performance (including single-core performance) is better than Snapdragon 410, it is not as optimized as chips such as MT6752 and runs at a relatively low clock speed. Multi-core performance scaling is less than expected.

Geekbench integer and memory scores comparison


The following table provides details about selected Geekbench integer and memory benchmark scores for different Cortex-A53-based SoCs, and also other smartphone SoCs from Qualcomm, MediaTek and Samsung for comparison.

                Arch    Max freq. JPEG C. IPC   JPEG C. Dijkstra      Stream Copy   Geekbench
                                  Single  x A7  Multi   Single Multi  Single Multi  Ref. number

Snapdragon 410  ARMv7     1.19      596   1.30   2384     810   2135   431   492    1551964
Snapdragon 615  AArch32 1.50/1.0    820   1.42   4979     886   3646   572   703    2015694
MT6732          AArch32   1.50      843   1.46   3357    1041   3002  1001  1199    1546611
MT6752          AArch32   1.69      952   1.46   7554    1144   4483  1071  1191    1583540
MT6795          AArch64   1.95     1026   1.37   8167     990   3802  1356  2068    2002894
MT6795T         AArch64   2.16     1128   1.36   8962    1064   4109  1350  2140    1984431
Hi6210          AArch32   1.20      660   1.43   3501     744   2772   602   900    1999304

Snapdragon 400  ARMv7     1.19      462   1.01   1860     700   2132   534   551    1938063
Snapdragon 801  ARMv7     2.46     1347   1.42   5437    1174   3586  1931  2144    1491681
Snapdragon 805  ARMv7     2.65     1475   1.45   4105    1230   4058  2117  2910    1502687
Snapdragon 810  AArch64  ?/1.55    1358          5972    1073   3584  1428  1838    2017257
MT6582          ARMv7     1.30      506   1.01   2027     748   2354   250   396    2017732
MT6592          ARMv7     1.66      643   1.01   5086     891   3327   261   388    2000008
MT6595          ARMv7   2.20/1.69  1350   1.59   6080    1844   5612  1652  1986    1591744
Exynos 5430     ARMv7   1.80/1.3   1056   1.52   5140    1102   3918  1457  1559    1556780
Exynos 5433     AArch32   1.89     1456   2.10   6209    1523   5728  1396  1458    2017193
Exynos 7420     AArch64  ?/1.50    1481          7168    1065   4596  1953  2579    2012972

The low performance of Snapdragon 410 is apparent in the scores, with normalized IPC (instructions per cycle to the equivalent of a 1.0 GHz Cortex-A7) for the CPU-speed sensitive single-core JPEG Compress benchmark being lower than that of other Cortex-A53-based SoCs, probably due to being limited to ARMv7. The Dijkstra benchmark even scores lower on Snapdragon 410 than on an equivalently clocked Snapdragon 400, and memory performance is also lower.

Snapdragon 615, while improving on Snapdragon 410, also appears to be less optimized than MT6732/MT6752 in terms of single-core IPC, despite a very similar clock frequency. Looking at multi-core performance, MT6752 is significantly faster than Snapdragon 615, largely due to being able run all eight cores at the maximum clock frequency. MT6732 and MT6752 also have significantly higher memory performance, reaching an impressive score for devices with a 32-bit memory interface.

The higher clock speed of MT6795 (Helio X10) brings benefits for integer performance, but due to the use of the AArch64 instruction set, normalized IPC is lower (1.36 vs 1.46 for JPEG Compress). This is especially true for the Dijkstra benchmark, where AArch64 mode imposes a significant penalty (this is also seen on other platforms utilizing AArch64).

Overall, a high-speed Cortex-A53 configuration such as implemented in the MT6795T comes fairly close to Snapdragon 801 for single-core performance, while being significantly faster for multi-core performance, at a significantly lower cost. Several metrics are also in the same ballpark as the current high-end leader Exynos 7420.

Analysis of the Geekbench Lua subtest


The Lua integer benchmark appears to be particularly sensitive to memory subsystem efficiency, including L2 cache size, and memory bandwidth as well being dependent on CPU speed. It is the kind of code that may frequently occur in actual practice on a smartphone.

                Arch      Lua     IPC   Lua    CPU    #CPUs
                          Single  x A7  Multi  Par.

Snapdragon 410  ARMv7      603    1.23  2137   3.54   4
Snapdragon 615  AArch32    709    1.15  1644   2.32   4 + 4
MT6732          AArch32    753    1.22  2419   3.21   4
MT6752          AArch32    842    1.21  2361   2.80   8
MT6795          AArch64   1053    1.31  8203   7.79   8
MT6795T         AArch64   1173    1.32  8847   7.54   8
Hi6210          AArch32    587    1.19  1740   2.96   8

Snapdragon 400  ARMv7      476    0.97  1874   3.94   4
Snapdragon 801  ARMv7      980    0.97  2880   2.94   4
Snapdragon 805  ARMv7     1016    0.93  2917   2.87   4
Snapdragon 810  AArch64   1283          1065   0.83   4 + 4
MT6582          ARMv7      514    0.96  1644   3.20   4
MT6592          ARMv7      651    0.95  1344   2.06   8
MT6595          ARMv7     1509    1.67  2498   1.66   4 + 4
Exynos 5430     ARMv7      981    1.33  1861   1.90   4 + 4
Exynos 5433     AArch32   1397    1.89  5478   3.92   4 + 4
Exynos 7420     AArch64   1409          7088   5.03   4 + 4

In this test, Snapdragon 410 performs reasonably well. MT6752's multi-core performance seems limited by a bottleneck, probably external memory bandwidth. MT6795's performance is impressive; while single-core performance falls a little short of Cortex-A57 based SoCs, for multi-core performance it blows past them, with CPU parallelism fully exploited. It seems the bottleneck present with the MT6752 (presumably memory bandwidth and the L2 cache memory size available to each core) is not present with the MT6795.

Qualcomm's Snapdragon 810 consistently scores in the 1000-1200 range for both the single-core and multi-core test, while the multi-core test would have been expected to be significantly higher. This appears to reflect a serious deficiency in the memory subsystem of the SoC (which might not only be related tot the LPDDR4 SDRAM controller, but also the on-chip L2 cache) which might also have negative implications for smoothness in every-day use.

Geekbench floating points subtests


Finally, let's look at floating point performance. The Mandelbrot subtest tests pure floating point performance, while the SGEMM and SFFT tests also significantly depend on memory performance.


                Arch      Mandelbrot                 SGEMM         SFFT
                          Single  IPC   Multi  Par.  Single Multi  Single Multi

Snapdragon 410  ARMv7      448    1.10  1794   4.00   245    489    317   1258
Snapdragon 615  AArch32    583    1.14  3611   6.19   303    688    426   2517
MT6732          AArch32    585    1.14  2336   3.99   337    653    430   1727
MT6752          AArch32    661    1.15  5257   7.95   384   1148    481   3870
MT6795          AArch64    823    1.24  6406   7.78   484   1542    618   4764
MT6795T         AArch64    912    1.24  7245   7.94   529   1659    694   5333
Hi6210          AArch32    467    1.14  3509   7.51   264    876    343   2178

Snapdragon 400  ARMv7      405    1.00  1620   4.00   203    634    285   1182
Snapdragon 801  ARMv7      788    0.94  3104   3.94   907   2816    992   3518
Snapdragon 805  ARMv7      848    0.94  3389   4.00  1011   2669   1130   4135
Snapdragon 810  AArch64   1100          5144   4.68   749   1828   1009   3643
MT6582          ARMv7      444    1.00  1765   3.98   230    512    328   1316
MT6592          ARMv7      568    1.00  4430   7.80   282    696    419   3397
MT6595          ARMv7     1284    1.71  5822   4.53   748   2337   1187   4255
Exynos 5430     ARMv7      990    1.61  4745   4.79   657   2491    896   3971
Exynos 5433     AArch32   1174    1.91  4883   4.16   751   2369   1044   4031
Exynos 7420     AArch64   1198          6129   5.12   945   2888   1313   4874

From these numbers its is clear that Cortex-A53 improves floating point performance somewhat when compared to Cortex-A7 at the same clock speed. When eight cores can run in parallel at high speed, multi-core floating point performance is impressive, as demonstrated by MT6752 and MT6795. Snapdragon 801 and 805 are looking a bit dated in this department.

In the memory-intensive SGEMM and SFFT tests, Snapdragon 400 comes close to Snapdragon 410, illustrating the lack of performance improvement by Snapdragon 410. In fact MediaTek's previous generation MT6582 matches the floating point performance of Snapdragon 410 across all tests.

The Cortex-A57 based SoCs have the highest single-core floating point performance, although the Cortex-A17-based MT6595 is also very strong. Exynos 5433 and Exynos 7420 beat Snapdragon 810 in most floating point tests, although the difference is not as large as it used to be with earlier results for Snapdragon 810.

Conclusion


It is clear that octa-core Cortex-A53-based SoCs can deliver strong performance at a relatively low cost, and this particularly true for MediaTek's new chips, MT6752 and MT6795. The MT6795, with its higher clock speed and dual-channel memory interface, can match current high-end chips in most metrics, being not much slower in single-core performance while being superior in multi-core.

One unknown question is whether the high maximum clock frequency of the MT6795 and MT6795T, which deliver impressive performance/dollar, translates to acceptable power consumption and battery life. Observations that power consumption for Cortex-A53 can quickly increase at higher frequencies for the Samsung-manufactured Exynos 5433 have been made, but MT6795 is manufactured on different process at TSMC and probably makes use of specific design optimizations for high clock speeds (ARM POP IP core hardening technology) that make power consumption more acceptable.

Sources: Geekbench Browser

Updated 10 March 2015.

Wednesday, February 25, 2015

Early benchmarks for MT6795 show high performance, suggest use of eight Cortex-A53 cores

MediaTek originally announced the MT6795, a SoC targeting the premium-level and performance segments of the smartphone market, in July 2014, with expectations of devices being commercially available to end users before the end of 2014. However, the chip was delayed (problems with the memory controller were reported) and competitive benchmark results are only now beginning to surface for the chip.

According to the announcement, the SoC was to have an octa-core CPU configuration with clock speeds up to 2.2 GHz, a strong dual-channel memory interface with support for LPDDR3 up to 933 MHz, 2K (2560x1600) display support. Other reports and information have suggested that it uses a PowerVR G6200 GPU, similar to the one used in MediaTek's MT6595, which can be seen as 32-bit predecessor of the new chip.

Confusion about processor cores, octa-core Cortex-A53 seems likely


The actual CPU cores used inside the MT6795 continue to be source of confusion. Initially understood to be an octa-core Cortex-A53 CPU configuration clocked at a high frequency, later a purported leaked MediaTek product roadmap surfaced that described the MT6795 as a big.LITTLE design that includes Cortex-A57 cores. However, a recent new entry in the Geekbench database suggesst that the chip actually has eight Cortex-A53 cores as originally suspected, as the IPC (instructions per cycle) of the integer and floating point subtests would be hard to reconcile with Cortex-A57 cores being present.

Geekbench results show mixed performance but high overall score


The Geekbench results show strong CPU performance, with the overall score being superior to that of available results for Snapdragon 810, which has a significantly higher cost design but has been plagued by performance issues, although it scores lower than Exynos 5433/Exynos 7 Octa with Cortex-A57 cores as used in the Galaxy Note 4. Note that MT6795 uses a less advanced 28 nm process compared to the 20 nm process used for Snapdragon 810 and Exynos 5433.

Single-score integer performance is not spectacular and below that of the previous generation high-end chips such as Snapdragon 801. Although this is compatible with the use of medium-performance Cortex-A53 cores, integer single-core performance is actually lower than the mid-range MT6752, despite the higher clock rate, pointing to continuing hardware performance problems with the chip. The Dijkstra benchmark result is particular low. This benchmark has a lot of external memory access and likely branches a lot, taxing certain elements of the CPU and SoC that simpler CPU benchmarks do not. It may be affected by the doubled address size in AArch64 mode, either through the increased size of pointer storage or reduced efficiency of the branch prediction unit inside the processor core.

Single core floating point performance in the Mandelbrot benchmark is higher than the MT6752 and actually compatible with the Cortex-A53 core running at 2.1 GHz, close to the originally envisaged maximum clock speed for the MT6795. Multi-core performance in this subtest is impressive, with a score that is higher than most existing SoCs including Exynos 7 Octa, which employs faster Cortex-A57 cores.

Finally, the dual-channel memory interface seems to working reasonably well in the tested revision of the chip/development board, with memory scores consistent with an optimized dual-channel interface, and higher, for example, than those of Exynos 5433. However, they are generally lower than those of the 32-bit MT6595.

One caveat is that the MT6795 entry is running in AArch64 mode, while the other devices were running in AArch32 (32-bit ARMv8) or 32-bit ARMv7 mode.

Average single-core CPU performance, strong multi-core performance


In a direct comparison with the MT6752, which has a comparable CPU configuration but clocked lower and has only a 32-bit memory interface, the MT6795 is only slightly faster, although the MT6795 uses a full 64-bit AArch64 instruction set model, while the tested MT6752 configurations use AArch32 with partial use of ARMv8 features. There are a few anomalous results, including a low score for the MT6795 in the single-core AES benchmark, and as mentioned it also scores significantly lower in the Dijkstra benchmark. Floating point performance is consistently higher for the MT6795 (more than the increase in clock rate would explain), which may be caused by the higher-performance memory subsystem of the MT6795 and/or the increased number of floating point registers available in AArch64 mode.

The MT6795 is clearly slower than its 32-bit predecessor MT6595 (which uses high-performance Cortex-A17 and Cortex-A7 cores in a big.LITTLE configuration) in most metrics, with only the heavy weighting and large performance gain for the AES and SHA1 cryptography tests  (due to the new ARMv8 instruction set) shifting the advantage for the overall score towards the MT6795.

When making a comparison with a median entry for the high performance Exynos 5433 (Exynos 7 Octa) inside the Samsung Galaxy Note 4, the MT6795 fairly consistently shows clearly lower single-core performance but higher multi-core performance.

MT6795 likely to be most cost-effective performance segment processor on the market


The exclusive use of Cortex-A53 CPU cores, and not the much more expensive and die-space consuming Cortex-A57 (or, in a 32-bit comparison, Cortex-A15/A17 cores), has positive implications for the cost of the chip. Die space dedicated to the CPU cores will be relatively low, although L2 caches will take considerable space when configured with a size that matches the desired performance level and market segment. Overall, the chip is likely to be attractive in terms of performance/dollar for the performance segment.

In terms of SoC optimizations, the chip would probably work better with the employment of additional ARM IP such as a Mali T760 or Mali-T800 series GPU, which offers advantages in combination with ARM cores such as Cortex-A53 in tandem with techniques such as AFBC, smart composition and transaction elimination, and new interconnect buses within the chip. SoCs like the MT6752 probably benefit from these optimizations, while the MT6795 cannot do so fully because of the non-ARM GPU. It seems likely that the MT6795 will be superseeded in next generation products to be announced by MediaTek in the future by a similar SoC with an ARM Mali-T760 or T800 series GPU.

Update (2 March): Based on a closed-door presentation event at the MWC, MediaTek appears to have rebranded MT6795 as Helio X10 with future Helio P series products also being announced.

Sources: MediaTek (MT6795 announcement), Geekbench browser

Tuesday, February 17, 2015

Cortex-A53 not as power efficient as Cortex-A7

Recent detailed technical review articles published by AnandTech based on a comparison of Samsung Exynos SoCs have elucidated some of the details about the performance of the Cortex-A53 core, including processing performance, power consumption and die size. Overall, it appears that while Cortex-A53 is significantly faster than Cortex-A7 at the same clock speed, die size and power consumption on an equivalent manufacturing process has increased by a greater amount, leading to lower performance/Watt.

Direct comparison of Cortex-A7 and Cortex-A53 on the same process


In a recently published technical review article about the ARM Cortex-A53, Cortex-A57 CPU cores and Mali-T760 GPU core, based Samsung's Exynos-based Galaxy Note 4 model, AnandTech has provided details about the performance, power consumption and die size of the 64-bit Cortex-A53 core relative the its 32-bit predecessor, Cortex-A7. It has done so by comparing measurements of the Cortex-A53 cores inside the Exynos 5433 used in the Note 4 with the Cortex-A7 cores inside the Exynos 5430 used in the Galaxy Alpha. Both SoCs are produced using a similar 20nm process at Samsung, making a direct comparison possible.

Cortex-A7 is an in-order pipeline CPU core with moderate performance but an extremely small die size and very low power consumption. The Cortex-A53 core has been designed by ARM as a logical extension of Cortex-A7 to ARM's 64-bit ARMv8 instruction set with higher performance. However, in doing so die size and power efficiency have suffered somewhat.

CPU performance increased in Cortex-A53


According to the designer of Cortex-A53 at ARM, Cortex-A53 increases SPECint-2000 performance from 0.35 SPEC/MHz to 0.50 SPEC/MHz when compared to the Cortex-A7 core. In Geekbench integer benchmarks, disregarding cryptography benchmarks which a show a large increase, performance is still about 50% higher for Cortex-A53 when compared to Cortex-A7 at the same clock speed, with the biggest gains coming with multi-threaded performance (aided by the increased memory performance).

For floating point benchmarks the performance increase reported by AnandTech is dramatic, with most benchmarks showing a two to three times performance increase. However, there seems to be a discrepancy between these benchmarks results and benchmark results available from the Geekbench results database for Cortex-A53 and Cortex-A7-based devices, showing ony a moderate floating point performance increase for Cortex-A53 over Cortex-A7. Most likely, AnandTech is erroneously reporting Cortex-A57 core floating performance in this case (this matches Geekbench results that I previously tabulated).

Memory performance benchmarks performed by AnandTech show a relative increase in latency for a Cortex-A53 cluster between transfer sizes of 256 KB and 512 KB when compared to a Cortex-A7 cluster, despite the fact that this should fit inside the 512 KB L2 cache. However, as I previously noted in earlier blog articles, the benchmarks show that memory bandwidth has significantly increased with Cortex-A53 when compared to Cortex-A7, virtually doubling. This most likely contributes to the Cortex-A53 core's greater multi-threading performance in practice.

Power consumption of Cortex-A7 greatly reduced with Samsung's 20 nm process


AnandTech has published a detailed chart showing estimates for power consumption of the previous generation 32-bit Cortex-A7 and Cortex-A15 cores on both 20 nm and 28 nm processes at Samsung, based on Samsung's Exynos 5422 (28 nm) and Exynos 5430 (20 nm) SoCs.

While the high-performance Cortex-A15 cores are seeing a power reduction of about 25%, power consumption of the Cortex-A7 cores sees a significant 40% reduction with a 56% reduction at the highest CPU frequency of 1300 MHz. This can be partly explained by Samsung optimizing the Cortex-A7 cores inside Exynos 5430 for low power consumption using ARM's POP IP optimization platform.

Ironically, the excellent power characteristics of the Cortex-A7 at the latest processes such as Samsung's 20 nm process have not been taken advantage of in the market except in Samsung's Exynos big.LITTLE 5430, since Cortex-A7 adoption is mostly limited to 40 and 28 nm and all announced 20 nm SoCs use Cortex-A57 and Cortex-A53 cores. There seems to be an opportunity for ultra-efficient 20 nm Cortex-A7-based SoCs for certain product segments, while there is also a significant opportunity for 20 nm Cortex-A53-only SoCs that should be more power efficient than their 28 nm equivalents.

One could envision a hypothetical octa-core Cortex-A7-based SoC manufactured on TSMC's 20nm HPM process delivering spectacular performance/Watt, with relatively high clock speeds being possible. AnandTech's article notes that TSMC's 28nm and 20 nm HPM processes are most likely significantly more efficient than Samsung's equivalent process technology because they allow CPUs to operate at lower voltage level. A similar argument applies to Cortex-A53-based SoCs manufactured at 20 nm, albeit with lower performance/Watt.

In terms of die size, AnandTech reports a significant reduction of 45% for the the Cortex-A7 cores and 64% for the Cortex-A15 cores in the 20 nm Exynos 5430 vs 28 nm Exynos 5422.

Cortex-A53 has significantly greater power consumption than Cortex-A7


AnandTech has published a detailed chart with power consumption characteristics of the Cortex-A53 cores inside Samsung's Exynos 5433 manufactured at 20nm. In their analysis, AnandTech notes a relatively large increase in power consumption when utilizing multiple Cortex-A53 cores at their highest frequency (1300 MHz on Exynos 5433), when compared to running at 1.0 GHz. This correlates with a voltage bump when going from 1.0 to 1.3 GHz.

Based on this analysis, the article concludes the power consumption is more than twice as large for Cortex-A53 when compared to Cortex-A7 at an equivalent clock speed of 1300 MHz at a similar manufacturing process (Samsung's 20nm process). Although the Cortex-A53 core's CPU performance is greater, it is not twice as great leading to clearly lower performance/Watt for Cortex-A53 when compared to Cortex-A7.

It is possible that the chip errata (hardware bugs) in earlier revisions of Cortex-A53 that I mentioned in previous articles play a role in reducing the measured performance and power efficiency of Cortex-A53. Exynos 5433 uses Cortex-A53r0p1, which is affected by this. The chip errata require more frequent cache flushing as a work-around, which can potentially affect performance as well as power consumption. The non-optimal state of big.LITTLE kernel scheduling code may exacerbate these problems. There is potential for later revisions of Cortex-A53 such as r0p3 to deliver higher efficiency because they are not affected by these hardware problems. Chips with Cortex-A53 revision r0p3 have not yet appeared on the market.

Chip-specific core optimizations makes comparisons more difficult


It should be noted that specific optimization of the processor cores for a particular higher clock frequency target (e.g. in chip like MediaTek's MT6752 and MT6795) or low power consumption at lower clock frequency (for example, in a big.LITTLE configuration), using ARM's POP core hardening technology, has the potential skew the comparison between different chips. MediaTek's MT6752 has already been reported to have acceptable power consumption while running at relatively high maximum clock frequency, which would otherwise be incompatible with the steep rise in power consumption for clock speeds above 1.2 GHz observed in the charts for the Samsung chips.

Die size of Cortex-A53 increased compared to Cortex-A7


The die size of Cortex-A53 cores when compared to Cortex-A7 in Samsung's chips is about 1.75 times greater according to AnandTech, although it is still below one square millimeter, which is still low for a CPU. When looking at the total cluster size, which includes the L2 cache (the same amount of 512 KB for Cortex-A53 and Cortex-A7), the die size of the cluster is 1.38 times greater. The larger die size has consequences for cost-sensitive SoCs for low-end mobile devices and IoT applications, for which Cortex-A7 remains more attractive. Cortex-A7 can also be employed as an embedded CPU in a functional block such as a baseband processor,  just like Cortex-A5 is frequently used.

Consequences for mobile SoCs


The higher performance of Cortex-A53 when compared to Cortex-A7, especially memory bandwidth, makes high-clocked multi-core Cortex-A53-based SoCs suitable for mid-range performance segments. Examples of this are MediaTek's MT6752 and Qualcomm's Snapdragon 615 SoC. These SoCs also have higher GPU performance than that traditionally associated with Cortex-A7-based SoCs.

The increased power consumption and die size of Cortex-A53 causes Cortex-A7 to remain relevant, because it still delivers superior power efficiency, cost and die size, and consequently performance/Watt and performance/dollar are better than Cortex-A53. Hypothetically, a 20nm octa-core Cortex-A7 based SoC would deliver excellent power efficiency with quite acceptable performance due to higher clock speeds, and their may be a market for such a solution for smartphones. The main drawback would be that OS ecosystems such as Android are moving towards 64-bit implementations and can also make use of new cryptography instructions in ARMv8.

Sources: AnandTech (technical Exynos Galaxy 4 Note review)

Updated 1 March 2015 (Add section about core-hardening).

Thursday, January 8, 2015

New mobile SoCs announced at CES

At the Consumer Electronics Show in Las Vegas, USA this week, a large number of new devices as well as chips for various kinds of multimedia devices is being announced, including mobile SoCs for smartphones and tablets. Several of the newly announced SoCs use Cortex-A53 CPU cores.

Rockchip announces octa-core Cortex-A53 tablet SoC


Rockchip announced the RK3368 at the show, which is a tablet processor with eight Cortex-A53 cores clocked up to 1.5 GHz and an unnamed GPU supported OpenGL 3.1. Rockchip also claims 4Kx2K H.264/H.265 video playback capability and HDMI 2.0 display output supporting 4Kx2K resolution. Early information about this chip became available a few months ago, when it was codenamed "MayBach". Rockchip mentions support for Android Lollipop in its materials.

The quoted maximum clock speed of 1.5 GHz is not very high, but an up-to-date revision of the Cortex-A53 core should provide good CPU performance at that speed even for single-core, and the octa-core configuration will provide very good multi-core performance. At which foundry it is being produced in unclear; in the past Rockchip has been using the 28 nm SLP process at GlobalFoundries for its high-performance chips, although plans for chips produced at TSMC have been reported.

Most of the specifications suggest that the chip is targeted at the performance segment, more or less as a replacement for the RK3288 that is more suitable for tablets due to lower power consumption. Based on the fact that DirectX support up to 9.3 is claimed as well as OpenGL 3.1, the GPU is most likely a Mali-T760 GPU. The RK3288 already contains a performance-oriented Mali GPU, of which the exact nature is unclear. The memory interface is likely to be 32-bit dual-channel with support for LPDDR3, similar to the RK3288 and suitable for performance-oriented devices.

Allwinner announces low-cost quad-core Cortex-A53 tablet SoC


Meanwhile, Allwinner, Rockchip's archrival in the Chinese tablet processor market, announced the A64, a new low-cost tablet processor with four Cortex-A53 CPU cores. Allwinner quotes a price of $5 for the chip. The SoC appears to be the logical successor to the recently introduced A33 with Cortex-A7 cores, which is also a low-cost quad-core tablet processor that appears to have been less successful than anticipated. Allwinner also recently introduced an octa-core Cortex-A7-based SoC, the A83T.

The new SoC supports H.265/H.264 decoding in hardware, and is compatible with various types of DDR memory (presumably in a single channel 32-bit configuration). 4K HDMI output is also listed.

MediaTek announces Android TV and wearable device SoC platforms


Outside of the mobile space, MediaTek (which has long being prominent in the digital television SoC space, both through its internal division and through MStar, which it acquired not too long ago), announced a new digital television SoC, MT5595, with support for Android TV.  Sony will be using the chip in new LCD TV models. The chip has a big.LITTLE-type CPU configuration with two Cortex-A17 cores and two Cortex-A7 cores, and has hardware support for HVEC (H.265) and VP9 for 4K2K content streaming at 60 frames per second. As shown by the MT6595 smartphone SoC, MediaTek's Cortex-A17 implementation can provide very high single-core CPU performance, which is probably helpful in providing good performance and response times on the Android TV platform.

MediaTek has also announced an optimized solution for wearable devices based on Google’s Android Wear software. The MT2601 is equipped with a dual-core Cortex-A7 CPU up to 1.2 GHz and a single-core Mali-400 MP GPU, with support for display resolutions up to qHD (960x540). In several respects, these specifications match those of MediaTek's existing low-cost MT6572 smartphone SoC. MediaTek is touting the small die size and power efficiency of the new chip. It can be paired with various external wireless connectivity chips including the recently introduced MT6630 for Bluetooth (MT6630 also integrates advanced WiFi, GPS and FM radio functionality).

Sources: CNX Software (Rockchip RK3368), CNX Software (Allwinner A64), MediaTek (MT5595 announcement), MediaTek (MT2601 announcement)

Tuesday, December 30, 2014

Early benchmarks for Snapdragon 810 show performance flaws

Recently, reports have surfaced, including one from BusinessKorea published on December 4, about Qualcomm's new high-end chip, Snapdragon 810, being affected by performance issues related to heat production and issues with the memory controller. Subsequently, Geekbench results for some Samsung prototype devices using the SoC (MSM8994) have also appeared in the Geekbench results database. Detailed analysis of the Geekbench results seems to confirm the issues with thermal throttling and especially memory controller performance, at least in the early revision of SoC that was used to obtain the mentioned benchmark scores, resulting in sub-par performance for its segment.

Updated (January 5, 2015): A section has been added discussing new Geekbench results from a LG G Flex2 prototype using Snapdragon 810, which shows improvement in some areas.

Snapdragon 810: A departure from Qualcomm's in-house Krait cores


For a long time, Qualcomm has used its own ARM-compatible Krait cores (most recently Krait-400/450 in Snapdragon 801/805) for SoCs targeting the performance segment. However, with Snapdragon 810 (as well as Snapdragon 808 and to a certain extent Snapdragon 615), Qualcomm seems to be migrating to standard ARM cores for performance-oriented SoCs. Some time ago, Qualcomm already transitioned its cost-effective SoCs (such as the Snapdragon 200 and 400 series) to cost efficient ARM cores such as Cortex-A7 (and later Cortex-A53).

Snapdragon 810 contains four Cortex-A57 cores (clocked up to about 1.5 GHz based on current evidence) as well as four Cortex-A53 cores in a big.LITTLE configuration. In this respect the chip is similar to Samsung's Exynos 7 Octa (5433) that has already been shipping for several months in devices such as the Galaxy Note 4 and shows impressive CPU performance. However, Snapdragon 810 is the direct successor to Snapdragon 805 and has a similarly ambitious memory interface with high total bandwidth (pioneering the use of new LPDDR4 SDRAM), which puts it squarely in the very high end category, like Snapdragon 805.

Qualcomm also has a SoC in planning for the more mainstream part of the high-end performance segment, Snapdragon 808, which has two Cortex-A57 cores instead of four while retaining the four Cortex-A53 cores. Importantly, Snapdragon 808 also simplifies the memory interface to dual-channel 32-bit with more standard LPDDR3 memory instead of LPDDR4, reducing cost and being comparable to Snapdragon 801, the current high-end standard.

20nm process and LPDDR4 memory


Snapdragon 810 is Qualcomm's first SoC product to be manufactured using TSMC's 20nm process technology. 20nm, in theory, significantly increases performance and power efficiency when compared to the 28nm process technology that Qualcomm has been using recently for most of its chips.

The SoC also features a LPDDR4 external memory interface in a dual-channel 32-bit configuration, with maximum clock speed of 1600 MHz according to Qualcomm's webpage, resulting in memory bandwidth of 25.6 GB/s, similar to Snapdragon 805, which achieves its bandwidth with a wide 64-bit dual channel memory interface with LPDDR3. This is a very high amount of memory bandwidth for a mobile device, making the chip suitable for driving very high resolutions such as QHD. However, it also increases cost, and the apparent requirement of using higher-clocked LPDDR4 memory instead of mainstream LPDDR3 is also likely to increase cost, despite the reduction in memory bus width allowed by LPDDR4.

Snapdragon 808 likely to be more attractive for high-volume flagship devices


Meanwhile, Snapdragon 808 seems to provide a more practical performance-oriented platform by utilizing standard LPDDR3 in a dual-channel 32-bit at a clock speed up to 933 MHz, resulting in maximum memory bandwidth of 14.9 GB/s. Overall, Snapdragon 808 seems to be much more attractive for high-volume high-end devices as a successor to Qualcomm's popular Snapdragon 801.

Performance flaws evident in early Geekbench database entries


Early Geekbench results database entries show lower-than-expected CPU and memory performance, and detailed analysis of the results seems to confirm the reports about thermal throttling due to heat production as well as lower-than-expected memory performance. In practice, the version of Snapdragon 810 that was benchmarked seems to provide performance lower than even Snapdragon 801 in most respects.

Performance data for Snapdragon 810 in the Geekbench entries is clouded somewhat because of the use of 64-bit Aarch64 mode in Android. Until now, most Cortex-A57 and Cortex-A53 based solutions use AArch32 (32-bit ARMv8 mode, which takes advantage of some of the new features of Armv8 but is not fully 64-bit). Android AArch64 support and performance has been work in progress and is still likely to be not fully optimized. However, in the case of the Snapdragon 810 results, the performance deficit is of such magnitude that is clear that they are caused by flaws in the chip implementation and not AArch64 mode.

In the table in the Appendix below, some Snapdragon 810 and 801 results have been highlighted in bold to show some of the performance differences and in particular the areas where Snapdragon 810 performance is much lower than expected.

There are several entries for the device in the database that show considerable variation between runs, providing evidence that performance throttling caused by heat production is a significant problem. For the analysis below, the best benchmark result among the various entries has been used. There is evidence that some of the later entries impose a CPU clock speed limit of about 1.0 GHz or perhaps only use the Cortex-A53 cores in some cases (these entries are also represented in the table).

Deficits in pure CPU performance, especially multi-core


Compared to Samsung's Exynos 7 Octa (5433), which has a similar CPU configuration, basic integer tests such as JPEG Compress already show somewhat lower than expected performance based on the reported clock speed, with multi-core performance scaling being considerably less than expected, and also clearly lower than Snapdragon 801. The Dijkstra benchmark, which has more external memory access and branching, is more heavily affected and is at least 35% slower than on Exynos 5433, despite a similar clock speed, and slower than Snapdragon 801 as well as Snapdragon 805. However, this may for a large part be due to running in AArch64 compared to 32-bit mode used on the other chips, since the Dijkstra benchmark seems to similarly affected on other platforms that use AArch64.

For floating point performance, pure single-core performance, as shown by the Mandelbrot subtest results, is relatively unaffected, but multi-core performance scaling is much lower than Exynos, resulting in performance comparable to Snapdragon 805 rather than the higher floating point performance expected from Cortex-A57 cores (such as in Exynos 5433).

Memory performance significantly impacted


Memory performance is clearly seriously affected, confirming reported issues with the memory controller. The raw throughput of the Stream Copy subtest is signficantly lower than expected based on the 32-bit dual-channel memory interface with double-speed LPDDR4, being lower than Snapdragon 805 with a similar amount of memory bandwidth and even significantly lower than Snapdragon 801 with its 32-bit dual-channel LPDDR3 interface.

The flaws in memory performance are evident in the SGEMM subtest, which is a floating point test that is heavy on sequential memory access. Snapdragon 810 shows performance for this test barely more than half that of Snapdragon 801 and 805. It is even worse for the multi-core test, where Snapdragon 810 shows performance scaling worse than two times, while Snapdragon 801 and 805 have performance scaling more in line with the four CPU scores they possess.

Finally, in the SFFT test, which is a floating point test with heavy random memory access, only shows roughly half the performance of Snapdragon 801, Snapdragon 805 as well as Exynos 5433. This seems to provide the clearest evidence of performance problems with the memory controller.

Snapdragon 810 likely to be too costly for mainstream high-end devices


In popular technology websites on the internet, Snapdragon 810 has recently frequenty been mentioned as the likely chip used for future high-end models for a diverse range of well-known manufacturers such as Samsung, HTC and LG. However, the high-banwidth LPDDR4 memory interface (which increases device cost) and performance targets seems to put it clearly in the very high end category, comparable to Snapdragon 805, which does not make it ideal for high-volume performance devices that do not have an extremely high screen resolution such as QHD (2560x1440). Other new chips such as Snapdragon 808 and (for mid-range) Snapdragon 615 seems to be more suitable for performance-oriented mainstream devices, including several of the mainstream flagship devices from the mentioned manufacturers.

However, if the performance flaws that are evident in the current Snapdragon 810 are not fixed or if Qualcomm has significant inventory of flawed chips, it is possible that they will be unloaded onto the more mainstream performance segment for a discounted price. It seems likely however that Qualcomm, given its chip expertise, will be able to fix most of the performance issues with the Snapdragon 810 in a future revision of the chip.

Update (January 5): LG prototype shows better multi-core performance


A Geekbench test run was recorded on January 5 for a prototype LG G Flex2 with Snapdragon 810. This result shows some improvements, especially in the overall multi-core score, although it still well below that of Exynos 7 Octa (5433) which has a similar CPU configuration.

A closer look reveals that integer benchmarks, especially the more memory-intensive Dijkstra subtest, has not materially improved over the prior results. Multi-core floating point performance has improved significantly and contributes to the higher total multi-core score.

However, memory tests show mixed results. The Stream Copy subtests are lower than the previous best results from last month, remaining significantly lower than Snapdragon 805 and even Snapdragon 801, suggesting that sequential memory access performance has not improved. This is corroborated by the SGEMM subtest results, which also depend on sequential memory access performance and show results that are very similar to the earlier scores.

Meanwhile, the SFFT scores show a significant uptick, especially for multi-core performance, suggesting that Qualcomm has been able to improve the random memory access performance of the chip. However, the subtest scores are still clearly below those of Exynos 5433, Snapdragon 805 and even Snapdragon 801.

Update (January 10): New prototype entry shows improvements in memory performance


A subsequent Geekbench result entry recorded on January 9 for an unknown device shows further improvements in memory performance, although still falling short of the memory performance of the more mainstream Snapdragon 801 (let alone Snapdragon 805). The single-core JPEG Compress subtest result is also improved, but overall the CPU performance results still suggest that thermal throttling because of overheating is still likely to be a significant problem.

Appendix: Geekbench performance table


The table below is similar to the one published in my previous article. In the bottom half of the table, some relevant benchmark scores for Snapdragon 810 and Snapdragon 801/805 have been highlighted.

For a high-resolution version, view/copy/save the image above using the browser.

Sources: BusinessKoreaGeekbench browser (Samsung SM-N916S results), Qualcomm (Snapdragon 810 page), Wikipedia (Qualcomm Snapdragon)

Updated (January 5, 2015): Add discussion of recent LG prototype Geekbench test results, update performance table (also include Intel Atom results).
Updated (January 8, 2015): Correct DRAM interface of Snapdagon 810 (it is 32-bit dual-channel using LPDDR4, which can be clocked much higher than LPDDR3).
Updated (January 10, 2015): Add discussion of new Geekbench result entry, updated table.

Monday, December 29, 2014

Another look at Cortex-A53 CPU core performance

Several smartphone chips using ARM's new Cortex-A53 and Cortex-A57 CPU cores with the 64-bit-capable ARMv8 instruction set have arrived on the market recently. Cortex-A53-only based SoCs are especially attractive from a performance/dollar standpoint. However, as I described in earlier articles, there exist significant performance differences between different Cortex-A53 implementations, with some early revisions of the core being limited in performance, probably because of design bugs.

32-bit version of ARMv8 seems practical


Most of the Cortex-A57 and Cortex-A53-equipped SoCs currently seem to be running in what can be called "32-bit ARMv8 mode" (AArch32 in Geekbench, as opposed to ARMv7 for older 32-bit devices), taking advantage of some of the features of the ARMv8 instruction set (which is better suited to modern CPU chip architectures) while preventing some of the disadvantages of the full 64-bit model (such as doubled storage space for pointers and addresses).

Whether the full 64-bit instruction model (AArch64) will soon be attractive for Android devices, including lower-end ones such as Cortex-A53-based devices with limited amounts of CPU cache and RAM, is unclear. NVIDIA already uses AArch64 in conjunction with their latest Tegra K1 SoC. Optimizations for AArch64 seem to have been work in progress and early benchmarks for systems running in AArch64 mode were quite poor in comparison to 32-bit mode benchmarks, but progress is been made. Theoretically, more registers are available in AArch64 mode, also to the NEON SIMD unit, which should help performance in some important cases, and may mitigate the disadvantages of increased address storage size.

Snapdragon 410 has crippled first revision of Cortex-A53


Snapdragon 410 (MSM8926) is a SoC with quad-core Cortex-A53 that has been one of the first chips with Cortex-A53 cores to come to market and has already been adopted in significant volume for low-to-mid-range designs, replacing the older Cortex-A7-based Snapdragon 400.

However, it is obvious that the very first public revision of the Cortex-A53 core as used inside Snapdragon 410, Cortex-A53r0p0, is crippled in terms of performance, clearly scoring lower in CPU and memory-intensive benchmarks (even after making the significant correction for clock speed) than SoCs using later revisions of the Cortex-A53 core such as Snapdragon 615 and MediaTek's new chips. Coupled with Snapdragon 410's relatively low clock speed of 1.19 GHz, this results in significant lower performance than the newer mid-range chips mentioned. Performance in complex benchmarks that simulate demanding, typical use such as complex browsing and gaming is even worse.

Advertising of Snapdragon 410 as having 64-bit support is very misleading


The lower performance seems to be partly associated with the fact that Snapdragon 410 (because of the r0p0 revision of Cortex-A53) is completely limited to ARMv7-compatibility mode and is unable to run in ARMv8 mode (32-bit or otherwise). I have yet to see evidence of a shipping Snapdragon 410 chip that is 64-bit or even ARMv8 capable. It functions as nothing else than having somewhat faster 32-bit ARMv7 Cortex-A7 cores. In this sense, labeling the chip as being 64-bit or potentially having support for the 64-bit ARMv8 in a future update is downright misleading or a blatant lie, depending on one's standpoint.

Memory performance seems most impacted


Based on Geekbench results, Snapdragon 410 has about 10% lower pure integer CPU performance per MHz when compared to chip such as Snapdragon 615 and MediaTek's MT6732/MT6752. For pure floating point performance, performance is about 5% lower. The biggest difference is in memory performance, where Snapdragon 410 is about 25% slower than Snapdragon 615 (with r0p1 Cortex-A53) and more than two times slower (even when correcting for clock or memory speed) than MT6732/MT6752 with Cortex-A53 r0p2. Another big difference is found in cryptography performance because of the extra ARMv8 instructions that apparently are not available to Snapdragon 410.

A large part of the lower performance of the Cortex-A53 cores inside Snapdragon 410 may be due to chip design bugs as evident from errata issued by ARM for earlier revisions of the Cortex-A53 core. Some details about these errata, which mostly involve memory coherency issues related to CPU cache memory, can be found when compiling a Linux kernel.

Snapdragon 410 shows poor scores in real-world benchmarks


While Snapdragon 410 delivers somewhat better scores than the Cortex-A7-based Snapdragon 400 at the same clockspeed in pure CPU-specific benchmarks such as Geekbench for single-core performance, multi-core performance does not show much benefit (which is unexpected based on the architectural advantage that the Cortex-A53-based Snapdragon 410 should have).

Even worse is the performance in practical benchmarks that measure performance for web browsing, gaming and other more complex, practical use cases. Based on benchmark results reported by GSMArena (1) (2), Basemark X, which is gaming benchmark that simulates throughput for a more demanding typical usage pattern that uses of the Unity engine, reports a significantly lower score than recent Snapdragon 400-based models such as the Moto G (2014), with the GPU score being similar, pointing to significant flaws in (multi-core) CPU and memory performance.

In Rightware's Browsermark 2.1, a browser benchmark with use of advanced web standards such as HTML 5, WebGL and advanced JavaScript, performance is downright disappointing, with a score less than half that of Snapdragon 400-based devices. Other browser benchmarks show similar results. Scores in Rightware's overall-use Basemark OS II benchmark are also typically relatively disappointing, not surpassing those of Snapdragon 400-based devices.

Hardware bugs likely cause of crippled performance


These lower than expected benchmark results for more complex, typical use benchmark are compatible with hardware bugs in the Cortex-A53 implementation of the Snapdragon 410 being a bottleneck and significantly degrading especially multi-core performance. In particular, work-arounds for cache consistency and coherency issues have the potential to significantly degrade performance, for example by forcing the kernel to frequently flush CPU caches.

The Linux kernel source shows commits to handle errata for Cortex-A53 up to r0p2 relating to cache clean operations, with the work-around being to promote cache clean to cache clean and invalidate. This could mean that revision r0p3 of Cortex-A53 may see further improvements. These commits do not explain the performance difference between r0p0, r0p1 and r0p2, since the work-around is the same for all three revisions.

Third revision of Cortex-A53 (r0p2) seems to improve memory performance


Some of the hardware or performance bugs that plagued especially the first version of the core (r0p0 as used in Snapdragon 410) have most likely been fixed in later revisions, contributing to a significant performance increase at the same clock speed.

SoCs with the third revision (r0p2) of the Cortex-A53 core seem to have much better memory performance as shown by Geekbench results, especially impressive given the bandwidth limitations of a 32-bit external memory interface. Most likely, this improvement is derived synergistically with ARM IP such as the Mali-T760 GPU as well as other IP blocks, which are implemented inside chips such as MT6732 and MT6752.

Since a SoC such as MT6732 is on the surface essentially comparable to Snapdragon 410 in the sense of having four Cortex-A53 CPU cores, there seem to be major performance improvements in the later revisions of the Cortex-A53 core and associated system architecture, especially with regard to memory performance. The difference is made more pronounced by the fact that the MT6732 is manufactured using TSMC's higher performance 28HPM process rather than 28LP and also clocked significantly higher.

Octa-core Cortex-A53 configurations provide impressive multi-core performance


Octa-core Cortex-A53-based SoCs such as MT6752 and to a lesser extent Snapdragon 615 are already showing impressive multi-core CPU performance, while single-core performance has also improved considerably over prior cost-effective CPU architectures. Multi-core performance, both in terms of pure CPU integer and floating performance, for the MT6752 significantly surpasses (by tens of percent in many benchmarks) the much more expensive Snapdragon 801, while single-core performance is catching up, being about 30% slower for integer operations and 15% slower for floating point. This high level of performance comes at a fraction of the cost (primarily because of the small die size and low power consumption of the Cortex-A53 cores).

Memory bandwidth still a bottleneck


However, when the memory subsystem truly comes into play, high-end chips such Snapdragon 801 still show much greater performance because of their much higher external memory bandwidth (because of the wider memory interface) as well as larger CPU cache. This is apparent in the Geekbench subtest SGEMM (which is heavy on sequential memory access), for which high-end SoCs such as Snapdragon 801 are more than twice as fast.

In practice, memory performance is important for how fast a device feels, impacting response times and also being very important for GPU performance. High screen resolutions also put heavy demands on the memory subsystem. In that sense, SoCs such as MT6752 and Snapdragon 615 still perform best at a resolution like 1280x720, with the best performance at 1920x1080 and higher still reserved for high-end SoCs.

There seems to be great potential for performance-oriented Cortex-A53 SoCs with a memory interface wider than 32-bit, comparable with other performance-oriented SoCs. This would be the "best of both worlds" in several respects (lower cost because of small die size of the CPU cores, low power consumption, while still having the memory bandwidth to drive high resolutions). MediaTek has announced such a chip that was expected to have such as configuration, the MT6795, but it has not quite appeared on the market yet and might be delayed. However, similar solutions certainly look likely to become popular for performance-oriented devices in the not too distant future.

Appendix: Table with detailed Geekbench CPU benchmark results


Presented here is a table with detailed benchmark result information for the mentioned SoCs, also including several other SoCs on the market. Included is information about the CPU cores used, their clock speed, the smartphone model and Geekbench result entry used as a reference, and scores for several benchmarks. Indexed results (relative to a 1.0 GHz Cortex-A7) are shown for several of the benchmarks, as well multi-core performance scaling indices. Results relevant to the discussion above have been highlighted in bold. The following Geekbench subtests have been included:

  • JPEG Compression (single/multi-core). A useful integer benchmark that seems to strongly depend on pure CPU performance (CPU core type and clock speed) with less dependence on the memory subsystem (including L2 CPU cache).
  • Dijkstra (single/multi-core). A more complex integer benchmark that probably includes more memory access and may branch a lot. Notable for this benchmark is that Cortex-A53 performs better than Cortex-A15 at the same clock speed, with both Cortex-A17 (MT6595) and Cortex-A57 being significantly faster still.
  • Mandelbrot (single/multi-core). A pure floating point benchmark, highly dependent on the combination of CPU core type and clock speed.
  • Stream copy (single/multi-core). An important metric for memory performance (especially sequential external RAM performance).
  • SGEMM. A floating point matrix multiplication benchmark that heavily depends on sequential memory access. The memory bandwidth available to the SoC makes a critical difference for this benchmark.
  • SFFT. A floating point benchmark that heavily uses random memory access.
For a high-resolution version, view/copy/save the image above using the browser.

Sources: Geekbench browser, Primate Labs website

Updated January 2, 2015 (Add section of low typical-use benchmark scores for Snapdragon 410).
Updated January 5, 2015 (Update Geekbench performance table).
Updated January 10, 2015 (Update performance table).
Updated February 11, 2015 (mention and link Linux kernel Cortex-A53 errata).