Showing posts with label HMP. Show all posts
Showing posts with label HMP. Show all posts

Thursday, April 30, 2015

More details emerge about Cortex-A72 CPU core

Recently, more details have become available about the performance improvements implemented in ARM's Cortex-A72 core, which is a replacement for the high-performance Cortex-A57 core. Apart from the gains from using a more advanced process such as 14/16 nm FinFET, Cortex-A72 also implements fairly significant micro-architectural improvements affecting performance per cycle and power efficiency. AnandTech has published a detailed overview of these improvements.

Cortex-A57 based on Cortex-A15 and not fully optimized for power-efficiency


The Cortex-A57 CPU core, which was announced in 2012, has significant similarities to Cortex-A15, ARM's long-standing high-performance 32-bit CPU core, which has been known for relatively high power consumption. As such, it is not unexpected that improvements on the Cortex-A57 architecture (in the form of the Cortex-A72) have proven to be possible. Cortex-A57-based SoCs  such as Snapdragon 810 have been known to throttle, being forced to reduce the clock speed due to excessive heat production and power use, resulting in reduced sustained performance. Apple's A7 and A8 processors use CPU cores that most likely have strong similarities with Cortex-A57, but which exhibit little throttling due to a lower maxium clock speed, a lower number of cores and other factors related to the the chip design.

Increased level of sustained performance


ARM has made available a number of slides detailing the improvements in sustained performance and power efficiency in Cortex-A72 over Cortex-A57. On a 28 nm process and similar clock speed, ARM's charts indicate a roughly 20% improvement in power reduction. 

Sustained performance is expected to be higher than Cortex-A57, implementations of which (such as Snapdragon 810 and Exynos 5433, and to a lesser degree Exynos 7420) have suffered from an inability to maintain high clock speeds and throttle back to a relatively low speed due to heat production and associated power consumption. ARM gives a figure of sustained 750 mW operation per core on a 16FF+ process with a clock speed around 2.5 GHz.

In terms of IPC (instructions per cycle), ARM's information shows improvements in all instruction-level performance segments, with a 1.16x improvement for "analytics", 1.38x for cryptography, 1.50x for memory, 1.26x for floating point and 1.16 for integer compute. The increase in memory performance appears to be significant.

Improved single-core performance evident in early Geekbench results


Early Geekbench results for the MT8173 SoC from MediaTek, which includes two Cortex-A72 cores, give an indication of practical peformance of the Cortex-A72 core, although the exact clock speed the Cortex-A72 cores are running at is hard to determine. The following table shows single-core performance from a recent MT8173 Geekbench entry, comparing it to Exynos 7420 as used in the Samsung Galaxy S6. Both use 64-bit AArch64 mode.

SoC                        JPEG   Dijkstra  Lua   Mandelb. Stream SGEMM SFFT
                           Compr.                          Copy
28nm? MT8173 (Cortex-A72)  1429    1287     1675  1750     2217    979  1345
14nm Exynos 7420           1475    1082     1409  1147     1993    954  1379
The MT8173 easily matches the single-core performance of Exynos 7420, while showing significant improvements in the Mandelbrot floating point subtest and the memory-intensive Dijkstra subtest, and also the Lua subtest. Memory subtest (Stream Copy) performance is also better than Exynos 7420, despite the likely much wider memory interface of the latter, providing clear evidence of the improved memory performance (largely due to smarter prefetching) in Cortex-A72. Overall, since the MT8173 results reflects a SoC using 28 mn or perhaps 20 nm process technology, while Exynos 7420 uses Samsung's leading-edge 14 nm FinFET process, the ability of the MT8173 to beat Exynos 7420 in single-core performance while using a less advanced process is impressive and illustrates the performance improvements in the Cortex-A72 core.

Reduced silicon area results in lower cost


Cortex-A72 has a silicon area that is 10% smaller than Cortex-A57 on an equivalent process, while delivering improvements in performance and power efficiency. Already SoCs have been announced or described that utilize Cortex-A72 cores, such as MediaTek's MT8173 for tablets, Qualcomm's Snapdragon 618 and 620 for smartphones, and MediaTek's MT6797 (Helio-X20) for smartphones.

There seems to be a clear trend of using just two Cortex-A72 cores (instead of the four cores used in many Cortex-A57 implementations), reducing cost and maximum power consumption. These are cores are augmented by low-power, small-area Cortex-A53 cores running at a lower frequency. MT8173, Snapdragon 618 and Helio-X20 all use such as configuration.

Use of Cortex-A72 may be more effective than high-clocked Cortex-A53 cores


There are indications that Cortex-A53 cores running at a high frequency (such as implemented in MediaTek's MT6752 and MT6795 (Helio-X10), HiSilicon's Kirin 930 and to a lesser degree in Snapdragon 615 and the announced Snapdragon 415 and 420) run into a power efficiency bottleneck at higher clock speed, due the relatively steep increase in power consumption as the clock speed of the Cortex-A53 core increases above 1.3-1.5 GHz. Solutions that combine a small number of Cortex-A72 with lower-clocked, power efficient Cortex-A53 cores may prove to be a sweet spot in terms of practical performance and power efficiency for mid-range SoCs.

Source: AnandTech (Cortex-A72 Architecture Details article), Geekbench Browser

Thursday, April 23, 2015

Details surface about MediaTek's upcoming Helio-X20 SoC

Recently, details surfaced about MediaTek's upcoming Helio-X20 SoC, a high performance offering in the series of Helio-branded SoCs, of which the MT6795 (Helio-X10) is the first member. The deca-core Helio-X20, which has the model number MT6797, has a total of ten CPU cores and is the first mobile SoC with a hierarchy of three clusters of progressively less performance-oriented CPU cores: two ARM-Cortex-A72 cores, four high clocked ARM-Cortex-A53 cores and four lower clocked ARM-Cortex-A53 cores.

Three-cluster hierarchy extends the big.LITTLE principle


The SoC's ten CPU cores are organized as follows:
  • Two Cortex-A72 cores clocked up to 2.5 GHz to provide "extreme performance".
  • Four Cortex-A53 cores clocked up to 2.0 GHz for "best performance/power balance".
  • Four Cortex-A53 cores clocked up to 1.4 GHz for "best power efficiency".
The different clusters and their separate L2 caches are linked together using MediaTek's MCSI interconnect technology. MediaTek claims higher efficiency than big.LITTLE based designs, which have just two levels of cluster hierarchy.

The triple-level hierarchical design is a significant departure from the symmetric CPU configuration on current MediaTek smartphone SoCs such as MT6795 (Helio-X10) and MT6752, which have eight "equal" Cortex-A53 cores, although MediaTek does have experience with big.LITTLE, for example in the 32-bit MT6595 and some tablet processors.

Reports suggest the chip is manufactured using a 20 nm process at TSMC and will be in mass production as soon as July 2015. This marks MediaTek's first known product manufactured using a geometry below 28 nm.

Other features: ARM Mali-T880 MP4 GPU, dual-channel LPDDR3, world modem


Based on a recent report from Gizchina.com that gives more details about the specifications of the chip, other features include an ARM Mali-T880 MP4 GPU at 700 MHz and a dual-channel 32-bit LPDDR3 memory interface at 933 MHz. The maximum display resolution supported is 2560x1600. The integrated LTE modem has Cat. 6 capability. and also supports CDMA2000/EVDO Rev. A (world modem support). The video processor supports decoding and encoding of the H.265 format up to 4K resolution.

The report suggests the SoC will start shipping to manufacturers this summer with end products reaching stores by late autumn.

Execution issues at Qualcomm may help MediaTek's chances of success in high-end


Execution issues at Qualcomm regarding their high-end product roadmap may increase the chances of success of MediaTek's high-end product line. Qualcomm's Snapdragon 810 has some performance issues and has not been a great success, giving MediaTek the opportunity to capture more of the performance-oriented, premium level segment. MediaTek already has Helio-X10 (MT6795) in the market, which has gained design wins, but for which some key characteristics such as power efficiency are still unknown.

Meanwhile, MediaTek has come under pressure in the cost-sensitive smartphone SoC market, previously the bread-and-butter of the company, on which Qualcomm is encroaching by gaining market share for low-end devices in China. This is mainly the result of MediaTek's delayed introduction of cost-sensitive 4G SoC solutions.

MediaTek's sales performance under pressure


While MediaTek has made some progress penetrating the performance-oriented smartphone market with SoCs such as MT6752 and MT6795, it has lost ground in the cost-senstive smartphone segment among Chinese manufacturers, which it previously dominated. Although MediaTek's March 2015 sales rebounded from the low level of February, for the second quarter its sales performance is not expected to reach the level of previous quarters (such Q3 and Q4 of 2014). Indeed, the forecast given by MediaTek during its quarterly results presentation for Q1 2015 on April 30 sets sequential growth between -5% and +3% for Q2 2015, which represents a lower level of sales than the level MediaTek was accustomed to in 2014.

Due to a product mix with a significantly lower volume of cost-senstive SoCs, offset by some traction for performance-oriented SoCs, MediaTek's product mix has changed, with overall unit shipments and unit market share for MediaTek declining when compared to the previous year, despite likely higher performance-oriented chip shipments.

Update: MediaTek has officially announced Helio-X20


On 12 May, MediaTek officially announced Helio-X20. Most of the previously known details are confirmed in the announcement. The chip utilizes MediaTek's new CorePilot 3.0 heterogeneous computing scheduling algorithm, with together with the tri-cluster architecture should provide up to 30% reduction in power consumption. The chip has advanced camera features and has an ARM Cortex-M4-based sensor hub processor for better battery efficiency.

According to AnandTech, quoting MediaTek, the GPU used is not the Mali-T880 but an as yet unannounced Mali-T8xx series GPU, similar to Mali-T880. Compared to Helio-X10's PowerVR G6200, MediaTek sees a 40% performance improvement with a 40% drop in power.

Sources: CNXSoftware (Helio-X20 article), DigiTimes (MediaTek Q2 sales projection), DigiTimes (MediaTek Q2 2015 quarterly results), Gizchina.com (Comparison of MT6797 with Snapdragon 810), MediaTek (Helio-X20 announcement), AnandTech (Helio-X20 article)

Updated 21 May 2015.

Tuesday, March 3, 2015

A detailed comparison of Cortex-A53-based and other SoCs using Geekbench, and impact of AArch64

More Cortex-A53 CPU core-based SoCs have recently come to market and more benchmark results are now available, for example from the Geekbench results database. Firmware is also becoming more mature. This makes it possible to make better comparisons between different Cortex-A53-based SoCs (for example, octa-core SoCs) and compare the performance of the highest-performance chips with competitive chips that use more expensive CPU cores such as Krait 400 and Cortex-A57.

Overview of Cortex-A53-based SoCs


The following is a list of Cortex-A53 CPU core-based mobile SoCs that have appeared in the market or for which benchmark results have become available. All chips integrate 4G LTE modem functionality unless otherwise noted.

  • Snapdragon 410 (MSM8916), utilizing four early Cortex-A53r0p0 cores. Numerous cost-sensitive smartphones now use this chip. However, none of them appears to take any advantage at all of the new ARMv8 instruction set, with all of them running in ARMv7 compatibility mode. This is counter-intuitive because AArch32 (32-bit version of ARMv8), which is used by the other SoCs, already brings significant benefits. Snapdragon 410 generally perform significantly worse than other Cortex-A53-based SoCs, even when correcting for the low clock speed. This is also reflected in memory performance. The Adreno 306 GPU tends to be even a little slower than the Adreno 305 GPU in Snapdragon 400. The net result is a chip that is not much faster than Snapdragon 400 in many cases while having worse battery life.
  • Snapdragon 615 (MSM8939), equipped with an octa-core Cortex-A53r0p1 CPU configuration with four cores running (in practice) at 1.54 GHz or 1.50 GHz and four cores running at a lower maximum clock frequency (probably 1.0 GHz). This chip has appeared in an increasing number of new smartphone models. Runs in AArch32 mode. Performance is significantly lower than MediaTek's octa-core Cortex-A53-based SoCs, which can run all eight Cortex-A53 cores at the maximum frequency. Memory performance is improved from Snapdragon 410 but falls short of that of MediaTek's SoCs. The Adreno 405 GPU is fairly competitive, suitable for a mid-range SoC, although the 32-bit RAM interface of the SoC limits performance, especially at high resolutions. It is manufactured used TSMC's lower performance 28LP process. There have been reports that the chip gets hot with intensive use and requires throttling.
  • MediaTek MT6732, with an quad-core Cortex-A53r0p2 CPU configuration running at a maximum clock speed of 1.5 GHz. Devices using the chip are starting to become available, and tablets with the tablet version of this chip (MT8732) have also been announced. Although it has only four CPU cores, it has good performance, beating Snapdragon 615 in single core performance at a similar clock speed, and memory performance is significantly higher. The Mali-T760 MP2 GPU contributes to better GPU performance than previous MediaTek chips targeting cost-sensitive segments, although falling short of that of Snapdragon 615 and MT6752. A tablet version of the chip exists as MT8732.
  • MediaTek MT6752, featuring an octa-core Cortex-A53r0p2 CPU configuration with a maximum clock frequency of 1.69 GHz. Several devices have come to market using this chip, including the Meizu M1 Note. Performance is excellent, with high scores in the Geekbench CPU benchmark, considerably higher than Snapdragon 615 and beating high-end SoCs such as Snapdragon 801 in several metrics. The Mali-T760 MP2 GPU is clocked higher than that of the MT6732, resulting in good GPU performance, comparable to that of Snapdragon 615, as measured with GFXBench, although the 32-bit memory interface will be a bottleneck at high resolutions. Manufactured using TSMC's high-performance 28HPM process. A tablet version of the chip exists as MT8752.
  • MediaTek MT6795, with an octa-core Cortex-A53r0p2 CPU with clock speed up to 2.16 GHz. With a dual-channel memory interface and high resolution support, this SoC targets a higher performance segment than the previously mentioned chips, for which it can potentially offer much better performance/dollar because of the small die size of Cortex-A53 cores. Originally announced as become available in commercial devices before the end of 2014, it was delayed but competitive benchmark scores for what appears to be more mature versions of the chip have recently shown up. It appears to be configured with full AArch64 mode. Performance is excellent, with single-core performance closing much of the gap with the high-end Snapdragon 801, while multi-core performance is significantly higher. There appears to be a "Turbo" version running the CPU up to 2.16 GHz, while the regular version clocks at 1.95 GHz. At the MWC on 2 March 2015, MediaTek apparently rebranded the MT6795 as Helio X10.
  • MediaTek's MT6735 is a SoC for entry-level smartphones for which benchmark results have not yet become available. It has a quad-core Cortex-A53 CPU configuration and a Mali-T720 GPU, a downgrade from the Mali-T760 GPU in MT6732. The recently announced MT6753, with eight Cortex-A53 cores running up to 1.5 GHz, is compatible with the MT6735 and also has a Mali-T720 GPU (probably MP4). Other chips that have shown up in product announcements include the MT8161 (probably the equivalent of the MT6735 without modem) and MT8165 (equivalent to MT8732 without modem).
  • Qualcomm has announced additional octa-core Cortex-A53-based chips, Snapdragon 415 and Snapdragon 425. These probably utilize symmetrical Cortex-A53 configuration with all cores running at the same maximum clock frequency, unlike Snapdragon 615. Otherwise, the new SoCs are similar to Snapdragon 615, with the same Adreno 405 GPU. According to Qualcomm, devices using these chips will become commercially available in the second half of 2015.
  • Kirin 620 (Hi6210) from HiSilicon (Huawei) is an octa-core Cortex-A53r0p3-based SoC running up to 1.2 GHz. The GPU is a Mali-450 MP4. Although performance (including single-core performance) is better than Snapdragon 410, it is not as optimized as chips such as MT6752 and runs at a relatively low clock speed. Multi-core performance scaling is less than expected.

Geekbench integer and memory scores comparison


The following table provides details about selected Geekbench integer and memory benchmark scores for different Cortex-A53-based SoCs, and also other smartphone SoCs from Qualcomm, MediaTek and Samsung for comparison.

                Arch    Max freq. JPEG C. IPC   JPEG C. Dijkstra      Stream Copy   Geekbench
                                  Single  x A7  Multi   Single Multi  Single Multi  Ref. number

Snapdragon 410  ARMv7     1.19      596   1.30   2384     810   2135   431   492    1551964
Snapdragon 615  AArch32 1.50/1.0    820   1.42   4979     886   3646   572   703    2015694
MT6732          AArch32   1.50      843   1.46   3357    1041   3002  1001  1199    1546611
MT6752          AArch32   1.69      952   1.46   7554    1144   4483  1071  1191    1583540
MT6795          AArch64   1.95     1026   1.37   8167     990   3802  1356  2068    2002894
MT6795T         AArch64   2.16     1128   1.36   8962    1064   4109  1350  2140    1984431
Hi6210          AArch32   1.20      660   1.43   3501     744   2772   602   900    1999304

Snapdragon 400  ARMv7     1.19      462   1.01   1860     700   2132   534   551    1938063
Snapdragon 801  ARMv7     2.46     1347   1.42   5437    1174   3586  1931  2144    1491681
Snapdragon 805  ARMv7     2.65     1475   1.45   4105    1230   4058  2117  2910    1502687
Snapdragon 810  AArch64  ?/1.55    1358          5972    1073   3584  1428  1838    2017257
MT6582          ARMv7     1.30      506   1.01   2027     748   2354   250   396    2017732
MT6592          ARMv7     1.66      643   1.01   5086     891   3327   261   388    2000008
MT6595          ARMv7   2.20/1.69  1350   1.59   6080    1844   5612  1652  1986    1591744
Exynos 5430     ARMv7   1.80/1.3   1056   1.52   5140    1102   3918  1457  1559    1556780
Exynos 5433     AArch32   1.89     1456   2.10   6209    1523   5728  1396  1458    2017193
Exynos 7420     AArch64  ?/1.50    1481          7168    1065   4596  1953  2579    2012972

The low performance of Snapdragon 410 is apparent in the scores, with normalized IPC (instructions per cycle to the equivalent of a 1.0 GHz Cortex-A7) for the CPU-speed sensitive single-core JPEG Compress benchmark being lower than that of other Cortex-A53-based SoCs, probably due to being limited to ARMv7. The Dijkstra benchmark even scores lower on Snapdragon 410 than on an equivalently clocked Snapdragon 400, and memory performance is also lower.

Snapdragon 615, while improving on Snapdragon 410, also appears to be less optimized than MT6732/MT6752 in terms of single-core IPC, despite a very similar clock frequency. Looking at multi-core performance, MT6752 is significantly faster than Snapdragon 615, largely due to being able run all eight cores at the maximum clock frequency. MT6732 and MT6752 also have significantly higher memory performance, reaching an impressive score for devices with a 32-bit memory interface.

The higher clock speed of MT6795 (Helio X10) brings benefits for integer performance, but due to the use of the AArch64 instruction set, normalized IPC is lower (1.36 vs 1.46 for JPEG Compress). This is especially true for the Dijkstra benchmark, where AArch64 mode imposes a significant penalty (this is also seen on other platforms utilizing AArch64).

Overall, a high-speed Cortex-A53 configuration such as implemented in the MT6795T comes fairly close to Snapdragon 801 for single-core performance, while being significantly faster for multi-core performance, at a significantly lower cost. Several metrics are also in the same ballpark as the current high-end leader Exynos 7420.

Analysis of the Geekbench Lua subtest


The Lua integer benchmark appears to be particularly sensitive to memory subsystem efficiency, including L2 cache size, and memory bandwidth as well being dependent on CPU speed. It is the kind of code that may frequently occur in actual practice on a smartphone.

                Arch      Lua     IPC   Lua    CPU    #CPUs
                          Single  x A7  Multi  Par.

Snapdragon 410  ARMv7      603    1.23  2137   3.54   4
Snapdragon 615  AArch32    709    1.15  1644   2.32   4 + 4
MT6732          AArch32    753    1.22  2419   3.21   4
MT6752          AArch32    842    1.21  2361   2.80   8
MT6795          AArch64   1053    1.31  8203   7.79   8
MT6795T         AArch64   1173    1.32  8847   7.54   8
Hi6210          AArch32    587    1.19  1740   2.96   8

Snapdragon 400  ARMv7      476    0.97  1874   3.94   4
Snapdragon 801  ARMv7      980    0.97  2880   2.94   4
Snapdragon 805  ARMv7     1016    0.93  2917   2.87   4
Snapdragon 810  AArch64   1283          1065   0.83   4 + 4
MT6582          ARMv7      514    0.96  1644   3.20   4
MT6592          ARMv7      651    0.95  1344   2.06   8
MT6595          ARMv7     1509    1.67  2498   1.66   4 + 4
Exynos 5430     ARMv7      981    1.33  1861   1.90   4 + 4
Exynos 5433     AArch32   1397    1.89  5478   3.92   4 + 4
Exynos 7420     AArch64   1409          7088   5.03   4 + 4

In this test, Snapdragon 410 performs reasonably well. MT6752's multi-core performance seems limited by a bottleneck, probably external memory bandwidth. MT6795's performance is impressive; while single-core performance falls a little short of Cortex-A57 based SoCs, for multi-core performance it blows past them, with CPU parallelism fully exploited. It seems the bottleneck present with the MT6752 (presumably memory bandwidth and the L2 cache memory size available to each core) is not present with the MT6795.

Qualcomm's Snapdragon 810 consistently scores in the 1000-1200 range for both the single-core and multi-core test, while the multi-core test would have been expected to be significantly higher. This appears to reflect a serious deficiency in the memory subsystem of the SoC (which might not only be related tot the LPDDR4 SDRAM controller, but also the on-chip L2 cache) which might also have negative implications for smoothness in every-day use.

Geekbench floating points subtests


Finally, let's look at floating point performance. The Mandelbrot subtest tests pure floating point performance, while the SGEMM and SFFT tests also significantly depend on memory performance.


                Arch      Mandelbrot                 SGEMM         SFFT
                          Single  IPC   Multi  Par.  Single Multi  Single Multi

Snapdragon 410  ARMv7      448    1.10  1794   4.00   245    489    317   1258
Snapdragon 615  AArch32    583    1.14  3611   6.19   303    688    426   2517
MT6732          AArch32    585    1.14  2336   3.99   337    653    430   1727
MT6752          AArch32    661    1.15  5257   7.95   384   1148    481   3870
MT6795          AArch64    823    1.24  6406   7.78   484   1542    618   4764
MT6795T         AArch64    912    1.24  7245   7.94   529   1659    694   5333
Hi6210          AArch32    467    1.14  3509   7.51   264    876    343   2178

Snapdragon 400  ARMv7      405    1.00  1620   4.00   203    634    285   1182
Snapdragon 801  ARMv7      788    0.94  3104   3.94   907   2816    992   3518
Snapdragon 805  ARMv7      848    0.94  3389   4.00  1011   2669   1130   4135
Snapdragon 810  AArch64   1100          5144   4.68   749   1828   1009   3643
MT6582          ARMv7      444    1.00  1765   3.98   230    512    328   1316
MT6592          ARMv7      568    1.00  4430   7.80   282    696    419   3397
MT6595          ARMv7     1284    1.71  5822   4.53   748   2337   1187   4255
Exynos 5430     ARMv7      990    1.61  4745   4.79   657   2491    896   3971
Exynos 5433     AArch32   1174    1.91  4883   4.16   751   2369   1044   4031
Exynos 7420     AArch64   1198          6129   5.12   945   2888   1313   4874

From these numbers its is clear that Cortex-A53 improves floating point performance somewhat when compared to Cortex-A7 at the same clock speed. When eight cores can run in parallel at high speed, multi-core floating point performance is impressive, as demonstrated by MT6752 and MT6795. Snapdragon 801 and 805 are looking a bit dated in this department.

In the memory-intensive SGEMM and SFFT tests, Snapdragon 400 comes close to Snapdragon 410, illustrating the lack of performance improvement by Snapdragon 410. In fact MediaTek's previous generation MT6582 matches the floating point performance of Snapdragon 410 across all tests.

The Cortex-A57 based SoCs have the highest single-core floating point performance, although the Cortex-A17-based MT6595 is also very strong. Exynos 5433 and Exynos 7420 beat Snapdragon 810 in most floating point tests, although the difference is not as large as it used to be with earlier results for Snapdragon 810.

Conclusion


It is clear that octa-core Cortex-A53-based SoCs can deliver strong performance at a relatively low cost, and this particularly true for MediaTek's new chips, MT6752 and MT6795. The MT6795, with its higher clock speed and dual-channel memory interface, can match current high-end chips in most metrics, being not much slower in single-core performance while being superior in multi-core.

One unknown question is whether the high maximum clock frequency of the MT6795 and MT6795T, which deliver impressive performance/dollar, translates to acceptable power consumption and battery life. Observations that power consumption for Cortex-A53 can quickly increase at higher frequencies for the Samsung-manufactured Exynos 5433 have been made, but MT6795 is manufactured on different process at TSMC and probably makes use of specific design optimizations for high clock speeds (ARM POP IP core hardening technology) that make power consumption more acceptable.

Sources: Geekbench Browser

Updated 10 March 2015.

Wednesday, February 25, 2015

Early benchmarks for MT6795 show high performance, suggest use of eight Cortex-A53 cores

MediaTek originally announced the MT6795, a SoC targeting the premium-level and performance segments of the smartphone market, in July 2014, with expectations of devices being commercially available to end users before the end of 2014. However, the chip was delayed (problems with the memory controller were reported) and competitive benchmark results are only now beginning to surface for the chip.

According to the announcement, the SoC was to have an octa-core CPU configuration with clock speeds up to 2.2 GHz, a strong dual-channel memory interface with support for LPDDR3 up to 933 MHz, 2K (2560x1600) display support. Other reports and information have suggested that it uses a PowerVR G6200 GPU, similar to the one used in MediaTek's MT6595, which can be seen as 32-bit predecessor of the new chip.

Confusion about processor cores, octa-core Cortex-A53 seems likely


The actual CPU cores used inside the MT6795 continue to be source of confusion. Initially understood to be an octa-core Cortex-A53 CPU configuration clocked at a high frequency, later a purported leaked MediaTek product roadmap surfaced that described the MT6795 as a big.LITTLE design that includes Cortex-A57 cores. However, a recent new entry in the Geekbench database suggesst that the chip actually has eight Cortex-A53 cores as originally suspected, as the IPC (instructions per cycle) of the integer and floating point subtests would be hard to reconcile with Cortex-A57 cores being present.

Geekbench results show mixed performance but high overall score


The Geekbench results show strong CPU performance, with the overall score being superior to that of available results for Snapdragon 810, which has a significantly higher cost design but has been plagued by performance issues, although it scores lower than Exynos 5433/Exynos 7 Octa with Cortex-A57 cores as used in the Galaxy Note 4. Note that MT6795 uses a less advanced 28 nm process compared to the 20 nm process used for Snapdragon 810 and Exynos 5433.

Single-score integer performance is not spectacular and below that of the previous generation high-end chips such as Snapdragon 801. Although this is compatible with the use of medium-performance Cortex-A53 cores, integer single-core performance is actually lower than the mid-range MT6752, despite the higher clock rate, pointing to continuing hardware performance problems with the chip. The Dijkstra benchmark result is particular low. This benchmark has a lot of external memory access and likely branches a lot, taxing certain elements of the CPU and SoC that simpler CPU benchmarks do not. It may be affected by the doubled address size in AArch64 mode, either through the increased size of pointer storage or reduced efficiency of the branch prediction unit inside the processor core.

Single core floating point performance in the Mandelbrot benchmark is higher than the MT6752 and actually compatible with the Cortex-A53 core running at 2.1 GHz, close to the originally envisaged maximum clock speed for the MT6795. Multi-core performance in this subtest is impressive, with a score that is higher than most existing SoCs including Exynos 7 Octa, which employs faster Cortex-A57 cores.

Finally, the dual-channel memory interface seems to working reasonably well in the tested revision of the chip/development board, with memory scores consistent with an optimized dual-channel interface, and higher, for example, than those of Exynos 5433. However, they are generally lower than those of the 32-bit MT6595.

One caveat is that the MT6795 entry is running in AArch64 mode, while the other devices were running in AArch32 (32-bit ARMv8) or 32-bit ARMv7 mode.

Average single-core CPU performance, strong multi-core performance


In a direct comparison with the MT6752, which has a comparable CPU configuration but clocked lower and has only a 32-bit memory interface, the MT6795 is only slightly faster, although the MT6795 uses a full 64-bit AArch64 instruction set model, while the tested MT6752 configurations use AArch32 with partial use of ARMv8 features. There are a few anomalous results, including a low score for the MT6795 in the single-core AES benchmark, and as mentioned it also scores significantly lower in the Dijkstra benchmark. Floating point performance is consistently higher for the MT6795 (more than the increase in clock rate would explain), which may be caused by the higher-performance memory subsystem of the MT6795 and/or the increased number of floating point registers available in AArch64 mode.

The MT6795 is clearly slower than its 32-bit predecessor MT6595 (which uses high-performance Cortex-A17 and Cortex-A7 cores in a big.LITTLE configuration) in most metrics, with only the heavy weighting and large performance gain for the AES and SHA1 cryptography tests  (due to the new ARMv8 instruction set) shifting the advantage for the overall score towards the MT6795.

When making a comparison with a median entry for the high performance Exynos 5433 (Exynos 7 Octa) inside the Samsung Galaxy Note 4, the MT6795 fairly consistently shows clearly lower single-core performance but higher multi-core performance.

MT6795 likely to be most cost-effective performance segment processor on the market


The exclusive use of Cortex-A53 CPU cores, and not the much more expensive and die-space consuming Cortex-A57 (or, in a 32-bit comparison, Cortex-A15/A17 cores), has positive implications for the cost of the chip. Die space dedicated to the CPU cores will be relatively low, although L2 caches will take considerable space when configured with a size that matches the desired performance level and market segment. Overall, the chip is likely to be attractive in terms of performance/dollar for the performance segment.

In terms of SoC optimizations, the chip would probably work better with the employment of additional ARM IP such as a Mali T760 or Mali-T800 series GPU, which offers advantages in combination with ARM cores such as Cortex-A53 in tandem with techniques such as AFBC, smart composition and transaction elimination, and new interconnect buses within the chip. SoCs like the MT6752 probably benefit from these optimizations, while the MT6795 cannot do so fully because of the non-ARM GPU. It seems likely that the MT6795 will be superseeded in next generation products to be announced by MediaTek in the future by a similar SoC with an ARM Mali-T760 or T800 series GPU.

Update (2 March): Based on a closed-door presentation event at the MWC, MediaTek appears to have rebranded MT6795 as Helio X10 with future Helio P series products also being announced.

Sources: MediaTek (MT6795 announcement), Geekbench browser

Tuesday, February 17, 2015

Cortex-A53 not as power efficient as Cortex-A7

Recent detailed technical review articles published by AnandTech based on a comparison of Samsung Exynos SoCs have elucidated some of the details about the performance of the Cortex-A53 core, including processing performance, power consumption and die size. Overall, it appears that while Cortex-A53 is significantly faster than Cortex-A7 at the same clock speed, die size and power consumption on an equivalent manufacturing process has increased by a greater amount, leading to lower performance/Watt.

Direct comparison of Cortex-A7 and Cortex-A53 on the same process


In a recently published technical review article about the ARM Cortex-A53, Cortex-A57 CPU cores and Mali-T760 GPU core, based Samsung's Exynos-based Galaxy Note 4 model, AnandTech has provided details about the performance, power consumption and die size of the 64-bit Cortex-A53 core relative the its 32-bit predecessor, Cortex-A7. It has done so by comparing measurements of the Cortex-A53 cores inside the Exynos 5433 used in the Note 4 with the Cortex-A7 cores inside the Exynos 5430 used in the Galaxy Alpha. Both SoCs are produced using a similar 20nm process at Samsung, making a direct comparison possible.

Cortex-A7 is an in-order pipeline CPU core with moderate performance but an extremely small die size and very low power consumption. The Cortex-A53 core has been designed by ARM as a logical extension of Cortex-A7 to ARM's 64-bit ARMv8 instruction set with higher performance. However, in doing so die size and power efficiency have suffered somewhat.

CPU performance increased in Cortex-A53


According to the designer of Cortex-A53 at ARM, Cortex-A53 increases SPECint-2000 performance from 0.35 SPEC/MHz to 0.50 SPEC/MHz when compared to the Cortex-A7 core. In Geekbench integer benchmarks, disregarding cryptography benchmarks which a show a large increase, performance is still about 50% higher for Cortex-A53 when compared to Cortex-A7 at the same clock speed, with the biggest gains coming with multi-threaded performance (aided by the increased memory performance).

For floating point benchmarks the performance increase reported by AnandTech is dramatic, with most benchmarks showing a two to three times performance increase. However, there seems to be a discrepancy between these benchmarks results and benchmark results available from the Geekbench results database for Cortex-A53 and Cortex-A7-based devices, showing ony a moderate floating point performance increase for Cortex-A53 over Cortex-A7. Most likely, AnandTech is erroneously reporting Cortex-A57 core floating performance in this case (this matches Geekbench results that I previously tabulated).

Memory performance benchmarks performed by AnandTech show a relative increase in latency for a Cortex-A53 cluster between transfer sizes of 256 KB and 512 KB when compared to a Cortex-A7 cluster, despite the fact that this should fit inside the 512 KB L2 cache. However, as I previously noted in earlier blog articles, the benchmarks show that memory bandwidth has significantly increased with Cortex-A53 when compared to Cortex-A7, virtually doubling. This most likely contributes to the Cortex-A53 core's greater multi-threading performance in practice.

Power consumption of Cortex-A7 greatly reduced with Samsung's 20 nm process


AnandTech has published a detailed chart showing estimates for power consumption of the previous generation 32-bit Cortex-A7 and Cortex-A15 cores on both 20 nm and 28 nm processes at Samsung, based on Samsung's Exynos 5422 (28 nm) and Exynos 5430 (20 nm) SoCs.

While the high-performance Cortex-A15 cores are seeing a power reduction of about 25%, power consumption of the Cortex-A7 cores sees a significant 40% reduction with a 56% reduction at the highest CPU frequency of 1300 MHz. This can be partly explained by Samsung optimizing the Cortex-A7 cores inside Exynos 5430 for low power consumption using ARM's POP IP optimization platform.

Ironically, the excellent power characteristics of the Cortex-A7 at the latest processes such as Samsung's 20 nm process have not been taken advantage of in the market except in Samsung's Exynos big.LITTLE 5430, since Cortex-A7 adoption is mostly limited to 40 and 28 nm and all announced 20 nm SoCs use Cortex-A57 and Cortex-A53 cores. There seems to be an opportunity for ultra-efficient 20 nm Cortex-A7-based SoCs for certain product segments, while there is also a significant opportunity for 20 nm Cortex-A53-only SoCs that should be more power efficient than their 28 nm equivalents.

One could envision a hypothetical octa-core Cortex-A7-based SoC manufactured on TSMC's 20nm HPM process delivering spectacular performance/Watt, with relatively high clock speeds being possible. AnandTech's article notes that TSMC's 28nm and 20 nm HPM processes are most likely significantly more efficient than Samsung's equivalent process technology because they allow CPUs to operate at lower voltage level. A similar argument applies to Cortex-A53-based SoCs manufactured at 20 nm, albeit with lower performance/Watt.

In terms of die size, AnandTech reports a significant reduction of 45% for the the Cortex-A7 cores and 64% for the Cortex-A15 cores in the 20 nm Exynos 5430 vs 28 nm Exynos 5422.

Cortex-A53 has significantly greater power consumption than Cortex-A7


AnandTech has published a detailed chart with power consumption characteristics of the Cortex-A53 cores inside Samsung's Exynos 5433 manufactured at 20nm. In their analysis, AnandTech notes a relatively large increase in power consumption when utilizing multiple Cortex-A53 cores at their highest frequency (1300 MHz on Exynos 5433), when compared to running at 1.0 GHz. This correlates with a voltage bump when going from 1.0 to 1.3 GHz.

Based on this analysis, the article concludes the power consumption is more than twice as large for Cortex-A53 when compared to Cortex-A7 at an equivalent clock speed of 1300 MHz at a similar manufacturing process (Samsung's 20nm process). Although the Cortex-A53 core's CPU performance is greater, it is not twice as great leading to clearly lower performance/Watt for Cortex-A53 when compared to Cortex-A7.

It is possible that the chip errata (hardware bugs) in earlier revisions of Cortex-A53 that I mentioned in previous articles play a role in reducing the measured performance and power efficiency of Cortex-A53. Exynos 5433 uses Cortex-A53r0p1, which is affected by this. The chip errata require more frequent cache flushing as a work-around, which can potentially affect performance as well as power consumption. The non-optimal state of big.LITTLE kernel scheduling code may exacerbate these problems. There is potential for later revisions of Cortex-A53 such as r0p3 to deliver higher efficiency because they are not affected by these hardware problems. Chips with Cortex-A53 revision r0p3 have not yet appeared on the market.

Chip-specific core optimizations makes comparisons more difficult


It should be noted that specific optimization of the processor cores for a particular higher clock frequency target (e.g. in chip like MediaTek's MT6752 and MT6795) or low power consumption at lower clock frequency (for example, in a big.LITTLE configuration), using ARM's POP core hardening technology, has the potential skew the comparison between different chips. MediaTek's MT6752 has already been reported to have acceptable power consumption while running at relatively high maximum clock frequency, which would otherwise be incompatible with the steep rise in power consumption for clock speeds above 1.2 GHz observed in the charts for the Samsung chips.

Die size of Cortex-A53 increased compared to Cortex-A7


The die size of Cortex-A53 cores when compared to Cortex-A7 in Samsung's chips is about 1.75 times greater according to AnandTech, although it is still below one square millimeter, which is still low for a CPU. When looking at the total cluster size, which includes the L2 cache (the same amount of 512 KB for Cortex-A53 and Cortex-A7), the die size of the cluster is 1.38 times greater. The larger die size has consequences for cost-sensitive SoCs for low-end mobile devices and IoT applications, for which Cortex-A7 remains more attractive. Cortex-A7 can also be employed as an embedded CPU in a functional block such as a baseband processor,  just like Cortex-A5 is frequently used.

Consequences for mobile SoCs


The higher performance of Cortex-A53 when compared to Cortex-A7, especially memory bandwidth, makes high-clocked multi-core Cortex-A53-based SoCs suitable for mid-range performance segments. Examples of this are MediaTek's MT6752 and Qualcomm's Snapdragon 615 SoC. These SoCs also have higher GPU performance than that traditionally associated with Cortex-A7-based SoCs.

The increased power consumption and die size of Cortex-A53 causes Cortex-A7 to remain relevant, because it still delivers superior power efficiency, cost and die size, and consequently performance/Watt and performance/dollar are better than Cortex-A53. Hypothetically, a 20nm octa-core Cortex-A7 based SoC would deliver excellent power efficiency with quite acceptable performance due to higher clock speeds, and their may be a market for such a solution for smartphones. The main drawback would be that OS ecosystems such as Android are moving towards 64-bit implementations and can also make use of new cryptography instructions in ARMv8.

Sources: AnandTech (technical Exynos Galaxy 4 Note review)

Updated 1 March 2015 (Add section about core-hardening).

Monday, December 29, 2014

Another look at Cortex-A53 CPU core performance

Several smartphone chips using ARM's new Cortex-A53 and Cortex-A57 CPU cores with the 64-bit-capable ARMv8 instruction set have arrived on the market recently. Cortex-A53-only based SoCs are especially attractive from a performance/dollar standpoint. However, as I described in earlier articles, there exist significant performance differences between different Cortex-A53 implementations, with some early revisions of the core being limited in performance, probably because of design bugs.

32-bit version of ARMv8 seems practical


Most of the Cortex-A57 and Cortex-A53-equipped SoCs currently seem to be running in what can be called "32-bit ARMv8 mode" (AArch32 in Geekbench, as opposed to ARMv7 for older 32-bit devices), taking advantage of some of the features of the ARMv8 instruction set (which is better suited to modern CPU chip architectures) while preventing some of the disadvantages of the full 64-bit model (such as doubled storage space for pointers and addresses).

Whether the full 64-bit instruction model (AArch64) will soon be attractive for Android devices, including lower-end ones such as Cortex-A53-based devices with limited amounts of CPU cache and RAM, is unclear. NVIDIA already uses AArch64 in conjunction with their latest Tegra K1 SoC. Optimizations for AArch64 seem to have been work in progress and early benchmarks for systems running in AArch64 mode were quite poor in comparison to 32-bit mode benchmarks, but progress is been made. Theoretically, more registers are available in AArch64 mode, also to the NEON SIMD unit, which should help performance in some important cases, and may mitigate the disadvantages of increased address storage size.

Snapdragon 410 has crippled first revision of Cortex-A53


Snapdragon 410 (MSM8926) is a SoC with quad-core Cortex-A53 that has been one of the first chips with Cortex-A53 cores to come to market and has already been adopted in significant volume for low-to-mid-range designs, replacing the older Cortex-A7-based Snapdragon 400.

However, it is obvious that the very first public revision of the Cortex-A53 core as used inside Snapdragon 410, Cortex-A53r0p0, is crippled in terms of performance, clearly scoring lower in CPU and memory-intensive benchmarks (even after making the significant correction for clock speed) than SoCs using later revisions of the Cortex-A53 core such as Snapdragon 615 and MediaTek's new chips. Coupled with Snapdragon 410's relatively low clock speed of 1.19 GHz, this results in significant lower performance than the newer mid-range chips mentioned. Performance in complex benchmarks that simulate demanding, typical use such as complex browsing and gaming is even worse.

Advertising of Snapdragon 410 as having 64-bit support is very misleading


The lower performance seems to be partly associated with the fact that Snapdragon 410 (because of the r0p0 revision of Cortex-A53) is completely limited to ARMv7-compatibility mode and is unable to run in ARMv8 mode (32-bit or otherwise). I have yet to see evidence of a shipping Snapdragon 410 chip that is 64-bit or even ARMv8 capable. It functions as nothing else than having somewhat faster 32-bit ARMv7 Cortex-A7 cores. In this sense, labeling the chip as being 64-bit or potentially having support for the 64-bit ARMv8 in a future update is downright misleading or a blatant lie, depending on one's standpoint.

Memory performance seems most impacted


Based on Geekbench results, Snapdragon 410 has about 10% lower pure integer CPU performance per MHz when compared to chip such as Snapdragon 615 and MediaTek's MT6732/MT6752. For pure floating point performance, performance is about 5% lower. The biggest difference is in memory performance, where Snapdragon 410 is about 25% slower than Snapdragon 615 (with r0p1 Cortex-A53) and more than two times slower (even when correcting for clock or memory speed) than MT6732/MT6752 with Cortex-A53 r0p2. Another big difference is found in cryptography performance because of the extra ARMv8 instructions that apparently are not available to Snapdragon 410.

A large part of the lower performance of the Cortex-A53 cores inside Snapdragon 410 may be due to chip design bugs as evident from errata issued by ARM for earlier revisions of the Cortex-A53 core. Some details about these errata, which mostly involve memory coherency issues related to CPU cache memory, can be found when compiling a Linux kernel.

Snapdragon 410 shows poor scores in real-world benchmarks


While Snapdragon 410 delivers somewhat better scores than the Cortex-A7-based Snapdragon 400 at the same clockspeed in pure CPU-specific benchmarks such as Geekbench for single-core performance, multi-core performance does not show much benefit (which is unexpected based on the architectural advantage that the Cortex-A53-based Snapdragon 410 should have).

Even worse is the performance in practical benchmarks that measure performance for web browsing, gaming and other more complex, practical use cases. Based on benchmark results reported by GSMArena (1) (2), Basemark X, which is gaming benchmark that simulates throughput for a more demanding typical usage pattern that uses of the Unity engine, reports a significantly lower score than recent Snapdragon 400-based models such as the Moto G (2014), with the GPU score being similar, pointing to significant flaws in (multi-core) CPU and memory performance.

In Rightware's Browsermark 2.1, a browser benchmark with use of advanced web standards such as HTML 5, WebGL and advanced JavaScript, performance is downright disappointing, with a score less than half that of Snapdragon 400-based devices. Other browser benchmarks show similar results. Scores in Rightware's overall-use Basemark OS II benchmark are also typically relatively disappointing, not surpassing those of Snapdragon 400-based devices.

Hardware bugs likely cause of crippled performance


These lower than expected benchmark results for more complex, typical use benchmark are compatible with hardware bugs in the Cortex-A53 implementation of the Snapdragon 410 being a bottleneck and significantly degrading especially multi-core performance. In particular, work-arounds for cache consistency and coherency issues have the potential to significantly degrade performance, for example by forcing the kernel to frequently flush CPU caches.

The Linux kernel source shows commits to handle errata for Cortex-A53 up to r0p2 relating to cache clean operations, with the work-around being to promote cache clean to cache clean and invalidate. This could mean that revision r0p3 of Cortex-A53 may see further improvements. These commits do not explain the performance difference between r0p0, r0p1 and r0p2, since the work-around is the same for all three revisions.

Third revision of Cortex-A53 (r0p2) seems to improve memory performance


Some of the hardware or performance bugs that plagued especially the first version of the core (r0p0 as used in Snapdragon 410) have most likely been fixed in later revisions, contributing to a significant performance increase at the same clock speed.

SoCs with the third revision (r0p2) of the Cortex-A53 core seem to have much better memory performance as shown by Geekbench results, especially impressive given the bandwidth limitations of a 32-bit external memory interface. Most likely, this improvement is derived synergistically with ARM IP such as the Mali-T760 GPU as well as other IP blocks, which are implemented inside chips such as MT6732 and MT6752.

Since a SoC such as MT6732 is on the surface essentially comparable to Snapdragon 410 in the sense of having four Cortex-A53 CPU cores, there seem to be major performance improvements in the later revisions of the Cortex-A53 core and associated system architecture, especially with regard to memory performance. The difference is made more pronounced by the fact that the MT6732 is manufactured using TSMC's higher performance 28HPM process rather than 28LP and also clocked significantly higher.

Octa-core Cortex-A53 configurations provide impressive multi-core performance


Octa-core Cortex-A53-based SoCs such as MT6752 and to a lesser extent Snapdragon 615 are already showing impressive multi-core CPU performance, while single-core performance has also improved considerably over prior cost-effective CPU architectures. Multi-core performance, both in terms of pure CPU integer and floating performance, for the MT6752 significantly surpasses (by tens of percent in many benchmarks) the much more expensive Snapdragon 801, while single-core performance is catching up, being about 30% slower for integer operations and 15% slower for floating point. This high level of performance comes at a fraction of the cost (primarily because of the small die size and low power consumption of the Cortex-A53 cores).

Memory bandwidth still a bottleneck


However, when the memory subsystem truly comes into play, high-end chips such Snapdragon 801 still show much greater performance because of their much higher external memory bandwidth (because of the wider memory interface) as well as larger CPU cache. This is apparent in the Geekbench subtest SGEMM (which is heavy on sequential memory access), for which high-end SoCs such as Snapdragon 801 are more than twice as fast.

In practice, memory performance is important for how fast a device feels, impacting response times and also being very important for GPU performance. High screen resolutions also put heavy demands on the memory subsystem. In that sense, SoCs such as MT6752 and Snapdragon 615 still perform best at a resolution like 1280x720, with the best performance at 1920x1080 and higher still reserved for high-end SoCs.

There seems to be great potential for performance-oriented Cortex-A53 SoCs with a memory interface wider than 32-bit, comparable with other performance-oriented SoCs. This would be the "best of both worlds" in several respects (lower cost because of small die size of the CPU cores, low power consumption, while still having the memory bandwidth to drive high resolutions). MediaTek has announced such a chip that was expected to have such as configuration, the MT6795, but it has not quite appeared on the market yet and might be delayed. However, similar solutions certainly look likely to become popular for performance-oriented devices in the not too distant future.

Appendix: Table with detailed Geekbench CPU benchmark results


Presented here is a table with detailed benchmark result information for the mentioned SoCs, also including several other SoCs on the market. Included is information about the CPU cores used, their clock speed, the smartphone model and Geekbench result entry used as a reference, and scores for several benchmarks. Indexed results (relative to a 1.0 GHz Cortex-A7) are shown for several of the benchmarks, as well multi-core performance scaling indices. Results relevant to the discussion above have been highlighted in bold. The following Geekbench subtests have been included:

  • JPEG Compression (single/multi-core). A useful integer benchmark that seems to strongly depend on pure CPU performance (CPU core type and clock speed) with less dependence on the memory subsystem (including L2 CPU cache).
  • Dijkstra (single/multi-core). A more complex integer benchmark that probably includes more memory access and may branch a lot. Notable for this benchmark is that Cortex-A53 performs better than Cortex-A15 at the same clock speed, with both Cortex-A17 (MT6595) and Cortex-A57 being significantly faster still.
  • Mandelbrot (single/multi-core). A pure floating point benchmark, highly dependent on the combination of CPU core type and clock speed.
  • Stream copy (single/multi-core). An important metric for memory performance (especially sequential external RAM performance).
  • SGEMM. A floating point matrix multiplication benchmark that heavily depends on sequential memory access. The memory bandwidth available to the SoC makes a critical difference for this benchmark.
  • SFFT. A floating point benchmark that heavily uses random memory access.
For a high-resolution version, view/copy/save the image above using the browser.

Sources: Geekbench browser, Primate Labs website

Updated January 2, 2015 (Add section of low typical-use benchmark scores for Snapdragon 410).
Updated January 5, 2015 (Update Geekbench performance table).
Updated January 10, 2015 (Update performance table).
Updated February 11, 2015 (mention and link Linux kernel Cortex-A53 errata).

Thursday, December 25, 2014

Cortex A53-based Snapdragon 615 arrives, but power efficiency in question

Qualcomm's Snapdragon 615 (MSM8938), an octa-core ARM Cortex-A53 CPU core based SoC with four cores clocked at 1.54 GHz and four cores clocked at 1.0 GHz, has arrived on the market with a significant number of new models shipping from several manufacturers.

The new chip conveniently fills the gap in Qualcomm's product line for SoCs with integrated baseband between the low-to-mid-range Snapdragon 400/410 and the high-end Snapdragon 801, which have a large performance and cost difference, as for some time Qualcomm has offered no competitive smartphone solution with performance falling in between for the performance mid-range category.

While the SoC appears to offer good mid-range CPU and GPU performance, based on early evidence its power efficiency appears to be less than what one would expect based on its utilization of low-power Cortex-A53 cores.

DRAM interface appears to be 32-bit after all


Early data suggested that Snapdragon 615 (MSM8389) would utilize a relatively relatively wide 64-bit external DRAM interface, which is not typical of cost-sensitive devices because it significantly increases the cost of the PCB design, chip as well as other components. A 64-bit DRAM interface would mean that memory bandwidth is relatively high and that the chip would run relatively smoothly at resolutions such as FullHD (1920x1080) at higher.

However,  more recent sources as of December 2014 (including Qualcomm's website) indicate the chip uses a cost-effective 32-bit DRAM interface with support for LPDDR3 up to 800 MHz, resulting in memory bandwidth of 6.4 GB/s, comparable with other cost-effective mid-range SoCs, which can lead to constrained performance when running at high resolutions such as 1920x1080.

GPU appears to have strong pixel processing capabilities, but is limited by memory bandwidth


The Adreno 405 GPU provides adequate performance for a mid-range SoC, comparable in benchmarks such as the GFXBench T-Rex and Manhattan tests to that of MediaTek's new MT6752 (also an octa-core Cortex-A53-based SoC with a 32-bit memory interface, in conjunction with a Mali-T760 MP2 GPU), while being roughly three times faster than the GPU in the low-to-mid-range Snapdragon 400/410 platforms.

In GFXBench subtests, the ALU and Alpha Blending benchmark results are particularly high for a mid-range device and close to the scores achieved by higher-end chips from competitors such as Kirin 920 and Exynos 5 Octa, which have Mali-T628 MP4 and Mali-T628 MP6 GPUs and a wider DRAM interface. However, the pixel fill rate is lower and probably provides a bottleneck because of the memory bandwidth limitation. This could suggest that the GPU inside the chip is larger and higher powered than it needs to be, stemming from original plans for a 64-bit DRAM interface on the SoC. In comparison, the Mali-T760 MP2 as implemented in the MT6752 has less processing power but implements bandwidth-saving techniques from ARM that improve performance in a bandwidth -constrained environment.

The 32-bit memory interface and resulting memory bandwidth bottleneck probably means that devices using the SoC will run significantly smoother (especially in games) with better battery life when using a screen with a lower resolution screen like 1280x720, while a resolution 1920x1080 will make the memory interface the bottleneck, also resulting in shorter battery life. A similar phenomenon is seen with other relatively high-powered SoCs with limited memory bandwidth, such as MediaTek's previous generation MT6592.

SoC design shows some signs of cost-reduction measures, including use of 28LP process


Benchmark scores and GPU performance illustrate that this is not a high-end chip and that Qualcomm has reduced cost in a number of ways, reducing CPU and GPU performance. A likely factor is a smaller amount and slower L2 cache memory when compared to higher-end SoCs, as well as the relatively limited memory bandwidth provided by the 32-bit DRAM interface.

Another major factor is that, despite being a relatively performance-oriented chip, it is manufactured using TSMC's relatively economical and low-performance 28LP process (also used for Snapdragon 400/410), which limits clock rates and power efficiency. Other chips, like the Snapdragon 800 series and most of MediaTek's mid-range solutions like MT6752 are manufactured using the higher-performance 28HPM process at TSMC, which provides significantly better performance (higher clock rates) and lower power consumption.

Reduced cost and die size lowers wafer requirements


By migrating part its performance-mid-range SoC offerings from the Snapdragon 800 series to Snapdragon 615, Qualcomm is effectively reducing its wafer requirements at TSMC (especially for HPM), because Snapdragon 615 is likely to have a much smaller die size than the relatively large Snapdragon 801 (the total area for the CPU cores is much smaller, despite there being twice as many cores) and more chips can be manufactured on a single wafer. Qualcomm also saves a significant amount of cost this way (although in the past, Qualcomm's patent royalty leverage has meant that the chip margins were not as important as they might be for other companies).

Reviews and benchmark scores show mediocre battery life and power efficiency


Contrary to initial expectations from the use of power efficient Cortex-A53 CPU cores in a pseudo big.LITTLE configuration, Snapdragon 615 does not appear to be very power efficient, resulting in mediocre battery life in end devices.  The Snapdragon 615-based Oppo R5 shows poor battery life in a review by GSMArena, partly because of the high resolution 1080p AMOLED screen. The SoC is likely to be less efficient with resolutions of 1080p and higher.

In the GFXBench long-term performance benchmark for the HTC Desire 820, GPU performance is sustained close to the maximum level, but with a relatively mediocre battery lifetime score of 153 minutes, which is lower than almost all other modern smartphones. A review of the same device by Android Central noted that battery life was reasonable although not spectacular. The HTC model uses a 720p resolution which is likely to result in more acceptable battery life than devices running at 1080p.

Part the reason for the relatively high power consumption is likely to be the use of the less efficient 28LP semiconductor process at TSMC, in conjunction with a relatively powerful GPU with a relatively large die size (which is however limited by memory bandwidth). The Cortex-A53 cores may also perform worse, with higher power consumption, when compared with implementations using the 28HPM process such as MediaTek's Cortex-A53-based designs.

Is Cortex-A53 less power-efficient than expected?


Based on its similarities with the very power efficient Cortex-A7 core, one would expect Cortex-A53 to be a relatively power efficient CPU core, and in that sense the power efficiency of the Cortex-A53-only Snapdragon 615 might be considered disappointing. However, in the case of Snapdragon 615, there are important factors that reduce the power efficiency of the implementation. The 28LP process is a major factor, as well as presumably the relatively high-powered GPU . The 32-bit memory interface in conjunction with the relatively powerful multi-core CPU and GPU can cause memory bus contention due to insufficient bandwidth, resulting in relatively heavy DRAM access patterns.

Another factor could be the r0p1 revision of the Cortex-A53 core; progressive revisions of the core show indications of increased performance and efficiency. MediaTek uses revision r0p2 in its MT67xx family, as well as using the more efficient 28HPM process at TSMC. Samsung has already been shipping the 20 nm-manufactured Exynos 7 Octa (5433) for several months which also uses Cortex-A53 to good effect as the power efficient part of its CPU configuration.

The bandwidth-saving techniques of the Mali-T760 GPU (used by both MediaTek and Samsung) and other ARM IP blocks is likely to contribute to reduced power consumption. Battery life benchmarks and reviews for the MT6732 and MT6752, when they become available, will help clarify whether an octa-core Cortex-A53 with a 32-bit memory interface can in fact provide low power consumption and long battery life.

Sources: Wikipedia (Snapdragon page), Qualcomm (Snapdragon processor page)GFXBench results browser, GSMArena, Android Central

Updated January 2, 2015.

Friday, November 7, 2014

Analysis of tablet processors by chip company, with a focus on Geekbench CPU performance

The Geekbench browser, which includes hundreds of thousands of mobile benchmark results, provides access to a wealth of information about the CPU and memory performance of smartphone and tablet SoCs. Because certain subtests within Geekbench results (such as the single-core JPEG Compress test) correlate well with CPU clock speed for a given CPU core, it is possible to determine the actual maximum clock speed of the CPU, which sometimes does not correspond to the advertised clock speed or even the clock speed reported by the operating system.

By assessing the number of entries for a specific chip or model, the database of also provides an indication about the unit volume and popularity of specific chips and models. The approximate arrival on the end market of specific chips can also be estimated.

In this post, I am analysing the Android ARM and x86-based tablet processor market of the last two years or so from the low-end (mostly chip used in Chinese white-box tablets) to high-end devices from well known brand names, with a focus on CPU performance and other information that can be found after studying the Geekbench results database. The article takes on tablet SoC chip companies in alphabetical order, one-by-one.

Although the article specifically focuses on tablet chips, there is some overlap with smartphone chips since many players in smartphone chip space also compete in tablets with solutions that are generally similar to their smartphone chip solutions. HiSilicon, the chip division of Huawei, is becoming more prominent for smartphone SoCs but has been omitted because it does not really target tablets. A similar argument applies to the Chinese low-end smartphone chip designer Spreadtrum. These companies may be covered in a future update or in an article focusing on smartphone chips.

Actions Semiconductor


Actions a Chinese chip company with a long prior history in the MP3 player chip market, which has operated at the bottom-level of the white-box tablet market in the last few years.

Chip      Arrival  Fab    CPU             Clock speed  Geekbench  Multi   GPU
                          configuration   (typical)    JPEG C.    core x

ATM7021   Q4 2013  40nm  2 x Cortex-A5    1.3? GHz                        PowerVR SGX540
ATM7029A  Q1 2013  40nm  4 x Cortex-A5    1.0 GHz      296   681  2.30    Vivante GC1000
ATM7029B           40nm  4 x Cortex-A5    1.2? GHz                        PowerVR SGX540
ATM7059            28nm  4 x Cortex-A9    1.6 GHz                         PowerVR SGX544 MP

The ATM7029A from Actions is a low-end quad-core SoC that was one of the first affordable quad-core tablet processors to appear on the market, and has been sold in fair numbers in low-end tablets. However, the chip cuts corners with regard to performance in a rather unorthodox way. Actions advertised the chip as containing Cortex-A9 (later "Cortex-A9 family") CPU cores, while actually containing Cortex-A5 cores that perform about half as fast at a given clock speed (also significantly slower than Cortex-A7). Actions also modified the Android kernel to hide the actual CPU core type and also to falsely report a 1.2 GHz clock speed while the actual maximum speed is 1.0 GHz. The SoC displays very poor multi-core performance scaling for a quad-core CPU of only 2.3x for the JPEG Compress test in Geekbench, probably due to a very small and slow L2 cache.

The AT7029B is an improved version of the ATM7029 that replaces the less compatible Vivante GC1000 GPU with a more proven PowerVR SGX540.

The ATM7021A is an ultra-low-end dual-core Cortex-A5 processor that arrived in the market at the end of 2013. It only supports 512MB RAM and has been sighted in ultra-cheap tablets advertised on the internet.

The ATM7039c/7039s/7059 family consists of higher performance SoC designs that incorporate a quad-core Cortex-A9 running at 1.6 GHz. The ATM7039s and ATM7059 are manufactured at 28nm so have increased power efficiency, although the aging Cortex-A9 core is much less power efficient (as well having much large die area) than the Cortex-A7 used by most competitors. The chips have been in the pipeline for some time and Actions remains hopeful that they will appear on the market in 2014. However, it terms of cost efficiency the chips give the impression of following Rockchip's RK3188(T) long after the fact at a time when such a solution has almost ceased to be competitive.

Allwinner Technology


Allwinner is a Chinese tablet chip company that for some time (2012-2013) dominated the worldwide unit volume for tablet processors with cost-effective chips like the A1x series, and has probably shipped more than 100 million units in total. More recently, the company has suffered from loss of market share due to problematic and delayed new product introductions.

Chip      Arrival  Fab    CPU              Clock speed Geekbench   Multi   GPU
                          configuration    (typical)   JPEG C.     core x

A10       Q1 2012  55nm   1x Cortex-A8     1.00 GHz     423   424  1.00    Mali-400
A13       2H 2012  55nm   1x Cortex-A8     1.00 GHz     416   418  1.00    Mali-400
A20       Q3 2013  55nm   2x Cortex-A7     1.00 GHz     384   785  1.97    Mali-400 MP2
A23       Q3 2014  40nm   2x Cortex-A7     1.20 GHz     463   922  1.99    Mali-400 MP2
A31s      2013     40nm   4x Cortex-A7     1.01 GHz     387  1571  4.06    PowerVR SGX544 MP2
A33       Q3 2014  40nm   4x Cortex-A7     1.20 GHz     466  1450* 3.11*   Mali-400 MP2
A80T      Q3 2014  28nm   4x Cortex-A15/A7 1.60 GHz     927  4020  4.34    PowerVR Series 6
A83T      Q4 2014? 28nm   8x Cortex-A7     2.0? GHz                        PowerVR
* The CPU performance of the A33 shows different CPU scaling in different entries, with some close to 4 as expected for a fully utilized quad-core CPU, while many others show a scaling factor of only about 3.1 or even as low as 2.6. Some other scores seem to correlate with the CPU scaling factor variation, with the multi-core JPEG Decompress result scaling to all CPUs when the JPEG Compress test is low. Scheduling characteristics such as thermal throttling or other factors could be involved.

The A10 was Allwinner's first successful chip targeting tablets, with its relatively high level of integration providing significant cost advantages, which catapulted Allwinner into dominance of the Chinese white-box tablet market in 2012. The A13 was a cost-reduced version of the A10 with a 16-bit external memory interface, which later caused problems as memory bandwidth requirements increased with newer Android versions and higher resolution screens. The old Cortex-A8 CPU core had relatively competitive integer performance while floating point performance was much lower than more recent designs.

The A31s (a cost-reduced version of the A31 that was released a little earlier), a quad-core Cortex-A7-based SoC with a powerful PowerVR SGX544 MP2 GPU, arrived on the market in 2013 and was more or less Allwinner's last succesful product introduction. Although the 40nm process limited clock speeds due to power and heat limitations, the A31/A31s were a reasonable success in higher-end Chinese tablets and also used by some well known brand names such as HP, although due to cost not suited for the really high-volume part of the Chinese white-box market. This chip has continued to be sold for a long time.

The dual-core A20 was intended as a pin-compatible successor to the succesful A10 processor, which was also manufactured at 55nm as early as 2012 and widely used at the time. The A20 is notable for using Cortex-A7 cores with a trailing-edge 55nm process. Originally announced in 2012, the product suffered from serious delays and quality issues related to firmware when it arrived in the market in the second half of 2013 and was not a success, contributing to Allwinner's decline. I have personal experience with an early A20-based Android tablet which came with grossly misconfigured firmware (unstable, running at 0.7 GHz, with very slow screen refresh), which nevertheless ran a custom Linux OS without problems at 1.0 GHz, suggesting that much of the problem was very sloppy software engineering related to low-level chip initialization in the Android firmware.

The A23 is the replacement for the A20 using a more sensible 40nm process. However, it also did not come to market smoothly and the Geekbench database provides evidence that it only arrived on the market as recently as Q3 2014, being more or less immediately superseeded by the Allwinner's quad-core A33 which is arriving at the same time. Geekbench results provide evidence that the kernel has been modified by Allwinner to falsely report the CPU speed as 1.54 GHz, with all shipping devices actually running at an estimated maximum speed of 1.20 GHz.

The quad-core A33 is logical extension of the A2x and was announced in June 2014 as a entry-level tablet solution, with mass production already having commenced, highly important for any recovery of Allwinner's market position. As of early November 2014, a few entries in the database have appeared suggesting the use of the A33 but this is not yet suggestive of a successful product introduction. The results listed seem to reflect devices based on A23 ("sun8i") firmware, and show lower than expected multi-core performance scaling of only about 2.6 - 3.1 for the Geekbench JPEG Compress benchmark (close to 4 would be expected), which could be due to limited L2 cache size or other factors, and the chip also shows a very low memory performance score. A possible explanation for the lower than expected performance is that the L2 cache (which should have the very reasonable size of 512KB according to Allwinner) is disabled due to hardware defects in earlier revisions of the A33. However, some recent entries in the Geekbench database show CPU scaling close to 4.0 (as expected) for A33-based devices, with variation for other benchmark tests such as JPEG Decompress also being observed. CPU clock speed appears to be falsely reported as 1.34 GHz, because actual single-core performance suggests a 1.20 GHz maximum clock speed for the Cortex-A7 cores. Allwinner has announced that HP (who earlier used the A31s) is using the A33 in the new HP 7 G2 and HP 8 G2 tablets, and mentioned having achieved one million units shipments of A33. However, the Amazon website evidence shows no reviews for these models, suggesting that actual volume availability is still doubtful. The A33 being another failed product introduction from Allwinner cannot be ruled out at this point.

Finally, the ambitious octa-core big.LITTLE A80 SoC is Allwinner's attempt to address the high-performance market. After several delays, which saw the A80 pitched mainly at development boards and other non-tablet applications, with suggestions of power and heat issues, numerous entries for the Allwinner A80T-based Onda V989 tablet have started to appear in the Geekbench database in the last few months. The results are consistent with a Cortex-A15 clock speed of about 1.6 GHz, lower than the advertised 2.0 GHz. This is confirmed by independent research. Although the chip provides high performance relative to previous Allwinner chips, performance is still lower than previous generation, lower-power SoCs such as Qualcomm's Snapdragon 800 for smartphones. The chip also shows lower multi-core performing scaling than comparable chips from competitors such as HiSilicon's Kirin 920 for smartphones, although there is evidence that the Cortex-A7 cores are also utilized (use of Global Task Switching), as well as showing low memory performance for a SoC with a dual-channel memory interface.

Intel


Intel has started targeting the tablet market in earnest only recently in 2014, using its increasingly efficient Atom processor cores and SoCs and employing a contra-revenue strategy that subsidizes tablet manufacturers that use its platform. First gaining traction in the first half of 2014 with brand-name manufacturers such as Asus, in the second half of 2014 Intel started penetrating Chinese white-box tablets primarily due to the introduction of lower cost Atom SoCs with a 32-bit memory interface such as Z3735G/Z3736G and addition to the Z3735F/Z3736F with 64-bit memory for higher performance segments, also helped by a general shortage of efficient tablet processors from competitors such as MediaTek due to the tight wafer capacity environment at TSMC. Because of the advanced 22nm process, Intel's SoCs provide relatively high CPU and GPU performance as well as high power efficiency. Part of the efficiency advantage stems from Intel's ability to integrate a fast and large 2MB L2 cache (Z37xx series), much larger than the L2 cache in typical cost-sensitive tablet processors.

Chip      Arrival  Fab    CPU              Clock speed Geekbench   Multi   GPU         Memory
                          configuration    (typical)   JPEG C.     core x              Interface

Z2560     Q2 2013  32nm   2x Saltwell      1.6 GHz      617  1711  2.77    SGX544 MP2  2 x 32-bit
Z2580     Q2 2013  32nm   2x Saltwell      1.6 GHz                         SGX544 MP2  2 x 32-bit
Z3735F    Q3 2014  22nm   4x Silvermont    1.33 GHz*    821  2803  3.35    Intel HD    64-bit
Z3735G    Q3 2014  22nm   4x Silvermont    1.33 GHz*    827  2773  3.42    Intel HD    32-bit
Z3736F    Q4 2014  22nm   4x Silvermont    1.33 GHz*    968  2858  2.95    Intel HD    64-bit
Z3736G             22nm   4x Silvermont    1.33 GHz*                       Intel HD    32-bit

* The chips have a so-called burst (turbo) frequency of 1.83 GHz (Z3735) or 2.16 GHz (Z3736).

Intel's Atom SoCs for mobile devices, although compatible with the x86 and x86-64 instruction sets used with PC processors, are based on CPU cores specifically designed for the mobile market and not derivatives of PC-class architectures.

The Saltwell core (which does not support x86-64) in previous generation Atom SoCs such as Z2560 and Z2580 has performance approximately equivalent to an ARM Cortex-A7 clocked at the same frequency, but the higher typical clock speed of 1.6 GHz results in higher single-core performance than typical Cortex-A7 configuration that are clocked lower. However, the dual-core CPU configuration with HyperThreading results in lower multi-core performance scaling than a typical quad-core Cortex-A7. The per-core 512K L2 cache is not really optimal for mobile applications and suggests that the architecture was not yet fully optimized for low power mobile applications, and overall the SoCs have significantly lower performance/Watt than competitive solutions that use ARM Cortex-A7 cores.

The current generation Z373x series are faster than Z25xx with improved power efficiency and fall somewhere in the mid-range with regard to performance, since they do not reach pure CPU processor speed of competitive mobile SoCs  targeting the performance segment (approaching the speed of less optimized Cortex-A1x designs like Allwinner A80T and RK3288, but falling short of the performance of high-end Exynos and Snapdragon 801/805 chips for tablets and smartphones).

The Silvermont-based SoCs show evidence of an optimized memory subsystem, so that the Z3735G with 32-bit memory shows memory performance comparable to Rockchip's RK3288 with a much more expensive dual-channel memory design. The CPU burst mode benefits single-core performance but means that multi-core performance does not scale as well as most ARM-based chips. The SoCs also have relatively fast GPU performance for a mobile chip, benefiting from the low power design and the large cache memory inside the chip.

Leadcore Technology


Leadcore is an upcoming Chinese designer of SoCs for smartphones that has been on focusing on the TD cellular standards primarily used in China, and also offers tablet chips with integrated modem. Although still a relatively small player, its designs show evidence of good product planning with efficient, cost-effective solutions and the company has attracted the attention of Xiaomi, which is rumoured to be interested in acquiring a majority stake in the company.

Chip      Arrival  Fab    CPU              Clock speed Geekbench   Multi   GPU            Modem
                          configuration    (typical)   JPEG C.     core x

LC1913    2013?    40nm   4x Cortex-A7     1.4 GHz                         Mali-400 MP2   3G (TD)
LC1960    2014     28nm   6x Cortex-A7     2.0? GHz                        Mali-T628 MP2  4G
LC1980    2014?                                                            Mali-T720 MP6

On paper, the LC1913 appears to be a cost-effective chip for tablets with integrated 3G connectivity, being similar to MediaTek's MT8382 but on a 40nm instead of a 28nm process. I have not yet located any entries using this chip in the Geekbench database. The hexa-core LC1960, which most likely has a dual-channel external memory interface like the LC1860 for smartphones, promises to be a reasonably balanced, efficient design that provides good but low-power CPU performance while addressing performance bottlenecks with the use of a dual-channel memory interface, potentially making it suitable for higher resolution screens (but see note below about fillrate of the Mali-T628 MP2 GPU). Although the dual-channel memory increases PCB cost, the SoC has the hallmarks of being relatively low-cost and the wide memory interface may in fact contribute to increased power efficiency because of the reduction in memory transaction duration. This is one of the first chips to combine a wide memory interface with a relatively efficient CPU configuration (most existing chips with dual-channel memory tend to be high-end designs using heavy, performance-oriented CPU cores such as Cortex-A15, Krait-400 or Cortex-A57 as well as heavy GPUs).

The Mali-T628 MP2 GPU clocked at about 690 MHz inside the L1960 provides greatly improved triangle throughput (173 Mtri/s) when compared to the Mali-400 from typical low-end SoCs, as well as OpenGL 3.x support. However, the MP2 configuration limits pixel throughput to 1380 MPix/s, equivalent to Mali-400 MP2 or 450 MP2 clocked at the same frequency. Since comparable GPUs used by competitors (such as Mali-450 MP4 used by MediaTek and HiSilicon and Mali-T628 MP4 and MP6 used by HiSilicon and Samsung) have at least double the amount of GPU cores and thus twice the pixel rate at the same clock frequency, and are already relatively limited in fill-rate when compared to high-end GPUs from competitors, it remains to be seen how much of a bottleneck this willl be in practice. Game performance is likely to be severely impacted at higher screen resolutions.

MediaTek


MediaTek is a Taiwanese company with a relatively long history of activity and success as a chip platform provider for the the Chinese mobile phone market. MediaTek also has a long history targeting segments such as digital TVs and set-top boxes, DVD players and several other segments, and has generally been successful in those segments. In the past few years, MediaTek has had a large share of the SoC market for smartphones among Chinese manufacturers and other cost-sensitive manufacturers with cost-effective, power efficient, highly integrated SoCs. MediaTek was the company that spearheaded the emergence of a multi-core ARM Cortex-A7 configuration manufactured at 28nm as a very efficient, low cost and adequately performing CPU solutions for smartphones ranging from entry-level to mid-range. Since 2013, MediaTek has also been successful in the tablet chip market, with both modemless application processors targeting WiFi-only tablets and chips with integrated modem.

Chip      Arrival  Fab    CPU              Clock speed Geekbench   Multi   GPU                Modem
                          configuration    (typical)   JPEG C.     core x

MT8125    H1 2013  28nm   4x Cortex-A7     1.20 GHz     472  1893  4.01    PowerVR SGX544 MP  -
MT8121    Q2 2014  28nm   4x Cortex-A7     1.30 GHz     505  2002  3.96    PowerVR SGX544 MP  -
MT8127    Q3 2014  28nm   4x Cortex-A7     1.30 GHz     508  2023  3.98    Mali-450 MP4       -
MT8135V   Q3 2014  28nm   2x Cortex-A15/A7 1.50 GHz     896  1884  2.10    PowerVR Series 6   -

MT8389    2H 2013  28nm   4x Cortex-A7     1.21 GHz     469  1894  4.04    PowerVR SGX544 MP  3G
MT8312    Q4 2013  28nm   2x Cortex-A7     1.30 GHz     505  1011  2.00    Mali-400 MP        3G
MT8382    Q1 2014  28nm   4x Cortex-A7     1.30 GHz     505  2013  3.99    Mali-400 MP2       3G
MT8392    2014     28nm   8x Cortex-A7     1.66 GHz     644  4745  7.79    Mali-450 MP4       3G
MT8732    Q4 2014? 28nm   4x Cortex-A53    1.5? GHz                        Mali-T760 MP2      4G
MT8752    Q4 2014? 28nm   8x Cortex-A53    1.69 GHz     952* 5046* 5.30*   Mali-T760 MP2      4G
* The CPU performance of the MT8752 as reported for the CUBE T7 and for the equivalent MT6752 for smartphones shows different CPU scaling in different entries, with some around 7.7 as expected for a fully utilized octa-core CPU, while others show a scaling factor of about 5.3. It is notable that the PNG Decompress test shows CPU scaling close to 8 when JPEG Compress scaling is 5.3, while PNG Decompress scaling is a little above 5 when JPEG Compress scaling is close to 8. This could the result of scheduling algorithm differences, or something else related to Geekbench, since similar behaviour with regard to JPEG Compress benchmark variation is also noticeable for recent entries for other chips like the Allwinner A33.

MediaTek's MT8125 was its first really successful tablet chip, providing high power efficiency and good performance. Performance and efficiency benefits from four low-power Cortex-A7 cores, a relatively large 1MB L2 cache, and a PowerVR SGX544MP GPU. The chip was prominently adopted by the Asus MemoPad 7 HD and other brand-name tablets.

The MT8121 is a lower-cost, more highly integrated version of the MT8125 that does not appear to have been widely used outside of a few Lenovo tablet models. The MT8127 is a relatively fast and cost-efficient tablet processor within the bounds of a single-channel memory interface, with the Mali-450 MP4 GPU providing relatively good game performance as long as the resolution is not too high. Both processors appear to have been affected by the shortage of wafer supply for MediaTek in mid-2014, with some production capacity most likely prioritized for the MT8135V used in new Amazon Kindle tablets, as well as higher-margin tablet processors with integrated modem.

The MT8135V is a variant of the high-end MT8135 tablet processor that was announced in mid-2013 but has failed to materially appear on the market. The MT8135V appears to be a custom design for new Amazon Kindle tablets that are positioned at the entry-level segment of the US retail market, probably as the result of a long-standing agreement. However, the MT8135V shares much of the MT8135's higher-cost design features making it seem rather unsuitable for entry-level tablets with a small form factor, although the memory interface has been halved from double to single-channel. Power efficiency is also likely to be a problem. It is ironic that use of the MT8127, although having lower single-core performance, would probably easily have fit the bill for the Kindle tablets with significant advantages for cost and power consumption.

MediaTek has been one of the first companies to offer cost-effective solutions for tablets with integrated 3G cellular data or voice connectivity, mostly based on comparable smartphone products, and has for some time dominated that market. The previous-generation MT8389(T) corresponded to the MT6589(T) for smartphones, while the dual-core MT8312 and quad-core MT8382 are the equivalent of the MT6572 and MT6582. The MT8392 matches the MT6592 octa-core smartphone processor. Tablet manufacturers also commonly utilize MediaTek's smartphone chips directly. Chip such as the MT8312/MT6572 and MT8382/MT6582 have a relatively optimized CPU achitecture, with no unexpected bottlenecks, providing good performance for their cost segment.

The upcoming MT8732 (quad-core) and MT8752 (octa-core) are Cortex-A53-based tablet SoCs with integrated 4G modem that correspond to similar upcoming chips for smartphones (MT6732 and MT6752). The use of a many-core Cortex-A53 configuration is promising to significantly raise performance for low-power SoCs and is likely to be able to address several segments including the high-performance segment, while greatly reducing cost. There are signs that the MT8732, because of the relatively large die area associated with the Mali-T760 MP2 GPU core, will not be cost-effective enough for entry-level segments and will be superseeded by a chip (equivalent to MT6735 for smartphones) that has a more economical but lower-performance Mali-T720 GPU.

NVIDIA


NVIDIA, with a long history as a leader in PC, console and laptop GPUs, has recently increased its focus on the tablet market and more or less given up on its long-term goal of penetrating the high-volume smartphone market with integrated SoCs. NVIDIA has been designing its Tegra tablet processors for tablets for quite some time, but has seen mixed success, while eventually not being successful in the high-volume mainstream tablet market. It has gained a few high-profile design wins for high-end devices, most recently for the HTC Nexus 9.

Chip              Arrival  Fab   CPU                 Clock speed  Geekbench   Multi  GPU
                                 configuration       (typical)    JPEG C.     core x
Tegra 250 T20     Q1 2010  40nm  2x Cortex-A9        1.0 GHz                         GeForce ULP
Tegra 3 T30       Q4 2011  40nm  4x + 1x Cortex-A9   1.4 GHz       605  2238  3.70   GeForce ULP
Tegra 4 T114      Q2 2013  28nm  4x + 1x Cortex-A15  1.8 GHz       938  3850  4.10   GeForce ULP
NVIDIA K1         Q1 2014  28nm  4x + 1x Cortex-A15  2.2 GHz      1296  5359  4.14   Kepler DX1
NVIDIA K1 (ARMv8) Q3 2014  28nm  2x NVIDIA Denver    2.5 GHz      2002  3941  1.97   Kepler DX1

NVIDIA's Tegra and Tegra 2 processors saw fairly widespread adoption in the early days of the tablet market. Tegra 2 had some architectural deficiencies that made it less competitive, for example, it did not have an up-to-date video decoder, and lacked ARM's almost standard NEON SIMD extension. NVIDIA was not able sustain its market share momentum as the market became increasingly dominated by Chinese white-box tablets as well as brand names such as Apple and Samsung.

NVIDIA has developed its own ARMv8-compatible CPU core, Denver, which is a large core with very high single-core performance, and which has been implemented in the ARMv8 version of the NVIDIA K1 processor in a dual-core configuration. The chip provides leading single-core performance, but multi-core performance is less than even upcoming mid-range solutions. The GPU performance of both K1 processors is industry-leading.

Rockchip


Chinese company Rockchip, which has a history as a supplier of MP3/MP4 video players, held a strong position in the very early tablet market before Allwinner displaced it with its A10 chip in 2012. Rockchip subsequently regained traction with relatively high-performing chips including the RK3066 and RK3188, and later expanded its product offering for low-end segments. Although Rockchip has led the tablet processor market in 2014 in terms of volume, it has continued to use Cortex-A9 cores for most of its products which are considerably less efficient in terms of chip cost (die area) and power efficiency when compared to the Cortex-A7 cores used by competitors.

Chip      Arrival  Fab    CPU             Clock speed  Geekbench   Multi    GPU
                          configuration   (typical)    JPEG C.     core x
RK2926/28 2013     55nm   1x Cortex-A9     1.01 GHz     430   430  1.00     Mali-400 MP
RK3066    Q3 2012  40nm   2x Cortex-A9     1.61 GHz     696  1202  1.73     Mali-400 MP4
RK3188    Q2 2013  28nm   4x Cortex-A9     1.61 GHz     699* 2604* 3.73     Mali-400 MP4
RK3188T   Q3 2013  28nm   4x Cortex-A9     1.42 GHz     617  2441  3.96     Mali-400 MP4
RK3026/28 1H 2014  40nm   2x Cortex-A9     1.01 GHz     443   885  2.00     Mali-400 MP2
RK3168    Q2 2014  28nm   2x Cortex-A9     1.5 GHz                          PowerVR SGX540
RK3288    Q3 2014  28nm   4x Cortex-A12    1.8 GHz      980  3873  3.95     Mali-T760 MP4
RK3126/28 Q4 2014  40nm   4x Cortex-A7     1.3 GHz                          Mali-400 MP2
"MayBach"          28nm   8x Cortex-A53                                     OpenGL ES 3.0-class

* RK3188-based deviced running at 1.6 GHz (probably reflecting the use of the original RK3188
  rather than the cost-reduced RK3188T) show a relatively high amount of variation in benchmark
  scores between devices and runs, probably reflecting thermal throttling or other scheduler
  characteristics.

The RK3066 was a relatively high-performance chip at the time of its introduction (second half of 2012), and was successful in the mid-range of the white-box tablet market, as well as gaining design wins with companies like HP. The relatively high clock frequency Cortex-A9 cores on a 40nm process, as well as the Mali-400 MP4 GPU, constrained its power efficiency.

The RK3188 (in practice more often the lower-clocked RK3188T in a cost-reduced package) was introduced as the logical successor to the RK3066 addressing the higher-performance part of the white-box tablet market as well as being adopted in brand name models from Asus and others. Although Cortex-A9 cores are not very power-efficient, efficiency is improved by the use of a relatively advanced 28nm HKMG process at Global Foundries. Rockchip has benefitted from the fact that it was one of the few companies with plentiful wafer supply in 2014, being one of the few customers of GlobalFoundries while many of its competitors faced a very tight capacity environment at TSMC and to a lesser extent other foundries. In 2014, the RK3188T has been observed not only in more performance-oriented tablets, but also in significant numbers in cheaper tablets with relatively low-cost and low-quality components outside of the processor, being seemingly out of place. This scenario probably unfolded because of shortages of tablet processors due to the tight foundry capacity environment outside of GlobalFoundries, while GF may have offered low prices for wafers in the face of excess capacity.

The RK3168 was announced in 2013 as a power-efficient dual-core processor, but only arrived in Q2 2014 with relatively limited adoption among signs that its power efficiency leaves something to be desired.

The dual-core Cortex-A9 RK3026 and RK3028 appeared in numerous low-end tablets in 2014, while the pin-compatible RK3126 and RK3128, which are due to appear in Q4 2014, will finally see Rockchip transition away from the relatively inefficient Cortex-A9 to the more efficient (in terms of cost and power consumption) Cortex-A7.

Finally, the RK3288 is an ambitious high-end processor utilizing four Cortex-A17 (technically Cortex-A12) cores also manufactured at GlobalFoundries. The RK3288 was delayed and for some time pitched to manufacturers of media boxes and other devices amongst indications that hardware work-arounds were required to circumvent hardware issues related to the chip. Reports suggest power consumption and heat production can be problematic. The RK3288 has recently appeared in the Geekbench database in several entries for the Teclast P90HD tablet. Results show performance roughly comparable with Allwinner's A80, with memory performance lower than the A80 and significantly lower than other competitor's chips that also use a 64-bit or dual-channel memory interface, including smartphone platforms. One TV box result shows more acceptable memory performance, probably as the result of a faster DRAM frequency, although still falling short of the performance of smartphone platforms like Exynos 5430 and Snapdragon 801. A relatively steep fall-off in game performance at higher resolutions can be explained by a memory bandwidth bottleneck imposed by the less-than-optimal memory controller. When not constrained by memory bandwidth, the Mali-T764 GPU provides excellent game performance, although the exact nature of the Mali-T764 GPU (a model number not used by ARM) remains in doubt.

Despite the announcement by ARM that the latest version of the Cortex-A12 core is equivalent in performance to Cortex-A17 and the name Cortex-A12 will therefore by retired, a comparison of Geekbench results for the Cortex-A12-based RK3288 with the real Cortex-A17-based MT6595 shows a not insignificant performance difference in pure CPU performance when corrected for clock frequency of about 13% in favor of Cortex-A17, with Cortex-A15 in the middle. This suggests RK3288 does not use the latest version of Cortex-A12 to which ARM referred when making the performance comparison to Cortex-A17.

Qualcomm


Qualcomm has dominated the entire higher-end part of the smartphone SoC market in recent years, largely based on leverage of its patent royalty schemes which are based on the total selling price of a device, enabling Qualcomm to coerce most well-known device manufacturers to use Snapdragon chips for a large proportion of their line-up. More recently, Qualcomm has started targeting the tablet space. Clearly, its integrated 3G/4G modem technology and patent royalty leverage gives it opportunities to penetrate 3G/4G-enabled tablets, but Qualcomm has also been targeting WiFi-only tablets for which it does not have direct patent royalty leverage.

Chip      Arrival  Fab    CPU              Clock speed Geekbench     Multi  GPU          Modem
                          configuration    (typical)   JPEG C.       core x
APQ8064    2013     28nm   4x Krait 300    2.0 GHz       1035  4207  3.22x  Adreno 320   -
MSM8026    2014     28nm   4x Cortex-A7    1.2 GHz                          Adreno 305   -
MSM8074    2014     28nm   4x Krait 400    2.36 GHz                         Adreno 330   -
 
MSM8226    2013     28nm   4x Cortex-A7    1.19 GHz       461  1791  3.85x  Adreno 305   3G
MSM8926    2014     28nm   4x Cortex-A7    1.19 GHz       466  1883  4.04x  Adreno 305   4G
MSM8974-AC 2014     28nm   4x Krait 400    2.45 GHz      1273  4969  3.90x  Adreno 330   4G

Qualcomm's modemless applications processors for WiFi-only tablets are generally variants of smartphone SoCs that do have an integrated baseband. Snapdragon platforms that have modemless counterparts include Snapdragon 400, 600 and 801, while Snapdragon 805 is also technically a modemless processor that might be applicable to WiFi-only tablets.

For tablets with integrated 3G or 4G, Qualcomm uses smartphone chips from the Snapdragon 400 and Snapdragon 800 series. The Cortex-A7-based versions of Snapdragon 400 are power-efficient SoCs comparable in performance to MediaTek's offerings with a reasonably fast GPU. Qualcomm has been leading the integration of 4G modems into SoCs and dominates that part of the smartphone market, which it can also apply to 4G-enabled tablets.

The Snapdragon 800 series has long been the performance leader in the high-end smartphone SoC market outside of Apple, dominating high-end smartphones. This product line is also being used in some tablets from brand-name manufacturers such as Samsung. The Snapdragon 800 series is characterized by relatively high CPU performance, reasonable power efficiency, wide memory interfaces with high bandwidth, and a high-end mobile GPU able to drive high resolutions. From a chip cost standpoint, the series is expensive to produce because of a relatively large die area, but this affects Qualcomm only slightly because of the virtual monopoly it has had from the leverage its patent royalty schemes, which allows it to maintain high margins.

Samsung


Samsung has a fairly extended history developing Exynos SoCs for devices such as smartphones and tablets. A few years ago, when the baseband/modem was generally not yet integrated with the applications processor in performance-oriented smartphones, Samsung used a significant number of Exynos application processors in international versions of its flagship smartphones such as the Galaxy S II. Later, although Samsung prominently announced the use of new high-performance Exynos chips in new flagship smartphones, actual shipments were overwhelmingly dominated by Qualcomm Snapdragon-based variants of the same model. Only recently in 2014 has Samsung started to again use more of its own Exynos chips (including Exynos 3470, Exynos 5430 and Exynos 5433/Exynos 7 Octa) in new smartphones. Samsung also uses Exynos SoCs in tablets, primarily WiFi-only models.

Chip         Arrival  Fab    CPU                   Clock speed Geekbench   Multi   GPU                Memory  Modem
                             configuration         (typical)   JPEG C.     core x                     bus

Exynos 4210  2011     45nm  2x Cortex-A9           1.2 GHz                         Mali-400 MP4       2 x 32  -
Exynos 4212  2011     32nm  2x Cortex-A9           1.2 GHz                         Mali-400 MP4       2 x 32  -
Exynos 4412  2012     32nm  2x Cortex-A9           1.6 GHz       486  1290  2.65   Mali-400 MP4       2 x 32  -
Exynos 5250  2012     32nm  2x Cortex-A15          1.7 GHz                         Mali-T604 MP4      2 x 32
Exynos 5420  2013     28nm  4x Cortex-A15/A7       1.9 GHz      1212  4337  3.58   Mali-T628 MP6      2 x 32  -
Exynos 5260  Q2 2014  28nm  2x + 4x Cortex-A15/A7  1.7 GHz                         Mali-T624          2 x 32  -
Exynos 5422  Q2 2014  28nm  4x Cortex-A15/A        1.9 GHz                         Mali-T628 MP6      2 x 32  -
Exynos 3470  2014     28nm  4x Cortex-A7           1.4 GHz                         Mali-400 MP4       32      4G
Exynos 5430  Q3 2014  20nm  4x Cortex-A15/A7       1.8 GHz      1053  4910  4.66   Mali-T628 MP6      2 x 32  -
Exynos 5433  Q3 2014  20nm  4x Cortex-A57/A53      1.4-1.9 GHz  1376  6130  4.45   Mali-T760 MP6      2 x 32  -

Some Exynos SoCs, including Exynos 4412 and Exynos 5420, have been sold to parties outside of Samsung such as Chinese tablet manufacturers.

The use of the relatively power-hungry ARM Cortex-A15 core has made it a challenge for Samsung to preserve power efficiency, generally limiting the use of these Exynos processors to tablets. Samsung' s implementation of big.LITTLE has become more optimized over time, progressing to the ability to do full Global Task Switching and implementing improvements in power efficiency. Power use is also helped by newer versions of the Cortex-A15 core, process improvements (e.g. 20nm), and reducing the maximum clock rate for the Cortex-A15 cores (which were sometimes set in an unbalanced way at a high speed for marketing purposes, at the cost of the practical experience such as shorter battery life).

Sources: Geekbench browser

Initial version (November 7, 2014): Geekbench CPU benchmark results still have to filled for most SoCs
Updated (November 8, 2014):  Add Atom Z2560, MT812x benchmarks, correct description of MT8121.
Updated (November 9, 2014): Improve Intel section.
Updated (November 13, 2014): Provide more CPU benchmark scores, some other improvements.
Updated (November 18, 2014): Provide CPU benchmarks for Qualcomm and Samsung chips.
Updated (November 27, 2014). Improve Samsung section, add CPU benchmarks, fix RK3288 CPU configuration, add MT8752 CPU benchmarks, comment on variation in JPEG Compress CPU scaling scores for MT8752 and Allwinner A33.
Updated (November 30, 2014). Add note about MT8121.
Updated (December 5, 2014). Add NVIDIA section, other tweaks.