Mobile semiconductors blog

Friday, November 7, 2014

Analysis of tablet processors by chip company, with a focus on Geekbench CPU performance

The Geekbench browser, which includes hundreds of thousands of mobile benchmark results, provides access to a wealth of information about the CPU and memory performance of smartphone and tablet SoCs. Because certain subtests within Geekbench results (such as the single-core JPEG Compress test) correlate well with CPU clock speed for a given CPU core, it is possible to determine the actual maximum clock speed of the CPU, which sometimes does not correspond to the advertised clock speed or even the clock speed reported by the operating system.

By assessing the number of entries for a specific chip or model, the database of also provides an indication about the unit volume and popularity of specific chips and models. The approximate arrival on the end market of specific chips can also be estimated.

In this post, I am analysing the Android ARM and x86-based tablet processor market of the last two years or so from the low-end (mostly chip used in Chinese white-box tablets) to high-end devices from well known brand names, with a focus on CPU performance and other information that can be found after studying the Geekbench results database. The article takes on tablet SoC chip companies in alphabetical order, one-by-one.

Although the article specifically focuses on tablet chips, there is some overlap with smartphone chips since many players in smartphone chip space also compete in tablets with solutions that are generally similar to their smartphone chip solutions. HiSilicon, the chip division of Huawei, is becoming more prominent for smartphone SoCs but has been omitted because it does not really target tablets. A similar argument applies to the Chinese low-end smartphone chip designer Spreadtrum. These companies may be covered in a future update or in an article focusing on smartphone chips.

Actions Semiconductor

Actions a Chinese chip company with a long prior history in the MP3 player chip market, which has operated at the bottom-level of the white-box tablet market in the last few years.

Chip      Arrival  Fab    CPU             Clock speed  Geekbench  Multi   GPU
                          configuration   (typical)    JPEG C.    core x

ATM7021   Q4 2013  40nm  2 x Cortex-A5    1.3? GHz                        PowerVR SGX540
ATM7029A  Q1 2013  40nm  4 x Cortex-A5    1.0 GHz      296   681  2.30    Vivante GC1000
ATM7029B           40nm  4 x Cortex-A5    1.2? GHz                        PowerVR SGX540
ATM7059            28nm  4 x Cortex-A9    1.6 GHz                         PowerVR SGX544 MP

The ATM7029A from Actions is a low-end quad-core SoC that was one of the first affordable quad-core tablet processors to appear on the market, and has been sold in fair numbers in low-end tablets. However, the chip cuts corners with regard to performance in a rather unorthodox way. Actions advertised the chip as containing Cortex-A9 (later "Cortex-A9 family") CPU cores, while actually containing Cortex-A5 cores that perform about half as fast at a given clock speed (also significantly slower than Cortex-A7). Actions also modified the Android kernel to hide the actual CPU core type and also to falsely report a 1.2 GHz clock speed while the actual maximum speed is 1.0 GHz. The SoC displays very poor multi-core performance scaling for a quad-core CPU of only 2.3x for the JPEG Compress test in Geekbench, probably due to a very small and slow L2 cache.

The AT7029B is an improved version of the ATM7029 that replaces the less compatible Vivante GC1000 GPU with a more proven PowerVR SGX540.

The ATM7021A is an ultra-low-end dual-core Cortex-A5 processor that arrived in the market at the end of 2013. It only supports 512MB RAM and has been sighted in ultra-cheap tablets advertised on the internet.

The ATM7039c/7039s/7059 family consists of higher performance SoC designs that incorporate a quad-core Cortex-A9 running at 1.6 GHz. The ATM7039s and ATM7059 are manufactured at 28nm so have increased power efficiency, although the aging Cortex-A9 core is much less power efficient (as well having much large die area) than the Cortex-A7 used by most competitors. The chips have been in the pipeline for some time and Actions remains hopeful that they will appear on the market in 2014. However, it terms of cost efficiency the chips give the impression of following Rockchip's RK3188(T) long after the fact at a time when such a solution has almost ceased to be competitive.

Allwinner Technology

Allwinner is a Chinese tablet chip company that for some time (2012-2013) dominated the worldwide unit volume for tablet processors with cost-effective chips like the A1x series, and has probably shipped more than 100 million units in total. More recently, the company has suffered from loss of market share due to problematic and delayed new product introductions.

Chip      Arrival  Fab    CPU              Clock speed Geekbench   Multi   GPU
                          configuration    (typical)   JPEG C.     core x

A10       Q1 2012  55nm   1x Cortex-A8     1.00 GHz     423   424  1.00    Mali-400
A13       2H 2012  55nm   1x Cortex-A8     1.00 GHz     416   418  1.00    Mali-400
A20       Q3 2013  55nm   2x Cortex-A7     1.00 GHz     384   785  1.97    Mali-400 MP2
A23       Q3 2014  40nm   2x Cortex-A7     1.20 GHz     463   922  1.99    Mali-400 MP2
A31s      2013     40nm   4x Cortex-A7     1.01 GHz     387  1571  4.06    PowerVR SGX544 MP2
A33       Q3 2014  40nm   4x Cortex-A7     1.20 GHz     466  1450* 3.11*   Mali-400 MP2
A80T      Q3 2014  28nm   4x Cortex-A15/A7 1.60 GHz     927  4020  4.34    PowerVR Series 6
A83T      Q4 2014? 28nm   8x Cortex-A7     2.0? GHz                        PowerVR

* The CPU performance of the A33 shows different CPU scaling in different entries, with some close to 4 as expected for a fully utilized quad-core CPU, while many others show a scaling factor of only about 3.1 or even as low as 2.6. Some other scores seem to correlate with the CPU scaling factor variation, with the multi-core JPEG Decompress result scaling to all CPUs when the JPEG Compress test is low. Scheduling characteristics such as thermal throttling or other factors could be involved.

The A10 was Allwinner's first successful chip targeting tablets, with its relatively high level of integration providing significant cost advantages, which catapulted Allwinner into dominance of the Chinese white-box tablet market in 2012. The A13 was a cost-reduced version of the A10 with a 16-bit external memory interface, which later caused problems as memory bandwidth requirements increased with newer Android versions and higher resolution screens. The old Cortex-A8 CPU core had relatively competitive integer performance while floating point performance was much lower than more recent designs.

The A31s (a cost-reduced version of the A31 that was released a little earlier), a quad-core Cortex-A7-based SoC with a powerful PowerVR SGX544 MP2 GPU, arrived on the market in 2013 and was more or less Allwinner's last succesful product introduction. Although the 40nm process limited clock speeds due to power and heat limitations, the A31/A31s were a reasonable success in higher-end Chinese tablets and also used by some well known brand names such as HP, although due to cost not suited for the really high-volume part of the Chinese white-box market. This chip has continued to be sold for a long time.

The dual-core A20 was intended as a pin-compatible successor to the succesful A10 processor, which was also manufactured at 55nm as early as 2012 and widely used at the time. The A20 is notable for using Cortex-A7 cores with a trailing-edge 55nm process. Originally announced in 2012, the product suffered from serious delays and quality issues related to firmware when it arrived in the market in the second half of 2013 and was not a success, contributing to Allwinner's decline. I have personal experience with an early A20-based Android tablet which came with grossly misconfigured firmware (unstable, running at 0.7 GHz, with very slow screen refresh), which nevertheless ran a custom Linux OS without problems at 1.0 GHz, suggesting that much of the problem was very sloppy software engineering related to low-level chip initialization in the Android firmware.

The A23 is the replacement for the A20 using a more sensible 40nm process. However, it also did not come to market smoothly and the Geekbench database provides evidence that it only arrived on the market as recently as Q3 2014, being more or less immediately superseeded by the Allwinner's quad-core A33 which is arriving at the same time. Geekbench results provide evidence that the kernel has been modified by Allwinner to falsely report the CPU speed as 1.54 GHz, with all shipping devices actually running at an estimated maximum speed of 1.20 GHz.

The quad-core A33 is logical extension of the A2x and was announced in June 2014 as a entry-level tablet solution, with mass production already having commenced, highly important for any recovery of Allwinner's market position. As of early November 2014, a few entries in the database have appeared suggesting the use of the A33 but this is not yet suggestive of a successful product introduction. The results listed seem to reflect devices based on A23 ("sun8i") firmware, and show lower than expected multi-core performance scaling of only about 2.6 - 3.1 for the Geekbench JPEG Compress benchmark (close to 4 would be expected), which could be due to limited L2 cache size or other factors, and the chip also shows a very low memory performance score. A possible explanation for the lower than expected performance is that the L2 cache (which should have the very reasonable size of 512KB according to Allwinner) is disabled due to hardware defects in earlier revisions of the A33. However, some recent entries in the Geekbench database show CPU scaling close to 4.0 (as expected) for A33-based devices, with variation for other benchmark tests such as JPEG Decompress also being observed. CPU clock speed appears to be falsely reported as 1.34 GHz, because actual single-core performance suggests a 1.20 GHz maximum clock speed for the Cortex-A7 cores. Allwinner has announced that HP (who earlier used the A31s) is using the A33 in the new HP 7 G2 and HP 8 G2 tablets, and mentioned having achieved one million units shipments of A33. However, the Amazon website evidence shows no reviews for these models, suggesting that actual volume availability is still doubtful. The A33 being another failed product introduction from Allwinner cannot be ruled out at this point.

Finally, the ambitious octa-core big.LITTLE A80 SoC is Allwinner's attempt to address the high-performance market. After several delays, which saw the A80 pitched mainly at development boards and other non-tablet applications, with suggestions of power and heat issues, numerous entries for the Allwinner A80T-based Onda V989 tablet have started to appear in the Geekbench database in the last few months. The results are consistent with a Cortex-A15 clock speed of about 1.6 GHz, lower than the advertised 2.0 GHz. This is confirmed by independent research. Although the chip provides high performance relative to previous Allwinner chips, performance is still lower than previous generation, lower-power SoCs such as Qualcomm's Snapdragon 800 for smartphones. The chip also shows lower multi-core performing scaling than comparable chips from competitors such as HiSilicon's Kirin 920 for smartphones, although there is evidence that the Cortex-A7 cores are also utilized (use of Global Task Switching), as well as showing low memory performance for a SoC with a dual-channel memory interface.

Intel

Intel has started targeting the tablet market in earnest only recently in 2014, using its increasingly efficient Atom processor cores and SoCs and employing a contra-revenue strategy that subsidizes tablet manufacturers that use its platform. First gaining traction in the first half of 2014 with brand-name manufacturers such as Asus, in the second half of 2014 Intel started penetrating Chinese white-box tablets primarily due to the introduction of lower cost Atom SoCs with a 32-bit memory interface such as Z3735G/Z3736G and addition to the Z3735F/Z3736F with 64-bit memory for higher performance segments, also helped by a general shortage of efficient tablet processors from competitors such as MediaTek due to the tight wafer capacity environment at TSMC. Because of the advanced 22nm process, Intel's SoCs provide relatively high CPU and GPU performance as well as high power efficiency. Part of the efficiency advantage stems from Intel's ability to integrate a fast and large 2MB L2 cache (Z37xx series), much larger than the L2 cache in typical cost-sensitive tablet processors.

Chip      Arrival  Fab    CPU              Clock speed Geekbench   Multi   GPU         Memory
                          configuration    (typical)   JPEG C.     core x              Interface

Z2560     Q2 2013  32nm   2x Saltwell      1.6 GHz      617  1711  2.77    SGX544 MP2  2 x 32-bit
Z2580     Q2 2013  32nm   2x Saltwell      1.6 GHz                         SGX544 MP2  2 x 32-bit
Z3735F    Q3 2014  22nm   4x Silvermont    1.33 GHz*    821  2803  3.35    Intel HD    64-bit
Z3735G    Q3 2014  22nm   4x Silvermont    1.33 GHz*    827  2773  3.42    Intel HD    32-bit
Z3736F    Q4 2014  22nm   4x Silvermont    1.33 GHz*    968  2858  2.95    Intel HD    64-bit
Z3736G             22nm   4x Silvermont    1.33 GHz*                       Intel HD    32-bit

* The chips have a so-called burst (turbo) frequency of 1.83 GHz (Z3735) or 2.16 GHz (Z3736).

Intel's Atom SoCs for mobile devices, although compatible with the x86 and x86-64 instruction sets used with PC processors, are based on CPU cores specifically designed for the mobile market and not derivatives of PC-class architectures.

The Saltwell core (which does not support x86-64) in previous generation Atom SoCs such as Z2560 and Z2580 has performance approximately equivalent to an ARM Cortex-A7 clocked at the same frequency, but the higher typical clock speed of 1.6 GHz results in higher single-core performance than typical Cortex-A7 configuration that are clocked lower. However, the dual-core CPU configuration with HyperThreading results in lower multi-core performance scaling than a typical quad-core Cortex-A7. The per-core 512K L2 cache is not really optimal for mobile applications and suggests that the architecture was not yet fully optimized for low power mobile applications, and overall the SoCs have significantly lower performance/Watt than competitive solutions that use ARM Cortex-A7 cores.

The current generation Z373x series are faster than Z25xx with improved power efficiency and fall somewhere in the mid-range with regard to performance, since they do not reach pure CPU processor speed of competitive mobile SoCs targeting the performance segment (approaching the speed of less optimized Cortex-A1x designs like Allwinner A80T and RK3288, but falling short of the performance of high-end Exynos and Snapdragon 801/805 chips for tablets and smartphones).

The Silvermont-based SoCs show evidence of an optimized memory subsystem, so that the Z3735G with 32-bit memory shows memory performance comparable to Rockchip's RK3288 with a much more expensive dual-channel memory design. The CPU burst mode benefits single-core performance but means that multi-core performance does not scale as well as most ARM-based chips. The SoCs also have relatively fast GPU performance for a mobile chip, benefiting from the low power design and the large cache memory inside the chip.

Leadcore Technology

Leadcore is an upcoming Chinese designer of SoCs for smartphones that has been on focusing on the TD cellular standards primarily used in China, and also offers tablet chips with integrated modem. Although still a relatively small player, its designs show evidence of good product planning with efficient, cost-effective solutions and the company has attracted the attention of Xiaomi, which is rumoured to be interested in acquiring a majority stake in the company.

Chip      Arrival  Fab    CPU              Clock speed Geekbench   Multi   GPU            Modem
                          configuration    (typical)   JPEG C.     core x

LC1913    2013?    40nm   4x Cortex-A7     1.4 GHz                         Mali-400 MP2   3G (TD)
LC1960    2014     28nm   6x Cortex-A7     2.0? GHz                        Mali-T628 MP2  4G
LC1980    2014?                                                            Mali-T720 MP6

On paper, the LC1913 appears to be a cost-effective chip for tablets with integrated 3G connectivity, being similar to MediaTek's MT8382 but on a 40nm instead of a 28nm process. I have not yet located any entries using this chip in the Geekbench database. The hexa-core LC1960, which most likely has a dual-channel external memory interface like the LC1860 for smartphones, promises to be a reasonably balanced, efficient design that provides good but low-power CPU performance while addressing performance bottlenecks with the use of a dual-channel memory interface, potentially making it suitable for higher resolution screens (but see note below about fillrate of the Mali-T628 MP2 GPU). Although the dual-channel memory increases PCB cost, the SoC has the hallmarks of being relatively low-cost and the wide memory interface may in fact contribute to increased power efficiency because of the reduction in memory transaction duration. This is one of the first chips to combine a wide memory interface with a relatively efficient CPU configuration (most existing chips with dual-channel memory tend to be high-end designs using heavy, performance-oriented CPU cores such as Cortex-A15, Krait-400 or Cortex-A57 as well as heavy GPUs).

The Mali-T628 MP2 GPU clocked at about 690 MHz inside the L1960 provides greatly improved triangle throughput (173 Mtri/s) when compared to the Mali-400 from typical low-end SoCs, as well as OpenGL 3.x support. However, the MP2 configuration limits pixel throughput to 1380 MPix/s, equivalent to Mali-400 MP2 or 450 MP2 clocked at the same frequency. Since comparable GPUs used by competitors (such as Mali-450 MP4 used by MediaTek and HiSilicon and Mali-T628 MP4 and MP6 used by HiSilicon and Samsung) have at least double the amount of GPU cores and thus twice the pixel rate at the same clock frequency, and are already relatively limited in fill-rate when compared to high-end GPUs from competitors, it remains to be seen how much of a bottleneck this willl be in practice. Game performance is likely to be severely impacted at higher screen resolutions.

MediaTek

MediaTek is a Taiwanese company with a relatively long history of activity and success as a chip platform provider for the the Chinese mobile phone market. MediaTek also has a long history targeting segments such as digital TVs and set-top boxes, DVD players and several other segments, and has generally been successful in those segments. In the past few years, MediaTek has had a large share of the SoC market for smartphones among Chinese manufacturers and other cost-sensitive manufacturers with cost-effective, power efficient, highly integrated SoCs. MediaTek was the company that spearheaded the emergence of a multi-core ARM Cortex-A7 configuration manufactured at 28nm as a very efficient, low cost and adequately performing CPU solutions for smartphones ranging from entry-level to mid-range. Since 2013, MediaTek has also been successful in the tablet chip market, with both modemless application processors targeting WiFi-only tablets and chips with integrated modem.

Chip      Arrival  Fab    CPU              Clock speed Geekbench   Multi   GPU                Modem
                          configuration    (typical)   JPEG C.     core x

MT8125    H1 2013  28nm   4x Cortex-A7     1.20 GHz     472  1893  4.01    PowerVR SGX544 MP  -
MT8121    Q2 2014  28nm   4x Cortex-A7     1.30 GHz     505  2002  3.96    PowerVR SGX544 MP  -
MT8127    Q3 2014  28nm   4x Cortex-A7     1.30 GHz     508  2023  3.98    Mali-450 MP4       -
MT8135V   Q3 2014  28nm   2x Cortex-A15/A7 1.50 GHz     896  1884  2.10    PowerVR Series 6   -

MT8389    2H 2013  28nm   4x Cortex-A7     1.21 GHz     469  1894  4.04    PowerVR SGX544 MP  3G
MT8312    Q4 2013  28nm   2x Cortex-A7     1.30 GHz     505  1011  2.00    Mali-400 MP        3G
MT8382    Q1 2014  28nm   4x Cortex-A7     1.30 GHz     505  2013  3.99    Mali-400 MP2       3G
MT8392    2014     28nm   8x Cortex-A7     1.66 GHz     644  4745  7.79    Mali-450 MP4       3G
MT8732    Q4 2014? 28nm   4x Cortex-A53    1.5? GHz                        Mali-T760 MP2      4G
MT8752    Q4 2014? 28nm   8x Cortex-A53    1.69 GHz     952* 5046* 5.30*   Mali-T760 MP2      4G

* The CPU performance of the MT8752 as reported for the CUBE T7 and for the equivalent MT6752 for smartphones shows different CPU scaling in different entries, with some around 7.7 as expected for a fully utilized octa-core CPU, while others show a scaling factor of about 5.3. It is notable that the PNG Decompress test shows CPU scaling close to 8 when JPEG Compress scaling is 5.3, while PNG Decompress scaling is a little above 5 when JPEG Compress scaling is close to 8. This could the result of scheduling algorithm differences, or something else related to Geekbench, since similar behaviour with regard to JPEG Compress benchmark variation is also noticeable for recent entries for other chips like the Allwinner A33.

MediaTek's MT8125 was its first really successful tablet chip, providing high power efficiency and good performance. Performance and efficiency benefits from four low-power Cortex-A7 cores, a relatively large 1MB L2 cache, and a PowerVR SGX544MP GPU. The chip was prominently adopted by the Asus MemoPad 7 HD and other brand-name tablets.

The MT8121 is a lower-cost, more highly integrated version of the MT8125 that does not appear to have been widely used outside of a few Lenovo tablet models. The MT8127 is a relatively fast and cost-efficient tablet processor within the bounds of a single-channel memory interface, with the Mali-450 MP4 GPU providing relatively good game performance as long as the resolution is not too high. Both processors appear to have been affected by the shortage of wafer supply for MediaTek in mid-2014, with some production capacity most likely prioritized for the MT8135V used in new Amazon Kindle tablets, as well as higher-margin tablet processors with integrated modem.

The MT8135V is a variant of the high-end MT8135 tablet processor that was announced in mid-2013 but has failed to materially appear on the market. The MT8135V appears to be a custom design for new Amazon Kindle tablets that are positioned at the entry-level segment of the US retail market, probably as the result of a long-standing agreement. However, the MT8135V shares much of the MT8135's higher-cost design features making it seem rather unsuitable for entry-level tablets with a small form factor, although the memory interface has been halved from double to single-channel. Power efficiency is also likely to be a problem. It is ironic that use of the MT8127, although having lower single-core performance, would probably easily have fit the bill for the Kindle tablets with significant advantages for cost and power consumption.

MediaTek has been one of the first companies to offer cost-effective solutions for tablets with integrated 3G cellular data or voice connectivity, mostly based on comparable smartphone products, and has for some time dominated that market. The previous-generation MT8389(T) corresponded to the MT6589(T) for smartphones, while the dual-core MT8312 and quad-core MT8382 are the equivalent of the MT6572 and MT6582. The MT8392 matches the MT6592 octa-core smartphone processor. Tablet manufacturers also commonly utilize MediaTek's smartphone chips directly. Chip such as the MT8312/MT6572 and MT8382/MT6582 have a relatively optimized CPU achitecture, with no unexpected bottlenecks, providing good performance for their cost segment.

The upcoming MT8732 (quad-core) and MT8752 (octa-core) are Cortex-A53-based tablet SoCs with integrated 4G modem that correspond to similar upcoming chips for smartphones (MT6732 and MT6752). The use of a many-core Cortex-A53 configuration is promising to significantly raise performance for low-power SoCs and is likely to be able to address several segments including the high-performance segment, while greatly reducing cost. There are signs that the MT8732, because of the relatively large die area associated with the Mali-T760 MP2 GPU core, will not be cost-effective enough for entry-level segments and will be superseeded by a chip (equivalent to MT6735 for smartphones) that has a more economical but lower-performance Mali-T720 GPU.

NVIDIA

NVIDIA, with a long history as a leader in PC, console and laptop GPUs, has recently increased its focus on the tablet market and more or less given up on its long-term goal of penetrating the high-volume smartphone market with integrated SoCs. NVIDIA has been designing its Tegra tablet processors for tablets for quite some time, but has seen mixed success, while eventually not being successful in the high-volume mainstream tablet market. It has gained a few high-profile design wins for high-end devices, most recently for the HTC Nexus 9.

Chip              Arrival  Fab   CPU                 Clock speed  Geekbench   Multi  GPU
                                 configuration       (typical)    JPEG C.     core x
Tegra 250 T20     Q1 2010  40nm  2x Cortex-A9        1.0 GHz                         GeForce ULP
Tegra 3 T30       Q4 2011  40nm  4x + 1x Cortex-A9   1.4 GHz       605  2238  3.70   GeForce ULP
Tegra 4 T114      Q2 2013  28nm  4x + 1x Cortex-A15  1.8 GHz       938  3850  4.10   GeForce ULP
NVIDIA K1         Q1 2014  28nm  4x + 1x Cortex-A15  2.2 GHz      1296  5359  4.14   Kepler DX1
NVIDIA K1 (ARMv8) Q3 2014  28nm  2x NVIDIA Denver    2.5 GHz      2002  3941  1.97   Kepler DX1

NVIDIA's Tegra and Tegra 2 processors saw fairly widespread adoption in the early days of the tablet market. Tegra 2 had some architectural deficiencies that made it less competitive, for example, it did not have an up-to-date video decoder, and lacked ARM's almost standard NEON SIMD extension. NVIDIA was not able sustain its market share momentum as the market became increasingly dominated by Chinese white-box tablets as well as brand names such as Apple and Samsung.

NVIDIA has developed its own ARMv8-compatible CPU core, Denver, which is a large core with very high single-core performance, and which has been implemented in the ARMv8 version of the NVIDIA K1 processor in a dual-core configuration. The chip provides leading single-core performance, but multi-core performance is less than even upcoming mid-range solutions. The GPU performance of both K1 processors is industry-leading.

Rockchip

Chinese company Rockchip, which has a history as a supplier of MP3/MP4 video players, held a strong position in the very early tablet market before Allwinner displaced it with its A10 chip in 2012. Rockchip subsequently regained traction with relatively high-performing chips including the RK3066 and RK3188, and later expanded its product offering for low-end segments. Although Rockchip has led the tablet processor market in 2014 in terms of volume, it has continued to use Cortex-A9 cores for most of its products which are considerably less efficient in terms of chip cost (die area) and power efficiency when compared to the Cortex-A7 cores used by competitors.

Chip      Arrival  Fab    CPU             Clock speed  Geekbench   Multi    GPU
                          configuration   (typical)    JPEG C.     core x
RK2926/28 2013     55nm   1x Cortex-A9     1.01 GHz     430   430  1.00     Mali-400 MP
RK3066    Q3 2012  40nm   2x Cortex-A9     1.61 GHz     696  1202  1.73     Mali-400 MP4
RK3188    Q2 2013  28nm   4x Cortex-A9     1.61 GHz     699* 2604* 3.73     Mali-400 MP4
RK3188T   Q3 2013  28nm   4x Cortex-A9     1.42 GHz     617  2441  3.96     Mali-400 MP4
RK3026/28 1H 2014  40nm   2x Cortex-A9     1.01 GHz     443   885  2.00     Mali-400 MP2
RK3168    Q2 2014  28nm   2x Cortex-A9     1.5 GHz                          PowerVR SGX540
RK3288    Q3 2014  28nm   4x Cortex-A12    1.8 GHz      980  3873  3.95     Mali-T760 MP4
RK3126/28 Q4 2014  40nm   4x Cortex-A7     1.3 GHz                          Mali-400 MP2
"MayBach"          28nm   8x Cortex-A53                                     OpenGL ES 3.0-class

* RK3188-based deviced running at 1.6 GHz (probably reflecting the use of the original RK3188
  rather than the cost-reduced RK3188T) show a relatively high amount of variation in benchmark
  scores between devices and runs, probably reflecting thermal throttling or other scheduler
  characteristics.

The RK3066 was a relatively high-performance chip at the time of its introduction (second half of 2012), and was successful in the mid-range of the white-box tablet market, as well as gaining design wins with companies like HP. The relatively high clock frequency Cortex-A9 cores on a 40nm process, as well as the Mali-400 MP4 GPU, constrained its power efficiency.

The RK3188 (in practice more often the lower-clocked RK3188T in a cost-reduced package) was introduced as the logical successor to the RK3066 addressing the higher-performance part of the white-box tablet market as well as being adopted in brand name models from Asus and others. Although Cortex-A9 cores are not very power-efficient, efficiency is improved by the use of a relatively advanced 28nm HKMG process at Global Foundries. Rockchip has benefitted from the fact that it was one of the few companies with plentiful wafer supply in 2014, being one of the few customers of GlobalFoundries while many of its competitors faced a very tight capacity environment at TSMC and to a lesser extent other foundries. In 2014, the RK3188T has been observed not only in more performance-oriented tablets, but also in significant numbers in cheaper tablets with relatively low-cost and low-quality components outside of the processor, being seemingly out of place. This scenario probably unfolded because of shortages of tablet processors due to the tight foundry capacity environment outside of GlobalFoundries, while GF may have offered low prices for wafers in the face of excess capacity.

The RK3168 was announced in 2013 as a power-efficient dual-core processor, but only arrived in Q2 2014 with relatively limited adoption among signs that its power efficiency leaves something to be desired.

The dual-core Cortex-A9 RK3026 and RK3028 appeared in numerous low-end tablets in 2014, while the pin-compatible RK3126 and RK3128, which are due to appear in Q4 2014, will finally see Rockchip transition away from the relatively inefficient Cortex-A9 to the more efficient (in terms of cost and power consumption) Cortex-A7.

Finally, the RK3288 is an ambitious high-end processor utilizing four Cortex-A17 (technically Cortex-A12) cores also manufactured at GlobalFoundries. The RK3288 was delayed and for some time pitched to manufacturers of media boxes and other devices amongst indications that hardware work-arounds were required to circumvent hardware issues related to the chip. Reports suggest power consumption and heat production can be problematic. The RK3288 has recently appeared in the Geekbench database in several entries for the Teclast P90HD tablet. Results show performance roughly comparable with Allwinner's A80, with memory performance lower than the A80 and significantly lower than other competitor's chips that also use a 64-bit or dual-channel memory interface, including smartphone platforms. One TV box result shows more acceptable memory performance, probably as the result of a faster DRAM frequency, although still falling short of the performance of smartphone platforms like Exynos 5430 and Snapdragon 801. A relatively steep fall-off in game performance at higher resolutions can be explained by a memory bandwidth bottleneck imposed by the less-than-optimal memory controller. When not constrained by memory bandwidth, the Mali-T764 GPU provides excellent game performance, although the exact nature of the Mali-T764 GPU (a model number not used by ARM) remains in doubt.

Despite the announcement by ARM that the latest version of the Cortex-A12 core is equivalent in performance to Cortex-A17 and the name Cortex-A12 will therefore by retired, a comparison of Geekbench results for the Cortex-A12-based RK3288 with the real Cortex-A17-based MT6595 shows a not insignificant performance difference in pure CPU performance when corrected for clock frequency of about 13% in favor of Cortex-A17, with Cortex-A15 in the middle. This suggests RK3288 does not use the latest version of Cortex-A12 to which ARM referred when making the performance comparison to Cortex-A17.

Qualcomm

Qualcomm has dominated the entire higher-end part of the smartphone SoC market in recent years, largely based on leverage of its patent royalty schemes which are based on the total selling price of a device, enabling Qualcomm to coerce most well-known device manufacturers to use Snapdragon chips for a large proportion of their line-up. More recently, Qualcomm has started targeting the tablet space. Clearly, its integrated 3G/4G modem technology and patent royalty leverage gives it opportunities to penetrate 3G/4G-enabled tablets, but Qualcomm has also been targeting WiFi-only tablets for which it does not have direct patent royalty leverage.

Chip      Arrival  Fab    CPU              Clock speed Geekbench     Multi  GPU          Modem
                          configuration    (typical)   JPEG C.       core x

APQ8064    2013     28nm   4x Krait 300    2.0 GHz       1035  4207  3.22x  Adreno 320   -
MSM8026    2014     28nm   4x Cortex-A7    1.2 GHz                          Adreno 305   -
MSM8074    2014     28nm   4x Krait 400    2.36 GHz                         Adreno 330   -
 
MSM8226    2013     28nm   4x Cortex-A7    1.19 GHz       461  1791  3.85x  Adreno 305   3G
MSM8926    2014     28nm   4x Cortex-A7    1.19 GHz       466  1883  4.04x  Adreno 305   4G
MSM8974-AC 2014     28nm   4x Krait 400    2.45 GHz      1273  4969  3.90x  Adreno 330   4G

Qualcomm's modemless applications processors for WiFi-only tablets are generally variants of smartphone SoCs that do have an integrated baseband. Snapdragon platforms that have modemless counterparts include Snapdragon 400, 600 and 801, while Snapdragon 805 is also technically a modemless processor that might be applicable to WiFi-only tablets.

For tablets with integrated 3G or 4G, Qualcomm uses smartphone chips from the Snapdragon 400 and Snapdragon 800 series. The Cortex-A7-based versions of Snapdragon 400 are power-efficient SoCs comparable in performance to MediaTek's offerings with a reasonably fast GPU. Qualcomm has been leading the integration of 4G modems into SoCs and dominates that part of the smartphone market, which it can also apply to 4G-enabled tablets.

The Snapdragon 800 series has long been the performance leader in the high-end smartphone SoC market outside of Apple, dominating high-end smartphones. This product line is also being used in some tablets from brand-name manufacturers such as Samsung. The Snapdragon 800 series is characterized by relatively high CPU performance, reasonable power efficiency, wide memory interfaces with high bandwidth, and a high-end mobile GPU able to drive high resolutions. From a chip cost standpoint, the series is expensive to produce because of a relatively large die area, but this affects Qualcomm only slightly because of the virtual monopoly it has had from the leverage its patent royalty schemes, which allows it to maintain high margins.

Samsung

Samsung has a fairly extended history developing Exynos SoCs for devices such as smartphones and tablets. A few years ago, when the baseband/modem was generally not yet integrated with the applications processor in performance-oriented smartphones, Samsung used a significant number of Exynos application processors in international versions of its flagship smartphones such as the Galaxy S II. Later, although Samsung prominently announced the use of new high-performance Exynos chips in new flagship smartphones, actual shipments were overwhelmingly dominated by Qualcomm Snapdragon-based variants of the same model. Only recently in 2014 has Samsung started to again use more of its own Exynos chips (including Exynos 3470, Exynos 5430 and Exynos 5433/Exynos 7 Octa) in new smartphones. Samsung also uses Exynos SoCs in tablets, primarily WiFi-only models.

Chip         Arrival  Fab    CPU                   Clock speed Geekbench   Multi   GPU                Memory  Modem
                             configuration         (typical)   JPEG C.     core x                     bus

Exynos 4210  2011     45nm  2x Cortex-A9           1.2 GHz                         Mali-400 MP4       2 x 32  -
Exynos 4212  2011     32nm  2x Cortex-A9           1.2 GHz                         Mali-400 MP4       2 x 32  -
Exynos 4412  2012     32nm  2x Cortex-A9           1.6 GHz       486  1290  2.65   Mali-400 MP4       2 x 32  -
Exynos 5250  2012     32nm  2x Cortex-A15          1.7 GHz                         Mali-T604 MP4      2 x 32
Exynos 5420  2013     28nm  4x Cortex-A15/A7       1.9 GHz      1212  4337  3.58   Mali-T628 MP6      2 x 32  -
Exynos 5260  Q2 2014  28nm  2x + 4x Cortex-A15/A7  1.7 GHz                         Mali-T624          2 x 32  -
Exynos 5422  Q2 2014  28nm  4x Cortex-A15/A        1.9 GHz                         Mali-T628 MP6      2 x 32  -
Exynos 3470  2014     28nm  4x Cortex-A7           1.4 GHz                         Mali-400 MP4       32      4G
Exynos 5430  Q3 2014  20nm  4x Cortex-A15/A7       1.8 GHz      1053  4910  4.66   Mali-T628 MP6      2 x 32  -
Exynos 5433  Q3 2014  20nm  4x Cortex-A57/A53      1.4-1.9 GHz  1376  6130  4.45   Mali-T760 MP6      2 x 32  -

Some Exynos SoCs, including Exynos 4412 and Exynos 5420, have been sold to parties outside of Samsung such as Chinese tablet manufacturers.

The use of the relatively power-hungry ARM Cortex-A15 core has made it a challenge for Samsung to preserve power efficiency, generally limiting the use of these Exynos processors to tablets. Samsung' s implementation of big.LITTLE has become more optimized over time, progressing to the ability to do full Global Task Switching and implementing improvements in power efficiency. Power use is also helped by newer versions of the Cortex-A15 core, process improvements (e.g. 20nm), and reducing the maximum clock rate for the Cortex-A15 cores (which were sometimes set in an unbalanced way at a high speed for marketing purposes, at the cost of the practical experience such as shorter battery life).

Sources: Geekbench browser

Initial version (November 7, 2014): Geekbench CPU benchmark results still have to filled for most SoCs
Updated (November 8, 2014): Add Atom Z2560, MT812x benchmarks, correct description of MT8121.
Updated (November 9, 2014): Improve Intel section.
Updated (November 13, 2014): Provide more CPU benchmark scores, some other improvements.
Updated (November 18, 2014): Provide CPU benchmarks for Qualcomm and Samsung chips.
Updated (November 27, 2014). Improve Samsung section, add CPU benchmarks, fix RK3288 CPU configuration, add MT8752 CPU benchmarks, comment on variation in JPEG Compress CPU scaling scores for MT8752 and Allwinner A33.
Updated (November 30, 2014). Add note about MT8121.
Updated (December 5, 2014). Add NVIDIA section, other tweaks.

Saturday, October 18, 2014

Samsung's 64-bit Exynos 5433 SoC renamed to Exynos 7 Octa, used in some Galaxy Note 4 models

Recently, Samsung renamed its Exynos 5433 SoC to Exynos 7 Octa. The new Exynos chip is used by Samsung in the new Galaxy Note 4 smartphone, although how material actual shipments are has been unclear because most regions were first served primarily by Qualcomm Snapdragon 805-based versions of the Galaxy Note 4. However, evidence from the Geekbench result database suggests roughly a quarter of models currently sold are Exynos versions.

Signs of actual adoption of Exynos 7 Octa in high volume becoming apparent

Samsung has in the past frequently announced the use of Exynos SoCs in prominent smartphones, but shipments were often limited to very low volumes for smaller regions such Korea, with the vast majority of shipments using Snapdragon SoCs. During the last two years, only Samsung's tablets have seen widespread use of Samsung high-performance mobile SoCs. Although Samsung has recently ramped mid-range chips such as Exynos 3470 in presumably high volume for the Galaxy S5 Mini, strong evidence would be required to establish that the situation will be different this time around in terms of a high profile Exynos SoC (Exynos 7 Octa) being actually used in high volume in smartphones.

However, searching for Galaxy Note 4 models on the Geekbench Browser provides evidence that at least one quarter of units currently sold contains the new Exynos chip, with the other three quarters or so using Snapdragon 805. Exynos versions are primarily represented by the SM-N910C, SM-N910S and SM-N910K models, while Snapdragon versions are mainly represented by SM-N910A, SM-N910T, SM-N910F and several other models.

Number of Geekbench entries for each Samsung Galaxy Note variant as of 24 October:

SM-N9100: Snapdragon 805, 7 entries
SM-N9109W: Snapdragon 805, 4 entries
SM-N910A: Snapdragon 805, 635 entries
SM-N910C: Exynos 5433, 425 entries
SM-N910F: Snapdragon 805, 496 entries
SM-N910H: Exynos 5433, 24 entries
SM-N910K: Exynos 5433, 73 entries
SM-N910L: Exynos 5433, 33 entries
SM-N910R4: Snapdragon 805, 23 entries
SM-N910P: Snapdragon 805, 238 entries
SM-N910S: Exynos 5433, 197 entries
SM-N910T: Snapdragon 805, 559 entries
SM-N910V: Snapdragon 805, 69 entries
SM-N910W8: Snapdragon 805, 10 entries

For the listed models, the total count is 752 Exynos and 2041 Snapdragon, representing an Exynos proportion of about 27%.

All things being equal, one would expect Samsung to prefer to use the internally manufactured Exynos chipset if enough supply is available, although with four Cortex-A57 cores the SoC is likely to be relatively expensive to manufacture. On the other hand, there are significant performance differences, with the Exynos platform clearly faster in terms of CPU processing but with a question mark in terms of power efficiency, while Snapdragon 805 can be regarded as mature, stable technology. Qualcomm may also be able to enforce a certain quotum of Snapdragon chips based on its leverage of patent royalties and licensing fees (which are considerable for a high-end smartphone).

Some anomalies are evident in the chips used for certain models. For example, a number of the SM-N910S results (which officially uses the Exynos 5433) in the Geekbench database show the use of an APQ8064 (Snapdragon 600) SoC clocked at 1.89 GHz, which is significantly slower that Exynos 5433 (or Snapdragon 805). Similarly, for the SM-N910C, starting from October 30 a not insignificant number of results labelled as SM-N910C show the use of the aging Exynos 4412 SoC (also used in old models such as the Galaxy S III) with four Cortex-A9 cores clocked at 2.0 GHz, much slower than Exynos 5433. These anomalies probably represent counterfeit production by Chinese manufacturers (both APQ8064 and Exynos 4412 have been common in the supply chain in the past). For models that officially use Snapdragon 805, no anomalies are evident.

Update as of December 5, 2014

Reassessing the share of Exynos 5433 vs Snapdragon 805 in the Geekbench database after a few months of production should be informative about whether Samsung is really serious about ramping Exynos production for smartphones. The following is apparent:

The Exynos-based SM-N910C count has increased from 425 to 4390.
The Exynos-based SM-N910S count has increased from 197 to 578.
The Exynos-based SM-N910K count has increased from 73 to 212.
The Exynos-based SM-N910H has increased from 23 to 757, while SM-N910L has increased from 33 to 91.
The new Exynos-based SM-N910U shows a count of 1062.
The Snapdragon 805-based SM-N910A count has increased from 635 to 2258.
The Snapdragon 805-based SM-N910T count has increased from 559 to 2089.
The Snapdragon 805-based SM-N910F count has increased from to 496 to 3857.
The Snapdragon 805-baed SM-N910P count has increased from 238 to 1162.
The Snapdragon 805-based SM-N910R4 has increased from 23 to 61, SM-N9100 from 7 to 58, SM-9109W from 4 to 20, SM-910V from 69 to 1685, and SM-910W8 from 10 to 636.
The new Snapdragon 805-based SM-N910G shows a count of 903, SM-N9106W shows 22, SM-N9108V shows 1.

For the listed models, the total count is 7090 Exynos and 12752 Snapdragon, representing an increased share of Exynos-based models in the Geekbench database from about 27% to about 36%, clearly suggesting that the share of Exynos-based models is increasing, and recent production may already have a much greater proportion of Exynos-based models.

First 20nm ARMv8 SoC targeting Android

One of the first smartphone SoCs manufactured using a 20nm process, at Samsung's own fabs, the Exynos 7 Octa is the first chip featuring ARM's Cortex-A57 and Cortex-A53 cores in a big.LITTLE configuration to appear on the market. The Cortex-A5x cores support the 64-bit ARMv8 instruction set, although using the 32-bit variant of the ARMv8 instruction set also appears to bring benefits while avoiding the performance degradation (related to increased memory use for pointers and addressing) that is associated with going to full 64-bit.

It is not the first 20nm SoC to support the ARMv8 instruction set, since Apple's A8 chip has already ramped to high-volume production during most of the year at TSMC for use in the iPhone 6 models. And already in 2013, Apple introduced the first ARMv8 chip with the Apple A7. As I have explained in an earlier article, there are reasons to believe the CPU cores in the Apple A7/A8 may have great similarities to ARM's Cortex-A57 CPU core, and in that sense the Exynos 7 Octa technically may not actually be the first SoC with Cortex-A57 cores to hit the market.

Fast, but power efficiency may be a problem

Reviews of Exynos 7 Octa-based devices such as the Galaxy Note 4 are still scarce. Already several months ago, early benchmarks results showed Exynos 5433 (as it was known then) providing the highest performance in the mobile space, significantly outscoring Snapdragon 805 in most benchmarks. This is not unexpected given the use of high-performance Cortex-A57 cores at a fairly high clock frequency.

However, there are signs that maintaining power efficiency with higher-clocked Cortex-A57 cores may be a challenge. Some early hands-on preview have suggested relatively high power consumption and mediocre battery life for an Exynos 5433-based Galaxy Note 4. More definite test results should clarify the situation.

Setting maximum clock frequency creates dilemma

Software techniques such as the use of efficient Global Task Switching with preference for the economical Cortex-A53 cores and throttling down of the clock frequency may be vital to maintain acceptable battery life. Analysis of Geekbench results for the Exynos 5433-based SM-N910C shows a multi-core performance scaling factor of about 4.45 for the largely CPU-bound JPEG Compress test, suggesting that Global Task Switching is implemented in such a way that not just the Cortex-A57 cores are utilized but the Cortex-A53 cores as well when high CPU performance is required.

High-performance CPU cores such as Cortex-A57 tend to have relatively high power consumption that increases as the clock frequency increases. This creates a dilemma for a manufacturer, because for acceptable power consumption with practical use there is little reason to set the maximum clock speed at the relatively high level that desirable for marketing purposes; a speed similar to the one used in Apple's Cyclone cores (e.g. 1.4 GHz) provides more than enough speed for most applications while limiting the excessive power consumption (and potential stability problems) associated with higher frequencies. A similar dilemma is often associated with SoCs with Cortex-A15 CPU cores (such as Samsung's Exynos 5430 used in the Galaxy Alpha) that have performance characteristics (high performance, but low performance/Watt) comparable to Cortex-A57, although Cortex-A57 is likely be more efficient.

Providing superior synthetic benchmark performance can be a matter of high prestige for a company and its marketing department to the extent that an unbalanced high maximum clock frequency may still be used in actually shipping devices, to the detriment of the user experience. Associated with this dilemma is the attraction of "cheating" on benchmarks by detecting when synthetic benchmarks are run and the switching to higher, sustained clock frequencies with reduced heat throttling, which has been demonstrated to be widespread in the past by websites such as AnandTech.

Evidence suggests Exynos 5433's Cortex-A57 cores are already clocked at a relatively low but efficient speed of about 1.4 GHz

Exynos 5433 may in practice already be clocked at a relatively low maximum speed to conserve power. Geekbench consistently reports 1.3 GHz as the clock frequency for all Exynos 5433 devices, however for some devices, including Samsung's big.LITTLE Exynos 5430 with Cortex-A15, Geekbench seems to report the maximum clock speed of the slower LITTLE cores, so the Cortex-A57 are probably clocked higher. However, even for the Cortex-A57 cores in the Exynos 5433, which have dramatically higher performance/cycle than the LITTLE Cortex-A53 cores, a relatively limited maximum speed in the range of 1.3 GHz would by no means be inappropriate for a smartphone platform.

Looking more closely at cross-platform Geekbench results for the Exynos-based Note 4 and the iPhone 5S and iPhone 6, and assuming that Apple' s Cyclone and Cortex-A57 are cores with similar performance characteristics at given clock speed (at least the little available evidence puts metrics like IPC and DMIPS in the same ballpark), gives indications that Exynos 5433 may on average actually be clocked at an effective 1.4 GHz, comparable to the 1.4 GHz of the iPhone 6. However, it can not be ruled out that in the case of the Exynos 5433 the frequency is the average resulting from thermal speed throttling (variation of the CPU speed based on power consumption and heat production).

Apple's SoC architecture is also different because it is a dual-core compared to the big.LITTLE configuration of the Exynos 5433 with four Cortex-A57 cores and four Cortex-A53 cores, and Apple' s cache memory architecture is very different with a large L3 cache and likely highly optimized but smaller L2 cache, and the Apple device has higher external RAM performance. Additionally, the software model (Apple's 64-bit AArch64 vs 32-bit ARMv8 AArch32 used with Exynos 5433) also complicates things, however some conclusions may still be drawn looking at specific benchmarks.

Comparison with Apple A7 and A8 benchmarks provides clues

Performing a detailed comparison of representative results for a SM-N910C and an iPhone 6 with Apple A8 on the Geekbench browser page provides interesting information. On first glance the results are all over the place with some benchmarks (including single-core ones) being faster on Exynos and others on the Apple A8, while Exynos obviously has an advantage for multi-core tests.

However, one can look for sub-benchmarks that are less likely to be affected by a large L3 cache on the Apple device, specifically benchmarks that do not have a large memory working set and source data or do not constantly perform random read access on a large set but do perform a lot of processing, possibly writing (but not reading) a lot of data. Some stream-type algorithms such as common data and image compression and decompression benchmarks fit the bill, because they generally steam the source data sequentially, perform a relatively high amount of CPU processing based on a relatively limited working set (a small part of the stream/file), and write the resulting data sequentially.

This type of benchmark puts the Exynos 5433 somewhat lower but fairly close to the Apple A8 in single-core CPU performance. Further information can be gained from iPhone 5S (Apple A7) results.

Benchmark results: Galaxy Note 4 (SM-N910C) vs iPhone 5S vs iPhone 6, relative speed advantage of iPhones compared to SM-N910C:

Test name                           SM-N910C  iPhone 5S       iPhone 6
BZip2 Compress:                     1187      1109 ( -6.5%)   1187 ( +8.5%)
BZip2 Decompress:                   1366      1394 ( +2.0%)   1538 (+12.6%)
JPEG Compress:                      1378      1196 (-13.2%)   1372 ( -0.0%)
JPEG Decompress:                    1598      1583 ( -0.9%)   1855 (+16.1%)
PNG Compress:                       1391      1427 ( +2.6%)   1577 (+13.4%)
PNG Decompress:                     1490      1301 (-12.7%)   1498 ( +0.5%)
Sobel (image local edge detection): 1701      1584 ( -6.9%)   1922 (+13.0%)

The Apple A8 chip in the iPhone 6 scores somewhat higher than Exynos 5433 in most tests, while Exynos 5433 is on average faster than the Apple A7 in the iPhone 5S. All of this is consistent with the CPU cores in all of the devices having comparable single-core CPU performance, and when making the assumption that Cortex-A57 and Cyclone (which seems to have a lot of architectural similarities with Cortex-A57) have comparable performance per cycle (at a given clock frequency), consistent with a clock frequency for the Exynos 5433 that is similar to the one used in the Apple devices (around 1.3 to 1.4 GHz).

The largely CPU-bound JPEG Compress test, which appears to be closedly tied to clock speed on other chip platforms with limited dependence on factors outside the CPU core, provides evidence that the isolated single-core CPU performance of Exynos 5433 may be close to that of the Apple A8 in the iPhone 6, consistent with a similar effective clock frequency of about 1.4 GHz. To what extent thermal throttling plays a role for the Exynos is not entirely clear. Most of the Geekbench results for SM-N910C for the JPEG Compress test are very close (a score around 1375), suggesting that at least for this test the maximum clock speed is generally maintained, which would be compatible with this speed being about 1.4 GHz.

PNG Decompress seems to be somewhat of a negative outlier for the Apple A7 and A8, but it is consistent across different iPhone results and is probably related to the high amount of memory writes (decompressed image data) associated with the benchmark, which can be affected by the extra layer in the memory subsystem represented by the L3 cache.

One significant caveat for the comparison above is that the Apple devices run in AArch64 mode, while Exynos 5433 in the Note 4 runs in AArch32 mode (the 32-bit version of the ARMv8 instruction set). AArch64 can take advantage of more instructions, in particular instructions operating on 64-bit registers, while the increased pointer/address storage size can decrease performance somewhat. However, the source code for the Geekbench test is likely to be identical (without extensive use of 64-bit integer variables) for AArch64 and unlikely to be specifically optimized, with any optimizations for AArch64 in the generated code depending on the compiler.

Sources: Samsung (Exynos 7 Octa), Geekbench Browser

Updated (24 October 2014): Update with information about proportion of Exynos models based on Geekbench database, and provide performance comparisons with Apple processors.
Updated (30 October 2014): Language tweaks, improve Geekbench comparison table and fix PNG Decompress score for iPhone 5S.
Updated (2 November 2014): Update discussion about clock speed of Exynos 5433, expand description of use of GTS, make note of counterfeit models in Geekbench database.
Updateed (5 December 2014): Update Exynos model share statistics for Galaxy Note 4.

Friday, October 3, 2014

Transition to next-generation FinFET process nodes: Samsung unlikely to be in the lead despite media reports

In the last few months, relatively vague media reports about Samsung gaining back chip orders from Apple that it has recently lost to TSMC, as well new orders for Qualcomm and other players for its next-generation 14nm FinFET technology have surfaced a few times. These media reports have frequently been widely reported in popular technology publications, often been interpreted as if TSMC would be losing market share in 2015 to the point of having significant excess capacity or as if Samsung has a considerable technology lead. However, these media reports as well sweeping conclusions about a presumed superior market competitiveness of Samsung in comparison with TSMC in 2015 are likely to be highly inaccurate.

TSMC currently dominates advanced node foundry production

TSMC currently dominates the foundry market for leading-edge nodes such as 28 and 20nm for chips such as smartphone SoCs and GPUs with a market share in excess of 80%, and faces significantly more demand than it is able to supply, despite unprecedented investment in new production capacity. Samsung's 28nm logic fabs are currently largely empty, and a similar situation is occurring at GlobalFoundries as it has been struggling to gain significant customers apart from AMD. Within this context, it is apparent that TSMC has been doing something right, while Samsung and GlobalFoundries must have had some significant set-backs, otherwise this market share distribution would not be happening. Given this track record, one can wonder how realistic it is to expect that the level of competitiveness of Samsung and GlobalFoundries would recover or even be reversed for next-generation processes as early as 2015.

Chip design companies motivated to seek additional sources of supply, but challenges apparent

Clearly, because TSMC currently has a virtual monopoly and is not able to fulfill demand there is a pressing motivation for chip companies such as Qualcomm and others to seek additional sources of supply. Therefore there is no reason to doubt that major efforts are being made in this area, especially starting from about Q2 2014 when the capacity shortage at TSMC became very evident. However, successful completion within any reasonable time-frame of such a move (especially when the effort has only recently become more intensive) involves substantial technological challenges and risks, which make it unlikely that it will actually happen in any way close to the time-frame and volume that has been suggested be some reports.

The fact that TSMC's 16nm FinFET process is an evolutionary extension of its already highly successful 20nm process to incorporate FinFET technology, rather than the radical technology changes involved in Samsung's 14mn FinFET process, also make it likely that chip design companies will continue to concentrate on TSMC process technology in the near term out of necessity, with any efforts with Samsung likely to only result in significant production at a much later stage.

Optimistic projections from sources within Samsung widely reported as fact

In an article on October 1, ZDNet (based on an article from its Korean website) quotes a manager from Samsung's LSI division saying that Samsung is likely to improve profits once it achieves volume production for next-generation products for Apple. The source declined to comment about when Samsung would start mass producing such chips for clients. Combining earlier media speculation, the article goes on to state that 14nm production for clients such as Apple, Qualcomm and AMD would start as early as the end of this year. The article also quotes undisclosed sources that Samsung is producing 30% of Apple's A8 processors, with the rest being manufactured by TSMC. The article has been widely quoted in popular news media.

However, there are several reasons to believe that these reports are relatively inaccurate and misleading. First of all, unofficial remarks from sources within Samsung seem to be the only source of information for the article. As mentioned in the article, Samsung is currently incurring very significant losses from its logic (LSI) fabs because of underutilization after losing Apple SoC orders to TSMC. That sources within Samsung (including managers who in fact may hold primary responsibility within Samsung for achieving profitability of the LSI division) would be inclined to paint to an over-optimistic picture that may not accurately reflect the the current and future market status for production of advanced next-generation designs is not at all surprising.

Apple has explored multiple sources for production of Apple A9

Already in July 2013, an article published by EE Times reported that Apple signed a deal with Samsung with Apple to produce the Apple A9 in 2015. This article also illustrates that knowledge of TSMC 20nm production for the Apple A8 in 2014 (as mentioned in the article) was already widespread at this time. However, in June 2013, it was already reported that Apple signed a three-year deal with TSMC not only involving 20nm, but also TSMC's next-generation 16nm FinFET and later 10nm FinFET technologies, with Apple A9 being mentioned. Recently, in August 2014, DigiTimes reported that TSMC had gained production of the Apple A9 using its 16nm FinFET process with significant volume as early as Q1 2015. More recent reports suggest Apple A9 will be manufactured at TSMC but using the same 20nm process as Apple A8.

Based on TSMC's track record and in particular its successful high volume ramp of the Apple A8 using its 20nm process, I believe it is very likely that Apple will focus Apple A9 production, at least for the most significant earlier part of its production cycle, at TSMC. Apple will be able to move to FinFET earlier at TSMC if it chooses too because TSMC's 16nm FinFET is to a large extent an evolutionary extension of its 20nm process incorporating FinFET technology, rather than the radical technology change involved in Samsung's 14mn FinFET process, achievement of maturity for high volume production is much less of a challenge which makes it unlikely that Samsung will be able to achieve a similar level of maturity in a time-frame that is competitive with TSMC. The fact that qualifying and bringing a similar chip to stable production at Samsung involves substantial additional investment in chip design, testing and associated risks including the timing of such production will probably even make it attractive for Apple to keep material Apple A9 production at TSMC for its entire life cycle.

Achieving significant production of Apple A8 will be very challenging for Samsung

In addition, the accuracy of the claim that 30% of the production of the Apple A8 is already manufactured by Samsung is highly questionable. Samsung's 20nm process is fundamentally different from that of TSMC in several details, and Apple would have to repeat most of the design/validation cycle that it is has already completed for the TSMC version of Apple A8 in order to be able to produce at Samsung's fabs, resulting in very high additional cost, numerous risks, and substantial delays. Moreover, it is doubtful that the production capacity of Samsung at 20nm (which it already uses for certain Exynos chips such as Exynos 5430 and 5433, and even those do not appear to have already ramped in really high volumes) is ramping fast enough to quickly gain material shipments to Apple, especially when Samsung is supposed to be rapidly transitioning to 14nm FinFET.

While it is not unlikely that Samsung has been aggressively seeking to provide capacity for the Apple A8, working with Apple, whether it would be able to achieve material amounts of production before the latter stages of the life cycle of the Apple A8 in 2015 when production levels will already have decreased is debatable. From Apple's viewpoint, it appears that its relationship with TSMC involves TSMC giving it any level of capacity it needs (to the detriment of competitors who are facing wafer shortages) which makes the apparent benefit for Apple to quickly move part of the Apple A8 production Samsung relatively limited. Samsung may offer lower prices for 20nm manufacturing capacity, but as explained earlier, the complexity, cost, time and risk involved in moving Apple A8 production to Samsung make it unlikely that Samsung will be able to gain a significant share of production within a reasonable time-frame.

Comparison of FinFET technologies at Intel, TSMC and Samsung

Recently, ZDNet also published a much more technical and reliable article discussing the status of FinFET technologies of the major fab players, including Intel, TSMC, Samsung and GlobalFoundries.

Intel started production of processors using FinFET technology at 22nm as early as 2011 and has already shipped 500 million such chips, mostly targeted at PCs but also gaining shipments for tablet applications this year. It also offers the technology to other customers as a foundry. Intel has started volume production of its next-generation 14nm FinFET process, which is a "true shrink" with significantly increased transistor density and delivers a combined 1.6x improvement in performance/Watt across applications ranging from smartphones to servers, and will continue to ramp production into 2015.

TSMC's 16nm FinFET development is at an advanced stage

TSMC's first generation 16nm FinFET process, 16FF, was qualified in November 2013 and already saw product tape-outs as early as April 2014. This suggests TSMC's 16nm FinFET process is already close to high volume production. TSMC's 16FF process will be followed up by its 16FF+ process with tape-outs expected in early 2015. While the performance benefits of 16FF are limited due its similarities (the same back-end metal layers) with TSMC's 20nm process, the 16FF+ process involves a reduction in feature size that makes it competitive with the theoretical performance of 14nm FinFET processes from competitors. TSMC is already in a stage called "risk production" for 15 16nm FinFET products this year and another 45 products next year for a variety of applications. Yields are reported have already reached levels comparable to TSMC's 20nm process. This is not surprising, as TSMC has reported that 95% of the tools used for 20nm can be reused for 16FF, which also brings massive advantages in the required level of investment to ramp capacity and greatly facilitates time-to-market.

TSMC quotes its 16FF+ process as having 15% greater performance when compared to 16FF (40% compared to 20nm) and 30% less power consumption when compared to 16FF. TSMC is already working on 10nm FinFET process technology which involves a more substantial 2.2x increase in transistor density.

SoCs using Cortex-A57 and Cortex-A53 CPU cores already implement TSMC's 16nm FinFET processes

Although 16FF is seen as a stepping stone to FinFET technology, it does provide performance benefits over planar 20nm. TSMC and ARM have announced that a 16nm test chip using Cortex-A57 and Cortex-A53 cores in a big.LITTLE configuration achieved a sustained 2.3GHz clock rate for the Cortex-A57 core with minimal power consumption of 75 milliwatts achieved for the Cortex-A53 core for common workloads. This demonstration involving a currently relevant SoC design illustrates the relative maturity of TSMC's 16nm technology.

For Cortex-A57, 16FF+ is expected to result in a 11% performance improvement relative to 16FF at the same level of power, while power consumption of the Cortex-A53 for low-intensity applications is reduced by 35%. ARM POP IP core hardening (tweaking cores for either performance or low power consumption) is utilized for early 16FF+ SoC designs. Although TSMC does not specifically address the use of Cortex-A53 at higher clock rates for high performance applications instead of Cortex-A57, the quoted numbers are consistent with the better scaling of Cortex-A53 on new processes when compared to performance-oriented "big-core" Cortex-A57 and cores with a similar architecture.

For example, one can speculate that the significant power reduction for the Cortex-A53 will further significantly increase the maximum clock rate and performance of Cortex-A53 CPU cores, more than the 11% quoted for Cortex-A57, making Cortex-A53-only designs more attractive for high-end applications. Already, early reports about MediaTek's MT6795 octa-core SoC running at about 2.2GHz, the first Cortex-A53-based SoC targeting high performance applications, suggest that it will provide premium-level performance at half the price of current premium-performance SoCs. The chip achieves this despite still using 28nm technology, indicating that Cortex-A53-based high-performance designs using more advanced nodes such as 20nm and 16nm FinFET will be even more revolutionary in terms of performance efficiency.

Samsung development of 14nm FinFET well underway, but maturity for high volume production unclear

Production of the first test chip (using a Cortex-A7 CPU core) on Samsung's first generation 14nm FinFET process, 14FPE, already occurred in December 2013. According to the marketing manager for Samsung’s foundry business, the foundry has completed tape-outs of multiple products and has already started early commercial production for some customers. The 14FPE process is claimed to provide either a 20% boost in performance or a 35% reduction in power consumption when compared to be a planar 20nm process. The process is said to result in 15% smaller chips when compared to a 20nm planar process.

Considering the considerable technological changes in Samsung' FinFET process (especially when compared to TSMC's more evolutionary first-generation 16FF process, which is closely aligned with the already almost mature 20nm), the claimed performance and density gains are relatively minor in the context of the high costs and learning curve involved in bringing chips to mature volume production. High theoretical performance of a new process has little value when it involves very high investment in chip design, relatively high manufacturing cost, and when mature volume production is not achieved in a timely manner. A higher performance version of Samsung's 14nm FinFET process, 14LPP, is expected to be qualified in a couple of months time.

Meanwhile, GlobalFoundries has given up on its own 14XM FinFET process and has aligned with Samsung's 14LPE and 14LPP processes. This decision probably means that it will take considerable time before GlobalFoundries will be competitive for volume production using FinFET, providing evidence that its market position will continue to be precarious for some time.

Conclusion

In summary, indications are that TSMC, helped by its more evolutionary transition to FinFET and dominant position in current leading-edge processes, is much closer to stable high volume production of next-generation FinFET processes than Samsung, and that it will continue to dominate leading-edge foundry production in the near term even as chip designers seek additional sources of supply given the very tight capacity environment at TSMC.

While Intel is also well advanced in its FinFET process development and uses it on a large scale for PC processors, it has not yet seen widespread success either as a foundry partner for third parties or as a provider of large numbers of low-power SoC for applications such as smartphones, also illustrated by the fact that early Intel mobile SoCs such as SoFIA that integrate cellular baseband and other components will in fact first be produced at TSMC and not in Intel's own fabs.

Source: ZDNet (Technical article of FinFET technology development), ZDNet (Samsung LSI article), EE Times

Updated October 5, 2014 (Spelling, grammar) .
Updated October 30, 2014 (Grammar, small corrections).
Updated December 26, 2014 (Minor grammatical corrections).

Friday, September 26, 2014

New Amazon Kindle tablets use MediaTek SoC -- but will it help MediaTek?

Amazon has introduced two new low-priced tablets for the US market, the Kindle Fire HD 6 and Kindle Fire HD 7, priced at $99 and $149 respectively. Both tablets are expected to be available in October. The new models are reported to feature an unspecified quad-core MediaTek SoC. Although some news articles suggest the use of the high-performance (but somewhat inefficient) MediaTek MT8135 SoC, about which little has been heard since its announcement more than a year ago, which would match reports from last year about Amazon using the MT8135 for future models, use of the newer and much more cost-effective and power-efficient MT8127 would make much more sense.

A recent tear-down by iFixit however proves that the tablets do use a MT8135V SoC, although the memory interface is limited to a single channel 32-bit configuration compared to the dual channel configuration originally announced for the MT8135. As will be explained below, the use of the relatively expensive (because of a relatively large die area) and not very power-efficient MT8135 featuring Cortex-A15 cores and high-performance PowerVR GPU, a SoC originally announced for high-end tablets, in low budget devices like the new Kindle models does not make economical sense at all, especially from MediaTek' s standpoint, while MediaTek's existing MT8127 would have provided clear advantages for cost and power efficiency while still meeting performance goals.

Amazon targeting different segment of the market

The new tablet models are relatively small. The 6" Kindle Fire HD 6 is one of the few tablets of that size, while smartphones of a similar size (sometimes dubbed "phablets") are becoming more popular. Both tablet models do not have cellular connectivity and require a WiFi connection to connect to the internet. The tablets have a very robust design, being considerably thicker than most tablets. There are also versions with a software and accessory package specifically targeted at children.

Amazon uses a customized version of Android KitKat, without access to Google's Play Store and other Google applications, instead focusing on its own Amazon AppStore, with a somewhat different target demographic than higher-priced tablets.

MT8315 Amazon design win reported as early as August 2013, but use of MT8127 would be more economical

Already in August 2013, reports surfaced that Amazon would be using MediaTek's MT8135 in tablets to start shipping in 2014. Amazon has confirmed that a quad-core 1.5GHz MediaTek processor used in the new models. Current specifications mention to processor cores running up to 1.5 GHz and two cores up to 1.2 GHz. The MT8135 was announced more than a year ago as a relatively high-end chip and was originally expected to be commercially available much earlier. It was MediaTek's first chip using ARM's big.LITTLE architecture, using two Cortex-A15 cores clocked up to 1.7GHz and two Cortex-A7 cores.

The MT8127, announced this spring, is based on a proven and efficient quad-core Cortex-A7 CPU configuration and adds a relatively fast GPU (although limited to OpenGL ES 2.0 API support) and is listed with a maximum clock speed of 1.5GHz.

Power efficiency of big.LITTLE MT8135 likely to be problematic

ARM Cortex-A15 cores are notorious for high power consumption, and few Cortex-A15-based SoC designs have been commercially successful for mobile applications (especially smartphones), with problematic heat production and power drain often being reported. Cortex-A15 cores also take up considerably more die area than efficient cores like Cortex-A7 or Cortex-A53, resulting in larger, more costly chips.

Although power consumption and battery life of the Kindle tablets has not yet been tested, battery life specifications by Amazon are the same as for Kindle Fire HD models from previous years. Since the MT8135V is actually used in the new models, maintaining battery life is likely to be a challenge, while if Amazon had actually chosen the MT8127, the devices would most likely have provided much longer battery life.

The case against the use of the MT8135

Even though it has been established that the new Kindle tablets do use a version of the MT8135, several drawback are apparent. Although only two Cortex-A15 cores are used in the MT8135 instead of the four present in most existing big.LITTLE designs, a small form factor tablet would most likely not allow a large battery (the Kindle Fire HD 6 in fact has only a 3400 mAh battery, limited by the form factor) and power consumption could be problematic.

The relatively high performance PowerVR Series 6 GPU in the MT8135 should also contribute to high power consumption, for example when playing games, as well as being seemingly overpowered for the relatively low screen resolution since it is heavily oriented towards the use of dual-channel memory interface and a high display resolution.

On the positive side, MediaTek has experience balancing power consumption with its CorePilot technology (for example in octa-core CPUs), although this has not yet been proven for big.LITTLE CPU designs. MediaTek also originally announced its HMP (heterogeneous multi-processing) capability in conjunction with the MT8135, with all four cores being able to run concurrently.

In addition to the relatively large die area of the CPU and GPU (resulting in a relatively large, expensive chip), as well as increased manufacturing cost due handle potentially high heat production, a hypothetical design using the MT8135 would likely be using a relatively expensive dual-channel memory interface (matching the choice of CPU and GPU), further increasing cost at several levels. However, as it turns out the new Kindle tablets limit the memory interface to 32 bits in conjunction with the MT8135V SoC used.

Consistent with the cost characteristics of the chip platform, the MT8135 was originally announced as being targeted at the mid-to-high tier of the tablet OEM market. Clearly, this does not match the $99 price of the Kindle Fire HD 6, making the actual use of the MT8135 somewhat silly.

MediaTek already transitioning away from big.LITTLE

MediaTek has also announced a big.LITTLE smartphone platform, the MT6595 using four Cortex-A17 and four Cortex-A7 cores. However, although providing performance competitive with or surpassing current high-end platforms like Snapdragon 801, the MT6595 platform does not appear to have been widely adopted, which makes sense considering the relatively high power consumption of associated with the Cortex-A17 CPU cores and higher cost of the SoC, which make it stand out compared to other MediaTek SoCs, which tend to be low cost and power efficient.

In fact, MediaTek has already announced the MT6795, to be available this year not long after the MT6595, which does away with big.LITTLE and instead uses an efficient octa-core ARM Cortex-A53 configuration, with the other specifications being similar to the MT6595. This provides strong evidence that MediaTek is no longer focusing on big.LITTLE designs, including the MT8135, supporting the case that if MediaTek would make the decision, the new Amazon tablets in fact would not use the MT8135, but instead the newer, much more efficient MT8127.

Good game performance would have been achieved with MT8127 as well

Amazon has demonstrated relatively good performance of the new tablet models, for example when playing games, compared to competitive devices such as certain models from Samsung's Galaxy Tab 4 series. This is not unexpected, since the PowerVR Series 6 GPU in the MT8135 clearly provides high performance.

However, the Mali-450 GPU inside the MT8127 is actually a relatively recent GPU that is significantly faster than the Mali-400 commonly used in entry-level devices, and combined with the modest 1280x800 display resolution of the new Kindle tablets would have given respectable 3D game performance, not far from the performance of the actual MT8135V-equipped models. Although Mali-450 does not support the OpenGL ES 3.x API, OpenGL ES 2.0 continues to dominate, for which Mali-450 provides an efficient implementation (in terms of performance/Watt and performance/dollar).

The MT8127 is clearly a much more cost-effective (and more more power-efficient) chip. The MT8127 is likely to be dramatically more cost-effective than the MT8135, with much lower chip cost, much better battery life, and significantly lower manufacturing cost of the PCB and other manufacturing aspects, altogether a much better fit given the price segment of the new tablets.
Although the single-thread CPU performance of the quad-core Cortex-A7-based MT8127 is significantly lower than the Cortex-A15-based MT8135, this is not a critical issue in practice, and Android can already take significant advantage of multi-threading with a quad-core processor, mitigating the impact of single-thread performance bottlenecks.

Large-scale production of MT8135 does not make financial sense, unlike MT8127

Given the high manufacturing cost of the MT8135 (especially when compared to much more cost-effective tablet chips from MediaTek like the MT8127), it unlikely that MediaTek is making much of a profit on the chip even when selling millions of chips to Amazon.

In fact, because MediaTek is likely to be facing a critical shortage of wafer capacity at its foundry TSMC (being squeezed between juggernauts Apple and Qualcomm buying up capacity), the production of the MT8135, with its low profit margin, has probably cannibalized MediaTek profits as well as revenues, because, for example, for each MT8135 sold MediaTek would have been able to sell two or more much more cost-efficient and higher margin chips such as the MT8127 or MT6582.

Indeed, for this reason, the use of the MT8127 in inside the new Kindle tablets would have been much more logical. A prior commitment with Amazon for producing and shipping the MT8135, as reported previously in 2013, probably left MediaTek with no other options.

Few signs of financial gain from Amazon design win

As described in an earlier post, MediaTek's sequential revenue growth in Q3 is unlikely to be much greater than 10%, already low considering the normally expected seasonal increase expected in Q3. This provides additional evidence that MediaTek is severely affected by wafer shortages at TSMC, as well as the late introduction of smartphones SoCs with integrated 4G LTE baseband, and general price pressure on its chips. Despite probably shipping millions of MT8135V chips to Amazon, this probably has had the effect of limiting shipment of other, higher-margin MediaTek chips to other customers, because of an inability to fulfill demand. Indeed, tablets using more cost-effective MT8127 have been very slow to appear on the market, suggesting that MediaTek has been prioritizing tablet processor production of the MT8135V for Amazon because of capacity constraints. So while MediaTek has gained prestige from this design win, the financial gain is likely to be limited or even negative.

Strong prospects for new products, clouded by capacity concerns

Although the performance of MediaTek's upcoming Cortex-A53-based smartphone SoCs is likely to very competitive and they have been reported to to have gained widespread adoption in China for new designs, while also contributing to MediaTek's increasing competitiveness in high-performance segments, recent reports suggest competition for wafer capacity at TSMC will continue to be intense, bringing into question MediaTek's ability to translate any product strength (ranging from new and existing smartphone platforms to tablet chips like the MT8127) into actual sales and profit growth in the near term. If MediaTek continues to be obligated to produce the MT8135V in high volume for Amazon, that will most likely continue to negatively affect MediaTek's sales and profits.

Sources: CNET (Kindle Fire HD 6 and 7 announcement), DigiTimes (2013 MediaTek Amazon Kindle article), MediaTek (MT8127 announcement press release), iFixit tear-down article

Updated October 24, 2014 (Update to reflect the fact that the tablets actually do use the MT8135V SoC).
Updated November 2, 2014.

Monday, September 22, 2014

Early test results suggest Cortex-A53 wil revolutionize performance, cost and efficiency across all segments

The ARM Cortex-A53 is a very small and power-efficient in-order-pipeline CPU core that is the successor to the similar and very successful ARM Cortex-A7 core. Although Cortex-A53 supports the 64-bit ARMv8 instruction set (as well as having full compatibility with 32-bit ARMv7), it can take advantage of the 32-bit version of ARMv8 with architectural improvements, and it has other significant internal architectural improvements leading to increased performance on current leading-edge process nodes compared to Cortex-A7. Although also used as power-efficient core in combination with ARM's high-performance Cortex-A57 core in a big.LITTLE configuration for high-end designs, Cortex-A53 cores have also been widely adopted as a stand-alone CPU in leading smartphone and other mobile SoC designs, with the first designs currently starting to appear in commercially available devices. Upcoming Cortex-A53-based designs span virtually the whole performance spectrum from entry-level to premium devices.

Early benchmarks show strong performance of the Cortex-A53 core, especially for the latest revisions

Early evidence of the performance of new SoCs exclusively using ARM Cortex-A53 processor cores, based on recent entries in Geekbench's result database, suggests that the performance improvement of Cortex-A53 compared to Cortex-A7 at an equivalent clockspeed, especially when running with the 32-bit ARMv8 machine model as implemented in Android 4.4.4, may be greater than originally expected.

There is evidence that several revisions of the Cortex-A53 core already exist, including the original r0p0, the r0p1 and the r0p2 revision (with r0p3 also being listed on ARM's website). Although these are minor revisions that do not signficantly alter the IP blocks, the later revisions seem to be associated with significant performance improvements when compared to earlier revisions, possibly because of the correction of bugs or performance bugs in earlier revisions. In particular, r0p0 revision devices such the first incarnation of Snapdragon 410 (MSM8916) appear to be limited to ARMv7 compatibility mode, while SoCs with later revisions appearing to be configured with support for the 32-bit version of the ARMv8 instruction set (AArch32) in association with Android 4.4.4.

Full 64-bit ARMv8 machine model not likely to be of great benefit on mobile devices

The full 64-bit ARMv8 instruction set (AArch64) as supported by Cortex-A5x is not yet supported in Android, and there are reasons to believe that using it might not result in much benefit in today's devices over AArch32. For example, much of the benefit of the new ARMv8 instruction set is already delivered by AArch32, and actual use of 64-bit registers/variables and operations on them is relatively uncommon in program code (this is true of most program code, including typical code executed using the x86_64 instruction set used in PCs and Atom-based mobile devices). Additionally the ARMv7 instruction set (and AArch32) already contain some instructions that operate on 64-bit values, which can be conveniently taken advantage of for these uncommon cases, without requiring the use of the full 64-bit ARMv8 instruction set.

Moreover, in the ARM world, data processing algorithms that might benefit from 64-bit processing are often better served by using ARM's NEON SIMD extension, which is also available on AArch32 and most ARMv7-A devices.

Although AArch64 makes memory management more flexible by extending the addressing space beyond 4 GB, the doubling of the storage size of all pointers (memory addresses) from 32 bits to 64 bits negatively impacts performance because of greater code and data memory usage, which for mobile SoCs, given their relatively small internal SoC buffers, cache memories and RAM, are especially sensitive. PAE support already allows 32-bit ARM machine models to take advantage of a larger addressing space, reducing the necessity of switching to a full 64-bit model.

32-bit version of ARMv8 instruction set brings benefits

Android support for the 32-bit version of ARMv8 is a very recent development, taking advantage of new ARMv8 instructions that improve performance, and probably also the architectural changes in ARMv8 (such as the removal of the optional conditional predication of instructions present in ARMv7-A) that benefit modern CPU cores such as Cortex-A53 and Cortex-A57. Geekbench takes advantage of the new machine model, and the majority of Android applications, largely consisting of device-independent Java code that is translated into machine code on demand, is also likely to benefit. However, to what extent ARMv7-A native code, which is commonly used in applications that require more CPU processing, is affected by the new machine model is unclear.

SoC-specific CPU optimizations are common, but impact power consumption more than speed

Variation between different implementations of Cortex-A53 cores at a similar process node can also occur because of core hardening optimization in the SoC design. This can involve trading performance for power efficiency and vice versa, although it should not in principle affect metrics such as IPC (instructions per cycle) or indeed Geekbench CPU scores as long as they do not depend on factors outside of the CPU core such as a more extensive memory footprint. However, apart from L2 CPU cache memory size, CPU cache latencies may also be configurable through core hardening, and the latter may impact even small memory footprint benchmarks, including CPU tests used in Geekbench.

Geekbench result round-up for smartphone SoCs, including new designs using Cortex-A53

(Click to enlarge)

The table above shows a summary of Geekbench results for smartphone models using popular smartphone SoCs, as well as new smartphone SoCs using Cortex-A53 cores. Note that the MSM8939 entry in the table is incorrectly labeled as Snapdragon 610, it actually represents an early version of Snapdragon 615.

The results were gathered after examining the range of benchmark results for a common SoC and CPU clock frequency configuration (which tends include numerous lower-than-expected scores, probably mostly due to background CPU activity when running the benchmark or the effects of CPU throttling), and choosing a representative result close to the high end of the range, while trying to make sure the result is not an outlier or giving indications of overclocking. As much as possible, entries using the most recent version of Geekbench (3.2.1 or 3.2.0) and the underlying Android version (preferably 4.4.x) was selected.

While the Integer and Float scores reported in the table are likely to be closely tied to the processor core, SoC and the clock frequency used, the memory score and overall score depend on the external memory implementation and speed and other factors related to a particular device model.

Analyzing Geekbench performance of existing SoCs

Looking at previous generation SoCs, among SoCs with a quad-core Cortex-A7 CPU configuration, based on Geekbench results, MediaTek SoCs are very competitive against Qualcomm SoCs long considered mid-range such as a 1.2 GHz Snapdragon 400. For example, MediaTek's MT6582, despite usually being found in much cheaper (often entry-level) devices than Snapdragon 400, is quite competitive. Samsung's Exynos 3470, used in the Galaxy S5 Mini, appears to be worst performer in this class in terms of performance per MHz.

Looking at higher performance SoCs, the octa-core MT6592 holds the middle ground based on strong multi-core CPU performance (with memory performance being a relative bottleneck), while Qualcomm's Snapdragon 801/805 are a clear step up, especially in terms of single-thread and memory performance. Snapdragon 805 appears to be very similar to Snapdragon 801 in terms of CPU architecture, with very similar performance at the same clock speed, and being reported basically as major version bump of the Krait-400 core used in the Snapdragon 801 by Geekbench, although Qualcomm described the CPU cores inside Snapdragon 805 as Krait-450. Exynos 5430 provides a similar level of performance, but the power efficiency of the latter may be in doubt.

The following Geekbench model names associated with entries using existing SoCs were used for performance comparisons. A link to the results page used for each model is provided.

Qualcomm MSM8226 (Snapdragon 400) (Cortex-A7r0p3): HTC HTC Desire 610 (Geekbench 3.2.1 ARMv7, Android 4.4.2)
Samsung Exynos 3470 (Cortex-A7r0p3): samsung SM-G800F (Geekbench 3.2.1 ARMv7, Android 4.4.2)
MediaTek MT6582 (Cortex-A7r0p3): HUAWEI H30-U10 (Geekbench 3.2.1, ARMv7, Android 4.4.2)
MediaTek MT6589T (Cortex-A7r0p2): LENOVO Lenovo S960 (Geekbench 3.2.0 ARMv7, Android 4.4.2)
Qualcomm MSM8226 (Snapdragon 400) (Cortex-A7r0p3): HTC HTC Desire 816 dual sim (Geekbench 3.2.1 ARMv7, Android 4.4.2)
MediaTek MT6592 (Cortex-A7r0p4): LENOVO Lenovo A806 (Geekbench 3.2.1 ARMv7, Android 4.4.2)
Qualcomm MSM8974AC (Snapdragon 801): Motorola Moto X (2014) (Geekbench 3.2.1 ARMv7, Android 4.4.4)
Samsung Exynos 5430: samsung SM-G850F (Geekbench 3.2.1 ARMv7, Android 4.4.4)
Qualcomm APQ8084 (Snapdragon 805): samsung SAMSUNG-SM-N910A (Geekbench 3.2.1, Android 4.4.4)

Performance of new Cortex-A53-based SoCs

Qualcomm's first generation 1.2 GHz Snapdragon 410 (MSM8916), with four Cortex-A53r0p0 cores, has higher performance than a similarly clocked Snapdragon 400, although not dramatically so. A faster clocked Snapdragon 410 prototype (with MSM8916_32 SoC) with a later revision of the Cortex-A53 core shows a clear improvement in Geekbench Integer Performance over the previous Snapdragon 410 when adjusting for the clock rate. However, this is for a large part due to the availability of the Aarch32 instruction set in the newer device, allowing Geekbench to take advantage of new cryptography instructions that greatly speed up certain subtests that are part of the Integer benchmarks.

MediaTek's upcoming MT6752 with an octa-core configuration of the more recent r0p2 revision of the Cortex-A53 core shows impressive performance, with the caveat that this is based on a single reported benchmark score of a prototype device. Overall integer performance as reported by Geekbench is especially impressive, being comparable to Snapdragon 801 for single-thread performance and blowing past it in terms of multi-core performance. However, the use of Aarch32 is likely to inflate the overall Integer scores relative to typical performance in practice because of the relatively large influence of new cryptography instructions available with AArch32 on Geekbench's Integer Performance scores, although other benefits of AArch32 are also apparent. Memory efficiency also appears to be significantly improved when compared to previous generation Cortex-A7-based devices. Despite relatively high performance, the MT6752 is likely to be power-efficient and very cost-effective, due to the characteristics that the Cortex-A53 core has inherited from Cortex-A7.

The following Geekbench model names associated with entries using a SoC with Cortex-A53 cores were used for performance comparisons. A link to the results page used for each model is provided.

Qualcomm MSM8916 (Snapdragon 410) (Cortex-A53r0p0): HTC Desire 510 (Geekbench 3.2.1 ARMv7, Android 4.4.3)
Qualcomm MSM8916_32 (Snapdragon 410) (Cortex-A53r0p1): unknown msm8916_32 (Geekbench 3.2.1 AArch32, Android 4.4.4)
Qualcomm MSM8939 (Snapdragon 615) (Cortex-A53r0p1): HTC HTC 0PFJ1 (Geekbench 3.2.0 Aarch32, Android 4.4.4)
MediaTek MT6752 (Cortex-A53r0p2): alps k2v1 (Geekbench 3.2.1 AArch32, Android 4.4.4)
Samsung Exynos 5433 (Cortex-A57r1p0 + Cortex-A53): samsung SM-N910C (Geekbench 3.2.0 AArch32, Android 4.4.4)

Cortex-A53 blows Cortex-A57 away in terms of efficiency

Samsung's new Exynos 5433, the first SoC with publicly disclosed Cortex-A57 cores, sets a new high mark for single-thread performance, being considerably faster than Snapdragon 801, but surprisingly finds itself beaten on multi-core integer performance in early results for the MT6752, a mid-range SoC. Both devices use AArch32, so the relatively heavy weighing of new AArch32 cryptography instructions by Geekbench is not as important as when comparing with previous generation devices.

Exynos 5433 contains four Cortex-A53 cores in addition to the four Cortex-A57 cores in a big.LITTLE configuration, and more detailed examination of the benchmark results (more specifically primarily CPU-bound subtests such as JPEG Compress) provide evidence that the Cortex-A53 cores do contribute to multi-core performance, with a multi-core performance scaling factor of 4.46 (about 4.0 would be expected when just the Cortex-A57 cores are utilized), suggesting Global Task Switching (allowing all eight cores to run concurrently) is working, although not providing a great boost in overall processing performance, with more significant benefits for overall power efficiency and CPU scheduler efficiency.

It has to be noted that the MT6752, which closes in on the performance of a high-end design like Exynos 5433, is a mid-range chip with a cost-effective 32-bit memory interface, and is likely to be considerably cheaper and much more power-efficient than Exynos 5433 and other high-end platforms, dramatically illustrating the great efficiency of Cortex-A53-based SoCs against the relative inefficiency of Cortex-A57. Cortex-A57 provides superior single-thread performance, but compares poorly in terms of performance/dollar and performance/Watt. High performance Cortex-A53 designs such as MediaTek's upcoming octa-core MT6795 (which is targeting a higher clock frequency and has a premium dual-channel memory interface) are likely to make the comparison even more compelling.

Low-power Cortex-A53 has significant advantages related to performance scaling and thermal restrictions

Key to this development is the apparent tendency of in-order pipeline cores such as Cortex-A53 (and previously Cortex-A7) to show much greater performance scaling on new, more advanced process nodes, primarily because of much greater increases in maximum clock speed. For example, clock speed increase has been limited for SoCs with high-performance CPU cores in the same class as Cortex-A57 (generally out-of-order pipeline, speculative issue architectures with a large die size) such as Exynos models with Cortex-A15 and Apple A7/A8 with Cyclone, despite the transition to 20 nm manufacturing.

In addition, practical performance of Cortex-A53 is likely to be much less affected by CPU throttling (periodic reduction of the CPU clock speed because of the temperature increasing beyond a certain threshold in order to maintain stability), thanks to the power efficiency of Cortex-A53, which may aid actual performance in practice more than is apparent from the results of common CPU benchmarks.

Finally, the current comparison of Cortex-A53 with Cortex-A57 as implemented in Exynos 5433 is not apples-to-apples because Exynos 5433 is manufactured at 20 nm, with significant associated performance benefits, while Cortex-A53-based devices (which for the moment are mostly targeted at cost-sensitive applications) are still manufactured at 28 nm. Although there is as of yet not much information about how Cortex-A53 will scale on 20 nm, I believe there is potential for additional performance scaling that could be disruptive in terms of performance and efficiency advantages when compared to high-performance cores like Cortex-A57.

Comparison of Cortex-A53 CPU core revisions

Cortex-A53r0p0 (part 3331, variant 0, revision 0 as reported by Geekbench) is the first revision. This appears to be the version used in a quad-core configuration in MSM8916, the first generation of Qualcomm's Snapdragon 410, which is the first Cortex-A53-based SoCs to be commercially available in devices such as HTC Desire 510 and several currently ramping devices, including Samsung Mega 2 (SM-G7508Q) and Samsung Galaxy A5 (SM-A500F). The clock speed is typically set at 1.19 GHz. Devices using this chip appear to be limited to ARMv7, not being able to take advantage of the 32-bit ARMv8 (Aarch32) instruction set. Already on July 1, 2014, Qualcomm's Android for MSM Project stopped providing support for this SoC for new Android versions, with the latest supported version being Android 4.4.3.

Cortex-A53r0p1 (part 3331, variant 0, revision 1 as reported by Geekbench) is the second revision. It is used in a Qualcomm prototype device result reported as MSM8916 or MSM8916_32 (a chip designation similar to already shipping devices using a Snapdragon 410 with the first revision of Cortex-A53), equivalent to a SoC referred to by Qualcomm as MSM8916_32, running at a higher maximum clock rate (1.54 GHz vs 1.19 GHz) and showing a significant additional performance improvement beyond that expected from the clock speed increase only. The combined Geekbench integer performance score for r0p1 is about 30% higher for single and multi-core performance than r0p0 at the same clock speed, although that is largely the result of the cryptography instructions enhancement offered by AArch32, but other improvements are also apparent. Floating point performance remains the about same. Memory performance may also be higher, but that also depends on the speed of the memory used in the tested devices.

Cortex-A53r0p1 is also used in Qualcomm's octa-core MSM8939 (Snapdragon 615), which has two clusters of four Cortex-A53 cores, one running at a higher and the other at a lower a speed. Geekbench results for a HTC prototype using this chip (running at maximum CPU speed of 1.34 GHz) are consistent with the performance per MHz found in the r0p1-based MSM8916_32, with gains in multi-core performance over the quad-core chips suggesting that the device supports heterogeneous multi-processing (also called Global Task Switching), allowing all eight processor cores to run simultaneously, although the gain is significantly lower than what would be expected when all CPU cores are fully utilized (even allowing for a relatively low CPU speed of the second cluster, say 0.7 GHz).

Cortex-A53r0p2 (part 3331, variant 0, revision 2 in Geekbench) appears to be the latest revision of the Cortex-A53 that has been implemented in SoCs. A benchmark result for a device based on MediaTek's upcoming mid-range octa-core MT6752 SoC provides evidence for the existence of this core. The CPU cores are clocked at 1.69 GHz, and the benchmark results are impressive, helped by ability of the eight cores to run concurrently at full speed. Integer and floating point performance when corrected for clock speed appears to be further improved slightly over the previous r0p1 revision, based on single-core performance, although this could also be due to characteristics of the SoC.

Multi-core performance of the r0p2-based the MT6752 is very impressive, although not quite scaling linearly with the doubling of the amount of cores. Multi-core performance does appear to be scaling significantly better than the asymmetrically clocked cores in the Snapdragon 610 prototype, even when allowing for a very low clock speed of the second cluster of the latter. This is not unexpected because multi-threading, especially in a benchmark, is likely to be significantly more efficient when dealing with equivalently-clocked CPU cores.

Memory performance of the MT6572 test device is impressive for its class, with a significant increase over the Cortex-A53r0p1-based Qualcomm SoCs, and being dramatically higher than existing designs that also utilize an economical 32-bit memory interface. Although higher-clocked memory is likely a factor, data rate and memory controller improvements in the r0p2 revision of the Cortex-A53 core are likely to be more significant. ARM has alluded to improvements in the memory subsystem and data rates in Cortex-A53, which may be more fully realized in the r0p2 revision and its implementation in the MT6572 SoC.

Other new ARM IP technology contributes to performance and efficiency improvement

The Cortex-A53 has become available together with other IP products from ARM that improve performance and efficiency. These include a faster and more efficient interconnect bus, compression and other data rate reduction techniques such as ARM Frame Buffer Compression (AFBC), Smart Composition, and Transaction Elimination, and new Mali GPU cores (such as Mali-T760 and Mali-T72x) which together have the potential to dramatically improve performance and especially power consumption for graphics-related tasks (including typical device use), while also alleviating the memory bandwidth bottleneck in cost-sensitive devices with a limited memory subsystem, such as the 32-bit external memory interface used in most entry-level to mid-range mobile devices.

Favourable comparison with existing high-performance designs

Judging from these early benchmark results, an octa-core Cortex-A53 can achieve performance rivalling existing high-end platform such as Snapdragon 801 in several metrics. The test results of the MT6752-based device show a dramatically higher Geekbench multi-core integer performance score when compared to Snapdragon 801, with single-core integer performance being similar. However, the scores are inflated due to the heavy weighting of new cryptography instructions available with MT6752's support for AArch32, although in general AArch32 is likely to bring benefits for most applications. Multi-core floating performance is also higher. Single-core floating point and memory performance are clearly lower than Snapdragon 801, although not dramatically so. Nevertheless, considering the fact that the MT6752 is supposed to and likely to be using only a 32-bit memory interface, its memory performance is very impressive, being a large improvement over existing devices with a 32-bit memory interface.

The strong "premium level" performance of devices like the MT6752 is associated with a dramatically decreased chip manufacturing cost when compared to existing high-end SoCs such as Snapdragon 801. The Cortex-A53 cores, even in an octa-core configuration, are likely to be significantly smaller than out-of-order high-performance cores such as the Krait-400 cores used in the Snapdragon 801, resulting in chips with a much smaller die size (similar comparisons can be made with ARM's high-performance cores such as Cortex-A1x and Cortex-A57). Power consumption is also likely to be dramatically improved.

Revolution on the cards for performance, cost and power efficiency

Coupled with the cost reductions allowed by the 32-bit memory interface (as compared to the 64-bit or 32-bit dual-channel interfaces of existing high-end devices), with Cortex-A53 a revolution in performance/dollar and performance/Watt for high-performing devices appears to be on the cards. At the same time, lower-end devices (using, for example, a quad-core Cortex-A53r0p2 configuration) will see dramatic performance improvement.

When Cortex-A53 cores are combined with other high-end features such as a wider memory interface and a high-performance GPU (such as implemented in MediaTek's upcoming MT6795), there is potential to further close in or even surpass the performance of existing premium-level architectures, with greatly increased (power) efficiency and reduced cost. Although single-thread performance is not likely to quite reach the level of existing premium devices, other metrics (including multi-core performance, power consumption and cost) are likely to see a dramatic improvement. Early reports already indicate that SoCs such as the MT6795 will be disruptive in terms of cost and efficiency for high-performance mobile applications.

In conclusion, the emergence of Cortex-A53-based designs and associated IP is likely to revolutionize performance, cost and efficiency in mobile devices, bringing higher performance to cost-sensitive entry-level and mid-range devices, reducing cost for high-end devices while also improving the performance of premium devices with much greater efficiency and reduced cost.

Sources: Geekbench result database, ARM, EE Times (Comments about adoption of MT6795)

Updated September 28, 2014 (Fix revision designations of Cortex-A53 based on feedback; revisions reported by Geekbench are minor revisions of major revision r0, as in r0pN).
Updated September 30, 2014 (Use more representative benchmarks for some SoCs, provide information about Geekbench and Android version as well a weblink for all tabulated benchmark results, discuss merits of different ARMv8 instruction set models, make note of cryptography instructions in AArch32 inflating Geekbench Integer Performance, and other improvements).
Updated October 3, 2014 (Include early reports about octa-core Cortex-A53 MT6795 adoption for high-performance devices).
Updated November 13, 2014 (Correct information about effectiveness of GTS on Exynos 5433).
Updated December 45, 2014 (MSM8939 is Snapdragon 615, not Snapdragon 610).

To do: Cortex-A53 Geekbench scores are likely to be inflated because of support for 32-bit ARMv8 mode in the most recent versions of Geekbench, which enables the use of cryptography instructions that significantly increase the scores of certain subtests of the Geekbench CPU Integer performance tests, while not accurately reflecting the CPU performance increase for most applications. This will be further investigated in the near future. As I did in subsequent blog posts, concentrating on Geekbench subtests that better represent integer CPU performance such as the JPEG Compress test, rather than the overall integer performance scores, should give an much better picture.

Index

By Subject