Tuesday, December 30, 2014

Early benchmarks for Snapdragon 810 show performance flaws

Recently, reports have surfaced, including one from BusinessKorea published on December 4, about Qualcomm's new high-end chip, Snapdragon 810, being affected by performance issues related to heat production and issues with the memory controller. Subsequently, Geekbench results for some Samsung prototype devices using the SoC (MSM8994) have also appeared in the Geekbench results database. Detailed analysis of the Geekbench results seems to confirm the issues with thermal throttling and especially memory controller performance, at least in the early revision of SoC that was used to obtain the mentioned benchmark scores, resulting in sub-par performance for its segment.

Updated (January 5, 2015): A section has been added discussing new Geekbench results from a LG G Flex2 prototype using Snapdragon 810, which shows improvement in some areas.

Snapdragon 810: A departure from Qualcomm's in-house Krait cores

For a long time, Qualcomm has used its own ARM-compatible Krait cores (most recently Krait-400/450 in Snapdragon 801/805) for SoCs targeting the performance segment. However, with Snapdragon 810 (as well as Snapdragon 808 and to a certain extent Snapdragon 615), Qualcomm seems to be migrating to standard ARM cores for performance-oriented SoCs. Some time ago, Qualcomm already transitioned its cost-effective SoCs (such as the Snapdragon 200 and 400 series) to cost efficient ARM cores such as Cortex-A7 (and later Cortex-A53).

Snapdragon 810 contains four Cortex-A57 cores (clocked up to about 1.5 GHz based on current evidence) as well as four Cortex-A53 cores in a big.LITTLE configuration. In this respect the chip is similar to Samsung's Exynos 7 Octa (5433) that has already been shipping for several months in devices such as the Galaxy Note 4 and shows impressive CPU performance. However, Snapdragon 810 is the direct successor to Snapdragon 805 and has a similarly ambitious memory interface with high total bandwidth (pioneering the use of new LPDDR4 SDRAM), which puts it squarely in the very high end category, like Snapdragon 805.

Qualcomm also has a SoC in planning for the more mainstream part of the high-end performance segment, Snapdragon 808, which has two Cortex-A57 cores instead of four while retaining the four Cortex-A53 cores. Importantly, Snapdragon 808 also simplifies the memory interface to dual-channel 32-bit with more standard LPDDR3 memory instead of LPDDR4, reducing cost and being comparable to Snapdragon 801, the current high-end standard.

20nm process and LPDDR4 memory

Snapdragon 810 is Qualcomm's first SoC product to be manufactured using TSMC's 20nm process technology. 20nm, in theory, significantly increases performance and power efficiency when compared to the 28nm process technology that Qualcomm has been using recently for most of its chips.

The SoC also features a LPDDR4 external memory interface in a dual-channel 32-bit configuration, with maximum clock speed of 1600 MHz according to Qualcomm's webpage, resulting in memory bandwidth of 25.6 GB/s, similar to Snapdragon 805, which achieves its bandwidth with a wide 64-bit dual channel memory interface with LPDDR3. This is a very high amount of memory bandwidth for a mobile device, making the chip suitable for driving very high resolutions such as QHD. However, it also increases cost, and the apparent requirement of using higher-clocked LPDDR4 memory instead of mainstream LPDDR3 is also likely to increase cost, despite the reduction in memory bus width allowed by LPDDR4.

Snapdragon 808 likely to be more attractive for high-volume flagship devices

Meanwhile, Snapdragon 808 seems to provide a more practical performance-oriented platform by utilizing standard LPDDR3 in a dual-channel 32-bit at a clock speed up to 933 MHz, resulting in maximum memory bandwidth of 14.9 GB/s. Overall, Snapdragon 808 seems to be much more attractive for high-volume high-end devices as a successor to Qualcomm's popular Snapdragon 801.

Performance flaws evident in early Geekbench database entries

Early Geekbench results database entries show lower-than-expected CPU and memory performance, and detailed analysis of the results seems to confirm the reports about thermal throttling due to heat production as well as lower-than-expected memory performance. In practice, the version of Snapdragon 810 that was benchmarked seems to provide performance lower than even Snapdragon 801 in most respects.

Performance data for Snapdragon 810 in the Geekbench entries is clouded somewhat because of the use of 64-bit Aarch64 mode in Android. Until now, most Cortex-A57 and Cortex-A53 based solutions use AArch32 (32-bit ARMv8 mode, which takes advantage of some of the new features of Armv8 but is not fully 64-bit). Android AArch64 support and performance has been work in progress and is still likely to be not fully optimized. However, in the case of the Snapdragon 810 results, the performance deficit is of such magnitude that is clear that they are caused by flaws in the chip implementation and not AArch64 mode.

In the table in the Appendix below, some Snapdragon 810 and 801 results have been highlighted in bold to show some of the performance differences and in particular the areas where Snapdragon 810 performance is much lower than expected.

There are several entries for the device in the database that show considerable variation between runs, providing evidence that performance throttling caused by heat production is a significant problem. For the analysis below, the best benchmark result among the various entries has been used. There is evidence that some of the later entries impose a CPU clock speed limit of about 1.0 GHz or perhaps only use the Cortex-A53 cores in some cases (these entries are also represented in the table).

Deficits in pure CPU performance, especially multi-core

Compared to Samsung's Exynos 7 Octa (5433), which has a similar CPU configuration, basic integer tests such as JPEG Compress already show somewhat lower than expected performance based on the reported clock speed, with multi-core performance scaling being considerably less than expected, and also clearly lower than Snapdragon 801. The Dijkstra benchmark, which has more external memory access and branching, is more heavily affected and is at least 35% slower than on Exynos 5433, despite a similar clock speed, and slower than Snapdragon 801 as well as Snapdragon 805. However, this may for a large part be due to running in AArch64 compared to 32-bit mode used on the other chips, since the Dijkstra benchmark seems to similarly affected on other platforms that use AArch64.

For floating point performance, pure single-core performance, as shown by the Mandelbrot subtest results, is relatively unaffected, but multi-core performance scaling is much lower than Exynos, resulting in performance comparable to Snapdragon 805 rather than the higher floating point performance expected from Cortex-A57 cores (such as in Exynos 5433).

Memory performance significantly impacted

Memory performance is clearly seriously affected, confirming reported issues with the memory controller. The raw throughput of the Stream Copy subtest is signficantly lower than expected based on the 32-bit dual-channel memory interface with double-speed LPDDR4, being lower than Snapdragon 805 with a similar amount of memory bandwidth and even significantly lower than Snapdragon 801 with its 32-bit dual-channel LPDDR3 interface.

The flaws in memory performance are evident in the SGEMM subtest, which is a floating point test that is heavy on sequential memory access. Snapdragon 810 shows performance for this test barely more than half that of Snapdragon 801 and 805. It is even worse for the multi-core test, where Snapdragon 810 shows performance scaling worse than two times, while Snapdragon 801 and 805 have performance scaling more in line with the four CPU scores they possess.

Finally, in the SFFT test, which is a floating point test with heavy random memory access, only shows roughly half the performance of Snapdragon 801, Snapdragon 805 as well as Exynos 5433. This seems to provide the clearest evidence of performance problems with the memory controller.

Snapdragon 810 likely to be too costly for mainstream high-end devices

In popular technology websites on the internet, Snapdragon 810 has recently frequenty been mentioned as the likely chip used for future high-end models for a diverse range of well-known manufacturers such as Samsung, HTC and LG. However, the high-banwidth LPDDR4 memory interface (which increases device cost) and performance targets seems to put it clearly in the very high end category, comparable to Snapdragon 805, which does not make it ideal for high-volume performance devices that do not have an extremely high screen resolution such as QHD (2560x1440). Other new chips such as Snapdragon 808 and (for mid-range) Snapdragon 615 seems to be more suitable for performance-oriented mainstream devices, including several of the mainstream flagship devices from the mentioned manufacturers.

However, if the performance flaws that are evident in the current Snapdragon 810 are not fixed or if Qualcomm has significant inventory of flawed chips, it is possible that they will be unloaded onto the more mainstream performance segment for a discounted price. It seems likely however that Qualcomm, given its chip expertise, will be able to fix most of the performance issues with the Snapdragon 810 in a future revision of the chip.

Update (January 5): LG prototype shows better multi-core performance

A Geekbench test run was recorded on January 5 for a prototype LG G Flex2 with Snapdragon 810. This result shows some improvements, especially in the overall multi-core score, although it still well below that of Exynos 7 Octa (5433) which has a similar CPU configuration.

A closer look reveals that integer benchmarks, especially the more memory-intensive Dijkstra subtest, has not materially improved over the prior results. Multi-core floating point performance has improved significantly and contributes to the higher total multi-core score.

However, memory tests show mixed results. The Stream Copy subtests are lower than the previous best results from last month, remaining significantly lower than Snapdragon 805 and even Snapdragon 801, suggesting that sequential memory access performance has not improved. This is corroborated by the SGEMM subtest results, which also depend on sequential memory access performance and show results that are very similar to the earlier scores.

Meanwhile, the SFFT scores show a significant uptick, especially for multi-core performance, suggesting that Qualcomm has been able to improve the random memory access performance of the chip. However, the subtest scores are still clearly below those of Exynos 5433, Snapdragon 805 and even Snapdragon 801.

Update (January 10): New prototype entry shows improvements in memory performance

A subsequent Geekbench result entry recorded on January 9 for an unknown device shows further improvements in memory performance, although still falling short of the memory performance of the more mainstream Snapdragon 801 (let alone Snapdragon 805). The single-core JPEG Compress subtest result is also improved, but overall the CPU performance results still suggest that thermal throttling because of overheating is still likely to be a significant problem.

Appendix: Geekbench performance table

The table below is similar to the one published in my previous article. In the bottom half of the table, some relevant benchmark scores for Snapdragon 810 and Snapdragon 801/805 have been highlighted.

For a high-resolution version, view/copy/save the image above using the browser.

Sources: BusinessKoreaGeekbench browser (Samsung SM-N916S results), Qualcomm (Snapdragon 810 page), Wikipedia (Qualcomm Snapdragon)

Updated (January 5, 2015): Add discussion of recent LG prototype Geekbench test results, update performance table (also include Intel Atom results).
Updated (January 8, 2015): Correct DRAM interface of Snapdagon 810 (it is 32-bit dual-channel using LPDDR4, which can be clocked much higher than LPDDR3).
Updated (January 10, 2015): Add discussion of new Geekbench result entry, updated table.

Monday, December 29, 2014

Another look at Cortex-A53 CPU core performance

Several smartphone chips using ARM's new Cortex-A53 and Cortex-A57 CPU cores with the 64-bit-capable ARMv8 instruction set have arrived on the market recently. Cortex-A53-only based SoCs are especially attractive from a performance/dollar standpoint. However, as I described in earlier articles, there exist significant performance differences between different Cortex-A53 implementations, with some early revisions of the core being limited in performance, probably because of design bugs.

32-bit version of ARMv8 seems practical

Most of the Cortex-A57 and Cortex-A53-equipped SoCs currently seem to be running in what can be called "32-bit ARMv8 mode" (AArch32 in Geekbench, as opposed to ARMv7 for older 32-bit devices), taking advantage of some of the features of the ARMv8 instruction set (which is better suited to modern CPU chip architectures) while preventing some of the disadvantages of the full 64-bit model (such as doubled storage space for pointers and addresses).

Whether the full 64-bit instruction model (AArch64) will soon be attractive for Android devices, including lower-end ones such as Cortex-A53-based devices with limited amounts of CPU cache and RAM, is unclear. NVIDIA already uses AArch64 in conjunction with their latest Tegra K1 SoC. Optimizations for AArch64 seem to have been work in progress and early benchmarks for systems running in AArch64 mode were quite poor in comparison to 32-bit mode benchmarks, but progress is been made. Theoretically, more registers are available in AArch64 mode, also to the NEON SIMD unit, which should help performance in some important cases, and may mitigate the disadvantages of increased address storage size.

Snapdragon 410 has crippled first revision of Cortex-A53

Snapdragon 410 (MSM8926) is a SoC with quad-core Cortex-A53 that has been one of the first chips with Cortex-A53 cores to come to market and has already been adopted in significant volume for low-to-mid-range designs, replacing the older Cortex-A7-based Snapdragon 400.

However, it is obvious that the very first public revision of the Cortex-A53 core as used inside Snapdragon 410, Cortex-A53r0p0, is crippled in terms of performance, clearly scoring lower in CPU and memory-intensive benchmarks (even after making the significant correction for clock speed) than SoCs using later revisions of the Cortex-A53 core such as Snapdragon 615 and MediaTek's new chips. Coupled with Snapdragon 410's relatively low clock speed of 1.19 GHz, this results in significant lower performance than the newer mid-range chips mentioned. Performance in complex benchmarks that simulate demanding, typical use such as complex browsing and gaming is even worse.

Advertising of Snapdragon 410 as having 64-bit support is very misleading

The lower performance seems to be partly associated with the fact that Snapdragon 410 (because of the r0p0 revision of Cortex-A53) is completely limited to ARMv7-compatibility mode and is unable to run in ARMv8 mode (32-bit or otherwise). I have yet to see evidence of a shipping Snapdragon 410 chip that is 64-bit or even ARMv8 capable. It functions as nothing else than having somewhat faster 32-bit ARMv7 Cortex-A7 cores. In this sense, labeling the chip as being 64-bit or potentially having support for the 64-bit ARMv8 in a future update is downright misleading or a blatant lie, depending on one's standpoint.

Memory performance seems most impacted

Based on Geekbench results, Snapdragon 410 has about 10% lower pure integer CPU performance per MHz when compared to chip such as Snapdragon 615 and MediaTek's MT6732/MT6752. For pure floating point performance, performance is about 5% lower. The biggest difference is in memory performance, where Snapdragon 410 is about 25% slower than Snapdragon 615 (with r0p1 Cortex-A53) and more than two times slower (even when correcting for clock or memory speed) than MT6732/MT6752 with Cortex-A53 r0p2. Another big difference is found in cryptography performance because of the extra ARMv8 instructions that apparently are not available to Snapdragon 410.

A large part of the lower performance of the Cortex-A53 cores inside Snapdragon 410 may be due to chip design bugs as evident from errata issued by ARM for earlier revisions of the Cortex-A53 core. Some details about these errata, which mostly involve memory coherency issues related to CPU cache memory, can be found when compiling a Linux kernel.

Snapdragon 410 shows poor scores in real-world benchmarks

While Snapdragon 410 delivers somewhat better scores than the Cortex-A7-based Snapdragon 400 at the same clockspeed in pure CPU-specific benchmarks such as Geekbench for single-core performance, multi-core performance does not show much benefit (which is unexpected based on the architectural advantage that the Cortex-A53-based Snapdragon 410 should have).

Even worse is the performance in practical benchmarks that measure performance for web browsing, gaming and other more complex, practical use cases. Based on benchmark results reported by GSMArena (1) (2), Basemark X, which is gaming benchmark that simulates throughput for a more demanding typical usage pattern that uses of the Unity engine, reports a significantly lower score than recent Snapdragon 400-based models such as the Moto G (2014), with the GPU score being similar, pointing to significant flaws in (multi-core) CPU and memory performance.

In Rightware's Browsermark 2.1, a browser benchmark with use of advanced web standards such as HTML 5, WebGL and advanced JavaScript, performance is downright disappointing, with a score less than half that of Snapdragon 400-based devices. Other browser benchmarks show similar results. Scores in Rightware's overall-use Basemark OS II benchmark are also typically relatively disappointing, not surpassing those of Snapdragon 400-based devices.

Hardware bugs likely cause of crippled performance

These lower than expected benchmark results for more complex, typical use benchmark are compatible with hardware bugs in the Cortex-A53 implementation of the Snapdragon 410 being a bottleneck and significantly degrading especially multi-core performance. In particular, work-arounds for cache consistency and coherency issues have the potential to significantly degrade performance, for example by forcing the kernel to frequently flush CPU caches.

The Linux kernel source shows commits to handle errata for Cortex-A53 up to r0p2 relating to cache clean operations, with the work-around being to promote cache clean to cache clean and invalidate. This could mean that revision r0p3 of Cortex-A53 may see further improvements. These commits do not explain the performance difference between r0p0, r0p1 and r0p2, since the work-around is the same for all three revisions.

Third revision of Cortex-A53 (r0p2) seems to improve memory performance

Some of the hardware or performance bugs that plagued especially the first version of the core (r0p0 as used in Snapdragon 410) have most likely been fixed in later revisions, contributing to a significant performance increase at the same clock speed.

SoCs with the third revision (r0p2) of the Cortex-A53 core seem to have much better memory performance as shown by Geekbench results, especially impressive given the bandwidth limitations of a 32-bit external memory interface. Most likely, this improvement is derived synergistically with ARM IP such as the Mali-T760 GPU as well as other IP blocks, which are implemented inside chips such as MT6732 and MT6752.

Since a SoC such as MT6732 is on the surface essentially comparable to Snapdragon 410 in the sense of having four Cortex-A53 CPU cores, there seem to be major performance improvements in the later revisions of the Cortex-A53 core and associated system architecture, especially with regard to memory performance. The difference is made more pronounced by the fact that the MT6732 is manufactured using TSMC's higher performance 28HPM process rather than 28LP and also clocked significantly higher.

Octa-core Cortex-A53 configurations provide impressive multi-core performance

Octa-core Cortex-A53-based SoCs such as MT6752 and to a lesser extent Snapdragon 615 are already showing impressive multi-core CPU performance, while single-core performance has also improved considerably over prior cost-effective CPU architectures. Multi-core performance, both in terms of pure CPU integer and floating performance, for the MT6752 significantly surpasses (by tens of percent in many benchmarks) the much more expensive Snapdragon 801, while single-core performance is catching up, being about 30% slower for integer operations and 15% slower for floating point. This high level of performance comes at a fraction of the cost (primarily because of the small die size and low power consumption of the Cortex-A53 cores).

Memory bandwidth still a bottleneck

However, when the memory subsystem truly comes into play, high-end chips such Snapdragon 801 still show much greater performance because of their much higher external memory bandwidth (because of the wider memory interface) as well as larger CPU cache. This is apparent in the Geekbench subtest SGEMM (which is heavy on sequential memory access), for which high-end SoCs such as Snapdragon 801 are more than twice as fast.

In practice, memory performance is important for how fast a device feels, impacting response times and also being very important for GPU performance. High screen resolutions also put heavy demands on the memory subsystem. In that sense, SoCs such as MT6752 and Snapdragon 615 still perform best at a resolution like 1280x720, with the best performance at 1920x1080 and higher still reserved for high-end SoCs.

There seems to be great potential for performance-oriented Cortex-A53 SoCs with a memory interface wider than 32-bit, comparable with other performance-oriented SoCs. This would be the "best of both worlds" in several respects (lower cost because of small die size of the CPU cores, low power consumption, while still having the memory bandwidth to drive high resolutions). MediaTek has announced such a chip that was expected to have such as configuration, the MT6795, but it has not quite appeared on the market yet and might be delayed. However, similar solutions certainly look likely to become popular for performance-oriented devices in the not too distant future.

Appendix: Table with detailed Geekbench CPU benchmark results

Presented here is a table with detailed benchmark result information for the mentioned SoCs, also including several other SoCs on the market. Included is information about the CPU cores used, their clock speed, the smartphone model and Geekbench result entry used as a reference, and scores for several benchmarks. Indexed results (relative to a 1.0 GHz Cortex-A7) are shown for several of the benchmarks, as well multi-core performance scaling indices. Results relevant to the discussion above have been highlighted in bold. The following Geekbench subtests have been included:

  • JPEG Compression (single/multi-core). A useful integer benchmark that seems to strongly depend on pure CPU performance (CPU core type and clock speed) with less dependence on the memory subsystem (including L2 CPU cache).
  • Dijkstra (single/multi-core). A more complex integer benchmark that probably includes more memory access and may branch a lot. Notable for this benchmark is that Cortex-A53 performs better than Cortex-A15 at the same clock speed, with both Cortex-A17 (MT6595) and Cortex-A57 being significantly faster still.
  • Mandelbrot (single/multi-core). A pure floating point benchmark, highly dependent on the combination of CPU core type and clock speed.
  • Stream copy (single/multi-core). An important metric for memory performance (especially sequential external RAM performance).
  • SGEMM. A floating point matrix multiplication benchmark that heavily depends on sequential memory access. The memory bandwidth available to the SoC makes a critical difference for this benchmark.
  • SFFT. A floating point benchmark that heavily uses random memory access.
For a high-resolution version, view/copy/save the image above using the browser.

Sources: Geekbench browser, Primate Labs website

Updated January 2, 2015 (Add section of low typical-use benchmark scores for Snapdragon 410).
Updated January 5, 2015 (Update Geekbench performance table).
Updated January 10, 2015 (Update performance table).
Updated February 11, 2015 (mention and link Linux kernel Cortex-A53 errata).

Thursday, December 25, 2014

Cortex A53-based Snapdragon 615 arrives, but power efficiency in question

Qualcomm's Snapdragon 615 (MSM8938), an octa-core ARM Cortex-A53 CPU core based SoC with four cores clocked at 1.54 GHz and four cores clocked at 1.0 GHz, has arrived on the market with a significant number of new models shipping from several manufacturers.

The new chip conveniently fills the gap in Qualcomm's product line for SoCs with integrated baseband between the low-to-mid-range Snapdragon 400/410 and the high-end Snapdragon 801, which have a large performance and cost difference, as for some time Qualcomm has offered no competitive smartphone solution with performance falling in between for the performance mid-range category.

While the SoC appears to offer good mid-range CPU and GPU performance, based on early evidence its power efficiency appears to be less than what one would expect based on its utilization of low-power Cortex-A53 cores.

DRAM interface appears to be 32-bit after all

Early data suggested that Snapdragon 615 (MSM8389) would utilize a relatively relatively wide 64-bit external DRAM interface, which is not typical of cost-sensitive devices because it significantly increases the cost of the PCB design, chip as well as other components. A 64-bit DRAM interface would mean that memory bandwidth is relatively high and that the chip would run relatively smoothly at resolutions such as FullHD (1920x1080) at higher.

However,  more recent sources as of December 2014 (including Qualcomm's website) indicate the chip uses a cost-effective 32-bit DRAM interface with support for LPDDR3 up to 800 MHz, resulting in memory bandwidth of 6.4 GB/s, comparable with other cost-effective mid-range SoCs, which can lead to constrained performance when running at high resolutions such as 1920x1080.

GPU appears to have strong pixel processing capabilities, but is limited by memory bandwidth

The Adreno 405 GPU provides adequate performance for a mid-range SoC, comparable in benchmarks such as the GFXBench T-Rex and Manhattan tests to that of MediaTek's new MT6752 (also an octa-core Cortex-A53-based SoC with a 32-bit memory interface, in conjunction with a Mali-T760 MP2 GPU), while being roughly three times faster than the GPU in the low-to-mid-range Snapdragon 400/410 platforms.

In GFXBench subtests, the ALU and Alpha Blending benchmark results are particularly high for a mid-range device and close to the scores achieved by higher-end chips from competitors such as Kirin 920 and Exynos 5 Octa, which have Mali-T628 MP4 and Mali-T628 MP6 GPUs and a wider DRAM interface. However, the pixel fill rate is lower and probably provides a bottleneck because of the memory bandwidth limitation. This could suggest that the GPU inside the chip is larger and higher powered than it needs to be, stemming from original plans for a 64-bit DRAM interface on the SoC. In comparison, the Mali-T760 MP2 as implemented in the MT6752 has less processing power but implements bandwidth-saving techniques from ARM that improve performance in a bandwidth -constrained environment.

The 32-bit memory interface and resulting memory bandwidth bottleneck probably means that devices using the SoC will run significantly smoother (especially in games) with better battery life when using a screen with a lower resolution screen like 1280x720, while a resolution 1920x1080 will make the memory interface the bottleneck, also resulting in shorter battery life. A similar phenomenon is seen with other relatively high-powered SoCs with limited memory bandwidth, such as MediaTek's previous generation MT6592.

SoC design shows some signs of cost-reduction measures, including use of 28LP process

Benchmark scores and GPU performance illustrate that this is not a high-end chip and that Qualcomm has reduced cost in a number of ways, reducing CPU and GPU performance. A likely factor is a smaller amount and slower L2 cache memory when compared to higher-end SoCs, as well as the relatively limited memory bandwidth provided by the 32-bit DRAM interface.

Another major factor is that, despite being a relatively performance-oriented chip, it is manufactured using TSMC's relatively economical and low-performance 28LP process (also used for Snapdragon 400/410), which limits clock rates and power efficiency. Other chips, like the Snapdragon 800 series and most of MediaTek's mid-range solutions like MT6752 are manufactured using the higher-performance 28HPM process at TSMC, which provides significantly better performance (higher clock rates) and lower power consumption.

Reduced cost and die size lowers wafer requirements

By migrating part its performance-mid-range SoC offerings from the Snapdragon 800 series to Snapdragon 615, Qualcomm is effectively reducing its wafer requirements at TSMC (especially for HPM), because Snapdragon 615 is likely to have a much smaller die size than the relatively large Snapdragon 801 (the total area for the CPU cores is much smaller, despite there being twice as many cores) and more chips can be manufactured on a single wafer. Qualcomm also saves a significant amount of cost this way (although in the past, Qualcomm's patent royalty leverage has meant that the chip margins were not as important as they might be for other companies).

Reviews and benchmark scores show mediocre battery life and power efficiency

Contrary to initial expectations from the use of power efficient Cortex-A53 CPU cores in a pseudo big.LITTLE configuration, Snapdragon 615 does not appear to be very power efficient, resulting in mediocre battery life in end devices.  The Snapdragon 615-based Oppo R5 shows poor battery life in a review by GSMArena, partly because of the high resolution 1080p AMOLED screen. The SoC is likely to be less efficient with resolutions of 1080p and higher.

In the GFXBench long-term performance benchmark for the HTC Desire 820, GPU performance is sustained close to the maximum level, but with a relatively mediocre battery lifetime score of 153 minutes, which is lower than almost all other modern smartphones. A review of the same device by Android Central noted that battery life was reasonable although not spectacular. The HTC model uses a 720p resolution which is likely to result in more acceptable battery life than devices running at 1080p.

Part the reason for the relatively high power consumption is likely to be the use of the less efficient 28LP semiconductor process at TSMC, in conjunction with a relatively powerful GPU with a relatively large die size (which is however limited by memory bandwidth). The Cortex-A53 cores may also perform worse, with higher power consumption, when compared with implementations using the 28HPM process such as MediaTek's Cortex-A53-based designs.

Is Cortex-A53 less power-efficient than expected?

Based on its similarities with the very power efficient Cortex-A7 core, one would expect Cortex-A53 to be a relatively power efficient CPU core, and in that sense the power efficiency of the Cortex-A53-only Snapdragon 615 might be considered disappointing. However, in the case of Snapdragon 615, there are important factors that reduce the power efficiency of the implementation. The 28LP process is a major factor, as well as presumably the relatively high-powered GPU . The 32-bit memory interface in conjunction with the relatively powerful multi-core CPU and GPU can cause memory bus contention due to insufficient bandwidth, resulting in relatively heavy DRAM access patterns.

Another factor could be the r0p1 revision of the Cortex-A53 core; progressive revisions of the core show indications of increased performance and efficiency. MediaTek uses revision r0p2 in its MT67xx family, as well as using the more efficient 28HPM process at TSMC. Samsung has already been shipping the 20 nm-manufactured Exynos 7 Octa (5433) for several months which also uses Cortex-A53 to good effect as the power efficient part of its CPU configuration.

The bandwidth-saving techniques of the Mali-T760 GPU (used by both MediaTek and Samsung) and other ARM IP blocks is likely to contribute to reduced power consumption. Battery life benchmarks and reviews for the MT6732 and MT6752, when they become available, will help clarify whether an octa-core Cortex-A53 with a 32-bit memory interface can in fact provide low power consumption and long battery life.

Sources: Wikipedia (Snapdragon page), Qualcomm (Snapdragon processor page)GFXBench results browser, GSMArena, Android Central

Updated January 2, 2015.

Tuesday, December 16, 2014

No more wafer capacity shortage at TSMC?

For November 2014, TSMC somewhat unexpectedly reported a revenue decline to US$2.31 billion, 10% lower than the historical high achieved in October 2014, reversing several months of continually increasing monthly revenues at TSMC amid a shortage of capacity for clients of TSMC and continuous investments into capacity expansion.

An article in the Taipei Times from 14 December 2014 further reports on TSMC's sales in Q4 2014 and its future prospects, with a senior TSMC official saying that the decline was not a suprise, as "cautious inventory adjustment actions taken by some of our customers will bring slower fourth-quarter demand". There have been reports that some clients may have double-ordered chips in the face of the capacity shortage that existed previously. TSMC's sales in Q4 2014 will still be a quarterly record based on strong demand for 4G smartphones in China and increased demand for TSMC's advanced 20nm process technology.

TSMC's Q4 2014 revenues are still projected to be near NT$220 million (about US$7.0 billion), a sequential increase of about 5% from the previous quarter, which would complete a strong 27% increase in revenues for the whole year 2014 over the previous year, extending TSMC's leadership of the foundry industry.

TSMC's revenues for 2015 are forecast to further increase by 15 to 20%, based on strong demand for 20nm chips, new chips manufactured using its 16nm FinFET process technology and continuing demand for 28nm chips, as well as demand for trailing-edge 8" wafer capacity.

Apple ramp has peaked, Android chip vendors cautious

As mentioned in the recent articles, a likely major reason for the revenue decline in November and for Q4 2014 is that Apple's production of the Apple A8 and Apple A8X processors already peaked in October in order to achieve sufficient production in time for the 2014 holiday season. Additionally, demand from Android device vendors (such as Qualcomm and MediaTek) has not picked up, so that another sales decline for December 2014 is expected.

The decline in demand appears to be concentrated in the 20nm and 28nm HPM (High-Performance Mobile) process technologies, which were earlier in extremely short supply, and are used by Apple for its A8 SoCs at 20nm, and primarily at 28nm by Qualcomm for its high-end SoCs such as the Snapdragon 800 series and by MediaTek for various mid-range chips (such as MT6592, MT6595 and MT8135V), as well as its new 64-bit SoCs (MT6732, MT6752 and MT6795) that are currently ramping.

Decreased use of TSMC-produced chips by Samsung

An important contributor to the decline in demand for mobile processor capacity at TSMC is likely to be a decline in the utilization of TSMC-produced smartphone SoCs at Samsung. Samsung has recently been facing an overall sales decrease for its smartphone business, although Q4 2014 has been projected to see a recovery. However, Samsung is aggressively increasing the use of its own Exynos series SoCs in its smartphones, especially high-end models, after reduced orders from Apple left Samsung's advanced logic fabs underutilized.

Chips like the Exynos 7 Octa perform adequately for a high-end device and have significantly decreased Samsung's reliance on Qualcomm, which manufactures at TSMC. While Qualcomm is likely to continue to sell a large number of low-cost chips such as Snapdragon 410 to Samsung, the overall product mix from Qualcomm into Samsung has likely shifted to lower-end chips that have a significantly smaller die size, and thus require significantly less wafer capacity for a given amount of chips.

Transition to smaller die size for Qualcomm's mid-range performance-oriented SoCs

For some time, Qualcomm has had a gap in its product line, with Snapdragon 400 being used for the low-end as well as part of the mid-range segment, and a large performance and cost gap to the high-end Snapdragon 800 series, while Snapdragon 600 (without integrated baseband) was out of the picture. This resulted in a relatively large amount of high-end, large die-size Snapdagon 800 series chips being used in smartphones, even for models that do not quite require that level of performance.

However, Qualcomm has introduced new SoCs such as Snapdragon 615, an octa-core Cortex-A53-based SoC with a mid-range GPU, which can address the perfomance requirements of a significant part of the performance-oriented segment at a much lower cost, importantly while consuming significantly less wafer capacity at TSMC due the smaller die size of the SoC. This product transition at Qualcomm likely contributes to lower wafer requirements for Qualcomm at TSMC as production of smaller chips like Snapdragon 210, Snapdragon 410 and Snapdragon 615 increases at the expense of Snapdragon 801/805, and as a result contributes to TSMC's revenue decline.

Product transition at MediaTek

Meanwhile, MediaTek is also in a product transition from its 3G product line to its new product line with integrated 4G baseband. Because it has been late with integrated 4G, MediaTek has come under some pressure in China, with more of its sales being concentrated at the low of the market with SoCs such as dual-core chips for worldwide export markets, which take a smaller amount of wafer capacity.

MediaTek's new mid-range performance-oriented chips such as MT6752 and MT6795 are competitive, and have the potential to reduce overall market die size requirements and improve device cost and efficiency for the performance-oriented segment. However, they, as well as the lower-end MT6732, do not address the highest-volume low-end 4G segment, which in the near term is more likely to be addressed by Qualcomm with its Snapdagon 410 and upcoming Snapdragon 210 series, the latter of which implies with further reductions in wafer requirements due to smaller die size.

Other smartphone SoC clients at TSMC

HiSilicon has been producing increasing numbers of smartphone SoCs at TSMC for use in Huawei smartphones, but may have been affected by inventory issues, and there's also evidence of HiSilicon transitioning to more cost-effective designs such as the octa-core Cortex-A53-based Kirin 620, partly displacing its existing big.LITTLE Cortex-A15/Cortex-A7-based Kirin 920/925 series, which have a relatively large die size.

Other TSMC clients for leading-edge processes

Other companies that do not concentrate on smartphones such as Broadcom (embedded communications/networking) and NVIDIA (primarily PC-class GPUs, as well as high-end tablet SoCs) may welcome the increased capacity as it gives them increased production flexibility amid strong demand for their chips.

Not good for foundry competitors

Foundry competitors such as GlobalFoundries, which are already struggling, are not likely to benefit from the alleviation of capacity constraints at TSMC, because potential clients may now be less determined into moving part of their production from TSMC to alternative suppliers such as GlobalFoundries (as well as Samsung and UMC). Moving products to new foundries involves considerable investment and time since their processes are different from TSMC's processes, and as long as TSMC has enough capacity there is little reason for clients to not concentrate production at TSMC with its industry-leading performance.

Sources: DigiTimes (TSMC November revenues) , Taipei Times (TSMC article)

Updated December 26, 2014.

Thursday, December 4, 2014

Another symmetric octa-core CPU-based SoC announced (HiSilicon Kirin 620)

Huawei has just announced a new SoC, Kirin 620, with an octa-core Cortex-A53 CPU. The chip is the latest in a series of newly introduced octa-core Cortex-A53-based SoCs from companies such as MediaTek and Qualcomm as well as other players.

New Kirin 620 chip appears to target cost-sensitive segment

HiSilicon shows some smart design choices with this chip. It is clearly designed to be relatively cheap to manufacture (with a relatively limited chip die area) while still providing good performance for low/mid-range devices.

In the past, HiSilicon has been using CPU cores with a relatively large die area such Cortex-A9 and Cortex-A15, which do not result in a particularly cheap or power-efficient chip. However, the Cortex-A53 is the direct successor to the very power-efficient and extremely small Cortex-A7 core, which means even with eight cores the chip will still be relatively small as well as power-efficient.

The maximum CPU clock speed of 1.2 GHz is significantly lower than most other announced Cortex-A53-based SoCs, illustrating that the chip is intended for the cost-sensitive segment. Possibly, it is manufactured on TSMC's relatively economical 28LP process technology, which limits maximum performance.

Compared to MediaTek’s and Qualcomm’s new octa-core Cortex-A53-based chips, the Mali-450 MP4 GPU is notable because it does not support the OpenGL ES 3.0 API. However, OpenGL ES 2.0 is still the standard in the mobile market, and HiSilicon can probably improve cost and performance this way (especially since Mali-T62x and Mali-T760 are not cheap in terms of die size). Mali-T760 would have been faster and more power-efficient, but Mali-450 MP4 saves cost while still providing reasonable performance.

The new chip has several similarities with MediaTek’s MT6592, which is almost a year old, and has eight Cortex-A7 cores instead of Cortex-A53 and also a Mali-450 MP4 GPU.

Octa-core Cortex-A53 core CPU provides benefits for performance/Watt and performance/dollar

Because the Cortex-A53 (like its predecessor, the Cortex-A7) has a very small die size in comparison to higher-performance cores like Cortex-A57 and Cortex-A15, the use of eight cores instead of four does not very significantly raise the cost of the chip, while greatly increasing multi-core performance. Although not quite true for HiSilicon's chip due to the relatively low clock speed, several other Cortex-A53-based chips are also clocked at a relatively high frequency (in excess of 2 GHz for MT6795), resulting in respectable single-core performance as well, and making such a configuration suitable for the performance segment.

An octa-core configuration can provide real benefits in practice in a multi-threaded OS such as Android. Applications that can readily take advantage of eight cores include the Chrome browser and software video decoding and encoding libraries, all of which can improve the user experience. Because the eight cores are usually physically split into two clusters with a separate L2 cache, there is also room for further optimizations by the kernel scheduler in order to maximize performance and power efficiency.

For example, it might be possible for the scheduler to disable one of the two clusters of four CPU cores and its associated L2 cache during normal operation (when the load is not high), resulting in low power consumption. When more CPU power is needed, the second cluster comes online. Even when there are only a few threads, the scheduler might be able to detect the need for more L2 cache memory in a particular workload and move one or more threads to the second cluster. MediaTek's CorePilot technology, with which it has had experience since the MT6592, probably involves heuristics of this kind.

Overview of symmetric octa-core Cortex-A7 and Cortex-A53-based SoCs

The following table shows an overview of currently announced octa-core Cortex-A7 and Cortex-A53-based SoCs, starting with MediaTek's MT6592 which has been available for about a year.

(Click to enlarge)
Note that Qualcomm's Snapdragon 615 is not really a symmetric octa-core because it uses a pseudo-big.LITTLE configuration with four Cortex-A53 cores clocked higher and four cores clocked lower.

Performance comparison of octa-core Cortex-A7 and Cortex-A53-based SoCs

The following tables show CPU performance (using a representative Geekbench subtest result) as well GPU performance based on GFXBench for relevant SoCs and devices for which benchmark data is available. It includes both octa-core Cortex-A7 and Cortex-A53-based SoCs, as well as other existing SoCs from different market segments, for reference.

(Click to enlarge)
The first few columns of the table show a description of the SoC with CPU configuration, the name of a representative device model using the SoC and the maximum CPU clock speed. Then comes the Geekbench JPEG Compression benchmark test, both single-core and multi-core. This Geekbench subtest has been found to be relatively sensitive to CPU performance without being very sensitive to other factors such as L2 cache size.

The rightmost columns show information about the GPU. First listed are the GPU type and off-screen performance for the GFXBench T-Rex (OpenGL ES 2.0) and Manhattan (OpenGL ES 3.0) benchmarks. The offscreen tests always render into a 1920x1080 off-screen buffer, making results comparable between devices with different screen resolutions. The actual resolution used on the device comes next, followed by on-screen T-Rex benchmark benchmark performance and information relevant for battery life and long-term performance (which is affected by thermal throttling). This includes average long-term performance of the T-Rex on-screen benchmark, the battery size of the device and the battery life in minutes when running T-Rex on-screen long-term.

Mali-T760 appears to be highly efficient

Notable is that GPU performance of the MT6752 with Mali-T760 MP2 GPU as represented by the Lenovo A70-A entry in the GFXBench database is comparable with the Snapdragon 615-based HTC Desire 820, despite the latter's higher low-level pixel processing performance (such as evident in the ALU and Alpha Blending scores) provided by the Adreno 405 GPU.

This strongly suggests that ARM has made a big leap in terms of performance efficiency with the Mali-T760 GPU core in conjunction with compression-based bandwidth optimization technologies such as ARM Framebuffer Compression, Transaction Elimination and Smart Composition as well as good integration with the Cortex-A53 CPU architecture (which already shows memory performance improvements).

Based on GFXBench power efficiency data, none of the listed SoCs appears to be particularly power-efficient with a full GPU load with the complex T-Rex benchmark, but data for the Mali-T760 MP2-based MT6752 has yet to come in. However, the best battery life entries in the GFXBench database for the Samsung Galaxy Note 4 with Mali-T760 MP6-based Exynos 7 Octa shows the ability to run the on-screen T-Rex benchmark for more than 300 minutes with reasonable sustained performance on the very high resolution screen of the Note 4, which is compatible with relatively high power efficiency of the Mali-T760 GPU.

Note that power efficiency is likely to be better for typical GPU applications that are less demanding than GFXBench's T-Rex benchmark (this affects lower-end SoCs/GPUs more than higher-end ones).

Sources: CNXSoftware (Kirin 620 announcement), GFXBench results database, Geekbench browser

Updated December 25, 2014 (Correct memory interface information for Snapdragon 615).

Monday, December 1, 2014

Analysis of GPU performance of mobile SoCs based on GFXBench results

In this post, I am analysing the GPU performance of different GPUs and SoCs based on the results database of GFXBench, one of the leading mobile GPU benchmarks. Apart from providing a GPU performance comparison for different SoCs, GFXBench results provide sufficient detail to get an impression of metrics like fill rate, triangle rate and shader performance, allowing one to draw conclusions about what the bottleneck is in a particular implementation.

GFXBench results table for mobile SoCs

The folowing table show detailed GFXBench 3.0 results for a large number of mobile SoC platforms and devices. The results are grouped by smartphone and tablet devices, and further grouped for similar chips (smartphone table) or in alphabetical order by chip (tablet table).

For a high-resolution version, view/copy/save the image above using the browser.

The same table is shown below, but sorted on the T-Rex Offscreen benchmark score in descending order, which provides a reasonable device-independent indication of GPU performance.

For a high-resolution version, view/copy/save the image above using the browser.

Top-performing SoCs: Apple A8/A8X, Snapdragon 805, NVIDIA Tegra K1 and Exynos 7 Octa

Apple's A8 and A8X SoCs, NVIDIA's Tegra K1 (both the Cortex-A15/A7-based version as well as the NVIDIA Denver-based version) as well as Qualcomm's Snapdragon 805 lead the pack for mobile GPU performance. What most of these chips have in common is a large number of GPU pixel processing cores and a wide DRAM interface (especially in the case of the Apple A8X and Snapdragon 805) to achieve high memory bandwidth. The Apple A8X has been reported by AnandTech to contain an eight cluster PowerVR Series 6 GPU, twice the number of clusters of the GPU inside the Apple A8.

In the OpenGL ES 2.0-based T-Rex offscreen benchmark, the Apple A8X as used in the iPad Air 2 leads, closely followed by the respective versions of Tegra K1 in the HTC Nexus 9 and the NVIDIA Shield Tablet. The Apple A8 and Snapdragon 805 show significantly slower but comparable performance in the T-Rex offscreen benchmark (although still very fast for most purposes), although Snapdragon 805 shows significantly higher low-level metrics such as fillrate, alpha blending bandwidth and shader processing throughput. Snapdragon 805 (with Adreno 420 GPU) has an effective 128-bit memory interface (similar to Apple A8X), which suggests the Apple A8 (with 64-bit memory interface) has greater efficiency within the limitations of the lower memory bandwidth, probably helped by the use of large on-chip caches (including the L3 cache). Samsung's Exynos 7 Octa (Exynos 5433, with Mali-T760 MP6) is somewhat slower than Apple A8 and Snapdragon 805, and so is the slowest of the high-performance processors in terms of GPU power (while being near the lead in terms of CPU performance).

In the OpenGL ES 3.0-based Manhattan benchmark (offscreen, so that the results are largely independent of screen resolution), the Apple A8X and NVIDIA Tegra K1 provide comparable performance (a score just above 2000), while the Snapdragon 805 follows at a considerable distance with a score of about 1200, similar to the score achieved by the Apple A8 inside the iPhone 6 and iPhone 6 Plus. Samsung's Mali-T760 MP6-based Exynos 7 Octa (as represented by the Exynos-based version of the Galaxy Note 4) follows with a score of about 1100.

High-end: Snapdragon 801, Exynos 5 Octa, Apple A7

Qualcomm's Snapdragon 801 with Adreno 330 GPU has been widely used in performance-oriented devices for some time and provides relatively high performance for the segment. Part of the reason for the wide adoption of the high-powered Snapdragon 801 is that Qualcomm has not had a convenient SoC offering intermediate between the Snapdragon 801 and Snapdragon 400 (between which exists a large performance and cost gap), and through its control over the high-performance smartphone market through its patent royalty leverage has been able to convince customers to use the Snapdragon 801 in a wide range of devices (as it did previously with the Snapdragon 800), with the SoC providing more performance than really necessary in many cases.

In the OpenGL ES 2.0-based T-Rex (offscreen) test, Snapdragon 801 scores approximately the same as Apple's previous generation Apple A7 SoC. Samsung's recent Exynos 5 Octa (Exynos 5430, with Mali-T628 MP6) used in the Galaxy Alpha also score about the same. The results for the OpenGL ES 3.0-based Manhattan benchmark are also comparable for these three SoCs.

PowerVR's Rogue Han (G6200) GPU with two clusters inside MediaTek's recent MT6595 does not match the performance of the other high-end chips mentioned above, although still providing perfomance clearly above current and upcoming mid-range solutions. This GPU is also implemented in Allwinner's A80 chip, which shows somewhat lower scores in a benchmark entry for an A80 OptimusBoard development board.

Cost-sensitive SoCs: Snapdragon 410 vs Snapdragon 400 vs MT6582

Rather than showing an evolutionary improvement in GPU performance, the quad-core Cortex-A53-based Snapdragon 410's Adreno 306 GPU actually shows 10% to 20% lower GPU performance than the Adreno 305 in Snapdragon 400 based on metrics like fillrate and the offscreen T-Rex benchmark. This provides evidence that Snapdragon 410 is also a cost-reduction effort in comparison with Snapdragon 400, with a smaller die size for the GPU to reduce cost. This also helps to explain why Qualcomm has aggressively pitched the Snapdragon 410 for low-end 4G smartphones as well as somewhat higher segments, with Snapdragon 410 reported to be Qualcomm's current main volume driver.

When looking at previous generation chips, the Adreno 305 in Snapdragon 400 scores higher than MediaTek's MT6582 in the offscreen T-Rex benchmark (approximately 40% better), while some low-level metrics are slower than MT6582. For example, GFXBench's Driver Overhead score is relatively low for both Snapdragon 400 and Snapdragon 410, reflecting mediocre performance when rendering lots of small objects. The fillrate benchmark is also a little lower than MT6582. The higher T-Rex benchmark performance is probably due to a more optimized and larger cache memory subsystems used in Snapdragon 400 and 410. Exactly how Snapdragon 400/410 compares with the MT6582 and other solutions in other benchmarks and games is beyond the scope of this article.

The next generation of efficient Cortex-A53-based mid-range SoCs: Snapdragon 610 and 615, MT6732 and MT6752

Several new chips for the mid-range performance segment are emerging that use a quad or octa-core Cortex-A53 CPU configuration. The use of Cortex-A53 cores at a relatively high clock frequency is promising to significantly improve power efficiency and cost for this segment (which might previously have required the use of more costly SoCs such as Snapdragon 801). This CPU configuration provides adequate single-core performance and (in the case of an octa-core CPU) great multi-core performance.

Both Qualcomm and MediaTek have introduced SoCs in this class, which also introduce new GPU architectures. Qualcomm's Snapdragon 610 (quad-core Cortex-A53) and Snapdragon 615 (octa-core Cortex-A53) utilize the new Adreno 405 GPU, while MediaTek's quad-core MT6732 and octa-core MT6752 utilize a Mali-T760 MP2 GPU (Mali-T760 has also been adopted by Samsung and others).

T-Rex offscreen performance of Snapdragon 615's Adreno 405 GPU (as represented by an entry for a Lenovo device) with a score of about 850 clearly puts the chip in the performance-oriented segment, since Snapdragon 400 and 410 score not much more than 300 in this benchmark. The OpenGL ES 3.0 Manhattan offscreen benchmark score is similarly significantly higher (about three times higher than Snapdragon 400/410). Low-level metrics are all fairly high for a mid-range device, with only fillrate being limited by the 32-bit DRAM interface.

MediaTek's MT6752 with Mali-T760 MP2 (as represented by a Gionee device entry) shows scores for T-Rex and Manhattan that are comparable with Snapdragon 615. Raw low-level metrics such as ALU, Alpha Blending and fillrate are clearly lower than Snapdragon 615, with only Driver Overhead being superior, suggesting that new ARM optimization technologies such as ARM Framebuffer Compression, Smart Composition Transaction Elimination are already having a positive effect on real-world performance, especially within the bounds of a 32-bit DRAM interface, keeping device cost down.

In terms of cost, the ability of MediaTek's MT6752 to provide good performance for a mid-range device, comparable to Snapdagon 615, with an economical 32-bit DRAM interface, make the chip look very attractive. This also provides evidence that ARM has made somewhat of a breakthrough in terms of performance efficiency with Mali-T760 and the associated optimization techniques mentioned above, mostly based on compression techniques, which will revolutionize performance for economical devices with a 32-bit memory interface that have limited memory bandwidth.

MediaTek's quad-core MT6732 (as represented by an Asus device entry), which also has a Mali-T760 MP2 GPU (but clocked lower than in the MT6752) scores lower but still very respectable (especially for the real-world T-Rex and Manhattan benchmarks) for a mid-range device. There have been reports though suggesting that the Mali-T760's efficiency benefits come at the cost of a relatively large chip die size for a cost-sensitive device, so that a chip such as the MT6732 is not suitable for the high-volume entry-level 4G market (for which Snapdragon 410 is likely to be much more suitable). MediaTek is addressing this with its upcoming MT6735 with cheaper Mali-T720 GPU, which does not appear to offer the bandwidth optimization techniques of the Mali-T760.

MT6592 still has competitive GPU performance

MediaTek's octa-core MT6592 smartphone chip (which was released almost a year ago) with a T-Rex offscreen score in excess of 700 has GPU performance that roughly matches that of the upcoming mid-range chips described above, which are addressing approximately the same segment. The high GPU clock speed of the Mali-450 MP4 GPU probably drives the high scores.

The disadvantages of the MT6592 are a lack of OpenGL ES 3.x support and a likely greater memory bandwidth bottleneck when running at high screen resolutions such as 1920x1080, which also impacts power efficiency. GFXBench's battery life benchmarks when running T-Rex long-term are mediocre for most MT6592-based devices, including devices using a 1280x720 resolution, although it is likely that less demanding 3D applications exhibit better battery life. The Cortex-A7 CPU cores (typically clocked at 1.7 GHz) are also slower than the eight Cortex-A53 cores inside a chip like the MT6752 (but still provide plenty of performance).

RK3288's Mali-T764 GPU: Exact nature unclear

Rockchip's RK3288 is a relatively high performance SoC intended primarily for tablets but currently mainly implemented in devices such as media boxes and development boards. For a long time, Rockchip has advertised its RK3288 SoC as featuring an ARM Mali-T764 GPU. This is confusing because ARM has never announced a GPU with that name. ARM's Mali-T760, also used in new SoCs from other companies such as Exynos 5433 (Exynos 7 Octa) and several new MediaTek SoCs, comes close, and one could assume Rockchip means a Mali-T760 MP4 configuration.

However, in the GFXBench results database, all device entries (mainly representing Android TV box devices, but also including tablets such as the Teclast P90HD) for the RK3288 show a set of GL_EXTENSIONS that is identical to that of devices with a Mali-T628 or Mali-T624 GPU. In particular, the GL_EXT_disjoint_timer_query, GL_EXT_sRGB and GL_EXT_sRGB_write_control extensions, which seem to be associated with Mali-T760-class devices, are missing. Whether this means that the RK3288 actually does not contain a Mali-T760-class GPU but instead an older generation Mali-T62x GPU, or this simply reflects non-optimal drivers, is unclear, but there certainly is a suggestion that the GPU inside the RK3288 is actually of an older (Mali-T62x generation) type.

Earlier, Rockchip was not exactly forthcoming about the exact CPU cores inside the RK3288, which have been proven to be Cortex-A12 instead of Cortex-A17, even though ARM later helped Rockchip by declaring that Cortex-A12 will be also referred to as Cortex-A17 (even though it is technically a different core for which Rockchip was one of the few known customers), and CPU performance from benchmarks such as Geekbench suggests the version of the Cortex-A12 core inside the RK3288 does not quite perform as fast as a real Cortex-A17, clock-for-clock.

While RK3288 does support OpenGL ES 3.0 (as do both Mali-T62x and Mali-T760), GFXBench does not allow the OpenGL ES 3.0 Manhattan benchmark to run on this chip for several TV box devices, which one would normally expect to be possible even if the GPU is technically Mali-T62x class. However, the Teclast P90HD tablet entry does show Manhattan benchmark results, which are consistent with a Mali-T62x MP4 GPU (or perhaps Mali-T7xx) configuration, while also showing reasonable sustained GPU performance and power efficiency.

Other tablet solutions

MediaTek's MT8382 chip for 3G tablets shows performance similar to that of the MT6582 smartphone chip, as expected, with a T-Rex offscreen score of about 220. MediaTek's previous generation WiFi-only MT8125 with PowerVR 544MP shows limited performance, lower than Mali-400 MP2 based designs, and slightly less than its previous-generation MT6589T smartphone chip with a similar GPU.

MediaTek's WiFi-only MT8127 with Mali-450 MP4 for somewhat higher performing tablets, shows higher performance with a T-Rex offscreen score of about 500, higher than the typical score of 350 of the popular RK3188T with Mali-400 MP4, which has commonly been used in tablets. However, the performance of the Mali-450 MP4 GPU appears to be clearly lower than the similar GPU configuration in the octa-core MT6592 smartphone chip, which scores more than 700 in T-Rex offscreen and scores higher in low-level metrics such as fillrate, probably due to the lower GPU clock speed of the MT8127. The MT8135V used in recent Amazon Kindle Fire tablets shows good mid-range performance with a T-Rex offscreen score of 740. This results in good performance given the low screen resolution of the Kindle tablets, but performance is otherwise low for a PowerVR Rogue class GPU.

As mentioned, Rockchip's popular RK3188T chip with Mali-400 MP4 clocked at about 400 MHz scores about 350 in T-Rex offscreen, which is a higher than typical cost-sensitive tablet processors, and also scores higher in low-level metrics such as fillrate.

Thanks to the PowerVR 544 MP2 GPU, Allwinner's aging A31s processor still shows higher performance than Mali-400 MP2-based chips such as MT8382. Allwinner's more recent mass-market chips such as A23 and A33 with Mali-400 MP2 have been slow to come to market, and I haven't yet analyzed their GPU performance, but it is unlikely to be spectacular.

An entry for Leadcore's L1860 with Mali-T628 MP2 GPU shows a T-Rex offscreen score of about 580, and it is compatible with OpenGL ES 3.0. The score reflects a fillrate that might still allow higher resolutions such as 1920x1080 to be used in tablets using this chip, with reasonable but not great GPU performance to be expected, helped by a relatively high GPU clock speed.

Intel' s Atom Z3745 processor for the tablet market shows high performance for its class, with the Acer A1-840 FHD (which uses the higher-end Z3745F variant with 64-bit memory interface) scoring a fairly impressive 1181 in the T-Rex offscreen benchmark. The more commonly used cost-sensitive Z3745G with 32-bit memory interface, as used in the Acer A1-840, scores a still very reasonable 853 in T-Rex offscreen. Both processors have relatively good OpenGL ES 3.0 performance, resulting in relatively high Manhattan benchmark scores for their class (higher than chips such as Snapdragon 610/615).

Finally, the results for the Actions ATM7021, a fairly recent ultra-low-end tablet processor, shows signs of blatant benchmark cheating, with the offscreen (1920x1080) T-Rex score being several times higher than the on-screen score for a device with a screen resolution of 1024x768 (one would expect the offscreen score to be several times lower).

Note about T-Rex benchmark and cost-sensitive GPUs

Because GFXBench's T-Rex benchmark targets a fairly detailed and advanced level of rendering that requires a reasonably high-end GPU for good results, the T-Rex benchmark is likely to understate practical GPU performance for low-end devices. Part of the reason for this is the much lower L2 cache associated with low-end GPU like Mali-400 MP2 and especially Mali-400 MP, which is not likely to be enough to satisfy the T-Rex benchmark's relatively large textures and other demands, resulting in much more expensive external RAM access and a relatively low benchmark score. More typical, less demanding GPU applications of the Angry Birds and Temple Run-type are likely to perform better in relative terms on these platforms (although there will still be variation between chips), and GFXBench's low-level benchmarks provide some information on this.

GFXBench's battery life benchmark is also likely to understate practical battery life for devices such as Mali-400 MP and probably Mali-450 MP because of its higher than typical rendering complexity and relatively large texture working set, with battery life for less demanding GPU applications likely to be significantly better.

Sources: GFXBench results database

Updated December 4, 2014 (Make corrections and add comments about Snapdragon 615's 64-bit memory interface vs MT6752's 32-bit memory interface), add section about T-Rex benchmark's complexity negatively affecting low-end CPU scores.
Updated December 25, 2014 (Correct Snapdragon 615 memory interface width).
Updated December 26, 2014 (Provide slightly updated, sorted GPU benchmark results tables).