What are the chances of AMD in the server market? Part II

Chapter 1. Architecture and production strategy

This material is the second article in a series of analytical reviews on the server
AMD strategy, its processors, technologies and overall server “ecosystem”.

To successfully return to the server market, AMD needed to optimize capital and operating costs and reduce the risks associated with diversified x86 chip production for all major market segments. The company decided to unify production as much as possible, linking together the principles of Zen microarchitecture, the ideology of the chip and the final Assembly of processors:

Комплекс CPU

The fundamental logical building block was the Soge Complex (CCX), consisting of four cores, and a common L3 cache for them. The illustration shows the characteristics of such a complex.

All AMD Zen 1 processors are built from these CCX blocks, switched using the Infinity Fabric (IF) factory, which can connect CCX blocks within a single processor, two server processors to each other, or, say, a processor and an accelerator. Read more about IF here 1.

Two CCX blocks with cores and cache memory are the basis of the Zeppelin universal crystal. This “system-on-chip” (SoC) can do without an additional chipset. In addition to CCX units, memory and I / o controllers are integrated on the chip, as well as the rest of the chipset “binding”. Zeppelin is used to create any model of AMD processor by disabling some of the cores and specific mobile or server functionality.

Блоки CCX

The Zeppelin chip turned out to be universal and at the same time redundant, because each instance contained a full range of supported technologies: from power management, which is important primarily for the mobile segment, to reliability technologies (such as ECC memory and cache protection), encryption of physical memory, hypervisors and virtual machines, which are important for the server segment.

The later integrated graphics processors combine the CCX and the GPU on a single chip. Thus, the two basic variants of CCX + CCX AND CCX + GPU completely cover the needs of AMD for the production of the entire range of its first-generation Zen architecture processors.

The Zeppelin chip was used to create processors for desktops, high-performance desktops and servers. Close in design and version of the chip with a mobile core Vega (SSH + GPU). The topology of the EPYC 7001 series server processor codenamed Naples (Naples) is as follows:

Топология процессора

EPYC 7001 series server processors are multi-chip modules (Multi-Chip Module, MCM), which by their characteristics have naturally become a “quadrupling” of the characteristics of models for the desktop:

up to 32 cores (4 x 8 cores on desktop),
up to 64MV L3 cache (4 x 16MV on desktop),
8 memory channels (4 x 2 channels on desktop),
the maximum number of PCI-E lines (128 pieces).

Thus, the configuration of the lower processor of the line is predetermined, it is 8 cores (1 working core in each of the CCS in 4 chips). Similarly, using 2 to 4 universal Zeppelin chips, AMD created models OF hedt ThreadRipper processors for high-performance desktops.

Despite the redundancy of the characteristics of one Zeppelin chip, which leads to a certain increase in the area of the SoC crystal, such unification reduces the nomenclature of the produced chips to actually one basic variant. According to AMD, this significantly improves the yield of usable crystals, simplifies the planning and loading of production lines (nodes) at the contractor and increases the flexibility in the Assembly and labeling of processors for different market segments. All this allowed to optimize the strategy and costs associated with the planning and production of first-generation Zen processors. In particular, the selfcost of four Zeppelin crystals is 59 percent of the hypothetical cost of a single-crystal 32-core chip 2.

AMD has long been looking for ways to reduce costs. The company disposed of some of its assets, reducing related capital and operating expenses. Its factory in Dresden (Fab 30) in 2008-2009 was sold to Global Foundries, the headquarters complex in Austin was also sold and came under the control of the management company 1.

It is very likely that Intel partially solves the problem of the increased cost of single-crystal chips Xeon Scalable due to higher selling prices and much greater variability of its nomenclature, when “everything goes to work”, but, strategically, at production rates below 14 nanometers single-crystal architecture becomes too complex and expensive to design and manufacture, even for the market leader.

On the face of the clash between two ideologies of production. It is difficult to say what carries more risks: owning a mass of factories that need to be updated and constantly “load” production orders, releasing a wide range of processors and chips for completely different types of devices, or fabless-approach, when the company concentrates on unified Central microprocessors and graphics chips, while fully depending on the success of the mass of third-party partner companies, including manufacturers TSMC, Global Foundries and automating electronic design Synopsys (EDA) .

In 2015, Intel planned a transition to a 10-nanometer process, which was to nullify the potential gap in the performance of 14-nanometer processors competitor. However, against the background of problems with the launch of the new process, the new microarchitecture AMD with its 52 percent increase in IPC (executable instructions per beat) has become unpleasant news for Intel.

Initially for transition from 14nm to 10Nm Intel planned highly aggressive the rate increasing sparse transistors 2.68, while, as TSMC decided reduce risks expense of the rate compaction only 2.24. Below is a slide from Intel presentation:

Плотность на техпроцессе

This optimism in expectations was one of the reasons for the problems with the 10Nm process: Intel has not been able to bring the original version of its 10Nm to suitable for mass production indicators and has recently been forced to once again, the fourth time to respond to AMD solutions using a modified 14nm process.

As a result of all these events, there was an unprecedented factor of Intel’s loss of leadership in production standards, which significantly affects the balance of power: TSMC and Samsung turned out to be the world leaders in technical processes.

So far, the strategy of AMD’s transition to new production standards has been winning, but this does not guarantee the company’s long-term success. The only thing analysts agree on is that the ideologies of modular processors using chiplets (MCM, SoC) and stacking chips (SiP - 2.5 D, 3D, EMIB, Foveros) are technologically justified and promising: the near future for multi-chip composite processors.

Chapter 2. Sockets and Functionality

The next most important element of AMD’s strategy after the unification of production was the creation of a powerful single-socket server platform, endowed with all available technologies and system characteristics. AMD rightly announced the creation of a new class of “uncompromising” single-socket servers, performance and resources which will facilitate the simplification of the design, components and cost of motherboards, increase computational density, reduce the number of processors, power supplies and total power consumption of servers.

AMD’s decision to offer a powerful single-socket platform was dictated by the sales statistics of dual-socket servers with one processor installed* (table 1).

Price segmentation	Percentage of all 2 socket server volume	Percentage of 2-socket servers shipped with single CPU
$3000-$5999	43%	35%
$6000-$9999	30%	39%

Table 1. Two-socket rack servers delivered with only one socket for the period from the first to the third quarter of 2016.
Source: EPYC Empowers Single-Socket Servers, Tirias Research и IDC.

Gartner has recently estimated that by 2021, 80% of server applications will be served by single-socket servers. With the growth of the number of cores, the importance of multi-socket solutions for a wide range of applications gradually decreases, so AMD decided not to offer processors for four-socket systems at all.

Intel has historically trained the market to buy higher-end processors, limiting the functionality of both single-socket series and low-and mid-range chips for dual-socket systems. EPYC violates this “status quo” because any single-and dual-socket AMD server processor of the first generation supports up to 2TV of memory (per socket) with support for Chipkill 4-bit error correction technology and 128 lines of PCI-E to connect multiple disks, cards and accelerators.

The easy-to-understand positioning of server processors is the next element of AMD’s server strategy. The company refused to limit the functionality of both single-socket processors in General, and lower models of single-socket and dual-socket processors (table 2).

Таблица 2

Table 2. Nomenclature and characteristics of EPYC 7001 models. 7xx1P-models for single-socket systems.

AMD EPYC price gradation is based on the difference in the number of cores and their frequencies. This contrasts favorably with modern Intel segmentation, which forces customers to buy more expensive versions of chips for dual-socket and Quad-socket servers for additional memory, turbo frequencies, or more supported I / o lines.

On paper, AMD had a great plan. EPYC 7001 series processors looked very convincing against the Intel Xeon E5-26xx V. 4 processors offered at that time on the market. Using its long-standing strategy of “more functionality for comparable money”, AMD hoped to offer the market a convincing product comparable to Opteron in terms of superiority over the competitor.

This plan had one serious drawback: heterogeneous kernel access to system memory (NUMA – non-uniform memory access). From the processor topology described above (4 universal chips on a common substrate) directly follows the fact that the EPYC 7001 is a processor with four NUMA domains: since its 8 memory channels actually belong to four different Zeppelin chips, for each core there will be options with “nearest” (its chip) and “remote” (chip diagonally) memory, with a range of access delays to DIMMs tied to different memory controllers of Zeppelin chips. In the case of a two-socket system, the memory of the second socket will be completely removed, with even higher latency.

Even without performance data, you can expect that the maximum benefit from such a processor will receive either optimized for NUMA SOFTWARE, or a set of unrelated independent tasks (provided that the OS rationally distributes them over the address space). At the same time, SOFTWARE that does not take into account the organization of NUMA, and does not optimize the location of data in memory, should work somewhat less efficiently than on monolithic Intel Xeon processors.

On the other hand, AMD expected a significant superiority of EPYC characteristics, so some efficiency reduction in the case of non-optimized programs for NUMA could be sacrificed. The compromise seemed reasonable.

Chapter 3. Sobering reality

Made plans for the summer of 2017 not only AMD, the competitor also did not stand still. He was preparing for another server platform change to a new generation, Xeon Scalable. For the first time in 4 generations of server platforms planned significant changes: expanded memory (6 channels instead of 4), would increase the performance of the kernel as an integer (for this Intel rebalanced caches compared to desktop models) and floating point (for this increased frequency of work and increased the thermal limit of the processor). In the conditions of total domination of Xeon the only possible task of Intel was the further increase of profitability of server business.

Planning to increase the overall performance, Intel introduced a new segmentation: all Xeon Scalable processors were divided into 4 segments (Platinum, Gold, Silver, Bronse), and the younger ones were offended also in terms of performance (disabled the second AVX-512 units, the frequency of the supported memory was lowered). Moreover, among the older models, several models with the letter “M” were singled out, supporting a memory capacity of up to 1.5 TV, all the rest were reduced to 768gb. Significantly increased the recommended price of senior processors: up to 13 000 dollars. Later, in the second generation of Xeon Scalable, Intel adjusted the supported memory to 1TV for conventional models, and to 3TB and 4.5 TB for models with the letters “M” and “L” respectively.

Table 3. Nomenclature of the first generation of Intel Xeon Scalable processors

Table 4. Nomenclature of the second generation of Intel Xeon Scalable processors

For both manufacturers, the clash with reality did not go as they had planned. Announcing EPYC in may 2017, AMD found that the advantage of EPYC relative to the upcoming announcement of the new platform Xeon Scalable is not so overwhelming: 8 memory channels more than 6, but by a third, not twice. 32 cores more than 28 in the new Xeon, but not 10 cores, as before.

Intel, in turn, also found itself in a rather delicate situation with an inflated price for processors with more modest functionality. AMD’s competitive EPYC 7001 product line undermines carefully constructed segment positioning: it is difficult to justify the high cost of what a competitor offers even in low-end models of its processors. The price of senior Xeon Scalable processors in 13 thousand dollars inevitably forced tech-savvy customers to pay attention to AMD EPYC with a large number of cores, offered for 8 thousand dollars.

As a result, both companies were in a kind of clinch: Intel has experienced and is still experiencing serious problems with the transition to the updated 10-nanometer process. AMD found it more difficult to promote its 7001-based platform as the performance advantage proved to be heterogeneous and more modest than planned, and the increased latency of the NUMA architecture with its additional transitions when accessing remote memory became more apparent against the background of the Xeon Scalable single-chip architecture.

The integer performance of Xeon Scalable has also increased significantly, and now AMD had to carefully choose convenient tasks for comparison. The undeniable advantages of EPYC 7001 in the form of 128 lines of PCI-E, or 2TV of supported memory on all processors of the line are not always needed. Finally, AMD’s recommendations to fill all EPYC memory channels with modules for optimal performance slightly increased the cost of the server.

Scale scale in some places AMD is faced with redundancy of its solutions: some customers prefer 4-8 cores of high frequency, because their SOFTWARE is poorly scaled by cores, but perfectly scaled by frequency. Later, in 2019, EPYC 7371 went on sale with a high base frequency (3.1 GHz) and a relatively small number of cores (16), but the niche of small-core processors remains a zone of complete domination of Intel.

An equally thorny issue is expensive commercial SOFTWARE, which is often licensed over cores, and a 32-core processor instead of 16 causes not only the joy of performance gains, but also twice the cost of licenses. And sometimes more than twice, as in the case of Windows Server, whose license supports up to 16 cores, and all that is more, already requires a much more expensive version of Datacenter*.

There are exceptions on the positive side: for example, VMware’s per-socket licensing, due to which AMD EPYC processors of the first generation are successfully sold as part of a-Brand systems for creating virtual ESX servers, virtual desktops (VDI) and Hyper-converged software-defined storage systems (vSAN).

Chapter 4. EPYC 7001 two-year sales summary

Announced in may 2017, the EPYC 7001 processors were intended primarily for data centers: they relied on multicore at a relatively low consumption (initially up to 180W, then up to 200W), memory and input / output system extensibility, the capabilities of security technologies that include “trusted loading” of BIOS and OS, encryption of physical memory, hypervisors and virtual machines.

It is worth noting that, unlike the situation with Opteron, AMD EPYC was quickly adopted by the industry. Single and dual socket servers were announced by HPE (Proliant DL325, DL385) and DELL EMC (PowerEdge 6415, 7415, 7425). Cisco released the UCS C125 twin system as part of the c4200 (2U 4N) chassis. A variety of platforms and motherboards are available from Asroskrack, Asustek, Gigabyte, Quanta, SuperMicro, Tyan and a number of other manufacturers.

Under a multi-year monopoly, Intel’s weighted average selling prices have risen by an average of two percent per quarter since Opteron lost its lead in 2006. This slide, obtained from one of the OEMs, perfectly describes the situation:

Динамика средневзвешенных цен Intel

Dynamics of weighted average Intel prices.

The long-awaited competitive alternative allowed to moderate price pressure and count on project discounts from Intel, so OEMs and ODMs relatively quickly introduced EPYC-based servers in their product portfolios.

In terms of end customers, AMD has focused on promoting EPYC to the big seven hyperscalers. It is the demand from hyperscalers, cloud providers and regional hosting companies that have become the main sales of AMD server processors. In its presentations and press releases, the company mentions American Amazon Web Services (AWS), Microsoft Azure, Oracle Cloud, Dropbox, Packet, Chinese Baidu and Tencent, Europeans Hetzner, Scaleway and MyLoc. In Russia, the use of EPYC said St. Petersburg Selectel. Recently, Google and Twitter joined the list of AMD customers. The usual purchases of hyperscalers are tens of thousands of processors per quarter, which allows AMD to build a foothold for further expansion into the server market.

Microsoft offered at least three Azure cloud configurations on the EPYC 7001, including an" LSV2 " based on the EPYC 7551 with 80 virtual processors and up to 19TB of flash memory for high performance computing. The AMD-based AWS nomenclature is positioned for both General loads and applications that depend on a high-speed memory subsystem. AWS also offers fast one-click migration to an EPYC-based virtual machine. Oracle offers both virtual machines and physical AMD servers for deployment of a wide range of its own applications. Finally, following the same principles as Dropbox, Baidoo migrated ABC services (AI, Big Data, and Cloud computing) to single-socket EPYC platforms.

Without significant resources to develop demand in the mass market, AMD concentrates on the largest consumers, as it did in the days of Opteron. A similar strategy of sales to large customers can be observed in the segment of manufacturers of game consoles, where Sony and Microsoft have been using custom AMD chips for many years.

The supercomputer industry is also characterized by high technical expertise and fairly large project volumes. However, the lower number of floating-point operations performed per clock cycle and a certain lag of the first-generation EPYC in frequencies did not allow AMD to repeat the success of Opteron in the segment of computing clusters in a short time (in 2006, the share of supercomputer systems on Opteron was 22% of the Top500 list). In the June 2019 edition, prior to the launch of the second-generation EPYC, only the Chinese Sugon cluster 1 was present on AMD-licensed epyc (7501) clones called Hygon Dhyana.

Unlike the new, universal in its performance record 7002, in computing clusters EPYC 7001 “pulls” against Xeon Scalable only in some applications. Despite the larger number of cores (32 vs. 28) and do not reduce the frequency under load AVX blocks (AMD AVX variant in the form of FADD + FMUL is slightly more flexible, and therefore more efficient than Intel FMA, which requires for peak performance necessarily merged commands), EPYC 7001 Intel loses on peak values FLOPs per core.

In fairness, it is worth noting that the 7001 series processors from time to time are supplied to computing projects, for example, as part of the DELL EMC cluster for subatomic research NikHef, and NEC system for aerodynamic testing in the German aerospace Agency DLR, consisting of two thousand three hundred nodes based on EPYC 7601.Typically, first-generation applications are limited to applications that benefit from a fast memory subsystem, such as hydro-gas dynamics.

At the end of 2018, ten months before the launch of the 7002 series, AMD itself created a problem of deferred demand, publicly announcing a significant increase in the performance of the next generation of processors EPYC 7002 “Rome”. According to AMD employees, at the beginning of 2019, key customers began to receive engineering samples for testing and, impressed with the performance, decided to wait for sales of the new processor, probably creating a problem of overstocking of AMD warehouses with 7001 series processors.

At the time of publication in September 2019, the 7001 series remains a competitive and attractive solution in virtualization applications, cloud containers, high-security environments, and software-defined storage and switching systems. This is facilitated by an attractive core-to-memory ratio, a performance-to-price ratio, a powerful I / o subsystem, an attractive total cost of ownership for single-socket systems, and integrated security technologies.

Therefore, the launch of the second generation EPYC does not mean the removal of the first generation processors from sales. SP3 socket motherboards support both generations of processors with certain reservations, and in order to preserve the profit structure for the 7002 series, AMD is likely to offer flexible design discounts for the 7001 series, trying to “sell out the warehouse”.

Thus, EPYC 7001 became a kind of AMD trial balloon, which allowed the industry to “try out” the new platform, especially appreciating the attractiveness of its single-socket servers. Dell EMC technical Director Robert Hormus (Robert W Hormuth) actually confirmed the success of single-socket ideology AMD, saying that EPYC really allowed Dell EMC to introduce a new, still inaccessible class of single-processor servers.

Despite much broader modern support for the industry, AMD was not able to take a comparable market share with Opteron in a short time. Many analysts suggest that the former success of Opteron is destined for the second generation of EPYC. This was confirmed by the announcement of the latest supercomputers on the second generation 7002 series processors, which began before the announcement of their availability:

Cray-four systems, including the future leader of the Top 500 list, the Frontier supercomputer, whose total hybrid performance of 1.5 Exaflops will be higher than the combined performance of the next 160 supercomputer list systems.
HPE-Hawk Supercomputer with a capacity of 22 PF for the HLRS computing center, specializing in support of German engineering companies and the automotive industry,
Atos (formerly Bull) is a BullSequana Supercomputer with a performance of 11 PF for the Finnish science center CSC.

As a rule, mass supercomputer announcements based on promising x86 chips indicate their sales potential in the General server market.

In the next Chapter, we will begin a detailed narrative about the new series OF epyc processors, their positioning, prospects, features and differences from the microarchitecture of the predecessor.

Viktor Kartunov
12/09.2019