Processor architectures

From HP-SEE Wiki

Intel processors

Current architectures: SandBridge and its sucessors (IvyBridge) Next architectures: Haswell (by March 2013)

New manycore architecture: Intel MIC (Many Integrated Core) Products: Knights Ferry , Knights Corner (Xeon Phi, 2012), Knights Landing

Intel uses Tick-Tock principle for the developing cycle of their processors. "Tick-Tock" is a model adopted Intel since 2007 to follow every microarchitectural change with a die shrink of the process technology. Every "tick" is a shrinking of process technology of the previous microarchitecture and every "tock" is a new microarchitecture. Every year, there is expected to be one tick or tock.

SandyBridge architecture

Sandy Bridge is a microprocessor architecture developed by Intel for central processing units in computers to replace the Nehalem microarchitecture. Intel demonstrated a Sandy Bridge processor in 2009, and released first products based on the architecture in January 2011 under the Core brand.

Sandy Bridge processors are originally produced in 32nm technology, while subsequent products, Ivy Bridge processors, use 22nm technology. Main advantages include:

32KB data + 32KB instruction L1 cache and 256KB L2 cache per core
Shared L3 cache includes the processor graphics
64-byte CPU cache line size
Two load/store operations per CPU cycle for each memory channel
Decoded micro-operation cache and enlarged, optimized branch predictor
Improved performance for criptography
256-bit/cycle ring bus interconnect between cores, graphics, cache and System Agent Domain
Advanced Vector Extensions (AVX) 256-bit instruction set with wider vectors, new extensible syntax and rich functionality
Intel Quick Sync Video, hardware support for video encoding and decoding
Up to 8 physical cores or 16 logical cores through Hyper-threading

Haswell architecture

Haswell is a microprocessor architecture developed as the successor to the Ivy Bridge architecture. Haswell will be the first processor to be designed from the ground up to fully optimize the power savings and performance benefits from the move to 3D or tri-gate transistors on the 22nm process node. Haswell will also increase the perfromance of integrated GPU bringing it closely to those of $50 - $70 graphics cards.

Main improvements include:

Haswell New Instructions (includes Advanced Vector Extensions 2 (AVX2), gather, bit manipulation, and FMA3 support)
New sockets — LGA 1150 for desktops and rPGA947 & BGA1364 for the mobile market
Intel Transactional Synchronization Extensions (TSX)
Three versions of the integrated GPU
Graphics support in hardware for Direct3D 11.1 and OpenGL 4.0
DDR4 for the enterprise/server variant (Haswell-EX)
Variable Base clock like LGA 2011
A new cache design
128-byte CPU cache line size
New advanced power-saving system
Up to 32MB Unified cache LLC (Last Level Cache)

Future architectures

Broadwell will be the 14nm shrink of the Haswell architecture. Broadwell will adopt the Multi-Chip Package (MCP) design, so that could mean that Haswell motherboards might not be compatible with Broadwell motherboards.

Skylake is the codename for a processor microarchitecture to be developed by Intel as the successor to the Haswell architecture. Skylake will use a 14 nm process. Skymont will be the 10 nm shrink of Skylake and is due out the year after the introduction of the Skylake microarchitecture.

Many Integrated Cores (MIC)

Intel Many Integrated Core Architecture or Intel MIC is a multiprocessor computer architecture developed by Intel. The Intel Many Integrated Core architecture brings an SMP programming environment onto a single chip. A co-processor, code-named Knights Corner will be the first Intel product using the Intel MIC architecture. A prototype, called Knights Ferry, was offered as a PCIe card with 32 in-order cores at up to 1.2 GHz with 4 threads per core, 2 GB GDDR5 memory, and 8 MB coherent L2 cache (256 kB per core with 32 kB L1 cache),and a power requirement of ~300 W,built at a 45 nm process. Intel Knights Corner will be branded as Intel Xeon Phi accelerator cards.

Intel Xeon Phi will include 50 or more cored with at least 8GB RAM. The Xeon Phi cores are based on the P54C revision of the original Pentium with added 64-bit support, larger on-die caches, a 512-bit, bi-directional ring bus connecting the cores, advanced power management, and 32 512-bit vector (AVX) registers. It is expected to have good x86 compatibility. Intel announced on their software blog that programs written in high-level languages (C, C++, Fortran, etc.) can easily remain portable despite any ISA or ABI differences. Programming efforts will center on exploiting the high degree of parallelism through vectorization and scaling: Vectorization to utilize Knights Corner vector instructions and scaling to use more than 50 cores. This has the familiarity of optimizing to use a highly-parallel SMP system based on CPUs. There are a handful of x86/x86-64 instructions, including a few fairly common ones, that Knights Corner will not support. The vectorization and scalar instructions that KNC/KNF introduced are also unique — Knights Corner does not support traditional SIMDs like MMX, SSE, or AVX at this stage of developement.

Useful discussion:

Intel Xeon Phi

The Intel Xeon Phi (Knights Corner edition) consists of 60 SMP chips. Each core is equipped with a dedicated 512-bit wide SSE (Streaming SIMD Extensions) vector unit and connected to the others by a 512-bit bidirectional ring interconnect. The Intel Xeon Phi coprocessor is currently being deployed as a separate accelerator card that could be connected to the rest of the system via PCIe slot. Each card contains 8GB RAM with support for external host file-system, but without paging.

The main advantage of the Intel Xeon Phi is that it supports various modern and legacy programming models. Programming of those systems is supported through:

Directived based model (using pragmas in the existing code to offload portions of work to the Phi coprocessors
Recompiling source code to run directly on coprocessor as a separate many-core Linux SMP compute node
Intel MKL (Math Kernel Library optimized for execution on the coprocessor
Using each coprocessor as a node in an MPI cluster or as a device containing a cluster of MPI nodes

Hardware specifications:

60 cores/1.053 GHz/240 threads
Up to 1 teraflops double-precision performance3
8 GB memory and 320 GB/s bandwidth
Standard PCIe* x16 form factor
Linux* operating system, IP addressable
512-bit wide vector engine
32 KB L1 I/D cache, 512 KB L2 cache (per core)
8 GB GDDR5 memory (up to 320 GB/s)
225W TDP
X16 PCIe form factor (requires IA host)
Host operating system: Red Hat Enterprise Linux 6.x, SuSE Linux 12+

Useful links:

AMD processors

Current architectures: Bulldozer, Fusion Next architectures: Bulldozer improvements - Piledriver (2012), Steamroller (2013), Excavator (2014)

Buldozer architecture

Bulldozer is Advanced Micro Devices' (AMD) Accelerated Processing Unit (APU) codename for the server and desktop processors released in October 2011.

Main improvements include:

"Clustered Integer Core" micro-architecture
2MB of L2 per CPU module
Support for Intel's Advanced Vector Extensions (AVX) instruction set
New instrcution set (XOP, FMA4 and CVT16) proposed by AMD
32 nm production process
Native DDR3 memory support up to DDR3-1866[17]
Dual Channel DDR3 integrated memory controller for Desktop and Server/Workstation processors
Quad Channel DDR3 Integrated Memory Controller for Server/Workstation processors

Fusion architecture

AMD Fusion is the name for a series of Accelerated Processing Unit by AMD, aimed at providing good performance with low power consumption, and integrating a CPU and a GPU based on a mobile stand-alone GPU. The final design is the product of the merger between AMD and ATI, combining general processor execution as well as 3D geometry processing and other functions of modern GPUs (like GPGPU computation) into a single die. This technology was shown to the general public in 2011. Second-generation is expected in 2012.

The 2011 platform integrates CPU, GPU, Northbridge, PCIe, DDR3 memory controller, and UVD on the same integrated circuit. The CPU and GPU are coupled together using a memory controller that arbitrates between coherent and non-coherent memory request. The physical memory is partitioned: up to 512 MB + virtual for GPU, remainder + virtual for the CPU. The 2012 platform will allow the GPU to access the CPU memory without going through a device driver. The 2013 platform will use a unified memory controller for both CPU and GPU. The 2014 platform will add hardware context switching for the GPU.[5]

AMD is set to launch a "fully integrated" APU in 2014. According to AMD, the APU will feature 'heterogeneous cores' capable of processing both CPU and GPU work automatically, depending on the workload requirement

FirePro cards

New AMD FirePro processors are intended for workstation users who want high-end graphics and computation in a single box. One of them promises a teraflop of double precision performance as well as support for error correcting code (ECC) memory. The new cards include two APUs (Accelerated Processing Units) that glue four CPU cores and hundreds of FirePro GPU stream processors onto the same chip. It's much easier to share data between the two compute engines on the chip, because there's no need for communication across a relatively slow PCIe bus.

NVIDIA Graphics Processing Units

NVIDIA gave preview of its new generation of GPU arhictecture in GTC 2012 conference, San Jose, California. The new GPU architecture is called Kepler. There are two different Kepler GPUs in development. The Kepler1 (GK104) chip, also known as , is aimed at graphics cards and Tesla GPU coprocessors, where single-precision floating point math is what matters most. Kepler2 (GK110) GPUs will be tuned for double-precision floating point math and will support more GDDR5 memory with ECC support. Kepler2 will have different packaging aimed at servers, and will cost more money than Tesla cards based on the Kepler1 units. Nvidia increased the core counts and slowed down the clock speeds, increasing the parallelism and the overall performance of GPU while significantly lowering its power draw and heat dissipation.

New features:

New SMX (Streaming Multiprocessor eXtreme) architecture with 192 CUDA cores per multiprocesor
The core speed is 1006MHz, with turbo boost option allowing 1058MHz
About three times better performance per watt of power consumption
New Hyper-Q technology that enables multiple MPI tasks to run in parallel on the GPU (up to 32 MPI tasks)
Dynamic parallelism - GPU can adapt to data and one kernel can dynamically launch another kernel.

Kepler architecture

Kepler GK110 is first chip in the new series of NVIDIA GPUs. Kepler is mostly intended for the High performance computer market, aiming to be the highest performing parallel computing microprocessor in the world. It exceeds raw compute power delivered by Fermi architecture by a fair amount, consuming significantly less power and generating much less heat output. A full Kepler GK110 implementation includes 15 SMX units and six 64‐bit memory controllers.

Key features of the new architecture include:

The new SMX processor architecture
An enhanced memory subsystem, offering additional caching capabilities, more bandwidth at each level of the hierarchy, and a fully redesigned and substantially faster DRAM I/O implementation.
Hardware support throughout the design to enable new programming model capabilities such as dynamic parallelism and Hyper-Q

Dynamic Parallelism is the capability for the GPU to generate new work for itself, synchronize on results, and control the scheduling of that work via dedicated, accelerated hardware paths, all without involving the CPU. By providing the flexibility to adapt to the amount and form of parallelism through the course of a program's execution, programmers can expose more varied kinds of parallel work and make the most efficient use the GPU as a computation evolves. This capability allows less‐structured, more complex tasks to run easily and effectively, enabling larger portions of an application to run entirely on the GPU.

Hyper-Q enables multiple CPU cores to launch work on a single GPU simultaneously, thereby increasing GPU utilization and reducing CPU idle times. Hyper‐Q increases the total number of connections (work queues) between the host and the GK110 GPU by allowing 32 simultaneous, hardware‐managed connections (compared to the single connection available with Fermi). Hyper‐Q is a flexible solution that allows separate connections from multiple CUDA streams, from multiple Message Passing Interface (MPI) processes, or even from multiple threads within a process. Applications that previously encountered false serialization across tasks, thereby limiting achieved GPU utilization, can see up to performance increase without changing any existing code.

New Grid Management Unit manages and prioritizes grids to be executed on the GPU, thus enabling dynamic parallelism. The GMU can pause the dispatch of new grids and queue pending and suspended grids until they are ready to execute, providing the flexibility to enable runtimes, such as Dynamic Parallelism. The GMU ensures both CPU‐ and GPU‐generated workloads are properly managed and dispatched.

NVIDIA GPUDirect is a capability that enables GPUs within a single computer, or GPUs in different servers located across a network, to directly exchange data without needing to go to CPU/system memory. The RDMA feature in GPUDirect allows third party devices such as SSDs, NICs, and IB adapters to directly access memory on multiple GPUs within the same system, significantly decreasing the latency of MPI send and receive messages to/from GPU memory. It also reduces demands on system memory bandwidth and frees the GPU DMA engines for use by other CUDA tasks.

Whitepaper:

http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf

Maxwell architecture

New Maxwell architecture of NVIDIA GPUs will be introduced in the beginning of 2014. GPUs of this architecture will have an integrated ARM CPU core, thus making those GPUs more independent from the main CPU. It will also have access to main memory of the CPU, something AMD have already intorduced (unified memory architecture) in its Fusion series of Accelerated Procesing Units (APUs). Nvidia will be using its unified memory access primarily to help researchers and those that use its Tesla GPU accelerators work with larger datasets.

Volta architecture

At the GTC conference 2013, NVIDIA announced new GPU architecture, called Volta, aimed for 2015. The main feature expected will be stacking of the DRAM directly on the silicon substrate around the CPU. The DRAM will be connected to each other directly, which will result in much faster bandwidth. It is expected that those processors would achieve 1 Tbit/s of bandwidth.

Links:

http://www.hpcwire.com/hpcwire/2013-03-21/volta_adds_charge_to_gpu_roadmap.html

ARM microprocessors

ARM 64-bit is also aimed for servers and HPC market.