# The multicore revolution

Giorgio Buttazzo



Scuola Superiore Sant'Anna, Pisa

# Detis

### The transition

- On May 17<sup>th</sup>, 2004, Intel, the world's largest chip maker, canceled the development of the Tejas processor, the successor of the Pentium4-style Prescott processor.
- On July 27<sup>th</sup>, 2006, Intel announced the official release of the Core Duo processors family.
- Since then, all major chip producers decided to switch from single core to multicore platforms.
- Such a phenomenon is known as the <u>multicore revolution</u>.

The reason why this happened has to do with a market law, predicted by Gordon Moore, Intel's co-founder, in 1965, known as Moore's Law.





# Retis

### **Benefits of size reduction**

There are 2 main benefits of reducing transistor size:

- 1. a higher number of gates that can fit on a chip;
- 2. devices can operate at higher frequency.

In fact, if the distance between gates is reduced, signals have to cover a shorter path, and the time for a state transition decreases, allowing a higher clock speed.

### However...

At the launch of Pentium 4, Intel expected single core chips to scale up to 10 GHz using gates below 90 nm. However, the fastest Pentium 4 never exceeded 4 GHz.

Why did that happen?

# etis.

### **Power dissipation**

The main reason is related to power dissipation in CMOS integrated circuits, which is mainly due to two causes:

- Dynamic power (P<sub>d</sub>) consumed during operation;
- Static power (P<sub>s</sub>) consumed when the circuit is off.



















# \*\*Etis How to exploit multiple cores?

The efficient exploitation of multicore platforms poses a number of new problems that are still being addressed by the research community.

When porting a real-time application from a single core to a multicore platform, the following key issues have to be addressed:

- How to split the code into parallel segments that can be executed simultaneously?
- How to allocate such segments to the different cores?

# Retis

# **Expressing parallelism**

- In a multicore system, sequential languages (as C/C++) are no longer appropriate to specify programs.
- In fact, a sequential language hides the intrinsic concurrency that must be exploited to improve the performance of the system.

To really exploit hardware redundancy, most of the code has to be parallelized.

# Retis

# A big problem for industry

Parallelizing legacy code implies a tremendous cost and effort for industries, mainly due to:

- re-design the application
- > re-writing the source code
- updating the operating system
- writing new documentation
- > testing the system
- software certification

To avoid such costs, the cheapest solution is to port the software on a multicore platform, but run it on a single core, disabling all the other cores.

# Petis

# A big problem for industry

However, due to the clock speed saturation effect, a core in a multicore chip is slower than a single core:



If the application workload was already high, running the application on a single core of a multicore chip creates an overload condition.

To avoid such problems, avionic industries buy in advance enough components for ensuring maintenance for 30 years!

# Retis

### Other problems

In a single core system, concurrent tasks are sequentially executed on the processor, hence the access to physical resources is implicitly serialized (e.g., two tasks can never cause a contention for a simultaneous memory access).

In a multicore platform, different tasks can run simultaneously on different cores, hence several conflicts can arise while accessing physical resources.

Such conflicts not only introduce interference on task execution but also increase the Worst-Case Execution Time (WCET) of each tasks.











# It is referred to as main memory or internal memory, and is directly accessible to the CPU. It is volatile, which means that it loses its content if power is removed. Primary storage includes RAM (based on DRAM technology), Cache and CPU registers (based on SRAM technology): DRAM (Dynamic random-access memory) requires to be periodically, refreshed (re-read and re-written) otherwise it would vanish. SRAM (Static random-access memory) never needs to be refreshed as long as power is applied.

# Retis

### **Secondary Storage**

It is referred to as <u>external memory</u> or <u>auxiliary storage</u>, because it is not directly accessible by the CPU. The access is mediated by I/O channels and data are transferred using intermediate area in primary storage.

It is <u>non volatile</u>, that is, it retains the stored information even if it is not constantly supplied with electric power.

Examples of secondary storage devices are:

Hard Disk: based on magnetic technology

CD ROM, DVD: based on optical technology

• Flash memory: can be electrically erased and

reprogrammed

# Retis

# **Cache Memory**

The cache is a local memory used by the CPU to reduce the <u>average time</u> to access data from the main memory.

The cache is <u>faster</u> than the RAM, but <u>more expensive</u>, so <u>much smaller</u> in size.

Most CPUs have different types of caches:

- Instruction Cache, to speed up executable instruction fetch
- Data Cache, to speed up data fetch and store
- Translation Lookaside Buffer (TLB), used to speed up virtual-to-physical address translation for both executable instructions and data.



















complex!

















# **Expressing parallelism**

Code parallelization can be done at different levels:

- Parallel programming languages (e.g., Ada, Java, CAL).
- Code annotation.

The information on parallel code segments and their dependencies is inserted in the source code of a sequential language by means of special constructs analyzed by a pre-compiler (e.g., OpenMP).

# Retis

# **Expressing parallelism**

For instance, CAL [UC@Berkeley, 2003] is a dataflow language.

Algorithms are described by modular components (<u>actors</u>), communicating through I/O ports:



Actions read input tokens, modify the internal state, and produce output tokens.

# Retis

### **Expressing parallelism**

**OpenMP** specifies parallel code by the pragma directive.

For instance, the following for statement is executed as n parallel threads:

In any case, a suitable task model is needed to represent and analyze parallel applications.



# Retis

### Task model

Representing a parallel code requires more complex structures like a graph:



Restrictions are needed to simplify the analysis

Graph models



















# Retis

### **Performance issues**

Assuming we are able to express the parallel structure of our source code,

- ➤ How much performance can we gain by switching from 1 core to *m* cores?
- How can we measure the performance improvement?

# Retis

# Speed-up factor

It measures the relative performance improvement achieved when executing a task on a new computing platform, with respect to an old one.

$$S = \frac{R_{old}}{R_{new}}$$

 $\int R_{old}$  = response time on the old platform

 $R_{new}$  = response time on the new platform

# Retis

# **Speed-up factor**

If the old architecture is a single core platform and the new architecture is a platform with m cores (each having the same speed as the single core one), the speedup factor can be expressed as

$$S = \frac{R_1}{R_m}$$

 $R_1$  = response time on 1 processor

 $R_m$  = response time on m processors







# Retis

### **Considerations**

Law of diminishing returns:

Each time a processor is added the gain is lower

- > Performance/price rapidly fall down as *m* increases
- Considering communications costs, memory, bus conflicts, and I/O bounds, the situation gets worse
- Parallel computing is only useful for
  - limited numbers of processors, or
  - highly parallel applications (high values of  $\gamma$ )

# Retis

### When MP is not suited

Applications having some of the following features are not suited for running on a multicore platform:

- I/O bound tasks;
- Tasks composed by a series of pipeline dependent calculations;
- > Tasks that frequently exchange data;
- Tasks that contend for shared resources.



### Other issues

- How to <u>allocate</u> and <u>schedule</u> concurrent tasks on a multicore platform?
- How to <u>analyze</u> real-time applications to guarantee timing constraints, taking into account communication delays and interference?
- How to <u>optimize</u> resources (e.g., minimizing the number of active cores under a set of constraints)?
- > How to reduce interference?
- How to simplify software portability?



### **Multiprocessor models**

### Identical

Processors are of the same type and have the same speed. Each task has the same WCET on each processor.

### **Uniform**

Processors are of the same type but may have different speeds. Task WCETs are smaller on faster processors.

### > Heterogeneous

Processors can be of different type. The WCET of a task depends on the processor type and the task itself.





