# Concurrent and parallel programming

Romolo Marotta



# Lock implementations

# **Blocking coordination**



# Spinning vs Sleeping

| Benefits                | Spinning     |
|-------------------------|--------------|
| Guaranteed low latency  | $\checkmark$ |
| Computing power savings | ×            |



# Spinning vs Sleeping



# Spin vs Sleep – is that all?

- Choosing the proper back off scheme is very challenging
- Even implementing a simple spin lock is not trivial
  - Trade off between low and high contented case
  - You should have heard about algorithms for Mutual Exclusion in Distributed Systems lectures
    - E.g. Dijkstra, Bakery algorithm, Peterson...
  - Those algorithm essentially implements spin locks by resorting only on read/write operations
- Here, we will focus on spin locking algorithms that exploit stronger synchronization primitives... RMW!

#### Test-and-set spin lock

- Test-and-set lock is the simplest spin lock
- Acquiring threads always try to set a variable via RMW

}

int lock = 0;

}

void acquire(int \*lock){
 while(XCHG(lock, 1));

void release(int \*lock){
 \*lock = 0;

# A small benchmark

- We have an array of integers
- Each thread reverse the array



This is done within a critical section

```
while(!stop){
    acquire(&lock);
    flip_array();
    release(&lock);
}
```

- Performance Metric:
  - Throughput = #Flips per second

#### Results



#Threads





















# Lock implementations

#### Test-and-set spin lock

- Test-and-set lock is the simplest spin lock
- Acquiring threads always try to set a variable via RMW

}

int lock = 0;

}

void acquire(int \*lock){
 while(XCHG(lock, 1));

void release(int \*lock){
 \*lock = 0;

#### Results



#Threads



#### Test-and-set spin lock

- Test-and-set lock is the simplest spin lock
- Acquiring threads always try to set a variable via RMW

int lock = 0;

}

void acquire(int \*lock){
 while(XCHG(lock, 1));

void release(int \*lock){
 \*lock = 0;
}



#### Test-and-set spin lock

- Test-and-set lock is the simplest spin lock
- Acquiring threads always try to set a variable via RMW

```
int lock = 0;
```

void acquire(int \*lock){
 while(XCHG(lock, 1));

void release(int \*lock){
 \*lock = 0;
}

We can reduce the impact of memory traffic by introducing exponential back off! But how to set it properly?

#### Test-and-test-and-set spin lock

- Like test-and-set, but spins by reading the value of the lock
- Traffic is generated only upon lock handover

int lock = 0;

void acquire(int \*lock){ void release(int \*lock){
 while(XCHG(lock, 1)) \*lock = 0;
 while(\*lock); }



#### Results



#Threads

#### Test-and-test-and-set spin lock

- Like test-and-set, but spins by reading the value of the lock
- Traffic is generated only upon lock handover

int lock = 0;

}

- void acquire(int \*lock){ void release(int \*lock){
   while(XCHG(lock, 1)) \*lock = 0;
   while(\*lock); }
  - Lock handover costs increase with the concurrency level
  - Very lightweight for the uncontended case
  - Is it feasible reducing handover costs?
  - AND IMPROVING FAIRNESS?

**FIFO locks** 

## **Ticket locks**

- Similar to the bakery algorithm but it uses RMW instructions
- Two variables
  - The next available ticket
  - The served ticket

```
typedef struct _tck_lock{
    int ticket = 0;
    int current = 0;
} tck lock;
```

```
void acquire(tck_lock *lock){
    int cur_tck;
    int mytck = fetch&add(lock->ticket, 1);
    while(mytck != (cur_tck = lock->current) )
        delay((mytck-cur_tck)*BASE);
```

}

void release(tck\_lock \*lock){ lock->current += 1; }

# **Ticket locks**

- Ensure fairness
- Similar structure w.r.t. TTAS spinlock
  - One variable updated once at each acquisition (better than TTAS)
  - Write-1-Read-N variable updated at each release (same as TTAS)
- How?

- Use similar to ticket lock
- Use the ticket to obtain an individual cache line

$$Ticket = \frac{0}{1} \frac{1}{2} 3$$



- Use similar to ticket lock
- Use the ticket to obtain an individual cache line

$$Ticket = \frac{0}{1} \frac{1}{2} 3$$



}

void acquire(anderson\_lock \*lock){
 mytck = fetch&add(lock->ticket, 1);
 while(lock->array[mytck]);
 lock->array[mytck] = 1; v

void release(int \*lock){
 lock->array[mytck+1] = 0;
}



- Pros:
  - One variable updated once at each acquisition (like Ticket lock)
  - Write-1-Read-1 variable updated once per release (better than (T)TAS and Ticket)
- Cons:
  - Increased memory footprint
  - Each lock needs to know the maximum number of threads
- Let:
  - T be the number of threads
  - L be the number of locks
- Space Usage
  - Anderson = O(LT)
  - TAS, TTAS, Ticket = O(L)

### CLH lock

- An (implicit) linked list maintains the order between waiting threads
- An empty list represent an uncontended lock
- An arriving thread swaps the node with its private node
- Spin on the previous node
- Release on the new node


## CLH queue lock

- Pros:
  - One variable updated once at each acquisition (like Ticket lock)
  - Write-1-Read-1 variable updated once per release (better than (T)TAS and Ticket)
- Cons:
  - Slightly increased memory footprint
- Let:
  - T be the number of threads
  - L be the number of locks
- Space Usage
  - CLH = O(L+T)
  - Anderson = O(LT)
  - TAS, TTAS, Ticket = O(L)

#### NUMA



#### MCS lock

- An explicit linked list maintains the order between waiting threads
- An empty list represent an uncontended lock
- An arriving thread swaps the node with its private node
- Spin on the just inserted node
- Release on the new node



#### MCS lock

- An explicit linked list maintains the order between waiting threads
- An empty list represent an uncontended lock
- An arriving thread swaps the node with its private node
- Spin on the just inserted node
- Release on the new node



#### MCS queue lock

- Pros:
  - One variable updated once at each acquisition (like Ticket lock)
  - Write-1-Read-1 variable updated once per release (better than (T)TAS and Ticket)
  - No-remote spinning
- Cons:
  - Slightly increased memory footprint
- Let:
  - T be the number of threads
  - L be the number of locks
- Space Usage
  - MCS, CLH = O(L+T)
  - Anderson = O(LT)
  - TAS, TTAS, Ticket = O(L)

#### MCS in practice: the Linux kernel case

- The Linux kernel uses a particular implementation of a MCS lock: Qspinlock
- Additional challenge:
  - Maintain compatibility with classical 32-bit locks
  - MCS uses pointers (64-bit)
- Compact data:
  - 1. No recursion of same context in critical sections
  - 2. 4 different contexts (task, softirq, hardirq, nmi)
  - 3. Finite number of cores
- Use an additional bit for fast lock handover



#### MCS in practice: the Linux kernel case



#### A small benchmark

- We have an array of integers
- Each thread reverse the array



This is done within a critical section

```
while(!stop){
    acquire(&lock);
    flip_array();
    release(&lock);
}
```

- Performance Metric:
  - Throughput = #Flips per second

# One lock to rule them all...

#### Performance

#### Intel i7-7700HQ – 8 cores



Concurrent and parallel programming

#### Performance

#### AMD Opteron 6168 - 48 cores



Concurrent and parallel programming

#### At the beginning was... Spin vs Sleep

|                         | Waiting Policy |              |
|-------------------------|----------------|--------------|
| Benefits                | Spinning       | Sleeping     |
| Guaranteed low latency  | $\checkmark$   | ×            |
| Computing power savings | ×              | $\checkmark$ |



#### How to avoid costs for sleeping?

- A general approach exists:
- Reducing the frequency of sleep/wake-up pairs
- How?
  - Trading Fairness in favor of Throughput
- Make some thread sleep longer than others
- If the lock is highly contented, some thread willing to access the critical section will arrive soon
- If the lock is scarcely contented, we pay lower latency as TTAS locks

#### An example - MutexEE

 MutexEE is a pthread\_mutex optimized for throughput and energy efficiency



Credits: Falsafi et all. "Unlocking energy"

An example - MutexEE

 MutexEE is a pthread\_mutex optimized for throughput and energy efficiency



Concurrent and parallel programming

#### An example 2 – Malthusian locks



Credits: Dave Dice "Malthusian locks"

#### An example 2 – Malthusian locks



Credits: Dave Dice "Malthusian locks"

Concurrent and parallel programming

#### **Hierarchical locks**

HPC wants maximum usage of CPU power

- Sleeping might be required for better management of I/O
- Large number of cores per machine
- $\Rightarrow$ NUMA (again)
- FIFO locks cannot avoid transfer to remote NUMA nodes

Again, we can trade fairness in favor of throughput

#### **Hierarchical locks**

- Transfer the lock to threads that reside on the same NUMA node
- Hierarchical TTAS
  - Shorter backoff for local threads, longer for remote ones



#### **Hierarchical locks**

- Transfer the lock to threads that reside on the same NUMA node
- Hierarchical TTAS
  - Shorter backoff for local threads, longer for remote ones
- Hierarchical QUEUE LOCKS (lock cohorting)
  - One global lock (the application one)
  - One lock per NUMA node (render the hoods")



# **Optimizing Critical Section Execution**

#### Optimizing the waiting phase

We have seen several approaches to optimize the lock acquisition phase:

- Back-off scheme
- Cache-awareness TTAS, FIFO locks
- Non-trivial combinations of both sleep and spin phases

What can we do to improve the execution of threads running the critical section?

Improve locality and cache usage

#### How?

Observation:

- A lock (typically) protects data (instead of code)
   Idea!
- There is a good chance that threads willing to acquire a lock want to access "similar" sets of data
- ⇒Allow thread holding the lock to execute the critical section for waiting threads
- Reduces lock handover costs
- Increases locality

- Use a linked list for holding waiting threads
- Each node maintains:
  - The waiting thread ID
  - The critical section descriptor
- Thread check waiting queue before releasing the lock
  - If empty exit



- Use a linked list for holding waiting threads
- Each node maintains:
  - The waiting thread ID
  - The critical section descriptor
- Thread check waiting queue before releasing the lock
  - If empty exit
  - Otherwise take a node from the waiting queue and execute the critical section for the waiting thread



- It might allow further (asymptotic) optimizations (e.g., data structures)
- Operations can be combined to each other BEFORE interacting with protected data



- It might allow further (asymptotic) optimizations (e.g., data structures)
- Operations can be combined to each other BEFORE interacting with protected data
- Operations can be applied in batch (relevant for accesses that require a search)







No need for restarting the search from scratch!



# Is it a silver bullet? Can replace complex lock-free algorithms?

NO! SPARC T2 - QUEUE - Throughput as function of Read/Write delay Constant 56 threads; 50% ENQ; 50% DEQ 16000 - fc 14000 michael scott Basket 12000 Combin tree oyama 10000 sm / sdo **X** oyama combin 8000 H log sync 6000 4000 2000 0 320 192 64 128 256 384 0 read/write delay HIGH LOW CONTENTION Credits: Hendler et all. "Flat combining and the synchronization-parallelism tradeoff"

Concurrent and parallel programming

Is it a silver bullet? Can replace complex lock-free algorithms?

- No, performance depends on the actual contention!
- Combining requires hand-written code!
- How to improve for NUMA?
- Hierarchical Flat Combining

#### Approaches targeting peculiar workloads

#### Read-Write locks

- Threads that do not want to perform updates can acquire the lock with other "readers"
- Threads willing to perform updates ("writers") take exclusive lock

#### • Easy to implement:

- Lock < 0 : acquired by a writer</li>
- Lock = 0 : available
- Lock = N > 0: locked by N readers
- RW locks work well in read-mostly workloads, but:
  - It has a greater impact to readers (exclusive accesses to the lock variable)
  - Can be optimized by splitting the read counter

#### **RW** locks

#### Multiple RW locks (each one has its own cache line)



Writer acquire Owner and spin until all flags are 0

If Owner is free Readers acquire their assigned Flag (e.g. the one of their numa node) Then, check again Owner

#### Approaches targeting peculiar workloads (2)

- When read-only accesses are predominant, we can make reader DO NOT use any lock
- Version Numbers
- Writer:
  - Acquire a (writer) lock
  - Increase Version Number
  - Apply Update
  - Increase Version Number
  - Release Lock

- Reader:
  - Wait even Version Number
  - Do job
  - If Version Number is unchanged OK else retry



#### Approaches targeting peculiar workloads (3)

- When read-only accesses are predominant, we can make reader DO NOT use any lock
- Version Numbers
- Read-Copy-Update
  - Single shared-data entry point

These solutions NEEDS memory management as non-blocking algorithms!!!

#### **Final Picture**



LOW

#### **Contention Level**



#### The Java Case

allocate object

Credits: Kotzmann et all. "Synchronization and Object Locking"
# The Java Case



lock / unlock

Credits: Kotzmann et all. "Synchronization and Object Locking"

# The Java Case



# The Java Case

Synchronization and Object Locking: https://wiki.openjdk.java.net/display/HotSpot/Synchronization



Credits: Kotzmann et all. "Synchronization and Object Locking"

Concurrent and parallel programming

# **Final Picture**

Performance

LOW



Concurrent and parallel programming

# Synchronization approaches:

- Non-blocking data structures
- Locks
- Transactional Memory

# **Transactional Memory**

- Why?
  - Fine grain locking (or non-blocking synchronization) can scale but it is hard
  - Locks do not scale in general, but they are hard too:
    - Deadlocks
    - Races (forgotten locks)
    - Do not compose
- Transactions:
  - They compose (e.g. nested transactions)
  - Simpler to reason about

```
Begin_transaction
    x.op()
    y.op2(k)
    z.op(j)
End transaction
```

- Well known in the context of databases
- Conceived integration of transaction in hardware (1993)
- Software implementations (1995-2005)
- Commercial hardware support (2013)











# Hardware Transactional Memory

Intel TSX BlueGene RockProcessor Arm Transactional Extension IBM POWER8 and 9



Memory:

- Exploit cache coherency protocols
- Modified
- Exclusive
- Shared
- Invalid
- Tracked for speculative execution of transaction.
- Losing track of a cache line leads to an abort

CPU:

 Ability to restore the processor state as the one before the beginning

### Hardware transaction and abort

- Why can a hardware transaction abort?
  - Whenever, we lose track of a cache line....
- Any reason that could lead to an invalidation of a tracked cache line:
  - Another core wants it exclusive (conflict)
  - Change of execution mode (syscall, interrupts, page fault)
  - Working set too large

# Intel Transactional Synchronization eXtensions (TSX)

#### RTE

- XBEGIN:
  - Start a hardware transaction (keep track of accessed cache lines)
- XEND:
  - Try to commit a hardware transaction (untrack cache lines)
- XABORT:
  - Make a hardware transaction abort programmatically

### Are HTM so simple?



### Are HTM so simple?



#### Are HTM so simple?

```
int committed count;
void transaction() {
char *buf = malloc(4096*1024); // 4MB
init(buf);
start tsx:
     if ( XBEGIN() == XBEGIN STARTED) {
          do job(buf,...)
This is not a
good fallback path! XEND ();
          FAD(&committed count, 1);
          return;
     }
     else goto start tsx;
```

int committed\_count; volatile int lock = UNLOCKED; void transaction() {

**char** \*buf = malloc(4096\*1024); // 4MB

init(buf); bool fb = false; int retry =0;

start\_tsx:

if (fb || XBEGIN() == XBEGIN STARTED) { if(lock==LOCKED) XABORT(); if(fb) TTAS(&lock, LOCKED); do job(buf,...) if (fb) lock = UNLOCKED; **XEND**(); FAD(&committed count,1); return; } else { fb=++retry>MAX RETRY; goto start tsx; }

| XBEGIN XABORT | LOCK UNLOCK   |           |                    |                                       |                 |                 |
|---------------|---------------|-----------|--------------------|---------------------------------------|-----------------|-----------------|
| XBEGIN XABORT | XBEGI         | N XABORT  | SPIN L             | PIN LOCK UNLOCK                       |                 |                 |
| XBEGIN XAE    | ORT           | XBEGIN XA | BORT               | SPIN LOCK                             | UNLOO           | СК              |
| XBEGIN XABORT | SPIN          |           |                    | LOCK UNLOCK                           |                 |                 |
|               | XBEGIN XABORT |           |                    | XBEGIN XAB                            | ORT             | SPIN LOCK       |
|               |               |           | No one<br>a perioc | will use the fast<br>I of quiescence! | transacti<br>!! | onal path until |

int committed\_count; volatile int lock = UNLOCKED; void transaction() {

```
char *buf = malloc(4096*1024); // 4MB
```

```
init(buf); bool fb = false; int retry =0;
start tsx:
```

```
if (fb || XBEGIN() == XBEGIN STARTED) {
     if(lock==LOCKED) {while(lock==LOCKED);
            XABORT (); }
     if(fb) TTAS(&lock, LOCKED);
     do job(buf,...)
     if(fb) lock = UNLOCKED;
     XEND();
     FAD(&committed count,1);
     return;
}
else {fb=++retry<MAX RETRY; goto start tsx;}</pre>
```

}

- We cannot replace lock with HTM as is due to performance aspects
- Naïve code might abort frequently due to:
  - Statistics
  - Memory allocations
  - Fallback path policy make the fast past rarely used
  - False cache-sharing
  - NUMA
  - NVRAM

# Intel Transactional Synchronization eXtensions (TSX)

#### RTE

- XBEGIN:
  - Start a hardware transaction (keep track of accessed cache lines)
- XEND:
  - Try to commit a hardware transaction (untrack cache lines)
- XABORT:
  - Make a hardware transaction abort programmatically
- Needs a fallback path (e.g., by using locks)

#### HLE

- XACQUIRE:
  - Start a hardware transaction
  - execute a RMW without the LOCK prefix (XACQUIRE LOCK XCHG mutex, 1)
- XRELEASE:
  - Execute a mov to release the lock (XRELEASE mov mutex, 0)
  - Try to commit
- No need for an additional fallback path (just drop xacquire/xrelease and restart)

Is it worth investing in optimizing our code for HTM?

- VERY HARD TO SAY
- HTM has been around for a while (2014), BUT:
- IBM BlueGene/Q It is a high-end processor, not an off-the-shelf
- RockProcessor Canceled in 2009
- IBM POWER8 and 9 (Power ISA v.2.07 to 3.0)

Not present in 10 (3.1)

- Intel TSX
  - First releases were bugged => disabled by firmware update
  - As other speculative components of Intel processors, they are vulnerable (leak info, see *TSX* Asynchronous Abort (TAA) / CVE-2019-11135) => disabled by firmware update
  - Not supported in last generation Cometlake cpu (finger crossed for the next one)
- Arm Transactional Extension introduced in the last generation Armv9 (Mar 30 2021)

### What about Software Transactional Memory

From a programmer perspective:

- It is less efficient than hardware implementation
- It generally provides stronger progress
- No need for a fallback path
- Processor independent
- Stick with the support of the community/organization developing it

### What about Software Transactional Memory

- Hot topic in 2005
- A pletora of implementations for several programming languages
- C/C++
  - TinySTM
  - From G++ v4.7 (still expertimental)
- C#
  - SXM by Microsoft (discontinued)
- Haskell
  - STM is part of the Haskell platform
- Scala
  - Akka framework
- Java, python