Golang’s Atomic

Golang is a language which excels at parallelism, with spinning up new goroutines being as easy as typing “go”. As you find yourself building more and more complex systems, it becomes exceedingly important to properly protect access to shared resources in order to prevent race conditions. Such resources might include configuration which can be updated on-the-fly (e.g. feature flags), internal states (e.g. circuit breaker state), and more.

What Are Race Conditions?

For most readers, this is probably basic knowledge, but since the rest of this article depends on understanding race conditions, it makes sense to have a brief refresher. A race condition is a situation in which a program’s behavior is dependent on the sequence or timing of other uncontrollable events. In most cases, such a condition is a bug due to the possibility for undesirable outcomes to occur.

This is perhaps easier to understand with a concrete example:

package main

import (
	"fmt"
	"sort"
	"sync"
	"testing"
)

func Test_RaceCondition(t *testing.T) {
	var s = make([]int, 0)

	wg := sync.WaitGroup{}

	// spawn 10 goroutines to modify the slice in parallel
	for i := 0; i < 10; i++ {
		wg.Add(1)
		go func(i int) {
			defer wg.Done()
			s = append(s, i) //add a new item to the slice
		}(i)
	}

	wg.Wait()
	
	sort.Ints(s) //sort the response to have comparable results
	fmt.Println(s)
}

Execution One:

=== RUN   Test_RaceCondition
[0 1 2 3 4 5 6 7 8 9]
--- PASS: Test_RaceCondition (0.00s)

All looks good here. This is the output we expected. The program iterated 10 times and added the index to the slice on each iteration.

Execution Two:

=== RUN   Test_RaceCondition
[0 3]
--- PASS: Test_RaceCondition (0.00s)

Wait, what happened here? We only had two items in our response slice this time. This is because the contents of the slice changed between the time when s was loaded and when it was modified, leading to the program overwriting some of the results. This particular race condition is caused by a data race, which is a situation in which multiple goroutines attempt to access a particular shared variable at the same time and at least one of those goroutines attempts to modify it.

If you execute the test with the -race flag, go will even tell you there’s a data race and help you pinpoint exactly where:

$ go test race_condition_test.go -race

==================
WARNING: DATA RACE
Read at 0x00c000132048 by goroutine 9:
  command-line-arguments.Test_RaceCondition.func1()
      /home/sfinlay/go/src/benchmarks/race_condition_test.go:20 +0xb4
  command-line-arguments.Test_RaceCondition·dwrap·1()
      /home/sfinlay/go/src/benchmarks/race_condition_test.go:21 +0x47

Previous write at 0x00c000132048 by goroutine 8:
  command-line-arguments.Test_RaceCondition.func1()
      /home/sfinlay/go/src/benchmarks/race_condition_test.go:20 +0x136
  command-line-arguments.Test_RaceCondition·dwrap·1()
      /home/sfinlay/go/src/benchmarks/race_condition_test.go:21 +0x47

Goroutine 9 (running) created at:
  command-line-arguments.Test_RaceCondition()
      /home/sfinlay/go/src/benchmarks/race_condition_test.go:18 +0xc5
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1259 +0x22f
  testing.(*T).Run·dwrap·21()
      /usr/local/go/src/testing/testing.go:1306 +0x47

Goroutine 8 (finished) created at:
  command-line-arguments.Test_RaceCondition()
      /home/sfinlay/go/src/benchmarks/race_condition_test.go:18 +0xc5
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1259 +0x22f
  testing.(*T).Run·dwrap·21()
      /usr/local/go/src/testing/testing.go:1306 +0x47
==================

Concurrency Control

Protecting access to these shared resources typically involves common memory synchronization mechanisms such as channels or mutexes.

Here’s that same test case with the race condition adjusted to use a mutex:

func Test_NoRaceCondition(t *testing.T) {
	var s = make([]int, 0)

	m := sync.Mutex{}
	wg := sync.WaitGroup{}

	// spawn 10 goroutines to modify the slice in parallel
	for i := 0; i < 10; i++ {
		wg.Add(1)
		go func(i int) {
			m.Lock()
			defer wg.Done()
			defer m.Unlock()
			s = append(s, i)
		}(i)
	}

	wg.Wait()

	sort.Ints(s) //sort the response to have comparable results
	fmt.Println(s)
}

This time it consistently returns all 10 integers because it ensures that each goroutine only reads and writes to the slice when no one else is doing it. If a second goroutine attempts to get a lock at the same time, it must wait until the previous one is finished (i.e. until it unlocks).

However, for high-throughput systems, performance becomes very important, and it therefore becomes ever more important to reduce lock contention (i.e. the situation in which one process or thread attempts to acquire a lock held by another process or thread). One of the most basic ways to do this is by using a reader-writer lock (sync.RWMutex) instead of a standard sync.Mutex, however Golang also provides some atomic memory primitives with it’s atomic package.

Atomic

Golang’s atomic package provides low-level atomic memory primitives for implementing synchronization algorithms. That sounds like the sort of thing we need, so let’s try rewriting that test with atomic:

import "sync/atomic"

func Test_RaceCondition_Atomic(t *testing.T) {
	var s = atomic.Value{}
	s.Store([]int{}) // store empty slice as the base

	wg := sync.WaitGroup{}

	// spawn 10 goroutines to modify the slice in parallel
	for i := 0; i < 10; i++ {
		wg.Add(1)
		go func(i int) {
			defer wg.Done()
			s1 := s.Load().([]int)
			s.Store(append(s1, i)) //replace the slice with a new one containing the new item
		}(i)
	}

	wg.Wait()

	s1 := s.Load().([]int)
	sort.Ints(s1) //sort the response to have comparable results
	fmt.Println(s1)
}

Execution Result:

=== RUN   Test_RaceCondition_Atomic
[1 3]
--- PASS: Test_RaceCondition_Atomic (0.00s)

What? This is exactly the same problem we had before, so what good is this package?

Read-Copy-Update

Atomic isn’t a silver bullet, and it obviously cannot replace mutexes, but it’s excellent when it comes to shared resources which can be managed using the read-copy-update pattern. In this technique, we fetch the current value by reference, and when we want to update it, we don’t modify the original value, but rather replace the pointer (and therefore no one is ever accessing the same resource that another thread may be modifying). The previous example could not be implemented using this pattern since it should extend an existing resource over time rather than replacing its contents entirely, but for many cases, read-copy-update is perfect.

Here is a basic example where we can fetch and store boolean values (useful for feature flags, for instance). In this example, we’re performing a parallel benchmark comparing atomic to a read-write mutex:

package main

import (
	"sync"
	"sync/atomic"
	"testing"
)

type AtomicValue struct{
	value atomic.Value
}

func (b *AtomicValue) Get() bool {
	return b.value.Load().(bool)
}

func (b *AtomicValue) Set(value bool) {
	b.value.Store(value)
}

func BenchmarkAtomicValue_Get(b *testing.B) {
	atomB := AtomicValue{}
	atomB.value.Store(false)

	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			atomB.Get()
		}
	})
}

/************/

type MutexBool struct {
	mutex sync.RWMutex
	flag  bool
}

func (mb *MutexBool) Get() bool {
	mb.mutex.RLock()
	defer mb.mutex.RUnlock()
	return mb.flag
}

func BenchmarkMutexBool_Get(b *testing.B) {
	mb := MutexBool{flag: true}

	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			mb.Get()
		}
	})
}

cpu: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
BenchmarkAtomicValue_Get
BenchmarkAtomicValue_Get-8   	1000000000          0.5472 ns/op
BenchmarkMutexBool_Get
BenchmarkMutexBool_Get-8     	24966127            48.80 ns/op

The results are clear. Atomic was more than 89 times faster. And it can be improved even more by using a more primitive type:

type AtomicBool struct{ flag int32 }

func (b *AtomicBool) Get() bool {
	return atomic.LoadInt32(&(b.flag)) != 0
}

func (b *AtomicBool) Set(value bool) {
	var i int32 = 0
	if value {
		i = 1
	}
	atomic.StoreInt32(&(b.flag), int32(i))
}

func BenchmarkAtomicBool_Get(b *testing.B) {
	atomB := AtomicBool{flag: 1}

	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			atomB.Get()
		}
	})
}

cpu: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
BenchmarkAtomicBool_Get
BenchmarkAtomicBool_Get-8    	1000000000	         0.3161 ns/op

This version is more than 154 times faster than the mutex version.

Write operations also show a clear difference (though the scale isn’t quite as impressive):

func BenchmarkAtomicBool_Set(b *testing.B) {
	atomB := AtomicBool{flag: 1}

	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			atomB.Set(true)
		}
	})
}

/************/

func BenchmarkAtomicValue_Set(b *testing.B) {
	atomB := AtomicValue{}
	atomB.value.Store(false)

	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			atomB.Set(true)
		}
	})
}

/************/

func BenchmarkMutexBool_Set(b *testing.B) {
	mb := MutexBool{flag: true}

	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			mb.Set(true)
		}
	})
}

cpu: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
BenchmarkAtomicBool_Set
BenchmarkAtomicBool_Set-8    	64624705	        16.79 ns/op
BenchmarkAtomicValue_Set
BenchmarkAtomicValue_Set-8   	47654121	        26.43 ns/op
BenchmarkMutexBool_Set
BenchmarkMutexBool_Set-8     	20124637	        66.50 ns/op

Here we can see that atomic is significantly slower while writing than it was while reading, though still much faster than a mutex. Interestingly, we can see that the difference between mutex reads and writes isn’t very significant (30% slower). In spite of that, atomic still performs much better (2-4 times faster than the mutex).

Why is Atomic so fast?

In short, atomic operations are fast because they rely on atomic CPU instructions rather than relying on external locks. When using a mutex, every time a lock is obtained, the goroutine is briefly paused or interrupted, and this blocking accounts for a significant portion of the time taken while using mutexes. Atomic operations can be performed without any interruption.

Is Atomic always the answer?

As we already demonstrated in one of the early examples, atomic can’t solve every problem, and some operations can only be solved using a mutex.

Consider the following example demonstrating a common pattern in which we use a map as an in-memory cache:

package main

import (
	"sync"
	"sync/atomic"
	"testing"
)

//Don't use this implementation!
type AtomicCacheMap struct {
	value atomic.Value //map[int]int
}

func (b *AtomicCacheMap) Get(key int) int {
	return b.value.Load().(map[int]int)[key]
}

func (b *AtomicCacheMap) Set(key, value int) {
	oldMap := b.value.Load().(map[int]int)
	newMap := make(map[int]int, len(oldMap)+1)
	for k, v := range oldMap {
		newMap[k] = v
	}
	newMap[key] = value
	b.value.Store(newMap)
}

func BenchmarkAtomicCacheMap_Get(b *testing.B) {
	atomM := AtomicCacheMap{}
	atomM.value.Store(testMap)

	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			atomM.Get(0)
		}
	})
}

func BenchmarkAtomicCacheMap_Set(b *testing.B) {
	atomM := AtomicCacheMap{}
	atomM.value.Store(testMap)

	var i = 0
	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			atomM.Set(i, i)
			i++
		}
	})
}

/************/

type MutexCacheMap struct {
	mutex sync.RWMutex
	value map[int]int
}

func (mm *MutexCacheMap) Get(key int) int {
	mm.mutex.RLock()
	defer mm.mutex.RUnlock()
	return mm.value[key]
}

func (mm *MutexCacheMap) Set(key, value int) {
	mm.mutex.Lock()
	defer mm.mutex.Unlock()
	mm.value[key] = value
}

func BenchmarkMutexCacheMap_Get(b *testing.B) {
	mb := MutexCacheMap{value: testMap}

	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			mb.Get(0)
		}
	})
}

func BenchmarkMutexCacheMap_Set(b *testing.B) {
	mb := MutexCacheMap{value: testMap}

	var i = 0
	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			mb.Set(i, i)
			i++
		}
	})
}

cpu: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
BenchmarkAtomicCacheMap_Get
BenchmarkAtomicCacheMap_Get-8   	301664540           4.194 ns/op
BenchmarkAtomicCacheMap_Set
BenchmarkAtomicCacheMap_Set-8   	   87637            95889 ns/op
BenchmarkMutexCacheMap_Get
BenchmarkMutexCacheMap_Get-8    	20000959            54.63 ns/op
BenchmarkMutexCacheMap_Set
BenchmarkMutexCacheMap_Set-8    	 5012434            267.2 ns/op

Yikes, that performance is painful. This means that, atomic performs very poorly when large structures must be copied. Not only that, but this code contains a race condition as well. Just like the slice case from the beginning of this article, the atomic cache example has a race condition in which new cache entries may be added between the time when the map was copied and when it was stored, in which case the new entries would be lost. In this case, the -race flag wouldn’t detect any data race since there isn’t concurrent access to the same map.

Caveats

Golang’s own documentation warns against the potential misuse of the atomic package:

These functions require great care to be used correctly. Except for special, low-level applications, synchronization is better done with channels or the facilities of the sync package. Share memory by communicating; don’t communicate by sharing memory.

One of the first issues you may encounter when starting to use the atomic package is:

panic: sync/atomic: store of inconsistently typed value into Value

With atomic.Store, it’s important to ensure that exactly the same type is stored every time the method is called. That may sound easy, but it’s often not as simple as it sounds:

package main

import (
	"fmt"
	"sync/atomic"
)

//Our own custom error type which implements the error interface
type CustomError struct {
	Code    int
	Message string
}

func (e CustomError) Error() string {
	return fmt.Sprintf("%d: %s", e.Code, e.Message)
}

func InternalServerError(msg string) error {
	return CustomError{Code: 500, Message: msg}
}

func main() {
	var (
		err1 error = fmt.Errorf("error happened")
		err2 error = InternalServerError("another error happened")
	)

	errVal := atomic.Value{}
	errVal.Store(err1)
	errVal.Store(err2) //panics here
}

It’s not enough that both values are error type because they simply implement the error interface. Their concrete types still differ, and therefore atomic doesn’t like it.

Summary

Race conditions are bad and access to shared resources should be protected. Mutexes are cool, but tend to be slow due to lock contention. The atomic package can be an amazingly fast alternative to mutexes for certain cases where the read-copy-update pattern makes sense (this tends to be things like dynamic configuration such as feature flags, log level, or maps or structures filled all at once for example through JSON-unmarshaling, etc), especially when there is a significant number of reads compared to writes. Atomic should typically not be used for other usecases (e.g. variables that grow over time such as caches), and the usage of the feature requires great care.

Probably the most important takeway is that locking should be kept to a minimum, and if you’re considering alternatives such as atomic, be sure to extensively test it and experiment before going to production.

Image by Renee French, 2019