Ring buffer usage and cache locality: boost lockfree spsc_queue cache memory access

https://stackoverflow.com/questions/26534342/boost-lockfree-spsc-queue-cache-memory-access

C++ principally a pay-for-what-you-need eco-system.

Any regular queue will let you choose the storage semantics (by value or by reference).

However, this time you ordered something special: you ordered a lock free queue. In order to be lock free, it must be able to perform all the observable modifying operations as atomic operations. This naturally restricts the types that can be used in these operations directly.

You might doubt whether it’s even possible to have a value-type that exceeds the system’s native register size (say, int64_t).

Good question.

Enter Ringbuffers

Indeed, any node based container would just require pointer swaps for all modifying operations, which is trivially made atomic on all modern architectures. But does anything that involves copyingmultiple distinct memory areas, in non-atomic sequence, really pose an unsolvable problem?

No. Imagine a flat array of POD data items. Now, if you treat the array as a circular buffer, one would just have to maintain the index of the buffer front and end positions atomically. The container could, at leisure update in internal ‘dirty front index’ while it copies ahead of the external front. (The copy can use relaxed memory ordering). Only as soon as the whole copy is known to have completed, the external front index is updated. This update needs to be in acq_rel/cst memory order[1].

As long as the container is able to guard the invariant that the front never fully wraps around and reaches back, this is a sweet deal. I think this idea was popularized in the Disruptor Library (of LMAX fame). You get mechanical resonance from

  • linear memory access patterns while reading/writing
  • even better if you can make the record size aligned with (a multiple) physical cache lines
  • all the data is local unless the POD contains raw references outside that record

How Does Boost’s spsc_queue Actually Do This?

  1. Yes, spqc_queue stores the raw element values in a contiguous aligned block of memory: (e.g. from compile_time_sized_ringbuffer which underlies spsc_queue with statically supplied maximum capacity:)
    typedef typename boost::aligned_storage<max_size * sizeof(T),
                                            boost::alignment_of<T>::value
                                           >::type storage_type;
    
    storage_type storage_;
    
    T * data()
    {
        return static_cast<T*>(storage_.address());
    }

    (The element type T need not even be POD, but it needs to be both default-constructible and copyable).

  2. Yes, the read and write pointers are atomic integral values. Note that the boost devs have taken care to apply enough padding to avoid False Sharing on the cache line for the reading/writing indices: (from ringbuffer_base):
    static const int padding_size = BOOST_LOCKFREE_CACHELINE_BYTES - sizeof(size_t);
    atomic<size_t> write_index_;
    char padding1[padding_size]; /* force read_index and write_index to different cache lines */
    atomic<size_t> read_index_;
  3. In fact, as you can see, there are only the “internal” index on either read or write side. This is possible because there’s only one writing thread and also only one reading thread, which means that there could only be more space at the end of write operation than anticipated.
  4. Several other optimizations are present:
    • branch prediction hints for platforms that support it (unlikely())
    • it’s possible to push/pop a range of elements at once. This should improve throughput in case you need to siphon from one buffer/ringbuffer into another, especially if the raw element size is not equal to (a whole multiple of) a cacheline
    • use of std::unitialized_copy where possible
    • The calling of trivial constructors/destructors will be optimized out at instantiation time
    • the unitialized_copy will be optimized into memcpy on all major standard library implementations (meaning that e.g. SSE instructions will be employed if your architecture supports it)

All in all, we see a best-in-class possible idea for a ringbuffer

What To Use

Boost has given you all the options. You can elect to make your element type a pointer to your message type. However, as you already raised in your question, this level of indirection reduces locality of reference and might not be optimal.

On the other hand, storing the complete message type in the element type could become expensive if copying is expensive. At the very least try to make the element type fit nicely into a cache line (typically 64 bytes on Intel).

So in practice you might consider storing frequently used data right there in the value, and referencing the less-of-used data using a pointer (the cost of the pointer will be low unless it’s traversed).

If you need that “attachment” model, consider using a custom allocator for the referred-to data so you can achieve memory access patterns there too.

Let your profiler guide you.


[1] I suppose say for spsc acq_rel should work, but I’m a bit rusty on the details. As a rule, I make it a point not to write lock-free code myself. I recommend anyone else to follow my example đŸ™‚

Blocking Bounded Queue

Minimalistic blocking bounded queue in C++

MPMCQueue

https://github.com/rigtorp/MPMCQueue/blob/master/include/rigtorp/MPMCQueue.h

http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s