How I Ported Python Code to C++17 to be 625x Faster

I’ve just about finished porting all of my ByteBuffer and packet logic to C++17 and have achieved a noticeable 625x performance improvement when comparing its Python counterpart. This isn’t particularly surprising considering it’s modern C++17, but I want to go into some detail as to how something like this can be done.

Before that, let’s get into why I would port an old project to C++ in the first place.

Why C++

I think there’s really no reason not to use C++ other than the notion that it’s difficult to write C++ code. I don’t agree with this argument.

The reality is that C++ isn’t actually that much more difficult to write if you know the language, and for the added performance benefit it’s very much so worth it. Sure it’s easy to misuse, but you’re not writing assembly you’re literally writing the same code you’d write in any other modern language (though not as functional as Haskell).

For something like a low level project revolving around manipulation of plain old data with low tolerance for slow code C++ is an excellent choice.

The core guidelines are a great read for anyone looking into getting into C++ programming.

A simple data structure

To write to a byte_buf we can get a large performance improvement just by ensuring the capacity is a constant size. In practice this just means making the buffer the max possible size a packet can be. In memory byte_buf is basically

template<uint32_t N>
struct byte_buf {
	static const uint32_t capacity_ = N;

	uint32_t read_index = 0;
	uint32_t write_index = 0;
	u_byte data[capacity_];
	
    // ...
}

which is a buffer with a read index and a write index. One of my initial concerns with something as simple as a struct was whether or not this would be super fast since I was concerned with allocation times for these byte buffers when creating new packets.

But anyone who knows anything about machine code/assembly knows that allocating stack space is really cheap and one instruction. In fact, it’s much cheaper than heap allocation because you don’t have to worry about how the heap can be fragmented or hitting the L2 cache which can be expensive. In assembly it’s simply sub ESP, 0xFF to allocate 255 bytes of space!

When I profiled my code increasing the capacity of the buffer did not have any effect on the performance.

Another thing that’s really important here is to not rely on unnecessary polymorphism or classes to do this. Even though in theory a struct is simply a class and takes up the same amount of memory, the reality is that when you add virtual functions and various levels of polymorphism (like how I did in my Python code) that greatly increases the overhead on each byte_buffer object since a vtable has to be created in memory. When you call a virtual function there can be 2-3 levels or more of pointer indirection which are very expensive and add up when frequently calling a function. I was heavily inspired by this talk on Data Oriented Design when coming up with the logic for this data structure which is intended for low-level use.

In fact, you can find ByteBuffer implementations on Github that use classes but don’t virtualize functions at all for this very reason. Nor do they template or inline their generic functions (which is my only gripe). Then again, why you would apply polymorphism here doesn’t really make sense to begin with.

Avoid excessive copies

One way to avoid excessive copies is passing by refence when your data type is larger than 4 bytes in size. A reference is technically just a 4 byte pointer, so you don’t gain anything from passing an int by reference unless you want to modify it for some reason.

Similarly, avoid using memcpy when you can just do something like a cast:

template <typename T>
inline const T read() {
    T result = *(T*)&data[read_index];
    read_index += sizeof(T);
    return result;
}

to read an arbitrary data type. Note that this C style cast is not safe and using static_cast will not cause it to compile, but we’re just telling the cpu to read the bytes at &data[read_index] as a generic type ` T`.

Inline your functions

Inlining functions, in particular recursive ones offer a minor 10-14% performance benefit in practice pays off when you consider how often these are called in write_packet_to_buf calls.

Template your functions

Templating functions makes polymorphic type deduction occur at compile time through overloads as opposed to at runtime so it’s faster than the alternative runtime type deductions that take place in languages like Python or Java. In Python I was using an OrderedDict to do this! 😭

Results

To test in C++17 I used std::chrono and compared writing 1 million packets with the same structure to the equivalent code in Python.

auto start = std::chrono::high_resolution_clock::now();
for (unsigned int i = 0; i < 1000000; i++) {
    network::packet::packet_t<double, double, double, float, float, bool> playerpos2((double)3.0, (double)3.2, (double)15503.24, (float)315, (float)15423, true);
    byte_buf<256> aaaz;
    network::packet::write_packet_to_buf(aaaz, playerpos2);
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
// ~625x faster than python
std::cout << "Took " << duration.count() << "ms\n"; 

In C++17 this takes 600ms on average in debug and 12ms in release.

t0 = time.time()

for _ in range(0, 1000000):
    PlayerPositionAndLook(X=3.0,Y=3.2,Z=15503.24,Yaw=315, Pitch=315,OnGround=True).write()

t1 = time.time()
print (t1-t0)

The equivalent Python code takes 7.5 seconds on average on my i7 8700k.

It’s 625x faster

Initially I thought the performance boost was because the compiler was doing some sort of optimizations maybe and collapsing the for loop, but I tested it and it wasn’t doing this.

C++ really excels at memory manipulation where Python doesn’t. If you look at the C++ code there’s very little work done that is particularly expensive here. When we write a double, float or a boolean it calls the write overload

template <typename T, uint32_t N>
void write(byte_buf<N>& buf, T val) {
    u_byte* bytes;

    #ifndef BOOST_LITTLE_ENDIAN
        auto swap = byteswap::byteswap<T>(val);
        bytes = reinterpret_cast<u_byte*>(&swap);
    #else
        bytes = reinterpret_cast<u_byte*>(&val);
    #endif

    buf.write(bytes, sizeof(T));
}

The only costs that arise here are any sort of copies on the type T and primarily byteswap::byteswap<T>(val) which reverses the byte order on a little endian system. Other than that buf.write is simply a memcpy into the data buffer and a simple add operation on write_index.

Since we already know what our buffer is going to be in size this is as fast as reading/writing to memory in C++ which is extremely fast and well defined behavior. Not to mention the Python code calls loads of stuff like struct.pack to enforce endian-ness which contribute to the slowness of the code.

Another thing to remember is that the compiler knows at compile time what network::packet::packet_t<double, double, double, float, float, bool> is and what all of its write calls entail with the magic of overloads so it doesn’t incur the additional runtime costs for looking up types that the Python does here. In other words, POD data types save us a lot of performance costs in C++.

Summary

To conclude I’ve managed to achieve a 625x performance improvement for Python’s ByteBuffer in C++ without relying on too many assumptions and purely optimizing for performance.

I’ll be posting the project on Github in the upcoming weeks when it’s done for anyone that’s interested in how I ported mcidle.

Back