The future of mcidle

For the last 2 weeks I’ve been mostly working on mcidle and updating it so that it’s perfectly usable by players. Even though I don’t play Minecraft and haven’t played in a long while, I just wanted to work on something that wasn’t web related. There’s a lot to go over, but in essence I just made everything extremely thread safe, added the ability for the program to terminate with daemon threads, abstracted away communication to sockets with “upstreams”, removed all the worker thread packet processing which actually slowed down mcidle due to threads deadlocking while they wait for access to the mutex, and store all relevant to-be-serialized packet data in a game_state object.

The future of mcidle

Currently mcidle can actually imitate the game’s state fairly well, that is almost every relevant packet is processed (inventory, gamemode, health, position, packets, loaded entities) except block/chunk state. The reason the old version of mcidle didn’t have chunk processing originally is because I was too lazy to read the docs on how to deserialize chunks.

The major problems are

We can’t easily implement protocol agnostic packet serialization/deserialization
Processing chunks is extremely slow and cannot be done in Python (blocks the entire program, too much CPU usage)
Packaging python is literally hell, even with pipx you have to compile the cryptography library with mcidle which doesn’t work on a lot of Windows machines for some reason even with VS tools installed

mcidle-cpp

Remember mcidle-cpp? That post was a little misleading. I basically ended up realizing that there’s no way to design mcidle without virtual inheritance/relying purely on templates and not pulling my hair out. We can use virtual inheritance to implement protocol agnostic packet parsing like mclib does.

The basic idea is to take a packet id and then call a lambda function on a protocol object which returns the correct instance of the packet we want for that id then call serialize or deserialize functions respectively which are also implemented using virtual inheritance. Sure there are some complications since we still have to be careful about protocol versions during the serialization process, but in general it works.

So far mcidle-cpp contains a full cmake build system, tests for all the primitive data types, working client->server minecraft connection, byte buffers, all the primitive data types, zlib compression/decompression, packet encryption/decryption (using mclib as a base). The only code that I did not write myself was the Yggdrasil auth API which is basically just boilerplate from mclib (I hate writing this part, the way the auth system is designed is really clunkily–if your auth credentials have expired the system tells you you have the wrong password for instance). Also, it goes without saying that this is all built on top of boost::asio.

We can process chunks faster just by writing C++ code that uses a lot of move semantics and avoids unnecessary copies.

Finally, using static linking we can just build a binary without having to worry about if a user has a specific dll on their system or not. This makes packaging a LOT easier.

Results so far

Today I was able to test chunk section processing. When I controlled for how slow print statements are in C++ if you use streams and being especially careful of different optimizations and avoiding excessive copies I was able to get a 15-20x speed improvement for chunk processing (0.37 seconds c++ versus 7.1 seconds in python for 14mb of compressed chunk data or 441 chunk packets). Chunk processing means deserializing the entire chunk sections, palette information, etc. I think this is sufficient for my port to be successful. Honestly, testing is sort of a nightmare but I can definitely say the C++ version is at least one order of magnitude faster.

In fact, when I tried to measure how long it took to process chunks it had zero impact on the speed of the program. It seems like my actual bottleneck is the rate at which the server could send packets, which works out to chunks being read and processed in 0.36-0.4 seconds upon initially connecting! Blazing fast. These chunks when uncompressed and deserialized amount to ~50mb of data in memory. I thought Python was slower because of the print statements but when I measured the time for a print it was negligible.

The following code, written in Python, to read chunk sections drops the performance of mcidle-python to 20x slower than it’s C++ equivalent

@staticmethod
def read(stream):
    # In the latest protocol we have to read a short here (block count)
    # block_count = Short.read(stream)
    bits_per_block = UnsignedByte.read(stream)

    palette_len = VarInt.read(stream)
    if bits_per_block < 4:
        # Indirect palette
        bits_per_block = 4
    if bits_per_block > 8:
        # Direct palette, ignore
        bits_per_block = 13

    if palette_len > 0:
        while palette_len > 0:
            VarInt.read(stream)
            palette_len -= 1

    mask = (1 << bits_per_block) - 1

    data_len = VarInt.read(stream)
    data = []
    num_read = 0
    while num_read < data_len:
        data.append(UnsignedLong.read(stream))
        num_read = num_read + 1

    SECTION_HEIGHT = 16
    SECTION_WIDTH = 16

    for y in range(0, SECTION_HEIGHT):
        for z in range(0, SECTION_WIDTH):
            for x in range(0, SECTION_WIDTH):
                block_number = ((y * SECTION_HEIGHT) + z) * SECTION_WIDTH + x
                start_long = (block_number * bits_per_block) // 64
                start_offset = (block_number * bits_per_block) % 64
                end_long = ((block_number + 1) * bits_per_block - 1) // 64

                val = 0
                if start_long == end_long:
                    val = data[start_long] >> start_offset
                else:
                    end_offset = 64 - start_offset
                    val = (data[start_long] >> start_offset) | (data[end_long] << end_offset)
                val &= mask

    block_light = ByteArray.read(stream, 4096 // 2)

When I write the exact same thing in C++ with some niceties like using std::copy or memcpy to read a bunch of 64 bit longs it’s very fast. The loop over each block in the chunk section is extremely slow in Python and dominates the time (removing this loop makes the python code run in max 1 second). It pretty much boils down to how slow loops are in python due to it being an interpreted language and not being able to optimize to down to bare metal very well. It could be the fact that the loops are indented, but it doesn’t really matter. Trying to make Python go fast is a losing battle even if you use Cython.

BTW, I don’t want to hear about how Rust was the clear better choice. I find it fun to write C++ and mess with build systems for some twisted reason.

Back