Skip to content

Performance tuning

vukasin gostovic edited this page Mar 5, 2024 · 8 revisions

Blutgang is designed to have insanely fast caching and load balancing. Out of the box with no performance tuning, it's up to 3x faster than its competitors. However we can push it even further. In this section we'll go over both general OS level, as well as blutgang specific tweaks we can make to decrease access time and improve performance.

Tuning Blutgang

When dealing with performance when it comes to Blutgang we are concerned about RPC access and database settings. Our main tool when it comes to RPC access tuning are the ttl and max_consecutive options.

ttl

ttl is a global option that dictates the acceptable time we wait for an RPC to deliver us an answer before we drop it. If we submit a request to Blutgang that is not cached, we request it from the fastest available RPC. If the RPC does not deliver us an answer within the ttl time, we drop it from the active queue and pick a new one. This process is repeated until we find a suitable RPC or until there are no available RPC endpoints.

There is no golden ttl value which you should set. It should be set to the highest value you are willing to tolerate. Where a response taking too long to arrive either signals that something has gone wrong with that RPC, or that we want to try with a potentially faster RPC.

It's recommended that you experiment with your setup to find the value that's right for you. For nodes that run on the same machine as Blutgang, the average latency is around ~8ms. For remote nodes it can be as high as ~150ms.

max_consecutive and max_per_second

[llama]
url = "https://eth.llamarpc.com" # RPC url
max_consecutive = 10 # The maximum amount of time we can use this RPC in a row.
max_per_second = 100 # Max ammount of queries per second.

max_consecutive is an option that limits consecutive RPC querries to a node. It is either a global value if set via the CLI, or per individual RPC if using the config file.

max_per_second is an optimistic limit to how much a node can be called per second. This means that blutgang will honor max_per_second unless no other RPCs are available. For example, if blutgang has access to only 1 healthy RPC and is receiving more than the specified max_per_second it will not drop or delay any request, forwarding them straight to the node.

max_consecutive and max_per_second give us options to adjust load on a per RPC basis, so no RPC gets too overwhelmed.

As with ttl, there is no golden value for max_consecutive or max_per_second. Both of these depend on what your hardware/node software can handle.

System tuning

This section talks about how to extract the absolute maximum performance for blutgang/your node(s). When talking about optimizing our system, we care about how the OS networking stack, disk and filesystem, Memory and Cache interact with our node and Blutgang. Most users are recommended to skip this section.

As a general checklist for tuning your system, we recommend you practice the following:

  • Follow hardware manufacturers' guidelines for low latency BIOS tuning.
  • Research system hardware topology.
  • Determine which CPU sockets and PCIe slots are directly connected.
  • Ensure that adapter cards are installed in the most performant PCIe slots (e.g., 8x vs 16x etc).
  • Ensure that memory is installed and operating at maximum supported frequency.
  • Make sure the OS is fully updated.
  • Enable network-latency tuned profile, or perform equivalent tuning.
  • Verify that power management settings are correct.
  • Stop all unnecessary services/processes.
  • Unload unnecessary kernel modules (for instance, iptables/netfilter).
  • Perform baseline latency tests.
  • Iterate, making isolated tuning changes, testing in between each change.

Networking

Generally, most distros have sane defaults when it comes to networking. Going for the absolute minimum latency, may cause regressions due to other parts of the stack being negatively affected by the changes.

When tuning network performance for low latency, the goal is to have IRQs be serviced on the same core or socket that is currently executing the application that is interested in the network packet. This increases CPU cache hit-rate and avoids using the inter-processor link.

irqbalance

irqbalance is a daemon to help balance the CPU load generated by interrupts across all of a system's CPUs. irqbalance identifies the highest volume interrupt sources, and isolates them to a single unique CPU so that load is spread as much as possible over an entire processor set, while minimizing cache miss rates for IRQ handlers.

The irqbalance service should to be enabled by default on most distros.

Filesystem

Most Linux distors come with sane ext4 defaults. It is not recommended to use OpenZFS on Linux as it may increase Blutgang DB access latency. ZFS on FreeBSD has been noted to perform nominally, although we offer no support for it.

Unsafe ext4 tweaks

Warning: Changing the following options may cause filesystem corruption and loss of data. Only change them if you know what you're doing.

Turning barriers off

Ext4 enables write barriers by default. It ensures that file system metadata is correctly written and ordered on disk, even when write caches lose power. This goes with a performance cost especially for applications that use fsync heavily, like databases.

To turn barriers off, add the option barrier=0 to the desired filesystem. For example:

/etc/fstab

/dev/sda5    /    ext4    defaults,barrier=0    0    1

Disabling journaling

Disabling the journal with ext4 can be done with the following command on an unmounted disk:

tune2fs -O "^has_journal" /dev/sdXn

Scheduler

(todo)

CPU, topology, NUMA and memory

Blutgang abuses x86_64 ISA extensions (Specifically sse4.2 and avx2). We support and offer official aarch64 builds, but you might find them lacking in performance.

AMD and Intel processors offer different performance tradeoffs. AMD CPUs generally have more cache than their Intel counterparts. However, due to it's chiplet design latency to access said cache is much higher. Cache misses are more costly on AMD but should happen less frequently.

Memory

Besides running as few background processes as possible, there isn't much we can do to improve memory performance. The best way to extract performance from Blutgang's in memory cache is to use the best RAM available.

Blutgang does not currently support hugepages. When support gets added, a guide on how to utilize it will be posted here.

Pinning

(todo)