RADULS is a fast multithreaded MSD radix sorter designed mainly to sort unsigned integers. This repository contains an example application using RADULS.
RADULS may be compiled with GCC (6 or higher) or visual C++ (Visual Studio 2015 or higher). Additional memory of size of array to sort is required (RADULS is not in-place algorithm).
To use raduls one should include raduls.h
file. All definitions are in raduls
namespace. Currently there are two function available:
void raduls::CleanTmpArray(uint8_t* tmp, uint64_t n_recs, uint32_t rec_size, uint32_t n_threads);
This function "touches" every memory page starting on address tmp of size n_recs*rec_size bytes using n_threads. The rationale behind this function is that allocating a memory with new[]
or malloc
does not really givie the memory to the process (it is postponed by the operating system to the first use). This function should be used only if tmp
memory was not yet touched by a program. See also the "On sorting time measurements" section for more explanation.
void raduls::RadixSortMSD(uint8_t* input, uint8_t* tmp, uint64_t n_recs, uint32_t rec_size, uint32_t key_size, uint32_t n_threads);
This is the main sorting procedure. It uses n_threads
threads. input
points to an array of n_recs
records of size rec_size
bytes to sort. tmp
points to an auxiliary array of the same size array pointed by input
. rec_size
must be a multiple of 8. key_size
is a number of bytes of a record that forms a key (i.e. the part of a record that is used while sorting). It must be lower or equal to rec_size
. In the simplest case a rec_size
and key_size
are equal and whole record forms a key. If they are not equal the first key_size
bytes form a key, and the last floor((rec_size - key_size)/8)*8
bytes form a data field. The remaining bytes in the middle must be filled with zeros. For example if rec_size
is 16, and a key_size is 6 the record may be presented as follows:
-------------------------------------------------------------------------------------------------
| | | | | | | | | | | | | | | | |
| K0 | K1 | K2 | K3 | K4 | K5 | 0 | 0 | D | D | D | D | D | D | D | D |
| | | | | | | | | | | | | | | | |
-------------------------------------------------------------------------------------------------
where:
- K0 -- least significant byte of a key, it is also a start address of a record (in case of first record in the array it is
input
address) - K5 -- most significant byte of a key
- D -- byte belonging to data field
The result of sorting may be in array pointed by input
or tmp
depending on the key_size
. If it is even the result will be in the array pointed by input
, in the oposite case (odd key_size
) the result will be in the array pointed by tmp
.
input
and tmp
addresses must be divisible by 256.
#include <iostream>
#include <chrono>
#include <random>
#include "raduls.h"
int main()
{
uint64_t n_recs = 100000000;
uint32_t key_size = 8;
std::mt19937_64 gen;
std::uniform_int_distribution<uint64_t> dis(0, ~0ull);
auto _raw_input = new uint8_t[n_recs * sizeof(uint64_t) + raduls::ALIGNMENT];
auto _raw_tmp = new uint8_t[n_recs * sizeof(uint64_t) + raduls::ALIGNMENT];
auto input = reinterpret_cast<uint64_t*>(_raw_input);
auto tmp = reinterpret_cast<uint64_t*>(_raw_tmp);
while (reinterpret_cast<uintptr_t>(input) % raduls::ALIGNMENT) ++input;
while (reinterpret_cast<uintptr_t>(tmp) % raduls::ALIGNMENT) ++tmp;
for (uint64_t i = 0; i < n_recs; ++i)
{
input[i] = dis(gen) & (~0ull >> ((8 - key_size)*8));
//std::cerr << input[i] << " ";
}
auto n_threads = std::thread::hardware_concurrency();
raduls::CleanTmpArray(reinterpret_cast<uint8_t*>(tmp), n_recs, sizeof(uint64_t), n_threads);
auto start = std::chrono::high_resolution_clock::now();
raduls::RadixSortMSD(reinterpret_cast<uint8_t*>(input), reinterpret_cast<uint8_t*>(tmp), n_recs, sizeof(uint64_t), key_size, n_threads);
std::cerr << "\nTime: " << std::chrono::duration<double>(std::chrono::high_resolution_clock::now() - start).count() << "\n";
auto result = key_size % 2 ? tmp : input;
//for (uint64_t i = 0; i < n_recs; ++i)
// std::cerr << result[i] << " ";
delete[] _raw_input; delete[] _raw_tmp;
}
It is always an important issue how to measure the running time of functions that need some external memory. This is especially important in case of sorting of huge arrays using non-in-place algorithms that require a lot of additional memory (likely the same size as the input data). Operating systems vary in how they initialize huge memory blocks just after allocation (this also depends on the hardware). Sometimes the initialization is so slow that the time of first accesses to the memory cells can be comparable with the sorting time. From the other side, in typical scenarios, the memory allocated by some program will be used many times, so the additional cost of "touching" the memory will be paid only once. Thus, in our opinion, this is not a part of the sorting procedure and should not be included in the sorting time.
Therefore, in the experiments in which we evaluated sorting algorithms the time of "touching" the memory (with a use of CleanTmpArray
) are not taken into account.
RADULS may throw following exceptions:
InputNotAlignedException
- ifinput
is not properly alignedTempNotAlignedException
- iftmp
is not properly alignedRecSizeNotMultipleOf8Exception
- ifrec_size
is not multiple of 8KeySizeGreaterThanRecSizeException
- ifkey_size
is greater thanrec_size
UsupportedRecSizeException
- ifrec_size
is to big, currently it is set to 32, it may be extended changing theMAX_REC_SIZE_IN_BYTES
inraduls.h
file (it will lead to longer compilation time)
One may note that sorting_network.cpp is compiled with -O1 flag instead of higher optimization level. The reason is that GCC tends to produce branches in sorting network code when higher optimization level is specified, which leads to slower executable.
If you found RADULS useful please cite us:
-
Kokot M., Deorowicz S., Debudaj-Grabysz A. (2017) Sorting Data on Ultra-Large Scale with RADULS. In: Kozielski S., Mrozek D., Kasprowski P., Małysiak-Mrozek B., Kostrzewa D. (eds) Beyond Databases, Architectures and Structures. Towards Efficient Solutions for Data Analysis and Knowledge Representation. BDAS 2017. Communications in Computer and Information Science, vol 716. Springer, Cham
-
Kokot M., Deorowicz S., Długosz M. (2018) Even Faster Sorting of (Not Only) Integers. In: Gruca A., Czachórski T., Harezlak K., Kozielski S., Piotrowska A. (eds) Man-Machine Interactions 5. ICMMI 2017. Advances in Intelligent Systems and Computing, vol 659. Springer, Cham