Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce allocations #22

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Reduce allocations #22

wants to merge 6 commits into from

Conversation

dcoutts
Copy link
Member

@dcoutts dcoutts commented Sep 20, 2024

  • Set up a benchmark to measure allocations
  • Reduce allocations in the benchmark itself (to make allocations in the library more significant)
  • Reduce allocations in submitIO, and adjust to be async-exception safe
  • Reduce allocations in awaitIO.

Results:

  • Baseline allocations per IO op: 1331 bytes
  • After improvements to benchmark itself (not lib): 221 bytes
  • After improvements to the library: 109 bytes

This is about 14 words (64bit) remaining. Of which 6 words are the IOOp itself. This is when using batches of 32 IOOps, so some per-batch allocations are amortised over the operations.

Further improvements are probably possible, but low hanging fruit is now probably elsewhere. One remaining idea is to make IOOp an instance of Vector.Unbox, to allow the vector of IOOps to use an unboxed vector. This would not directly improve allocations much (only 1.5 words per op), but might allow for reduced conversions in applications, and/or reuse of (mutable) vectors.

This is as a prelude to trying to reduce allocations.

Inital result for high level benchmark

Allocated total: 348802552
Allocated per:   1331

That is, 1331 bytes per IO operation. This includes allocations from the
benchmark itself. Indeed profiling indicates that most of the
allocations are from the benchmark itself rather than the library.
Previously we generated a large list of a random permutation of all the
4k disk blocks in the input file, and then broke that up into groups of
32 blocks. We used forConcurrently_ over the groups which starts one
thread per group.

This approach did excessive work and allocated a lot. For starters the
random permutation is expensive in time and space. It needed to track a
set of remaining choices. And it had to allocate the whole list because
the forConcurrently_ forces the list (of groups). And a thread per 32
blocks is a lot of threads once one uses large files. For our example
target file of 262144 blocks, this amounts to 8192 threads. Each one has
a stack, so it's a lot of allocations in total.

The new approach is:
1. Use random block indexes rather than a permutation. This leads to
   duplicates but is much cheaper because it needs little state. The
   best solution to the duplicates will be to use O_DIRECT to avoid
   caching effects.
2. Only use a small number of threads and have each one submit a series
   of batches of work (still of size 32).

This uses much less state. Then in addition, we use a lower level
approach to generatng the random block indexes to reduce heap
allocations.

The new allocation results for the high level benchmark are:

Allocated total: 57904976
Allocated per:   221

That is, 221 bytes per IO op. This is without any changes yet to the
actual library, just the benchmark.
In the time and allocation profiling (using late cost centers) it helps
to have things inlined or not inlined to achieve a readable structure to
the profile results.

This helps identifying the source of the remaining allocations.
Try to do two things:
1. reduce allocations and improve performance slightly
2. improve the sync and async exception safety of submitIO

Trying to achieve exception safety is non-trivial and needs masking and
multiple levels of exception handlers.

The one remaining difficulty is that if we get a sync exception from the
lower level submit operation then we do not yet reset the ring to clear
the SQEs that we filled out.
High level benchmark allocations down to:

Allocated total: 28566920
Allocated per:   109

which is just under 14 words per operation, which isn't too bad. At
least, when using 32 operations per batch (many of the allocations are
per-batch).
They need GADTs or TypeFamilies enabled in a module to solve constraints
arising from other modules.
@dcoutts dcoutts force-pushed the dcoutts/allocations-reduction branch from ad3fd1e to 50a8a63 Compare September 20, 2024 13:01
Copy link
Collaborator

@mheinzel mheinzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still looking at the implementation changes.

forRngSplitM_ nbatches rng_task $ \ !rng_batch ->
submitIO ioctx $
generateIOOpsBatch fd buf lastBlock batchsz rng_batch
_ <- Async.waitAnyCancel tasks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that right? Shouldn't all async actions finish?

go :: V.MVector s (IOOp IO) -> Random.StdGen -> Int -> ST s ()
go !_ !_ !i | i == size = return ()
go !v !rng !i = do
let (!block, !rng') = Random.uniformR (0, lastBlock) rng
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, previously we read each block exactly once, now it's random, so some will be read multiple times, some never? That should be fine for this benchmark.

Copy link
Collaborator

@jorisdral jorisdral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. How is running time affected?

Comment on lines +117 to +118
nbatches = lastBlock `div` (ntasks * batchsz) -- batches per task
totalOps = nbatches * batchsz * ntasks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we might ignore a few pages based on whether the division rounds down?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these new INLINE/NOINLINE pragmas in 6217dab change benchmark results in any way?

takeMVar iobatchCompletion

submitIO ioctx@IOCtx {ioctxBatchSizeLimit'} !ioops0 =
-- General case. Needs multiple batches and combining results.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or it needs no batch at all, if V.null ioops. The code handles this case correctly, but the comment only mentions the V.length ioops > ioctxBatchSizeLimit' case

Comment on lines +213 to +214
awaitIOBatches iobatchCompletions =
VU.concat <$> mapM takeMVar (reverse iobatchCompletions)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was giving some thought to whether it is okay that takeMVar is interruptible here. I suppose if the submitting thread gets interrupted and doesn't take the MVars, that should be fine because the MVars can be GC'ed. No explicit cleanup required

Comment on lines +213 to +214
awaitIOBatches iobatchCompletions =
VU.concat <$> mapM takeMVar (reverse iobatchCompletions)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Int terms of allocations/number of allocated objects, would reverse on the vector be better than reverse on the list?

Comment on lines +229 to +230
-- We're called with async exceptions masked, but 'waitQSemN' can block and
-- receive exceptions. That's ok. But once we acquire the semaphore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because waitQSemN is interruptible, right?

URing.submitIO uring

-- More async exception safety: we want to inform the completionThread
-- /if and only if/ we successfully submitted a bathc of IO. So now that
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
-- /if and only if/ we successfully submitted a bathc of IO. So now that
-- /if and only if/ we successfully submitted a batch of IO. So now that

awaitIO :: URing -> IO IOCompletion
awaitIO (URing uringptr) =
alloca $ \cqeptrptr -> do
awaitIO !URing {uringptr, cqeptrfptr} = do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant bang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants