Skip to content

Commit

Permalink
Merge pull request #17 from ambiata/topic/debug
Browse files Browse the repository at this point in the history
Performance notes and minor fixes.
  • Loading branch information
novemberkilo authored Mar 25, 2017
2 parents b6da9b7 + 731edc9 commit 944baa3
Show file tree
Hide file tree
Showing 4 changed files with 36 additions and 4 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,5 @@ regiment sort -k 5 -c 15 -f ',' -o "path/to/output-file" input-file
# all the things
regiment sort -f ',' -k 1 -k 4 -k 5 -c 26 -m 10G --crlf --standardized -o "path/to/output-file" input-file
```

Note: `regiment` requires local storage roughly equivalent to the size of the inputs, and follows unix `TMPDIR` conventions for that storage.
30 changes: 30 additions & 0 deletions doc/performance-notes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
### Notes on performance

20170323 - with no performance tuning, at its inception (around commit `b6da9b7`):

```
Sorting an 11GB file (on a Macbook Pro):
gnu-sort (defaults): LC_COLLATE=C sort -t '|' -k 3,3 -o ~/Downloads/grohl/sort-sauerkraut 314.75s user 67.72s system 96% cpu 6:36.24 total
gnu-sort (2GB memory allocation): LC_COLLATE=C sort -t '|' -k 3,3 -S 2G -o ~/Downloads/grohl/sort-sauerkraut 346.95s user 34.03s system 97% cpu 6:32.65 total
regiment (2GB memory allocation): ./dist/build/Regiment/regiment sort -c 4 -k 3 -f '|' -m 2147483648 -o 3283.97s user 481.99s system 95% cpu 1:05:54.34 total
```

Results of profiling points clearly to the need to improve `updateMinCursor`:

```
COST CENTRE MODULE %time %alloc
updateMinCursor Regiment.Vanguard.Base 69.3 80.1
runVanguard Regiment.Vanguard.Base 7.5 9.8
compare Regiment.Data 6.6 0.0
flushVector Regiment.Parse 3.3 1.4
compare Regiment.Data 2.8 0.0
readKeyedPayloadIO Regiment.Vanguard.IO 1.8 1.2
writeCursor Regiment.Parse 1.2 1.4
selectSortKeys Regiment.Parse 1.0 1.1
```


6 changes: 3 additions & 3 deletions main/regiment.hs
Original file line number Diff line number Diff line change
Expand Up @@ -129,21 +129,21 @@ lfP :: Parser Newline
lfP =
flag' LF . mconcat $ [
long "lf"
, help "The input file uses \n to terminate lines (default)."
, help "The input file uses \\n to terminate lines (default)."
]

crP :: Parser Newline
crP =
flag' CR . mconcat $ [
long "cr"
, help "The input file uses \r to terminate lines."
, help "The input file uses \\r to terminate lines."
]

crlfP :: Parser Newline
crlfP =
flag' CRLF . mconcat $ [
long "crlf"
, help "The input file uses \r\n to terminate lines."
, help "The input file uses \\r\\n to terminate lines."
]

toChar :: Text -> Maybe Word8
Expand Down
2 changes: 1 addition & 1 deletion src/Regiment/Parse.hs
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ flushVector :: Grow.Grow Boxed.MVector (PrimState IO) (Boxed.Vector BS.ByteStrin
flushVector acc counter (TempDirectory tmp) = do
mv <- Grow.unsafeElems acc
Tim.sort mv
(v :: Boxed.Vector (Boxed.Vector BS.ByteString)) <- Grow.unsafeFreeze acc
(v :: Boxed.Vector (Boxed.Vector BS.ByteString)) <- Grow.freeze acc
-- write to TempFile
newEitherT . IO.withFile (tmp </> (T.unpack $ renderIntegral counter)) WriteMode $ \out -> do
runEitherT $ writeChunk out v
Expand Down

0 comments on commit 944baa3

Please sign in to comment.