Fix bug in drawWithoutReplacementSkip() #734

sligocki · 2024-07-29T19:54:40Z

drawWithoutReplacementSkip() appears to have a bug in both the drawWithoutReplacementSimple() and drawWithoutReplacementFisherYates() branches. Specifically, these bugs are for cases where there are multiple indexes to be skipped. These bugs are demonstrated by new tests added drawWithoutReplacementSkip.small_small4 (FisherYates) and drawWithoutReplacementSkip.small_large2 (Simple).

I fix these both by switching to using std::sample() which seems like a safer bet to avoid these sorts of subtle bugs. This change requires updating to (at least) C++17.

This might be the source of #733

… values.

…e().

…e max is really range_length - 1).

mnwright · 2024-08-05T06:30:15Z

Thanks, that looks like an easy fix (and simplification). Did you check any runtimes? Back in 2014, the Fisher-Yates implementation gave a big speedup for some settings (which I cannot remember exactly).

sligocki · 2024-08-06T20:21:50Z

Nope, I haven't tested any runtimes. I expect that std::sample is implementing Fisher-Yates (or something similar). Hopefully the std library folks have optimized this so that we don't have to :) I suppose that it may be slightly less efficient if you are running with a huge number of options and only a tiny sample (since it has to allocate the list of all indexes at start). I wouldn't expect that to be the bottleneck in most contexts, but makes sense that it would be good to test.

mnwright · 2024-08-15T06:42:46Z

Unfortunately, this is slower. Particularly for high-dimensional data and it's quite extreme for the GWAS settings (p>>n, few unique values). Here is an example (in R but results will be similar without R):

#remotes::install_github("imbs-hl/ranger")
#remotes::install_github("sligocki/ranger@sample_skip_bug")

library(ranger)
library(microbenchmark)

# High dimensional
n <- 1000
p <- 10000
x <- matrix(rbinom(n * p, size = 1, prob = .5), nrow = n, 
            dimnames = list(NULL, paste0("X", 1:p)))
y <- rnorm(n)

microbenchmark(
  default = ranger(x = x, y = y, num.threads = 1),
  times = 1)

# Extremely high dimensional
n <- 100
p <- 100000
x <- matrix(rbinom(n * p, size = 1, prob = .5), nrow = n, 
            dimnames = list(NULL, paste0("X", 1:p)))
y <- rnorm(n)

microbenchmark(
  default = ranger(x = x, y = y, num.threads = 1),
  times = 1)

On this machine, the first one takes about 10 seconds with the master and about 24 seconds with this PR. The second one is more extreme: less then 2 second with the master and about 17 seconds with this PR. For a real GWAS (or similar), the differences will probably be several hours. ☹️

So it looks like I have to fix the skipping in drawWithoutReplacementFisherYates().

sligocki · 2024-08-15T15:49:06Z

Oh my, I see Tree::createPossibleSplitVarSubset() is being called 346k times in this benchmark so we are creating these vectors over and over again. I wonder if we can create them once and reuse them on each call?

mnwright · 2024-08-16T06:55:06Z

I don't think we can re-use them. This is the sampling of the mtry variables to be considered for splitting at each node, so one of the major random components of the RF algorithm.

mnwright · 2024-08-21T12:11:14Z

I merged #738, so don't need this PR anymore. Thanks for your help.

sligocki added 3 commits July 29, 2024 15:35

Add failing tests for drawWithoutReplacementSkip() with multiple skip…

764d980

… values.

Fix drawWithoutReplacementSkip() by just replacing it with std::sampl…

1157e1d

…e().

Update argument names to match .h (and be more logically correct sinc…

b2c0511

…e max is really range_length - 1).

mnwright added 4 commits August 13, 2024 11:12

Merge branch 'fix_poisson_test' into sample_skip_bug

513fa60

Merge branch 'master' into sample_skip_bug

b252e33

Merge branch 'master' into sample_skip_bug

f3b1bec

new version after bug fix

da8bcaf

mnwright closed this Aug 21, 2024

mnwright mentioned this pull request Aug 22, 2024

Tests for drawWithoutReplacementSkip() #739

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug in drawWithoutReplacementSkip() #734

Fix bug in drawWithoutReplacementSkip() #734

sligocki commented Jul 29, 2024

mnwright commented Aug 5, 2024

sligocki commented Aug 6, 2024

mnwright commented Aug 15, 2024

sligocki commented Aug 15, 2024

mnwright commented Aug 16, 2024

mnwright commented Aug 21, 2024

Fix bug in drawWithoutReplacementSkip() #734

Fix bug in drawWithoutReplacementSkip() #734

Conversation

sligocki commented Jul 29, 2024

mnwright commented Aug 5, 2024

sligocki commented Aug 6, 2024

mnwright commented Aug 15, 2024

sligocki commented Aug 15, 2024

mnwright commented Aug 16, 2024

mnwright commented Aug 21, 2024