Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in drawWithoutReplacementSkip() #734

Closed
wants to merge 7 commits into from

Conversation

sligocki
Copy link
Contributor

drawWithoutReplacementSkip() appears to have a bug in both the drawWithoutReplacementSimple() and drawWithoutReplacementFisherYates() branches. Specifically, these bugs are for cases where there are multiple indexes to be skipped. These bugs are demonstrated by new tests added drawWithoutReplacementSkip.small_small4 (FisherYates) and drawWithoutReplacementSkip.small_large2 (Simple).

I fix these both by switching to using std::sample() which seems like a safer bet to avoid these sorts of subtle bugs. This change requires updating to (at least) C++17.

This might be the source of #733

@mnwright
Copy link
Member

mnwright commented Aug 5, 2024

Thanks, that looks like an easy fix (and simplification). Did you check any runtimes? Back in 2014, the Fisher-Yates implementation gave a big speedup for some settings (which I cannot remember exactly).

@sligocki
Copy link
Contributor Author

sligocki commented Aug 6, 2024

Nope, I haven't tested any runtimes. I expect that std::sample is implementing Fisher-Yates (or something similar). Hopefully the std library folks have optimized this so that we don't have to :) I suppose that it may be slightly less efficient if you are running with a huge number of options and only a tiny sample (since it has to allocate the list of all indexes at start). I wouldn't expect that to be the bottleneck in most contexts, but makes sense that it would be good to test.

@mnwright
Copy link
Member

Unfortunately, this is slower. Particularly for high-dimensional data and it's quite extreme for the GWAS settings (p>>n, few unique values). Here is an example (in R but results will be similar without R):

#remotes::install_github("imbs-hl/ranger")
#remotes::install_github("sligocki/ranger@sample_skip_bug")

library(ranger)
library(microbenchmark)

# High dimensional
n <- 1000
p <- 10000
x <- matrix(rbinom(n * p, size = 1, prob = .5), nrow = n, 
            dimnames = list(NULL, paste0("X", 1:p)))
y <- rnorm(n)

microbenchmark(
  default = ranger(x = x, y = y, num.threads = 1),
  times = 1)

# Extremely high dimensional
n <- 100
p <- 100000
x <- matrix(rbinom(n * p, size = 1, prob = .5), nrow = n, 
            dimnames = list(NULL, paste0("X", 1:p)))
y <- rnorm(n)

microbenchmark(
  default = ranger(x = x, y = y, num.threads = 1),
  times = 1)

On this machine, the first one takes about 10 seconds with the master and about 24 seconds with this PR. The second one is more extreme: less then 2 second with the master and about 17 seconds with this PR. For a real GWAS (or similar), the differences will probably be several hours. ☹️

So it looks like I have to fix the skipping in drawWithoutReplacementFisherYates().

@sligocki
Copy link
Contributor Author

Oh my, I see Tree::createPossibleSplitVarSubset() is being called 346k times in this benchmark so we are creating these vectors over and over again. I wonder if we can create them once and reuse them on each call?

@mnwright
Copy link
Member

I don't think we can re-use them. This is the sampling of the mtry variables to be considered for splitting at each node, so one of the major random components of the RF algorithm.

@mnwright
Copy link
Member

I merged #738, so don't need this PR anymore. Thanks for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants