Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCFS.fit() causes my computer to crash #30

Open
paulcbogdan opened this issue Nov 11, 2022 · 3 comments
Open

NCFS.fit() causes my computer to crash #30

paulcbogdan opened this issue Nov 11, 2022 · 3 comments

Comments

@paulcbogdan
Copy link

paulcbogdan commented Nov 11, 2022

Thank you very much for making the package. It's a great help. Sadly, it sometimes causes my computer to totally crash, particularly if I'm running another script simultaneously. During a crash, everything goes black, and I then need to restart it using my PSU's on/off switch. My guess is that this is related to parallel processing?

Even when the package doesn't crash, running fit causes this warning to appear:

C:\Users\paulc\Anaconda3\lib\site-packages\ncfs\NCFS.py:125: UserWarning: Data matrix contains values outside of the [0, 1] interval. May be numerical unstable and lead to pseudocount additions during fitting.
  warnings.warn(
C:\Users\paulc\Anaconda3\lib\site-packages\ncfs\accelerated.py:199: NumbaPerformanceWarning: 
The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see https://numba.pydata.org/numba-doc/latest/user/parallel.html#diagnostics for help.

File "C:\Users\paulc\Anaconda3\lib\site-packages\ncfs\accelerated.py", line 65:
@numba.njit(parallel=True, fastmath=True)
def feature_gradient(

I have no clue how numba works. Do you have any suggestions? I'm fine with the code running slower, so if there is any way for me to turn off the parallelism that would be appreciated.

Thanks again

@dakota-hawkins
Copy link
Collaborator

Hi there!

Sorry for the problem, and thanks for raising the issue. How big are the datasets you're working with? The method does produce good feature selection, but can be computationally expensive.

That diagnostic warning is Numba saying there are limited performance gains from setting a function to parallel, but when I benchmarked it empirically there were still significant speedups, so I left the decoration in.

One thing you can try is changing the NUMBA_NUM_THREADS environmental variable to set the number of threads to a lower count so your computer doesn't brick. Theoretically if you set it below how every many threads your other processes are using, you should be okay.

Info on limiting threads: https://numba.pydata.org/numba-doc/latest/user/threading-layer.html#setting-the-number-of-threads

@paulcbogdan
Copy link
Author

paulcbogdan commented Nov 13, 2022

Thanks for the quick and detailed reply. The dataset has 200 examples with 34716 features each. This isn't too large (?), although I think running the NCFS was most likely to crash when I was running other scripts simultaneously, some of which use larger datasets.

I will try setting the numba threads to 1 and running it later this week, once these other scripts finish.

@paulcbogdan paulcbogdan reopened this Nov 13, 2022
@paulcbogdan
Copy link
Author

I am now working with a dataset, where each example has roughly 500,000 features and with 200-1000 examples. NCFS works fine with 200 examples. It still causes a crash at 1000 examples, even when setting NUMBA_NUM_THREADS = 1. For 400 examples, NCFS does not finish after 24 hours.

I am not familiar with the implementation details behind NCFS, but do you think the 1000-example case, even if I get it to stop crashing, is simply not computationally feasible on a consumer-level PC (64 GB of RAM and a decent processor)? If so, we can accept this, but before we conclude that it isn't computationally feasible, we would like to know whether we are simply doing something wrong.

Thank you again for making this package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants