Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two small issues #4

Open
dfarmer opened this issue Aug 5, 2023 · 11 comments
Open

Two small issues #4

dfarmer opened this issue Aug 5, 2023 · 11 comments

Comments

@dfarmer
Copy link

dfarmer commented Aug 5, 2023

Hello! I pulled the latest version of NerfGUI.jl and Nerf.jl to my local machine and in NerfGUI I entered package mode and "dev'd" Nerf.jl. But it crashed for me with errors about undefined DEVICE. I prepared a PR to fix it but as I was about to push it I noticed that the issue is already fixed on the nerf-update branch. So I thought I would ask if it was time to merge that branch to main or is there a potential issue?

My 2nd question (which may be related) is that the training performance after the fix is extremely slow. On the order of a minute or two per iteration. I'm using CUDA with an RTX 4090 so I guess I was expected a few frames per second, but looking around I couldn't find a reference iteration time and even in the JuliaCon '23 talk I didn't see training speed. Is there something wrong on my end or is < 1 fps during training expected?

@pxl-th
Copy link
Member

pxl-th commented Aug 5, 2023

Hi! nerf-update branch was already merged in #2.
By fixes you mean changes to LocalPreferences.toml: backend=CUDA?
If so, then this is the value that users needs to specify manually.

Regarding the performance, you should expect it to be slower than respective instant-ngp C++ version, but more work on the performance side is in coming.
Also, at the beginning, training steps are slower mostly because of the occupancy acceleration structure.
On the default dataset on RTX 3060 I get ~25 seconds for 1k training steps (1024 batch size).

Additionally, you can try disabling rendering during training and vice-versa, disable training when rendering for better performance.

@pxl-th
Copy link
Member

pxl-th commented Aug 5, 2023

Also, I don't think you need to dev Nerf.jl for NerfGUI.jl, unless you want to modify it.

And lastly, we have a basic benchmark in Nerf.jl which you can run with:

using Nerf
Nerf.benchmark()

And report the numbers here

@dfarmer
Copy link
Author

dfarmer commented Aug 5, 2023

No, I was referring to https://github.com/JuliaNeuralGraphics/NerfGUI.jl/blob/main/src/NerfGUI.jl#L18 is still Nerf.DEVICE on the tip for NerfGUI.jl when it needs to be Nerf.Backend to match up with the latest Nerf.jl -- I spent some time figuring that out and making the changes locally and then saw they were in your branch. I wonder what happened with the merge you linked? I see it says it was merged, but you can see the Device --> Backend rename isn't reflected. 🤔

For the benchmark I see (Windows 11, Nvidia RTX 4090)

julia> using Nerf
[ Info: Precompiling Nerf [2c86e8b6-813a-40f3-97f9-c72f78886291]
[ Info: [Nerf.jl] Backend: CUDA
[ Info: [Nerf.jl] Device: CUDA.CUDAKernels.CUDABackend(false, false)

julia> Nerf.benchmark()
Trainer benchmark
1
2
3
4
5
6
7
8
9
10
 78.700332 seconds (31.23 M allocations: 1.953 GiB, 1.30% gc time, 24.32% compilation time: 2% of which was recompilation)

I also just wanted to say I'm sorry if my first post came off as critical. I think this project is extremely cool, I was just trying to figure out what my expectations should be (and based on your comment: 25 seconds for 1k steps and my benchmark showing 78 sec for 10 it does seem there's some kind of large discrepancy).

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e90 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × AMD Ryzen 7 3700X 8-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
  Threads: 16 on 16 virtual cores

@pxl-th
Copy link
Member

pxl-th commented Aug 6, 2023

Nerf.benchmark() does following:

    @time trainer_benchmark(trainer, 10)
    @time trainer_benchmark(trainer, 1000)

So it runs benchmark twice, first 10 iterations to make sure everything is compiled.
So 78 seconds you are seeing include compilation of kernels, you should look at what's next.
Same for renderer benchmark.

As for DEVICE, indeed there was an issue, thanks for pointing out!

@dfarmer
Copy link
Author

dfarmer commented Aug 6, 2023

For the 1,000 iterations it took

1388.605311 seconds (2.96 M allocations: 135.260 MiB, 0.01% gc time, 0.00% compilation time)

@pxl-th
Copy link
Member

pxl-th commented Aug 7, 2023

Wow, that is extremely slow!
Can you share a profiling result using ProfileCanvas.jl?

julia> using Nerf, ProfileCanvas

julia> config_file = joinpath(pkgdir(Nerf), "data", "raccoon_sofa2", "transforms.json");

julia> dataset = Nerf.Dataset(Nerf.Backend; config_file);

julia> model = Nerf.BasicModel(Nerf.BasicField(Nerf.Backend));

julia> trainer = Nerf.Trainer(model, dataset; n_rays=1024);

julia> @profview Nerf.trainer_benchmark(trainer, 10); # Ignore it, since it includes compilation time

julia> @profview Nerf.trainer_benchmark(trainer, 10); # Report this one.

@dfarmer
Copy link
Author

dfarmer commented Aug 7, 2023

nerfjl_profile.zip

(Sorry for the zip; turns out Github doesn't let you attach html files). One thing I noticed when running the 1k iterations yesterday and on the profile today is that actually there will be "bursts" where it will do 5-15 iterations in less than a second and then the next few iterations will take multiple seconds each. I'm not sure if it's the garbage collector or maybe device transfers, but it's definitely not the case that all iterations are equally slow but more that it is very irregular; sometimes quite fast and other times extremely slow.

@pxl-th
Copy link
Member

pxl-th commented Aug 10, 2023

There profiling results aren't very useful unfortunately...
My guess is this is either due to GC or some Windows-specific stuff. I actually never ran it on Windows.

But since you are seeing bursts with fast iterations it may be due to GC.
Where it blocks everything, trying to free enough GPU memory.
Although I ran it on 6 GB Nvidia GPU without these issues...

@dfarmer
Copy link
Author

dfarmer commented Aug 12, 2023

Ok, well I guess we can close this anyway. Maybe one last question: you mentioned that you've run it on a 6 GB GPU; that was one other weird thing that I noticed -- when I run Nerf.jl it always instantly allocates all 24GB that the 4090's got and I had wondered about that also -- is that intentional or maybe that's something I can look into also (probably at the CUDA.jl level?). Thanks for the help and sorry for the chatty "issue."

@pxl-th
Copy link
Member

pxl-th commented Aug 14, 2023

when I run Nerf.jl it always instantly allocates all 24GB

It allocates, but it does not use all that memory. This is due to GC not freeing unused arrays immediately, which means, from the CUDA.jl perspective, that those arrays (and underlying memory) are still in use.

And when the memory pool grows to the maximum size (24 GB in your case), then it forcibly triggers GC, freeing the memory and potentially releasing it back to OS, which is expensive.
I guess this is why you are seeing such a dramatic slow-down.

Ok, well I guess we can close this anyway

I propose we leave it open, since this huge drop in performance is worrying.

@pxl-th
Copy link
Member

pxl-th commented Aug 14, 2023

If you can, try running another Nerf.jl benchmark, it tests performance of a single kernel without allocations.

  1. Update to latest Nerf.jl.
  2. Go to Nerf.jl directory.
  3. Update project's packages with:
    • julia --threads=auto
    • ]up
  4. Run: julia --threads=auto --project=. benchmark/main.jl.

Currently it tests the performance of HashGridEncoding kernel and sperical harmonics.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants