-
Notifications
You must be signed in to change notification settings - Fork 83
Optimization and analysis
Harlan is a moderately high level language for GPU computing. As such, it relies on some fairly aggressive compiler optimizations to get good performance. This page catalogues the optimizations we think we will need, why we think we need them, and a rough idea of how to implement them.
This is a standard optimization that moves things out of loops that don't change. This will help Harlan out in examples like these:
(kernel ((i (iota 10)))
(kernel ((j (iota 10)))
...))
Remove nested kernels turns this into something like this (with some extra temporaries):
(for (i 0 10)
(kernel ((j (iota 10)))
...))
Lifting the complex expressions gives us something like...
(for (i 0 10)
(let t_1 (iota 10))
(kernel ((j t_1))
...))
At this point, we recompute t_1 on every iteration of the loop, and we have to copy this over to the GPU for each kernel invocation.
This optimization pass would transform this into something like...
(let t_1 (iota 10))
(for (i 0 10)
(kernel ((j t_1))
...))
Now we're only computing t_1 once, and the later GPU scheduling code will hopefully be smart enough to only insert one transfer instead of many.
This pass should go right after remove-nested-kernels
, because then we have all the loops, but iota
hasn't been expanded yet.
This is a variant of liveness analysis, but it tracks a CPU-live and GPU-live property for each variable assignment. We can use this information to place data transfers in intelligent places.