- Di Lu
- Tested on: Windows 11, i7-12700H @ 2.30GHz 32GB, NVIDIA GeForce RTX 3050 Ti
In this project, I simulate flocking behavior for a 200 x 200 x 200 cube of scattered boids by using CUDA kernel functions to calculate their position and velocity on each dT. Based on Craig Reynold's artificial life program, for which a SIGGRAPH paper was written in 1989, the following three behaviors are implemented:
- cohesion - boids move towards the perceived center of mass of their neighbors
- separation - boids avoid getting to close to their neighbors
- alignment - boids generally try to move with the same direction and speed as their neighbors
In the simulation results, the color of each particle is a representation of its velocity.
Coherent Grid Flocking with 50,000 boids
To measure the performance of my code, I ran my program on release mode with VSync disabled. There are three implementations: with the first being naive neighbor search, and each subsequent part utilizing more optimizations.
The first simulation is a naive neighbor search, where each boid searches every other boid in existence and checks whether they are within distance for cohesion, separation, or alignment. If a non-self boid is within any such distance, then its position and velocity will be taken into account for the respective rule.
Naive Grid Flocking with 5,000 boids
The second simulation is a neighbor search that takes into account the largest neighborhood distance among the 3 rules. The simulation space is divided into grid cubes. Using these cubes, Each boid only needs to check the cubes that overlap with its spherical neighborhood.
Each boid calculates the extremities of its reach by using its own radius and position. With these extremities, I can calculate the maximum and minimum of my desired cells to scan. Hence, the number of useless boid scans are reduced, resulting in a much faster simulation!
Uniform Grid Flocking with 5,000 boids
The third simulation builds on the second simulation. This time, we also rearrange the position and velocity information such that boids that are in a cell together are also contiguous in memory.
Coherent Grid Flocking with 5,000 boids
- Number of Boids vs. Performance
- BlockSize vs. Performance (N = 100,000)
For each implementation, how does changing the number of boids affect performance? Why do you think this is?
Across all three implementations, increasing the number of boids will decrease FPS. If the scene scale is the same, then each cell will become more dense as boids increase, which means each boid will need to run more sequential operations to check the increased number of valid neighbors.
For each implementation, how does changing the block count and block size affect performance? Why do you think this is?
As seen from the graph, smaller block count usually results in poorer performance before a certain blockSize is hit. For example, there is a big difference between blockSize == 16 and blockSize == 4, however, above blockSize 32 is relatively similar in FPS performance. This behavior is not easily seen in simulations with fewer boids (N = 5,000). This is likely because when blockSize is small, there are less threads that can run in parallel. When blockSize is large, more and more threads can run the same program at the same time. However, if the number of threads exceed the number of parallel operations, then there won't be much performance enahancement anymore.
For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?
I experienced a significant performance improvement with coherent uniform grid when the boid number is very high. For example, on 500,000 boids, coherent grid can comfortably give me ~130 FPS, while uniform grid just about dies at ~6 FPS. For fewer boids, the performance improvement is not that obvious (for example, FPS is fairly similar between coherent and uniform grids when we have 5,000 boids.) This outcome is expected, because when the number of boids is higher, each cell also contains a lot more boids. If the information is not contiguous in memory, the impact of checking many boids whose information is further apart is more noticeable.
Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not? Be careful: it is insufficient (and possibly incorrect) to say that 27-cell is slower simply because there are more cells to check!
My understanding is that given scene scale 100 and 100,000 boids, decreasing cell size (thus checking more cells) will increase performance. This is because given the same density of boids in a cell, we can check more cells in parallel and run less sequential operations for each cell. However, if the scene scale were larger with the same number of boids, then this could potentially decrease efficiency, because we are checking more cells that could be "useless".
To test my thinking, I used the following two scenarios:
Constants:
- Coherent grid
- Number of boids: 100,000
- Scene Scale: 100
I observed around 100 FPS increase when I used cell width == neighborhood size.
- Scene Scale: 200
I observed nearly 200 FPS decrease when I used cell width == neighbordhood size.
Naive Flocking with 50,000 boids
Coherent Flocking with 100,000 boids
Coherent Flocking with 500,000 boids, could not get a gif of this onto github
100,000 boids on 200 scene scale instead of 100