Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 2: Ryan Tong #25

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 77 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,84 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)

### (TODO: Your README)
* Ryan Tong
* [LinkedIn](https://www.linkedin.com/in/ryanctong/), [personal website](), [twitter](), etc.
* Tested on: Windows 10, i7-8750H @ 2.20GHz 16GB, GeForce GTX 1060 6144MB (Personal Laptop)

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)

### Features
This project implements the required features of CPU based scan and stream compaction; GPU based naive scan, work-efficient scan, stream compaction (using work-efficient scan); and Thrust based scan. I roughly optimized the block size to be 256 after doing some testing and reading this article: https://oneflow2020.medium.com/how-to-choose-the-grid-size-and-block-size-for-a-cuda-kernel-d1ff1f0a7f92.

### Performance Analysis
Here are the graphs comapring the runtimes of scan implemented on the CPU, GPU, and with Thrust. Note that Thrust is removed and the range of array sizes is shrunk for the second graph for visualization purposes.

![Scan](img/scan.png)
![Scan (Better Visualization)](img/scan_small.png)

From the graphs, we can see that CPU is faster than the work-efficent GPU implementation until the array size reaches about ~1,000,000 elements. This is suprising because the theortical complexities of these algorithms are O(n), O(nlogn), O(n) for CPU, naive, and work efficent respectively. Since the GPU implementations are paralleized we would expect that they are faster than the CPU implementation. The cause of this is likely the lack of optimizations in my GPU code and frequent reads and writes to global memory which is slow. An implementation using shared memory would improve the memory access speeds. Further more, the indexing of scan is inefficent since there are many inactive threads that could be retired in a warp if they were consecutive.

The Thrust implementations are significantly slower than both GPU and CPU implementation which is likely due to some implementation error that I was unable to solve.

We can see these inefficenies reflected again in the stream compaction run times:

![Stream Compaction](img/compaction.png)
![Stream Compaction (Better Visualization)](img/compaction_small.png)

### Program Output
```

****************
** SCAN TESTS **
****************
[ 29 48 7 23 28 16 45 34 2 47 35 3 16 ... 48 0 ]
==== cpu scan, power-of-two ====
elapsed time: 2.702ms (std::chrono Measured)
[ 0 29 77 84 107 135 151 196 230 232 279 314 317 ... 12845931 12845979 ]
==== cpu scan, non-power-of-two ====
elapsed time: 2.7096ms (std::chrono Measured)
[ 0 29 77 84 107 135 151 196 230 232 279 314 317 ... 12845838 12845880 ]
passed
==== naive scan, power-of-two ====
elapsed time: 4.85891ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 4.5247ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 2.23603ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 2.10493ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 35.5277ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 27.5845ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 3 2 3 1 2 0 3 0 2 3 1 3 2 ... 2 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 6.4836ms (std::chrono Measured)
[ 3 2 3 1 2 3 2 3 1 3 2 3 1 ... 1 2 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 5.3097ms (std::chrono Measured)
[ 3 2 3 1 2 3 2 3 1 3 2 3 1 ... 2 1 ]
passed
==== cpu compact with scan ====
elapsed time: 14.7061ms (std::chrono Measured)
[ 3 2 3 1 2 3 2 3 1 3 2 3 1 ... 1 2 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 2.84058ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 2.50528ms (CUDA Measured)
passed
```
Binary file added img/compaction.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/compaction_small.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/scan.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/scan_small.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 5; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
Expand Down
1 change: 1 addition & 0 deletions stream_compaction/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
cmake_minimum_required(VERSION 3.1)
set(headers
"common.h"
"cpu.h"
Expand Down
14 changes: 12 additions & 2 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,12 @@ namespace StreamCompaction {
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n) {
return;
}

bools[index] = idata[index] == 0 ? 0 : 1;
}

/**
Expand All @@ -32,7 +37,12 @@ namespace StreamCompaction {
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n || bools[index] == 0) {
return;
}

odata[indices[index]] = idata[index];
}

}
Expand Down
3 changes: 3 additions & 0 deletions stream_compaction/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@
#include <algorithm>
#include <chrono>
#include <stdexcept>
#include "device_launch_parameters.h"

#define blockSize 4

#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
Expand Down
42 changes: 37 additions & 5 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,11 @@ namespace StreamCompaction {
*/
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int prev = 0;
for (int i = 0; i < n; ++i) {
odata[i] = prev;
prev += idata[i];
}
timer().endCpuTimer();
}

Expand All @@ -30,9 +34,15 @@ namespace StreamCompaction {
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int out_ptr = 0;
for (int i = 0; i < n; ++i) {
if (idata[i] != 0) {
odata[out_ptr] = idata[i];
++out_ptr;
}
}
timer().endCpuTimer();
return -1;
return out_ptr;
}

/**
Expand All @@ -42,9 +52,31 @@ namespace StreamCompaction {
*/
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int* binary = new int[n];
int* scanOut = new int[n];
for (int i = 0; i < n; ++i) {
if (idata[i] != 0) {
binary[i] = 1;
}
else {
binary[i] = 0;
}
}
//Scan code copied
int prev = 0;
for (int i = 0; i < n; ++i) {
scanOut[i] = prev;
prev += binary[i];
}
int count = 0;
for (int i = 0; i < n; ++i) {
if (idata[i] != 0) {
odata[scanOut[i]] = idata[i];
++count;
}
}
timer().endCpuTimer();
return -1;
return count;
}
}
}
Loading