Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 2: Hanlin Sun #14

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 97 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,103 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Hanlin Sun
* [LinkedIn](https://www.linkedin.com/in/hanlin-sun-7162941a5/),
* [personal website](https://hanlinsun.github.io/)
* Tested on: Windows 10, i7-8750H @ 3.2GHz 32GB, NVIDIA Quadro P3200

### (TODO: Your README)
# Stream Compaction
This Project involves:

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
* CPU version of scan
* CPU version of scan without using scan
* CPU version of compact with scan
* GPU version of naive scan
* GPU version of work-efficient scan
* GPU version of String Compact scan

These three CPU implements was used to test whether GPU implements was right. I have collected the data across 8 executions with different array sizes to collect the data.
This program generates a new array of random values with each execution, where the size of array is customisable. I have varied the size of the arrays by powers of two, starting from 2^8^ all the wai to 2^20^. The program also executes each algorithm for arrays of size "non- power of two" which are generated truncating the "power of two" arrays.


# Output Results

```
****************
** SCAN TESTS **
****************
[ 21 28 22 23 38 18 20 9 44 26 14 10 3 ... 25 0 ]
==== cpu scan, power-of-two ====
elapsed time: 0.0015ms (std::chrono Measured)
[ 0 21 49 71 94 132 150 170 179 223 249 263 273 ... 6273 6298 ]
==== cpu scan, non-power-of-two ====
elapsed time: 0.0014ms (std::chrono Measured)
[ 0 21 49 71 94 132 150 170 179 223 249 263 273 ... 6223 6226 ]
passed
==== naive scan, power-of-two ====
elapsed time: 0.25088ms (CUDA Measured)
[ 0 21 49 71 94 132 150 170 179 223 249 263 273 276 ... 6298 ]
passed
==== naive scan, non-power-of-two ====
elapsed time: 0.23552ms (CUDA Measured)
[ 0 21 49 71 94 132 150 170 179 223 249 263 273 276 ... 0 ]
passed
==== work-efficient scan, power-of-two ====
elapsed time: 0.185344ms (CUDA Measured)
[ 0 21 49 71 94 132 150 170 179 223 249 263 273 ... 6273 6298 ]
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 0.185344ms (CUDA Measured)
[ 0 21 49 71 94 132 150 170 179 223 249 263 273 ... 6223 6226 ]
passed
==== thrust scan, power-of-two ====
elapsed time: 13.1092ms (CUDA Measured)
[ 0 21 49 71 94 132 150 170 179 223 249 263 273 ... 6273 6298 ]
passed
==== thrust scan, non-power-of-two ====
elapsed time: 2.18214ms (CUDA Measured)
[ 0 21 49 71 94 132 150 170 179 223 249 263 273 ... 6223 6226 ]
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 1 2 2 3 2 2 0 3 0 0 0 0 1 ... 3 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 0.0014ms (std::chrono Measured)
[ 1 2 2 3 2 2 3 1 1 3 2 1 1 ... 3 3 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 0.0013ms (std::chrono Measured)
[ 1 2 2 3 2 2 3 1 1 3 2 1 1 ... 3 1 ]
passed
==== cpu compact with scan ====
elapsed time: 0.0038ms (std::chrono Measured)
[ 1 2 2 3 2 2 3 1 1 3 2 1 1 ... 3 3 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 0.309248ms (CUDA Measured)
[ 1 2 2 3 2 2 3 1 1 3 2 1 1 ... 3 3 ]
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 0.274432ms (CUDA Measured)
[ 1 2 2 3 2 2 3 1 1 3 2 1 1 ... 3 1 ]
passed
```

# Performance Analysis

## Scanning Algorithm

![Scan Chart](img/Analysis%201.JPG)

In this chart, the lower the attribute is, the better its performance.
We can see that when the array number is generally small, the performance gap between naive method and work-efficient method is generally small, but with the array length increased, this gap become larger and larger.
The reason why the second method is more efficient is that even though we have limited the number of active threads, when we are doing naive scan, threads which are not doing anything have to wait for the other active threads in the warp to finish to become available again.
But in upsweep and downsweep method, upsweep only use half number of threads to finish the work, and the rest of threads can be utilized by the GPU to do other tasks(like downsweep). So through that method we launch the same number of threads, but use less depth than the naive method.
That's why it is way more faster.

## String Compaction

![Compact Chart](img/Analysis%202.JPG)
In this chart, the lower the attribute is, the better its performance.
Binary file added img/Analysis 1.JPG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/Analysis 2.JPG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
22 changes: 11 additions & 11 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 20; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
Expand Down Expand Up @@ -51,7 +51,7 @@ int main(int argc, char* argv[]) {
printDesc("naive scan, power-of-two");
StreamCompaction::Naive::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

/* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
Expand All @@ -64,35 +64,35 @@ int main(int argc, char* argv[]) {
printDesc("naive scan, non-power-of-two");
StreamCompaction::Naive::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, power-of-two");
StreamCompaction::Efficient::scan(SIZE, c, a);
StreamCompaction::Efficient::scan(SIZE, c, a,false);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, non-power-of-two");
StreamCompaction::Efficient::scan(NPOT, c, a);
StreamCompaction::Efficient::scan(NPOT, c, a,false);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("thrust scan, power-of-two");
StreamCompaction::Thrust::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("thrust scan, non-power-of-two");
StreamCompaction::Thrust::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

printf("\n");
Expand Down Expand Up @@ -137,14 +137,14 @@ int main(int argc, char* argv[]) {
printDesc("work-efficient compact, power-of-two");
count = StreamCompaction::Efficient::compact(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(count, c, true);
printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient compact, non-power-of-two");
count = StreamCompaction::Efficient::compact(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(count, c, true);
printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

system("pause"); // stop Win32 console from closing on exit
Expand Down
20 changes: 19 additions & 1 deletion stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,16 @@ namespace StreamCompaction {
* Maps an array to an array of 0s and 1s for stream compaction. Elements
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
__global__ void kernMapToBoolean(int n, int *bools, const int *idata)
{
// TODO
//find each index
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n)
{
return;
}
bools[index] = idata[index] != 0 ? 1 : 0;
}

/**
Expand All @@ -33,6 +41,16 @@ namespace StreamCompaction {
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n)
{
return;
}
if (bools[index] !=0)
{
int targetIdx = indices[index];
odata[targetIdx] = idata[index];
}
}

}
Expand Down
57 changes: 54 additions & 3 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,19 @@ namespace StreamCompaction {
/**
* CPU scan (prefix sum).
* For performance analysis, this is supposed to be a simple for loop.
* (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
* (Optional) For better understanding before starting moving to GPU,
you can simulate your GPU scan in this function first.
*/
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
odata[0] = 0;
for (int i = 1; i < n; i++)
{
odata[i] = odata[i-1] + idata[i-1];
}

//Why the last two digit different?
timer().endCpuTimer();
}

Expand All @@ -28,11 +36,22 @@ namespace StreamCompaction {
*
* @returns the number of elements remaining after compaction.
*/
//Well I don't know exactly the condition
//So I treat it as remove 0 I guess
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int j = 0;
for (int i = 0; i < n; i++)
{
if (idata[i] > 0)
{
odata[j] = idata[i];
j++;
}
}
timer().endCpuTimer();
return -1;
return j;
}

/**
Expand All @@ -41,10 +60,42 @@ namespace StreamCompaction {
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
int* boolArray = new int[n * sizeof(int)];
int* scanArray = new int[n * sizeof(int)];
timer().startCpuTimer();
// TODO
for (int i = 0; i < n; i++)
{
boolArray[i] = (idata[i] > 0) ? 1 : 0;
}
//Set temp array

//begin scan
//Inclusive scan
scanArray[0] = boolArray[0]; //identity
for (int i = 1; i < n; i++)
{
scanArray[i] = scanArray[i-1] + boolArray[i];
}
int elementNum = scanArray[n - 1];
//Shift to right
//Exclusive scan
for (int i = n; i > 0; i--)
{
scanArray[i] = scanArray[i - 1];
}
scanArray[0] = 0;
//Scatter
for (int i = 0; i < n; i++)
{
if (boolArray[i] > 0)
{
odata[scanArray[i]]=idata[i];
}
}
timer().endCpuTimer();
return -1;

return elementNum;
}
}
}
Loading