Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 2: Rhuta Joshi #27

Open
wants to merge 28 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
026b100
Naive parallel scan with failure at index 129
rcj9719 Sep 15, 2022
724b088
Naive scan done, bug resolved
rcj9719 Sep 16, 2022
3a6edba
Work efficient incomplete and buggy
rcj9719 Sep 16, 2022
e090c05
Work-efficient working for powers of two
rcj9719 Sep 17, 2022
81b7628
work-efficient scan implemented
rcj9719 Sep 17, 2022
d8834c8
Thrust scan completed
rcj9719 Sep 18, 2022
c61e90a
Efficient stream compaction done
rcj9719 Sep 18, 2022
34af50e
Updating README
rcj9719 Sep 18, 2022
de02574
Updating README
rcj9719 Sep 18, 2022
564a9d0
Updating README
rcj9719 Sep 18, 2022
aed7737
Updating README
rcj9719 Sep 18, 2022
0339925
adding performance analysis
rcj9719 Sep 19, 2022
c6d1b63
adding performance analysis
rcj9719 Sep 19, 2022
99166e5
adding performance analysis
rcj9719 Sep 19, 2022
660bb80
Update README.md
rcj9719 Sep 19, 2022
93ea1c2
Update README.md
rcj9719 Sep 19, 2022
24c5290
adding stream compaction analysis
rcj9719 Sep 19, 2022
18f1aff
Merge branch 'main' of https://github.com/rcj9719/GPU-Project2-Stream…
rcj9719 Sep 19, 2022
6901f62
Update README.md
rcj9719 Sep 19, 2022
3b4f84c
Update README.md
rcj9719 Sep 19, 2022
0660ebb
Update README.md
rcj9719 Sep 19, 2022
346aff8
Adding tests
rcj9719 Sep 19, 2022
319c082
Update README.md
rcj9719 Sep 19, 2022
a7e8e42
Update README.md
rcj9719 Sep 19, 2022
4877ca1
minor changes during testing
rcj9719 Sep 19, 2022
73cbb14
attempt at upsweep optimization
rcj9719 Sep 19, 2022
ffc7df3
Update README.md
rcj9719 Sep 19, 2022
d9e4f03
Update README.md
rcj9719 Sep 19, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 127 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,133 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* RHUTA JOSHI
* [LinkedIn](https://www.linkedin.com/in/rcj9719/)
* [Website](https://sites.google.com/view/rhuta-joshi)

### (TODO: Your README)
* Tested on: Windows 10 Home, i5-7200U CPU @ 2.50GHz, NVIDIA GTX 940MX 4096 MB (Personal Laptop), RTX not supported
* GPU Compatibility: 5.0

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
Introduction
---

### Stream Compaction

Stream compaction is an important parallel computing primitive that generates a compact output buffer with selected elements of an input buffer based on some condition. Basically, given an array of elements, we want to create a new array with elements that meet a certain criteria while preserving order.
The important steps in a parallel stream compaction algorithm are as follows:

![](img/stream-compaction.jpg)

1. Step 1: Mapping - Compute a temporary array containing
- 1 if corresponding element meets criteria
- 0 if element does not meet criteria
2. Step 2: Scanning - We can use one of the scanning techniques expanded below to run an exclusive scan on the mapped temporary array
- Naive scan
- Work-efficient scan
3. Step 3: Scattering - Insert input data at index obtained from scanned buffer if criteria is set to true
- Result of scan is index into final array
- Only write an element if temporary array has a 1

For this project the criteria into consideration is the number should not be equal to 0.

### Parallel Scanning

In this project, I implemented stream compaction on CPU and GPU using parallel all-prefix-sum (commonly known as scan) with CUDA and analyzed the performance of each of them. The sequential scan algorithm is poorly suited to GPUs because it does not take advantage of the GPU's data parallelism. The parallel version of scan that utilizes the parallel processors of a GPU to speed up its computation. The parallel scan can be performed in two ways:

1. Naive scan - This is an O(nlogn) algorithm which iteratively adds elements with an offset.
2. Work-efficient scan - This is an O(n) algorithm
- Step 1: **Upsweep scan** (Parallel Reduction phase) - In this, we traverse the tree from leaves to root computing partial sums at internal nodes of the tree. At the end of this phase, the root node (the last node in the array) holds the sum of all nodes in the array.

![](img/upsweep.jpg)

- Step 2: **Downsweep scan** (Collecting scanned results) - In the down-sweep phase, we traverse back down the tree from the root, using the partial sums from the reduce phase to build the scan in place on the array. We start with adding the identity value at the end of upsweep output array. At each level,
- Left child: Copy the parent value
- Right child: Add the parent value and left child value copying root value.

![](img/downsweep.jpg)


Tests performed
---
The following tests are run with blocksize of 256 for an array size of 2<sup>23</sup> which is around 8 million array elements.
```
****************
** SCAN TESTS **
****************
[ 3 16 30 16 9 8 8 11 41 20 38 34 7 ... 4 0 ]
==== cpu scan, power-of-two ====
elapsed time: 0ms (std::chrono Measured)
[ 0 3 19 49 65 74 82 90 101 142 162 200 234 ... 205461733 205461737 ]
==== cpu scan, non-power-of-two ====
elapsed time: 0ms (std::chrono Measured)
passed
==== naive scan, power-of-two ====
elapsed time: 327.662ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 328.087ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 0ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 0ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 3 3 1 0 3 3 3 3 3 0 0 1 0 ... 3 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 45.6113ms (std::chrono Measured)
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 46.366ms (std::chrono Measured)
passed
==== cpu compact with scan ====
elapsed time: 142.93ms (std::chrono Measured)
passed
==== work-efficient compact, power-of-two ====
elapsed time: 743.851ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 739.94ms (CUDA Measured)
passed
Press any key to continue . . .
```

Performance Analysis
---
### Scan algorithm

For different block sizes ranging from 4 to 1024, the most optimized performance was observed with a block size of 128 or 256. The performance below block size of 32 is really poor because warp size is 32 and block sizes lower than that can force it to perform operations serially. As the block size increases more than 256, the number of idle threads per iteration also increases hence decreasing performance. The following chart shows test results with block size of 256.

![](img/parallel_scan_performance_analysis.png)

Based on this image, it appears as if CPU takes lesser time than the GPU parallel algorithms. This is because of further optimizations that can be performed on GPU based parallel algorithms. Some of the things that can be considered when doing so include:
1. In the current implementation, number of threads hosted in each iteration of upsweep and downsweep is the same. We know that in each iteration, many threads are idle and are simply returning without performing any meaningful operation.
2. Even if some threads in a warp are done with execution with an early exit, they have to wait for other threads in the warp. When this happens due to conditional stalls, it is called warp divergence. This can be avoided by warp partitioning, such that threads which are likely to terminate together are grouped together in a single warp.

I tried to improve the upsweep by tring to make better thread divisions to optimize parallel reduction process.
![](img/upsweep_optimization.jpg)

I tried to take strides in the reverse order, such as ... 4, 2, 1, instead of 1, 2, 4 ... by setting the offset as follows:
```int offsetd = pow(2, maxDepth - d - 1);```
And then inside upsweep, instead of using expensive modulo operator, using:
```
if (k < offsetd) {
data[k] += data[k + offsetd];
}
```
However the implementation is still buggy, did not get time to implement it correctly and test the results.

### Stream compaction

Stream compaction shows a similar trend when compared to scanning. The following graph does not cover stream compaction tests using naive scan method. The behavior with respect to block size is also similar as observed in scanning, described above.

![](img/stream_compaction_analysis.png)

References
---
1. GPU Parallel Algorithms Course Presentation - CIS 5650 - Fall 2022
2. GPU Gems 3, Chapter 39 - [Parallel Prefix Sum (Scan) with CUDA](https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html)
Binary file added img/downsweep.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/parallel_scan_performance_analysis.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/stream-compaction.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/stream_compaction_analysis.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/upsweep.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/upsweep_optimization.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
32 changes: 16 additions & 16 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ int main(int argc, char* argv[]) {
printDesc("cpu scan, non-power-of-two");
StreamCompaction::CPU::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
printArray(NPOT, b, true);
//printArray(NPOT, b, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
Expand Down Expand Up @@ -81,19 +81,19 @@ int main(int argc, char* argv[]) {
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("thrust scan, power-of-two");
StreamCompaction::Thrust::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);
//zeroArray(SIZE, c);
//printDesc("thrust scan, power-of-two");
//StreamCompaction::Thrust::scan(SIZE, c, a);
//printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
////printArray(SIZE, c, true);
//printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("thrust scan, non-power-of-two");
StreamCompaction::Thrust::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);
//zeroArray(SIZE, c);
//printDesc("thrust scan, non-power-of-two");
//StreamCompaction::Thrust::scan(NPOT, c, a);
//printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
////printArray(NPOT, c, true);
//printCmpResult(NPOT, b, c);

printf("\n");
printf("*****************************\n");
Expand All @@ -115,22 +115,22 @@ int main(int argc, char* argv[]) {
count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
expectedCount = count;
printArray(count, b, true);
//printArray(count, b, true);
printCmpLenResult(count, expectedCount, b, b);

zeroArray(SIZE, c);
printDesc("cpu compact without scan, non-power-of-two");
count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
expectedNPOT = count;
printArray(count, c, true);
//printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

zeroArray(SIZE, c);
printDesc("cpu compact with scan");
count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
printArray(count, c, true);
//printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);

zeroArray(SIZE, c);
Expand Down
44 changes: 40 additions & 4 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,16 @@ namespace StreamCompaction {
* (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
*/
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
//timer().startCpuTimer();
// TODO
timer().endCpuTimer();
int identity = 0;

odata[0] = identity;
for (int i = 1; i < n; i++) {
odata[i] = odata[i - 1] + idata[i - 1]; // exclusive scan
}

//timer().endCpuTimer();
}

/**
Expand All @@ -31,8 +38,22 @@ namespace StreamCompaction {
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int nonZeroIdx = 0;
for (int i = 0; i < n; i++) {
if (idata[i] != 0) {
odata[nonZeroIdx] = idata[i];
nonZeroIdx++;
}
}

timer().endCpuTimer();
return -1;
return nonZeroIdx;
}

void map(int n, int* odata, const int* idata) {
for (int i = 0; i < n; i++) {
odata[i] = (idata[i] == 0) ? 0 : 1;
}
}

/**
Expand All @@ -43,8 +64,23 @@ namespace StreamCompaction {
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO

int* mapped = new int[n];
int* scanned = new int[n];
map(n, mapped, idata);
scan(n, scanned, mapped);
int count = 0;
for (int i = 0; i < n; i++) {
if (mapped[i] == 1) {
int index = scanned[i];
odata[index] = idata[i];
count++;
}
}
delete[] mapped;
delete[] scanned;
timer().endCpuTimer();
return -1;
return count;
}
}
}
Loading