CIS565-Fall-2022 · rcj9719 · Sep 15, 2022 · Sep 16, 2022 · Sep 16, 2022 · Sep 17, 2022
diff --git a/README.md b/README.md
@@ -3,12 +3,133 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* RHUTA JOSHI
+  * [LinkedIn](https://www.linkedin.com/in/rcj9719/)
+  * [Website](https://sites.google.com/view/rhuta-joshi)
 
-### (TODO: Your README)
+* Tested on: Windows 10 Home, i5-7200U CPU @ 2.50GHz, NVIDIA GTX 940MX 4096 MB (Personal Laptop), RTX not supported
+* GPU Compatibility: 5.0
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+Introduction
+---
 
+### Stream Compaction
+
+Stream compaction is an important parallel computing primitive that generates a compact output buffer with selected elements of an input buffer based on some condition. Basically, given an array of elements, we want to create a new array with elements that meet a certain criteria while preserving order.
+The important steps in a parallel stream compaction algorithm are as follows:
+
+![](img/stream-compaction.jpg)
+
+1. Step 1: Mapping - Compute a temporary array containing
+    - 1 if corresponding element meets criteria
+    - 0 if element does not meet criteria
+2. Step 2: Scanning - We can use one of the scanning techniques expanded below to run an exclusive scan on the mapped temporary array
+    - Naive scan
+    - Work-efficient scan
+3. Step 3: Scattering - Insert input data at index obtained from scanned buffer if criteria is set to true
+    - Result of scan is index into final array
+    - Only write an element if temporary array has a 1
+
+For this project the criteria into consideration is the number should not be equal to 0.
+
+### Parallel Scanning
+
+In this project, I implemented stream compaction on CPU and GPU using parallel all-prefix-sum (commonly known as scan) with CUDA and analyzed the performance of each of them. The sequential scan algorithm is poorly suited to GPUs because it does not take advantage of the GPU's data parallelism. The parallel version of scan that utilizes the parallel processors of a GPU to speed up its computation. The parallel scan can be performed in two ways:
+
+1. Naive scan - This is an O(nlogn) algorithm which iteratively adds elements with an offset.
+2. Work-efficient scan - This is an O(n) algorithm
+    - Step 1: **Upsweep scan** (Parallel Reduction phase) - In this, we traverse the tree from leaves to root computing partial sums at internal nodes of the tree. At the end of this phase, the root node (the last node in the array) holds the sum of all nodes in the array.
+
+        ![](img/upsweep.jpg)
+
+    - Step 2: **Downsweep scan** (Collecting scanned results) - In the down-sweep phase, we traverse back down the tree from the root, using the partial sums from the reduce phase to build the scan in place on the array. We start with adding the identity value at the end of upsweep output array. At each level,
+        - Left child: Copy the parent value
+        - Right child: Add the parent value and left child value copying  root value.
+
+        ![](img/downsweep.jpg)
+
+
+Tests performed
+---
+The following tests are run with blocksize of 256 for an array size of 2<sup>23</sup> which is around 8 million array elements.
+```
+****************
+** SCAN TESTS **
+****************
+    [   3  16  30  16   9   8   8  11  41  20  38  34   7 ...   4   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0ms    (std::chrono Measured)
+    [   0   3  19  49  65  74  82  90 101 142 162 200 234 ... 205461733 205461737 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0ms    (std::chrono Measured)
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 327.662ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 328.087ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   3   3   1   0   3   3   3   3   3   0   0   1   0 ...   3   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 45.6113ms    (std::chrono Measured)
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 46.366ms    (std::chrono Measured)
+    passed
+==== cpu compact with scan ====
+   elapsed time: 142.93ms    (std::chrono Measured)
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 743.851ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 739.94ms    (CUDA Measured)
+    passed
+Press any key to continue . . .
+```
+
+Performance Analysis
+---
+### Scan algorithm
+
+For different block sizes ranging from 4 to 1024, the most optimized performance was observed with a block size of 128 or 256. The performance below block size of 32 is really poor because warp size is 32 and block sizes lower than that can force it to perform operations serially. As the block size increases more than 256, the number of idle threads per iteration also increases hence decreasing performance. The following chart shows test results with block size of 256.
+
+![](img/parallel_scan_performance_analysis.png)
+
+Based on this image, it appears as if CPU takes lesser time than the GPU parallel algorithms. This is because of further optimizations that can be performed on GPU based parallel algorithms. Some of the things that can be considered when doing so include:
+1. In the current implementation, number of threads hosted in each iteration of upsweep and downsweep is the same. We know that in each iteration, many threads are idle and are simply returning without performing any meaningful operation.
+2. Even if some threads in a warp are done with execution with an early exit, they have to wait for other threads in the warp. When this happens due to conditional stalls, it is called warp divergence. This can be avoided by warp partitioning, such that threads which are likely to terminate together are grouped together in a single warp.
+
+I tried to improve the upsweep by tring to make better thread divisions to optimize parallel reduction process.
+![](img/upsweep_optimization.jpg)
+
+I tried to take strides in the reverse order, such as ... 4, 2, 1, instead of 1, 2, 4 ... by setting the offset as follows:
+```int offsetd = pow(2, maxDepth - d - 1);```
+And then inside upsweep, instead of using expensive modulo operator, using:
+```
+if (k < offsetd) {
+   data[k] += data[k + offsetd];
+}
+```
+However the implementation is still buggy, did not get time to implement it correctly and test the results.
+
+### Stream compaction
+
+Stream compaction shows a similar trend when compared to scanning. The following graph does not cover stream compaction tests using naive scan method. The behavior with respect to block size is also similar as observed in scanning, described above. 
+
+![](img/stream_compaction_analysis.png)
+
+References
+---
+1. GPU Parallel Algorithms Course Presentation - CIS 5650 - Fall 2022
+2. GPU Gems 3, Chapter 39 - [Parallel Prefix Sum (Scan) with CUDA](https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html)
diff --git a/img/downsweep.jpg b/img/downsweep.jpg
diff --git a/img/parallel_scan_performance_analysis.png b/img/parallel_scan_performance_analysis.png
diff --git a/img/stream-compaction.jpg b/img/stream-compaction.jpg
diff --git a/img/stream_compaction_analysis.png b/img/stream_compaction_analysis.png
diff --git a/img/upsweep.jpg b/img/upsweep.jpg
diff --git a/img/upsweep_optimization.jpg b/img/upsweep_optimization.jpg
diff --git a/src/main.cpp b/src/main.cpp
@@ -44,7 +44,7 @@ int main(int argc, char* argv[]) {
     printDesc("cpu scan, non-power-of-two");
     StreamCompaction::CPU::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(NPOT, b, true);
+    //printArray(NPOT, b, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
@@ -81,19 +81,19 @@ int main(int argc, char* argv[]) {
     //printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
-    zeroArray(SIZE, c);
-    printDesc("thrust scan, power-of-two");
-    StreamCompaction::Thrust::scan(SIZE, c, a);
-    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
-    printCmpResult(SIZE, b, c);
+    //zeroArray(SIZE, c);
+    //printDesc("thrust scan, power-of-two");
+    //StreamCompaction::Thrust::scan(SIZE, c, a);
+    //printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    ////printArray(SIZE, c, true);
+    //printCmpResult(SIZE, b, c);
 
-    zeroArray(SIZE, c);
-    printDesc("thrust scan, non-power-of-two");
-    StreamCompaction::Thrust::scan(NPOT, c, a);
-    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
-    printCmpResult(NPOT, b, c);
+    //zeroArray(SIZE, c);
+    //printDesc("thrust scan, non-power-of-two");
+    //StreamCompaction::Thrust::scan(NPOT, c, a);
+    //printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    ////printArray(NPOT, c, true);
+    //printCmpResult(NPOT, b, c);
 
     printf("\n");
     printf("*****************************\n");
@@ -115,22 +115,22 @@ int main(int argc, char* argv[]) {
     count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
     expectedCount = count;
-    printArray(count, b, true);
+    //printArray(count, b, true);
     printCmpLenResult(count, expectedCount, b, b);
 
     zeroArray(SIZE, c);
     printDesc("cpu compact without scan, non-power-of-two");
     count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
     expectedNPOT = count;
-    printArray(count, c, true);
+    //printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("cpu compact with scan");
     count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(count, c, true);
+    //printArray(count, c, true);
     printCmpLenResult(count, expectedCount, b, c);
 
     zeroArray(SIZE, c);

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -18,9 +18,16 @@ namespace StreamCompaction {
          * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
          */
         void scan(int n, int *odata, const int *idata) {
-            timer().startCpuTimer();
+            //timer().startCpuTimer();
             // TODO
-            timer().endCpuTimer();
+            int identity = 0;
+
+            odata[0] = identity;
+            for (int i = 1; i < n; i++) {
+                odata[i] = odata[i - 1] + idata[i - 1];     // exclusive scan
+            }
+
+            //timer().endCpuTimer();
         }
 
         /**
@@ -31,8 +38,22 @@ namespace StreamCompaction {
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+            int nonZeroIdx = 0;
+            for (int i = 0; i < n; i++) {
+                if (idata[i] != 0) {
+                    odata[nonZeroIdx] = idata[i];
+                    nonZeroIdx++;
+                }
+            }
+
             timer().endCpuTimer();
-            return -1;
+            return nonZeroIdx;
+        }
+
+        void map(int n, int* odata, const int* idata) {
+            for (int i = 0; i < n; i++) {
+                odata[i] = (idata[i] == 0) ? 0 : 1;
+            }
         }
 
         /**
@@ -43,8 +64,23 @@ namespace StreamCompaction {
         int compactWithScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+
+            int* mapped = new int[n];
+            int* scanned = new int[n];
+            map(n, mapped, idata);
+            scan(n, scanned, mapped);
+            int count = 0;
+            for (int i = 0; i < n; i++) {
+                if (mapped[i] == 1) {
+                    int index = scanned[i];
+                    odata[index] = idata[i];
+                    count++;
+                }
+            }
+            delete[] mapped;
+            delete[] scanned;
             timer().endCpuTimer();
-            return -1;
+            return count;
         }
     }
 }