CIS565-Fall-2022 · ryanryantong · Sep 18, 2022 · Sep 18, 2022 · Sep 18, 2022 · Sep 18, 2022
diff --git a/README.md b/README.md
@@ -3,12 +3,84 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
-
-### (TODO: Your README)
+* Ryan Tong
+  * [LinkedIn](https://www.linkedin.com/in/ryanctong/), [personal website](), [twitter](), etc.
+* Tested on: Windows 10, i7-8750H @ 2.20GHz 16GB, GeForce GTX 1060 6144MB (Personal Laptop)
 
 Include analysis, etc. (Remember, this is public, so don't put
 anything here that you don't want to share with the world.)
 
+### Features
+This project implements the required features of CPU based scan and stream compaction; GPU based naive scan, work-efficient scan, stream compaction (using work-efficient scan); and Thrust based scan. I roughly optimized the block size to be 256 after doing some testing and reading this article: https://oneflow2020.medium.com/how-to-choose-the-grid-size-and-block-size-for-a-cuda-kernel-d1ff1f0a7f92. 
+
+### Performance Analysis
+Here are the graphs comapring the runtimes of scan implemented on the CPU, GPU, and with Thrust. Note that Thrust is removed and the range of array sizes is shrunk for the second graph for visualization purposes.
+
+![Scan](img/scan.png)
+![Scan (Better Visualization)](img/scan_small.png)
+
+From the graphs, we can see that CPU is faster than the work-efficent GPU implementation until the array size reaches about ~1,000,000 elements. This is suprising because the theortical complexities of these algorithms are O(n), O(nlogn), O(n) for CPU, naive, and work efficent respectively. Since the GPU implementations are paralleized we would expect that they are faster than the CPU implementation. The cause of this is likely the lack of optimizations in my GPU code and frequent reads and writes to global memory which is slow. An implementation using shared memory would improve the memory access speeds. Further more, the indexing of scan is inefficent since there are many inactive threads that could be retired in a warp if they were consecutive. 
+
+The Thrust implementations are significantly slower than both GPU and CPU implementation which is likely due to some implementation error that I was unable to solve.
+
+We can see these inefficenies reflected again in the stream compaction run times:
+
+![Stream Compaction](img/compaction.png)
+![Stream Compaction (Better Visualization)](img/compaction_small.png)
+
+### Program Output
+```
+
+****************
+** SCAN TESTS **
+****************
+    [  29  48   7  23  28  16  45  34   2  47  35   3  16 ...  48   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 2.702ms    (std::chrono Measured)
+    [   0  29  77  84 107 135 151 196 230 232 279 314 317 ... 12845931 12845979 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 2.7096ms    (std::chrono Measured)
+    [   0  29  77  84 107 135 151 196 230 232 279 314 317 ... 12845838 12845880 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 4.85891ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 4.5247ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 2.23603ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 2.10493ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 35.5277ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 27.5845ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   3   2   3   1   2   0   3   0   2   3   1   3   2 ...   2   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 6.4836ms    (std::chrono Measured)
+    [   3   2   3   1   2   3   2   3   1   3   2   3   1 ...   1   2 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 5.3097ms    (std::chrono Measured)
+    [   3   2   3   1   2   3   2   3   1   3   2   3   1 ...   2   1 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 14.7061ms    (std::chrono Measured)
+    [   3   2   3   1   2   3   2   3   1   3   2   3   1 ...   1   2 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 2.84058ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 2.50528ms    (CUDA Measured)
+    passed
+```
diff --git a/img/compaction.png b/img/compaction.png
diff --git a/img/compaction_small.png b/img/compaction_small.png
diff --git a/img/scan.png b/img/scan.png
diff --git a/img/scan_small.png b/img/scan_small.png
diff --git a/src/main.cpp b/src/main.cpp
@@ -13,7 +13,7 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 5; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];

diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
@@ -1,3 +1,4 @@
+cmake_minimum_required(VERSION 3.1)
 set(headers
     "common.h"
     "cpu.h"

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -23,7 +23,12 @@ namespace StreamCompaction {
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n) {
+                return;
+            }
+
+            bools[index] = idata[index] == 0 ? 0 : 1;
         }
 
         /**
@@ -32,7 +37,12 @@ namespace StreamCompaction {
          */
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
-            // TODO
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n || bools[index] == 0) {
+                return;
+            }
+
+            odata[indices[index]] = idata[index];
         }
 
     }

diff --git a/stream_compaction/common.h b/stream_compaction/common.h
@@ -9,6 +9,9 @@
 #include <algorithm>
 #include <chrono>
 #include <stdexcept>
+#include "device_launch_parameters.h"
+
+#define blockSize 4
 
 #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
 #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -19,7 +19,11 @@ namespace StreamCompaction {
          */
         void scan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            int prev = 0;
+            for (int i = 0; i < n; ++i) {
+                odata[i] = prev;
+                prev += idata[i];
+            }
             timer().endCpuTimer();
         }
 
@@ -30,9 +34,15 @@ namespace StreamCompaction {
          */
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            int out_ptr = 0;
+            for (int i = 0; i < n; ++i) {
+                if (idata[i] != 0) {
+                    odata[out_ptr] = idata[i];
+                    ++out_ptr;
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            return out_ptr;
         }
 
         /**
@@ -42,9 +52,31 @@ namespace StreamCompaction {
          */
         int compactWithScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            int* binary = new int[n];
+            int* scanOut = new int[n];
+            for (int i = 0; i < n; ++i) {
+                if (idata[i] != 0) {
+                    binary[i] = 1;
+                }
+                else {
+                    binary[i] = 0;
+                }
+            }
+            //Scan code copied
+            int prev = 0;
+            for (int i = 0; i < n; ++i) {
+                scanOut[i] = prev;
+                prev += binary[i];
+            }
+            int count = 0;
+            for (int i = 0; i < n; ++i) {
+                if (idata[i] != 0) {
+                    odata[scanOut[i]] = idata[i];
+                    ++count;
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            return count;
         }
     }
 }