CIS565-Fall-2022 · HanLinSun · Sep 17, 2022 · Sep 18, 2022 · Sep 18, 2022 · Sep 19, 2022
diff --git a/README.md b/README.md
@@ -3,12 +3,103 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Hanlin Sun
+  * [LinkedIn](https://www.linkedin.com/in/hanlin-sun-7162941a5/), 
+  * [personal website](https://hanlinsun.github.io/)
+* Tested on: Windows 10, i7-8750H @ 3.2GHz 32GB, NVIDIA Quadro P3200 
 
-### (TODO: Your README)
+# Stream Compaction
+This Project involves:
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+* CPU version of scan
+* CPU version of scan without using scan
+* CPU version of compact with scan
+* GPU version of naive scan
+* GPU version of work-efficient scan
+* GPU version of String Compact scan
 
+These three CPU implements was used to test whether GPU implements was right. I have collected the data across 8 executions with different array sizes to collect the data. 
+This program generates a new array of random values with each execution, where the size of array is customisable. I have varied the size of the arrays by powers of two, starting from 2^8^ all the wai to 2^20^. The program also executes each algorithm for arrays of size "non- power of two" which are generated truncating the "power of two" arrays.
+
+
+# Output Results
+
+```
+****************
+** SCAN TESTS **
+****************
+    [  21  28  22  23  38  18  20   9  44  26  14  10   3 ...  25   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.0015ms    (std::chrono Measured)
+    [   0  21  49  71  94 132 150 170 179 223 249 263 273 ... 6273 6298 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.0014ms    (std::chrono Measured)
+    [   0  21  49  71  94 132 150 170 179 223 249 263 273 ... 6223 6226 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.25088ms    (CUDA Measured)
+    [   0   21  49  71  94 132 150 170 179 223 249 263 273 276 ... 6298 ]
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.23552ms    (CUDA Measured)
+    [   0   21  49  71  94 132 150 170 179 223 249 263 273 276 ...   0  ]
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0.185344ms    (CUDA Measured)
+    [   0  21  49  71  94 132 150 170 179 223 249 263 273 ... 6273 6298 ]
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.185344ms    (CUDA Measured)
+    [   0  21  49  71  94 132 150 170 179 223 249 263 273 ... 6223 6226 ]
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 13.1092ms    (CUDA Measured)
+    [   0  21  49  71  94 132 150 170 179 223 249 263 273 ... 6273 6298 ]
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 2.18214ms    (CUDA Measured)
+    [   0  21  49  71  94 132 150 170 179 223 249 263 273 ... 6223 6226 ]
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   1   2   2   3   2   2   0   3   0   0   0   0   1 ...   3   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0.0014ms    (std::chrono Measured)
+    [   1   2   2   3   2   2   3   1   1   3   2   1   1 ...   3   3 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0.0013ms    (std::chrono Measured)
+    [   1   2   2   3   2   2   3   1   1   3   2   1   1 ...   3   1 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 0.0038ms    (std::chrono Measured)
+    [   1   2   2   3   2   2   3   1   1   3   2   1   1 ...   3   3 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 0.309248ms    (CUDA Measured)
+    [   1   2   2   3   2   2   3   1   1   3   2   1   1 ...   3   3 ]
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 0.274432ms    (CUDA Measured)
+    [   1   2   2   3   2   2   3   1   1   3   2   1   1 ...   3   1 ]
+    passed
+```
+
+# Performance Analysis
+
+## Scanning Algorithm
+
+![Scan Chart](img/Analysis%201.JPG)
+
+In this chart, the lower the attribute is, the better its performance.
+We can see that when the array number is generally small, the performance gap between naive method and work-efficient method is generally small, but with the array length increased, this gap become larger and larger.
+The reason why the second method is more efficient is that even though we have limited the number of active threads, when we are doing naive scan, threads which are not doing anything have to wait for the other active threads in the warp to finish to become available again.
+But in upsweep and downsweep method, upsweep only use half number of threads to finish the work, and the rest of threads can be utilized by the GPU to do other tasks(like downsweep). So through that method we launch the same number of threads, but use less depth than the naive method.
+That's why it is way more faster.
+
+## String Compaction
+
+![Compact Chart](img/Analysis%202.JPG)
+In this chart, the lower the attribute is, the better its performance.
diff --git a/img/Analysis 1.JPG b/img/Analysis 1.JPG
diff --git a/img/Analysis 2.JPG b/img/Analysis 2.JPG
diff --git a/src/main.cpp b/src/main.cpp
@@ -13,7 +13,7 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 20; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
@@ -51,7 +51,7 @@ int main(int argc, char* argv[]) {
     printDesc("naive scan, power-of-two");
     StreamCompaction::Naive::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
@@ -64,35 +64,35 @@ int main(int argc, char* argv[]) {
     printDesc("naive scan, non-power-of-two");
     StreamCompaction::Naive::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, power-of-two");
-    StreamCompaction::Efficient::scan(SIZE, c, a);
+    StreamCompaction::Efficient::scan(SIZE, c, a,false);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, non-power-of-two");
-    StreamCompaction::Efficient::scan(NPOT, c, a);
+    StreamCompaction::Efficient::scan(NPOT, c, a,false);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("thrust scan, power-of-two");
     StreamCompaction::Thrust::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("thrust scan, non-power-of-two");
     StreamCompaction::Thrust::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     printf("\n");
@@ -137,14 +137,14 @@ int main(int argc, char* argv[]) {
     printDesc("work-efficient compact, power-of-two");
     count = StreamCompaction::Efficient::compact(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedCount, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient compact, non-power-of-two");
     count = StreamCompaction::Efficient::compact(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
     system("pause"); // stop Win32 console from closing on exit

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -22,8 +22,16 @@ namespace StreamCompaction {
          * Maps an array to an array of 0s and 1s for stream compaction. Elements
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
-        __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
+        __global__ void kernMapToBoolean(int n, int *bools, const int *idata) 
+        {
             // TODO
+            //find each index
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n)
+            {
+                return;
+            }
+            bools[index] = idata[index] != 0 ? 1 : 0;
         }
 
         /**
@@ -33,6 +41,16 @@ namespace StreamCompaction {
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
             // TODO
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n)
+            {
+                return;
+            }
+            if (bools[index] !=0)
+            {
+                int targetIdx = indices[index];
+                odata[targetIdx] = idata[index];
+            }
         }
 
     }

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -15,11 +15,19 @@ namespace StreamCompaction {
         /**
          * CPU scan (prefix sum).
          * For performance analysis, this is supposed to be a simple for loop.
-         * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
+         * (Optional) For better understanding before starting moving to GPU,
+         you can simulate your GPU scan in this function first.
          */
         void scan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+            odata[0] = 0;
+            for (int i = 1; i < n; i++)
+            {
+                odata[i] = odata[i-1] + idata[i-1];
+            }
+
+            //Why the last two digit different?
             timer().endCpuTimer();
         }
 
@@ -28,11 +36,22 @@ namespace StreamCompaction {
          *
          * @returns the number of elements remaining after compaction.
          */
+        //Well I don't know exactly the condition
+        //So I treat it as remove 0 I guess
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+            int j = 0;
+            for (int i = 0; i < n; i++)
+            {
+                if (idata[i] > 0)
+                {
+                    odata[j] = idata[i];
+                    j++;
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            return j;
         }
 
         /**
@@ -41,10 +60,42 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithScan(int n, int *odata, const int *idata) {
+            int* boolArray = new int[n * sizeof(int)];
+            int* scanArray = new int[n * sizeof(int)];
             timer().startCpuTimer();
             // TODO
+            for (int i = 0; i < n; i++)
+            {
+                boolArray[i] = (idata[i] > 0) ? 1 : 0;
+            }
+            //Set temp array
+
+            //begin scan
+            //Inclusive scan
+            scanArray[0] = boolArray[0]; //identity
+            for (int i = 1; i < n; i++)
+            {
+                scanArray[i] = scanArray[i-1] + boolArray[i];
+            }
+            int elementNum = scanArray[n - 1];
+            //Shift to right
+            //Exclusive scan
+            for (int i = n; i > 0; i--)
+            {
+                scanArray[i] = scanArray[i - 1];
+            }
+            scanArray[0] = 0;
+            //Scatter
+            for (int i = 0; i < n; i++)
+            {
+                if (boolArray[i] > 0)
+                {
+                    odata[scanArray[i]]=idata[i];
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+
+            return elementNum;
         }
     }
 }