Tools for QA/QC checks and fixes for CSV files
Data State | Input | Output | Procedure |
---|---|---|---|
Level 0 | Data acquisition | Raw data | N/A |
Level 1 | - Primary data as CSV(s): filename.csv - Configuration file: default.conf Optionally: - Source characteristics file - Alternate data source files(s): filename_alt.csv |
-Level 1 data: filename_Level1.csv - QA mask file: filename_Level1_qa.csv - Deviation file:filename_Level1_dev.csv - Log file |
Level0to1.py |
Level 2 | - Level 1 data: filename_Level1.csv - Configuration file: default.conf - Reviewed QA mask file: filename_Level2_mask.csv - Deviation file: filename_Level1_dev.csv |
- Processed, combined* data: filename_Level2.csv - Data aggregated to additional intervals: filename_Level2_avg_interval.csv |
Level1to2.py |
Level 3 (Aggregate to Intervals Only) |
- Level 3 data: filename_Level3.csv - Configuration file: default.conf |
- Data aggregated to additional intervals: filename_Level3_agg_avg_interval.csv | Level3_agg_only.py |
*(If alternate source included)
-
Prepare Sources
- Prepare source_characteristics.csv
- Format any alternate data to have the same column names and filename appended with "_alt" as the primary source
- Create a copy of default.conf for the specific data run and make any necessary changes
-
Run Level0to1.py
-
Review Output
- filename_Level1_qa_stats.csv is useful to determine where to look for issues, by looking at the QA mask counts and stats and for each field.
- filename_Level1_dev.csv is useful to compare primary and alternate sources.
- filename_Level1_qa.csv should be modified to match the intended QA actions. This should include replacing any combination flags.
- filename_Level1.csv is always good reference for checking what data raised flags.
-
Run Level1to2.py.
If an alternate data source uses a larger sample interval than the primary source, values will be back-filled. An alternate source with a shorter sample interval than the primary source may have issues (has not been tested).
-
If needed, manually review the output and run Level3_agg_only.py to run only data aggregation at additional intervals. Note that Level3_agg_only.py is effectively the same as running only the last step of Level1to2.py. If column precisions are enforced in both Level 2 and Level 3, rounding errors may occur.
Warning: These scripts WILL overwrite files of the same name in the output directory. If an existing file is locked, it will append an integer to the end of the file name.
Other Notes:
In the future, the functions used in these scripts should be made into a proper object-oriented class. This would also simplify the number of variables that need to be passed around globally by letting them exist privately in the class.
These scripts are not optimized for memory usage and should be expected to potentially use more than twice the memory as the size of the original data, possibly much more depending on resampling intervals. They are also not optimized for multi-threaded processing; an obvious improvement could be made by run resampling of different intervals in parallel.
Known Issues:
- If timestamp column isn't first, resampling may fail.
- Index column is not excluded from correlation calcs, throws a warning but doesn't affect output.
After running Level0to1.py, the output should be manually reviewed in preparation for running Level1to2.py. Specifically, the QA mask file should be updated to reflect the desired changes for Level 2.
The QA mask file contains the following flags for each data point, which will indicate the corresponding actions to take in Level1to2.py.
Value | Data Status | Action |
---|---|---|
0 | Good | Keep as-is |
1 | Out of dat/time range | Remove |
2 | Out of source characteristic range | Replace with Nan value |
3 | 1 + 2 | Remove |
4 | Significant deviation from alternate source | Replace with alternate source value |
5 | 1 + 4 | Remove |
6 | 2 + 4 | Replace with alternate source value |
7 | 1 + 2 + 4 | Remove |
8 | No value/Nan | Replace/leave as Nan value |
9 | 1+8 | Remove |
A QA overview plot will be generated showing where and what variables are affected by different flags. This overview is created by taking the mode of the QA flags aggregated by hour for each variable. (Example shown below.)
An easy way to check and modify the flagged areas is to open the mask file in Excel and filter by the column and flag of interest. Affected values can then easily be selected and changed in bulk. The modified file should be saved as filename_Level2_mask.csv. Be sure that the Timestamp format in Excel is set to "yyyy-mm-dd hh:mm:ss" before saving. Combination flags have no effect on the output and should be manually replaced to reflect the intended action.
This should be a CSV with information about the data sources. Ideally, one row should be used for each type of measurement. The fields currently implemented are:
Field | Function | Format |
---|---|---|
"Min" | Minimum expected value, used to classify QA flag "2" | Float |
"Max" | Maximum expected value, used to classify QA flag "2" | Float |
"Useful Decimal Places" | Decimal places to use in the final output | Integer |
"Columns" | Determines which source column(s) the parameters in the row should apply to. | String(s), separated by ';' for multiple |
"Start", Timestamp (In first, second columns of any row) |
Starting date/time of dataset, used to classify QA flag "1" | String,Timestamp |
"End", Timestamp (In first, second columns of any row) |
Ending date/time of dataset, used to classify QA flag "1" | String,Timestamp |
Additional columns can be added as reference without any effect.
See parameters defined in configuration file (default.conf) for details.
The output of this should be manually reviewed before being used for Level 2.
See parameters defined in configuration file (default.conf) for details.
See parameters defined in script for details.
"Chamber1_2020_sample_1Min.dat" is a modified snippet of a data file, intentionally modified to have issues. The other files are the output generated by csv_avg.py using the default configuration.
File | Description |
---|---|
Met Station_Min1.dat* | Raw data as CSV |
Met Station_Min1_alt.dat* | Alternate data source for "AirT" column |
source_characteristics.csv | Source characteristics file |
*Not yet uploaded