water-dataset

The idea of this notebook is to create a small, simple classification dataset as an alternative to the well-known iris and penguins datasets.

The raw data is in the data folder, and originates from the UK Environment Agency website. This is processed step-by-step in the water dataset.ipynb notebook, with clear explanations of what is being done at each step. Various challenges (e.g. the uncompressed size of the full dataset being over 20 GB) and how they were surmounted are also discussed in the notebook. Other decisions made, such as which classes to use, and which determinands to include, are also explained. The final small dataset is saved as a .csv file (water.csv) and as a pickle of the pandas DataFrame (water.pkl). Some visualisation of the data is available at the end of the notebook, together with some interpretations of the plots, as well as some classifiers. However this is intentionally brief, as the main purpose of the notebook is to show the ETL process, rather than to actually analyse the data, which is left to the reader.

The classification categories are various fresh water types (sewage water types and seawater types were excluded), as follows:

River water	Canal water	Lake water	Groundwater	Estuary water

A River Bank (The Seine at Asnières)	A Regatta on the Grand Canal	Lakeside Landscape	At the Well	Thames Painting - The Estuary
Georges Seurat	Canaletto	Pierre-Auguste Renoir	Edward Bird	Michael Andrews
1883	1740	1889	c. 1800	1995

(Groundwater is the water present beneath Earth's surface in rock and soil pore spaces and in the fractures of rock formations, and is often withdrawn via wells.)

The columns are an index column (which can be deleted), the area from which the sample originated, the label of the exact sampling point location, the year over which the observation was averaged, and then the 6 observation features which are the basis for classification:

Chloride [mg/l]
Nitrite as N [mg/l]
Nitrate as N [mg/l]
Oxygen, Dissolved, % Saturation [%]
pH [phunits]
Temperature of Water [cel].

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
images		images
README.md		README.md
enhanced_pair_plot.py		enhanced_pair_plot.py
water dataset.ipynb		water dataset.ipynb
water.csv		water.csv
water.pkl		water.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

water-dataset

About

Releases

Packages

Languages

Gabriel-Kissin/water-dataset

Folders and files

Latest commit

History

Repository files navigation

water-dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages