Skip to content

create a simple classification dataset as an alternative to iris and penguins - water types (river, lake etc)

Notifications You must be signed in to change notification settings

Gabriel-Kissin/water-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

water-dataset

The idea of this notebook is to create a small, simple classification dataset as an alternative to the well-known iris and penguins datasets.

The raw data is in the data folder, and originates from the UK Environment Agency website. This is processed step-by-step in the water dataset.ipynb notebook, with clear explanations of what is being done at each step. Various challenges (e.g. the uncompressed size of the full dataset being over 20 GB) and how they were surmounted are also discussed in the notebook. Other decisions made, such as which classes to use, and which determinands to include, are also explained. The final small dataset is saved as a .csv file (water.csv) and as a pickle of the pandas DataFrame (water.pkl). Some visualisation of the data is available at the end of the notebook, together with some interpretations of the plots, as well as some classifiers. However this is intentionally brief, as the main purpose of the notebook is to show the ETL process, rather than to actually analyse the data, which is left to the reader.

The classification categories are various fresh water types (sewage water types and seawater types were excluded), as follows:

River water Canal water Lake water Groundwater Estuary water
A River Bank (The Seine at Asnières) A Regatta on the Grand Canal Lakeside Landscape At the Well Thames Painting - The Estuary
Georges Seurat Canaletto Pierre-Auguste Renoir Edward Bird Michael Andrews
1883 1740 1889 c. 1800 1995

(Groundwater is the water present beneath Earth's surface in rock and soil pore spaces and in the fractures of rock formations, and is often withdrawn via wells.)

The columns are an index column (which can be deleted), the area from which the sample originated, the label of the exact sampling point location, the year over which the observation was averaged, and then the 6 observation features which are the basis for classification:

  • Chloride [mg/l]
  • Nitrite as N [mg/l]
  • Nitrate as N [mg/l]
  • Oxygen, Dissolved, % Saturation [%]
  • pH [phunits]
  • Temperature of Water [cel].

About

create a simple classification dataset as an alternative to iris and penguins - water types (river, lake etc)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published