Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Description - California Housing #4862

Open
s2t2 opened this issue Sep 24, 2024 · 1 comment
Open

Dataset Description - California Housing #4862

s2t2 opened this issue Sep 24, 2024 · 1 comment

Comments

@s2t2
Copy link

s2t2 commented Sep 24, 2024

Describe the current behavior
There is a "README.md" file in the "sample_data" directory in the Colab filesystem that attempts to provide a link to more information about the california housing CSV files in the "sample_data" directory. However that link is broken:

california_housing_data*.csv is California housing data from the 1990 US
Census; more information is available at: https://developers.google.com/machine-learning/crash-course/california-housing-data-description

Describe the expected behavior
Expect a written description of the dataset to be in the README file, or a working link to where to find this information.

What web browser you are using
Chrome

Additional context
There is a lot of information about a similar california housing dataset from sklearn and tensorflow. However that dataset is slightly different (contains column about occupants, also expresses bedrooms and bathrooms as averages instead of totals).

Given the nature of these differences between the datasets, it isn't totally apparent if these datasets are meant to be the same, or if they are from the same source, what transformation operations were taken on the original dataset. Any transformations should be documented.

from sklearn.datasets import fetch_california_housing

dataset = fetch_california_housing()
print(type(dataset))
print(dataset.DESCR)

Alternatively, there is this kaggle dataset which more closely resembles the Colab dataset, and says:

This data was initially featured in the following paper:
Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297.

and I encountered it in 'Hands-On Machine learning with Scikit-Learn and TensorFlow' by Aurélien Géron.
Aurélien Géron wrote:
This dataset is a modified version of the California Housing dataset > available from: Luís Torgo's page (University of Porto)

@s2t2 s2t2 added the bug label Sep 24, 2024
@mayankmalik-colab
Copy link
Contributor

Thanks for letting us know. Tracking internally at b/369843963.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants