California Housing Prices

dataset_hub.regression.datasets.get_housing(verbose=None)[source]

Load and return the California Housing dataset (regression).

Median house prices for California districts derived from the 1990 census.

This dataset is intended for predicting median housing values at the block level, reflecting broader economic and social patterns rather than individual home prices. Each record summarizes features of a block, such as population, total rooms, and median income, making it suitable for regional-level regression tasks.

Original dataset: This dataset was used in Aurélien Géron’s book ‘Hands-On Machine Learning with Scikit-Learn and TensorFlow’. California Housing on Kaggle

Columns:

  • longitude (float): a measure of how far west a house is; higher is farther west

  • latitude (float): a measure of how far north a house is; higher is farther north

  • housing_median_age (float): median age of a house within a block; lower is newer

  • total_rooms (int): total number of rooms within a block

  • total_bedrooms (int): total number of bedrooms within a block

  • population (int): total number of people residing within a block

  • households (int): total number of households within a block

  • median_income (float): median income for households in tens of thousands of USD

  • ocean_proximity (str): location of the house with respect to ocean/sea

  • median_house_value 🚩 (float): median house value in USD

Parameters:

verbose (bool, optional) – If True, the function prints a link to the dataset documentation in the log output after loading. (e.g., on this page) Default is None, which uses the global Library Settings.

Returns:

The California Housing dataset with all features including the target.

Return type:

pandas.DataFrame

Quick Start:

from dataset_hub.regression import get_housing

df = get_housing()

Baseline

Open In Colab

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

from dataset_hub.regression import get_housing

# Get housing dataset
df = get_housing()
df.head()
Dataset info & details: https://getdataset.github.io/dataset-hub/datasets/regression/california_housing.html
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
# Separate target variable (y) and features (X)
y = df["median_house_value"]
X = df.drop("median_house_value", axis=1)

# Drop categorical columns for simplicity (you can preprocess them yourself)
X = X.select_dtypes(include=["int64", "float64"])

# Fill missing numeric values
for col in X.columns:
    X[col] = X[col].fillna(X[col].median())

# Split data into train and test parts
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate RMSE
rmse = mean_squared_error(y_test, y_pred)**0.5
print("RMSE:", round(rmse, 2))
RMSE: 71133.17