California Housing Prices

dataset_hub.regression.datasets.get_housing(verbose=None)[source]

Load and return the California Housing dataset (regression).

Median house prices for California districts derived from the 1990 census.

This dataset is intended for predicting median housing values at the block level, reflecting broader economic and social patterns rather than individual home prices. Each record summarizes features of a block, such as population, total rooms, and median income, making it suitable for regional-level regression tasks.

Original dataset: This dataset was used in Aurélien Géron’s book ‘Hands-On Machine Learning with Scikit-Learn and TensorFlow’. California Housing on Kaggle

Columns:

longitude (float): a measure of how far west a house is; higher is farther west
latitude (float): a measure of how far north a house is; higher is farther north
housing_median_age (float): median age of a house within a block; lower is newer
total_rooms (int): total number of rooms within a block
total_bedrooms (int): total number of bedrooms within a block
population (int): total number of people residing within a block
households (int): total number of households within a block
median_income (float): median income for households in tens of thousands of USD
ocean_proximity (str): location of the house with respect to ocean/sea
median_house_value 🚩 (float): median house value in USD

Parameters:: verbose (bool, optional) – If True, the function prints a link to the dataset documentation in the log output after loading. (e.g., on this page) Default is None, which uses the global Library Settings.
Returns:: The California Housing dataset with all features including the target.
Return type:: pandas.DataFrame

Quick Start:

from dataset_hub.regression import get_housing

df = get_housing()

Baseline

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

from dataset_hub.regression import get_housing

# Get housing dataset
df = get_housing()
df.head()

Dataset info & details: https://getdataset.github.io/dataset-hub/datasets/regression/california_housing.html

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY

# Separate target variable (y) and features (X)
y = df["median_house_value"]
X = df.drop("median_house_value", axis=1)

# Drop categorical columns for simplicity (you can preprocess them yourself)
X = X.select_dtypes(include=["int64", "float64"])

# Fill missing numeric values
for col in X.columns:
    X[col] = X[col].fillna(X[col].median())

# Split data into train and test parts
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate RMSE
rmse = mean_squared_error(y_test, y_pred)**0.5
print("RMSE:", round(rmse, 2))

RMSE: 71133.17