California Housing Prices
- dataset_hub.regression.datasets.get_housing(verbose=None)[source]
Load and return the California Housing dataset (regression).
Median house prices for California districts derived from the 1990 census.
This dataset is intended for predicting median housing values at the block level, reflecting broader economic and social patterns rather than individual home prices. Each record summarizes features of a block, such as population, total rooms, and median income, making it suitable for regional-level regression tasks.
Original dataset: This dataset was used in Aurélien Géron’s book ‘Hands-On Machine Learning with Scikit-Learn and TensorFlow’. California Housing on Kaggle
Columns:
longitude(float): a measure of how far west a house is; higher is farther westlatitude(float): a measure of how far north a house is; higher is farther northhousing_median_age(float): median age of a house within a block; lower is newertotal_rooms(int): total number of rooms within a blocktotal_bedrooms(int): total number of bedrooms within a blockpopulation(int): total number of people residing within a blockhouseholds(int): total number of households within a blockmedian_income(float): median income for households in tens of thousands of USDocean_proximity(str): location of the house with respect to ocean/seamedian_house_value🚩 (float): median house value in USD
- Parameters:
verbose (bool, optional) – If True, the function prints a link to the dataset documentation in the log output after loading. (e.g., on this page) Default is None, which uses the global Library Settings.
- Returns:
The California Housing dataset with all features including the target.
- Return type:
pandas.DataFrame
Quick Start:
from dataset_hub.regression import get_housing df = get_housing()
Baseline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from dataset_hub.regression import get_housing
# Get housing dataset
df = get_housing()
df.head()
Dataset info & details: https://getdataset.github.io/dataset-hub/datasets/regression/california_housing.html
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
| 1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
| 2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
| 3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
| 4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
# Separate target variable (y) and features (X)
y = df["median_house_value"]
X = df.drop("median_house_value", axis=1)
# Drop categorical columns for simplicity (you can preprocess them yourself)
X = X.select_dtypes(include=["int64", "float64"])
# Fill missing numeric values
for col in X.columns:
X[col] = X[col].fillna(X[col].median())
# Split data into train and test parts
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Calculate RMSE
rmse = mean_squared_error(y_test, y_pred)**0.5
print("RMSE:", round(rmse, 2))
RMSE: 71133.17