Titanic
- dataset_hub.classification.datasets.get_titanic(verbose=None)[source]
Load and return the Titanic dataset (classification).
A classic binary classification dataset containing information about passengers aboard the Titanic, including demographic and ticket-related features and survival outcome.
Original dataset: Kaggle Titanic
Columns:
pclass(int): passenger class (1 = 1st, 2 = 2nd, 3 = 3rd)name(str): full name of the passengersex(str): passenger genderage(float): passenger age in years, may contain missing valuesfare(float): ticket fare, may contain missing valuessibsp(int): number of siblings/spouses aboardparch(int): number of parents/children aboardsurvived🚩 (int): target variable, 1 if survived, 0 otherwise
- Parameters:
verbose (bool, optional) – If True, the function prints a link to the dataset documentation in the log output after loading. (e.g., on this page) Default is None, which uses the global Library Settings.
- Returns:
The Titanic dataset with all features including the target.
- Return type:
pandas.DataFrame
Quick Start:
from dataset_hub.classification import get_titanic df = get_titanic()
Baseline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from dataset_hub.classification import get_titanic
# Get titanic dataset
df = get_titanic()
df.head()
Dataset info & details: https://getdataset.github.io/dataset-hub/datasets/classification/titanic.html
| survived | pclass | name | sex | age | fare | sibsp | parch | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 7.2500 | 1 | 0 |
| 1 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 71.2833 | 1 | 0 |
| 2 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 7.9250 | 0 | 0 |
| 3 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 53.1000 | 1 | 0 |
| 4 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 8.0500 | 0 | 0 |
# Separate target variable (y) and features (X)
y = df["survived"]
X = df.drop("survived", axis=1)
# Drop categorical columns for simplicity (you can preprocess them yourself)
X = X.select_dtypes(include=["int64", "float64"])
# Fill missing numeric values
for col in X.columns:
X[col] = X[col].fillna(X[col].median())
# Split data into train and test parts
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Create and train the model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", round(accuracy, 3))
Accuracy: 0.706