Titanic

dataset_hub.classification.datasets.get_titanic(verbose=None)[source]

Load and return the Titanic dataset (classification).

A classic binary classification dataset containing information about passengers aboard the Titanic, including demographic and ticket-related features and survival outcome.

Original dataset: Kaggle Titanic

Columns:

pclass (int): passenger class (1 = 1st, 2 = 2nd, 3 = 3rd)
name (str): full name of the passenger
sex (str): passenger gender
age (float): passenger age in years, may contain missing values
fare (float): ticket fare, may contain missing values
sibsp (int): number of siblings/spouses aboard
parch (int): number of parents/children aboard
survived 🚩 (int): target variable, 1 if survived, 0 otherwise

Parameters:: verbose (bool, optional) – If True, the function prints a link to the dataset documentation in the log output after loading. (e.g., on this page) Default is None, which uses the global Library Settings.
Returns:: The Titanic dataset with all features including the target.
Return type:: pandas.DataFrame

Quick Start:

from dataset_hub.classification import get_titanic

df = get_titanic()

Baseline

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

from dataset_hub.classification import get_titanic

# Get titanic dataset
df = get_titanic()
df.head()

Dataset info & details: https://getdataset.github.io/dataset-hub/datasets/classification/titanic.html

	survived	pclass	name	sex	age	fare	sibsp
0	0	3	Braund, Mr. Owen Harris	male	22.0	7.2500	1
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	71.2833	1
2	1	3	Heikkinen, Miss. Laina	female	26.0	7.9250	0
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	53.1000	1
4	0	3	Allen, Mr. William Henry	male	35.0	8.0500	0

# Separate target variable (y) and features (X)
y = df["survived"]
X = df.drop("survived", axis=1)

# Drop categorical columns for simplicity (you can preprocess them yourself)
X = X.select_dtypes(include=["int64", "float64"])

# Fill missing numeric values
for col in X.columns:
    X[col] = X[col].fillna(X[col].median())

# Split data into train and test parts
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Create and train the model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", round(accuracy, 3))

Accuracy: 0.706