[Kaggle] Backpack Prediction Challenge with XGBoost
Backpack Prediction Challenge
Goal: Predict the price of backpacks given various attributes. Evaluated by RMSE metric.
$$ RMSE=(\frac{1}{N} \sum^N_{i=1} (y_i-\hat{y_i})^2)^\frac{1}{2} $$
Refer to this challenge here: https://www.kaggle.com/competitions/playground-series-s5e2
EDA
Viewing the first 5 rows of dataset,
full_data=pd.read_csv("/kaggle/input/playground-series-s5e2/train.csv")
full_data.head()

Visualization of data distribution using seaborn and matplotlib
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math
# Load the dataset (update the filename as needed)
df = full_data
# Set up the plotting style
sns.set(style="whitegrid")
# Plot numerical columns
num_cols = df.select_dtypes(include=["number"]).columns
num_rows = math.ceil(len(num_cols) / 2) # Arrange in two columns
fig, axes = plt.subplots(num_rows, 2, figsize=(12, 4 * num_rows))
fig.suptitle("Distributions of Numerical Columns", fontsize=16)
for i, col in enumerate(num_cols):
row, col_idx = divmod(i, 2)
sns.histplot(df[col], kde=True, bins=20, ax=axes[row, col_idx])
axes[row, col_idx].set_title(f"Distribution of {col}")
# Hide empty subplots if odd number of numerical columns
if len(num_cols) % 2 != 0:
fig.delaxes(axes[-1, -1])
plt.tight_layout(rect=[0, 0, 1, 0.96]) # Adjust layout
plt.show()
# Plot categorical columns
cat_cols = df.select_dtypes(exclude=["number"]).columns
num_rows = math.ceil(len(cat_cols) / 2)
fig, axes = plt.subplots(num_rows, 2, figsize=(12, 4 * num_rows))
fig.suptitle("Distributions of Categorical Columns", fontsize=16)
for i, col in enumerate(cat_cols):
row, col_idx = divmod(i, 2)
sns.countplot(y=df[col], order=df[col].value_counts().index, ax=axes[row, col_idx])
axes[row, col_idx].set_title(f"Distribution of {col}")
# Hide empty subplots if odd number of categorical columns
if len(cat_cols) % 2 != 0:
fig.delaxes(axes[-1, -1])
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()

Data Analysis
There are compartment, capacity, and price as a numerical variables and Brand, Material, Size, Laptop Compartment, Waterproof, Style and Color as a categorical variables. In the analysis below, there are NaN entries for each categorical variables.
for cols in full_data:
if full_data.loc[:, cols].dtype=='object':
print(cols, full_data.loc[:,cols].unique())

Train-test Split
Before preprocessing, train-test split is required to prevent data leakage. Full data is divided in to data X and label y.
X = full_data.loc[:, full_data.columns != 'Price']
y = full_data.loc[:, full_data.columns == 'Price']
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X,y, test_size=0.3)
print("Split Done")
Data preprocessing
Identifying the number of missing entries for each columns.
for col in X_train.columns:
ratio=X_train.loc[:, col].isna().sum()*100/X_train.loc[:, col].shape[0]
print(f"{col} | {ratio:.2f}% missing ({X_train.loc[:, col].isna().sum()} missing)")

All columns has missing values less than 5%, so imputation is appropriate. Among numerical variables, only weight capacity has missing value. Since it is floating point value with weights in kilograms, mean value imputation is feasible.
X_train['Weight Capacity (kg)'].fillna(X_train['Weight Capacity (kg)'].mean(), inplace=True)
X_valid['Weight Capacity (kg)'].fillna(X_valid['Weight Capacity (kg)'].mean(), inplace=True)
Imputation of categorical column is done by most frequent entries for each column.
from sklearn.impute import SimpleImputer
object_cols = [cols for cols in X_train.columns if X_train.loc[:, cols].dtype=='object']
cat_imputer = SimpleImputer(strategy='most_frequent')
X_train[object_cols] = cat_imputer.fit_transform(X_train[object_cols])
X_valid[object_cols] = cat_imputer.transform(X_valid[object_cols])
Checking for missing data after imputation
# Train dataset missing ratio
missing_cols=0
for col in X_train.columns:
ratio=X_train.loc[:, col].isna().sum()*100/X_train.loc[:, col].shape[0]
print(f"{col} | {ratio:.2f}% missing ({X_train.loc[:, col].isna().sum()} missing)")
if ratio!=0:
missing_cols+=1
if missing_cols==0:
print("\nTraining dataset imputation complete")
else:
print("\nTraining dataset imputation incomplete")

# Validation dataset missing ratio
missing_cols=0
for col in X_valid.columns:
ratio=X_valid.loc[:, col].isna().sum()*100/X_valid.loc[:, col].shape[0]
print(f"{col} | {ratio:.2f}% missing ({X_valid.loc[:, col].isna().sum()} missing)")
if ratio != 0:
missing_cols+=1
if missing_cols==0:
print("\nValidation dataset imputation complete")
else:
print("\nValidataion dataset imputation incomplete")

Categorical variable and one-hot encoding
Since XGBoost Regressor only gets non-object values as an input to fit the model, it is essential to encode categorical variables to non-object type attributes.
Size column consists of small, medium and large entries, so ordinal encoding is appropriate.
# Ordinal: Size
from sklearn.preprocessing import OrdinalEncoder
ord_encoder = OrdinalEncoder()
X_train['Size'] = ord_encoder.fit_transform(X_train['Size'].values.reshape(-1,1))
X_valid['Size'] = ord_encoder.transform(X_valid['Size'].values.reshape(-1,1))
Other object variables are non ordinal, therefore one hot encoding is a better choice.
from sklearn.preprocessing import OneHotEncoder
object_cols = [cols for cols in X_train.columns if X_train.loc[:,cols].dtype=='object']
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))
# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)
# Add one-hot encoded columns to numerical features
X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
Model fitting with XGBRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
model_1 = XGBRegressor(n_estimators = 1000, learning_rate=0.05)
model_1.fit(X_train, y_train,
early_stopping_rounds=10,
eval_set=[(X_valid, y_valid)],
verbose=False)
prediction=model_1.predict(X_valid)
rmse = mean_squared_error(prediction, y_valid) ** 0.5
print(rmse)

Test data preprocessing
X_test = pd.read_csv("/kaggle/input/playground-series-s5e2/test.csv")
for cols in X_test.columns:
print(cols, X_test.loc[:,cols].isna().sum())
X_test['Weight Capacity (kg)'].fillna(X_test['Weight Capacity (kg)'].mean(), inplace=True)

from sklearn.impute import SimpleImputer
object_cols = [cols for cols in X_test.columns if X_test.loc[:, cols].dtype=='object']
cat_imputer = SimpleImputer(strategy='most_frequent')
X_test[object_cols] = cat_imputer.fit_transform(X_test[object_cols])
missing_cols=0
for col in X_test.columns:
ratio=X_test.loc[:, col].isna().sum()*100/X_test.loc[:, col].shape[0]
print(f"{col} | {ratio:.2f}% missing ({X_test.loc[:, col].isna().sum()} missing)")
if ratio !=0 :
missing_cols+=1
if missing_cols==0:
print("Imputation for test dataset complete")
else:
print("Imputation for test dataset incomplete")

ord_encoder = OrdinalEncoder()
X_test['Size'] = ord_encoder.fit_transform(X_test['Size'].values.reshape(-1,1))
object_cols = [cols for cols in X_test.columns if X_test.loc[:,cols].dtype=='object']
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_test = pd.DataFrame(OH_encoder.fit_transform(X_test[object_cols]))
# One-hot encoding removed index; put it back
OH_cols_test.index = X_test.index
# Remove categorical columns (will replace with one-hot encoding)
num_X_test = X_test.drop(object_cols, axis=1)
# Add one-hot encoded columns to numerical features
X_test = pd.concat([num_X_test, OH_cols_test], axis=1)

Prediction with trained XGBoost model
test_pred = model_1.predict(X_test)
output = pd.DataFrame({"id": X_test['id'], 'Price': test_pred})
output.to_csv('submission.csv', index=False)
This code generated public score of 39.18075
Refer to my notebook: https://www.kaggle.com/code/hanbos/xgboost-good-simple-backpack-prediction-model
[XGBoost] Good & Simple Backpack Prediction Model
Explore and run machine learning code with Kaggle Notebooks | Using data from Backpack Prediction Challenge
www.kaggle.com