[Kaggle] Spaceship Titanic

Kaggle

[Kaggle] Spaceship Titanic

HBShim 2025. 1. 18. 14:40

Spaceship Titanic

The goal of the competition is to predict which passengers aboard the Spaceship Titanic were transported to an alternate dimension after the ship collided with a spacetime anomaly near Alpha Centauri. By analyzing records recovered from the spaceship's damaged computer system, assist rescue crews in locating and saving the lost passengers.

Dataset

Opening up train.csv data file. Since there are too many features for each data point, it could not be displayed in one output screen using pandas.

It consists of 14 features. RoomService, FoodCourt, ShoppingMall, Spa, VRDeck features represent the amount of money spent on each activity in spaceship titanic. In some data points, there are some missing features. Empty entries are filled with 0 to fit the model at later time.

train_data = pd.read_csv("/kaggle/input/spaceship-titanic/train.csv")
train_data.fillna(0, inplace=True)
train_data.head()

test_data = pd.read_csv("/kaggle/input/spaceship-titanic/test.csv")
test_data.fillna(0, inplace=True)
test_data.head()

Categorial Data Distribution

Homeplanet, CryoSleep, Destination and VIP features only take few designated values. To identify unique categories of the features, data distributino of each features are visualized. Here, HomePlanet and Destination represents values with NaN entry.

Random Forest Tree Fitting

Using all of the features, random forest tree fitting is used to train the model with 1000 estimator and maximum depth of 5. Learn more about Random Forest Tree in my previous post here.

from sklearn.ensemble import RandomForestClassifier

y = train_data["Transported"]

features = ["HomePlanet", "CryoSleep", "Age", "VIP", "RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=1000, max_depth=5, random_state=1)
model.fit(X, y)
print("Fitting Complete")
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Transported': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

The figure below represents classification results after reducing 14 dimensional data into 2 and 3 dimensional using PCA.

Classification Result for training dataset

The submission result gave 78.8% accuracy.