Kaggle

[Kaggle] Spaceship Titanic

HBShim 2025. 1. 18. 14:40

Spaceship Titanic

 The goal of the competition is to predict which passengers aboard the Spaceship Titanic were transported to an alternate dimension after the ship collided with a spacetime anomaly near Alpha Centauri. By analyzing records recovered from the spaceship's damaged computer system, assist rescue crews in locating and saving the lost passengers.

 

Dataset

Opening up train.csv data file. Since there are too many features for each data point, it could not be displayed in one output screen using pandas.

train.csv

It consists of 14 features. RoomService, FoodCourt, ShoppingMall, Spa, VRDeck features represent the amount of money spent on each activity in spaceship titanic. In some data points, there are some missing features. Empty entries are filled with 0 to fit the model at later time.

train.csv empty entries

train_data = pd.read_csv("/kaggle/input/spaceship-titanic/train.csv")
train_data.fillna(0, inplace=True)
train_data.head()
test_data = pd.read_csv("/kaggle/input/spaceship-titanic/test.csv")
test_data.fillna(0, inplace=True)
test_data.head()

 

Categorial Data Distribution

Homeplanet, CryoSleep, Destination and VIP features only take few designated values. To identify unique categories of the features, data distributino of each features are visualized. Here, HomePlanet and Destination represents values with NaN entry. 

train.csv data visualization

 

Random Forest Tree Fitting

Using all of the features, random forest tree fitting is used to train the model with 1000 estimator and maximum depth of 5. Learn more about Random Forest Tree  in my previous post here.

from sklearn.ensemble import RandomForestClassifier

y = train_data["Transported"]

features = ["HomePlanet", "CryoSleep", "Age", "VIP", "RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=1000, max_depth=5, random_state=1)
model.fit(X, y)
print("Fitting Complete")
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Transported': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

 

The figure below represents classification results after reducing 14 dimensional data into 2 and 3 dimensional using PCA.

 

Classification Result for training dataset
Classification Result for test dataset

 

The submission result gave 78.8% accuracy.