Adventures With Tensorflow: Titanic Competition (Part 1)

November 17, 2024

The Magic of Machine Learning

From an outsider’s perspective, machine learning looks like complete magic. I initially wanted to get a surface-level understanding of how it all works by watching a few Youtube videos on the topic but it was difficult to get even a blurry picture of what was going on without some hands-on learning. So I decided to watch a full class on Udemy instead. The result? I accidentally developed a major passion in machine learning (oops). So here’s a quick overview of my current adventures with Tensorflow as I attempt to put my current knowledge to use on Kaggle’s Titanic Competition

Here’s my Kaggle profile: myhashbrowns@Kaggle

My current submission score is at 0.77751 (out of 1.0) but hopefully this will improve after I make some changes to the data I push into my model.

Normalizing Data

One of my favorite things about machine learning, and reinforcement learning in particular, is that it all starts with data. I’ll always take any excuse to plot some graphs and crunch some numbers. As an infrastructure engineer, this typically only happens when there’s a particularly troublesome issue that requires diving knee-deep into some server logs. Usually, metrics platforms like Grafana and Prometheus handle most of the data munging and graphing.

The Titanic dataset is particularly fun to play around with because the data tends to be clean as-is and it just requires some normalization. These are the columns that come with the default dataset:

variable	definition
pid	PassengerID
pname	Passenger Name
survival	Survival
pclass	Ticket class
sex	Sex
Age	Age in years
sibsp	# of siblings / spouses aboard the Titanic
parch	# of parents / children aboard the Titanic
ticket	Ticket number
fare	Passenger fare
cabin	Cabin number
embarked	Port of Embarkation

(This is a very morbid dataset, in all honesty.)

As this is one of my first forays into reinforcement learning, I decided to toss out a couple fields rather than trying to normalize them. The columns I removed from the training data were: PassengerID, Name, Ticket, and Cabin. One of my next goals will be to re-add some of these fields. The Name column in particular could be useful since titles could hint at their survivability chances, and the model could take that into account.

Here’s the function I use to normalize the dataset:

def preprocess_titanic_data(df, is_training_data=True, columns_to_drop=None):
    """
    Preprocess Titanic dataset for neural networks
    """
    if columns_to_drop is None:
        columns_to_drop = ["PassengerId", "Name", "Ticket", "Cabin"]

    df_processed = df.copy()

    y = None
    if is_training_data and "Survived" in df_processed.columns:
        y = df_processed["Survived"].values
        df_processed.drop("Survived", axis=1, inplace=True)

    # Some columns we do not want
    df_processed = df_processed.drop(columns_to_drop, axis=1)

    # We try to handle missing values
    # Try filling missing age with median age?
    age_imputer = SimpleImputer(strategy="median")
    df_processed["Age"] = age_imputer.fit_transform(df_processed[["Age"]])

    # Fill missing Embarked with the most common value
    df_processed["Embarked"] = df_processed["Embarked"].fillna(df_processed["Embarked"].mode()[0])

    # Convert sex from male/female to 0/1
    df_processed["Sex"] = LabelEncoder().fit_transform(df_processed["Sex"])

    # Encode Embarked with one-hot encoding
    embarked_dummies = pd.get_dummies(df_processed["Embarked"], prefix="Embarked", dtype=np.float32)
    df_processed = pd.concat([df_processed, embarked_dummies], axis=1)
    df_processed.drop("Embarked", axis=1, inplace=True)

    # Normalize num values
    numerical_features = ["Age", "SibSp", "Parch", "Fare"]
    scaler = StandardScaler()
    df_processed[numerical_features] = scaler.fit_transform(df_processed[numerical_features])

    # Print shapes!
    print(f"Processed features shape: {df_processed.shape}")
    print(f"Original features shape: {df.shape}")

    print(f"Final features used: {df_processed.columns.tolist()}")

    # Convert to numpy and return
    return df_processed.to_numpy(), y, scaler

The output is as follows:

Processed features shape: (891, 9)
Original features shape: (891, 12)
Final features used: ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked_C', 'Embarked_Q', 'Embarked_S']
Processed features shape: (418, 9)
Original features shape: (418, 11)
Final features used: ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked_C', 'Embarked_Q', 'Embarked_S']
#
# Resulting X_train and Y_train shapes
#
X_train shape: (891, 9)
X_test shape: (418, 9)
X_train head: [[ 3.          1.         -0.56573646  0.43279337 -0.47367361 -0.50244517
   0.          0.          1.        ]
 [ 1.          0.          0.66386103  0.43279337 -0.47367361  0.78684529
   1.          0.          0.        ]
 [ 3.          0.         -0.25833709 -0.4745452  -0.47367361 -0.48885426
   0.          0.          1.        ]
 [ 1.          0.          0.4333115   0.43279337 -0.47367361  0.42073024
   0.          0.          1.        ]
 [ 3.          1.          0.4333115  -0.4745452  -0.47367361 -0.48633742
   0.          0.          1.        ]]
Y_train shape: (891,)
Y_train head: [0 1 1 1 0]

Missing age data gets filled with the median age. I’ve not played around with this field too much yet. I don’t think dropping the column will help the model perform better–after all, children were prioritized on the lifeboats. But I’ll consider simplifying this field in the future; maybe one-hot encoding it into something along the lines of child / adult. I also filled in the missing data for the embarked column and one-hot encoded it, but this is a field I’m going to try to remove in later experiments as I’m not sure how relevant it is for a passenger’s survival chances. Other columns were normalized; this will make it easier for the model to learn from as the numbers are squished down to smaller and more manageable values.

My next post will go in-depth into the Tensorflow model itself. But honestly, a lot of my time has been spent just staring at the csv file and trying to figure out how to best mold the data into something that will help the model learn properly.