Kelson Martins Blog

Introduction

I have recently got introduced to Kaggle after starting a journey on learning about Machine Learning, and this article aims to provide a gentle introduction to the world of Kaggle competition.
For those of you who are not familiar with the term, Kaggle is an online community of data scientists and machine learners that allow members to explore available datasets and participate on competitions to find the best performing models.

On this article, we will be tackling the Titanic competition, in which we must develop a model that will make use of a given dataset to predict whether passengers survived or not the Titanic.
We will be developing this model by making use of python’s scikit-learn machine learning library, where a simplistic logistic regression model will be built, that will give us an estimated accuracy of 76% correct prediction.

Titanic Image

Exploring the Dataset

The competition provides 2 dataset files, one for training and one for the test. Logically we will build the model according to the training dataset and then executing it against the `unseen` data from the test dataset.
The train dataset contains 11 features, and they are: [PassengerId], [Pclass], [Name], [Sex], [Age], [SibSp], [Parch], [Ticket], [Fare], [Cabin] and [Embarked].
The training dataset also contains the target class which we are trying to predict, which is the [survived] column. We will build our model aiming to predict the [survived] value once the model is applied to the unseen data of the test dataset.

The first step in building our model is to load the dataset.

def load_dataset():
    # Open the training and test dataset as a pandas dataframe
    train = pd.read_csv("train.csv", delimiter=",")
    test = pd.read_csv("test.csv", delimiter=",")

    # Merge the two datasets into one dataframe so we can perform preprocessing on all data at once    
    test["Survived"] = np.zeros(len(test))
    test["Survived"] = -1
    frameList = [train, test]
    allData = pd.concat(frameList, ignore_index=True)

    return allData

We initially load both training and test datasets as individual Pandas data-frames.

We then combine both data-frames so we can perform all pre-processing steps at once. For that, note that we create an artificial [Survived] feature in the Test dataset with value -1 so we can separate them again once pre-processing is finished.

Pre-Processing

The next step is to perform dataset pre-processing where for the simplicity of this article, we start by dropping some features. They are [Name], [Cabin], [Ticket].

titanic = titanic.drop(['Name','Cabin','Ticket'], axis=1)

We then move to analyze the missing values of the remaining features.

print(titanic.isnull().sum())

allData = pd.concat(frameList, ignore_index=True)
Age            263
Embarked         2
Fare             1
Parch            0
PassengerId      0
Pclass           0
Sex              0
SibSp            0
Survived         0
dtype: int64

We can observe that there are 3 features with missing values, being [Age], [Embarked], and [Fare].

To handle these missing features, we will make use of Scikit’s SimpleImputer Class, which provide us capabilities to quickly address these missing values through different strategies such as [most frequent], [mean] or [median].

imputer_mean = SimpleImputer(missing_values=np.NaN, strategy='mean')
imputer_frequent = SimpleImputer(missing_values=np.NaN, strategy='most_frequent')

imputer_mean.fit(titanic[['Age']])
titanic['Age'] = imputer_mean.transform(titanic[['Age']])

imputer_mean.fit(titanic[['Fare']])
titanic['Fare'] = imputer_mean.transform(titanic[['Fare']])

imputer_frequent.fit(titanic[['Embarked']])
titanic['Embarked'] = imputer_frequent.transform(titanic[['Embarked']])

Note that we are applying 2 different strategies for the features with missing values. The Most Frequent strategy is being applied to the [Embarked] feature, and Mean strategy is being applied to both [Age] and [Fare] features.

The next step in our Pre-Processing phase is the encoding of categorical features. These features are: [Sex] and [Embarked].

encoder = OneHotEncoder(sparse=False)
embarkEncoded = encoder.fit_transform( titanic[["Embarked"]] )
embarkedDF = pd.DataFrame(embarkEncoded)

frameList = [embarkedDF, titanic]
titanic = pd.concat(frameList, axis=1) 
    
# dropping old Embarked
titanic = titanic.drop(['Embarked'], axis=1)

# sex encoder    
encoder = OrdinalEncoder()
titanic['Sex'] = encoder.fit_transform(titanic[['Sex']])

To encode such features, we will be using scikit’s techniques to apply OneHotEncoding for the [embarked] feature, and a simple OrdinalEncoder for the [sex] feature.
Although the [embarked] feature has no specific order, we do not want the model to assume that one value is greater than other, so by using a OneHotEncoder we are slightly increasing the model accuracy.

The last step in our Pre-Processing phase is the normalization of numerical data.

scalingObj= preprocessing.MinMaxScaler()
titanic[['Age', 'SibSp', 'Fare', 'Pclass']]= scalingObj.fit_transform( titanic[['Age', 'SibSp', 'Fare', 'Pclass']] )

For the normalization, we simply use Scikit’s MinMaxScaler Class against our numerical features.

By wrapping all our Pre-Processing steps into a function, we get the following:

def performPreprocessing(titanic):

    # dropping a few features
    titanic = titanic.drop(['Name','Cabin','Ticket'], axis=1)

    # handling missing values
    # applying mean for Age and Fare features
    # applying most_frequent for Embarked
    imputer_mean = SimpleImputer(missing_values=np.NaN, strategy='mean')
    imputer_frequent = SimpleImputer(missing_values=np.NaN, strategy='most_frequent')

    imputer_mean.fit(titanic[['Age']])
    titanic['Age'] = imputer_mean.transform(titanic[['Age']])

    imputer_mean.fit(titanic[['Fare']])
    titanic['Fare'] = imputer_mean.transform(titanic[['Fare']])

    imputer_frequent.fit(titanic[['Embarked']])
    titanic['Embarked'] = imputer_frequent.transform(titanic[['Embarked']])

    # categorical features handling
    # embarked: ordinal Feature
    # Sex: nominal feature

    # handling Embarked with OneHotEncoder
    encoder = OneHotEncoder(sparse=False)
    embarkEncoded = encoder.fit_transform( titanic[["Embarked"]] )
    embarkedDF = pd.DataFrame(embarkEncoded)
  
    # concatenating Embarked OneHotEncoder
    frameList = [embarkedDF, titanic]
    titanic = pd.concat(frameList, axis=1)      
    # dropping old Embarked
    titanic = titanic.drop(['Embarked'], axis=1)    

    # sex encoder    
    encoder = OrdinalEncoder()
    titanic['Sex'] = encoder.fit_transform(titanic[['Sex']])

    # normalizing data
    # features Age, SibSp, Fare, Pclass
    scalingObj= preprocessing.MinMaxScaler()
    titanic[['Age', 'SibSp', 'Fare', 'Pclass']]= scalingObj.fit_transform( titanic[['Age', 'SibSp', 'Fare', 'Pclass']] )
  

    return titanic

Model Building

Now that we have our dataset properly handled, it is time for us to build our model and prepare it so it can be submitted in Kaggle.

The first step is to break our train and test datasets so that we remove the prediction feature [Survived] from it.

titanic_train_target = titanic_train.iloc[:,-1]
titanic_train_data = titanic_train.loc[:, titanic_train.columns != 'Survived']

titanic_test_target = titanic_test.iloc[:,-1]
titanic_test_data =  titanic_test.loc[:, titanic_test.columns != 'Survived']  

The next step involves the removal of the [PasengerID] feature from the dataset, as our Kaggle submission will be scored based on the [PassengerID] and [Survived] feature combined.
We are removing the feature but storing the [PassengerID] from the test dataset it as it will be later required to build the submission file.

titanic_train_data = titanic_train_data.drop(['PassengerId'], axis=1)
titanic_test_PassengerId = titanic_test['PassengerId']
titanic_test_data = titanic_test_data.drop(['PassengerId'], axis=1)

It is now time to train our model, and test against unseen data.
For that, we make use of a simple Logistic Regression algorithm.

clf = LogisticRegression(random_state=0, solver='lbfgs',multi_class='multinomial')
clf.fit(titanic_train_data,titanic_train_target) 
results = clf.predict(titanic_test_data)

resultSeries = pd.Series(data = results, name = 'Survived', dtype='int64') 

The final step is to prepare our submission file so it can be scored in Kaggle.

titanic_test_PassengerId = titanic_test_PassengerId.reset_index(drop=True)
df = pd.DataFrame({"PassengerId":titanic_test_PassengerId, "Survived":resultSeries})

df.to_csv("submission.csv", index=False, header=True)

This final step concatenates the previously stored [PassengerID] feature with the [Survived] feature predicted by our model once executed against unseen data.

We can summarize our Model Building steps through the following function:

def model(titanic_train, titanic_test):

   # splitting training into data and prediction. required to build the model
    titanic_train_target = titanic_train.iloc[:,-1]
    titanic_train_data = titanic_train.loc[:, titanic_train.columns != 'Survived']

    titanic_test_target = titanic_test.iloc[:,-1]
    titanic_test_data =  titanic_test.loc[:, titanic_test.columns != 'Survived']   

    # dropping PassengerID from both training and test dataset. But storying the test as we will need it later to submit to Kaggle        
    # dropping for train first
    titanic_train_data = titanic_train_data.drop(['PassengerId'], axis=1)
    
    # dropping for test but keeping it separate
    titanic_test_PassengerId = titanic_test['PassengerId']

    titanic_test_data = titanic_test_data.drop(['PassengerId'], axis=1)

    clf = LogisticRegression(random_state=0, solver='lbfgs',multi_class='multinomial')
    clf.fit(titanic_train_data,titanic_train_target) 
    results = clf.predict(titanic_test_data)

    # results below is the NumPy array of predicted results returned from our classifier
    resultSeries = pd.Series(data = results, name = 'Survived', dtype='int64')

    # important to reset index of the passengetID dataframe so that the data for result and passengerId is correctly concatenated
    titanic_test_PassengerId = titanic_test_PassengerId.reset_index(drop=True)

    # create a data frame with just the PassengerID feature from the test dataset and the results
    df = pd.DataFrame({"PassengerId":titanic_test_PassengerId, "Survived":resultSeries})
    # write the results to a CSV file (you should then upload this file)
    df.to_csv("submission.csv", index=False, header=True)    

Putting it all together

Having prepared our functions to Load the Dataset, perform Pre-Processing, and Build the Model, we can tie all together with our Main function.

def main():
    
    allData = load_dataset()
        
    # run preprocessing.
    all_data = performPreprocessing(allData)

    # break data into test and train
    train, test = break_train_test(all_data)
    
    # build and run model
    model(train, test)

The only point to note is the addition of the break_train_test method.

def break_train_test(titanic_data):

    # breaking down the data into Train and Test
    mask = titanic_data['Survived'] >= 0    
    train = titanic_data[mask]
    # ~ is the inverted portion
    test = titanic_data[~mask]     

    return [train, test]

Remember that at the start of the article we concatenated all our Training and Test dataset into one so that we could perform all the pre-processing steps at once? This function now simply break down the dataset again into Train and Test datasets.

We are now ready to run our program, and by doing so, you will note a submisson.csv file being generated, which can be submitted in Kaggle to be scored.

By performing the submission, it is expected that you achieve a score accuracy of at least 76%.

Conclusion

This article provided a simple way of getting started into Kaggle competitions.
Although the achieved accuracy of 76% is not going to put you in the top Kagglers for the Titanic competition, it will, sure enough provide you with information and tools to allow you to further explore improvements on the presented model.

An example of further improvement is the addition of the [Name] feature handling. For this article, the [Name] feature was dropped but some models that handle the feature were able to achieve higher accuracy.
The reason for such event lies in the fact that the `name` feature contains titles such as `MR`, `Mrs`, etc. Having such titles can have an impact on allowing the algorithm to better predict the outcome of the survived target class.

You can find the complete code for this implementation in GitHub.

I do plan to explore some other Kaggle competitions so if you found this article useful, stay tuned for check similar content in the near future.

Software engineer, geek, traveler, wannabe athlete and a lifelong learner. Works at @IBM

Next Post