Visualizing Data with Titanic DataSet

Before we start getting our hands dirty with the python and csv files, we first understand what is present in the dataset.

The dataset consists of the following columns:

  1. PassengerId
  2. Survived
  3. Pclass
  4. Name
  5. Sex
  6. Age
  7. SibSp
  8. Parch
  9. Ticket
  10. Fare
  11. Cabin
  12. Embarked

Procedure followed throughout the project

  1. Importing Packages
  2. Data Inspection
  3. Data Cleaning
  4. Data Visualization
  5. Sugestion

1. Importing Python Packages

Importing the required python packages for the project

2. Data inspection

We perform the initial inspection by importing the required .csv file into the Jupyter notebook, looking into the file as we are not expected to remember all the details of the present in the file

2.1 Dataset at a glance

2.2 Columns at a glance

variable.dtypes prints the columns of the dataset along with the datatype in a tabular manner

2.3 To check whether data contains any null values

This step is crucial during the inspection stage, as this gives the analyst a rough idea of the number of data cells that needs to be filled with calculated values

Conclusion

As we can see from the given table, columns like Age, Cabin and Embarked consists of null value which shall be either filled or dropped during the Data Cleaning Process.

3. Data Cleaning

One has to clean the data i.e., remove any null values and ambiguities (if any), and try to achieve uniformity in the dataset.

3.1 Dropping data column

Out of the 891 data entries, 687 data cells are empty in the dataset, so we can safely drop the column as it doesn't prove handy for us.

We now print the columns of the dataset to check if 'cabin' is dropped or not.

3.2 Populating the missing ages

Here I plan to clean the data by filling in the mean value of the age in cell which have a NULL value

3.3 Missing embarkation ports

We notice that the Embarked column contains some missing value. So we first find out the null entries

We notice that both passengers were travelling with the same ticket number. And we would be filling the Embarked column with the mode of the column. For that, we will have to find the mode.

From the above table we conclude that most women we travelling to Southampton. Hence we will be filling the NaN with S in both cases. And then recheck the null status of the Embarked column using isnull().sum()

4. Data Visualization

This section of the project consists of basic visualization of the dataset. The purpose of this section is for the better understanding of the dataset in a more clearer and visual manner.

4.1 Scatter plot of Fare versus Age

4.2 Bar Graph to represent the Seat Class taken by passengers

4.3 Histogram to represent Age

4.4 Bar plot of the Survival Rate of People

4.5 Heat Map of Survial vs PClass

4.6 Cool way to represent Age, Survival and Pclass

4.7 Violin plot of Gender versus Age

This graph summarises age range of people who were saved. The survival rate is –

4.8 Count plot of Embarkment vs Count keeping Pclass into consideration

Suggestion

As with most datasets the more information we have the better it can be analysed. I believe that we could add the following variables:

Back to Top