In today?s blog, We are going to learn about data analysis, and the various processes involved with it by utilizing examples taken from my GitHub repository, on Exploratory Data Analysis of the Titanic data set, done in a Jupyter Notebook, which can be viewed using this link.
According to Wikipedia, Data analysis is a process of inspecting, cleansing, transforming and modeling data to discover useful information, informing conclusions and supporting decision-making.
In simplified terms, ?Data analysis is the process of looking into the historical data of an organization, and analyze it with a particular aim in mind, that is, to draw potential facts and information and support decision-making process. Whatever decision we take in our lives, is by remembering what happened last time. Thus, data analysis greatly influences the decision-making process?
There are mainly 5 steps involved in the process of data analysis, they are: ?
STEP 1: Asking the right question(s)
The first step towards any sort of data analysis is to ask the right question(s) from the given data. Identifying the objective of the analysis, it becomes easier to decide on the type(s) of data we will be needing to draw conclusions.
The objective behind analyzing the Titanic data set is to find out the factors which contributed to a person?s chance of survival on board the titanic.
STEP 2: Data Wrangling
?Data wrangling, sometimes referred to as data munging, or Data Pre-Processing, is the process of gathering, assessing, and cleaning of ?raw? data into a form suitable for analysis.?
Data Wrangling has 3 sub-steps:-
Gathering of data
After identifying the objective behind our analysis, the next step is to collect the necessary data required by us to draw appropriate conclusions. There are various methods by which we can collect data. Some of which are: ?
- API or Web Scraping ? If the data needed is available in particular website(s), then we can use the websites API (if available) or Web Scraping techniques to collect, and store data in our local storage/ databases. Often, data collected from the Internet is stored in a JSON format, and further processing is needed to convert JSON to the commonly used ?.csv? format.
- Databases ? If the data required is available in our companies databases, then we can easily use SQL queries to extract the data needed from them.
- Sites like kaggle.com store data sets in appropriate formats to be downloaded by the members for practice/ competitions.
For this blog post, we will use the titanic data set uploaded in kaggle.com for our analysis.
Firstly, Lets import all the libraries and the ?train.csv? data set we will be needing throughout our analysis.
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inlinedata=pd.read_csv(‘train.csv’) # Gathering datadata.head()
Assessing of data
After the data has been gathered, stored in a supported format, and assigned to a variable in Python. It?s time to gain some high-level overview of the type of data we are dealing with. It includes gaining information such as: ?
- The number of rows and columns present in the data set
print(‘Number of rows: ‘,data.shape,’nNumber of columns: ‘,data.shape)
- columns present in the data set, along with the data type and number of non-null values
The above output shows the name of columns, along with several non-null values and the data type of each column. From the above output, it is clear that Age, Cabin, and Embarked columns have missing values, which we need to deal with in Data Cleaning stage.
Let?s look at what information do these features (columns) represent: ?
- PassengerId ? Unique ID of each passenger.
- Survived ? Binary Feature (consisting of only 0 or 1 values) which indicate whether that passenger survived or not. ( 0 = NO, 1 = YES).
- Pclass ? Indicates the socioeconomic status of the passenger. ( 1 = Upper Class, 2 = Middle Class, 3 = Lower Class ).
- Name ? Passenger name.
- Sex ? Male or Female.
- Age ? Age of the passenger.
- SibSp ? Number of siblings/ spouses of the passenger aboard the titanic.
- Parch ? Number of parents/ children of the passenger aboard the titanic.
- Ticket ? Ticket number of each passenger.
- Fare ? Fare paid by the passenger.
- Cabin ? shows the cabin number allotted to each passenger.
- Embarked ? Shows the port of embarkation of the passenger. ( C = Cherbourg, Q = Queens town, S = Southampton ).
Data cleaning is the process of detecting and correcting missing, or inaccurate records from a data set. In this process, data present in the ?raw? form (having missing, or inaccurate values) are cleaned appropriately so that the output data is void of missing and inaccurate values. Since no two data sets are same, therefore the method of tackling missing and inaccurate values vary greatly between data sets, but most of the time, we either fill up the missing values or remove the feature which cannot be worked upon.
Fun Fact: Data Analysts usually spend about 70% of their time cleaning data.
In the Titanic data set, as noticed before, the age column has some missing values, which we will now deal with.
Age column has a mean of 29.69 and a standard deviation of 14.52. That means it?s not possible to simply fill the missing values as the mean value as the standard deviation is very high. So we will need a workaround. That is, we will generate a list of random numbers equal to the length of missing values, between (mean-standard deviation) and (mean+standard deviation). Then we can fill up the missing values in the Data Frame with that of those in the list.
import randomprint(‘Number of Missing values in Age:’,data[‘Age’].isnull().sum())mean = data[‘Age’].mean()std = data[‘Age’].std()lower_limit = round(mean-std,0)upper_limit = round(mean+std,0)random_list=for i in range(0,177): random_list.append(random.randint(lower_limit,upper_limit))random_list=np.array(random_list)age=data[‘Age’].valuesk=0for i,j in enumerate(age): if np.isnan(age[i]): age[i]=random_list[k] k+=1data[‘Age’]=ageprint(‘Number of missing values in age: ‘,data[‘Age’].isnull().sum())
So, Age column has been dealt with and all missing values have been replaced by random ages between (mean ? standard deviation, mean + standard deviation)
Please view the Jupyter notebook from the link provided to view the full data cleaning process.
STEP 3: EXPLORATORY DATA ANALYSIS (EDA)
Once the data is collected, cleaned, and processed, it is ready for Analysis. As you manipulate data, you may find you have the exact information you need, or you might need to collect more data. During this phase, you can use data analysis tools and software which will help you to understand, interpret, and derive conclusions based on the requirements.
As the Titanic data set is now cleaned, we will now do some example EDA?s on it.
- Analyze utilizing visualization techniques, which Gender was given more priority during the rescue operation?
# Adding a new column ‘s’ to store survived status as a string for # better visualisations.data[‘s’]=”data.loc[(data[‘Survived’]==1),’s’] = ‘Survived’data.loc[(data[‘Survived’]==0),’s’] = ‘Not Survived’sns.countplot(x=’Sex’,hue=’s’,data=data)
Thus, from the above visualization, we can infer that females were given priority during the rescue operation due to their low mortality count as compared to males.
2. Find out whether class of a person contributed to its likelihood of survival.
sns.barplot(x=’Pclass’,y=’Survived’,data=data)plt.title(‘Class vs Survived’)plt.show()
From the above visualization, we can infer that people belonging to the upper class were given the highest priority during the rescue operation, followed by middle, and lower classes. Lower classes also had the highest mortality count.
Exploring the data, SibSp and Parch columns generally show the number of relatives a passenger has on board, so SibSp and Parch, combined as relatives would make more sense.
This step is known as Feature Engineering. Where we modify or make new features out of existing ones to better explain our analysis.
NOTE: For full data analysis, please view the Jupyter notebook file from this link.
STEP 4: CONCLUSION
After the analysis phase is completed, the next step is to interpret our analysis and draw conclusions from it.
As we interpret the data, there are 3 key questions which should be asked by us: ?
- Did the analysis answer my original question?
- Was there any limitation in my analysis which would affect my conclusions?
- Was the analysis sufficient enough to help decision making?
From the analysis of the titanic data set (link), we were able to find out the major factors which contributed to a person?s chance of survival.
- Males had a higher chance of survival if they belonged to the upper class. Had an age between 0 to 4 years old, or 18 to 50 years old, and had 1 to 3 relatives traveling onboard the titanic.
- Females had a higher chance of survival irrespective of their class, but if they had an age between 0 to 4 years old, or 15 to 50 years old, and had 0 to 4 relatives traveling onboard the titanic.
STEP 5: COMMUNICATING RESULTS
Now that data has been explored, conclusions have been drawn; it?s time to communicate your findings to the concerned people or communicating to mass employing data storytelling, writing blogs, making presentations or filing reports. Great communication skills are a plus in this stage since your findings need to be communicated in a proper way to other people.
A Fun Fact
The five steps of data analysis are not followed linearly, it is actually non-linear in nature. To explain this, let?s consider an example: ?
Supposedly, you have done your analysis, drawn conclusions, then suddenly you find the possibility of representing a feature in a better way, or to construct a new feature out of other features present in the data set; thus, you would go back to step 3, perform feature engineering, and again perform the EDA with the new features added.
Thus, it is not always possible to follow these steps linearly.
The amount of data generated by organizations per day around the world is in the range of zettabytes; and till now, it remains underutilized. Data Analysis can help organizations gain useful insights from their data, and influence a better decision-making process.