Here are the steps to prepare a dataset for machine learning in Python:
- Get the dataset and import the libraries.
- Handle missing data.
- Encode categorical data.
- Splitting the dataset into the Training set and Test set.
- Feature Scaling if all the columns are not scaled correctly.
Step 1: Get The Dataset.
We will use Indian Liver Patient data. So we first prepare the complete dataset for this kind of data.
I am putting the link here to download the data. Remember, this is not a real dataset; this is just the demo dataset. However, it looks like the actual dataset.
Download File: patientData
Now, we need to create a project directory. So let us build using the following command.
mkdir predata
Now go into the directory.
cd predata
We need to move the CSV file inside this folder.
Now, open the Anaconda Navigator software. After opening Navigator, you can see a screen like the one below.
Launch the Spyder application and navigate to your project folder. You can see that we have already moved the patientData.csv file so that you can see that file over there.
We must create one Python file called datapre.py and import the mathematical libraries.
Write the following code inside the datapre.py file. So, your file looks like this. Remember, we are using Python 3.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Jul 25 18:52:15 2018
@author: your name
"""
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Now, select the code of three import statements and hit the command + enter, and you can see on the right side down; that the code is running successfully.
That means we have successfully imported the libraries. The numpy, pandas, or matplotlib library is possibly missing if you found any error. So you need to install that, and that is it.
Step 2: Handle Missing Data.
In real-time, missing the data happens quite a lot. If you are finding the real-time data set for the patients, then there is always missing data.
To train the model correctly, we must somehow fill in the data. Otherwise, the model will mispredict the values.
Luckily libraries are already available; we need to use the proper function. Now, in our dataset, there is missing data, so we need to fill the data with either mean values or use some other algorithms.
In this example, we are using MEAN to supply the values. So let us do that.
But first, let us divide the dataset into our X and Y-axis.
Write the following code after importing the libraries.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Jul 25 18:52:15 2018
@author: krunal
"""
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('patientData.csv')
Now, select the following line and hit the command + enter.
dataset = pd.read_csv('patientData.csv')
We have included our initial dataset, which you can see here.
Here, you can see that the nan displays if the value is empty. So we need to change it with the MEAN values. So let us do that.
Write the following code.
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, 3].values
So, here in the X, we have selected the first four columns and left the fifth column. So, it will be our Y.
Remember, indexes start from 0. So -1 means the last column. So we are selecting all the columns except for the last column.
We have explicitly selected the fourth column for Y, and the index is 3.
Now we need to handle the missing data. Again, we will use a library, Scikit learn.
Write the following code.
...
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
So, here, we have to use the Imputer module to use the strategy ‘mean’ and fill the missing values with the mean values. Run the above lines and type the X in the Console. You can see something like this below.
Here, columns 1 and 2 have missing values, but we have written 1:3 because the upper bound is excluded; that is why we have taken 1 and 3, which are working fine. Finally, transform the column values with NaN values; now, we have the filled values.
Here, you can see that the mean values of that particular column fill in the missing values.
So, we have handled the missing data. Now, head over to the next step.
Step 3: Encode Categorical data.
In our dataset, there are two categorical columns.
- Gender
-
Liver Disease
So, we need to encode these two columns of data.
# Encode Categorical Data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
Here, we have encoded the values of the first column. We have only two cases for the first column, Female and Male. So now, after the transformation, the values are 1 for females and 0 for Male.
Run the above line and see the changes in categorical data. So, here for Female, it is 1, and Male is 0.
So, it has created one more column and replaces Male and Female according to 1 and 0. That is why it becomes from 3 columns to 4 columns.
Step 4: Split the dataset into Training Set and Test Set.
We split the data with a ratio of 70% for the Training Data and 30% for to test data. So, for our example, we divided it into 80% for training data and 20% for test data.
Write the following code inside the Spyder.
# Split the data between the Training Data and Test Data
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2
,random_state = 0)
Run the code, and you can get four more variables. So, we have a total of seven variables.
We have split both Axis X and Y into X_train and X_test
Y-axis becomes Y_train and Y_test.
So, you have 80% data on the X_train and Y_train and 20% on the X_test and Y_test.
Step 5: Feature Scaling
In a general scenario, machine learning is based on Euclidean Distance.
The Albumin and Age column has an entirely different range of values. So we need to convert those values and make them under the range of values.
That is why this is called feature scaling. First, we need to scale the values for the Age column. So let us scale the X_train and X_test.
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
Here, we do not need Y because it is already scaled. Now run the above code and hit the following command.
Here, we can see that all the values are appropriately scaled, and you can also check the X_test variable.
So, we have successfully cleared and prepared the data.
Here is the final code of our datapre.py.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Jul 25 18:52:15 2018
@author: krunal
"""
# Importing Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing Dataset
dataset = pd.read_csv('patientData.csv')
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, 3].values
# Handing Missing Dataset
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
# Encode Categorical Data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
# Split the data between the Training Data and Test Data
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2
,random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
So, we have successfully prepared Dataset For Machine Learning in Python.
Machine Learning has very complex computations. It depends on how you get the data and in which condition; based on the data condition, you will start preprocessing and splitting the data into the Train and Test model.
That’s it for this tutorial. Thanks for taking it.

Krunal Lathiya is a seasoned Computer Science expert with over eight years in the tech industry. He boasts deep knowledge in Data Science and Machine Learning. Versed in Python, JavaScript, PHP, R, and Golang. Skilled in frameworks like Angular and React and platforms such as Node.js. His expertise spans both front-end and back-end development. His proficiency in the Python language stands as a testament to his versatility and commitment to the craft.
Why do we have to split dataset before feature scaling
Why do we have to split dataset before feature scaling, since we are scaling both train and test set.
Good article