AppDividend
Latest Code Tutorials

Python Scikit Learn Tutorial For Beginners With Example

1

In this tutorial, we will see Python Scikit Learn Tutorial For Beginners With Example. Scikit-learn is a machine learning library for Python. It has many features like regression, classification and clustering algorithms including SVMs, gradient boosting, k-means, random forests, and DBSCAN. It is designed to work with  Numpy and Pandas library.  Scikit learn is written in Python (most of it), and some of its core algorithms are written in Cython(C extensions for Python) for even better performance. 

Scikit-learn is used to build the Machine Learning models, and it is not recommended to use it for reading, manipulating and summarizing data as there are better frameworks available for the purpose like Pandas and NumPy.

Python Scikit Learn Tutorial For Beginners

For this example, we will use two ways to run Scikit learn on your machine.

Now, if you do not know how to create a virtual environment using Python then check out my this article.

If you have successfully installed the virtual environment, then please go inside that folder and activate using the following command.

source bin/activate

My virtualenv is started, and now I can list the packages, I have installed on that environment using the following command.

pip list

 

Python Scikit Learn Tutorial For Beginners

I have already installed the Scikit Learn package.

You can install it using the following command.

pip install scikit-learn

Now, if you are using Python Jupyter Notebook, then chances are you have already installed Scikit learn package.

Python Scikit Learn Example

Now, you have two choices. If you want to use Jupyter Notebook, then you can use that and if you are using virtualenv and write the code in code editor like Visual Studio Code and run the file in the console.

For this example, I am using Python Jupyter Notebook. So, open up the notebook.

Let’s deep dive into the code and see the Scikit learn in action.

If we need to work with Scikit Learn, then we need to have some data. Let’s create some data using NumPy

Step 1: Import NumPy and Scikit learn

So, let’s import two libraries.

import numpy as np
from sklearn.preprocessing import MinMaxScaler

First, we have imported NumPy library, and then we have imported MinMaxScaler module from sklearn.preprocessing library.

MinMaxScaler module is used when we need to do feature scaling to the data.

Feature scaling means, in the particular column, you will find the highest value and divide all the values with that highest value. That means, now, your column has only values between 0 to 1.

The MinMaxScaler is the probably the most popular scaling algorithm when we are working with large sets of data especially in building Machine Learning model.

Step 2: Create a demo data using NumPy

Let’s create a data using NumPy library. Write the following code in the next cell.

demoData = np.random.randint(10, 100, (10 ,2))
demoData

So, we have created a random integer data between 10 to 100 with ten rows and two columns. The data is a random number generated so that yours might be different. But focus on Sklearn algorithms.

See the output below.

 

Create a demo data using NumPy

Step 3: Let’s transform data

As we have created a demo data, now it is time to scaling that data. So, we will use the feature scaling. First, create an object of MinMaxScalar. Then pass the demoData to that MinMaxScalar’s fit_transform function.

scalar_model = MinMaxScaler()
scalar_model.fit_transform(demoData)

See the output below.

 

Let's transform data using fit_transform in Scikit Learn

Ignore the red warning; it is just telling us that, we are converting integer data to floating data when we have transformed the value using MinMaxScalar.

Step 4: Create a DataFrame

For, splitting the data, first, we need to create a DataFrame, and for that, we need to import the Pandas library.

Now, we will create the demo data again, but this time, we will create a large dataset and then create a DataFrame from that data and then split that data to train and test.

Write the following code in one by one cell.

import numpy as np
from sklearn.preprocessing import MinMaxScaler

Then write the following code in the next cell.

demoData = np.random.randint(1, 500, (20 ,4))
demoData

 

Create a DataFrame

Now transform the data to create feature scaling. So, write the following code inside the cell.

scalar_model = MinMaxScaler()
feature_data = scalar_model.fit_transform(demoData)
feature_data

It will display the scaled data. See the below output.

 

Feature Scaling in Python

Next step is to create a DataFrame from above data.

So, write the following code in the next cell.

import pandas as pd
df = pd.DataFrame(data=feature_data, columns=['k1', 'k2', 'k3', 'labels'])
df

We have imported the pandas and created a DataFrame from the above feature_data. We have also defined the columns for the data. See the scaled data.

Feature Scaling in Python DataFrame

 

We have three columns of featured data, and one column label is to predict the values. It is a supervised problem.

Step 6: Split the data into Train and Test

First, let’s split the data between features and label. So, write the following code in the next cell.

X = df[['k1', 'k2', 'k3']]
y = df['labels']

Now, print the and y and see the output.

 

Split the data

See the in the output.

 

SciKit Learn Preprocessing Overview

So, now we have feature X and predict the label the data y.

Let’s do the train and test split.

from sklearn.model_selection import train_test_split

Now, write the following code in the next cell.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

Run the cell.

What above code does is that it will split the data between X_train and X_test with the 1:3. Means, train data gets 70%, and test data get 30% from the DataFrame.

You can change the percentage you want for the test and train data, but this ratio is the standard ratio to split the data between train and test.

We can check, how much data, we get for the train and test data. Type the following code in the next cell.

X_train.shape

Now, see the data of Y_train. See the below output.

 

Scikitlearn crash course

You can see that, X_train got the 14 rows, and Y_train got the six rows. So, it has split 70:30.

You can find more about shape attribute here.

So, the Scaling and splitting the dataset is the most crucial step in Machine Learning and if you want to know how to prepare dataset in Machine learning then check out this article.

Finally, the basics of Scikit learn for Machine learning is over. I hope you enjoyed the Python Scikit Learn Tutorial For Beginners With Example From Scratch.

1 Comment
  1. Jamis Educardo says

    Hello krunal,

    This is really great article and easy to understand. thanks for sharing.

    keep up the good work.

Leave A Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.