Python Scikit Learn: The Complete Guide

Scikit-learn is a machine learning library for Python. It has many features like regression, classification, and clustering algorithms, including SVMs, gradient boosting, k-means, random forests, and DBSCAN.

It is designed to work with Numpy and Pandas library. However, Scikit learn is written in Python (most of it), and some of its core algorithms are written in Cython(C extensions for Python) for even better performance. 

Scikit-learn is used to build Machine Learning models. Using it for reading, manipulating, and summarizing data is not recommended as better frameworks like Pandas and NumPy.

Python Scikit Learn

For this example, we will use two ways to run Scikit learn on your machine.

If you have successfully installed the virtual environment, please enter that folder and activate it using the following command.

source bin/activate

My virtualenv is started, and now I can list the packages I have installed on that environment using the following command.

pip list

Python Scikit Learn Tutorial For Beginners

I have already installed the Scikit Learn package.

You can install it using the following command.

pip install scikit-learn

If you are using Python Jupyter Notebook, you have already installed the Scikit learn package.


Now, you have two choices. If you want to use Jupyter Notebook, then you can use that, and if you are using virtualenv, write the code in a code editor like Visual Studio Code and run the file in the console.

For this example, I am using Python Jupyter Notebook. So, open up the notebook.

Let’s deep dive into the code and see the Scikit learn in action.

If we need to work with Scikit Learn, we must have some data.

Let’s create some data using NumPy

Step 1: Import NumPy and Scikit learn

So, let’s import two libraries.

import numpy as np
from sklearn.preprocessing import MinMaxScaler

First, we imported the NumPy library, and then we imported the MinMaxScaler module from sklearn.preprocessing library.

MinMaxScaler module is used when we need to do feature scaling to the data.

Feature scaling means that in a particular column, you will find the highest value and divide all the values with that highest value. That means your column has only values between 0 and 1.

The MinMaxScaler is probably the most popular scaling algorithm working with large data sets, especially in building the Machine Learning model.

Step 2: Create demo data using NumPy

Let’s create data using the NumPy library.

Write the following code in the next cell.

demoData = np.random.randint(10, 100, (10 ,2))

So, we have created random integer data between 10 to 100 with ten rows and two columns. Of course, the data is a random number generated, so that yours might be different. But focus on Sklearn algorithms.

See the output.

Create a demo data using NumPy

Step 3: Let’s transform data

As we have created demo data, now it is time to scale that data. So, we will use feature scaling.

First, create an object of MinMaxScalar.

Then pass the demoData to that MinMaxScalar’s fit_transform function.

scalar_model = MinMaxScaler()

See the output.

Let's transform data using fit_transform in Scikit Learn

Ignore the red warning; it tells us that we are converting integer data to floating data when we have transformed the value using MinMaxScalar.

Step 4: Create a DataFrame

For splitting the data, first, we need to create a DataFrame, and for that, we need to import the Pandas library.

Now, we will create the demo data again, but this time, we will create a large dataset, create a DataFrame from that data, and then split that data to train and test.

Write the following code one by one cell.

import numpy as np
from sklearn.preprocessing import MinMaxScaler

Then write the following code in the next cell.

demoData = np.random.randint(1, 500, (20 ,4))

Create a DataFrame

Now transform the data to create feature scaling. So, write the following code inside the cell.

scalar_model = MinMaxScaler()
feature_data = scalar_model.fit_transform(demoData)

It will display the scaled data.

See the output.

Feature Scaling in Python

The next step is to create a DataFrame from the above data.

So, write the following code in the next cell.

import pandas as pd
df = pd.DataFrame(data=feature_data, columns=['k1', 'k2', 'k3', 'labels'])

We have imported the pandas and created a DataFrame from the above feature_data. We have also defined the columns for the data. See the scaled data.

Feature Scaling in Python DataFrame

We have three columns of featured data, and one column label is to predict the values. It is a supervised problem.

Step 6: Split the data into Train and Test

First, let’s split the data between features and labels.

So, write the following code in the next cell.

X = df[['k1', 'k2', 'k3']]
y = df['labels']

Now, print the and y and see the output.

Split the data

See the in the output.

SciKit Learn Preprocessing OverviewSo, now we have feature X and predict the label of the data y.

Let’s do the train and test split.

from sklearn.model_selection import train_test_split

Now, write the following code in the next cell.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

Run the cell.

What the above code does is that it will split the data between X_train and X_test with the 1:3. This Means that train data gets 70%, and test data get 30% from the DataFrame.

You can change the percentage you want for the test and train data, but this ratio is the standard ratio to split the data between train and test.

We can check how much data we get for the train and test data.

Type the following code in the next cell.


Now, see the data of Y_train. See the output.

Scikitlearn crash course

You can see that X_train got the 14 rows, and Y_train got the six rows. So, it has split 70:30.

You can find more about the shape attribute here.

So, Scaling and splitting the dataset is the most crucial step in Machine Learning, and if you want to know how to prepare a dataset in Machine learning, check out this article.

That’s it.

3 thoughts on “Python Scikit Learn: The Complete Guide”

  1. Indeed, thanks a lot.
    In my humble opinion, it’s missing the training/predicting thing, but i was needing it a tutorial like this one. Now it is clear to me what each of those imports are for.


Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.