Python programming language is an excellent choice for data analysis, primarily because of the great ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier.
For this example, I am using Jupyter Notebook. If you are new to Jupyter Notebook and do not know how to install it on the local machine, I recommend you check out my article Getting Started With Jupyter Notebook. It will guide you to install and up and running with Jupyter Notebook.
Pandas read_csv
Pandas read_csv() is a library function used to import the data from a CSV file and analyze that data in Python. The read_csv() function takes a csv file as an input and reads its content.
If we need to import the data to the Jupyter Notebook, we first need data. I am using the following link to access the Olympics data.
https://docs.google.com/spreadsheets/d/1zeeZQzFoHE2j_ZrqDkVJK9eF7OH1yvg75c8S-aBcxaU/edit#gid=0
Save that file in the CSV format inside the local project folder. I have saved that with a filename of the data.csv file.
Now, open the Jupyter Notebook and start working on the project. But, first, let’s see the example step by step.
Step 1: Import the Pandas module.
The first step is to import the Pandas module.
Write the following one line of code inside the First Notebook cell and run the cell.
import pandas as pd
It has successfully imported the Pandas library to our project.
The next step is to use the read_csv() function to read the csv file and display the content.
Step 2: Use read_csv function to display a content.
Pandas read_csv function has the following syntax.
pandas.read_csv('filename or filepath', ['dozens of optional parameters'])
The read_csv method has only one required parameter, a filename, the other lots parameters are optional, and we will see some of them in this example.
Let’s write the following code in the next cell in Jupyter Notebook.
data = pd.read_csv('data.csv', skiprows=4)
Here, the first parameter is our file’s name, the Olympics data file.
The second argument is skiprows. We will skip the first four rows of the file, and then we will start reading that file.
Let’s see the file’s content by the following first: You need to add this code to the third cell in the notebook.
data
Just write the data and hit the Ctrl + Enter, and you will see the output like the below image.
Step 3: Use head() and tail() in Python Pandas
Okay, So in the above step, we have imported many rows. But there is a way that you can use to filter the data, either the first 5 rows or the last 5 rows using the head() and tail() function.
Let’s see these functions in action.
Write the following code in the next cell of the notebook.
data.head()
Now, run the cell and see the output below.
You can see that it has returned the first five rows of that CSV file.
Print the last five rows using the Pandas tail() function.
data.tail()
See the output below.
Step 4: Load a CSV with no headers
We can load a CSV file with no header. Let’s see that in action.
Go to the second step and write the below code.
data = pd.read_csv('data.csv', skiprows=4, header=None) data
Here, we have added one parameter called header=None. This means you will no longer be able to see the header. Now, rerun the code, and you will find the output like the below image.
Step 5: Load a CSV with specifying column names
We will only load a CSV with specifying column names in this case. See the below code.
data = pd.read_csv('data.csv', names=['City', 'Edition', 'Sport', 'NOC', 'Gender', 'Medal']) data
The above code only returns the above-specified columns.
Check out the original documentation to find out more about the pandas read_csv() function.
That’s it for this tutorial.
python3 issue with NaN … df shows NaN but df1 shows .
Since I pass na_values=[‘.’], I expect df to show me .
df = pd.read_csv(‘f.csv’, na_values=[‘.’]); print(df,”\n”)
df1 = df.fillna(“.”); print(df1)