AppDividend
Latest Code Tutorials

How to Read Excel File in Python using Pandas read_excel()

0

To import and read excel file in Python, use the Pandas read_excel() method. Pandas read_excel() is to read the excel sheet data into a DataFrame object. It is represented in a two-dimensional tabular view. With the help of read_excel(), we can also get the header details. It usually converts from csv, dict, json representation to DataFrame object.

A lot of work in Python revolves around working on different datasets, which are mostly present in the form of csv, json representation. So to use the data of those datasets in Python, we have a pandas library, and in that library, we have a method called read_excel() to read data from the excel files.

Syntax

pandas.read_excel(io, sheet_name=0, header=0, 
names=None, index_col=None, 
usecols=None, squeeze=False, dtype=None, 
engine=None, converters=None, true_values=None, 
false_values=None, skiprows=None, nrows=None, 
na_values=None, keep_default_na=True, verbose=False, 
parse_dates=False, date_parser=None, thousands=None, 
comment=None, skipfooter=0, convert_float=True, 
mangle_dupe_cols=True, **kwds)

Parameters

The read_excel() has the following parameters:

  1. io: It can be present in the form of str, bytes, ExcelFile, xlrd.Book etc. Any valid string is acceptable. It can be a URL in the form of http, https, s3, etc.
  2. sheet_name: By default value is 0. It can have values str, int, list, or None. Strings are used for sheet names.
  3. Header: Its default value is 0. It can have values of datatype int or list of int. Row to be used for the column labels of the parsed DataFrame.
  4. names: By default, its value is None. It accepts an array-like structure. It is the column names to use.
  5. index_col: It also takes None by default. It can have values of type int or list of lint. Column to use as the row labels of the DataFrame. If there is no such column, then the user is supposed to pass None.
  6. usecols: It also takes none by default and can have values such as int, str, list-like structures.
  7. squeeze: It takes Boolean values. By default, its value is False. 
  8. dtype: It is the type name or the dict of the column.
  9. engine: It takes str values and, by default, takes None. If io is not a buffer or path, this must be set to identity io.
  10. converters: It is a dictionary, and by default, it takes None. Dictionary of functions for converting values in specific columns. 
  11. true_values: It takes a list, by default, its None. Values to consider as true.
  12. false_values: It also takes a list, by default, its None. Values to be considered False.
  13. skiprows: It is the rows to skip at the beginning (0-indexed).
  14. nrows: It represents the number of rows to parse.
  15. na_values: It is the additional string to recognize NA/NaN values.
  16. keep_default_na: It takes Boolean values, and by default, it is set to True. It defines whether to take default NaN values when parsing the data or not.
  17. na_filter: It takes a Boolean value, and by default, it is True. It detects the missing value markers.
  18. verbose: It indicates several NA values placed in non-numeric columns.
  19. parse_dates: It takes bool, list-like, or dict. By default, its value is set to False.
  20. date_parser: It is an optional function. This function is used for converting a sequence of string columns to an array of datetime instances.
  21. thousands: It takes str values. By default, it is None. Thousand separators for parsing string columns to numeric.
  22. comment: It also takes str values. By default, it is None. It comments out the remainder of the line.
  23. skipfooter: It takes integer values. By default, it is set to 0. It tells the rows at the end to skip.
  24. convert_float: It takes a boolean value, and its default value is true. It converts integral floats to int. 
  25. mangle_dupe_cols: It takes boolean values and, by default, takes True. In this duplicate columns will be specified as X, X.1, X.2….X.N, rather than X, X, X,…..X. Passing False as the parameter will cause data to be overwritten if there are duplicate names in the column.
  26. **kwds: It is an optional parameter. The optional keyword is arguments can be passed to TextFileReader.

Return Value

The read_excel() method returns the DataFrame or Dictionary of DataFrames. The DataFrame contains the data of the excel sheet.

Example program on pandas.read_excel()

Write a program to show the working of pandas.read_excel().

In this example, we are using a readfile.xlsx file. You can download it from here.

import pandas as pd

df = pd.read_excel('readfile.xlsx', index_col=0)
print(df)

Output

In the output, you might get the following error, depending on the dependency installed on your machine.

ImportError: Missing optional dependency ‘xlrd’. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd.

To solve this ImportError, you have to install the xlrd module. Type the following command.

python3 -m pip install xlrd

It will install the module and now rerun the file.

  Name  Roll no  maths  science  english
0  Rohit        1     93       88       93
1   Arun        2     63       66       74
2  Sohit        3     55       66       84
3   Arun        4     94       94       92
4  Shubh       33     83       77       87

The first parameter is the name of the excel file.

When we print the DataFrame object, the output is a two-dimensional table. It looks similar to an excel sheet records.

If the first column in the Excel or CSV file has index values, then you can do the following to remove the Unnamed column in Pandas.

pd.read_excel('readfile.xlsx', index_col=0)

If we defined index_col = 0, then it will ignore the first unnamed column.

If the unnamed column is other than first, then you can write the following line to remove from any index.

import pandas as pd

df = pd.read_excel('readfile.xlsx')
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
print(df)

Output

 Name  Roll no  maths  science  english
0  Rohit        1     93       88       93
1   Arun        2     63       66       74
2  Sohit        3     55       66       84
3   Arun        4     94       94       92
4  Shubh       33     83       77       87

Here in the above code, we can see that we have used the read_excel() method to extract the data of an xlsx (excel file), which was previously created and saved in the same folder as of the py file with data of some students.

Get the List of Column Headers of the Excel Sheet

To get the list of column headers, use columns.ravel() method.

import pandas as pd

df = pd.read_excel('readfile.xlsx', index_col=0)
cols = df.columns.ravel()
print(cols)

Output

['Name' 'Roll no' 'maths' 'science' 'english']

You can see that we get the list of all the columns of DataFrame.

Printing a Column Data as a list

To print the column data as a list, use the df.tolist() method. The tolist() function converts the specific column values to the list.

Let’s get the list of values of the Name column.

import pandas as pd

df = pd.read_excel('readfile.xlsx', index_col=0)
vals_list = df['Name'].tolist()
print(vals_list)

Output

['Rohit', 'Arun', 'Sohit', 'Arun', 'Shubh']

Reading Excel File without Header Row

If the excel sheet doesn’t have any header row, pass the header parameter value as None.

See the following code.

import pandas as pd

df = pd.read_excel('readfile.xlsx', index_col=0, header=None)
print(df)

If you want to act header as a specific row, then you have to pass the header value as an integer.

Let’s say if you have passed the 4th row as a header row, then the fourth row will be treated as the header row, and the values will be read from the next row onwards.

Convert Excel Data to Dict, CSV, and JSON

To convert a DataFrame to JSON, use Pandas to_json() method.

To convert a DataFrame to CSV, use Pandas DataFrame to_csv() method.

To convert a DataFrame to Dictionary, use Pandas DataFrame to_dict() method.

import pandas as pd

df = pd.read_excel('readfile.xlsx', index_col=0)

print('Excel Data to Dict:', df.to_dict(orient='record'))
print('----------------------------------------------------------------------')
print('Excel Data to JSON:', df.to_json(orient='records'))
print('-----------------------------------------------------------------------')
print('Excel Data to CSV:\n', df.to_csv(index=False))

Output

Excel Data to Dict: [{'Name': 'Rohit', 'Roll no': 1, 'maths': 93, 'science': 88, 'english': 93}, {'Name': 'Arun', 'Roll no': 2, 'maths': 63, 'science': 66, 'english': 74}, {'Name': 'Sohit', 'Roll no': 3, 'maths': 55, 'science': 66, 'english': 84}, {'Name': 'Arun', 'Roll no': 4, 'maths': 94, 'science': 94, 'english': 92}, {'Name': 'Shubh', 'Roll no': 33, 'maths': 83, 'science': 77, 'english': 87}]
----------------------------------------------------------------------
Excel Data to JSON: [{"Name":"Rohit","Roll no":1,"maths":93,"science":88,"english":93},{"Name":"Arun","Roll no":2,"maths":63,"science":66,"english":74},{"Name":"Sohit","Roll no":3,"maths":55,"science":66,"english":84},{"Name":"Arun","Roll no":4,"maths":94,"science":94,"english":92},{"Name":"Shubh","Roll no":33,"maths":83,"science":77,"english":87}]
-----------------------------------------------------------------------
Excel Data to CSV:
 Name,Roll no,maths,science,english
Rohit,1,93,88,93
Arun,2,63,66,74
Sohit,3,55,66,84
Arun,4,94,94,92
Shubh,33,83,77,87

Conclusion

To import an Excel file into Python using pandas, use the pd.read_excel() method. For an earlier version of Excel, you may need to use the file extension of ‘xls’ instead of ‘xlsx’.

See also

How to import CSV file in Pandas

Export Pandas DataFrame to CSV

Convert Pandas JSON to CSV

Pandas ExcelWriter()

Pandas DataFrame to Numpy array

Leave A Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.