Python regex: The Complete Guide

0
456
Python RegEx Tutorial With Example | Regular Expressions in Python

The regular expression in a programming language is a unique text string used for describing a search pattern. It is beneficial for extracting information from text such as code, files, logs, spreadsheets, or even documents.

While using the regular expression, the first thing is to recognize that everything is essentially a character, and we are writing the patterns to match the specific sequence of characters, also referred to as a string. The Ascii or Latin letters are on your keyboards, and Unicode is used to match a different text.

Python regex

Python regex or Regular Expression is the sequence of characters that forms the search pattern. The regex can check if the string contains the specified search pattern. It includes the digits, punctuation, and special characters like $#@!%, etc.

For instance, a regular expression could tell the program to search for a specific text from the string and then print out the result accordingly. The phrase can include the following.

  1. Text matching
  2. Repetition
  3. Branching
  4. Pattern-composition etc.

Python RegEx Module

We can import the Python re module using the following code.

import re
  1. re” module included with a Python primarily used for string searching and manipulation.
  2. Also used frequently for a web page for “Scraping” or extracting a large amount of data from websites.

Search the string to see if it starts with “The” and ends with “Australia.”

# app.py

import re

data = "The rain in Australia"
x = re.search("^The.*Australia$", data)
if (x):
  print("YES! We have a match!")
else:
  print("No match")

See the output.

Python RegEx Tutorial With Example

You can see the return object from the search function.

# app.py

import re

data = "The rain in Australia"
x = re.search("^The.*Australia$", data)
print(x)

See the following output.

Python RegEx Module

Python regex functions

Python re module offers a set of functions that allows us to search the string for the match.

Function Description
findall Returns the list containing all matches
search Returns the Match object if there is a match anywhere in the string
split Returns the list where a string has been split at each match
sub It replaces one or many matches with a string

 

Python RegEx findall()

Python re.findall() is a built-in Python regex function that returns a list containing all matches.

# app.py

import re

data = "The rain in Australia"
x = re.findall("Aus", data)
print(x)

See the below output.

Python RegEx findall() Method

The list contains the matches in the order they are found. If no matches are found, the empty list is returned. The findall() method is case-sensitive.

See the following code.

# app.py

import re

data = "The rain in Australia"
x = re.findall("aus", data)
print(x)

See the output.

Regular Expressions in Python

Python RegEx search()

The search() is a built-in Python regex function that searches the string for the match and returns the Match object if there is a match. If there is more than one match, only the first occurrence of the match will be returned.

# app.py

import re

data = "The rain in Australia"
pos = re.search("\s", data)
print("The first white-space character is located", pos.start())

See the following output.

Python RegEx search() Method

Python regex split()

The split() is a built-in Python regex function that returns the list where the string has been split at each match.

Now, let’s split at each white-space character.

# app.py

import re

data = "The rain in Australia"
result = re.split("\s", data)
print(result)
See the following output.
Python RegEx split() Method
You can control the number of occurrences by specifying the maxsplit parameter.

Let’s split the string only at the first occurrence.

# app.py

import re

data = "The rain in Australia"
result = re.split("\s", data, 1)
print(result)

See the output.

Python RegEx split() Method Tutorial

Python regex sub()

The sub() is a built-in regex function that replaces the matches with a text of your choice. Let’s replace every white-space character with the symbol ‘~~~’.

# app.py

import re

data = "The rain in Australia"
result = re.sub("\s", "~~~", data)
print(result)

See the output.

Python RegEx sub() Method

Python Metacharacters

Metacharacters are characters with a special meaning, which is the following.

Character Description Example
[] A set of characters “[a-m]”
\ Signals the special sequence (can also be used to escape special characters) “\d”
. Any character (except newline character) “he..o”
^ Starts with “^hello”
$ Ends with “world$”
* Zero or more occurrences “aix*”
+ One or more occurrences “aix+”
{} Exactly the specified number of occurrences “al{2}”
| Either or “falls|stays”
() Capture and group

 

Python Special Sequences

A particular sequence is a \ followed by one of the characters in the list below and has a special meaning.

Character Description Example
\A Returns the match if the specified characters are at the beginning of the string “\AThe”
\b Returns the match where the specified characters are at the beginning or the end of a word r”\bain”
r”ain\b”
\B Returns the match where the specified characters are present, but NOT at the beginning (or at the end) of a word r”\Bain”
r”ain\B”
\d Returns the match where the string contains digits (numbers from 0-9) “\d”
\D Returns the match where a string DOES NOT contain digits “\D”
\s Returns the match where the string contains a white space character “\s”
\S Returns the match where the string DOES NOT contain a white space character “\S”
\w Returns the match where a string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character) “\w”
\W Returns the match where the string DOES NOT contain any word characters “\W”
\Z Returns the match if the specified characters are at the end of the string. “Spain\Z”

 

Python Sets

A set is a set of characters inside a pair of square brackets [] with a special meaning.

Set Description
[arn] Returns the match where one of the specified characters (ar, or n) are present
[a-n] Returns the match for any lower case character, alphabetically between a and n
[^arn] Returns a match for any character EXCEPT ar, and n
[0123] Returns the match where any of the specified digits (012, or 3) are present
[0-9] Returns a match for any digit between 0 and 9
[0-5][0-9] Returns the match for any two-digit numbers from 00 and 59
[a-zA-Z] Returns the match for any character alphabetically between a and z, lower case OR upper case
[+] In sets, +*.|()$,{} has no special meaning, so [+] means: return a match for any + character in the string

 

Example of w+ and ^ Expression

See the following characters.
  1. “^”: This expression matches the start of the string
  2. “w+“: This expression matches an alphanumeric character in a string

Here we will see how we can use the w+ and ^ expressions in our code. We cover the re.findall() function later in this tutorial, but we focus on \w+ and \^ expressions for a while.

For example, for our string “appdividend, is fun”, if we execute the code with w+ and^, it will give the output “appdividend”.

See the following code.

# app.py

import re

data = "appdividend, is fun"
result = re.findall("^\w+", data)
print(result)

See the output.

Example of w+ and ^ Expression

Remember, if you remove the +sign from the w+, the output will change, and it will only give the first character of the first letter, i.e., [a].

Python re.match()

The match() is a built-in Python regex function that matches the RE pattern to string with the optional flags. The expressions “w+” and “\W” will match the words starting with the letter ‘g’; anything not starting with ‘g’ is not identified.

To check the match for each element in the list or string, run the loop.

# app.py

import re

listA = ["appdividend10 giveaway",
        "appdividend10 giveup",
        "appdividend javascript"]

for element in listA:
  result = re.match("(a\w+)\W(g\w+)", element)
  if result:
    print((result.groups()))

See the following output.

Using re.match() function in Python

Summary

The regular expression in a programming language is the special text string used for describing a search pattern. It includes the digits, punctuation, and all special characters like $#@!%, etc. An expression can include a literal.

  1. Text matching
  2. Repetition
  3. Branching
  4. Pattern-composition etc.

In Python, the regular expression is denoted as RE (REs, regexes, or regex pattern) are embedded through the re module.

  1. The “re” module included with Python is primarily used for string searching and manipulation
  2. Also used frequently for webpage “Scraping” (extract a large amount of data from websites)
  3. Regular Expression Methods include the re.match(),re.search()& re.findall().
  4. Python Flags Many Python Regex Methods and Regex functions take an optional argument called Flags
  5. These flags can modify the meaning of a given Regex pattern
  6. Many Python flags used in Regex Methods are re.M, re.I, re.S, etc.

That’s it.

Leave A Reply

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.