REGULAR EXPRESSIONS with Python a beginners guide — Part 1

C C Sreenidhin
6 min readDec 21, 2020
Thanks for the free Image — https://unsplash.com/photos/b18TRXc8UPQ

Introduction

Whether you are a data scientist or a developer, having a solid understanding of regular expression can help you perform various data processing tasks very easily. This article is intended for beginners and aims to help you understand the building blocks of regular expressions.

What is a regular expression? Regular expression can be defined as a search pattern which are represented as a sequence of characters. Regular expressions are also called as regex or regexp. Usually such patterns are used by string searching algorithms, or for input validation.

Regular expressions are used in a wide variety of tasks that involves processing string formats generally. Common applications of regular expressions include data validation, data wrangling, parsing strings, syntax highlighting systems etc. Also regex is useful for web scraping and are used even on Internet search engines.

In general regex is widely applied for verifying the structure of strings, extract sub-strings from strings, search, replace or rearrange parts of the string or split a string into sub-strings.

Many programming languages provide regex capabilities as built in libraries or third party libraries. Getting started with regex may not be easy due to its syntax, but it is certainly worth the investment of your time. Let us learn the essentials to begin constructing regular expressions for our use cases. Throughout this tutorial we will be using python ‘re’ module to match or validate against regular expressions. The contents covered in this tutorial are given below.

Contents

  • Creating regular expressions or regex.
  • How regular expressions works.
  • re module in python.
  • Building blocks of regex with examples using python re.
  • A few examples

Creating regular expressions or regex

A regular expression pattern is composed of simple characters, such as ’regex’, or a combination of simple and special characters, such as ‘reg.*ex\w*’. The example ‘regex’ here is a very basic pattern, simply matching the text “regex”. The above pattern ‘regex’ matches character combinations in strings; only when the exact sequence “regex” occurs (all characters together and in that order). Such a match would succeed in matching against the strings “Tutorial to learn regex?” and “regex” too, in both cases the match is with the sub-string “regex”.

When the search for a match requires something more than a direct match, such as matching both “regex” and “regular expressions” , special characters are included in the pattern. The expression /reg.*ex\w*/ matches both “regex” and “regular expressions” in a string. The ‘.’ followed by ‘*’ after “reg” means “matches any character (except for line terminators)”. ‘\w’ followed by ‘*’ matches any alphanumeric character. The ‘*’ Matches between zero and unlimited times, as many times as possible the preceding expression.

In the following topics we will deal with regular expressions in depth with more examples and patterns using python re.

How regular expressions works

A regular expression matches patterns to a given string with the help of an underlying piece of software called “regular expression engine”. The regular expression engine processes the expressions to match against a string. The regular expressions give increased performance in the search operations, as regular expression patterns are compiled into a series of byte-codes which are then executed by these matching engines written in C or other compiled languages.

Also there are two kinds of regular expression engines: text-directed engines, and regex-directed engines. a regex-directed engine will always return the leftmost match, even if a “better” match could be found later. When applying a regex to a string, the engine will start at the first character of the string. It will try all possible permutations of the regular expression at the first character. Only if all possibilities have been tried and found to fail, will the engine continue with the second character in the text. Again, it will try all possible permutations of the regex, in exactly the same order. The result is that the regex-directed engine will return the leftmost match.

In this tutorial we will be using python “re” module which uses a regex-directed engine.

re module in python

Regular expression matching can be carried out using python re which comes as a built in module with python installation. Of Course there are other libraries too, as mentioned earlier we will stick with python ‘re’ throughout this tutorial.

The re module provides different methods for working with regular expressions. Before explaining the methods let us see a simple example for a search operation using python re.

Here we have imported the re module and used the search method provided by the re module. The search method returns a search object which contains the results as groups. Groups and details about the pattern are discussed under the topic ‘Building blocks of regex with examples using python re’ in this tutorial. On printing the group — ‘match.group()’ the output is print as ‘regular expressions’

Also notice that the regex pattern in here is written between single quotes that follows the letter ‘r’. This is a way of describing raw strings in python i.e., for example, by preceding a string with ‘r’ in python, backslash ‘\’ is not considered as an escape sequence. Not that it is a python syntax and not part of the regular expression.

We saw an example usage of python re search method above. Now let us see a few other functionalities provided by re.

We will use a few of them in our examples that follow. The explanations with examples for the methods we use in this tutorial are provided below.

1. re.compile(pattern, flags=0)

Using the compiled re expressions helps to avoid re-writing the patterns also gives better performance even though negligible. The compile method takes 2 arguments, the pattern(regular expression) and a flag(available flag options are discussed below briefly).

Instead of providing regex directly to the re methods as an argument. We can compile the regex to return an object and call re methods from the object. An example is given below:

2. re.search(pattern, string, flags=0)

We have seen the usage of re search() functionality in the example above. The search() method scans through the whole string and returns the match object at the first occurrences. It returns None if no position in the string matches the pattern.

3. re.match(pattern, string, flags=0)

Match returns a match if zero or more characters at the beginning of string match the regular expression pattern. This returns None if the string does not match the pattern.

The main difference between match() and search() is that the match() function only checks if the python ‘re’ matches at the beginning of the string while search() will scan forward through the string for a match. So If you want to locate a match anywhere in string, use search() instead. An example use case is given below.

4. re.findall(pattern, string, flags=0)

“findall” method as the name signifies returns all non-overlapping matches of the pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, it return a list of groups. Empty matches are included in the result.

Note that unlike search method which returns only one occurrence, “findall”() look for all occurrences.

re flags

If you take a look at the methods listed above you will notice that each re method above accepts an argument named flag. Below gives a brief about the flag options available.

  • re.ASCII — Perform ASCII-only matching instead of full Unicode matching.
  • re.DEBUG — Display debug information about compiled expression.
  • re.IGNORECASE — Perform case-insensitive matching.
  • re.LOCALE — case-insensitive matching dependent on the current locale.
  • re.MULTILINE — Makes $ match the end of a line (not just the end of the string) and makes ^ match the start of any line.
  • re.DOTALL — Makes a period (dot) match any character, including a newline.
  • re.VERBOSE — Permits more readable regular expression syntax.

Building blocks of regex with examples using python re” is covered in the Part II of this introductory document on regular expression.

--

--

C C Sreenidhin
0 Followers

Software Engineer, Data Scientist. Learning, coding, experimenting, developing, writing …