Discover the Magic of REGEX Formulas in Google Sheets

Video regex google sheet

In this article, we will delve into the world of REGEX formulas in Google Sheets and showcase their power through a series of examples. Regular expressions, or REGEX for short, are powerful tools for solving string-related problems by searching for patterns. They are used to tackle a wide range of tasks such as finding names or phone numbers in data, validating email addresses, extracting URLs, renaming files that contain the word “Application,” and much more. Although REGEX formulas have a reputation for being complex, once you grasp a few basic rules and understand how they work, they become an invaluable ally.

The Three REGEX Formulas in Google Sheets

Google Sheets offers three REGEX formulas: REGEXMATCH, REGEXEXTRACT, and REGEXREPLACE. Each formula serves a specific purpose:

  1. REGEXMATCH confirms if a pattern is found within a text.
  2. REGEXEXTRACT extracts the text that matches the pattern.
  3. REGEXREPLACE replaces the text that matches the pattern.

Let’s explore them in detail through a series of examples.

Example 1: Exploring REGEXMATCH in Google Sheets

The REGEXMATCH function returns TRUE if it finds the provided pattern anywhere within the text, and FALSE if there is no match. Let’s consider a dataset of vintage wines where each row contains a mix of text and/or numbers:

REGEX Formula Introduction

To check if a cell contains a number, i.e., if the vintage year is indicated, create a simple REGEXMATCH formula. In cell A2, add the following formula:

=REGEXMATCH(A2, "d")

This will output TRUE if it finds a number in the string, or FALSE if there is no number. The d part of the REGEX formula matches any digit from 0 to 9 in the input string. Hence, as long as there is a number in the input string, this pattern will give us a match. The result will look like this:

Google Sheets REGEXMATCH example

Notice that the text without any numbers, “Vin blanc maison,” gives a FALSE output as there is no match.

Important Note

This pattern matches a single digit. It doesn’t matter what else may be in the cell. For example, it will return TRUE for a meaningless string containing numbers, which is probably not the behavior you want in this case. By the end of this tutorial, you will understand enough to know how to fix this yourself. (For the solution, see the formula at the end of this tutorial.)

Example 2: Harnessing the Power of REGEXEXTRACT in Google Sheets

Using the same wine dataset as above, let’s create a new column in our dataset with just the vintage years, i.e., a column with only the year. This is a perfect use case for a REGEX formula in Google Sheets. We will create a regular expression pattern to match all numbers and then use REGEXEXTRACT to extract them. As with everything in spreadsheets, there are multiple REGEX patterns that could accomplish this task. We have seen this pattern above:

d+

But we can also use the named character class for digits:

p{Nd}+

This matches any digit, i.e., the digits 0 to 9. So, the REGEXEXTRACT formula to extract the year looks like this:

=REGEXEXTRACT(A2, "p{Nd}+")

or

=REGEXEXTRACT(A2, "d+")

Both formulas will yield a result of “1999” as the text in cell A2 is “Bordeaux rouge 1999.” If no number is found, the formula will return an #N/A error.

Google Sheets REGEX Regexextract Formula

Two Important Notes

Note 1:

REGEX formulas require text inputs and return text outputs. Thus, the output “1999” above is formatted as text. To convert it to a number, you need to wrap the result with the VALUE function. Refer to Example 4 below for more details.

Note 2:

If there were multiple numbers in the text string, for example, “Bordeaux rouge 1999 or 2001,” only “1999” would be returned by the REGEXEXTRACT formula as it doesn’t match spaces or letters between the numbers. It only matches numbers, so it matches the first number it finds and continues to match numbers until it encounters the first non-number character, where it stops matching, i.e., the space at the end of “1999.”

Example 3: Power Up with REGEXREPLACE in Google Sheets

The REGEXREPLACE formula replaces all sets of numbers in the text with a new value. For example, this formula:

=REGEXREPLACE(A2, "d+", "2021")

will replace “1999” in the phrase “Bordeaux rouge 1999” with “2021,” returning the answer: Bordeaux rouge 2021.

Important Note

The REGEXREPLACE function replaces ALL sets of numbers in the text, unlike REGEXEXTRACT, which only extracts the first pattern it finds.

Example 4: Using REGEXEXTRACT and VALUE to Extract Numbers from Text

Consider this data about a supertanker ship:

Data about a supertanker ship

Is it possible to extract these numbers with a REGEX formula, even though they are formatted differently? Absolutely!

This formula extracts numbers with or without thousand separators and/or decimal points:

=VALUE(REGEXEXTRACT(A2, "(d{1,3})([.,]d{3,})*"))

The REGEX pattern matches any digit, commas, or dots and extracts them. So, the REGEXEXTRACT formula matches the first digit it finds and then continues with the extraction as long as the next character is either another digit, a comma, or a dot. When it reaches something else, like a space or a letter, the match is broken and the extraction is finished.

We’re not done yet, though.

Using the VALUE function to Convert the Output to Numbers

The output of the REGEXEXTRACT formula is a string, not a number. Therefore, we need to convert the output to a number by wrapping the result with the VALUE function, like this:

=VALUE(REGEXEXTRACT(A2, "(d{1,3})([.,]d{3,})*"))

However, the above formula is not foolproof.

Improving Pattern Matching

If the text string has a dot or a comma before the first digit, it will be extracted as a match. For example, if the input text string was: “Le navire est énorme. Il pèse 630,962 tonnes en charge complète,” the above REGEXEXTRACT formula will only extract the first dot. It matches the dot after “énorme” and then stops matching due to the space that follows. How do we modify the formula to ensure the extraction starts with a number?

Well, we change the REGEX to first find a number before anything else, like this:

=VALUE(REGEXEXTRACT(A2, "(d[.,d]*)"))

Here, the REGEX matches a digit first and then searches for other digits, commas, or dots. If you’re observant, you’ll notice that the plus sign + has been replaced with an asterisk * after the square bracket []. This means zero or more of the characters between the square brackets, to account for a situation where there is a single-digit number we want to match. The REGEX pattern matches a digit followed by zero or more characters that are digits, commas, or dots.

Now, the result of the formula extraction is 630,962, which is the correct answer.

Please note that the output of the REGEXEXTRACT formula is a string of characters, so you need to wrap it with the VALUE function to convert it to a number, for example:

=VALUE(REGEXEXTRACT(A2, "(d[.,d]*)"))

Example 5: Checking Phone Numbers with REGEXMATCH

For this example, let’s consider American phone numbers with hyphens between the sections, i.e., numbers in this format: XXX-XXX-XXXX. This means there are 3 digits, then a hyphen, 3 digits, a hyphen, and finally, 4 digits. By the end of this tutorial, you will have enough information to adapt the example for other regions of the world.

Here is the data:

REGEXMATCH telephone numbers Google Sheets

Let’s build a REGEX formula to check if the string in column A is a valid American phone number. Using what we’ve learned above, we know that the expression d{3}-d{3}-d{4} matches digits. So, our first attempt is this formula:

=REGEXMATCH(A2, "d{3}-d{3}-d{4}")

This matches 3 digits, then a hyphen, 3 digits, a hyphen, and finally, 4 digits, and it works fine. It displays TRUE when it matches a phone number and FALSE otherwise.

But it’s verbose. We can simplify this by using a quantifier.

=REGEXMATCH(A2, "d{3}-d{3}-d{4}")

The {3} quantifier means match exactly 3 of the preceding pattern, i.e., exactly 3 digits.

This works perfectly, except that it still matches the last number in row 8. It matches 3-3-4 and doesn’t care about the extra digits that come after the 4 digits. But we know that it is not a valid phone number, so how do we prevent it from continuing after the 4 digits and reject anything beyond the 4 digits in the last set?

Word Boundaries

We wrap the expression with a special expression called a word boundary, indicated by b before and after the main expression. Technically, it matches a “zero-width nothing.” This means it marks a boundary between a word character (e.g., a letter, a digit, or an underscore) and a non-word character. Hence, it will match digits up to 4 and then look for a boundary. A fifth digit will break the match, but a space won’t since it sets a boundary.

Our new expression looks like this, with a b at the start and end:

=REGEXMATCH(A2, "bd{3}-d{3}-d{4}b")

Handling Errors

The last thing we could do is wrap this with an IFERROR function to handle numeric inputs such as row 6 above that cause an error output (REGEX formulas only work with string inputs).

=IFERROR(REGEXMATCH(A2, "bd{3}-d{3}-d{4}b"))

REGEXMATCH telephone numbers Google Sheets

Example 6: Rearranging Name Strings with REGEXREPLACE

In this example, we will look at the REGEXREPLACE function and a key concept of regular expressions: capture groups. Suppose we have a list of names in this format: “Last Name, First Name,” and we want to reverse the order to “First Name Last Name.”

Here is the formula to reverse the names:

=REGEXREPLACE(A2, "(w+), (w+)", "$2 $1")

Let’s break it down:

The REGEXREPLACE function finds all substring matches that match the pattern and replaces them with the given value. It takes three arguments: 1) the input text, 2) the pattern to search for, and 3) the replacement value.

Let’s take a look at each element one by one:

  • Input Text: Last Name, First Name, for example, Collins, Ben.
  • Search Pattern: (w+), (w+)
    • (w+) creates a numbered capture group. It matches the first word up to the comma.
    • (, ) opens a non-capturing group, which essentially means match but don’t capture the text in this group.
    • (w+) is the non-capturing group that matches a comma and a space but doesn’t capture them.
    • (w+) creates a second numbered capture group. It matches the second word after the comma and the space.
  • Replacement Value:
    • Now things get interesting!
    • Our matching pattern has captured each of the names as a numbered group, which we are now able to reference with $1 or $2.
    • The group $1 captured the last name, and the group $2 captured the first name.
    • To reverse the names, we put group $2 first, then group $1.
  • The result looks like this:

REGEXREPLACE in Google Sheets

Solution for Example 1

Modify the REGEXMATCH formula as follows:

=REGEXMATCH(A2, "bd{4}b")

It uses a quantifier {4} and word boundaries b to only match 4-digit numbers (see Example 5 for more information on quantifiers and word boundaries).

REGEX Google Sheets

Learn More

The Google Sheets REGEX Cookbook

There are hundreds of additional REGEX examples in my course, “The Google Sheets REGEX Cookbook.”

Resources

For the complete list of allowed syntax, check out the RE2 syntax page.

To enhance your Google Sheets skills, visit Crawlan.com. They offer a wide range of courses and resources to help you become a Google Sheets expert.

Related posts