hi@methebe.com

Regular Expressions done right - 03/02/2023

Are you scared of regular expressions? Guess what

Site under construction! Migrated content might be displayed ugly!

medium-to-markdown@0.0.3 convert node index.js https://medium.com/@birnadin/regular-expressions-done-right-3a73f0deacf4

Regular Expressions done right

[Enigma Bits

](https://medium.com/@birnadin?source=post_page-----3a73f0deacf4--------------------------------)

Enigma Bits

·

Follow

6 min read·Feb 3, 2023

Listen

Share

Regular expression (here on, I will refer as regex) are not scary as the meme states. It just needs a little patient and knowledge of state diagram.

Regex is just State Diagram!

Only sounds fancy, but in a nutshell it just primal flow-chart.

from community.coda.io

Assume, you are to analyze users and their location. Since, now a day everyone is anxious about their privacy they tend to mask their location. Your conclusion might be off and bring millions of $ lost. 😳. One way to do it to ask phone number for 2FA and then analyze the phone number 😜. Just state it vaguely in the T&C or Privacy Policy.

So, phone numbers then, yeah, they have a system and encode location data. But a bump is that you never enforced a pattern to end-user. So, when you try to look at head, you see this,

000-02-2344  
000 02 2343  
001.23.2342  
023-02-2344

Imagine you have a million users or so. WTF! So, you sanitize at pre-processing state. How?

user\_pns = \[user.phone\_number for user in Users()\]  
pns\_sanitized = \[\]  
  
for pn in user\_pns:  
  if '-' in pn:  
    # sanitize how ever you want  
    pns\_sanitized.append(pn.replace('-', ''))  
  elif '.' in pn:  
    pass  
  else:  
    pass

But, what if someone input phone number as 002-02.2344. They must have been in hurry, so they mixed up delimiters. You script will leave . out and you have inconsistent data to model. Heck even what if some lunatic used * or had (...) to indicate the begining? Are going to write if, elifs for all of that?

You realize that the fact each branching if will actually penalize the performance write? If not search CPU Shadow Realm Exploitations. What should you do? Call parser’s 911, The Regex.

No matter how end-user input, number will always be xxx xx xxxx or some other way depending on your users’ locality which you can query from where 2FA codes be sent (Cuz LTE or any cellular protocols require specific country data). Let’s consider scenario of pattern I mentioned above.

The pattern xxx xx xxxx can be represented as \D*\d{3}\D*\d{2}\D*\d{4}. Gibberish? Let me tell you the lexemes or the grammar of Regex.

for more, check the link below 👇[

Regex Cheat Sheet: A Quick Guide to Regular Expressions in Python

The tough thing about learning data science is remembering all the syntax. While at Dataquest we advocate getting used…

www.dataquest.io

](https://www.dataquest.io/blog/regex-cheatsheet/?source=post_page-----3a73f0deacf4--------------------------------)

So \D*\d{3}\D*\d{2}\D*\d{4} becomes: -

OR, you can think the regex like a map of how a finger should move across a piece of string, a state diagram.

see, I told you 😘

Now you can clap 👏_,_ button is at the bottom of the screen 😎.

Each logical piece represents each state.

See, it is just state machine in disguise. Once you get this fundamental right, you are off to 🚀.

The Tip to compile it.

  1. Just take an example and put your finger on the left-most side of the specimen.
  2. Ask yourself what you should do to advance to the next character. Should you encounter a numeric or alpha or a whitespace.
  3. Do it till you find yourself at the right-most character of your specimen.
  4. Now just like you factorize an integer, factorize your state transitions. E.g., if we have 12 then we can write it as 3 x 2 x 2, that one becomes, 3 x 2^2. So, we have a 3 and 2 appears 2 times.

Now, map it as a state diagram by head or if complex, take a paper and just sketch it. No need to follow the symbology or conventions. It just should express what your mind says. Heck even, you can use Flow Chart. Because any state-machine or Turing Machine can be expressed via a Flow-Chart.

Let’s look at an example. Say, we need to extract names of people from phone book entry along with prefix. Step 1 is to get an example.

Mas. Birnadin Erick

  1. Put your finger in left-most character, M.
  2. Ask what we should do to progress to right. We should encounter an alphabet.
  3. Then again, an alphabet and so.
  4. A period, .;
  5. Now a whitespace _,
  6. Now an alphabet, but Uppercase.

If in diagram, then traces would be…

from start to 2nd statefrom 2nd to 3rdfrom 3rd to nthnth to finish.

Are we done?, no there is another possible variant!

Ms. Jane Doe

Different would be from 2nd to 3rd we have to scan an s.

2nd to 3rd state transition changes

Everything else stays the same, hence a common factor, have we. Done? Nope.

Mr. John Doe

This time, 2nd to 3rd trigger is r.

So, 2nd to 3rd have more than one transitions.

different paths machine can deviate.

And, in between an nth and (n+1)th state may share same trigger but multiple time. E.g., along irnadin in Birnadin, i and r are same trigger but have multiple point of presence.

first pass;

👆 can be simplified as 👇

simplified on 2nd pass.

So, as a resultant, our state diagram becomes like 👇

click to zoom and analyze 🔍

Compilation

gives: Mthis gives: M[asr]

Then the . gives us M[asr]\W* , there could be presense or absence;

The whitespace gives us, M[asr]\W*\s*, there could be more than 1 space due to input errors;

Then an Uppercase: M[asr]\W*\s*[A-Z], you could actually say \w instead of A-Z if you want;

we end up with: M[asr]\W*\s*[A-Z]\w*

Then same goes, thus we end up with 👇

M\[asr\]\\W\*\\s\*\[A-Z\]\\w\*\\s\*\[A-Z\]?\\w\*   
\# last \[A-Z\]?\\w\* means there could be last name or not  
  
\# if you know for sure, prefix is delimitted by period  
M\[asr\]\\.\*\\s\*\[A-Z\]\\w\*\\s\*\[A-Z\]?\\w\*

Do you feel how easy it is? Told you Regular Expression is just State Diagram in disguise 🐑.

Epilogue

I hope I shed somewhat light onto regex using how I understand it. If you disagree or have an example that contradicts, please let me know in the comments, would love to be corrected beforehand.

Another post is coming soon using Regex in Python, Rust and JavaScript.

If you are intrigued or interested in this kind of stuff, make sure you follow me on Medium or Twitter.

Till then, it’s me the BE, signing off 👋

Cover background by Steve Johnson.