Some regex parsers (looking at you, JavaScript, but it's not the only one) can have serious problems with certain regular expressions such as the following:


This can be read in English as something like:

Any amount of characters at the end of the string that each is either an 'a' or anything but a 'b'".

When matching this regular expression against a string like "aaaaab", for each "a" the parser will decide one of the two possible options that match ("a" matches both the "an 'a'" and the "anything but a 'b'" conditions), and it could go with either one. In this case, it decides to use the first condition (it is an "a" after all), for every "a" in our test string.

The problem comes at the end with the "b": it doesn't match the "an 'a'" condition.

Apparently, the parser will go all the way back to the beginning to choose the second option for the first "a" character, and it will continue trying to match the rest of the "a"s as it did first (using the first condition). Again, it will reach the end and "b" won't match that first condition. So, it will go back to the second "a" character and use the second matching option for it, and so on. This grows exponentially with the number of "a"s in the string, until every "a" has been matched with the second option (they are "not 'b's"), and then we evaluate the final "b" and realize it doesn't match either condition.

You can try this and see that it takes a couple of seconds to evaluate it:


Final thoughts

This tool can tell you if a regular expression is vulnerable to this problem.

In PHP there doesn't seem to be a problem with it hanging forever but, for the same example, at a certain point, it will start giving you the wrong answer.

Go, of course, doesn't have this problem and gives you the correct answer no matter how many "a"s you add.

Mastodon Mastodon