Arduino Regular Expressions

The first chapter of a wonderful book titled “Beautiful Code” (O’Reilly 2007) was written by Brian Kernighan who described an episode from when he and Rob Pyke were writing a book together called “The Practice of Programming”. They wanted to include a discussion on regular expressions but also wanted to include the code of an implementation. Existing code bases would have been a book length by themselves. Rob Pyke developed 30 lines of C code in less than two hours that implements a substantial part of the day to day usage of regular expressions. The code can be described as beautiful for its elegant efficiency. The final line of the chapter reads “I don’t know of another piece of code that does so much in so few lines while providing such a rich source of insight and further ideas”.

The code handles the following constructs:

Character
Meaning
c
Matches any literal character (in this case ‘c’)
. (full stop)
Matches any single character
^
Start anchor matches from the beginning of a string
$
End anchor matches the end of a string
*
Matches zero or more instances of the previous character.

The code (with some trivial format edits) follows:

// match - search for regular expression anywhere in text
int match(char* regexp, char* text) {
  if(regexp[0] == '^') {
    return matchHere(++regexp, text);
  }
  do{
    if(matchHere(regexp, text)) {return 1;}
  }while(*text++ != '\0');
  return 0;
}
// matchHere - search for regex at beginning of text
int matchHere(char* regexp, char* text) {
  if(regexp[0] == '\0') {return 1;}
  if(regexp[1] == '*') {
    return matchStar(regexp[0], regexp+2, text);
  }
  if(regexp[0] == '$' && regexp[1] == '\0') {
    return *text == '\0';
  }
  if(*text != '\0' && (regexp[0] == '.' || regexp[0] == *text)) {
    return matchHere(++regexp, ++text);
  }
  return 0;
}
// matchStar - search for c*regexp at beginning of text
int matchStar(int c, char* regexp, char* text) {
  do {
    if(matchHere(regexp, text)) {return 1;}
  }while(*text != '\0' &&(*text++ == c || c=='.'));
  return 0;
}

With a simple Arduino C (ish) test rig:

template<class T> inline Print &operator <<(Print &obj, T arg) { obj.print(arg); return obj; } 
#define dprint(expr) Serial << #expr " = " << expr << "\n";

void setup() {
  Serial.begin(115200);
  dprint(match("a*$", "Xaa"));
  dprint(match(".*c", "abcabc"));
  dprint(match("^9.$", "9d"));
  dprint(match("^9*.$", "99d"));
}


Play around with the regular expressions and the strings to be matched against and you should see the match() function will return the value 1 for a match and 0 when no match is found.

This great demonstration might start anyone looking for a good text pattern matching tool thinking about what should be added and how to take the concept onwards. Fortunately Brian Kernighan suggested a few ways forward. One was to use a struct to represent each element of the regular expression pattern string as this would ease the process of adding more complex constructs to be matched against. Another suggestion was to “compile” the pattern ahead of any matching process. It is reasonable to assume that the term compile meant that a process should tokenise the pattern and save the tokens as an array (or equivalent) of structs representing the pattern components.

<edit>
Found the whole chapter available on line here - so available to read even if you dont have the book.
</edit>

My interest stemmed from some developing ideas around the interpretation or compilation of alternate languages. A compact regular expression tool that could find patterns within strings would be helpful in that context. Expanding Rob Pyke’s pattern matching options and finding the location of any match within a target string became the objective. I also wanted to get something useful running with a small impact on compiled program size and a minimal impact upon SRAM during execution.

Inevitably, my final list of features to include is going to be a personal one that reflects my anticipated usage. The list became:

Regex component
Meaning
^
Start anchor
$
End anchor
c
match a character literal
\d
match digits 0 to 9
\D
matched if not digit
\w
match alphanumeric char
\W
matched if not alphanumeric
\s
matches whitespace
\S
match if not whitespace
. (full stop)
match any character not a line feed (\n)
*
match zero or more of previous construct
+
match one or more of previous construct
?
match zero or one of previous construct
[a-z] or [0-9A-H]
match to one or more character ranges
[abcd]
match to any one of a set of characters
[^a-z]
match if not in range (eg. not a to z)
[^defg]
match if not in list
Lists of characters within square brackets can contain meta characters (such as \d). Characters such as ‘*’ or ‘?’ need to be escaped if they are outside of a list and should be matched just as a character.

I also added a feature to allow the search for a match to start at any position within the target text and a facility to return the length of any matched text.

The rules on how to implement any particular regular expression feature are reasonably clear between implementations but there are variations and odd places where things might be termed undefined.
It would be nice if you could just work along the regular expression trying to match individual characters or elements to a character in the target string. The regular expression symbols *, + and ? though mean that you have to look forward to see if a given matching relationship is modified. Indeed the * and ? symbols can mean that no match is a match, if that makes sense, as they allow for zero characters matching. Then you have to allow symbols like * to be “greedy” but not too greedy. 

For instance:

Regex “.*” matches all of the characters in “aabbccddee” but

Regex “.*d” only matches 7 as it stops after the first ‘d’. To match to just before the letter e you need

Regex “.*dd”

The code design avoids recursion to help minimise additions to the stack during execution. A vector structure is used to store the list of tokens as the regular expression is compiled. A vector allows the use of array style access to elements without having to set arbitrary limits on the array size. The vector code itself is heavily cut down to a minimum spec for no reason other than to minimise it’s importance in the code. All heap memory used by the vector is recovered when the regular expression class instance goes out of scope.  Sets of character ranges and lists are also stored in the heap and again this memory is recovered when a given class instance goes out of scope.


The usage starts by creating an instance of the Regex class passing a pointer to a regular expression to the constructor. Then the match() method is passed a pointer to a target char array. The method will return the position of the first match in the target string or -1 if no match is found. The match method can also be passed an additional integer argument to set the point in the target string for the search for a match to begin (zero based). The class getMatchLength() method will return the number of characters in the target string that were matched by the regular expression.

If you fancy adding to this Regex class then the numeric quantifiers are an obvious addition.


Meaning
{3}
match exactly 3 times
{2, 4}
match between 2 and 4 times
{2,}
match 2 or more times

The union component of the RE struct could have an array of two (uint8_t) unsigned 8 bit integers added which would allow plenty of scope without increasing the size of the struct on 8 bit boards.

Another tweak you might fancy could be to allow lists and ranges to be mixed in the same class (e.g. [ext0-9]. It makes sense as a range is just shorthand for a number of items in the list.

Java, Ruby, Python, the .NET framework, JavaScript and of course Perl (plus other environments) all add additional or alternate functionality to their regular expressions. My current view is that most are beyond the scope of this particular project’s aims.

If you have an interest then please give the code a whirl. It can be found on GitHub  with versions for Arduino and more general C/C++ environments. Happy to debate any issues you might have with the results – some might track down bugs and other issues could lead to revised interpretations of the rules.

Special thanks to a GitHub repository under the name of kokke (where there is a small regex solution coded in C) for the great list of tests that I stole to run through my effort. Fixing failed tests taught me a lot about Regex that I had not given much thought to. Flushing some bugs proved character building.

Comments

Popular posts from this blog

Unicode output from an Arduino.

Arduino: How to scroll long text on OLED screen