Arduino Regular Expressions
The first chapter of a wonderful book titled “Beautiful Code” (O’Reilly 2007) was written by Brian Kernighan who described an episode from when he and Rob Pyke were writing a book together called “The Practice of Programming”. They wanted to include a discussion on regular expressions but also wanted to include the code of an implementation. Existing code bases would have been a book length by themselves. Rob Pyke developed 30 lines of C code in less than two hours that implements a substantial part of the day to day usage of regular expressions. The code can be described as beautiful for its elegant efficiency. The final line of the chapter reads “I don’t know of another piece of code that does so much in so few lines while providing such a rich source of insight and further ideas”.
The code handles the following constructs:
The code (with some trivial format edits) follows:
The code handles the following constructs:
Character
|
Meaning
|
c
|
Matches any literal character (in this case ‘c’)
|
. (full stop)
|
Matches any single character
|
^
|
Start anchor matches from the beginning of a string
|
$
|
End anchor matches the end of a string
|
*
|
Matches zero or more instances of the previous character.
|
The code (with some trivial format edits) follows:
// match - search for regular expression anywhere in text int match(char* regexp, char* text) { if(regexp[0] == '^') { return matchHere(++regexp, text); } do{ if(matchHere(regexp, text)) {return 1;} }while(*text++ != '\0'); return 0; } // matchHere - search for regex at beginning of text int matchHere(char* regexp, char* text) { if(regexp[0] == '\0') {return 1;} if(regexp[1] == '*') { return matchStar(regexp[0], regexp+2, text); } if(regexp[0] == '$' && regexp[1] == '\0') { return *text == '\0'; } if(*text != '\0' && (regexp[0] == '.' || regexp[0] == *text)) { return matchHere(++regexp, ++text); } return 0; } // matchStar - search for c*regexp at beginning of text int matchStar(int c, char* regexp, char* text) { do { if(matchHere(regexp, text)) {return 1;} }while(*text != '\0' &&(*text++ == c || c=='.')); return 0; }
With a simple Arduino C (ish) test rig:
template<class T> inline Print &operator <<(Print &obj, T arg) { obj.print(arg); return obj; } #define dprint(expr) Serial << #expr " = " << expr << "\n"; void setup() { Serial.begin(115200); dprint(match("a*$", "Xaa")); dprint(match(".*c", "abcabc")); dprint(match("^9.$", "9d")); dprint(match("^9*.$", "99d")); }
Play around with the regular expressions and the strings to be matched against and you should see the match() function will return the value 1 for a match and 0 when no match is found.
This great demonstration might start anyone looking for a good text pattern matching tool thinking about what should be added and how to take the concept onwards. Fortunately Brian Kernighan suggested a few ways forward. One was to use a struct to represent each element of the regular expression pattern string as this would ease the process of adding more complex constructs to be matched against. Another suggestion was to “compile” the pattern ahead of any matching process. It is reasonable to assume that the term compile meant that a process should tokenise the pattern and save the tokens as an array (or equivalent) of structs representing the pattern components.
<edit>
Found the whole chapter available on line here - so available to read even if you dont have the book.
</edit>
My interest stemmed from some developing ideas around the interpretation or compilation of alternate languages. A compact regular expression tool that could find patterns within strings would be helpful in that context. Expanding Rob Pyke’s pattern matching options and finding the location of any match within a target string became the objective. I also wanted to get something useful running with a small impact on compiled program size and a minimal impact upon SRAM during execution.
Inevitably, my final list of features to include is going to be a personal one that reflects my anticipated usage. The list became:
This great demonstration might start anyone looking for a good text pattern matching tool thinking about what should be added and how to take the concept onwards. Fortunately Brian Kernighan suggested a few ways forward. One was to use a struct to represent each element of the regular expression pattern string as this would ease the process of adding more complex constructs to be matched against. Another suggestion was to “compile” the pattern ahead of any matching process. It is reasonable to assume that the term compile meant that a process should tokenise the pattern and save the tokens as an array (or equivalent) of structs representing the pattern components.
<edit>
Found the whole chapter available on line here - so available to read even if you dont have the book.
</edit>
My interest stemmed from some developing ideas around the interpretation or compilation of alternate languages. A compact regular expression tool that could find patterns within strings would be helpful in that context. Expanding Rob Pyke’s pattern matching options and finding the location of any match within a target string became the objective. I also wanted to get something useful running with a small impact on compiled program size and a minimal impact upon SRAM during execution.
Inevitably, my final list of features to include is going to be a personal one that reflects my anticipated usage. The list became:
Regex component
|
Meaning
|
^
|
Start anchor
|
$
|
End anchor
|
c
|
match a character literal
|
\d
|
match digits 0 to 9
|
\D
|
matched if not digit
|
\w
|
match alphanumeric char
|
\W
|
matched if not alphanumeric
|
\s
|
matches whitespace
|
\S
|
match if not whitespace
|
. (full stop)
|
match any character not a line feed (\n)
|
*
|
match zero or more of previous construct
|
+
|
match one or more of previous construct
|
?
|
match zero or one of previous construct
|
[a-z] or [0-9A-H]
|
match to one or more character ranges
|
[abcd]
|
match to any one of a set of characters
|
[^a-z]
|
match if not in range (eg. not a to z)
|
[^defg]
|
match if not in list
|
Lists of characters within square brackets can contain meta
characters (such as \d). Characters such as ‘*’ or ‘?’ need to be escaped if
they are outside of a list and should be matched just as a character.
I also added a feature to allow the search for a match to
start at any position within the target text and a facility to return the
length of any matched text.
The rules on how to implement any particular regular expression
feature are reasonably clear between implementations but there are variations
and odd places where things might be termed undefined.
It would be nice if you could just work along the regular
expression trying to match individual characters or elements to a character in
the target string. The regular expression symbols *, + and ? though mean that you
have to look forward to see if a given matching relationship is modified.
Indeed the * and ? symbols can mean that no match is a match, if that makes
sense, as they allow for zero characters matching. Then you have to allow
symbols like * to be “greedy” but not too greedy.
For instance:
Regex “.*” matches all of the characters in “aabbccddee” but
Regex “.*d” only matches 7 as it stops after the first ‘d’.
To match to just before the letter e you need
Regex “.*dd”
The code design avoids recursion to help minimise additions
to the stack during execution. A vector structure is used to store the list of
tokens as the regular expression is compiled. A vector allows the use of array
style access to elements without having to set arbitrary limits on the array
size. The vector code itself is heavily cut down to a minimum spec for no
reason other than to minimise it’s importance in the code. All heap memory used
by the vector is recovered when the regular expression class instance goes out
of scope. Sets of character ranges and
lists are also stored in the heap and again this memory is recovered when a
given class instance goes out of scope.
The usage starts by creating an instance of the Regex class
passing a pointer to a regular expression to the constructor. Then the match()
method is passed a pointer to a target char array. The method will return the
position of the first match in the target string or -1 if no match is found.
The match method can also be passed an additional integer argument to set the
point in the target string for the search for a match to begin (zero based).
The class getMatchLength() method will return the number of characters in the
target string that were matched by the regular expression.
If you fancy adding to this Regex class then the numeric quantifiers are an obvious addition.
Meaning
|
|
{3}
|
match exactly 3 times
|
{2, 4}
|
match between 2 and 4 times
|
{2,}
|
match 2 or more times
|
The union component of the RE struct could have an array of
two (uint8_t) unsigned 8 bit integers added which would allow plenty of scope
without increasing the size of the struct on 8 bit boards.
Another tweak you might fancy could be to allow lists and
ranges to be mixed in the same class (e.g. [ext0-9]. It makes sense as a range
is just shorthand for a number of items in the list.
Java, Ruby, Python, the .NET framework, JavaScript and of
course Perl (plus other environments) all add additional or alternate
functionality to their regular expressions. My current view is that most are
beyond the scope of this particular project’s aims.
If you have an interest then please give the code a whirl.
It can be found on GitHub with versions for Arduino and more general C/C++ environments. Happy to debate any
issues you might have with the results – some might track down bugs and other
issues could lead to revised interpretations of the rules.
Special thanks to a GitHub repository under the name of
kokke (where there is a small regex solution coded in C) for the great list of tests
that I stole to run through my effort. Fixing failed tests taught me a lot
about Regex that I had not given much thought to. Flushing some bugs proved
character building.
Comments
Post a Comment