This article will also appear on the Zeek blog in a forthcoming Zeke on Zeek post.
Paraglob is a data structure for quick string matching against a large set of patterns. The algorithm was originally designed by Robin Summer but it’s initial implementation was slowed significantly by an internal set data structure that ran in linear time for most of its operations. As a result of a couple of these linear time operations being called together, building Paraglob took O(N^2) and other operations took O(Nlog(N) time where N is the number of patterns in data structure. In this post I’ll walk through moving Paraglob to C++, and using different data structures to reduce its compile time to linear time and other operations to log(N) time.
But first, a cool looking graph and a look ahead. Queries refers to how many strings are being run against Paraglob, patterns to the number of patterns those patterns are being matched against, and in this case about %20 of the patterns match. The lumps aren’t consistent across runs, and are likely just my computer doing something else in the background. Notice how small the time increase is from running 1,000 to 20,000 queries. At the upper right paraglob is compiling a set of 10,000 patterns and running 20,000 queries on them in under 2 seconds.

The Algorithm
At its core paraglob is actually built around a relatively straightforward algorithm. For any pattern, there exists a set of characters that an input must contain in order to have any hope of matching against it. For example, consider the pattern “do*”. Anything that matches against “do*” must at the very least contain the substring “do”. This can be easily extended to more complicated patterns by just breaking up the pattern on special glob syntax. For example, “dog*fish*cat” contains the substrings [“dog”, “fish”, “cat”]. We call these substrings meta words.
We can then re-frame our problem as one of finding the meta words inside an input string and checking the patterns associated with those meta words against it. The aho-corasick string-searching algorithm coupled with some sort of map from meta words solves our problem. We can summarize how paraglob works as follows:
    for every input pattern:
        extract the meta words
        map them to the pattern word
        store the meta words in the aho-corasick data structure
    for every input string:
        get the meta words it contains with the aho-corasick structure
        get candidate patterns with the map
        check those patterns against the string
        return the matches

With an algorithm designed, paraglob’s actual implementation is fairly straightforward, but with a couple important nuances. The first lies in the fact that for a given set of patterns there is a non-zero chance that one meta word will be associated with multiple patterns. Consider for example a small pattern set [*mischiev[!o]us*, *mischevous*, *.us*, *.gov*] which might flag mischievous typosquatting and government related urls. Already this pattern set has one meta word (us) mapping to two quite different patterns.
As a result paraglob can’t use a standard map structure which only allows for a single value for every kep. The obvious solution to this is to use some sort of multimap, but in practice this proved to be slow. Using a multimap would slow down paraglob by as much as a factor of 10 as opposed to an implementation with a standard map structure that ignores the above issue.
In order to achieve the performance offered by the latter, and still handle the association of multiple patterns with one meta word, paraglob uses a custom paraglobNode class that can store a list of patterns and that is associated with a meta word in a map. ParaglobNodes also contain functionality to quickly merge patterns that they contain matching a string with an input vector. This greatly increases the speed at which paraglob is able to find patterns for an input string.
The second important nuance lies in how paraglob handles duplicate patterns. Using the same example pattern set as above, a query for “” contains the meta words us, and mischiev. Mapping those to their respective pattern words, we get [*mischiev[!o]us*, *.us*] from us and [*mischiev[!o]us*] from mischiev. Initially it seems like we should keep these in a set so as to prevent checking the same pattern twice. As it turns out though, maintaining a set internally is much more expensive that just checking duplicate patterns and using vectors internally. The result of this is that a paraglob doesn’t remove any duplicates until the last step when the vector of matching patterns is at its smallest.
Next steps
A paraglob’s state is defined completely by the patterns inside of it. Paraglobs hold no internal state between calls, nor do they make any updates to their internal aho-corasick structure unless a new pattern is added. Presently, their serialization function takes advantage of this and only serializes the vector of patterns contained inside a paraglob. For unserializing, a new paraglob is built from that serialized vector of patterns, and its aho-corasick structure is recompiled. This recompilation is expensive though, and can take as long as 10 seconds for very long pattern sets.
Ideally, a paraglob could be serialized in such a way that the recompilation step is not needed. There exists some serialization code inside of the Boost C++ Libaries that might be useful in doing this, but due to how complicated the aho-corasick trie becomes when it contains a fair amount of patterns serializing a paraglob like this would likely take a significant effort. Working out a clean way to serialize a paraglob properly though would potentially result in a serious increase in its usefulness.
A huge thank you to Kamiar Kanani for his excellent Multifast aho-corasick algorithm he let us use under the BSD license for the Zeek project. Without such a well done string searching algorithm underpinning everything this would have been a much slower data structure and the people at Corelight for keeping me around for the summer to keep working on awesome projects like these.
Back to Top