Friday, January 7, 2011

N-gram and regex search

So, I was thinking about the regex search in large volumes of data, and realized this should not be too hard to do. What do we do for a "usual" search: build an inverted index of stemmed words, and then match that with the query and intersect the word posting lists.

Now, imagine instead of words, we use the ngrams. Ok, here we have a little bit of a problem because if we are searching on "foobar" and we matched "foo" and "bar" we need to ensure they are also adjacent in the documents found. But, this should not be a big deal.

Now the interesting part. I am searching for a regex. Instead of matching the inverted index ngrams against the query string, I need to do something more clever. Let's pretend we have the regex implemented in terms as a DFA. A given ngram is suitable for the selection to begin with if it can result in valid transitions of the DFA. So for each of the states we can try to run the ngram against the DFA and see what gives. If this ngram does not give the good match, we discard it outright. If it does work - we store the acceptable chain of states.

Note that a single ngram might fire up more than once on a given regex - so we end up storing quite a few of possibilities.

Once we're done - we have a bunch of "broken chains" - because they are short. Now the task is to see if we can reconstruct the chain of the states from starting till ending one, and if we can - then this set of ngrams is a candidate set which can be used same as in the literal search onwards (e.g. in something like http://qwik.jp/senna/).

The remaining question is what is the complexity of such a "glue" algorithm. The number of permutations to test can quickly become prohibitive.

No comments: