CUDA Grep
Initially we only parallelized the regular expression matching, but then realized that matching a single regex against every input file was actually slower than grep. Each thread block has a specific regular expression (thus NFA) associated with it (there is no restriction on N thread blocks sharing the NFA since they will be working on separate lines no redundant work is done). Note: that copying the strings to the device is extremely fast now (this also might be cache exploitation on the GPU)
Above is a graph which records our speedup over the single-threaded grep which runs on a single core of a CPU (to be fair we did try to parallelize by spawning processes for each regular expression but this resulted in a slow-down for grep as explained above) for a 53 MB javascript file as we increase the number of regular expressions used per run time.
Source: www.cs.cmu.edu