grep: Performance

 
 5 Performance
 *************
 
 Typically ‘grep’ is an efficient way to search text.  However, it can be
 quite slow in some cases, and it can search large files where even minor
 performance tweaking can help significantly.  Although the algorithm
 used by ‘grep’ is an implementation detail that can change from release
 to release, understanding its basic strengths and weaknesses can help
 you improve its performance.
 
    The ‘grep’ command operates partly via a set of automata that are
 designed for efficiency, and partly via a slower matcher that takes over
 when the fast matchers run into unusual features like back-references.
 When feasible, the Boyer-Moore fast string searching algorithm is used
 to match a single fixed pattern, and the Aho-Corasick algorithm is used
 to match multiple fixed patterns.
 
    Generally speaking ‘grep’ operates more efficiently in single-byte
 locales, since it can avoid the special processing needed for multi-byte
 characters.  If your patterns will work just as well that way, setting
 ‘LC_ALL’ to a single-byte locale can help performance considerably.
 Setting ‘LC_ALL='C'’ can be particularly efficient, as ‘grep’ is tuned
 for that locale.
 
    Outside the ‘C’ locale, case-insensitive search, and search for
 bracket expressions like ‘[a-z]’ and ‘[[=a=]b]’, can be surprisingly
 inefficient due to difficulties in fast portable access to concepts like
 multi-character collating elements.
 
    Interval expressions may be implemented internally via repetition.
 For example, ‘^(a|bc){2,4}$’ might be implemented as
 ‘^(a|bc)(a|bc)((a|bc)(a|bc)?)?$’.  A large repetition count may exhaust
 memory or greatly slow matching.  Even small counts can cause problems
 if cascaded; for example, ‘grep -E ".*{10,}{10,}{10,}{10,}{10,}"’ is
 likely to overflow a stack.  Fortunately, regular expressions like these
 are typically artificial, and cascaded repetitions do not conform to
 POSIX so cannot be used in portable programs anyway.
 
    A back-reference such as ‘\1’ can hurt performance significantly in
 some cases, since back-references cannot in general be implemented via a
 finite state automaton, and instead trigger a backtracking algorithm
 that can be quite inefficient.  For example, although the pattern
 ‘^(.*)\1{14}(.*)\2{13}$’ matches only lines whose lengths can be written
 as a sum 15x + 14y for nonnegative integers x and y, the pattern matcher
 does not perform linear Diophantine analysis and instead backtracks
 through all possible matching strings, using an algorithm that is
 exponential in the worst case.
 
    On some operating systems that support files with holes--large
 regions of zeros that are not physically present on secondary
 storage--‘grep’ can skip over the holes efficiently without needing to
 read the zeros.  This optimization is not available if the ‘-a’
DONTPRINTYET  (‘--binary-files=text’) option is used (⇒File and Directory
 Selection), unless the ‘-z’ (‘--null-data’) option is also used (*noteDONTPRINTYET  (‘--binary-files=text’) option is used (⇒File and Directory
 Selection), unless the ‘-z’ (‘--null-data’) option is also used (⇒
 Other Options).
 
    For efficiency ‘grep’ does not always read all its input.  For
 example, the shell command ‘sed '/^...$/d' | grep -q X’ can cause ‘grep’
 to exit immediately after reading a line containing ‘X’, without
 bothering to read the rest of its input data.  This in turn can cause
 ‘sed’ to exit with a nonzero status because ‘sed’ cannot write to its
 output pipe after ‘grep’ exits.
 
    For more about the algorithms used by ‘grep’ and about related string
 matching algorithms, see:
 
    • Aho AV. Algorithms for finding patterns in strings. In: van Leeuwen
      J. _Handbook of Theoretical Computer Science_, vol. A. New York:
      Elsevier; 1990. p. 255-300. This surveys classic string matching
      algorithms, some of which are used by ‘grep’.
 
    • Aho AV, Corasick MJ. Efficient string matching: an aid to
      bibliographic search. _CACM_. 1975;18(6):333-40.
      <https://doi.org/10.1145/360825.360855>. This introduces the
      Aho-Corasick algorithm.
 
    • Boyer RS, Moore JS. A fast string searching algorithm. _CACM_.
      1977;20(10):762-72. <https://doi.org/10.1145/359842.359859>. This
      introduces the Boyer-Moore algorithm.
 
    • Faro S, Lecroq T. The exact online string matching problem: a
      review of the most recent results. _ACM Comput Surv_.
      2013;45(2):13. <https://doi.org/10.1145/2431211.2431212>. This
      surveys string matching algorithms that might help improve the
      performance of ‘grep’ in the future.
 
    • Hakak SI, Kamsin A, Shivakumara P, Gilkar GA, Khan WZ, Imran M.
      Exact string matching algorithms: survey issues, and future
      research directions. _IEEE Access_. 2019;7:69614-37.
      <https://doi.org/10.1109/ACCESS.2019.2914071>. This survey is more
      recent than Faro & Lecroq, and focuses on taxonomy instead of
      performance.
 
    • Hume A, Sunday D. Fast string search. _Software Pract Exper_.
      1991;21(11):1221-48. <https://doi.org/10.1002/spe.4380211105>. This
      excellent albeit now-dated survey aided the initial development of
      ‘grep’.