You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
382 lines
13 KiB
382 lines
13 KiB
.TH AGREP 1 "Jan 17, 1992" |
|
.SH NAME |
|
agrep \- search a file for a string or regular expression, with approximate matching capabilities |
|
.SH SYNOPSIS |
|
.B agrep |
|
[ |
|
.B \-#cdehiklnpstvwxBDGIS |
|
] |
|
.I pattern |
|
[ \-f |
|
.I patternfile |
|
] |
|
[ |
|
.IR filename ".\|.\|. ]" |
|
.SH DESCRIPTION |
|
.B agrep |
|
searches the input |
|
.IR filenames |
|
(standard input is the default, but see a warning under LIMITATIONS) |
|
for records containing strings which either |
|
\fIexactly\fP or \fIapproximately\fP match a pattern. |
|
A record is by default a line, but it can be defined differently using |
|
the \-d option (see below). |
|
Normally, each record found is copied to the standard output. |
|
Approximate matching allows finding records that contain the pattern |
|
with several errors including substitutions, insertions, and |
|
deletions. |
|
For example, Massechusets matches Massachusetts with two errors |
|
(one substitution and one insertion). Running |
|
.B agrep |
|
\-2 Massechusets foo outputs all lines in foo containing any string with |
|
at most 2 errors from Massechusets. |
|
.LP |
|
.B agrep |
|
supports many kinds of queries including |
|
arbitrary wild cards, sets of patterns, and in general, |
|
regular expressions. |
|
See PATTERNS below. |
|
It supports most of the options supported by the |
|
.B grep |
|
family plus several more (but it is not 100% compatible with grep). |
|
For more information on the algorithms used by agrep see |
|
Wu and Manber, |
|
"Fast Text Searching With Errors," |
|
Technical report #91-11, Department of Computer Science, University |
|
of Arizona, June 1991 (available by anonymous ftp from cs.arizona.edu |
|
in agrep/agrep.ps.1), and |
|
Wu and Manber, |
|
"Agrep -- A Fast Approximate Pattern Searching Tool", |
|
To appear in USENIX Conference 1992 January (available by anonymous ftp |
|
from cs.arizona.edu in agrep/agrep.ps.2). |
|
.LP |
|
As with the rest of the \fBgrep\fP family, the characters |
|
.RB ` $ ', |
|
.RB `^ ', |
|
.RB ` \(** ', |
|
.RB ` [ ' , |
|
.RB ` ] ' , |
|
.RB ` \s+2^\s0 ', |
|
.RB ` | ', |
|
.RB ` ( ', |
|
.RB ` ) ', |
|
.RB ` ! ', |
|
and |
|
.RB ` \e ' |
|
can cause unexpected results when included in the |
|
.IR pattern , |
|
as these characters are also meaningful |
|
to the shell. To avoid these problems, one should always enclose the entire |
|
pattern argument in single quotes, i.e., 'pattern'. |
|
Do not use double quotes ("). |
|
.LP |
|
When |
|
.B agrep |
|
is applied to more than one input |
|
file, the name of the file is displayed |
|
preceding each line which matches |
|
the pattern. The filename is not displayed |
|
when processing a single |
|
file, so if you actually want the filename |
|
to appear, use |
|
.B /dev/null |
|
as a second file in the list. |
|
.SH OPTIONS |
|
.TP |
|
.B \-\fI#\fP |
|
\fI#\fP is a non-negative integer (at most 8) |
|
specifying the maximum number of errors |
|
permitted in finding the approximate matches (defaults to zero). |
|
Generally, each insertion, deletion, or substitution counts as one error. |
|
It is possible to adjust the relative cost of insertions, |
|
deletions and substitutions (see \-I \-D and \-S options). |
|
.TP |
|
.B \-c |
|
Display only the count of matching records. |
|
.TP |
|
.B \-d "'\fIdelim\fP'" |
|
Define \fIdelim\fP to be the separator between two records. |
|
The default value is '$', namely a record is by default |
|
a line. |
|
\fIdelim\fP can be a string of size at most 8 |
|
(with possible use of ^ and $), but not |
|
a regular expression. |
|
Text between two \fIdelim\fP's, before the first \fIdelim\fP, |
|
and after the last \fIdelim\fP is considered as one record. |
|
For example, \-d '$$' defines paragraphs as records and \-d '^From\ ' |
|
defines mail messages as records. |
|
.B agrep |
|
matches each record separately. |
|
This option does not currently work with regular expressions. |
|
.TP |
|
.BI \-e " pattern" |
|
Same as a simple |
|
.I pattern |
|
argument, but useful when the |
|
.I pattern |
|
begins with a |
|
.RB ` \- '. |
|
.TP |
|
.BI \-f " patternfile" |
|
.I patternfile |
|
contains a set of (simple) patterns. |
|
The output is all lines that match at least one of the patterns in |
|
.I patternfile. |
|
Currently, the \-f option works only for exact match and for simple |
|
patterns (any meta symbol is interpreted as a regular character); |
|
it is compatible only with \-c, \-h, \-i, \-l, \-s, \-v, \-w, and \-x options. |
|
see LIMITATIONS for size bounds. |
|
.TP |
|
.B \-h |
|
Do not display filenames. |
|
.TP |
|
.B \-i |
|
Case-insensitive search \(em e.g., "A" and "a" are considered equivalent. |
|
.TP |
|
.B \-k |
|
No symbol in the pattern is treated as a meta character. |
|
For example, agrep \-k 'a(b|c)*d' foo will find |
|
the occurrences of a(b|c)*d in foo whereas agrep 'a(b|c)*d' foo |
|
will find substrings in foo that match the regular expression 'a(b|c)*d'. |
|
.TP |
|
.B \-l |
|
List only the files that contain a match. |
|
This option is useful for looking for files containing a certain pattern. |
|
For example, " agrep \-l 'wonderful' * " will list the names of those |
|
files in current directory that contain the word 'wonderful'. |
|
.TP |
|
.B \-n |
|
Each line that is printed is prefixed by its record number in the file. |
|
.TP |
|
.B \-p |
|
Find records in the text that contain a supersequence of the pattern. |
|
For example, |
|
\fB agrep \-p DCS foo |
|
will match "Department of Computer Science." |
|
.TP |
|
.B \-s |
|
Work silently, that is, display nothing except error messages. |
|
This is useful for checking the error status. |
|
.TP |
|
.B \-t |
|
Output the record starting from the end of |
|
.I delim |
|
to (and including) the next |
|
.I delim. |
|
This is useful for cases where |
|
.I delim |
|
should come at the end of the record. |
|
.TP |
|
.B \-v |
|
Inverse mode \(em display only those records that |
|
.I do not |
|
contain the pattern. |
|
.TP |
|
.B \-w |
|
Search for the pattern as a word \(em i.e., surrounded by non-alphanumeric |
|
characters. The non-alphanumeric |
|
.B must |
|
surround the match; they cannot be counted as errors. |
|
For example, |
|
.B agrep |
|
\-w \-1 car will match cars, but not characters. |
|
.TP |
|
.B \-x |
|
The pattern must match the whole line. |
|
.TP |
|
.B \-y |
|
Used with \-B option. When \-y is on, agrep will always |
|
output the best matches without giving a prompt. |
|
.TP |
|
.B \-B |
|
Best match mode. |
|
When \-B is specified and no exact matches are found, agrep |
|
will continue to search until the closest matches (i.e., the ones |
|
with minimum number of errors) |
|
are found, at which point the following message will be shown: |
|
"the best match contains x errors, there are y matches, output them? (y/n)" |
|
The best match mode is not supported for standard input, e.g., |
|
pipeline input. |
|
When the \-#, \-c, or \-l options are specified, the \-B option is ignored. |
|
In general, \-B may be slower than \-#, but not by very much. |
|
.TP |
|
.B \-D\fIk\fP |
|
Set the cost of a deletion to \fIk\fP (\fIk\fP is a positive integer). |
|
This option does not currently work with regular expressions. |
|
.TP |
|
.B \-G |
|
Output the files that contain a match. |
|
.TP |
|
.B \-I\fIk\fP |
|
Set the cost of an insertion to \fIk\fP (\fIk\fP is a positive integer). |
|
This option does not currently work with regular expressions. |
|
.TP |
|
.B \-S\fIk\fP |
|
Set the cost of a substitution to \fIk\fP (\fIk\fP is a positive integer). |
|
This option does not currently work with regular expressions. |
|
.ne 4 |
|
.SH PATTERNS |
|
.LP |
|
\fIagrep\fP |
|
supports a large variety of patterns, including simple |
|
strings, strings with classes of characters, sets of strings, |
|
wild cards, and regular expressions. |
|
.TP |
|
\fBStrings\fP |
|
any sequence of characters, including the special symbols |
|
`^' for beginning of line and `$' for end of line. |
|
The special characters listed above ( |
|
.RB ` $ ', |
|
.RB `^ ', |
|
.RB ` \(** ', |
|
.RB ` [ ' , |
|
.RB ` \s+2^\s0 ', |
|
.RB ` | ', |
|
.RB ` ( ', |
|
.RB ` ) ', |
|
.RB ` ! ', |
|
and |
|
.RB ` \e ' |
|
) should be preceded by `\\' if they are to be matched as regular |
|
characters. For example, \\^abc\\\\ corresponds to the string ^abc\\, |
|
whereas ^abc corresponds to the string abc at the beginning of a |
|
line. |
|
.TP |
|
\fBClasses of characters\fP |
|
a list of characters inside [] (in order) corresponds to any character |
|
from the list. For example, [a-ho-z] is any character between a and h |
|
or between o and z. The symbol `^' inside [] complements the list. |
|
For example, [^i-n] denote any character in the character set except |
|
character 'i' to 'n'. |
|
The symbol `^' thus has two meanings, but this is consistent with |
|
egrep. |
|
The symbol `.' (don't care) stands for any symbol (except for the |
|
newline symbol). |
|
.TP |
|
\fBBoolean operations\fP |
|
.B agrep |
|
supports an `and' operation `;' |
|
and an `or' operation `,', |
|
but not a combination of both. For example, 'fast;network' searches |
|
for all records containing both words. |
|
.TP |
|
\fBWild cards\fP |
|
The symbol '#' is used to denote a wild card. # matches zero or any |
|
number of arbitrary characters. For example, |
|
ex#e matches example. The symbol # is equivalent to .* in egrep. |
|
In fact, .* will work too, because it is a valid regular expression |
|
(see below), but unless this is part of an actual regular expression, |
|
# will work faster. |
|
.TP |
|
\fBCombination of exact and approximate matching\fP |
|
any pattern inside angle brackets <> must match the text exactly even |
|
if the match is with errors. For example, <mathemat>ics matches |
|
mathematical with one error (replacing the last s with an a), but |
|
mathe<matics> does not match mathematical no matter how many errors we |
|
allow. |
|
.TP |
|
\fBRegular expressions\fP |
|
The syntax of regular expressions in \fBagrep\fP is in general the same as |
|
that for \fBegrep\fP. The union operation `|', Kleene closure `*', |
|
and parentheses () are all supported. |
|
Currently '+' is not supported. |
|
Regular expressions are currently limited to approximately 30 |
|
characters (generally excluding meta characters). Some options |
|
(\-d, \-w, \-f, \-t, \-x, \-D, \-I, \-S) do not |
|
currently work with regular expressions. |
|
The maximal number of errors for regular expressions that use '*' |
|
or '|' is 4. |
|
.SH EXAMPLES |
|
.LP |
|
.TP |
|
agrep \-2 \-c ABCDEFG foo |
|
gives the number of lines in file foo that contain ABCDEFG |
|
within two errors. |
|
.TP |
|
agrep \-1 \-D2 \-S2 'ABCD#YZ' foo |
|
outputs the lines containing ABCD followed, within arbitrary |
|
distance, by YZ, with up to one additional insertion |
|
(\-D2 and \-S2 make deletions and substitutions too "expensive"). |
|
.TP |
|
agrep \-5 \-p abcdefghij /path/to/dictionary/words |
|
outputs the list of all words containing at least 5 of the first 10 |
|
letters of the alphabet \fIin order\fR. (Try it: any list starting |
|
with academia and ending with sacrilegious must mean something!) |
|
.TP |
|
agrep \-1 'abc[0-9](de|fg)*[x-z]' foo |
|
outputs the lines containing, within up to one error, the string |
|
that starts with abc followed by one digit, followed by zero or more |
|
repetitions of either de or fg, followed by either x, y, or z. |
|
.TP |
|
agrep \-d '^From\ ' 'breakdown;internet' mbox |
|
outputs all mail messages (the pattern '^From\ ' separates mail messages |
|
in a mail file) that contain keywords 'breakdown' and 'internet'. |
|
.TP |
|
agrep \-d '$$' \-1 '<word1> <word2>' foo |
|
finds all paragraphs that contain word1 followed by word2 with one |
|
error in place of the blank. |
|
In particular, if word1 is the last word in a line and word2 |
|
is the first word in the next line, then the space will be |
|
substituted by a newline symbol and it will match. |
|
Thus, this is a way to overcome separation by a newline. |
|
Note that \-d '$$' (or another delim which spans more than one line) |
|
is necessary, because otherwise agrep searches |
|
only one line at a time. |
|
.TP |
|
agrep '^agrep' <this manual> |
|
outputs all the examples of the use of agrep in this man pages. |
|
.PD |
|
.SH "SEE ALSO" |
|
.BR ed (1), |
|
.BR ex (1), |
|
.BR grep (1V), |
|
.BR sh (1), |
|
.BR csh (1). |
|
.SH BUGS/LIMITATIONS |
|
Any bug reports or comments will be appreciated! |
|
Please mail them to sw@cs.arizona.edu or udi@cs.arizona.edu |
|
.LP |
|
Regular expressions do not support the '+' operator (match 1 or more |
|
instances of the preceding token). These can be searched for by using |
|
this syntax in the pattern: |
|
.sp |
|
.in 1.0i |
|
\&'\fIpattern\fB(\fIpattern\fB)*\fR' |
|
.in |
|
.sp |
|
(search for strings containing one instance of the pattern, followed by 0 or |
|
more instances of the pattern). |
|
.LP |
|
The following can cause an infinite loop: |
|
.B agrep |
|
pattern * > output_file. |
|
If the number of matches is high, they may be deposited in |
|
output_file before it is completely read leading to more matches of |
|
the pattern within output_file (the matches are against the whole |
|
directory). It's not clear whether this is a "bug" (grep will do the |
|
same), but be warned. |
|
.LP |
|
The maximum size of the |
|
.I patternfile |
|
is limited to be 250Kb, and the maximum number of patterns |
|
is limited to be 30,000. |
|
.LP |
|
Standard input is the default if no input file is given. |
|
However, if standard input is keyed in directly (as opposed to through |
|
a pipe, for example) agrep may not work for some non-simple patterns. |
|
.LP |
|
There is no size limit for simple patterns. |
|
More complicated patterns are currently limited to approximately 30 characters. |
|
Lines are limited to 1024 characters. |
|
Records are limited to 48K, and may be truncated if they are larger |
|
than that. |
|
The limit of record length can be |
|
changed by modifying the parameter Max_record in agrep.h. |
|
.SH DIAGNOSTICS |
|
Exit status is 0 if any matches are found, |
|
1 if none, 2 for syntax errors or inaccessible files. |
|
.SH AUTHORS |
|
Sun Wu and Udi Manber, Department of Computer Science, University of |
|
Arizona, Tucson, AZ 85721. {sw|udi}@cs.arizona.edu. |
|
|
|
|
|
|