Pattern Matching

Pattern Matching

Top  Previous  Next

 

Pattern matching is a way to test whether data has a particular structure. It can be used for data validation and as a means to extract the part of a data item that matches against a specified element of the pattern. The pattern matching operations are the query processor LIKE and UNLIKE keywords, the QMBasic MATCHES operator, MATCHFIELD() and MATCHESS() functions, and the M search of ED. All of these compare a character string with a pattern template.

 

Pattern matching breaks the character set into three classes of character, each represented by a character type code:

AAlphabetic, upper and lowercase A - Z
NNumeric, digits 0 - 9
XAny character, including alphanumerics

 

On ECS mode systems, determination of character class is based on the character map in use.

 

There are also three ways to specify how many characters are present:

4Exactly 4 characters
2-7Between 2 and 7 characters
0Any number of characters, including none

 

The template consists of up to a maximum of 30 concatenated elements formed from pairs of lengths and character type:

 

0XZero or more characters of any type
nXExactly n characters of any type
n-mXBetween n and m characters of any type
0AZero or more alphabetic characters
nAExactly n alphabetic characters
n-mABetween n and m alphabetic characters
0NZero or more numeric characters
nNExactly n numeric characters
n-mNBetween n and m numeric characters
"string"A literal string which must match exactly. Either single or double quotation marks may be used. Use of the $NOCASE.STRINGS compiler directive makes the comparison case insensitive.

 

The values n and m are integers with any number of digits. m must be greater than or equal to n.

 

 

The 0X code is a wildcard that matches against anything. It has a commonly used synonym:

...Zero or more characters of any type

 

 

The 0A, nA, 0N, nN and "string" patterns may be preceded by a tilde (~) to invert the match condition. For example, ~4N matches four non-numeric characters such as ABCD (not a string which is not four numeric characters such as 12C4).

 

A null string matches patterns ..., 0A, 0X, 0N, their inverses (~0A, etc) and "".

 

The 0X and n-mX patterns match against as few characters as necessary before control passes to the next pattern. For example, the string ABC123DEF matched against the pattern 0X2N0X matches the pattern components as ABC, 12 and 3DEF.

 

The 0N, n-mN, 0A, and n-mA patterns match against as many characters as possible. For example, the string ABC123DEF matched against the pattern 0X2-3N0X matches the pattern components as ABC, 123 and DEF.

 

A pattern may contain unquoted literal elements so long as they do not cause ambiguity. Note that each character will be treated as a separate literal element such that a pattern

3AXYZ3A

has five elements and will match a string that is formed from three letters followed by the three literal characters X, Y, Z and a further three letters. The significance of the literal characters being treated as separate elements comes with MATCHFIELD() and PARSE().

 

The template string may contain alternative patterns separated by value marks. The source data will match the overall pattern if any of the pattern values match. If a match is found, the INMAT() function can be used to retrieve the value position within the pattern that matched.

 

 

The MATCHESS() function can be used to compare each element of a dynamic array with a pattern, returning a equivalently structured dynamic array of True/False values. Note the spelling  of this function with the trailing S to "pluralise" the name in the same way as other multivalue function names.

 

 

 

Examples

 

"A123BCD" would match successfully against patterns of

1A3N3A

1A1-3N3A

'A'1-3N3A

0A0N0A

1A...3A

1A~3A3A

and many more

 

It is often acceptable to omit the quotes around literal components. The above example would also match

A1-3N3A

There is no confusion between the leading A as a literal or as a character type as it is not preceded by a length value. It is, however, recommended that the quotes should be included. Omitting the quotes in a pattern used in the MATCHFIELD() function may affect the function's behaviour as each character of the literal will be counted as a separate component of the pattern.

 

 

A program might need to test whether data entered by a user is a non-negative integer (whole number) value. The QMBasic NUM() function can be used to test for numeric data but this would allow fractional or negative values. Testing against a pattern of "1-4N" would allow only integer values in the range 0 to 9999. To remove the upper limit, a pattern of 1N0N tests for one digit followed by any number of further digits, including none.