v/vlib/regex/README.md

421 lines
14 KiB
Markdown
Raw Normal View History

2020-01-16 00:39:33 +01:00
# V RegEx (Regular expression) 0.9c
[TOC]
## introduction
2020-01-18 07:38:00 +01:00
Write here the introduction... not today!! -_-
2020-01-16 00:39:33 +01:00
## Basic assumption
2020-01-18 07:38:00 +01:00
In this release, during the writing of the code some assumptions are made and are valid for all the features.
2020-01-16 00:39:33 +01:00
2020-01-16 02:07:36 +01:00
1. The matching stops at the end of the string not at the newline chars.
2020-01-18 07:38:00 +01:00
2. The basic elements of this regex engine are the tokens, in a query string a simple char is a token. The token is the atomic unit of this regex engine.
2020-01-16 00:39:33 +01:00
## Match positional limiter
The module supports the following features:
- `$` `^` delimiter
2020-01-16 02:07:36 +01:00
`^` (Caret.) Matches at the start of the string
2020-01-16 00:39:33 +01:00
2020-01-18 07:38:00 +01:00
`$` Matches at the end of the string
2020-01-16 00:39:33 +01:00
## Tokens
2020-01-18 07:38:00 +01:00
The tokens are the atomic units used by this regex engine and can be ones of the following:
2020-01-16 00:39:33 +01:00
### Simple char
this token is a simple single character like `a`.
### Char class (cc)
2020-01-18 07:38:00 +01:00
The cc match all the chars specified inside, it is delimited by square brackets `[ ]`
2020-01-16 00:39:33 +01:00
the sequence of chars in the class is evaluated with an OR operation.
2020-01-18 07:38:00 +01:00
For example, the following cc `[abc]` match any char that is `a` or `b` or `c` but doesn't match `C` or `z`.
2020-01-16 00:39:33 +01:00
Inside a cc is possible to specify a "range" of chars, for example `[ad-f]` is equivalent to write `[adef]`.
2020-01-16 02:07:36 +01:00
A cc can have different ranges at the same time like `[a-zA-z0-9]` that match all the lowercase,uppercase and numeric chars.
2020-01-16 00:39:33 +01:00
2020-01-16 02:07:36 +01:00
It is possible negate the cc using the caret char at the start of the cc like: `[^abc]` that matches every char that is not `a` or `b` or `c`.
2020-01-16 00:39:33 +01:00
2020-01-16 02:07:36 +01:00
A cc can contain meta-chars like: `[a-z\d]` that matches all the lowercase latin chars `a-z` and all the digits `\d`.
2020-01-16 00:39:33 +01:00
It is possible to mix all the properties of the char class together.
### Meta-chars
2020-01-16 02:07:36 +01:00
A meta-char is specified by a backslash before a char like `\w` in this case the meta-char is `w`.
2020-01-16 00:39:33 +01:00
A meta-char can match different type of chars.
* `\w` match an alphanumeric char `[a-zA-Z0-9]`
* `\W` match a non alphanumeric char
* `\d` match a digit `[0-9]`
* `\D` match a non digit
* `\s`match a space char, one of `[' ','\t','\n','\r','\v','\f']`
* `\S` match a non space char
* `\a` match only a lowercase char `[a-z]`
* `\A` match only an uppercase char `[A-Z]`
### Quantifier
Each token can have a quantifier that specify how many times the char can or must be matched.
2020-01-18 07:38:00 +01:00
#### **Short quantifier**
2020-01-16 00:39:33 +01:00
- `?` match 0 or 1 time, `a?b` match both `ab` or `b`
2020-01-16 02:07:36 +01:00
- `+` match at minimum 1 time, `a+` match both `aaa` or `a`
2020-01-16 00:39:33 +01:00
- `*` match 0 or more time, `a*b` match both `aaab` or `ab` or `b`
2020-01-18 07:38:00 +01:00
#### **Long quantifier**
2020-01-16 00:39:33 +01:00
- `{x}` match exactly x time, `a{2}` match `aa` but doesn't match `aaa` or `a`
2020-01-16 02:07:36 +01:00
- `{min,}` match at minimum min time, `a{2,}` match `aaa` or `aa` but doesn't match `a`
2020-01-18 07:38:00 +01:00
- `{,max}` match at least 0 time and maximum max time, `a{,2}` match `a` and `aa` but doesn't match `aaa`
2020-01-16 00:39:33 +01:00
- `{min,max}` match from min times to max times, `a{2,3}` match `aa` and `aaa` but doesn't match `a` or `aaaa`
2020-01-16 02:07:36 +01:00
a long quantifier may have a `greedy off` flag that is the `?` char after the brackets, `{2,4}?` means to match the minimum number possible tokens in this case 2.
2020-01-16 00:39:33 +01:00
### dot char
the dot is a particular meta char that match "any char", is more simple explain it with an example:
2020-01-16 02:07:36 +01:00
suppose to have `abccc ddeef` as source string to parse with regex, the following table show the query strings and the result of parsing source string.
2020-01-16 00:39:33 +01:00
| query string | result |
| ------------ | ------ |
| `.*c` | `abc` |
| `.*dd` | `abcc dd` |
| `ab.*e` | `abccc dde` |
| `ab.{3} .*e` | `abccc dde` |
the dot char match any char until the next token match is satisfied.
### OR token
2020-01-16 02:07:36 +01:00
the token `|` is a logic OR operation between two consecutive tokens, `a|b` match a char that is `a` or `b`.
2020-01-16 00:39:33 +01:00
2020-01-18 07:38:00 +01:00
The OR token can work in a "chained way": `a|(b)|cd ` test first `a` if the char is not `a` then test the group `(b)` and if the group doesn't match test the token `c`.
2020-01-16 00:39:33 +01:00
**note: The OR work at token level! It doesn't work at concatenation level!**
2020-01-16 02:07:36 +01:00
A query string like `abc|bde` is not equal to `(abc)|(bde)`!! The OR work only on `c|b` not at char concatenation level.
2020-01-16 00:39:33 +01:00
### Groups
2020-01-16 02:07:36 +01:00
Groups are a method to create complex patterns with repetition of blocks of tokens.
2020-01-16 00:39:33 +01:00
2020-01-16 02:07:36 +01:00
The groups are delimited by round brackets `( )`, groups can be nested and can have a quantifier as all the tokens.
2020-01-16 00:39:33 +01:00
`c(pa)+z` match `cpapaz` or `cpaz` or `cpapapaz` .
`(c(pa)+z ?)+` match `cpaz cpapaz cpapapaz` or `cpapaz`
2020-01-16 02:07:36 +01:00
let analyze this last case, first we have the group `#0` that are the most outer round brackets `(...)+`, this group has a quantifier that say to match its content at least one time `+`.
2020-01-16 00:39:33 +01:00
2020-01-16 02:07:36 +01:00
After we have a simple char token `c` and a second group that is the number `#1` :`(pa)+`, this group try to match the sequence `pa` at least one time as specified by the `+` quantifier.
2020-01-16 00:39:33 +01:00
2020-01-16 02:07:36 +01:00
After, we have another simple token `z` and another simple token ` ?` that is the space char (ascii code 32) followed by the `?` quantifier that say to capture the space char 0 or 1 time.
2020-01-16 00:39:33 +01:00
This explain because the `(c(pa)+z ?)+` query string can match `cpaz cpapaz cpapapaz` .
2020-01-16 02:07:36 +01:00
In this implementation the groups are "capture groups", it means that the last temporal result for each group can be retrieved from the `RE` struct.
2020-01-16 00:39:33 +01:00
2020-01-16 02:07:36 +01:00
The "capture groups" are store as couple of index in the field `groups` that is an `[]int` inside the `RE` struct.
2020-01-16 00:39:33 +01:00
**example:**
```v
text := "cpaz cpapaz cpapapaz"
query:= r"(c(pa)+z ?)+"
re, _, _ := regex.regex(query)
println(re.get_query())
// #0(c#1(pa)+z ?)+ // #0 and #1 are the ids of the groups, are shown if re.debug is 1 or 2
start, end := re.match_string(text)
// [start=0, end=20] match => [cpaz cpapaz cpapapaz]
mut gi := 0
for gi < re.groups.len {
if re.groups[gi] >= 0 {
println("${gi/2} :[${text[re.groups[gi]..re.groups[gi+1]]}]")
}
gi += 2
}
// groups captured
// 0 :[cpapapaz]
// 1 :[pa]
```
**note:** *to show the `group id number` in the result of the `get_query()` the flag `debug` of the RE object must be `1` or `2`*
## Flags
2020-01-16 02:07:36 +01:00
It is possible to set some flags in the regex parser that change the behavior of the parser itself.
2020-01-16 00:39:33 +01:00
```v
// example of flag settings
mut re := regex.new_regex()
re.flag = regex.F_BIN
```
- `F_BIN`: parse a string as bytes, utf-8 management disabled.
2020-01-16 02:07:36 +01:00
- `F_EFM`: exit on the first char match in the query, used by the find function.
- `F_MS`: match only if the index of the start match is 0, same as `^` at the start of the query string.
- `F_ME`: match only if the end index of the match is the last char of the input string, same as `$` end of query string.
2020-01-16 00:39:33 +01:00
- `F_NL`: stop the matching if found a new line char `\n` or `\r`
## Functions
### Initializer
2020-01-18 07:38:00 +01:00
These functions are helper that create the `RE` struct, a `RE` struct can be created manually if you needed.
2020-01-16 00:39:33 +01:00
2020-01-18 07:38:00 +01:00
#### **Simplified initializer**
2020-01-16 00:39:33 +01:00
```v
// regex create a regex object from the query string and compile it
pub fn regex(in_query string) (RE,int,int)
```
2020-01-18 07:38:00 +01:00
#### **Base initializer**
2020-01-16 00:39:33 +01:00
```v
// new_regex create a REgex of small size, usually sufficient for ordinary use
pub fn new_regex() RE
// new_regex_by_size create a REgex of large size, mult specify the scale factor of the memory that will be allocated
pub fn new_regex_by_size(mult int) RE
```
2020-01-18 07:38:00 +01:00
After a base initializer is used, the regex expression must be compiled with:
2020-01-16 00:39:33 +01:00
```v
// compile return (return code, index) where index is the index of the error in the query string if return code is an error code
pub fn (re mut RE) compile(in_txt string) (int,int)
```
2020-01-18 07:38:00 +01:00
### Operative Functions
2020-01-16 00:39:33 +01:00
These are the operative functions
```v
// match_string try to match the input string, return start and end index if found else start is -1
pub fn (re mut RE) match_string(in_txt string) (int,int)
// find try to find the first match in the input string, return start and end index if found else start is -1
pub fn (re mut RE) find(in_txt string) (int,int)
2020-01-16 02:07:36 +01:00
// find_all find all the "non overlapping" occurrences of the matching pattern, return a list of start end indexes
2020-01-16 00:39:33 +01:00
pub fn (re mut RE) find_all(in_txt string) []int
2020-01-16 02:07:36 +01:00
// replace return a string where the matches are replaced with the replace string, only non overlapped matches are used
2020-01-16 00:39:33 +01:00
pub fn (re mut RE) replace(in_txt string, repl string) string
```
## Debugging
This module has few small utilities to help the writing of regex expressions.
2020-01-18 07:38:00 +01:00
### **Syntax errors highlight**
2020-01-16 00:39:33 +01:00
2020-01-16 02:07:36 +01:00
the following example code show how to visualize the syntax errors in the compilation phase:
2020-01-16 00:39:33 +01:00
```v
2020-01-16 02:07:36 +01:00
query:= r"ciao da ab[ab-]" // there is an error, a range not closed!!
2020-01-16 00:39:33 +01:00
mut re := new_regex()
// re_err ==> is the return value, if < 0 it is an error
// re_pos ==> if re_err < 0, re_pos is the error index in the query string
re_err, err_pos := re.compile(query)
// print the error if one happen
if re_err != COMPILE_OK {
println("query: $query")
lc := "-".repeat(err_pos)
println("err : $lc^")
err_str := re.get_parse_error_string(re_err) // get the error string
println("ERROR: $err_str")
}
// output!!
//query: ciao da ab[ab-]
//err : ----------^
//ERROR: ERR_SYNTAX_ERROR
```
2020-01-18 07:38:00 +01:00
### **Compiled code**
2020-01-16 00:39:33 +01:00
2020-01-16 02:07:36 +01:00
It is possible view the compiled code calling the function `get_query()` the result will be something like this:
2020-01-16 00:39:33 +01:00
```
========================================
v RegEx compiler v 0.9c output:
PC: 0 ist: 7fffffff [a] query_ch { 1, 1}
PC: 1 ist: 7fffffff [b] query_ch { 1,MAX}
PC: 2 ist: 88000000 PROG_END { 0, 0}
========================================
```
2020-01-16 02:07:36 +01:00
`PC`:`int` is the program counter or step of execution, each single step is a token.
2020-01-16 00:39:33 +01:00
2020-01-16 02:07:36 +01:00
`ist`:`hex` is the token instruction id.
2020-01-16 00:39:33 +01:00
2020-01-16 02:07:36 +01:00
`[a]` is the char used by the token.
2020-01-16 00:39:33 +01:00
2020-01-16 02:07:36 +01:00
`query_ch` is the type of token.
2020-01-16 00:39:33 +01:00
2020-01-16 02:07:36 +01:00
`{m,n}` is the quantifier, the greedy off flag `?` will be showed if present in the token
2020-01-16 00:39:33 +01:00
2020-01-18 07:38:00 +01:00
### **Log debug**
2020-01-16 00:39:33 +01:00
The log debugger allow to print the status of the regex parser when the parser is running.
It is possible to have two different level of debug: 1 is normal while 2 is verbose.
here an example:
*normal*
2020-01-16 02:07:36 +01:00
list only the token instruction with their values
2020-01-16 00:39:33 +01:00
```
// re.flag = 1 // log level normal
flags: 00000000
# 2 s: ist_load PC: 0=>7fffffff i,ch,len:[ 0,'a',1] f.m:[ -1, -1] query_ch: [a]{1,1}:0 (#-1)
# 5 s: ist_load PC: 1=>7fffffff i,ch,len:[ 1,'b',1] f.m:[ 0, 0] query_ch: [b]{2,3}:0? (#-1)
# 7 s: ist_load PC: 1=>7fffffff i,ch,len:[ 2,'b',1] f.m:[ 0, 1] query_ch: [b]{2,3}:1? (#-1)
# 10 PROG_END
```
*verbose*
2020-01-16 02:07:36 +01:00
list all the instructions and states of the parser
2020-01-16 00:39:33 +01:00
```
flags: 00000000
# 0 s: start PC: NA
# 1 s: ist_next PC: NA
# 2 s: ist_load PC: 0=>7fffffff i,ch,len:[ 0,'a',1] f.m:[ -1, -1] query_ch: [a]{1,1}:0 (#-1)
# 3 s: ist_quant_p PC: 0=>7fffffff i,ch,len:[ 1,'b',1] f.m:[ 0, 0] query_ch: [a]{1,1}:1 (#-1)
# 4 s: ist_next PC: NA
# 5 s: ist_load PC: 1=>7fffffff i,ch,len:[ 1,'b',1] f.m:[ 0, 0] query_ch: [b]{2,3}:0? (#-1)
# 6 s: ist_quant_p PC: 1=>7fffffff i,ch,len:[ 2,'b',1] f.m:[ 0, 1] query_ch: [b]{2,3}:1? (#-1)
# 7 s: ist_load PC: 1=>7fffffff i,ch,len:[ 2,'b',1] f.m:[ 0, 1] query_ch: [b]{2,3}:1? (#-1)
# 8 s: ist_quant_p PC: 1=>7fffffff i,ch,len:[ 3,'b',1] f.m:[ 0, 2] query_ch: [b]{2,3}:2? (#-1)
# 9 s: ist_next PC: NA
# 10 PROG_END
# 11 PROG_END
```
2020-01-16 02:07:36 +01:00
the columns have the following meaning:
2020-01-16 00:39:33 +01:00
`# 2` number of actual steps from the start of parsing
`s: ist_next` state of the present step
`PC: 1` program counter of the step
`=>7fffffff ` hex code of the instruction
`i,ch,len:[ 0,'a',1]` `i` index in the source string, `ch` the char parsed, `len` the length in byte of the char parsed
`f.m:[ 0, 1]` `f` index of the first match in the source string, `m` index that is actual matching
`query_ch: [b]` token in use and its char
2020-01-16 02:07:36 +01:00
`{2,3}:1?` quantifier `{min,max}`, `:1` is the actual counter of repetition, `?` is the greedy off flag if present
2020-01-16 00:39:33 +01:00
2020-01-18 07:38:00 +01:00
### **Custom Logger output**
The debug functions output uses the `stdout` as default, it is possible to provide an alternative output setting a custom output function:
```v
// custom print function, the input will be the regex debug string
fn custom_print(txt string) {
println("my log: $txt")
}
mut re := new_regex()
re.log_func = custom_print // every debug output from now will call this function
```
2020-01-16 00:39:33 +01:00
## Example code
Here there is a simple code to perform some basically match of strings
```v
struct TestObj {
source string // source string to parse
query string // regex query string
s int // expected match start index
e int // expected match end index
}
const (
tests = [
TestObj{"this is a good.",r"this (\w+) a",0,9},
TestObj{"this,these,those. over",r"(th[eio]se?[,. ])+",0,17},
TestObj{"test1@post.pip.com, pera",r"[\w]+@([\w]+\.)+\w+",0,18},
TestObj{"cpapaz ole. pippo,",r".*c.+ole.*pi",0,14},
TestObj{"adce aabe",r"(a(ab)+)|(a(dc)+)e",0,4},
]
)
fn example() {
for c,tst in tests {
mut re := regex.new_regex()
re_err, err_pos := re.compile(tst.query)
if re_err == regex.COMPILE_OK {
// print the query parsed with the groups ids
re.debug = 1 // set debug on at minimum level
println("#${c:2d} query parsed: ${re.get_query()}")
re.debug = 0
// do the match
start, end := re.match_string(tst.source)
if start >= 0 && end > start {
println("#${c:2d} found in: [$start, $end] => [${tst.source[start..end]}]")
}
// print the groups
mut gi := 0
for gi < re.groups.len {
if re.groups[gi] >= 0 {
println("group ${gi/2:2d} :[${tst.source[re.groups[gi]..re.groups[gi+1]]}]")
}
gi += 2
}
println("")
} else {
// print the compile error
println("query: $tst.query")
lc := "-".repeat(err_pos-1)
println("err : $lc^")
err_str := re.get_parse_error_string(re_err)
println("ERROR: $err_str")
}
}
}
fn main() {
example()
}
```
more example code is available in the test code for the `regex` module `vlib\regex\regex_test.v`.