regex: bug fixes, docs
parent
ad7bc37672
commit
36660ce749
|
@ -4,14 +4,14 @@
|
|||
|
||||
## introduction
|
||||
|
||||
Write here the introduction
|
||||
Write here the introduction... not today!! -_-
|
||||
|
||||
## Basic assumption
|
||||
|
||||
In this release, during the writing of the code some assumption are made and are valid for all the features.
|
||||
In this release, during the writing of the code some assumptions are made and are valid for all the features.
|
||||
|
||||
1. The matching stops at the end of the string not at the newline chars.
|
||||
2. The basic element of this regex engine are the tokens, in query string a simple char is a token. The token is the atomic unit of this regex engine.
|
||||
2. The basic elements of this regex engine are the tokens, in a query string a simple char is a token. The token is the atomic unit of this regex engine.
|
||||
|
||||
## Match positional limiter
|
||||
|
||||
|
@ -21,11 +21,11 @@ The module supports the following features:
|
|||
|
||||
`^` (Caret.) Matches at the start of the string
|
||||
|
||||
`?` Matches at the end of the string
|
||||
`$` Matches at the end of the string
|
||||
|
||||
## Tokens
|
||||
|
||||
The tokens are the atomic unit used by this regex engine and can be ones of the following:
|
||||
The tokens are the atomic units used by this regex engine and can be ones of the following:
|
||||
|
||||
### Simple char
|
||||
|
||||
|
@ -33,11 +33,11 @@ this token is a simple single character like `a`.
|
|||
|
||||
### Char class (cc)
|
||||
|
||||
The cc match all the chars specified in its inside, it is delimited by square brackets `[ ]`
|
||||
The cc match all the chars specified inside, it is delimited by square brackets `[ ]`
|
||||
|
||||
the sequence of chars in the class is evaluated with an OR operation.
|
||||
|
||||
For example the following cc `[abc]` match any char that is `a` or `b` or `c` but doesn't match `C` or `z`.
|
||||
For example, the following cc `[abc]` match any char that is `a` or `b` or `c` but doesn't match `C` or `z`.
|
||||
|
||||
Inside a cc is possible to specify a "range" of chars, for example `[ad-f]` is equivalent to write `[adef]`.
|
||||
|
||||
|
@ -68,17 +68,17 @@ A meta-char can match different type of chars.
|
|||
|
||||
Each token can have a quantifier that specify how many times the char can or must be matched.
|
||||
|
||||
**Short quantifier**
|
||||
#### **Short quantifier**
|
||||
|
||||
- `?` match 0 or 1 time, `a?b` match both `ab` or `b`
|
||||
- `+` match at minimum 1 time, `a+` match both `aaa` or `a`
|
||||
- `*` match 0 or more time, `a*b` match both `aaab` or `ab` or `b`
|
||||
|
||||
**Long quantifier**
|
||||
#### **Long quantifier**
|
||||
|
||||
- `{x}` match exactly x time, `a{2}` match `aa` but doesn't match `aaa` or `a`
|
||||
- `{min,}` match at minimum min time, `a{2,}` match `aaa` or `aa` but doesn't match `a`
|
||||
- `{,max}` match at least 1 time and maximum max time, `a{,2}` match `a` and `aa` but doesn't match `aaa`
|
||||
- `{,max}` match at least 0 time and maximum max time, `a{,2}` match `a` and `aa` but doesn't match `aaa`
|
||||
- `{min,max}` match from min times to max times, `a{2,3}` match `aa` and `aaa` but doesn't match `a` or `aaaa`
|
||||
|
||||
a long quantifier may have a `greedy off` flag that is the `?` char after the brackets, `{2,4}?` means to match the minimum number possible tokens in this case 2.
|
||||
|
@ -102,7 +102,7 @@ the dot char match any char until the next token match is satisfied.
|
|||
|
||||
the token `|` is a logic OR operation between two consecutive tokens, `a|b` match a char that is `a` or `b`.
|
||||
|
||||
The or token can work in a "chained way": `a|(b)|cd ` test first `a` if the char is not `a` the test the group `(b)` and if the group doesn't match test the token `c`.
|
||||
The OR token can work in a "chained way": `a|(b)|cd ` test first `a` if the char is not `a` then test the group `(b)` and if the group doesn't match test the token `c`.
|
||||
|
||||
**note: The OR work at token level! It doesn't work at concatenation level!**
|
||||
|
||||
|
@ -181,16 +181,16 @@ re.flag = regex.F_BIN
|
|||
|
||||
### Initializer
|
||||
|
||||
These function are helper that create the `RE` struct, a `RE` struct can be created manually if you needed.
|
||||
These functions are helper that create the `RE` struct, a `RE` struct can be created manually if you needed.
|
||||
|
||||
**Simplified initializer**
|
||||
#### **Simplified initializer**
|
||||
|
||||
```v
|
||||
// regex create a regex object from the query string and compile it
|
||||
pub fn regex(in_query string) (RE,int,int)
|
||||
```
|
||||
|
||||
**Base initializer**
|
||||
#### **Base initializer**
|
||||
|
||||
```v
|
||||
// new_regex create a REgex of small size, usually sufficient for ordinary use
|
||||
|
@ -199,13 +199,13 @@ pub fn new_regex() RE
|
|||
// new_regex_by_size create a REgex of large size, mult specify the scale factor of the memory that will be allocated
|
||||
pub fn new_regex_by_size(mult int) RE
|
||||
```
|
||||
After the base initializer use, the regex expression must be compiled with:
|
||||
After a base initializer is used, the regex expression must be compiled with:
|
||||
```v
|
||||
// compile return (return code, index) where index is the index of the error in the query string if return code is an error code
|
||||
pub fn (re mut RE) compile(in_txt string) (int,int)
|
||||
```
|
||||
|
||||
### Functions
|
||||
### Operative Functions
|
||||
|
||||
These are the operative functions
|
||||
|
||||
|
@ -227,7 +227,7 @@ pub fn (re mut RE) replace(in_txt string, repl string) string
|
|||
|
||||
This module has few small utilities to help the writing of regex expressions.
|
||||
|
||||
**Syntax errors highlight**
|
||||
### **Syntax errors highlight**
|
||||
|
||||
the following example code show how to visualize the syntax errors in the compilation phase:
|
||||
|
||||
|
@ -256,7 +256,7 @@ if re_err != COMPILE_OK {
|
|||
|
||||
```
|
||||
|
||||
**Compiled code**
|
||||
### **Compiled code**
|
||||
|
||||
It is possible view the compiled code calling the function `get_query()` the result will be something like this:
|
||||
|
||||
|
@ -279,7 +279,7 @@ PC: 2 ist: 88000000 PROG_END { 0, 0}
|
|||
|
||||
`{m,n}` is the quantifier, the greedy off flag `?` will be showed if present in the token
|
||||
|
||||
**Log debug**
|
||||
### **Log debug**
|
||||
|
||||
The log debugger allow to print the status of the regex parser when the parser is running.
|
||||
|
||||
|
@ -338,6 +338,21 @@ the columns have the following meaning:
|
|||
|
||||
`{2,3}:1?` quantifier `{min,max}`, `:1` is the actual counter of repetition, `?` is the greedy off flag if present
|
||||
|
||||
### **Custom Logger output**
|
||||
|
||||
The debug functions output uses the `stdout` as default, it is possible to provide an alternative output setting a custom output function:
|
||||
|
||||
```v
|
||||
// custom print function, the input will be the regex debug string
|
||||
fn custom_print(txt string) {
|
||||
println("my log: $txt")
|
||||
}
|
||||
|
||||
mut re := new_regex()
|
||||
re.log_func = custom_print // every debug output from now will call this function
|
||||
|
||||
```
|
||||
|
||||
## Example code
|
||||
|
||||
Here there is a simple code to perform some basically match of strings
|
||||
|
|
|
@ -200,7 +200,6 @@ pub fn (re RE) get_parse_error_string(err int) string {
|
|||
}
|
||||
}
|
||||
|
||||
|
||||
// utf8_str convert and utf8 sequence to a printable string
|
||||
[inline]
|
||||
fn utf8_str(ch u32) string {
|
||||
|
@ -231,7 +230,7 @@ mut:
|
|||
ist u32 = u32(0)
|
||||
|
||||
// char
|
||||
ch u32 = u32(0)// char of the token if any
|
||||
ch u32 = u32(0) // char of the token if any
|
||||
ch_len byte = byte(0) // char len
|
||||
|
||||
// Quantifiers / branch
|
||||
|
@ -245,7 +244,7 @@ mut:
|
|||
// counters for quantifier check (repetitions)
|
||||
rep int = 0
|
||||
|
||||
// validator function pointer and control char
|
||||
// validator function pointer
|
||||
validator fn (byte) bool
|
||||
|
||||
// groups variables
|
||||
|
@ -280,9 +279,9 @@ pub const (
|
|||
|
||||
struct StateDotObj{
|
||||
mut:
|
||||
i int = 0 // char index in the input buffer
|
||||
pc int = 0 // program counter saved
|
||||
mi int = 0 // match_index saved
|
||||
i int = -1 // char index in the input buffer
|
||||
pc int = -1 // program counter saved
|
||||
mi int = -1 // match_index saved
|
||||
group_stack_index int = -1 // group index stack pointer saved
|
||||
}
|
||||
|
||||
|
@ -648,7 +647,7 @@ fn (re RE) parse_quantifier(in_txt string, in_i int) (int, int, int, bool) {
|
|||
|
||||
// min parsing skip if comma present
|
||||
if status == .start && ch == `,` {
|
||||
q_min = 1 // default min in a {} quantifier is 1
|
||||
q_min = 0 // default min in a {} quantifier is 0
|
||||
status = .comma_checked
|
||||
i++
|
||||
continue
|
||||
|
@ -998,6 +997,7 @@ pub fn (re mut RE) compile(in_txt string) (int,int) {
|
|||
// Post processing
|
||||
//******************************************
|
||||
|
||||
|
||||
// count IST_DOT_CHAR to set the size of the state stack
|
||||
mut pc1 := 0
|
||||
mut tmp_count := 0
|
||||
|
@ -1007,10 +1007,10 @@ pub fn (re mut RE) compile(in_txt string) (int,int) {
|
|||
}
|
||||
pc1++
|
||||
}
|
||||
|
||||
// init the state stack
|
||||
re.state_stack = [StateDotObj{}].repeat(tmp_count+1)
|
||||
|
||||
|
||||
// OR branch
|
||||
// a|b|cd
|
||||
// d exit point
|
||||
|
@ -1279,7 +1279,8 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
|
|||
|
||||
mut pc := -1 // program counter
|
||||
mut state := StateObj{} // actual state
|
||||
mut ist := u32(0) // Program Counter
|
||||
mut ist := u32(0) // actual instruction
|
||||
mut l_ist := u32(0) // last matched instruction
|
||||
|
||||
mut group_stack := [-1].repeat(re.group_max)
|
||||
mut group_data := [-1].repeat(re.group_max)
|
||||
|
@ -1359,7 +1360,7 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
|
|||
tmp_gr := re.prog[re.prog[pc].goto_pc].group_rep
|
||||
buf2.write("GROUP_START #:${tmp_gi} rep:${tmp_gr} ")
|
||||
} else if ist == IST_GROUP_END {
|
||||
buf2.write("GROUP_END #:${re.prog[pc].group_id} deep:${group_index} ")
|
||||
buf2.write("GROUP_END #:${re.prog[pc].group_id} deep:${group_index}")
|
||||
}
|
||||
}
|
||||
if re.prog[pc].rep_max == MAX_QUANTIFIER {
|
||||
|
@ -1417,17 +1418,10 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
|
|||
}
|
||||
|
||||
// manage IST_DOT_CHAR
|
||||
if re.state_stack_index >= 0 {
|
||||
//C.printf("DOT CHAR text end management!\n")
|
||||
// if DOT CHAR is not the last instruction and we are still going, then no match!!
|
||||
if pc < re.prog.len && re.prog[pc+1].ist != IST_PROG_END {
|
||||
return NO_MATCH_FOUND,0
|
||||
}
|
||||
}
|
||||
|
||||
m_state == .end
|
||||
break
|
||||
return NO_MATCH_FOUND,0
|
||||
//return NO_MATCH_FOUND,0
|
||||
}
|
||||
|
||||
// starting and init
|
||||
|
@ -1475,7 +1469,8 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
|
|||
// check if stop
|
||||
if m_state == .stop {
|
||||
// if we are in restore state ,do it and restart
|
||||
if re.state_stack_index >= 0 {
|
||||
//C.printf("re.state_stack_index %d\n",re.state_stack_index )
|
||||
if re.state_stack_index >=0 && re.state_stack[re.state_stack_index].pc >= 0 {
|
||||
i = re.state_stack[re.state_stack_index].i
|
||||
pc = re.state_stack[re.state_stack_index].pc
|
||||
state.match_index = re.state_stack[re.state_stack_index].mi
|
||||
|
@ -1499,14 +1494,24 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
|
|||
// program end
|
||||
if ist == IST_PROG_END {
|
||||
// if we are in match exit well
|
||||
|
||||
if group_index >= 0 && state.match_index >= 0 {
|
||||
group_index = -1
|
||||
}
|
||||
|
||||
// we have a DOT MATCH on going
|
||||
//C.printf("IST_PROG_END l_ist: %08x\n", l_ist)
|
||||
if re.state_stack_index>=0 && l_ist == IST_DOT_CHAR {
|
||||
m_state = .stop
|
||||
continue
|
||||
}
|
||||
|
||||
re.state_stack_index = -1
|
||||
m_state = .stop
|
||||
continue
|
||||
|
||||
}
|
||||
|
||||
// check GROUP start, no quantifier is checkd for this token!!
|
||||
else if ist == IST_GROUP_START {
|
||||
group_index++
|
||||
|
@ -1527,7 +1532,7 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
|
|||
//C.printf("g.id: %d group_index: %d\n", re.prog[pc].group_id, group_index)
|
||||
if group_index >= 0 {
|
||||
start_i := group_stack[group_index]
|
||||
group_stack[group_index]=-1
|
||||
//group_stack[group_index]=-1
|
||||
|
||||
// save group results
|
||||
g_index := re.prog[pc].group_id*2
|
||||
|
@ -1537,6 +1542,7 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
|
|||
re.groups[g_index] = 0
|
||||
}
|
||||
re.groups[g_index+1] = i
|
||||
//C.printf("GROUP %d END [%d, %d]\n", re.prog[pc].group_id, re.groups[g_index], re.groups[g_index+1])
|
||||
}
|
||||
|
||||
re.prog[pc].group_rep++ // increase repetitions
|
||||
|
@ -1568,6 +1574,7 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
|
|||
else if ist == IST_DOT_CHAR {
|
||||
//C.printf("IST_DOT_CHAR rep: %d\n", re.prog[pc].rep)
|
||||
state.match_flag = true
|
||||
l_ist = u32(IST_DOT_CHAR)
|
||||
|
||||
if first_match < 0 {
|
||||
first_match = i
|
||||
|
@ -1575,12 +1582,23 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
|
|||
state.match_index = i
|
||||
re.prog[pc].rep++
|
||||
|
||||
if re.prog[pc].rep == 1 {
|
||||
//if re.prog[pc].rep >= re.prog[pc].rep_min && re.prog[pc].rep <= re.prog[pc].rep_max {
|
||||
if re.prog[pc].rep >= 0 && re.prog[pc].rep <= re.prog[pc].rep_max {
|
||||
//C.printf("DOT CHAR save state : %d\n", re.state_stack_index)
|
||||
// save the state
|
||||
|
||||
// manage first dot char
|
||||
if re.state_stack_index < 0 {
|
||||
re.state_stack_index++
|
||||
}
|
||||
|
||||
re.state_stack[re.state_stack_index].pc = pc
|
||||
re.state_stack[re.state_stack_index].mi = state.match_index
|
||||
re.state_stack[re.state_stack_index].group_stack_index = group_index
|
||||
} else {
|
||||
re.state_stack[re.state_stack_index].pc = -1
|
||||
re.state_stack[re.state_stack_index].mi = -1
|
||||
re.state_stack[re.state_stack_index].group_stack_index = -1
|
||||
}
|
||||
|
||||
if re.prog[pc].rep >= 1 && re.state_stack_index >= 0 {
|
||||
|
@ -1590,19 +1608,11 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
|
|||
// manage * and {0,} quantifier
|
||||
if re.prog[pc].rep_min > 0 {
|
||||
i += char_len // next char
|
||||
l_ist = u32(IST_DOT_CHAR)
|
||||
}
|
||||
|
||||
if re.prog[pc+1].ist != IST_GROUP_END {
|
||||
m_state = .ist_next
|
||||
continue
|
||||
}
|
||||
// IST_DOT_CHAR is the last instruction, get all
|
||||
else {
|
||||
//C.printf("We are the last one!\n")
|
||||
pc--
|
||||
m_state = .ist_next_ks
|
||||
continue
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
|
@ -1622,6 +1632,7 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
|
|||
|
||||
if cc_res {
|
||||
state.match_flag = true
|
||||
l_ist = u32(IST_CHAR_CLASS_POS)
|
||||
|
||||
if first_match < 0 {
|
||||
first_match = i
|
||||
|
@ -1645,6 +1656,7 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
|
|||
//C.printf("BSLS in_ch: %c res: %d\n", ch, tmp_res)
|
||||
if tmp_res {
|
||||
state.match_flag = true
|
||||
l_ist = u32(IST_BSLS_CHAR)
|
||||
|
||||
if first_match < 0 {
|
||||
first_match = i
|
||||
|
@ -1669,6 +1681,7 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
|
|||
if re.prog[pc].ch == ch
|
||||
{
|
||||
state.match_flag = true
|
||||
l_ist = u32(IST_SIMPLE_CHAR)
|
||||
|
||||
if first_match < 0 {
|
||||
first_match = i
|
||||
|
@ -1857,7 +1870,7 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
|
|||
}
|
||||
|
||||
// no other options
|
||||
//C.printf("NO_MATCH_FOUND\n")
|
||||
//C.printf("ist_quant_n NO_MATCH_FOUND\n")
|
||||
result = NO_MATCH_FOUND
|
||||
m_state = .stop
|
||||
continue
|
||||
|
@ -1873,12 +1886,6 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
|
|||
|
||||
rep := re.prog[pc].rep
|
||||
|
||||
// clear the actual dot char capture state
|
||||
if re.state_stack_index >= 0 {
|
||||
//C.printf("Drop the DOT_CHAR state!\n")
|
||||
re.state_stack_index--
|
||||
}
|
||||
|
||||
// under range
|
||||
if rep > 0 && rep < re.prog[pc].rep_min {
|
||||
//C.printf("ist_quant_p UNDER RANGE\n")
|
||||
|
|
|
@ -34,14 +34,12 @@ match_test_suite = [
|
|||
TestItem{"this is a good sample.",r"( ?\w+){,5}",0,21},
|
||||
TestItem{"this is a good sample.",r"( ?\w+){2,3}",0,9},
|
||||
TestItem{"this is a good sample.",r"(\s?\w+){2,3}",0,9},
|
||||
TestItem{"this is a good sample.",r".*i(\w)+",0,4},
|
||||
TestItem{"this these those.",r"(th[ei]se?\s|\.)+",0,11},
|
||||
TestItem{"this these those ",r"(th[eio]se? ?)+",0,17},
|
||||
TestItem{"this these those ",r"(th[eio]se? )+",0,17},
|
||||
TestItem{"this,these,those. over",r"(th[eio]se?[,. ])+",0,17},
|
||||
TestItem{"soday,this,these,those. over",r"(th[eio]se?[,. ])+",6,23},
|
||||
TestItem{"soday,this,these,those. over",r".*,(th[eio]se?[,. ])+",0,23},
|
||||
TestItem{"soday,this,these,thesa.thesi over",r".*,(th[ei]se?[,. ])+(thes[ai][,. ])+",0,29},
|
||||
|
||||
TestItem{"cpapaz",r"(c(pa)+z)",0,6},
|
||||
TestItem{"this is a cpapaz over",r"(c(pa)+z)",10,16},
|
||||
TestItem{"this is a cpapapez over",r"(c(p[ae])+z)",10,18},
|
||||
|
@ -56,16 +54,23 @@ match_test_suite = [
|
|||
TestItem{"this cpapaz adce aabe",r"(c(pa)+z)(\s[\a]+){2}",5,21},
|
||||
TestItem{"1234this cpapaz adce aabe",r"(c(pa)+z)(\s[\a]+){2}$",9,25},
|
||||
TestItem{"this cpapaz adce aabe third",r"(c(pa)+z)(\s[\a]+){2}",5,21},
|
||||
TestItem{"123cpapaz ole. pippo",r"(c(pa)+z)(\s+\a+[\.,]?)+",3,20},
|
||||
|
||||
TestItem{"this is a good sample.",r".*i(\w)+",0,4},
|
||||
TestItem{"soday,this,these,those. over",r".*,(th[eio]se?[,. ])+",0,23},
|
||||
TestItem{"soday,this,these,thesa.thesi over",r".*,(th[ei]se?[,. ])+(thes[ai][,. ])+",0,29},
|
||||
TestItem{"cpapaz ole. pippo,",r".*(c(pa)+z)(\s+\a+[\.,]?)+",0,18},
|
||||
TestItem{"cpapaz ole. pippo",r".*(c(pa)+z)(\s+\a+[\.,]?)+",0,17},
|
||||
TestItem{"cpapaz ole. pippo, 852",r".*(c(pa)+z)(\s+\a+[\.,]?)+",0,18},
|
||||
TestItem{"123cpapaz ole. pippo",r".*(c(pa)+z)(\s+\a+[\.,]?)+",0,20},
|
||||
TestItem{"...cpapaz ole. pippo",r".*(c(pa)+z)(\s+\a+[\.,]?)+",0,20},
|
||||
TestItem{"123cpapaz ole. pippo",r"(c(pa)+z)(\s+\a+[\.,]?)+",3,20},
|
||||
|
||||
TestItem{"cpapaz ole. pippo,",r".*c.+ole.*pi",0,14},
|
||||
TestItem{"cpapaz ole. pipipo,",r".*c.+ole.*p([ip])+o",0,18},
|
||||
TestItem{"cpapaz ole. pipipo",r"^.*c.+ol?e.*p([ip])+o$",0,18},
|
||||
TestItem{"abbb",r"ab{2,3}?",0,3},
|
||||
TestItem{" pippo pera",r"\s(.*)pe(.*)",0,11},
|
||||
TestItem{" abb",r"\s(.*)",0,4},
|
||||
|
||||
// negative
|
||||
TestItem{"zthis ciao",r"((t[hieo]+se?)\s*)+",-1,0},
|
||||
|
|
Loading…
Reference in New Issue