regex: bug fixes, docs
							parent
							
								
									ad7bc37672
								
							
						
					
					
						commit
						36660ce749
					
				|  | @ -4,14 +4,14 @@ | ||||||
| 
 | 
 | ||||||
| ## introduction | ## introduction | ||||||
| 
 | 
 | ||||||
| Write here the introduction | Write here the introduction... not today!! -_- | ||||||
| 
 | 
 | ||||||
| ## Basic assumption | ## Basic assumption | ||||||
| 
 | 
 | ||||||
| In this release, during the writing of the code some assumption are made and are valid for all the features. | In this release, during the writing of the code some assumptions are made and are valid for all the features. | ||||||
| 
 | 
 | ||||||
| 1. The matching stops at the end of the string not at the newline chars. | 1. The matching stops at the end of the string not at the newline chars. | ||||||
| 2. The basic element of this regex engine are the tokens, in query string a simple char is a token. The token is the atomic unit of this regex engine. | 2. The basic elements of this regex engine are the tokens, in a query string a simple char is a token. The token is the atomic unit of this regex engine. | ||||||
| 
 | 
 | ||||||
| ## Match positional limiter | ## Match positional limiter | ||||||
| 
 | 
 | ||||||
|  | @ -21,11 +21,11 @@ The module supports the following features: | ||||||
| 
 | 
 | ||||||
| `^` (Caret.) Matches at the start of the string | `^` (Caret.) Matches at the start of the string | ||||||
| 
 | 
 | ||||||
| `?` Matches at the end of the string | `$` Matches at the end of the string | ||||||
| 
 | 
 | ||||||
| ## Tokens | ## Tokens | ||||||
| 
 | 
 | ||||||
| The tokens are the atomic unit used by this regex engine and can be ones of the following: | The tokens are the atomic units used by this regex engine and can be ones of the following: | ||||||
| 
 | 
 | ||||||
| ### Simple char | ### Simple char | ||||||
| 
 | 
 | ||||||
|  | @ -33,11 +33,11 @@ this token is a simple single character like `a`. | ||||||
| 
 | 
 | ||||||
| ### Char class (cc) | ### Char class (cc) | ||||||
| 
 | 
 | ||||||
| The cc match all the chars specified in its inside, it is delimited by square brackets `[ ]` | The cc match all the chars specified inside, it is delimited by square brackets `[ ]` | ||||||
| 
 | 
 | ||||||
| the sequence of chars in the class is evaluated with an OR operation. | the sequence of chars in the class is evaluated with an OR operation. | ||||||
| 
 | 
 | ||||||
| For example the following cc `[abc]` match any char that is `a` or `b` or `c` but doesn't match `C` or `z`. | For example, the following cc `[abc]` match any char that is `a` or `b` or `c` but doesn't match `C` or `z`. | ||||||
| 
 | 
 | ||||||
| Inside a cc is possible to specify a "range" of chars, for example `[ad-f]` is equivalent to write `[adef]`.  | Inside a cc is possible to specify a "range" of chars, for example `[ad-f]` is equivalent to write `[adef]`.  | ||||||
| 
 | 
 | ||||||
|  | @ -68,17 +68,17 @@ A meta-char can match different type of chars. | ||||||
| 
 | 
 | ||||||
| Each token can have a quantifier that specify how many times the char can or must be matched. | Each token can have a quantifier that specify how many times the char can or must be matched. | ||||||
| 
 | 
 | ||||||
| **Short quantifier** | #### **Short quantifier** | ||||||
| 
 | 
 | ||||||
| - `?` match 0 or 1 time, `a?b` match both `ab` or `b` | - `?` match 0 or 1 time, `a?b` match both `ab` or `b` | ||||||
| - `+` match at minimum 1 time, `a+` match both `aaa` or `a` | - `+` match at minimum 1 time, `a+` match both `aaa` or `a` | ||||||
| - `*` match 0 or more time, `a*b` match both `aaab` or `ab` or `b` | - `*` match 0 or more time, `a*b` match both `aaab` or `ab` or `b` | ||||||
| 
 | 
 | ||||||
| **Long quantifier** | #### **Long quantifier** | ||||||
| 
 | 
 | ||||||
| - `{x}` match exactly x time, `a{2}` match `aa` but doesn't match `aaa` or `a` | - `{x}` match exactly x time, `a{2}` match `aa` but doesn't match `aaa` or `a` | ||||||
| - `{min,}` match at minimum min time, `a{2,}` match `aaa` or `aa` but doesn't match `a` | - `{min,}` match at minimum min time, `a{2,}` match `aaa` or `aa` but doesn't match `a` | ||||||
| - `{,max}` match at least 1 time and maximum max time, `a{,2}` match `a` and `aa` but doesn't match `aaa` | - `{,max}` match at least 0 time and maximum max time, `a{,2}` match `a` and `aa` but doesn't match `aaa` | ||||||
| - `{min,max}` match from min times to max times, `a{2,3}` match `aa` and `aaa` but doesn't match `a` or `aaaa` | - `{min,max}` match from min times to max times, `a{2,3}` match `aa` and `aaa` but doesn't match `a` or `aaaa` | ||||||
| 
 | 
 | ||||||
| a long quantifier may have a `greedy off` flag that is the `?` char after the brackets, `{2,4}?` means to match the minimum number possible tokens in this case 2. | a long quantifier may have a `greedy off` flag that is the `?` char after the brackets, `{2,4}?` means to match the minimum number possible tokens in this case 2. | ||||||
|  | @ -102,7 +102,7 @@ the dot char match any char until the next token match is satisfied. | ||||||
| 
 | 
 | ||||||
| the token `|` is a logic OR operation between two consecutive tokens, `a|b` match a char that is `a` or `b`. | the token `|` is a logic OR operation between two consecutive tokens, `a|b` match a char that is `a` or `b`. | ||||||
| 
 | 
 | ||||||
| The or token can work in a "chained way": `a|(b)|cd ` test first `a` if the char is not `a` the test the group `(b)` and if the group doesn't match test the token `c`. | The OR token can work in a "chained way": `a|(b)|cd ` test first `a` if the char is not `a` then test the group `(b)` and if the group doesn't match test the token `c`. | ||||||
| 
 | 
 | ||||||
| **note: The OR work at token level! It doesn't work at concatenation level!** | **note: The OR work at token level! It doesn't work at concatenation level!** | ||||||
| 
 | 
 | ||||||
|  | @ -181,16 +181,16 @@ re.flag = regex.F_BIN | ||||||
| 
 | 
 | ||||||
| ### Initializer | ### Initializer | ||||||
| 
 | 
 | ||||||
| These function are helper that create the `RE` struct, a `RE` struct can be created manually if you needed. | These functions are helper that create the `RE` struct, a `RE` struct can be created manually if you needed. | ||||||
| 
 | 
 | ||||||
| **Simplified initializer** | #### **Simplified initializer** | ||||||
| 
 | 
 | ||||||
| ```v | ```v | ||||||
| // regex create a regex object from the query string and compile it | // regex create a regex object from the query string and compile it | ||||||
| pub fn regex(in_query string) (RE,int,int) | pub fn regex(in_query string) (RE,int,int) | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
| **Base initializer** | #### **Base initializer** | ||||||
| 
 | 
 | ||||||
| ```v | ```v | ||||||
| // new_regex create a REgex of small size, usually sufficient for ordinary use | // new_regex create a REgex of small size, usually sufficient for ordinary use | ||||||
|  | @ -199,13 +199,13 @@ pub fn new_regex() RE | ||||||
| // new_regex_by_size create a REgex of large size, mult specify the scale factor of the memory that will be allocated | // new_regex_by_size create a REgex of large size, mult specify the scale factor of the memory that will be allocated | ||||||
| pub fn new_regex_by_size(mult int) RE | pub fn new_regex_by_size(mult int) RE | ||||||
| ``` | ``` | ||||||
| After the base initializer use, the regex expression must be compiled with: | After a base initializer is used, the regex expression must be compiled with: | ||||||
| ```v | ```v | ||||||
| // compile return (return code, index) where index is the index of the error in the query string if return code is an error code | // compile return (return code, index) where index is the index of the error in the query string if return code is an error code | ||||||
| pub fn (re mut RE) compile(in_txt string) (int,int) | pub fn (re mut RE) compile(in_txt string) (int,int) | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
| ### Functions | ### Operative Functions | ||||||
| 
 | 
 | ||||||
| These are the operative functions | These are the operative functions | ||||||
| 
 | 
 | ||||||
|  | @ -227,7 +227,7 @@ pub fn (re mut RE) replace(in_txt string, repl string) string | ||||||
| 
 | 
 | ||||||
| This module has few small utilities to help the writing of regex expressions. | This module has few small utilities to help the writing of regex expressions. | ||||||
| 
 | 
 | ||||||
| **Syntax errors highlight** | ### **Syntax errors highlight** | ||||||
| 
 | 
 | ||||||
| the following example code show how to visualize the syntax errors in the compilation phase: | the following example code show how to visualize the syntax errors in the compilation phase: | ||||||
| 
 | 
 | ||||||
|  | @ -256,7 +256,7 @@ if re_err != COMPILE_OK { | ||||||
| 
 | 
 | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
| **Compiled code** | ### **Compiled code** | ||||||
| 
 | 
 | ||||||
| It is possible view the compiled code calling the function `get_query()` the result will be something like this: | It is possible view the compiled code calling the function `get_query()` the result will be something like this: | ||||||
| 
 | 
 | ||||||
|  | @ -279,7 +279,7 @@ PC:  2 ist: 88000000 PROG_END {  0,  0} | ||||||
| 
 | 
 | ||||||
| `{m,n}` is the quantifier, the greedy off flag  `?`  will be showed if present in the token | `{m,n}` is the quantifier, the greedy off flag  `?`  will be showed if present in the token | ||||||
| 
 | 
 | ||||||
| **Log debug** | ### **Log debug** | ||||||
| 
 | 
 | ||||||
| The log debugger allow to print the status of the regex parser when the parser is running. | The log debugger allow to print the status of the regex parser when the parser is running. | ||||||
| 
 | 
 | ||||||
|  | @ -338,6 +338,21 @@ the columns have the following meaning: | ||||||
| 
 | 
 | ||||||
| `{2,3}:1?` quantifier `{min,max}`, `:1` is the actual counter of repetition, `?` is the greedy off flag if present | `{2,3}:1?` quantifier `{min,max}`, `:1` is the actual counter of repetition, `?` is the greedy off flag if present | ||||||
| 
 | 
 | ||||||
|  | ### **Custom Logger output** | ||||||
|  | 
 | ||||||
|  | The debug functions output uses the `stdout` as default, it is possible to  provide an alternative output setting a custom output function: | ||||||
|  | 
 | ||||||
|  | ```v | ||||||
|  | // custom print function, the input will be the regex debug string | ||||||
|  | fn custom_print(txt string) { | ||||||
|  | 	println("my log: $txt") | ||||||
|  | } | ||||||
|  | 
 | ||||||
|  | mut re := new_regex() | ||||||
|  | re.log_func = custom_print  // every debug output from now will call this function | ||||||
|  | 
 | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
| ## Example code | ## Example code | ||||||
| 
 | 
 | ||||||
| Here there is a simple code to perform some basically match of strings | Here there is a simple code to perform some basically match of strings | ||||||
|  |  | ||||||
|  | @ -200,7 +200,6 @@ pub fn (re RE) get_parse_error_string(err int) string { | ||||||
| 	} | 	} | ||||||
| } | } | ||||||
| 
 | 
 | ||||||
| 
 |  | ||||||
| // utf8_str convert and utf8 sequence to a printable string
 | // utf8_str convert and utf8 sequence to a printable string
 | ||||||
| [inline] | [inline] | ||||||
| fn utf8_str(ch u32) string { | fn utf8_str(ch u32) string { | ||||||
|  | @ -245,7 +244,7 @@ mut: | ||||||
| 	// counters for quantifier check (repetitions)
 | 	// counters for quantifier check (repetitions)
 | ||||||
| 	rep int = 0 | 	rep int = 0 | ||||||
| 
 | 
 | ||||||
| 	// validator function pointer and control char
 | 	// validator function pointer
 | ||||||
| 	validator fn (byte) bool | 	validator fn (byte) bool | ||||||
| 
 | 
 | ||||||
| 	// groups variables
 | 	// groups variables
 | ||||||
|  | @ -280,9 +279,9 @@ pub const ( | ||||||
| 
 | 
 | ||||||
| struct StateDotObj{ | struct StateDotObj{ | ||||||
| mut: | mut: | ||||||
| 	i  int                = 0   // char index in the input buffer
 | 	i  int                = -1  // char index in the input buffer
 | ||||||
| 	pc int                = 0   // program counter saved
 | 	pc int                = -1   // program counter saved
 | ||||||
| 	mi int                = 0   // match_index saved
 | 	mi int                = -1   // match_index saved
 | ||||||
| 	group_stack_index int = -1  // group index stack pointer saved
 | 	group_stack_index int = -1  // group index stack pointer saved
 | ||||||
| } | } | ||||||
| 
 | 
 | ||||||
|  | @ -648,7 +647,7 @@ fn (re RE) parse_quantifier(in_txt string, in_i int) (int, int, int, bool) { | ||||||
| 
 | 
 | ||||||
| 		// min parsing skip if comma present
 | 		// min parsing skip if comma present
 | ||||||
| 		if status == .start && ch == `,` { | 		if status == .start && ch == `,` { | ||||||
| 			q_min = 1 // default min in a {} quantifier is 1
 | 			q_min = 0 // default min in a {} quantifier is 0
 | ||||||
| 			status = .comma_checked | 			status = .comma_checked | ||||||
| 			i++ | 			i++ | ||||||
| 			continue | 			continue | ||||||
|  | @ -998,6 +997,7 @@ pub fn (re mut RE) compile(in_txt string) (int,int) { | ||||||
| 	// Post processing
 | 	// Post processing
 | ||||||
| 	//******************************************
 | 	//******************************************
 | ||||||
| 
 | 
 | ||||||
|  | 
 | ||||||
| 	// count IST_DOT_CHAR to set the size of the state stack
 | 	// count IST_DOT_CHAR to set the size of the state stack
 | ||||||
| 	mut pc1 := 0 | 	mut pc1 := 0 | ||||||
| 	mut tmp_count := 0 | 	mut tmp_count := 0 | ||||||
|  | @ -1007,10 +1007,10 @@ pub fn (re mut RE) compile(in_txt string) (int,int) { | ||||||
| 		} | 		} | ||||||
| 		pc1++ | 		pc1++ | ||||||
| 	} | 	} | ||||||
|  | 
 | ||||||
| 	// init the state stack
 | 	// init the state stack
 | ||||||
| 	re.state_stack = [StateDotObj{}].repeat(tmp_count+1)	 | 	re.state_stack = [StateDotObj{}].repeat(tmp_count+1)	 | ||||||
| 	 | 	 | ||||||
| 	 |  | ||||||
| 	// OR branch
 | 	// OR branch
 | ||||||
| 	// a|b|cd
 | 	// a|b|cd
 | ||||||
| 	// d exit point
 | 	// d exit point
 | ||||||
|  | @ -1279,7 +1279,8 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) { | ||||||
| 
 | 
 | ||||||
| 	mut pc := -1                     // program counter
 | 	mut pc := -1                     // program counter
 | ||||||
| 	mut state := StateObj{}          // actual state
 | 	mut state := StateObj{}          // actual state
 | ||||||
| 	mut ist := u32(0)                // Program Counter
 | 	mut ist := u32(0)                // actual instruction
 | ||||||
|  | 	mut l_ist := u32(0)              // last matched instruction
 | ||||||
| 
 | 
 | ||||||
| 	mut group_stack      := [-1].repeat(re.group_max) | 	mut group_stack      := [-1].repeat(re.group_max) | ||||||
| 	mut group_data       := [-1].repeat(re.group_max) | 	mut group_data       := [-1].repeat(re.group_max) | ||||||
|  | @ -1417,17 +1418,10 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) { | ||||||
| 			} | 			} | ||||||
| 
 | 
 | ||||||
| 			// manage IST_DOT_CHAR
 | 			// manage IST_DOT_CHAR
 | ||||||
| 			if re.state_stack_index >= 0 { |  | ||||||
| 				//C.printf("DOT CHAR text end management!\n")
 |  | ||||||
| 				// if DOT CHAR is not the last instruction and we are still going, then no match!!
 |  | ||||||
| 				if pc < re.prog.len && re.prog[pc+1].ist != IST_PROG_END { |  | ||||||
| 					return NO_MATCH_FOUND,0 |  | ||||||
| 				} |  | ||||||
| 			} |  | ||||||
| 
 | 
 | ||||||
| 			m_state == .end | 			m_state == .end | ||||||
| 			break | 			break | ||||||
| 			return NO_MATCH_FOUND,0 | 			//return NO_MATCH_FOUND,0
 | ||||||
| 		} | 		} | ||||||
| 
 | 
 | ||||||
| 		// starting and init
 | 		// starting and init
 | ||||||
|  | @ -1475,7 +1469,8 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) { | ||||||
| 		// check if stop 
 | 		// check if stop 
 | ||||||
| 		if m_state == .stop { | 		if m_state == .stop { | ||||||
| 			// if we are in restore state ,do it and restart
 | 			// if we are in restore state ,do it and restart
 | ||||||
| 			if re.state_stack_index >= 0 {	 | 			//C.printf("re.state_stack_index %d\n",re.state_stack_index )
 | ||||||
|  | 			if re.state_stack_index >=0 && re.state_stack[re.state_stack_index].pc >= 0 { | ||||||
| 				i = re.state_stack[re.state_stack_index].i | 				i = re.state_stack[re.state_stack_index].i | ||||||
| 				pc = re.state_stack[re.state_stack_index].pc | 				pc = re.state_stack[re.state_stack_index].pc | ||||||
| 				state.match_index =	re.state_stack[re.state_stack_index].mi | 				state.match_index =	re.state_stack[re.state_stack_index].mi | ||||||
|  | @ -1499,14 +1494,24 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) { | ||||||
| 			// program end
 | 			// program end
 | ||||||
| 			if ist == IST_PROG_END { | 			if ist == IST_PROG_END { | ||||||
| 				// if we are in match exit well
 | 				// if we are in match exit well
 | ||||||
|  | 				 | ||||||
| 				if group_index >= 0 && state.match_index >= 0 { | 				if group_index >= 0 && state.match_index >= 0 { | ||||||
| 					group_index = -1 | 					group_index = -1 | ||||||
| 				} | 				} | ||||||
| 
 | 
 | ||||||
|  | 				// we have a DOT MATCH on going
 | ||||||
|  | 				//C.printf("IST_PROG_END l_ist: %08x\n", l_ist)
 | ||||||
|  | 				if re.state_stack_index>=0 && l_ist == IST_DOT_CHAR { | ||||||
| 					m_state = .stop | 					m_state = .stop | ||||||
| 					continue | 					continue | ||||||
| 				} | 				} | ||||||
| 
 | 
 | ||||||
|  | 				re.state_stack_index = -1 | ||||||
|  | 				m_state = .stop | ||||||
|  | 				continue | ||||||
|  | 				 | ||||||
|  | 			} | ||||||
|  | 
 | ||||||
| 			// check GROUP start, no quantifier is checkd for this token!!
 | 			// check GROUP start, no quantifier is checkd for this token!!
 | ||||||
| 			else if ist == IST_GROUP_START { | 			else if ist == IST_GROUP_START { | ||||||
| 				group_index++ | 				group_index++ | ||||||
|  | @ -1527,7 +1532,7 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) { | ||||||
| 					//C.printf("g.id: %d group_index: %d\n", re.prog[pc].group_id, group_index)
 | 					//C.printf("g.id: %d group_index: %d\n", re.prog[pc].group_id, group_index)
 | ||||||
| 					if group_index >= 0 { | 					if group_index >= 0 { | ||||||
| 	 					start_i   := group_stack[group_index] | 	 					start_i   := group_stack[group_index] | ||||||
| 	 					group_stack[group_index]=-1 | 	 					//group_stack[group_index]=-1
 | ||||||
| 
 | 
 | ||||||
| 	 					// save group results
 | 	 					// save group results
 | ||||||
| 						g_index := re.prog[pc].group_id*2 | 						g_index := re.prog[pc].group_id*2 | ||||||
|  | @ -1537,6 +1542,7 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) { | ||||||
| 							re.groups[g_index] = 0 | 							re.groups[g_index] = 0 | ||||||
| 						} | 						} | ||||||
| 						re.groups[g_index+1] = i | 						re.groups[g_index+1] = i | ||||||
|  | 						//C.printf("GROUP %d END [%d, %d]\n", re.prog[pc].group_id, re.groups[g_index], re.groups[g_index+1])
 | ||||||
| 					} | 					} | ||||||
| 					 | 					 | ||||||
| 					re.prog[pc].group_rep++ // increase repetitions
 | 					re.prog[pc].group_rep++ // increase repetitions
 | ||||||
|  | @ -1568,6 +1574,7 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) { | ||||||
| 			else if ist == IST_DOT_CHAR { | 			else if ist == IST_DOT_CHAR { | ||||||
| 				//C.printf("IST_DOT_CHAR rep: %d\n", re.prog[pc].rep)
 | 				//C.printf("IST_DOT_CHAR rep: %d\n", re.prog[pc].rep)
 | ||||||
| 				state.match_flag = true | 				state.match_flag = true | ||||||
|  | 				l_ist = u32(IST_DOT_CHAR) | ||||||
| 
 | 
 | ||||||
| 				if first_match < 0 { | 				if first_match < 0 { | ||||||
| 					first_match = i | 					first_match = i | ||||||
|  | @ -1575,12 +1582,23 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) { | ||||||
| 				state.match_index = i | 				state.match_index = i | ||||||
| 				re.prog[pc].rep++	 | 				re.prog[pc].rep++	 | ||||||
| 
 | 
 | ||||||
| 				if re.prog[pc].rep == 1 { | 				//if re.prog[pc].rep >= re.prog[pc].rep_min && re.prog[pc].rep <= re.prog[pc].rep_max {
 | ||||||
|  | 				if re.prog[pc].rep >= 0 && re.prog[pc].rep <= re.prog[pc].rep_max { | ||||||
|  | 					//C.printf("DOT CHAR save state : %d\n", re.state_stack_index)
 | ||||||
| 					// save the state
 | 					// save the state
 | ||||||
|  | 					 | ||||||
|  | 					// manage first dot char
 | ||||||
|  | 					if re.state_stack_index < 0 { | ||||||
| 						re.state_stack_index++ | 						re.state_stack_index++ | ||||||
|  | 					} | ||||||
|  | 
 | ||||||
| 					re.state_stack[re.state_stack_index].pc = pc | 					re.state_stack[re.state_stack_index].pc = pc | ||||||
| 					re.state_stack[re.state_stack_index].mi = state.match_index | 					re.state_stack[re.state_stack_index].mi = state.match_index | ||||||
| 					re.state_stack[re.state_stack_index].group_stack_index = group_index | 					re.state_stack[re.state_stack_index].group_stack_index = group_index | ||||||
|  | 				} else { | ||||||
|  | 					re.state_stack[re.state_stack_index].pc = -1 | ||||||
|  | 					re.state_stack[re.state_stack_index].mi = -1 | ||||||
|  | 					re.state_stack[re.state_stack_index].group_stack_index = -1 | ||||||
| 				} | 				} | ||||||
| 
 | 
 | ||||||
| 				if re.prog[pc].rep >= 1 && re.state_stack_index >= 0 { | 				if re.prog[pc].rep >= 1 && re.state_stack_index >= 0 { | ||||||
|  | @ -1590,19 +1608,11 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) { | ||||||
| 				// manage * and {0,} quantifier
 | 				// manage * and {0,} quantifier
 | ||||||
| 				if re.prog[pc].rep_min > 0 { | 				if re.prog[pc].rep_min > 0 { | ||||||
| 					i += char_len // next char
 | 					i += char_len // next char
 | ||||||
|  | 					l_ist = u32(IST_DOT_CHAR) | ||||||
| 				} | 				} | ||||||
| 
 | 
 | ||||||
| 				if re.prog[pc+1].ist !=  IST_GROUP_END { |  | ||||||
| 				m_state = .ist_next | 				m_state = .ist_next | ||||||
| 				continue | 				continue | ||||||
| 				}  |  | ||||||
| 				// IST_DOT_CHAR is the last instruction, get all
 |  | ||||||
| 				else { |  | ||||||
| 					//C.printf("We are the last one!\n")
 |  | ||||||
| 					pc--  |  | ||||||
| 					m_state = .ist_next_ks |  | ||||||
| 					continue |  | ||||||
| 				} |  | ||||||
| 
 | 
 | ||||||
| 			} | 			} | ||||||
| 
 | 
 | ||||||
|  | @ -1622,6 +1632,7 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) { | ||||||
| 
 | 
 | ||||||
| 				if cc_res { | 				if cc_res { | ||||||
| 					state.match_flag = true | 					state.match_flag = true | ||||||
|  | 					l_ist = u32(IST_CHAR_CLASS_POS) | ||||||
| 					 | 					 | ||||||
| 					if first_match < 0 { | 					if first_match < 0 { | ||||||
| 						first_match = i | 						first_match = i | ||||||
|  | @ -1645,6 +1656,7 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) { | ||||||
| 				//C.printf("BSLS in_ch: %c res: %d\n", ch, tmp_res)
 | 				//C.printf("BSLS in_ch: %c res: %d\n", ch, tmp_res)
 | ||||||
| 				if tmp_res { | 				if tmp_res { | ||||||
| 					state.match_flag = true | 					state.match_flag = true | ||||||
|  | 					l_ist = u32(IST_BSLS_CHAR) | ||||||
| 					 | 					 | ||||||
| 					if first_match < 0 { | 					if first_match < 0 { | ||||||
| 						first_match = i | 						first_match = i | ||||||
|  | @ -1669,6 +1681,7 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) { | ||||||
| 				if re.prog[pc].ch == ch | 				if re.prog[pc].ch == ch | ||||||
| 				{ | 				{ | ||||||
| 					state.match_flag = true | 					state.match_flag = true | ||||||
|  | 					l_ist = u32(IST_SIMPLE_CHAR) | ||||||
| 					 | 					 | ||||||
| 					if first_match < 0 { | 					if first_match < 0 { | ||||||
| 						first_match = i | 						first_match = i | ||||||
|  | @ -1857,7 +1870,7 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) { | ||||||
| 			} | 			} | ||||||
| 
 | 
 | ||||||
| 			// no other options
 | 			// no other options
 | ||||||
| 			//C.printf("NO_MATCH_FOUND\n")
 | 			//C.printf("ist_quant_n NO_MATCH_FOUND\n")
 | ||||||
| 			result = NO_MATCH_FOUND | 			result = NO_MATCH_FOUND | ||||||
| 			m_state = .stop | 			m_state = .stop | ||||||
| 			continue | 			continue | ||||||
|  | @ -1873,12 +1886,6 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) { | ||||||
| 
 | 
 | ||||||
| 			rep := re.prog[pc].rep | 			rep := re.prog[pc].rep | ||||||
| 			 | 			 | ||||||
| 			// clear the actual dot char capture state
 |  | ||||||
| 			if re.state_stack_index >= 0 { |  | ||||||
| 				//C.printf("Drop the DOT_CHAR state!\n")
 |  | ||||||
| 				re.state_stack_index-- |  | ||||||
| 			} |  | ||||||
| 
 |  | ||||||
| 			// under range
 | 			// under range
 | ||||||
| 			if rep > 0 && rep < re.prog[pc].rep_min { | 			if rep > 0 && rep < re.prog[pc].rep_min { | ||||||
| 				//C.printf("ist_quant_p UNDER RANGE\n")
 | 				//C.printf("ist_quant_p UNDER RANGE\n")
 | ||||||
|  |  | ||||||
|  | @ -34,14 +34,12 @@ match_test_suite = [ | ||||||
| 	TestItem{"this is a good sample.",r"( ?\w+){,5}",0,21}, | 	TestItem{"this is a good sample.",r"( ?\w+){,5}",0,21}, | ||||||
| 	TestItem{"this is a good sample.",r"( ?\w+){2,3}",0,9}, | 	TestItem{"this is a good sample.",r"( ?\w+){2,3}",0,9}, | ||||||
| 	TestItem{"this is a good sample.",r"(\s?\w+){2,3}",0,9},	 | 	TestItem{"this is a good sample.",r"(\s?\w+){2,3}",0,9},	 | ||||||
| 	TestItem{"this is a good sample.",r".*i(\w)+",0,4}, |  | ||||||
| 	TestItem{"this these those.",r"(th[ei]se?\s|\.)+",0,11}, | 	TestItem{"this these those.",r"(th[ei]se?\s|\.)+",0,11}, | ||||||
| 	TestItem{"this these those ",r"(th[eio]se? ?)+",0,17}, | 	TestItem{"this these those ",r"(th[eio]se? ?)+",0,17}, | ||||||
| 	TestItem{"this these those ",r"(th[eio]se? )+",0,17}, | 	TestItem{"this these those ",r"(th[eio]se? )+",0,17}, | ||||||
| 	TestItem{"this,these,those. over",r"(th[eio]se?[,. ])+",0,17}, | 	TestItem{"this,these,those. over",r"(th[eio]se?[,. ])+",0,17}, | ||||||
| 	TestItem{"soday,this,these,those. over",r"(th[eio]se?[,. ])+",6,23}, | 	TestItem{"soday,this,these,those. over",r"(th[eio]se?[,. ])+",6,23}, | ||||||
| 	TestItem{"soday,this,these,those. over",r".*,(th[eio]se?[,. ])+",0,23}, | 	 | ||||||
| 	TestItem{"soday,this,these,thesa.thesi over",r".*,(th[ei]se?[,. ])+(thes[ai][,. ])+",0,29}, |  | ||||||
| 	TestItem{"cpapaz",r"(c(pa)+z)",0,6}, | 	TestItem{"cpapaz",r"(c(pa)+z)",0,6}, | ||||||
| 	TestItem{"this is a cpapaz over",r"(c(pa)+z)",10,16}, | 	TestItem{"this is a cpapaz over",r"(c(pa)+z)",10,16}, | ||||||
| 	TestItem{"this is a cpapapez over",r"(c(p[ae])+z)",10,18}, | 	TestItem{"this is a cpapapez over",r"(c(p[ae])+z)",10,18}, | ||||||
|  | @ -56,16 +54,23 @@ match_test_suite = [ | ||||||
| 	TestItem{"this cpapaz adce aabe",r"(c(pa)+z)(\s[\a]+){2}",5,21}, | 	TestItem{"this cpapaz adce aabe",r"(c(pa)+z)(\s[\a]+){2}",5,21}, | ||||||
| 	TestItem{"1234this cpapaz adce aabe",r"(c(pa)+z)(\s[\a]+){2}$",9,25}, | 	TestItem{"1234this cpapaz adce aabe",r"(c(pa)+z)(\s[\a]+){2}$",9,25}, | ||||||
| 	TestItem{"this cpapaz adce aabe third",r"(c(pa)+z)(\s[\a]+){2}",5,21}, | 	TestItem{"this cpapaz adce aabe third",r"(c(pa)+z)(\s[\a]+){2}",5,21}, | ||||||
|  | 	TestItem{"123cpapaz ole. pippo",r"(c(pa)+z)(\s+\a+[\.,]?)+",3,20}, | ||||||
|  | 	 | ||||||
|  | 	TestItem{"this is a good sample.",r".*i(\w)+",0,4}, | ||||||
|  | 	TestItem{"soday,this,these,those. over",r".*,(th[eio]se?[,. ])+",0,23}, | ||||||
|  | 	TestItem{"soday,this,these,thesa.thesi over",r".*,(th[ei]se?[,. ])+(thes[ai][,. ])+",0,29}, | ||||||
| 	TestItem{"cpapaz ole. pippo,",r".*(c(pa)+z)(\s+\a+[\.,]?)+",0,18}, | 	TestItem{"cpapaz ole. pippo,",r".*(c(pa)+z)(\s+\a+[\.,]?)+",0,18}, | ||||||
| 	TestItem{"cpapaz ole. pippo",r".*(c(pa)+z)(\s+\a+[\.,]?)+",0,17}, | 	TestItem{"cpapaz ole. pippo",r".*(c(pa)+z)(\s+\a+[\.,]?)+",0,17}, | ||||||
| 	TestItem{"cpapaz ole. pippo, 852",r".*(c(pa)+z)(\s+\a+[\.,]?)+",0,18}, | 	TestItem{"cpapaz ole. pippo, 852",r".*(c(pa)+z)(\s+\a+[\.,]?)+",0,18}, | ||||||
| 	TestItem{"123cpapaz ole. pippo",r".*(c(pa)+z)(\s+\a+[\.,]?)+",0,20}, | 	TestItem{"123cpapaz ole. pippo",r".*(c(pa)+z)(\s+\a+[\.,]?)+",0,20}, | ||||||
| 	TestItem{"...cpapaz ole. pippo",r".*(c(pa)+z)(\s+\a+[\.,]?)+",0,20}, | 	TestItem{"...cpapaz ole. pippo",r".*(c(pa)+z)(\s+\a+[\.,]?)+",0,20}, | ||||||
| 	TestItem{"123cpapaz ole. pippo",r"(c(pa)+z)(\s+\a+[\.,]?)+",3,20}, | 	 | ||||||
| 	TestItem{"cpapaz ole. pippo,",r".*c.+ole.*pi",0,14}, | 	TestItem{"cpapaz ole. pippo,",r".*c.+ole.*pi",0,14}, | ||||||
| 	TestItem{"cpapaz ole. pipipo,",r".*c.+ole.*p([ip])+o",0,18}, | 	TestItem{"cpapaz ole. pipipo,",r".*c.+ole.*p([ip])+o",0,18}, | ||||||
| 	TestItem{"cpapaz ole. pipipo",r"^.*c.+ol?e.*p([ip])+o$",0,18}, | 	TestItem{"cpapaz ole. pipipo",r"^.*c.+ol?e.*p([ip])+o$",0,18}, | ||||||
| 	TestItem{"abbb",r"ab{2,3}?",0,3}, | 	TestItem{"abbb",r"ab{2,3}?",0,3}, | ||||||
|  | 	TestItem{" pippo pera",r"\s(.*)pe(.*)",0,11}, | ||||||
|  | 	TestItem{" abb",r"\s(.*)",0,4}, | ||||||
| 
 | 
 | ||||||
| 	// negative
 | 	// negative
 | ||||||
| 	TestItem{"zthis ciao",r"((t[hieo]+se?)\s*)+",-1,0}, | 	TestItem{"zthis ciao",r"((t[hieo]+se?)\s*)+",-1,0}, | ||||||
|  |  | ||||||
		Loading…
	
		Reference in New Issue