regex: update README.md
parent
25fabac059
commit
d6448ee5d6
|
@ -10,8 +10,8 @@ Write here the introduction
|
||||||
|
|
||||||
In this release, during the writing of the code some assumption are made and are valid for all the features.
|
In this release, during the writing of the code some assumption are made and are valid for all the features.
|
||||||
|
|
||||||
1. The matching stop at the end of the string not at the newline chars
|
1. The matching stops at the end of the string not at the newline chars.
|
||||||
2. The basic element of this regex engine are the tokens, in aquery string a simple char is a token. The token is the atomic unit of this regex engine.
|
2. The basic element of this regex engine are the tokens, in query string a simple char is a token. The token is the atomic unit of this regex engine.
|
||||||
|
|
||||||
## Match positional limiter
|
## Match positional limiter
|
||||||
|
|
||||||
|
@ -19,15 +19,13 @@ The module supports the following features:
|
||||||
|
|
||||||
- `$` `^` delimiter
|
- `$` `^` delimiter
|
||||||
|
|
||||||
|
`^` (Caret.) Matches at the start of the string
|
||||||
|
|
||||||
|
`?` Matches at the end of the string
|
||||||
`^` (Caret.) Matches the start of the string
|
|
||||||
|
|
||||||
`?` Matches the end of the string
|
|
||||||
|
|
||||||
## Tokens
|
## Tokens
|
||||||
|
|
||||||
The token are the atomic unit used by this regex engine and can be one of the following:
|
The tokens are the atomic unit used by this regex engine and can be ones of the following:
|
||||||
|
|
||||||
### Simple char
|
### Simple char
|
||||||
|
|
||||||
|
@ -35,25 +33,25 @@ this token is a simple single character like `a`.
|
||||||
|
|
||||||
### Char class (cc)
|
### Char class (cc)
|
||||||
|
|
||||||
The cc match all the char specified in its inside, it is delimited by square brackets `[ ]`
|
The cc match all the chars specified in its inside, it is delimited by square brackets `[ ]`
|
||||||
|
|
||||||
the sequence of chars in the class is evaluated with an OR operation.
|
the sequence of chars in the class is evaluated with an OR operation.
|
||||||
|
|
||||||
For example the following cc `[abc]` match any char that is or `a` or `b` or `c` but doesn't match `C` or `z`.
|
For example the following cc `[abc]` match any char that is `a` or `b` or `c` but doesn't match `C` or `z`.
|
||||||
|
|
||||||
Inside a cc is possible to specify a "range" of chars, for example `[ad-f]` is equivalent to write `[adef]`.
|
Inside a cc is possible to specify a "range" of chars, for example `[ad-f]` is equivalent to write `[adef]`.
|
||||||
|
|
||||||
A cc can have different ranges in the same like `[a-zA-z0-9]` that match all the lowercase,uppercase and numeric chars.
|
A cc can have different ranges at the same time like `[a-zA-z0-9]` that match all the lowercase,uppercase and numeric chars.
|
||||||
|
|
||||||
It is possible negate the cc using the caret char at the start of the cc like: `[^abc]` that match every char that is not `a` or `b` or `c`.
|
It is possible negate the cc using the caret char at the start of the cc like: `[^abc]` that matches every char that is not `a` or `b` or `c`.
|
||||||
|
|
||||||
A cc can contain meta-chars like: `[a-z\d]` that match all the lowercase latin chars `a-z` and all the digits `\d`.
|
A cc can contain meta-chars like: `[a-z\d]` that matches all the lowercase latin chars `a-z` and all the digits `\d`.
|
||||||
|
|
||||||
It is possible to mix all the properties of the char class together.
|
It is possible to mix all the properties of the char class together.
|
||||||
|
|
||||||
### Meta-chars
|
### Meta-chars
|
||||||
|
|
||||||
A meta-char is specified by a back slash before a char like `\w` in this case the meta-char is `w`.
|
A meta-char is specified by a backslash before a char like `\w` in this case the meta-char is `w`.
|
||||||
|
|
||||||
A meta-char can match different type of chars.
|
A meta-char can match different type of chars.
|
||||||
|
|
||||||
|
@ -79,17 +77,17 @@ Each token can have a quantifier that specify how many times the char can or mus
|
||||||
**Long quantifier**
|
**Long quantifier**
|
||||||
|
|
||||||
- `{x}` match exactly x time, `a{2}` match `aa` but doesn't match `aaa` or `a`
|
- `{x}` match exactly x time, `a{2}` match `aa` but doesn't match `aaa` or `a`
|
||||||
- `{min,}` match at minimum min time, `a{2,}` match `aaa` or `aa` bit doesn't march `a`
|
- `{min,}` match at minimum min time, `a{2,}` match `aaa` or `aa` but doesn't match `a`
|
||||||
- `{,max}` match at least 1 and maximum max time, `a{,2}` match `a` and `aa` but doesn't match `aaa`
|
- `{,max}` match at least 1 time and maximum max time, `a{,2}` match `a` and `aa` but doesn't match `aaa`
|
||||||
- `{min,max}` match from min times to max times, `a{2,3}` match `aa` and `aaa` but doesn't match `a` or `aaaa`
|
- `{min,max}` match from min times to max times, `a{2,3}` match `aa` and `aaa` but doesn't match `a` or `aaaa`
|
||||||
|
|
||||||
a long quantifier may have a `greedy` flag that is the `?` char after the brackets, `{2,4}?` means to match at the minimum possible tokens thus 2.
|
a long quantifier may have a `greedy off` flag that is the `?` char after the brackets, `{2,4}?` means to match the minimum number possible tokens in this case 2.
|
||||||
|
|
||||||
### dot char
|
### dot char
|
||||||
|
|
||||||
the dot is a particular meta char that match "any char", is more simple explain it with an example:
|
the dot is a particular meta char that match "any char", is more simple explain it with an example:
|
||||||
|
|
||||||
supposed to have `abccc ddeef` as string to parse with regex, the following table show the query strings and the result of parsing source string.
|
suppose to have `abccc ddeef` as source string to parse with regex, the following table show the query strings and the result of parsing source string.
|
||||||
|
|
||||||
| query string | result |
|
| query string | result |
|
||||||
| ------------ | ------ |
|
| ------------ | ------ |
|
||||||
|
@ -102,39 +100,35 @@ the dot char match any char until the next token match is satisfied.
|
||||||
|
|
||||||
### OR token
|
### OR token
|
||||||
|
|
||||||
the token `|` is an logic OR operation between two consecutive tokens, `a|b` match a char that is `a` or `b`.
|
the token `|` is a logic OR operation between two consecutive tokens, `a|b` match a char that is `a` or `b`.
|
||||||
|
|
||||||
The or token can work in a "chained way": `a|(b)|cd ` test first `a` if the char is not `a` the test the group `(b)` and if the group doesn't match test the token `c`.
|
The or token can work in a "chained way": `a|(b)|cd ` test first `a` if the char is not `a` the test the group `(b)` and if the group doesn't match test the token `c`.
|
||||||
|
|
||||||
**note: The OR work at token level! It doesn't work at concatenation level!**
|
**note: The OR work at token level! It doesn't work at concatenation level!**
|
||||||
|
|
||||||
A query string like `abc|bde` is not equal to `(abc)|(bde)`!!
|
A query string like `abc|bde` is not equal to `(abc)|(bde)`!! The OR work only on `c|b` not at char concatenation level.
|
||||||
|
|
||||||
The OR work only on `c|b` not at char concatenation level.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
### Groups
|
### Groups
|
||||||
|
|
||||||
Groups are a method to create complex patterns with repetition of blocks of token.
|
Groups are a method to create complex patterns with repetition of blocks of tokens.
|
||||||
|
|
||||||
The groups a delimited by round brackets `( )`, groups can be nested and can have a quantifier as all the tokens.
|
The groups are delimited by round brackets `( )`, groups can be nested and can have a quantifier as all the tokens.
|
||||||
|
|
||||||
`c(pa)+z` match `cpapaz` or `cpaz` or `cpapapaz` .
|
`c(pa)+z` match `cpapaz` or `cpaz` or `cpapapaz` .
|
||||||
|
|
||||||
`(c(pa)+z ?)+` match `cpaz cpapaz cpapapaz` or `cpapaz`
|
`(c(pa)+z ?)+` match `cpaz cpapaz cpapapaz` or `cpapaz`
|
||||||
|
|
||||||
let analyze this last case, first we have the group 0 that are the most outer round brackets `(...)+`, this group has a quantifier that say to match its content at least one time `+`.
|
let analyze this last case, first we have the group `#0` that are the most outer round brackets `(...)+`, this group has a quantifier that say to match its content at least one time `+`.
|
||||||
|
|
||||||
After we have a simple char token `c` and a second group that is the number 1 `(pa)+`, this group try to match the sequence `pa` at least one time as specified by the `+` quantifier.
|
After we have a simple char token `c` and a second group that is the number `#1` :`(pa)+`, this group try to match the sequence `pa` at least one time as specified by the `+` quantifier.
|
||||||
|
|
||||||
After we have another simple token `z` and another simple token ` ?` that is the space char (ascii code 32) with the `?` quantifier that say to capture this char or 0 or 1 time
|
After, we have another simple token `z` and another simple token ` ?` that is the space char (ascii code 32) followed by the `?` quantifier that say to capture the space char 0 or 1 time.
|
||||||
|
|
||||||
This explain because the `(c(pa)+z ?)+` query string can match `cpaz cpapaz cpapapaz` .
|
This explain because the `(c(pa)+z ?)+` query string can match `cpaz cpapaz cpapapaz` .
|
||||||
|
|
||||||
In this implementation the groups are capturing groups that means that the last result for each group can be retrieved from the `RE` struct.
|
In this implementation the groups are "capture groups", it means that the last temporal result for each group can be retrieved from the `RE` struct.
|
||||||
|
|
||||||
The captured groups are store as couple of index in the field `groups` that is an `[]int` each captured group
|
The "capture groups" are store as couple of index in the field `groups` that is an `[]int` inside the `RE` struct.
|
||||||
|
|
||||||
**example:**
|
**example:**
|
||||||
|
|
||||||
|
@ -167,7 +161,7 @@ for gi < re.groups.len {
|
||||||
|
|
||||||
## Flags
|
## Flags
|
||||||
|
|
||||||
It is possible to set some flag in the regex parser that change the behavior of the parser itself.
|
It is possible to set some flags in the regex parser that change the behavior of the parser itself.
|
||||||
|
|
||||||
```v
|
```v
|
||||||
// example of flag settings
|
// example of flag settings
|
||||||
|
@ -178,16 +172,16 @@ re.flag = regex.F_BIN
|
||||||
|
|
||||||
- `F_BIN`: parse a string as bytes, utf-8 management disabled.
|
- `F_BIN`: parse a string as bytes, utf-8 management disabled.
|
||||||
|
|
||||||
- `F_EFM`: exit on the first char match in the query, used by the find function
|
- `F_EFM`: exit on the first char match in the query, used by the find function.
|
||||||
- `F_MS`: match only if the index of the start match is 0, same as `^` at the start of query string
|
- `F_MS`: match only if the index of the start match is 0, same as `^` at the start of the query string.
|
||||||
- `F_ME`: match only if the end index of the match is the last char of the input string, same as `$` end of query string
|
- `F_ME`: match only if the end index of the match is the last char of the input string, same as `$` end of query string.
|
||||||
- `F_NL`: stop the matching if found a new line char `\n` or `\r`
|
- `F_NL`: stop the matching if found a new line char `\n` or `\r`
|
||||||
|
|
||||||
## Functions
|
## Functions
|
||||||
|
|
||||||
### Initializer
|
### Initializer
|
||||||
|
|
||||||
These function are helper that create the `RE` struct, the struct can be manually create if you need it
|
These function are helper that create the `RE` struct, a `RE` struct can be created manually if you needed.
|
||||||
|
|
||||||
**Simplified initializer**
|
**Simplified initializer**
|
||||||
|
|
||||||
|
@ -205,7 +199,7 @@ pub fn new_regex() RE
|
||||||
// new_regex_by_size create a REgex of large size, mult specify the scale factor of the memory that will be allocated
|
// new_regex_by_size create a REgex of large size, mult specify the scale factor of the memory that will be allocated
|
||||||
pub fn new_regex_by_size(mult int) RE
|
pub fn new_regex_by_size(mult int) RE
|
||||||
```
|
```
|
||||||
After the base initializer use the regex expression must be compiled with:
|
After the base initializer use, the regex expression must be compiled with:
|
||||||
```v
|
```v
|
||||||
// compile return (return code, index) where index is the index of the error in the query string if return code is an error code
|
// compile return (return code, index) where index is the index of the error in the query string if return code is an error code
|
||||||
pub fn (re mut RE) compile(in_txt string) (int,int)
|
pub fn (re mut RE) compile(in_txt string) (int,int)
|
||||||
|
@ -222,10 +216,10 @@ pub fn (re mut RE) match_string(in_txt string) (int,int)
|
||||||
// find try to find the first match in the input string, return start and end index if found else start is -1
|
// find try to find the first match in the input string, return start and end index if found else start is -1
|
||||||
pub fn (re mut RE) find(in_txt string) (int,int)
|
pub fn (re mut RE) find(in_txt string) (int,int)
|
||||||
|
|
||||||
// find all the non overlapping occurrences of the match pattern, return a list of start end indexes
|
// find_all find all the "non overlapping" occurrences of the matching pattern, return a list of start end indexes
|
||||||
pub fn (re mut RE) find_all(in_txt string) []int
|
pub fn (re mut RE) find_all(in_txt string) []int
|
||||||
|
|
||||||
// replace return a string where the matches are replaced with the replace string, only non overlapped match are used
|
// replace return a string where the matches are replaced with the replace string, only non overlapped matches are used
|
||||||
pub fn (re mut RE) replace(in_txt string, repl string) string
|
pub fn (re mut RE) replace(in_txt string, repl string) string
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -235,10 +229,10 @@ This module has few small utilities to help the writing of regex expressions.
|
||||||
|
|
||||||
**Syntax errors highlight**
|
**Syntax errors highlight**
|
||||||
|
|
||||||
the following example code show how to visualize the syntax errors in the compiling pahse:
|
the following example code show how to visualize the syntax errors in the compilation phase:
|
||||||
|
|
||||||
```v
|
```v
|
||||||
query:= r"ciao da ab[ab-]" // there is an error, a range not closed
|
query:= r"ciao da ab[ab-]" // there is an error, a range not closed!!
|
||||||
mut re := new_regex()
|
mut re := new_regex()
|
||||||
|
|
||||||
// re_err ==> is the return value, if < 0 it is an error
|
// re_err ==> is the return value, if < 0 it is an error
|
||||||
|
@ -264,7 +258,7 @@ if re_err != COMPILE_OK {
|
||||||
|
|
||||||
**Compiled code**
|
**Compiled code**
|
||||||
|
|
||||||
It is possible view the compiled code calling the function `get_query()` the result will something like this:
|
It is possible view the compiled code calling the function `get_query()` the result will be something like this:
|
||||||
|
|
||||||
```
|
```
|
||||||
========================================
|
========================================
|
||||||
|
@ -275,15 +269,15 @@ PC: 2 ist: 88000000 PROG_END { 0, 0}
|
||||||
========================================
|
========================================
|
||||||
```
|
```
|
||||||
|
|
||||||
`PC`:`int` is the program counter or step of execution, each single step is a token
|
`PC`:`int` is the program counter or step of execution, each single step is a token.
|
||||||
|
|
||||||
`ist`:`hex` is the token instruction id
|
`ist`:`hex` is the token instruction id.
|
||||||
|
|
||||||
`[a]` is the char used by the token
|
`[a]` is the char used by the token.
|
||||||
|
|
||||||
`query_ch` is the type of token
|
`query_ch` is the type of token.
|
||||||
|
|
||||||
`{m,n}` are the quantifier, the greedy flag `?` will be showed if present in the token
|
`{m,n}` is the quantifier, the greedy off flag `?` will be showed if present in the token
|
||||||
|
|
||||||
**Log debug**
|
**Log debug**
|
||||||
|
|
||||||
|
@ -295,7 +289,7 @@ here an example:
|
||||||
|
|
||||||
*normal*
|
*normal*
|
||||||
|
|
||||||
list only the token instruction with the values
|
list only the token instruction with their values
|
||||||
|
|
||||||
```
|
```
|
||||||
// re.flag = 1 // log level normal
|
// re.flag = 1 // log level normal
|
||||||
|
@ -308,7 +302,7 @@ flags: 00000000
|
||||||
|
|
||||||
*verbose*
|
*verbose*
|
||||||
|
|
||||||
list all the instruction and states of the parser
|
list all the instructions and states of the parser
|
||||||
|
|
||||||
```
|
```
|
||||||
flags: 00000000
|
flags: 00000000
|
||||||
|
@ -326,7 +320,7 @@ flags: 00000000
|
||||||
# 11 PROG_END
|
# 11 PROG_END
|
||||||
```
|
```
|
||||||
|
|
||||||
the column have the following meaning:
|
the columns have the following meaning:
|
||||||
|
|
||||||
`# 2` number of actual steps from the start of parsing
|
`# 2` number of actual steps from the start of parsing
|
||||||
|
|
||||||
|
@ -342,7 +336,7 @@ the column have the following meaning:
|
||||||
|
|
||||||
`query_ch: [b]` token in use and its char
|
`query_ch: [b]` token in use and its char
|
||||||
|
|
||||||
`{2,3}:1?` quantifier `{min,max}`, `:1` is the actual counter of repetition, `?` is the greedy flag if present
|
`{2,3}:1?` quantifier `{min,max}`, `:1` is the actual counter of repetition, `?` is the greedy off flag if present
|
||||||
|
|
||||||
## Example code
|
## Example code
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue