regex: reformat README.md to use shorter lines
parent
664f220f40
commit
4c22370635
|
@ -2,172 +2,193 @@
|
||||||
|
|
||||||
[TOC]
|
[TOC]
|
||||||
|
|
||||||
## Introduction, differences with PCRE
|
## Introduction
|
||||||
|
|
||||||
|
Here are the assumptions made during the writing of the implementation, that
|
||||||
|
are valid for all the `regex` module features:
|
||||||
|
|
||||||
|
1. The matching stops at the end of the string, *not* at newline characters.
|
||||||
|
|
||||||
|
2. The basic atomic elements of this regex engine are the tokens.
|
||||||
|
In a query string a simple character is a token.
|
||||||
|
|
||||||
|
|
||||||
|
## Differences with PCRE:
|
||||||
|
|
||||||
|
NB: We must point out that the **V-Regex module is not PCRE compliant** and thus
|
||||||
|
some behaviour will be different. This difference is due to the V philosophy,
|
||||||
|
to have one way and keep it simple.
|
||||||
|
|
||||||
The first thing we must point out is that the **V-Regex module is not PCRE compliant** and
|
|
||||||
thus some behaviour will be different.
|
|
||||||
This module is born upon the V philosophy to have one way and keep it simple.
|
|
||||||
The main differences can be summarized in the following points:
|
The main differences can be summarized in the following points:
|
||||||
|
|
||||||
- The basic element **is the token not the sequence of symbols**, the most simple token
|
- The basic element **is the token not the sequence of symbols**, and the most
|
||||||
is simple char.
|
simple token, is a single character.
|
||||||
|
|
||||||
- `|` **OR operator act on token,** for example `abc|ebc` is not `abc` OR `ebc` it
|
- `|` **the OR operator acts on tokens,** for example `abc|ebc` is not
|
||||||
is evaluated like `ab` followed by `c OR e` followed by`bc`, this because the **token is
|
`abc` OR `ebc`. Instead it is evaluated like `ab`, followed by `c OR e`,
|
||||||
the base element** not the sequence of symbols.
|
followed by `bc`, because the **token is the base element**,
|
||||||
- The **match operation stop at the end of the string** not at the new line chars.
|
not the sequence of symbols.
|
||||||
|
|
||||||
Further information can be found in the other part of this document.
|
- The **match operation stops at the end of the string**. It does *NOT* stop
|
||||||
|
at new line characters.
|
||||||
|
|
||||||
## Basic assumption
|
|
||||||
|
|
||||||
In this release, during the writing of the code some assumptions are made
|
|
||||||
and are valid for all the features.
|
|
||||||
|
|
||||||
1. The matching stops at the end of the string not at the newline chars.
|
|
||||||
2. The basic elements of this regex engine are the tokens,
|
|
||||||
in a query string a simple char is a token. The token is the atomic unit of this regex engine.
|
|
||||||
|
|
||||||
## Match positional limiter
|
|
||||||
|
|
||||||
The module supports the following features:
|
|
||||||
|
|
||||||
- `$` `^` delimiter
|
|
||||||
|
|
||||||
`^` (Caret.) Matches at the start of the string
|
|
||||||
|
|
||||||
`$` Matches at the end of the string
|
|
||||||
|
|
||||||
## Tokens
|
## Tokens
|
||||||
|
|
||||||
The tokens are the atomic units used by this regex engine and can be ones of the following:
|
The tokens are the atomic units, used by this regex engine.
|
||||||
|
They can be one of the following:
|
||||||
|
|
||||||
|
|
||||||
### Simple char
|
### Simple char
|
||||||
|
|
||||||
this token is a simple single character like `a`.
|
This token is a simple single character like `a` or `b` etc.
|
||||||
|
|
||||||
|
|
||||||
|
### Match positional delimiters
|
||||||
|
|
||||||
|
`^` Matches the start of the string.
|
||||||
|
|
||||||
|
`$` Matches the end of the string.
|
||||||
|
|
||||||
|
|
||||||
### Char class (cc)
|
### Char class (cc)
|
||||||
|
|
||||||
The cc matches all the chars specified inside, it is delimited by square brackets `[ ]`
|
The character classes match all the chars specified inside. Use square
|
||||||
|
brackets `[ ]` to enclose them.
|
||||||
|
|
||||||
the sequence of chars in the class is evaluated with an OR operation.
|
The sequence of the chars in the character class, is evaluated with an OR op.
|
||||||
|
|
||||||
For example, the following cc `[abc]` matches any char that is `a` or `b` or `c`
|
For example, the cc `[abc]`, matches any character, that is `a` or `b` or `c`,
|
||||||
but doesn't match `C` or `z`.
|
but it doesn't match `C` or `z`.
|
||||||
|
|
||||||
Inside a cc is possible to specify a "range" of chars,
|
Inside a cc, it is possible to specify a "range" of characters, for example
|
||||||
for example `[ad-f]` is equivalent to write `[adef]`.
|
`[ad-h]` is equivalent to writing `[adefgh]`.
|
||||||
|
|
||||||
A cc can have different ranges at the same time like `[a-zA-z0-9]` that matches all the lowercase,
|
A cc can have different ranges at the same time, for example `[a-zA-z0-9]`
|
||||||
uppercase and numeric chars.
|
matches all the latin lowercase, uppercase and numeric characters.
|
||||||
|
|
||||||
It is possible negate the cc using the caret char at the start of the cc like: `[^abc]`
|
It is possible to negate the meaning of a cc, using the caret char at the
|
||||||
that matches every char that is not `a` or `b` or `c`.
|
start of the cc like this: `[^abc]` . That matches every char that is NOT
|
||||||
|
`a` or `b` or `c`.
|
||||||
|
|
||||||
A cc can contain meta-chars like: `[a-z\d]` that matches all the lowercase latin chars `a-z`
|
A cc can contain meta-chars like: `[a-z\d]`, that match all the lowercase
|
||||||
and all the digits `\d`.
|
latin chars `a-z` and all the digits `\d`.
|
||||||
|
|
||||||
It is possible to mix all the properties of the char class together.
|
It is possible to mix all the properties of the char class together.
|
||||||
|
|
||||||
**Note:** In order to match the `-` (minus) char, it must be preceded by a backslash
|
NB: In order to match the `-` (minus) char, it must be preceded by
|
||||||
in the cc, for example `[\-_\d\a]` will match `-` minus, `_`underscore, `\d` numeric chars,
|
a backslash in the cc, for example `[\-_\d\a]` will match:
|
||||||
`\a` lower case chars.
|
`-` minus,
|
||||||
|
`_` underscore,
|
||||||
|
`\d` numeric chars,
|
||||||
|
`\a` lower case chars.
|
||||||
|
|
||||||
### Meta-chars
|
### Meta-chars
|
||||||
|
|
||||||
A meta-char is specified by a backslash before a char like `\w` in this case the meta-char is `w`.
|
A meta-char is specified by a backslash, before a character.
|
||||||
|
For example `\w` is the meta-char `w`.
|
||||||
|
|
||||||
A meta-char can match different type of chars.
|
A meta-char can match different types of characters.
|
||||||
|
|
||||||
* `\w` matches an alphanumeric char `[a-zA-Z0-9_]`
|
* `\w` matches an alphanumeric char `[a-zA-Z0-9_]`
|
||||||
* `\W` matches a non alphanumeric char
|
* `\W` matches a non alphanumeric char
|
||||||
* `\d` matches a digit `[0-9]`
|
* `\d` matches a digit `[0-9]`
|
||||||
* `\D` matches a non digit
|
* `\D` matches a non digit
|
||||||
* `\s`matches a space char, one of `[' ','\t','\n','\r','\v','\f']`
|
* `\s` matches a space char, one of `[' ','\t','\n','\r','\v','\f']`
|
||||||
* `\S` matches a non space char
|
* `\S` matches a non space char
|
||||||
* `\a` matches only a lowercase char `[a-z]`
|
* `\a` matches only a lowercase char `[a-z]`
|
||||||
* `\A` matches only an uppercase char `[A-Z]`
|
* `\A` matches only an uppercase char `[A-Z]`
|
||||||
|
|
||||||
### Quantifier
|
### Quantifier
|
||||||
|
|
||||||
Each token can have a quantifier that specify how many times the char can or must be matched.
|
Each token can have a quantifier, that specifies how many times the character
|
||||||
|
must be matched.
|
||||||
|
|
||||||
#### **Short quantifier**
|
#### **Short quantifiers**
|
||||||
|
|
||||||
- `?` matches 0 or 1 time, `a?b` matches both `ab` or `b`
|
- `?` matches 0 or 1 time, `a?b` matches both `ab` or `b`
|
||||||
- `+` matches at minimum 1 time, `a+` matches both `aaa` or `a`
|
- `+` matches *at least* 1 time, for example, `a+` matches both `aaa` or `a`
|
||||||
- `*` matches 0 or more time, `a*b` matches both `aaab` or `ab` or `b`
|
- `*` matches 0 or more times, for example, `a*b` matches `aaab`, `ab` or `b`
|
||||||
|
|
||||||
#### **Long quantifier**
|
#### **Long quantifiers**
|
||||||
|
|
||||||
- `{x}` matches exactly x time, `a{2}` matches `aa` but doesn't match `aaa` or `a`
|
- `{x}` matches exactly x times, `a{2}` matches `aa`, but not `aaa` or `a`
|
||||||
- `{min,}` matches at minimum min time, `a{2,}` matches `aaa` or `aa` but doesn't match `a`
|
- `{min,}` matches at least min times, `a{2,}` matches `aaa` or `aa`, not `a`
|
||||||
- `{,max}` matches at least 0 time and maximum max time,
|
- `{,max}` matches at least 0 times and at maximum max times,
|
||||||
`a{,2}` matches `a` and `aa` but doesn't match `aaa`
|
for example, `a{,2}` matches `a` and `aa`, but doesn't match `aaa`
|
||||||
- `{min,max}` matches from min times to max times,
|
- `{min,max}` matches from min times, to max times, for example
|
||||||
`a{2,3}` matches `aa` and `aaa` but doesn't match `a` or `aaaa`
|
`a{2,3}` matches `aa` and `aaa`, but doesn't match `a` or `aaaa`
|
||||||
|
|
||||||
a long quantifier may have a `greedy off` flag that is the `?` char after the brackets,
|
A long quantifier, may have a `greedy off` flag, that is the `?`
|
||||||
`{2,4}?` means to match the minimum number possible tokens in this case 2.
|
character after the brackets. `{2,4}?` means to match the minimum
|
||||||
|
number of possible tokens, in this case 2.
|
||||||
|
|
||||||
### dot char
|
### Dot char
|
||||||
|
|
||||||
the dot is a particular meta char that matches "any char",
|
The dot is a particular meta-char, that matches "any char".
|
||||||
is more simple explain it with an example:
|
|
||||||
|
|
||||||
suppose to have `abccc ddeef` as source string to parse with regex,
|
It is simpler to explain it with an example:
|
||||||
the following table show the query strings and the result of parsing source string.
|
|
||||||
|
|
||||||
| query string | result |
|
Suppose you have `abccc ddeef` as a source string, that you want to parse
|
||||||
| ------------ | ------ |
|
with a regex. The following table show the query strings and the result of
|
||||||
| `.*c` | `abc` |
|
parsing source string.
|
||||||
| `.*dd` | `abcc dd` |
|
|
||||||
| `ab.*e` | `abccc dde` |
|
+--------------+-------------+
|
||||||
|
| query string | result |
|
||||||
|
|--------------|-------------|
|
||||||
|
| `.*c` | `abc` |
|
||||||
|
| `.*dd` | `abcc dd` |
|
||||||
|
| `ab.*e` | `abccc dde` |
|
||||||
| `ab.{3} .*e` | `abccc dde` |
|
| `ab.{3} .*e` | `abccc dde` |
|
||||||
|
+--------------+-------------+
|
||||||
|
|
||||||
the dot char matches any char until the next token match is satisfied.
|
The dot matches any character, until the next token match is satisfied.
|
||||||
|
|
||||||
### OR token
|
### OR token
|
||||||
|
|
||||||
the token `|` is a logic OR operation between two consecutive tokens,
|
The token `|`, means a logic OR operation between two consecutive tokens,
|
||||||
`a|b` matches a char that is `a` or `b`.
|
i.e. `a|b` matches a character that is `a` or `b`.
|
||||||
|
|
||||||
The OR token can work in a "chained way": `a|(b)|cd ` test first `a` if the char is not `a`
|
The OR token can work in a "chained way": `a|(b)|cd ` means test first `a`,
|
||||||
then test the group `(b)` and if the group doesn't match test the token `c`.
|
if the char is not `a`, then test the group `(b)`, and if the group doesn't
|
||||||
|
match too, finally test the token `c`.
|
||||||
|
|
||||||
**note: The OR work at token level! It doesn't work at concatenation level!**
|
NB: ** unlike in PCRE, the OR operation works at token level!**
|
||||||
|
It doesn't work at concatenation level!
|
||||||
|
|
||||||
A query string like `abc|bde` is not equal to `(abc)|(bde)`!!
|
That also means, that a query string like `abc|bde` is not equal to
|
||||||
The OR work only on `c|b` not at char concatenation level.
|
`(abc)|(bde)`, but instead to `ab(c|b)de.
|
||||||
|
The OR operation works only for `c|b`, not at char concatenation level.
|
||||||
|
|
||||||
### Groups
|
### Groups
|
||||||
|
|
||||||
Groups are a method to create complex patterns with repetition of blocks of tokens.
|
Groups are a method to create complex patterns with repetitions of blocks
|
||||||
|
of tokens. The groups are delimited by round brackets `( )`. Groups can be
|
||||||
The groups are delimited by round brackets `( )`,
|
nested. Like all other tokens, groups can have a quantifier too.
|
||||||
groups can be nested and can have a quantifier as all the tokens.
|
|
||||||
|
|
||||||
`c(pa)+z` match `cpapaz` or `cpaz` or `cpapapaz` .
|
`c(pa)+z` match `cpapaz` or `cpaz` or `cpapapaz` .
|
||||||
|
|
||||||
`(c(pa)+z ?)+` matches `cpaz cpapaz cpapapaz` or `cpapaz`
|
`(c(pa)+z ?)+` matches `cpaz cpapaz cpapapaz` or `cpapaz`
|
||||||
|
|
||||||
let analyze this last case, first we have the group `#0`
|
Lets analyze this last case, first we have the group `#0`, that is the most
|
||||||
that are the most outer round brackets `(...)+`,
|
outer round brackets `(...)+`. This group has a quantifier `+`, that say to
|
||||||
this group has a quantifier that say to match its content at least one time `+`.
|
match its content *at least one time*.
|
||||||
|
|
||||||
After we have a simple char token `c` and a second group that is the number `#1` :`(pa)+`,
|
Then we have a simple char token `c`, and a second group `#1`: `(pa)+`.
|
||||||
this group try to match the sequence `pa` at least one time as specified by the `+` quantifier.
|
This group also tries to match the sequence `pa`, *at least one time*,
|
||||||
|
as specified by the `+` quantifier.
|
||||||
|
|
||||||
After, we have another simple token `z` and another simple token ` ?`
|
Then, we have another simple token `z` and another simple token ` ?`,
|
||||||
that is the space char (ascii code 32) followed by the `?` quantifier
|
i.e. the space char (ascii code 32) followed by the `?` quantifier,
|
||||||
that say to capture the space char 0 or 1 time.
|
which means that the preceding space should be matched 0 or 1 time.
|
||||||
|
|
||||||
This explain because the `(c(pa)+z ?)+` query string can match `cpaz cpapaz cpapapaz` .
|
This explains why the `(c(pa)+z ?)+` query string,
|
||||||
|
can match `cpaz cpapaz cpapapaz` .
|
||||||
|
|
||||||
In this implementation the groups are "capture groups",
|
In this implementation the groups are "capture groups". This means that the
|
||||||
it means that the last temporal result for each group can be retrieved from the `RE` struct.
|
last temporal result for each group, can be retrieved from the `RE` struct.
|
||||||
|
|
||||||
The "capture groups" are store as couple of index in the field `groups`
|
The "capture groups" are stored as indexes in the field `groups`,
|
||||||
that is an `[]int` inside the `RE` struct.
|
that is an `[]int` inside the `RE` struct.
|
||||||
|
|
||||||
**example:**
|
**example:**
|
||||||
|
@ -177,7 +198,8 @@ text := 'cpaz cpapaz cpapapaz'
|
||||||
query := r'(c(pa)+z ?)+'
|
query := r'(c(pa)+z ?)+'
|
||||||
mut re := regex.regex_opt(query) or { panic(err) }
|
mut re := regex.regex_opt(query) or { panic(err) }
|
||||||
println(re.get_query())
|
println(re.get_query())
|
||||||
// #0(c#1(pa)+z ?)+ // #0 and #1 are the ids of the groups, are shown if re.debug is 1 or 2
|
// #0(c#1(pa)+z ?)+
|
||||||
|
// #0 and #1 are the ids of the groups, are shown if re.debug is 1 or 2
|
||||||
start, end := re.match_string(text)
|
start, end := re.match_string(text)
|
||||||
// [start=0, end=20] match => [cpaz cpapaz cpapapaz]
|
// [start=0, end=20] match => [cpaz cpapaz cpapapaz]
|
||||||
mut gi := 0
|
mut gi := 0
|
||||||
|
@ -195,7 +217,7 @@ for gi < re.groups.len {
|
||||||
**note:** *to show the `group id number` in the result of the `get_query()`*
|
**note:** *to show the `group id number` in the result of the `get_query()`*
|
||||||
*the flag `debug` of the RE object must be `1` or `2`*
|
*the flag `debug` of the RE object must be `1` or `2`*
|
||||||
|
|
||||||
In order to simplify the use of the captured groups it possible to use the
|
In order to simplify the use of the captured groups, it possible to use the
|
||||||
utility function: `get_group_list`.
|
utility function: `get_group_list`.
|
||||||
|
|
||||||
This function return a list of groups using this support struct:
|
This function return a list of groups using this support struct:
|
||||||
|
@ -212,9 +234,9 @@ Here an example of use:
|
||||||
|
|
||||||
```v oksyntax
|
```v oksyntax
|
||||||
/*
|
/*
|
||||||
This simple function convert an HTML RGB value with 3 or 6 hex digits to an u32 value,
|
This simple function converts an HTML RGB value with 3 or 6 hex digits to
|
||||||
this function is not optimized and it si only for didatical purpose
|
an u32 value, this function is not optimized and it is only for didatical
|
||||||
example: #A0B0CC #A9F
|
purpose. Example: #A0B0CC #A9F
|
||||||
*/
|
*/
|
||||||
fn convert_html_rgb(in_col string) u32 {
|
fn convert_html_rgb(in_col string) u32 {
|
||||||
mut n_digit := if in_col.len == 4 { 1 } else { 2 }
|
mut n_digit := if in_col.len == 4 { 1 } else { 2 }
|
||||||
|
@ -250,29 +272,29 @@ for g_index := 0; g_index < re.group_count ; g_index++ {
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
more helper functions are listed in the **Groups query functions** section.
|
More helper functions are listed in the **Groups query functions** section.
|
||||||
|
|
||||||
### Groups Continuous saving
|
### Groups Continuous saving
|
||||||
|
|
||||||
In particular situations it is useful have a continuous save of the groups,
|
In particular situations, it is useful to have a continuous group saving.
|
||||||
this is possible initializing the saving array field in `RE` struct: `group_csave`.
|
This is possible by initializing the `group_csave` field in the `RE` struct.
|
||||||
|
|
||||||
This feature allow to collect data in a continuous way.
|
This feature allows you to collect data in a continuous/streaming way.
|
||||||
|
|
||||||
In the example we pass a text followed by a integer list that we want collect.
|
In the example, we can pass a text, followed by an integer list,
|
||||||
To achieve this task we can use the continuous saving of the group
|
that we wish to collect. To achieve this task, we can use the continuous
|
||||||
enabling the right flag: `re.group_csave_flag = true`.
|
group saving, by enabling the right flag: `re.group_csave_flag = true`.
|
||||||
|
|
||||||
The array will be filled with the following logic:
|
The `.group_csave` array will be filled then, following this logic:
|
||||||
|
|
||||||
`re.group_csave[0]` number of total saved records
|
`re.group_csave[0]` - number of total saved records
|
||||||
|
`re.group_csave[1+n*3]` - id of the saved group
|
||||||
|
`re.group_csave[1+n*3]` - start index in the source string of the saved group
|
||||||
|
`re.group_csave[1+n*3]` - end index in the source string of the saved group
|
||||||
|
|
||||||
`re.group_csave[1+n*3]` id of the saved group
|
The regex will save groups, until it finishes, or finds that the array has no
|
||||||
`re.group_csave[1+n*3]` start index in the source string of the saved group
|
more space. If the space ends, no error is raised, and further records will
|
||||||
`re.group_csave[1+n*3]` end index in the source string of the saved group
|
not be saved.
|
||||||
|
|
||||||
The regex save until finish or found that the array have no space.
|
|
||||||
If the space ends no error is raised, further records will not be saved.
|
|
||||||
|
|
||||||
```v ignore
|
```v ignore
|
||||||
import regex
|
import regex
|
||||||
|
@ -327,19 +349,18 @@ cg[1] 42 46:[html]
|
||||||
|
|
||||||
### Named capturing groups
|
### Named capturing groups
|
||||||
|
|
||||||
This regex module support partially the question mark `?` PCRE syntax for groups.
|
This regex module supports partially the question mark `?` PCRE syntax for groups.
|
||||||
|
|
||||||
`(?:abcd)` **non capturing group**: the content of the group will not be saved
|
`(?:abcd)` **non capturing group**: the content of the group will not be saved.
|
||||||
|
|
||||||
`(?P<mygroup>abcdef)` **named group:** the group content is saved and labeled as `mygroup`
|
`(?P<mygroup>abcdef)` **named group:** the group content is saved and labeled
|
||||||
|
as `mygroup`.
|
||||||
|
|
||||||
The label of the groups is saved in the `group_map` of the `RE` struct,
|
The label of the groups is saved in the `group_map` of the `RE` struct,
|
||||||
this is a map from `string` to `int` where the value is the index in `group_csave` list of index.
|
that is a map from `string` to `int`, where the value is the index in
|
||||||
|
`group_csave` list of indexes.
|
||||||
Have a look at the example for the use of them.
|
|
||||||
|
|
||||||
example:
|
|
||||||
|
|
||||||
|
Here is an example for how to use them:
|
||||||
```v ignore
|
```v ignore
|
||||||
import regex
|
import regex
|
||||||
fn main(){
|
fn main(){
|
||||||
|
@ -376,17 +397,17 @@ group:'format' => [http] bounds: (0, 4)
|
||||||
group:'token' => [html] bounds: (42, 46)
|
group:'token' => [html] bounds: (42, 46)
|
||||||
```
|
```
|
||||||
|
|
||||||
In order to simplify the use of the named groups it possible to use names map in the `re`
|
In order to simplify the use of the named groups, it is possible to
|
||||||
struct using the function `re.get_group_by_name`.
|
use a name map in the `re` struct, using the function `re.get_group_by_name`.
|
||||||
|
|
||||||
Here a more complex example of use:
|
|
||||||
|
|
||||||
|
Here is a more complex example of using them:
|
||||||
```v oksyntax
|
```v oksyntax
|
||||||
// This function demostrate the use of the named groups
|
// This function demostrate the use of the named groups
|
||||||
fn convert_html_rgb_n(in_col string) u32 {
|
fn convert_html_rgb_n(in_col string) u32 {
|
||||||
mut n_digit := if in_col.len == 4 { 1 } else { 2 }
|
mut n_digit := if in_col.len == 4 { 1 } else { 2 }
|
||||||
mut col_mul := if in_col.len == 4 { 4 } else { 0 }
|
mut col_mul := if in_col.len == 4 { 4 } else { 0 }
|
||||||
query := '#(?P<red>[a-fA-F0-9]{$n_digit})(?P<green>[a-fA-F0-9]{$n_digit})(?P<blue>[a-fA-F0-9]{$n_digit})'
|
query := '#(?P<red>[a-fA-F0-9]{$n_digit})' + '(?P<green>[a-fA-F0-9]{$n_digit})' +
|
||||||
|
'(?P<blue>[a-fA-F0-9]{$n_digit})'
|
||||||
mut re := regex.regex_opt(query) or { panic(err) }
|
mut re := regex.regex_opt(query) or { panic(err) }
|
||||||
start, end := re.match_string(in_col)
|
start, end := re.match_string(in_col)
|
||||||
println('start: $start, end: $end')
|
println('start: $start, end: $end')
|
||||||
|
@ -405,8 +426,8 @@ fn convert_html_rgb_n(in_col string) u32 {
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
Others utility functions are `get_group_by_name` and `get_group_bounds_by_name`
|
Other utilities are `get_group_by_name` and `get_group_bounds_by_name`,
|
||||||
that get directly the string of a group using its `name`:
|
that return the string of a group using its `name`:
|
||||||
|
|
||||||
```v ignore
|
```v ignore
|
||||||
txt := "my used string...."
|
txt := "my used string...."
|
||||||
|
@ -447,7 +468,8 @@ pub fn (re RE) get_group_list() []Re_group
|
||||||
|
|
||||||
## Flags
|
## Flags
|
||||||
|
|
||||||
It is possible to set some flags in the regex parser that change the behavior of the parser itself.
|
It is possible to set some flags in the regex parser, that change
|
||||||
|
the behavior of the parser itself.
|
||||||
|
|
||||||
```v ignore
|
```v ignore
|
||||||
// example of flag settings
|
// example of flag settings
|
||||||
|
@ -457,12 +479,16 @@ re.flag = regex.F_BIN
|
||||||
|
|
||||||
- `F_BIN`: parse a string as bytes, utf-8 management disabled.
|
- `F_BIN`: parse a string as bytes, utf-8 management disabled.
|
||||||
|
|
||||||
- `F_EFM`: exit on the first char matches in the query, used by the find function.
|
- `F_EFM`: exit on the first char matches in the query, used by the
|
||||||
- `F_MS`: matches only if the index of the start match is 0,
|
find function.
|
||||||
same as `^` at the start of the query string.
|
|
||||||
- `F_ME`: matches only if the end index of the match is the last char of the input string,
|
- `F_MS`: matches only if the index of the start match is 0,
|
||||||
same as `$` end of query string.
|
same as `^` at the start of the query string.
|
||||||
- `F_NL`: stop the matching if found a new line char `\n` or `\r`
|
|
||||||
|
- `F_ME`: matches only if the end index of the match is the last char
|
||||||
|
of the input string, same as `$` end of query string.
|
||||||
|
|
||||||
|
- `F_NL`: stop the matching if found a new line char `\n` or `\r`
|
||||||
|
|
||||||
## Functions
|
## Functions
|
||||||
|
|
||||||
|
@ -486,13 +512,15 @@ pub fn new() RE
|
||||||
|
|
||||||
```
|
```
|
||||||
#### **Custom initialization**
|
#### **Custom initialization**
|
||||||
For some particular needs it is possible initialize a fully manually customized regex:
|
For some particular needs, it is possible to initialize a fully customized regex:
|
||||||
```v ignore
|
```v ignore
|
||||||
pattern = r"ab(.*)(ac)"
|
pattern = r"ab(.*)(ac)"
|
||||||
// init custom regex
|
// init custom regex
|
||||||
mut re := regex.RE{}
|
mut re := regex.RE{}
|
||||||
re.prog = []Token {len: pattern.len + 1} // max program length, can not be longer then the pattern
|
// max program length, can not be longer then the pattern
|
||||||
re.cc = []CharClass{len: pattern.len} // can not be more char class the the length of the pattern
|
re.prog = []Token {len: pattern.len + 1}
|
||||||
|
// can not be more char class the the length of the pattern
|
||||||
|
re.cc = []CharClass{len: pattern.len}
|
||||||
|
|
||||||
re.group_csave_flag = false // true enable continuos group saving if needed
|
re.group_csave_flag = false // true enable continuos group saving if needed
|
||||||
re.group_max_nested = 128 // set max 128 group nested possible
|
re.group_max_nested = 128 // set max 128 group nested possible
|
||||||
|
@ -566,7 +594,7 @@ Today it is a good day. => Tod__[ay]__it is a good d__[ay]__
|
||||||
|
|
||||||
**Note:** in the replace strings can be used only groups from `0` to `9`.
|
**Note:** in the replace strings can be used only groups from `0` to `9`.
|
||||||
|
|
||||||
If the usage of `groups` in the replace process is not needed it is possible
|
If the usage of `groups` in the replace process, is not needed, it is possible
|
||||||
to use a quick function:
|
to use a quick function:
|
||||||
|
|
||||||
```v ignore
|
```v ignore
|
||||||
|
@ -576,10 +604,12 @@ pub fn (mut re RE) replace_simple(in_txt string, repl string) string
|
||||||
|
|
||||||
#### Custom replace function
|
#### Custom replace function
|
||||||
|
|
||||||
For complex find and replace operations it is available the function `replace_by_fn` .
|
For complex find and replace operations, you can use `replace_by_fn` .
|
||||||
The`replace_by_fn` use a custom replace function making possible customizations.
|
The `replace_by_fn`, uses a custom replace callback function, thus
|
||||||
**The custom function is called for every non overlapped find.**
|
allowing customizations. The custom callback function is called for
|
||||||
The custom function must be of the type:
|
every non overlapped find.
|
||||||
|
|
||||||
|
The custom callback function must be of the type:
|
||||||
|
|
||||||
```v ignore
|
```v ignore
|
||||||
// type of function used for custom replace
|
// type of function used for custom replace
|
||||||
|
@ -590,7 +620,7 @@ The custom function must be of the type:
|
||||||
fn (re RE, in_txt string, start int, end int) string
|
fn (re RE, in_txt string, start int, end int) string
|
||||||
```
|
```
|
||||||
|
|
||||||
The following example will clarify the use:
|
The following example will clarify its usage:
|
||||||
|
|
||||||
```v ignore
|
```v ignore
|
||||||
import regex
|
import regex
|
||||||
|
@ -624,11 +654,12 @@ today *[*John*]* is gone to his house with *(*Jack*)* and *[*Marie*]*.
|
||||||
|
|
||||||
## Debugging
|
## Debugging
|
||||||
|
|
||||||
This module has few small utilities to help the writing of regex expressions.
|
This module has few small utilities to you write regex patterns.
|
||||||
|
|
||||||
### **Syntax errors highlight**
|
### **Syntax errors highlight**
|
||||||
|
|
||||||
the following example code show how to visualize the syntax errors in the compilation phase:
|
The next example code shows how to visualize regex pattern syntax errors
|
||||||
|
in the compilation phase:
|
||||||
|
|
||||||
```v oksyntax
|
```v oksyntax
|
||||||
query := r'ciao da ab[ab-]'
|
query := r'ciao da ab[ab-]'
|
||||||
|
@ -676,40 +707,36 @@ PC: 10 ist: 88000000 PROG_END { 0, 0}
|
||||||
|
|
||||||
### **Log debug**
|
### **Log debug**
|
||||||
|
|
||||||
The log debugger allow to print the status of the regex parser when the parser is running.
|
The log debugger allow to print the status of the regex parser when the
|
||||||
|
parser is running. It is possible to have two different levels of
|
||||||
|
debug information: 1 is normal, while 2 is verbose.
|
||||||
|
|
||||||
It is possible to have two different level of debug: 1 is normal while 2 is verbose.
|
Here is an example:
|
||||||
|
|
||||||
here an example:
|
*normal* - list only the token instruction with their values
|
||||||
|
|
||||||
*normal*
|
```ignore
|
||||||
|
|
||||||
list only the token instruction with their values
|
|
||||||
|
|
||||||
```
|
|
||||||
// re.flag = 1 // log level normal
|
// re.flag = 1 // log level normal
|
||||||
flags: 00000000
|
flags: 00000000
|
||||||
# 2 s: ist_load PC: 0=>7fffffff i,ch,len:[ 0,'a',1] f.m:[ -1, -1] query_ch: [a]{1,1}:0 (#-1)
|
# 2 s: ist_load PC: i,ch,len:[ 0,'a',1] f.m:[ -1, -1] query_ch: [a]{1,1}:0 (#-1)
|
||||||
# 5 s: ist_load PC: 1=>7fffffff i,ch,len:[ 1,'b',1] f.m:[ 0, 0] query_ch: [b]{2,3}:0? (#-1)
|
# 5 s: ist_load PC: i,ch,len:[ 1,'b',1] f.m:[ 0, 0] query_ch: [b]{2,3}:0? (#-1)
|
||||||
# 7 s: ist_load PC: 1=>7fffffff i,ch,len:[ 2,'b',1] f.m:[ 0, 1] query_ch: [b]{2,3}:1? (#-1)
|
# 7 s: ist_load PC: i,ch,len:[ 2,'b',1] f.m:[ 0, 1] query_ch: [b]{2,3}:1? (#-1)
|
||||||
# 10 PROG_END
|
# 10 PROG_END
|
||||||
```
|
```
|
||||||
|
|
||||||
*verbose*
|
*verbose* - list all the instructions and states of the parser
|
||||||
|
|
||||||
list all the instructions and states of the parser
|
```ignore
|
||||||
|
|
||||||
```
|
|
||||||
flags: 00000000
|
flags: 00000000
|
||||||
# 0 s: start PC: NA
|
# 0 s: start PC: NA
|
||||||
# 1 s: ist_next PC: NA
|
# 1 s: ist_next PC: NA
|
||||||
# 2 s: ist_load PC: 0=>7fffffff i,ch,len:[ 0,'a',1] f.m:[ -1, -1] query_ch: [a]{1,1}:0 (#-1)
|
# 2 s: ist_load PC: i,ch,len:[ 0,'a',1] f.m:[ -1, -1] query_ch: [a]{1,1}:0 (#-1)
|
||||||
# 3 s: ist_quant_p PC: 0=>7fffffff i,ch,len:[ 1,'b',1] f.m:[ 0, 0] query_ch: [a]{1,1}:1 (#-1)
|
# 3 s: ist_quant_p PC: i,ch,len:[ 1,'b',1] f.m:[ 0, 0] query_ch: [a]{1,1}:1 (#-1)
|
||||||
# 4 s: ist_next PC: NA
|
# 4 s: ist_next PC: NA
|
||||||
# 5 s: ist_load PC: 1=>7fffffff i,ch,len:[ 1,'b',1] f.m:[ 0, 0] query_ch: [b]{2,3}:0? (#-1)
|
# 5 s: ist_load PC: i,ch,len:[ 1,'b',1] f.m:[ 0, 0] query_ch: [b]{2,3}:0? (#-1)
|
||||||
# 6 s: ist_quant_p PC: 1=>7fffffff i,ch,len:[ 2,'b',1] f.m:[ 0, 1] query_ch: [b]{2,3}:1? (#-1)
|
# 6 s: ist_quant_p PC: i,ch,len:[ 2,'b',1] f.m:[ 0, 1] query_ch: [b]{2,3}:1? (#-1)
|
||||||
# 7 s: ist_load PC: 1=>7fffffff i,ch,len:[ 2,'b',1] f.m:[ 0, 1] query_ch: [b]{2,3}:1? (#-1)
|
# 7 s: ist_load PC: i,ch,len:[ 2,'b',1] f.m:[ 0, 1] query_ch: [b]{2,3}:1? (#-1)
|
||||||
# 8 s: ist_quant_p PC: 1=>7fffffff i,ch,len:[ 3,'b',1] f.m:[ 0, 2] query_ch: [b]{2,3}:2? (#-1)
|
# 8 s: ist_quant_p PC: i,ch,len:[ 3,'b',1] f.m:[ 0, 2] query_ch: [b]{2,3}:2? (#-1)
|
||||||
# 9 s: ist_next PC: NA
|
# 9 s: ist_next PC: NA
|
||||||
# 10 PROG_END
|
# 10 PROG_END
|
||||||
# 11 PROG_END
|
# 11 PROG_END
|
||||||
|
@ -738,7 +765,8 @@ the columns have the following meaning:
|
||||||
### **Custom Logger output**
|
### **Custom Logger output**
|
||||||
|
|
||||||
The debug functions output uses the `stdout` as default,
|
The debug functions output uses the `stdout` as default,
|
||||||
it is possible to provide an alternative output setting a custom output function:
|
it is possible to provide an alternative output, by setting a custom
|
||||||
|
output function:
|
||||||
|
|
||||||
```v oksyntax
|
```v oksyntax
|
||||||
// custom print function, the input will be the regex debug string
|
// custom print function, the input will be the regex debug string
|
||||||
|
@ -790,12 +818,17 @@ fn main(){
|
||||||
|
|
||||||
// init regex
|
// init regex
|
||||||
mut re := regex.RE{}
|
mut re := regex.RE{}
|
||||||
re.prog = []regex.Token {len: query.len + 1} // max program length, can not be longer then the query
|
// max program length, can not be longer then the query
|
||||||
re.cc = []regex.CharClass{len: query.len} // can not be more char class the the length of the query
|
re.prog = []regex.Token {len: query.len + 1}
|
||||||
|
// can not be more char class the the length of the query
|
||||||
|
re.cc = []regex.CharClass{len: query.len}
|
||||||
re.prog = []regex.Token {len: query.len+1}
|
re.prog = []regex.Token {len: query.len+1}
|
||||||
re.group_csave_flag = true // enable continuos group saving
|
// enable continuos group saving
|
||||||
re.group_max_nested = 128 // set max 128 group nested
|
re.group_csave_flag = true
|
||||||
re.group_max = query.len>>1 // we can't have more groups than the half of the query legth
|
// set max 128 group nested
|
||||||
|
re.group_max_nested = 128
|
||||||
|
// we can't have more groups than the half of the query legth
|
||||||
|
re.group_max = query.len>>1
|
||||||
|
|
||||||
// compile the query
|
// compile the query
|
||||||
re.compile_opt(query) or { panic(err) }
|
re.compile_opt(query) or { panic(err) }
|
||||||
|
@ -837,6 +870,5 @@ fn main(){
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
More examples are available in the test code for the `regex` module,
|
||||||
|
see `vlib/regex/regex_test.v`.
|
||||||
more example code is available in the test code for the `regex` module `vlib\regex\regex_test.v`.
|
|
||||||
|
|
Loading…
Reference in New Issue