v/vlib/net/html
Lukas Neubert df4165c7ee
docs_ci: check all md files except thirdparty (#6855)
2020-11-18 18:28:28 +01:00
..
README.md docs_ci: check all md files except thirdparty (#6855) 2020-11-18 18:28:28 +01:00
data_structures.v checker: more default field fixes 2020-09-09 14:14:44 +02:00
dom.v vlib: run vfmt over vlib files, so that `v doc -m vlib/` can run without warnings 2020-10-21 12:54:10 +03:00
dom_test.v cgen: error if ForInStmt is not handled (#6131) 2020-08-14 21:01:43 +02:00
parser.v checker: fix error pos on default value (#6338) 2020-09-09 15:34:41 +02:00
parser_test.v net.html: create html parser module (#6076) 2020-08-09 04:13:34 +02:00
tag.v checker: fix error pos on default value (#6338) 2020-09-09 15:34:41 +02:00

README.md

V HTML

A HTML parser made in V.

Usage

If the description below isn't enought, please look at the test files.

Parser

Responsible for read HTML in full strings or splited string and returns all Tag objets of it HTML or return a DocumentObjectModel, that will try to find how the HTML Tree is.

split_parse(data string)

This functions is the main function called by parse method to fragment parse your HTML.

parse_html(data string, is_file bool)

This function is called passing a filename or a complete html data string to it.

add_code_tag(name string)

This function is used to add a tag for the parser ignore it's content. For example, if you have an html or XML with a custom tag, like <script>, using this function, like add_code_tag('script') will make all script tags content be jumped, so you still have its content, but will not confuse the parser with it's > or <.

finalize()

When using split_parse method, you must call this function to ends the parse completely.

get_tags() []Tag_ptr

This functions returns a array with all tags and it's content.

get_dom() DocumentObjectModel

Returns the DocumentObjectModel for current parsed tags.

WARNING

If you want to reuse parser object to parse another HTML, call initialize_all() function first.

DocumentObjectModel

A DOM object that will make easier to access some tags and search it.

get_by_attribute_value(name string, value string) []Tag_ptr

This function retuns a Tag array with all tags in document that have a attribute with given name and given value.

get_by_tag(name string) []Tag_ptr

This function retuns a Tag array with all tags in document that have a name with the given value.

get_by_attribute(name string) []Tag_ptr

This function retuns a Tag array with all tags in document that have a attribute with given name.

get_root() Tag_ptr

This function returns the root Tag.

get_all_tags() []Tag_ptr

This function returns all important tags, removing close tags.

Tag

An object that holds tags information, such as name, attributes, children.

get_children() []Tag_ptr

Returns all children as an array.

get_parent() &Tag

Returns the parent of current tag.

get_name() string

Returns tag name.

get_content() string

Returns tag content.

get_attributes() map[string]string

Returns all attributes and it value.

text() string

Returns the content of the tag and all tags inside it. Also, any <br> tag will be converted into \n.

Some questions that can appear

Q: Why in parser have a builder_str() string method that returns only the lexeme string?

A: Because in early stages of the project, strings.Builder are used, but for some bug existing somewhere, it was necessary to use string directly. Later, it's planned to use strings.Builder again.

Q: Why have a compare_string(a string, b string) bool method?

A: For some reason when using != and == in strings directly, it is not working. So this method is a workaround.

Q: Will be something like XPath?

A: Like XPath yes. Exactly equal to it, no.

Roadmap

  • Parser
    • <!-- Comments --> detection
    • Open Generic tags detection
    • Close Generic tags detection
    • verify string detection
    • tag attributes detection
    • attributes values detection
    • tag text (on tag it is declared as content, maybe change for text in the future)
    • text file for parse support (open local files for parsing)
    • open_code verification
  • DocumentObjectModel
    • push elements that have a close tag into stack
    • remove elements from stack
    • create a new document root if have some syntax error (deleted)
    • search tags in DOM by attributes
    • search tags in DOM by tag type
    • finish dom test

License

GPL3