v/vlib/net/html
Delyan Angelov e8ff94fb8b net.html: simplify map setting (fixes compilation with tcc on aarch64) 2020-08-20 16:45:54 +03:00
..
README.md
data_structures.v
dom.v
dom_test.v cgen: error if ForInStmt is not handled (#6131) 2020-08-14 21:01:43 +02:00
parser.v net.html: simplify map setting (fixes compilation with tcc on aarch64) 2020-08-20 16:45:54 +03:00
parser_test.v
tag.v gg: handle bad image index 2020-08-18 01:08:58 +02:00

README.md

V HTML

A HTML parser made in V

Usage

If description below isn't enought, see test files

Parser

Responsible for read HTML in full strings or splited string and returns all Tag objets of it HTML or return a DocumentObjectModel, that will try to find how the HTML Tree is.

split_parse(data string)

This functions is the main function called by parse method to fragment parse your HTML

parse_html(data string, is_file bool)

This function is called passing a filename or a complete html data string to it

add_code_tag(name string)

This function is used to add a tag for the parser ignore it's content. For example, if you have an html or XML with a custom tag, like <script>, using this function, like add_code_tag('script') will make all script tags content be jumped, so you still have its content, but will not confuse the parser with it's > or <

finalize()

When using split_parse method, you must call this function to ends the parse completely

get_tags() []Tag_ptr

This functions returns a array with all tags and it's content

get_dom() DocumentObjectModel

Returns the DocumentObjectModel for current parsed tags

WARNING

If you want to reuse parser object to parse another HTML, call initialize_all() function first

DocumentObjectModel

A DOM object that will make easier to access some tags and search it

get_by_attribute_value(name string, value string) []Tag_ptr

This function retuns a Tag array with all tags in document that have a attribute with given name and given value

get_by_tag(name string) []Tag_ptr

This function retuns a Tag array with all tags in document that have a name with the given value

get_by_attribute(name string) []Tag_ptr

This function retuns a Tag array with all tags in document that have a attribute with given name

get_root() Tag_ptr

This function returns the root Tag

get_all_tags() []Tag_ptr

This function returns all important tags, removing close tags

Tag

An object that holds tags information, such as name, attributes, children

get_children() []Tag_ptr

Returns all children as an array

get_parent() &Tag

Returns the parent of current tag

get_name() string

Returns tag name

get_content() string

Returns tag content

get_attributes() map[string]string

Returns all attributes and it value

text() string

Returns the content of the tag and all tags inside it. Also, any <br> tag will be converted into \n

Some questions that can appear

Q: Why in parser have a builder_str() string method that returns only the lexeme string?

A: Because in early stages of the project, strings.Builder are used, but for some bug existing somewhere, it was necessary to use string directly. Later, it's planned to use strings.Builder again

Q: Why have a compare_string(a string, b string) bool method?

A: For some reason when using != and == in strings directly, it not working. So, this method is a workaround

Q: Will be something like XPath?

A: Like XPath yes. Exactly equal to it, no.

Roadmap

  • Parser
    • <!-- Comments --> detection
    • Open Generic tags detection
    • Close Generic tags detection
    • verify string detection
    • tag attributes detection
    • attributes values detection
    • tag text (on tag it is declared as content, maybe change for text in the future)
    • text file for parse support (open local files for parsing)
    • open_code verification
  • DocumentObjectModel
    • push elements that have a close tag into stack
    • remove elements from stack
    • create a new document root if have some syntax error (deleted)
    • search tags in DOM by attributes
    • search tags in DOM by tag type
    • finish dom test

License

GPL3