Defining a JSON data structure for HTML
For the context-aware HTML IntelliSense, I need to transverse the HTML document. Since it isn't the DOM yet, I need to transverse a structured representation of that HTML Document. In other words, I need a JSON representation of HTML.
So, I dug the HTML Syntax, discover the Element Structure, and defined the JSON structure. Then, I provided an expectation.
The HTML Syntax
The HTML Living Standard describes 6 parts of an HTML document, in the given order:
-
Optionally, one Byte Order Mark (BOM) character
-
Any number of comments and ASCII whitespace
-
One Doctype
-
Any number of comments and ASCII whitespace
-
The Document element, html
-
Any number of comments and ASCII whitespace
Within that document element are other elements with or without text. Some elements are void as they accept nothing. Void or not, they may accept attributes.
In total, there are 116 HTML elements to compose any HTML document.
The HTML Element Structure
All HTML elements except "text" are taggable, example <html>.
As for "Text", it may come in 3 specific variants.
-
A BOM character,
-
ASCII whitespaces, and
-
Scalar Values; excluding noncharacters, and controls other than ASCII whitespace.
Anyways, the 115 taggable HTML elements may have attributes or no attributes.
The 2 non-attributable HTML Tags are the doctype tag <!DOCTYPE>
and the comment tag <!-- -->
. The other 113 Tags can have attributes, e.g. <html data-attribute-key="attribute-value">
.
Meanwhile, Attributable HTML Tags are either void or non-void.
The 13 Attributable Void HTML Tags can't accept an element as a child, e.g. <input>
. But, the remaining 100 Attributable non-void HTML Tags can accept 0 or more elements, e.g. <body> <!-- 0+ elements --> </body>
.
In summary, each HTML element has 3 basic properties defined by the 4 Tag Structure.
That 3 properties are content, tagname, and attributes. The 4 Tag Structures are Untaggable Text (UTT), Non-Attributable Tags (NAT), Attributable Void Tags (AVT), Attributable Non-Void Tags (ANVT)
UTT | NAT | AVT | ANVT | |
---|---|---|---|---|
Contents | ✓ | ✓ | ✓ | ✓ |
Tag Name | ✗ | ✓ | ✓ | ✓ |
Attributes | ✗ | ✗ | ✓ | ✓ |
The JSON Structure
By the HTML Syntax, order matters and repetitions are ok. The equivalent JSON representation will be an array of objects. The array keeps the order intact while the object contains each element.
[
{},
{},
{}
]
Meanwhile, by the HTML element structure, each object contains the tagname, attributes, and content of an element.
The tagname property takes a string data type. It will be an empty string only for the Untaggable element "text".
The attributes property takes an object data type for its key-value nature. It'll always be an empty object for the Untaggable Text and the Non-Attributable Tags. For others, it may not be empty.
The content property takes an array to provide order for the child elements. But, for the Untaggable Text and the Non-Attributable Tags, the value type is a string.
{
"element": "",
"attributes": {},
"content": [] // or ""
}
Put together, the JSON representation for an HTML Document is an array of object of tagname, attributes, and content of an element.
[
{
"element": "",
"attributes": {},
"content": [] // or ""
}
]
The Expectation
Given the HTML document:
<!-- Some comment -->
<!DOCTYPE html>
<html lang="en">
<head></head>
<body>
Some text
<hr class="block__element_modifier">
</body>
</html>
The JSON representation of the above document will be as follow
[
{
"element": "",
"attributes": {},
"content": "U+FEFF"
},
{
"element": "!-- --",
"attributes": {},
"content": "Some comment"
},
{
"element": "!DOCTYPE",
"attributes": {},
"content": "html"
},
{
"element": "html",
"attributes": {
"lang": "en",
},
"content": [
{
"element": "head",
"attributes": {},
"content": []
},
{
"element": "body",
"attributes": {},
"content": [
{
"element": "",
"attributes": {},
"content": "some text"
}
{
"element": "hr",
"attributes": {
"class": "block__element_modifier"
},
"content": []
}
]
}
]
}
]