Project

Profile

Help

Bug #3839

closed

Problem with HTML5 indentation: <mark> not recognized as inline element

Added by Michael Kay almost 6 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Serialization
Sprint/Milestone:
-
Start date:
2018-07-13
Due date:
% Done:

100%

Estimated time:
Legacy ID:
Applies to branch:
9.8, trunk
Fix Committed on Branch:
9.8, trunk
Fixed in Maintenance Release:
Platforms:

Description

Reported today in direct email to support(at)saxonica

I'm attaching an XSL file and HTML output which demonstrates the bug, in which — with the declaration <xsl:output method="html" name="htmlOutput" indent="yes" version="5.0"/> — the opening <mark> tag seems to be treated as a block element, while the closing tag is treated as inline. The opening tag therefore inappropriately introduces whitespace, with indent="yes". This doesn't occur with indent="no". This behavior doesn't occur with <span>, which is treated as inline with either indent value.

I'm using Saxon-PE 9.8.0.12 in Oxygen XML Editor 20.1.

Actions #1

Updated by Michael Kay almost 6 years ago

  • Description updated (diff)
Actions #2

Updated by Michael Kay almost 6 years ago

  • Description updated (diff)
Actions #3

Updated by Michael Kay almost 6 years ago

XSLT/XQ Serialization 3.1 says for HTML5 serialization that an element is treated as an inline element if HTML5 defines it as a phrasing element.

The set of phrasing elements in HTML5 is

a abbr area (if it is a descendant of a map element) audio b bdi bdo br button canvas cite code data datalist del dfn em embed i iframe img input ins kbd keygen label map mark math meter noscript object output picture progress q ruby s samp script select small span strong sub sup svg template textarea time u var video wbr

But the set of inline elements recognized by Saxon is

            "tt", "i", "b", "u", "s", "strike", "big", "small", "em", "strong", "dfn", "code", "samp",
            "kbd", "var", "cite", "abbr", "acronym", "a", "img", "applet", "object", "font",
            "basefont", "br", "script", "map", "q", "sub", "sup", "span", "bdo", "iframe", "input",
            "select", "textarea", "label", "button", "ins", "del"
Actions #4

Updated by Michael Kay almost 6 years ago

Studying the code, it also appears that the SUPPRESS_INDENTATION property is ignored for HTML and XHTML serialization. Note that the rules require case-blind (and partially namespace-blind) matching in the HTML case. I have logged this as a separate bug #3841.

Actions #5

Updated by Michael Kay almost 6 years ago

  • Subject changed from Problem with HTML5 indentation to Problem with HTML5 indentation: <mark> not recognized as inline element
Actions #6

Updated by Michael Kay almost 6 years ago

Seems there are very few W3C tests for indenting: a historical accident because the machinery for writing such a test was only introduced at a late stage.

I created a test to serialize:

<html><head/><body><p>a<a>text</a>z</p></body></html>

and it produced:

<html>
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
   </head>
   <body>
      <p>a<a>text</a>z
      </p>
   </body>
</html>

which is incorrect because no whitespace is allowed before the </p> end tag. ("Whitespace MUST NOT be added other than before or after an element, or adjacent to an existing whitespace character.")

I have logged this as a separate bug #3842

Actions #7

Updated by Michael Kay almost 6 years ago

For the record, I have lodged bug https://www.w3.org/Bugs/Public/show_bug.cgi?id=30276 against the W3C serialization spec regarding ambiguity about the definition of inline elements (e.g. is mark an inline element when serializing as HTML 4.0?).

There's no harm in treating an element as inline even if the spec doesn't classify it as such (the only effect is that you won't get indentation in a place where the spec permits it), so I will create a list of inline elements that's the union of the HTML4 and HTML5 lists, and use this regardless of the output version requested.

Actions #8

Updated by Michael Kay almost 6 years ago

  • Status changed from New to Resolved
  • Applies to branch 9.8, trunk added
  • Fix Committed on Branch 9.8, trunk added

I have updated the list of inline element names to:

            "a", "abbr", "acronym", "applet", "area",
            "audio", "b", "basefont", "bdi", "bdo", "big", "br", "button", "canvas", "cite", "code", "data",
            "datalist", "del", "dfn", "em", "embed", "font", "i", "iframe", "img", "input", "ins",
            "kbd", "label", "link", "map",
            "mark", "math", "meter", "noscript", "object", "output", "picture",
            "progress", "q", "ruby", "s", "samp", "script", "select", "small", "span",
            "strike", "strong", "sub", "sup", "svg", "template", "textarea",
            "time", "tt", "u", "var", "video", "wbr"

This is the union of the HTML4 and HTML5 lists.

The handling of case-blindness and namespace-sensitivity varies between HTML and XHTML, and depends on the html version, but the list of local names is the same in all cases.

I have also added title to the list of formatted elements.

I have added a number of test cases to the QT3 method-html and method-xhtml test sets. These also depend on fixing bug #3842.

Actions #9

Updated by Debbie Lockett over 5 years ago

  • Status changed from Resolved to Closed
  • % Done changed from 0 to 100
  • Fixed in Maintenance Release 9.8.0.14 added

Bug fix applied in the Saxon 9.8.0.14 maintenance release.

Please register to edit this issue

Also available in: Atom PDF