Class HtmlParser
- java.lang.Object
-
- nu.validator.htmlparser.sax.HtmlParser
-
- All Implemented Interfaces:
org.xml.sax.XMLReader
- Direct Known Subclasses:
InfosetCoercingHtmlParser
public class HtmlParser extends java.lang.Object implements org.xml.sax.XMLReader
This class implements an HTML5 parser that exposes data through the SAX2 interface.By default, when using the constructor without arguments, the this parser coerces XML 1.0-incompatible infosets into XML 1.0-compatible infosets. This corresponds to
ALTER_INFOSET
as the general XML violation policy. To make the parser support non-conforming HTML fully per the HTML 5 spec while on the other hand potentially violating the SAX2 API contract, set the general XML violation policy toALLOW
. It is possible to treat XML 1.0 infoset violations as fatal by setting the general XML violation policy toFATAL
.By default, this parser doesn't do true streaming but buffers everything first. The parser can be made truly streaming by calling
setStreamabilityViolationPolicy(XmlViolationPolicy.FATAL)
. This has the consequence that errors that require non-streamable recovery are treated as fatal.By default, in order to make the parse events emulate the parse events for a DTDless XML document, the parser does not report the doctype through
LexicalHandler
. Doctype reporting throughLexicalHandler
can be turned on by callingsetReportingDoctype(true)
.- Version:
- $Id$
- Author:
- hsivonen
-
-
Constructor Summary
Constructors Constructor Description HtmlParser()
Instantiates the parser with a fatal XML violation policy.HtmlParser(XmlViolationPolicy xmlPolicy)
Instantiates the parser with a specific XML violation policy.
-
Method Summary
All Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description void
addCharacterHandler(CharacterHandler characterHandler)
XmlViolationPolicy
getBogusXmlnsPolicy()
Deprecated.XmlViolationPolicy
getCommentPolicy()
Returns the commentPolicy.org.xml.sax.ContentHandler
getContentHandler()
XmlViolationPolicy
getContentNonXmlCharPolicy()
Returns the contentNonXmlCharPolicy.XmlViolationPolicy
getContentSpacePolicy()
Returns the contentSpacePolicy.DoctypeExpectation
getDoctypeExpectation()
Returns the doctype expectation.org.xml.sax.Locator
getDocumentLocator()
Returns theLocator
during parse.DocumentModeHandler
getDocumentModeHandler()
Returns the document mode handler.org.xml.sax.DTDHandler
getDTDHandler()
org.xml.sax.EntityResolver
getEntityResolver()
org.xml.sax.ErrorHandler
getErrorHandler()
boolean
getFeature(java.lang.String name)
Exposes the configuration of the emulated XML parser as well as boolean-valued configuration without using non-XMLReader
getters directly.Heuristics
getHeuristics()
org.xml.sax.ext.LexicalHandler
getLexicalHandler()
Returns the lexicalHandler.XmlViolationPolicy
getNamePolicy()
The policy for non-NCName element and attribute names.java.lang.Object
getProperty(java.lang.String name)
AllowsXMLReader
-level access to non-boolean valued getters.XmlViolationPolicy
getStreamabilityViolationPolicy()
Returns the streamabilityViolationPolicy.XmlViolationPolicy
getXmlnsPolicy()
Returns the xmlnsPolicy.boolean
isCheckingNormalization()
Indicates whether NFC normalization of source is being checked.boolean
isHtml4ModeCompatibleWithXhtml1Schemata()
Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.boolean
isMappingLangToXmlLang()
Whetherlang
is mapped toxml:lang
.boolean
isReportingDoctype()
Returns the reportingDoctype.boolean
isScriptingEnabled()
Whether the parser considers scripting to be enabled for noscript treatment.void
parse(java.lang.String systemId)
void
parse(org.xml.sax.InputSource input)
void
parseFragment(org.xml.sax.InputSource input, java.lang.String context)
Parses a fragment.void
setBogusXmlnsPolicy(XmlViolationPolicy bogusXmlnsPolicy)
Deprecated.void
setCheckingNormalization(boolean enable)
Toggles the checking of the NFC normalization of source.void
setCommentPolicy(XmlViolationPolicy commentPolicy)
Sets the policy for consecutive hyphens in comments.void
setContentHandler(org.xml.sax.ContentHandler handler)
void
setContentNonXmlCharPolicy(XmlViolationPolicy contentNonXmlCharPolicy)
Sets the policy for non-XML characters except white space.void
setContentSpacePolicy(XmlViolationPolicy contentSpacePolicy)
Sets the policy for non-XML white space.void
setDoctypeExpectation(DoctypeExpectation doctypeExpectation)
Sets the doctype expectation.void
setDocumentModeHandler(DocumentModeHandler documentModeHandler)
Sets the document mode handler.void
setDTDHandler(org.xml.sax.DTDHandler handler)
void
setEntityResolver(org.xml.sax.EntityResolver resolver)
void
setErrorHandler(org.xml.sax.ErrorHandler handler)
void
setErrorProfile(java.util.HashMap<java.lang.String,java.lang.String> errorProfileMap)
void
setFeature(java.lang.String name, boolean value)
Sets a boolean feature without having to use non-XMLReader
setters directly.void
setHeuristics(Heuristics heuristics)
Sets the encoding sniffing heuristics.void
setHtml4ModeCompatibleWithXhtml1Schemata(boolean html4ModeCompatibleWithXhtml1Schemata)
Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.void
setLexicalHandler(org.xml.sax.ext.LexicalHandler handler)
Sets the lexical handler.void
setMappingLangToXmlLang(boolean mappingLangToXmlLang)
Whetherlang
is mapped toxml:lang
.void
setNamePolicy(XmlViolationPolicy namePolicy)
The policy for non-NCName element and attribute names.void
setProperty(java.lang.String name, java.lang.Object value)
Sets a non-boolean property without having to use non-XMLReader
setters directly.void
setReportingDoctype(boolean reportingDoctype)
void
setScriptingEnabled(boolean scriptingEnabled)
Sets whether the parser considers scripting to be enabled for noscript treatment.void
setStreamabilityViolationPolicy(XmlViolationPolicy streamabilityViolationPolicy)
Sets the streamabilityViolationPolicy.void
setTransitionHandler(TransitionHandler handler)
void
setTreeBuilderErrorHandlerOverride(org.xml.sax.ErrorHandler handler)
Deprecated.For Validator.nu internal usevoid
setXmlnsPolicy(XmlViolationPolicy xmlnsPolicy)
Whether thexmlns
attribute on the root element is passed to through.void
setXmlPolicy(XmlViolationPolicy xmlPolicy)
This is a catch-all convenience method for setting name, xmlns, content space, content non-XML char and comment policies in one go.
-
-
-
Constructor Detail
-
HtmlParser
public HtmlParser()
Instantiates the parser with a fatal XML violation policy.
-
HtmlParser
public HtmlParser(XmlViolationPolicy xmlPolicy)
Instantiates the parser with a specific XML violation policy.- Parameters:
xmlPolicy
- the policy
-
-
Method Detail
-
getContentHandler
public org.xml.sax.ContentHandler getContentHandler()
- Specified by:
getContentHandler
in interfaceorg.xml.sax.XMLReader
- See Also:
XMLReader.getContentHandler()
-
getDTDHandler
public org.xml.sax.DTDHandler getDTDHandler()
- Specified by:
getDTDHandler
in interfaceorg.xml.sax.XMLReader
- See Also:
XMLReader.getDTDHandler()
-
getEntityResolver
public org.xml.sax.EntityResolver getEntityResolver()
- Specified by:
getEntityResolver
in interfaceorg.xml.sax.XMLReader
- See Also:
XMLReader.getEntityResolver()
-
getErrorHandler
public org.xml.sax.ErrorHandler getErrorHandler()
- Specified by:
getErrorHandler
in interfaceorg.xml.sax.XMLReader
- See Also:
XMLReader.getErrorHandler()
-
getFeature
public boolean getFeature(java.lang.String name) throws org.xml.sax.SAXNotRecognizedException, org.xml.sax.SAXNotSupportedException
Exposes the configuration of the emulated XML parser as well as boolean-valued configuration without using non-XMLReader
getters directly.http://xml.org/sax/features/external-general-entities
false
http://xml.org/sax/features/external-parameter-entities
false
http://xml.org/sax/features/is-standalone
true
http://xml.org/sax/features/lexical-handler/parameter-entities
false
http://xml.org/sax/features/namespaces
true
http://xml.org/sax/features/namespace-prefixes
false
http://xml.org/sax/features/resolve-dtd-uris
true
http://xml.org/sax/features/string-interning
false
http://xml.org/sax/features/unicode-normalization-checking
isCheckingNormalization
http://xml.org/sax/features/use-attributes2
false
http://xml.org/sax/features/use-locator2
false
http://xml.org/sax/features/use-entity-resolver2
false
http://xml.org/sax/features/validation
false
http://xml.org/sax/features/xmlns-uris
false
http://xml.org/sax/features/xml-1.1
false
http://validator.nu/features/html4-mode-compatible-with-xhtml1-schemata
isHtml4ModeCompatibleWithXhtml1Schemata
http://validator.nu/features/mapping-lang-to-xml-lang
isMappingLangToXmlLang
http://validator.nu/features/scripting-enabled
isScriptingEnabled
- Specified by:
getFeature
in interfaceorg.xml.sax.XMLReader
- Parameters:
name
- feature URI string- Returns:
- a value per the list above
- Throws:
org.xml.sax.SAXNotRecognizedException
org.xml.sax.SAXNotSupportedException
- See Also:
XMLReader.getFeature(java.lang.String)
-
getProperty
public java.lang.Object getProperty(java.lang.String name) throws org.xml.sax.SAXNotRecognizedException, org.xml.sax.SAXNotSupportedException
AllowsXMLReader
-level access to non-boolean valued getters.The properties are mapped as follows:
http://xml.org/sax/properties/document-xml-version
"1.0"
http://xml.org/sax/properties/lexical-handler
getLexicalHandler
http://validator.nu/properties/content-space-policy
getContentSpacePolicy
http://validator.nu/properties/content-non-xml-char-policy
getContentNonXmlCharPolicy
http://validator.nu/properties/comment-policy
getCommentPolicy
http://validator.nu/properties/xmlns-policy
getXmlnsPolicy
http://validator.nu/properties/name-policy
getNamePolicy
http://validator.nu/properties/streamability-violation-policy
getStreamabilityViolationPolicy
http://validator.nu/properties/document-mode-handler
getDocumentModeHandler
http://validator.nu/properties/doctype-expectation
getDoctypeExpectation
http://xml.org/sax/features/unicode-normalization-checking
- Specified by:
getProperty
in interfaceorg.xml.sax.XMLReader
- Parameters:
name
- property URI string- Returns:
- a value per the list above
- Throws:
org.xml.sax.SAXNotRecognizedException
org.xml.sax.SAXNotSupportedException
- See Also:
XMLReader.getProperty(java.lang.String)
-
parse
public void parse(org.xml.sax.InputSource input) throws java.io.IOException, org.xml.sax.SAXException
- Specified by:
parse
in interfaceorg.xml.sax.XMLReader
- Throws:
java.io.IOException
org.xml.sax.SAXException
- See Also:
XMLReader.parse(org.xml.sax.InputSource)
-
parseFragment
public void parseFragment(org.xml.sax.InputSource input, java.lang.String context) throws java.io.IOException, org.xml.sax.SAXException
Parses a fragment.- Parameters:
input
- the input to parsecontext
- the name of the context element- Throws:
java.io.IOException
org.xml.sax.SAXException
-
parse
public void parse(java.lang.String systemId) throws java.io.IOException, org.xml.sax.SAXException
- Specified by:
parse
in interfaceorg.xml.sax.XMLReader
- Throws:
java.io.IOException
org.xml.sax.SAXException
- See Also:
XMLReader.parse(java.lang.String)
-
setContentHandler
public void setContentHandler(org.xml.sax.ContentHandler handler)
- Specified by:
setContentHandler
in interfaceorg.xml.sax.XMLReader
- See Also:
XMLReader.setContentHandler(org.xml.sax.ContentHandler)
-
setLexicalHandler
public void setLexicalHandler(org.xml.sax.ext.LexicalHandler handler)
Sets the lexical handler.- Parameters:
handler
- the hander.
-
setDTDHandler
public void setDTDHandler(org.xml.sax.DTDHandler handler)
- Specified by:
setDTDHandler
in interfaceorg.xml.sax.XMLReader
- See Also:
XMLReader.setDTDHandler(org.xml.sax.DTDHandler)
-
setEntityResolver
public void setEntityResolver(org.xml.sax.EntityResolver resolver)
- Specified by:
setEntityResolver
in interfaceorg.xml.sax.XMLReader
- See Also:
XMLReader.setEntityResolver(org.xml.sax.EntityResolver)
-
setErrorHandler
public void setErrorHandler(org.xml.sax.ErrorHandler handler)
- Specified by:
setErrorHandler
in interfaceorg.xml.sax.XMLReader
- See Also:
XMLReader.setErrorHandler(org.xml.sax.ErrorHandler)
-
setTransitionHandler
public void setTransitionHandler(TransitionHandler handler)
-
setTreeBuilderErrorHandlerOverride
public void setTreeBuilderErrorHandlerOverride(org.xml.sax.ErrorHandler handler)
Deprecated.For Validator.nu internal use- See Also:
XMLReader.setErrorHandler(org.xml.sax.ErrorHandler)
-
setFeature
public void setFeature(java.lang.String name, boolean value) throws org.xml.sax.SAXNotRecognizedException, org.xml.sax.SAXNotSupportedException
Sets a boolean feature without having to use non-XMLReader
setters directly.The supported features are:
http://xml.org/sax/features/unicode-normalization-checking
setCheckingNormalization
http://validator.nu/features/html4-mode-compatible-with-xhtml1-schemata
setHtml4ModeCompatibleWithXhtml1Schemata
http://validator.nu/features/mapping-lang-to-xml-lang
setMappingLangToXmlLang
http://validator.nu/features/scripting-enabled
setScriptingEnabled
- Specified by:
setFeature
in interfaceorg.xml.sax.XMLReader
- Throws:
org.xml.sax.SAXNotRecognizedException
org.xml.sax.SAXNotSupportedException
- See Also:
XMLReader.setFeature(java.lang.String, boolean)
-
setProperty
public void setProperty(java.lang.String name, java.lang.Object value) throws org.xml.sax.SAXNotRecognizedException, org.xml.sax.SAXNotSupportedException
Sets a non-boolean property without having to use non-XMLReader
setters directly.http://xml.org/sax/properties/lexical-handler
setLexicalHandler
http://validator.nu/properties/content-space-policy
setContentSpacePolicy
http://validator.nu/properties/content-non-xml-char-policy
setContentNonXmlCharPolicy
http://validator.nu/properties/comment-policy
setCommentPolicy
http://validator.nu/properties/xmlns-policy
setXmlnsPolicy
http://validator.nu/properties/name-policy
setNamePolicy
http://validator.nu/properties/streamability-violation-policy
setStreamabilityViolationPolicy
http://validator.nu/properties/document-mode-handler
setDocumentModeHandler
http://validator.nu/properties/doctype-expectation
setDoctypeExpectation
http://validator.nu/properties/xml-policy
setXmlPolicy
- Specified by:
setProperty
in interfaceorg.xml.sax.XMLReader
- Throws:
org.xml.sax.SAXNotRecognizedException
org.xml.sax.SAXNotSupportedException
- See Also:
XMLReader.setProperty(java.lang.String, java.lang.Object)
-
isCheckingNormalization
public boolean isCheckingNormalization()
Indicates whether NFC normalization of source is being checked.- Returns:
true
if NFC normalization of source is being checked.- See Also:
nu.validator.htmlparser.impl.Tokenizer#isCheckingNormalization()
-
setCheckingNormalization
public void setCheckingNormalization(boolean enable)
Toggles the checking of the NFC normalization of source.- Parameters:
enable
-true
to check normalization- See Also:
nu.validator.htmlparser.impl.Tokenizer#setCheckingNormalization(boolean)
-
setCommentPolicy
public void setCommentPolicy(XmlViolationPolicy commentPolicy)
Sets the policy for consecutive hyphens in comments.- Parameters:
commentPolicy
- the policy- See Also:
Tokenizer.setCommentPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)
-
setContentNonXmlCharPolicy
public void setContentNonXmlCharPolicy(XmlViolationPolicy contentNonXmlCharPolicy)
Sets the policy for non-XML characters except white space.- Parameters:
contentNonXmlCharPolicy
- the policy- See Also:
Tokenizer.setContentNonXmlCharPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)
-
setContentSpacePolicy
public void setContentSpacePolicy(XmlViolationPolicy contentSpacePolicy)
Sets the policy for non-XML white space.- Parameters:
contentSpacePolicy
- the policy- See Also:
Tokenizer.setContentSpacePolicy(nu.validator.htmlparser.common.XmlViolationPolicy)
-
isScriptingEnabled
public boolean isScriptingEnabled()
Whether the parser considers scripting to be enabled for noscript treatment.- Returns:
true
if enabled- See Also:
TreeBuilder.isScriptingEnabled()
-
setScriptingEnabled
public void setScriptingEnabled(boolean scriptingEnabled)
Sets whether the parser considers scripting to be enabled for noscript treatment.- Parameters:
scriptingEnabled
-true
to enable- See Also:
TreeBuilder.setScriptingEnabled(boolean)
-
getDoctypeExpectation
public DoctypeExpectation getDoctypeExpectation()
Returns the doctype expectation.- Returns:
- the doctypeExpectation
-
setDoctypeExpectation
public void setDoctypeExpectation(DoctypeExpectation doctypeExpectation)
Sets the doctype expectation.- Parameters:
doctypeExpectation
- the doctypeExpectation to set- See Also:
TreeBuilder.setDoctypeExpectation(nu.validator.htmlparser.common.DoctypeExpectation)
-
getDocumentModeHandler
public DocumentModeHandler getDocumentModeHandler()
Returns the document mode handler.- Returns:
- the documentModeHandler
-
setDocumentModeHandler
public void setDocumentModeHandler(DocumentModeHandler documentModeHandler)
Sets the document mode handler.- Parameters:
documentModeHandler
- the documentModeHandler to set- See Also:
TreeBuilder.setDocumentModeHandler(nu.validator.htmlparser.common.DocumentModeHandler)
-
getStreamabilityViolationPolicy
public XmlViolationPolicy getStreamabilityViolationPolicy()
Returns the streamabilityViolationPolicy.- Returns:
- the streamabilityViolationPolicy
-
setStreamabilityViolationPolicy
public void setStreamabilityViolationPolicy(XmlViolationPolicy streamabilityViolationPolicy)
Sets the streamabilityViolationPolicy.- Parameters:
streamabilityViolationPolicy
- the streamabilityViolationPolicy to set
-
setHtml4ModeCompatibleWithXhtml1Schemata
public void setHtml4ModeCompatibleWithXhtml1Schemata(boolean html4ModeCompatibleWithXhtml1Schemata)
Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.- Parameters:
html4ModeCompatibleWithXhtml1Schemata
-
-
getDocumentLocator
public org.xml.sax.Locator getDocumentLocator()
Returns theLocator
during parse.- Returns:
- the
Locator
-
isHtml4ModeCompatibleWithXhtml1Schemata
public boolean isHtml4ModeCompatibleWithXhtml1Schemata()
Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.- Returns:
- the html4ModeCompatibleWithXhtml1Schemata
-
setMappingLangToXmlLang
public void setMappingLangToXmlLang(boolean mappingLangToXmlLang)
Whetherlang
is mapped toxml:lang
.- Parameters:
mappingLangToXmlLang
-- See Also:
Tokenizer.setMappingLangToXmlLang(boolean)
-
isMappingLangToXmlLang
public boolean isMappingLangToXmlLang()
Whetherlang
is mapped toxml:lang
.- Returns:
- the mappingLangToXmlLang
-
setXmlnsPolicy
public void setXmlnsPolicy(XmlViolationPolicy xmlnsPolicy)
Whether thexmlns
attribute on the root element is passed to through. (FATAL not allowed.)- Parameters:
xmlnsPolicy
-- See Also:
Tokenizer.setXmlnsPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)
-
getXmlnsPolicy
public XmlViolationPolicy getXmlnsPolicy()
Returns the xmlnsPolicy.- Returns:
- the xmlnsPolicy
-
getLexicalHandler
public org.xml.sax.ext.LexicalHandler getLexicalHandler()
Returns the lexicalHandler.- Returns:
- the lexicalHandler
-
getCommentPolicy
public XmlViolationPolicy getCommentPolicy()
Returns the commentPolicy.- Returns:
- the commentPolicy
-
getContentNonXmlCharPolicy
public XmlViolationPolicy getContentNonXmlCharPolicy()
Returns the contentNonXmlCharPolicy.- Returns:
- the contentNonXmlCharPolicy
-
getContentSpacePolicy
public XmlViolationPolicy getContentSpacePolicy()
Returns the contentSpacePolicy.- Returns:
- the contentSpacePolicy
-
setReportingDoctype
public void setReportingDoctype(boolean reportingDoctype)
- Parameters:
reportingDoctype
-- See Also:
TreeBuilder.setReportingDoctype(boolean)
-
isReportingDoctype
public boolean isReportingDoctype()
Returns the reportingDoctype.- Returns:
- the reportingDoctype
-
setErrorProfile
public void setErrorProfile(java.util.HashMap<java.lang.String,java.lang.String> errorProfileMap)
- Parameters:
errorProfile
-- See Also:
nu.validator.htmlparser.impl.errorReportingTokenizer#setErrorProfile(set)
-
setNamePolicy
public void setNamePolicy(XmlViolationPolicy namePolicy)
The policy for non-NCName element and attribute names.- Parameters:
namePolicy
-- See Also:
Tokenizer.setNamePolicy(nu.validator.htmlparser.common.XmlViolationPolicy)
-
setHeuristics
public void setHeuristics(Heuristics heuristics)
Sets the encoding sniffing heuristics.- Parameters:
heuristics
- the heuristics to set- See Also:
nu.validator.htmlparser.impl.Tokenizer#setHeuristics(nu.validator.htmlparser.common.Heuristics)
-
getHeuristics
public Heuristics getHeuristics()
-
setXmlPolicy
public void setXmlPolicy(XmlViolationPolicy xmlPolicy)
This is a catch-all convenience method for setting name, xmlns, content space, content non-XML char and comment policies in one go. This does not affect the streamability policy or doctype reporting.- Parameters:
xmlPolicy
-
-
getNamePolicy
public XmlViolationPolicy getNamePolicy()
The policy for non-NCName element and attribute names.- Returns:
- the namePolicy
-
setBogusXmlnsPolicy
public void setBogusXmlnsPolicy(XmlViolationPolicy bogusXmlnsPolicy)
Deprecated.Does nothing.
-
getBogusXmlnsPolicy
public XmlViolationPolicy getBogusXmlnsPolicy()
Deprecated.ReturnsXmlViolationPolicy.ALTER_INFOSET
.- Returns:
XmlViolationPolicy.ALTER_INFOSET
-
addCharacterHandler
public void addCharacterHandler(CharacterHandler characterHandler)
-
-