public class FilterHtml
extends java.lang.Object
| Constructor and Description |
|---|
FilterHtml()
Create a new instance of FilterHtml
|
| Modifier and Type | Method and Description |
|---|---|
java.lang.String |
getAttributeContent(java.lang.String tagName,
java.lang.String attributeName,
java.lang.String theText)
Get the value of the attribute that is part of the first occurrence of the
element with the specified tag name.
|
java.lang.String |
getReadableContent(java.lang.String theText,
boolean removeLists)
Get only the readable content of the html document.
|
java.lang.String |
getSectionContent(java.lang.String startTag,
java.lang.String endTag,
java.lang.String theText,
boolean removeOther)
Get the section of text that starts with the first occurrence of the
start tag and ends with the first occurrence of the end tag.
|
boolean |
hasSection(java.lang.String tagName,
java.lang.String theText)
Return true if the text contains a section that starts with the specified tag.
|
static int |
quoteIndex(java.lang.String theText)
Return the first index of a quote character in the text.
|
static int |
quoteIndex(java.lang.String theText,
int startFrom)
Return the first index of a quote character in the text.
|
java.lang.String |
removeCommented(java.lang.String theText)
|
java.lang.String |
removeHead(java.lang.String theText)
Remove the head section of the document.
|
java.lang.String |
removeHtmlCoded(java.lang.String theText)
Remove the html coded content that starts with
& and ends with ;. |
java.lang.String |
removeHtmlFormatting(java.lang.String theText)
Remove the html formatting tags only.
|
java.lang.String |
removeTagsAndSymbols(java.lang.String theText)
Remove any existing XML or HTML tags, plus any symbols from the text.
|
java.lang.String |
removeTagsHtml(java.lang.String theText)
Remove any existing tags XML or HTML from the text.
|
java.lang.String |
removeTagsInsideTag(java.lang.String theText,
java.lang.String insideTag)
Remove all tags inside another specified tag.
|
java.lang.String |
removeXmlTags(java.lang.String theText)
Remove the xml element tags from the text and keep only the xml element content.
|
void |
resetForParsing()
Reset intermediary tags for the start of parsing
|
java.lang.String |
trimBeforeAfterNewlines(java.lang.String theText)
Trim the text just before and after each new line character, to remove the whitespace.
|
public void resetForParsing()
public boolean hasSection(java.lang.String tagName,
java.lang.String theText)
tagName - the name of the section tag.theText - the html text to parse.public java.lang.String getAttributeContent(java.lang.String tagName,
java.lang.String attributeName,
java.lang.String theText)
tagName - the name of the element.attributeName - the name of the attribute..theText - the html text to parse.public java.lang.String getSectionContent(java.lang.String startTag,
java.lang.String endTag,
java.lang.String theText,
boolean removeOther)
throws java.lang.Exception
startTag - the name of the start tag.endTag - the name of the end tag.theText - the html text to parse.removeOther - is true Only return the text content and not any tag
information, otherwise return everything.java.lang.Exception - any error.public java.lang.String removeTagsHtml(java.lang.String theText)
throws java.lang.Exception
theText - the text to format.java.lang.Exception - any error.public java.lang.String removeTagsAndSymbols(java.lang.String theText)
throws java.lang.Exception
theText - the text to format.java.lang.Exception - any error.public java.lang.String getReadableContent(java.lang.String theText,
boolean removeLists)
throws java.lang.Exception
theText - the text representation of the html document.removeLists - if true remove ordered and unordered lists. These might relate
to menus, for example. The default is true.java.lang.Exception - any error.public java.lang.String trimBeforeAfterNewlines(java.lang.String theText)
throws java.lang.Exception
theText - the text to trim.java.lang.Exception - any error.public java.lang.String removeHead(java.lang.String theText)
throws java.lang.Exception
theText - the text representation of the html document.</head> is removed.java.lang.Exception - any error.public java.lang.String removeHtmlCoded(java.lang.String theText)
throws java.lang.Exception
& and ends with ;.theText - the text representation of the html document.java.lang.Exception - any error.public java.lang.String removeHtmlFormatting(java.lang.String theText)
throws java.lang.Exception
theText - the text representation of the html document.java.lang.Exception - any error.public java.lang.String removeTagsInsideTag(java.lang.String theText,
java.lang.String insideTag)
throws java.lang.Exception
theText - the text representation of the html document.insideTag - the tag to find and remove sub-tags from.java.lang.Exception - any error.public java.lang.String removeCommented(java.lang.String theText)
HtmlConst.COMMSTART
and COMMEND.
Each tag is replaced by a space. This method does not require a valid XML document.theText - the text representation of the XML document.public java.lang.String removeXmlTags(java.lang.String theText)
theText - the text representation of the XML document.public static int quoteIndex(java.lang.String theText)
theText - the text to check.public static int quoteIndex(java.lang.String theText,
int startFrom)
theText - the text to check.startFrom - the index to start the search from.