public class SourceText
extends java.lang.Object
FilterHtml class that it uses to do the parsing.| Constructor and Description |
|---|
SourceText()
Create a new instance of SourceText
|
| Modifier and Type | Method and Description |
|---|---|
java.util.ArrayList<java.net.URL> |
getHyperLinks()
Return the list of parsed hyperlinks.
|
java.lang.String |
parseHTML(java.lang.String htmlSource,
SourceConfig parseConfig)
Parse the html source as defined by
parseConfig. |
java.lang.String |
parseHTML(java.net.URL htmlUri,
SourceConfig parseConfig)
Read the source URL and parse the html source as defined by
parseConfig. |
java.util.ArrayList<java.net.URL> |
parseHyperLinks(java.lang.String theText)
Parse all hyperlinks in the text if any exist and keep all found.
|
java.util.ArrayList<java.net.URL> |
parseHyperLinks(java.lang.String theText,
java.util.ArrayList<java.lang.String> toKeep,
java.lang.String formatType,
boolean resetHyperList)
Parse all hyperlinks in the text if any exist, but only keep certain types of link.
|
java.lang.String |
readableContent(java.lang.String htmlSource)
Get only the readable content of the html document.
|
java.lang.String |
readableContent(java.net.URL htmlUri)
Get only the readable content of the html document.
|
public java.lang.String parseHTML(java.net.URL htmlUri,
SourceConfig parseConfig)
throws java.lang.Exception
parseConfig.
To automatically remove all of the markup, you can try readableContent instead.htmlUri - uri of the html file.parseConfig - defines what parsing should be done. Can be null when
no text removal is performed.java.lang.Exception - any error.public java.lang.String parseHTML(java.lang.String htmlSource,
SourceConfig parseConfig)
throws java.lang.Exception
parseConfig. To automatically remove
all of the markup, you can try readableContent instead.htmlSource - the source text in html format.parseConfig - defines what parsing should be done. Can be null
for default settings.java.lang.Exception - any error.public java.lang.String readableContent(java.net.URL htmlUri)
throws java.lang.Exception
htmlUri - uri of the html file.java.lang.Exception - any error.public java.lang.String readableContent(java.lang.String htmlSource)
throws java.lang.Exception
htmlSource - the source text in html format.java.lang.Exception - any error.public java.util.ArrayList<java.net.URL> parseHyperLinks(java.lang.String theText)
throws java.net.MalformedURLException,
java.lang.Exception
hyperLinks list, which is reset first and a html format
is assumed.theText - the text with the hyperlink descriptions.java.net.MalformedURLException - hyperlink cannot be created.java.lang.Exception - any error.public java.util.ArrayList<java.net.URL> parseHyperLinks(java.lang.String theText,
java.util.ArrayList<java.lang.String> toKeep,
java.lang.String formatType,
boolean resetHyperList)
throws java.net.MalformedURLException,
java.lang.Exception
hyperLinks list.theText - the text with the hyperlink descriptions.toKeep - a list of descriptions of link types to keep. If null or empty,
then all link types are kept. This can currently be HtmlConst.HTTPLINKTYPE or IMAGELINKTYPE.formatType - the format of the text. Can be VendorEngine.HTML or JSON.resetHyperList - true if reset the hyperlink list first.java.net.MalformedURLException - hyperlink cannot be created.java.lang.Exception - any error.public java.util.ArrayList<java.net.URL> getHyperLinks()