Editor Note: HtmlSieve will be released shortly on GitHub with a mirror of the documentation created here.
What is it good for?
HtmlSieve is a Lexer (not a full fledged stateful parser!), that allows for streaming processing of Html/Xml content. The Lexer recognizes the difference between:
- Text (including CDATA sections)
- Tags (open or close) and it parses the attributes of the tags
- Comments
Overly Simple Example: Counting Tags
A Tag Listener
The tag listener will receive events as tags are encountered by the Sieve while processing code
package htmlparser.example;
import org.htmlparser_fork.Tag;
import com.centercomp.htmlsieve;
public class TagCountingListener implements TagListener {
public int endTags = 0;
public int startTags = 0;
/**
* Receives notification whenever an 'end' tag
* such as </body> is encountered in the Lexing
* stream.
*/
public Tag onEndTag(final Tag tag) {
endTags++;
return tag;
}
/**
* Receives notification whenever a 'start' tag
* such as <body> is encountered in the Lexing
* stream. Start tags will also include
* empty xhtml tags
* or void html tags such as <br> or <br/>
*/
public Tag onStartTag(final Tag tag) {
startTags++;
return tag;
}
};
Processing
import org.junit.*;As you can see, the key to working with the sieve is the streaming listeners.
import static org.junit.Assert.*;
@Test
public void testTagCounting() throws IOException{
//This is where we will send output.
StringWriter stringWriter = new StringWriter();
TagCountingListener tagListener = new TagCountingListener();
//Construct the sieve
HtmlSieve sieve = new HtmlSieve()
//Tell it where to send output after processing
.setWriter(stringWriter)
//Add the Tag Listener
.addTagListener(tagListener);
//Now send the sieve some data in bits:
String test1 = "<html><body><p";
String test2 = ">This is a test</p></body></html>";
//Make sure nothing has been processed yet.
assertEquals(0, tagListener.startTags);
assertEquals(0, tagListener.endTags);
//The lexer will handle what it can... in this case, the <p> tag
//is incomplete, so it won't be processed yet.
sieve.write(test1);
//Processed up to, but not including the 'p' tag.
assertEquals(2, tagListener.startTags);
assertEquals(0, tagListener.endTags);
sieve.write(test2);
sieve.close();
assertEquals(3, tagListener.startTags);
assertEquals(3, tagListener.endTags);
//Make sure content hasn't been modified by the processing.
assertEquals(test1 + test2, stringWriter.toString());
}
What can listeners do
- Process Comments with
com.centercomp.htmlsieve.CommentListener. A useful example of this iscom.centercomp.htmlsieve.filters.CommentStripper - Process text inside the html with:
com.centercomp.htmlsieve.TextListenerA useful example of this would be a Profanity filter for a form post, or even possibly an inline emoticon replacer. - Rewrite html tags with
com.centercomp.htmlsieve.TagListenerWe use this extensively in the Chariot Framework for the AntiCSRF processing in Chariot Command. We also plan to use it in Chariot FuSOR.
No comments:
Post a Comment