Editor Note: HtmlSieve will be released shortly on GitHub with a mirror of the documentation created here.
What is it good for?
HtmlSieve is a Lexer (not a full fledged stateful parser!), that allows for streaming processing of Html/Xml content. The Lexer recognizes the difference between:
- Text (including CDATA sections)
- Tags (open or close) and it parses the attributes of the tags
- Comments
Overly Simple Example: Counting Tags
A Tag Listener
The tag listener will receive events as tags are encountered by the Sieve while processing code
package htmlparser.example;
import org.htmlparser_fork.Tag;
import com.centercomp.htmlsieve;
public class TagCountingListener implements TagListener {
public int endTags = 0;
public int startTags = 0;
/**
* Receives notification whenever an 'end' tag
* such as </body> is encountered in the Lexing
* stream.
*/
public Tag onEndTag(final Tag tag) {
endTags++;
return tag;
}
/**
* Receives notification whenever a 'start' tag
* such as <body> is encountered in the Lexing
* stream. Start tags will also include
* empty xhtml tags
* or void html tags such as <br> or <br/>
*/
public Tag onStartTag(final Tag tag) {
startTags++;
return tag;
}
};
Processing
import org.junit.*;As you can see, the key to working with the sieve is the streaming listeners.
import static org.junit.Assert.*;
@Test
public void testTagCounting() throws IOException{
//This is where we will send output.
StringWriter stringWriter = new StringWriter();
TagCountingListener tagListener = new TagCountingListener();
//Construct the sieve
HtmlSieve sieve = new HtmlSieve()
//Tell it where to send output after processing
.setWriter(stringWriter)
//Add the Tag Listener
.addTagListener(tagListener);
//Now send the sieve some data in bits:
String test1 = "<html><body><p";
String test2 = ">This is a test</p></body></html>";
//Make sure nothing has been processed yet.
assertEquals(0, tagListener.startTags);
assertEquals(0, tagListener.endTags);
//The lexer will handle what it can... in this case, the <p> tag
//is incomplete, so it won't be processed yet.
sieve.write(test1);
//Processed up to, but not including the 'p' tag.
assertEquals(2, tagListener.startTags);
assertEquals(0, tagListener.endTags);
sieve.write(test2);
sieve.close();
assertEquals(3, tagListener.startTags);
assertEquals(3, tagListener.endTags);
//Make sure content hasn't been modified by the processing.
assertEquals(test1 + test2, stringWriter.toString());
}
What can listeners do
- Process Comments with
com.centercomp.htmlsieve.CommentListener
. A useful example of this iscom.centercomp.htmlsieve.filters.CommentStripper
- Process text inside the html with:
com.centercomp.htmlsieve.TextListener
A useful example of this would be a Profanity filter for a form post, or even possibly an inline emoticon replacer. - Rewrite html tags with
com.centercomp.htmlsieve.TagListener
We use this extensively in the Chariot Framework for the AntiCSRF processing in Chariot Command. We also plan to use it in Chariot FuSOR.
No comments:
Post a Comment