Friday, April 1, 2011

HtmlSieve Five Minute Introduction

Editor Note: HtmlSieve will be released shortly on GitHub with a mirror of the documentation created here.

What is it good for?

HtmlSieve is a Lexer (not a full fledged stateful parser!), that allows for streaming processing of Html/Xml content. The Lexer recognizes the difference between:

  • Text (including CDATA sections)
  • Tags (open or close)  and it parses the attributes of the tags
  • Comments

 

Overly Simple Example: Counting Tags

A Tag Listener

The tag listener will receive events as tags are encountered by the Sieve while processing code

	package htmlparser.example;

import org.htmlparser_fork.Tag;
import com.centercomp.htmlsieve;

public class TagCountingListener implements TagListener {
public int endTags = 0;

public int startTags = 0;

/**
* Receives notification whenever an 'end' tag
* such as </body> is encountered in the Lexing
* stream.
*/
public Tag onEndTag(final Tag tag) {
endTags++;
return tag;
}

/**
* Receives notification whenever a 'start' tag
* such as <body> is encountered in the Lexing
* stream. Start tags will also include
* empty xhtml tags
* or void html tags such as &lt;br&gt; or &lt;br/&gt;
*/
public Tag onStartTag(final Tag tag) {
startTags++;
return tag;
}

};

Processing

	import org.junit.*;
import static org.junit.Assert.*;

@Test
public void testTagCounting() throws IOException{
//This is where we will send output.
StringWriter stringWriter = new StringWriter();

TagCountingListener tagListener = new TagCountingListener();
//Construct the sieve
HtmlSieve sieve = new HtmlSieve()

//Tell it where to send output after processing
.setWriter(stringWriter)

//Add the Tag Listener
.addTagListener(tagListener);


//Now send the sieve some data in bits:
String test1 = "<html><body><p";
String test2 = ">This is a test</p></body></html>";

//Make sure nothing has been processed yet.
assertEquals(0, tagListener.startTags);
assertEquals(0, tagListener.endTags);

//The lexer will handle what it can... in this case, the <p> tag
//is incomplete, so it won't be processed yet.
sieve.write(test1);

//Processed up to, but not including the 'p' tag.
assertEquals(2, tagListener.startTags);
assertEquals(0, tagListener.endTags);

sieve.write(test2);
sieve.close();

assertEquals(3, tagListener.startTags);
assertEquals(3, tagListener.endTags);

//Make sure content hasn't been modified by the processing.
assertEquals(test1 + test2, stringWriter.toString());
}
As you can see, the key to working with the sieve is the streaming listeners.

What can listeners do



  • Process Comments with com.centercomp.htmlsieve.CommentListener. A useful example of this is com.centercomp.htmlsieve.filters.CommentStripper
  • Process text inside the html with: com.centercomp.htmlsieve.TextListener A useful example of this would be a Profanity filter for a form post, or even possibly an inline emoticon replacer.
  • Rewrite html tags with com.centercomp.htmlsieve.TagListener We use this extensively in the Chariot Framework for the AntiCSRF processing in Chariot Command. We also plan to use it in Chariot FuSOR.

No comments:

Post a Comment