Flickr Popular Tags page parsing using Java – update

30 Apr

In my last post on parsing Flickr popular tags I had discussed converting the Flickr page HTML to XML (using Tidy on the command line) and then using the Java XML APIs to get a Document that I could parse.

But I was able to find some sample code of how to use Tidy (jTidy) to parse HTML and return a Document, so I have changed my code as follows…

import java.io.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import org.w3c.tidy.Tidy;

public class FParser {

// Main program
// Much of the XML parsing code comes from
// http://www.exampledepot.com/egs/org.w3c.dom/pkg.html
public static void main(String args[]) {
// The Quintessential Program to Create a DOM Document from an XML File
startParse(“.”);
}

public static void startParse(String dir) {
Tidy tidy = new Tidy(); // obtain a new Tidy instance
tidy.setQuiet(true);
tidy.setShowWarnings(false);
Document doc = null;
try {
FileInputStream fstream = new
FileInputStream(“2008-4-29_14-30.html”);
doc = tidy.parseDOM(fstream, null);
} catch (Exception e) {
e.printStackTrace();
}

// Retrieve the element using id
NodeList list = doc.getElementsByTagName(“p”);
for (int i=0; i whose id is “TagCloud”
if (element.getAttribute(“id”).equals(“TagCloud”)) {
// Get each of the Tag Clouds elements
NodeList list2 = element.getElementsByTagName(“a”);
for (int j=0; j

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: