Flickr Popular Tags page parsing using Java – update

30 Apr

In my last post on parsing Flickr popular tags I had discussed converting the Flickr page HTML to XML (using Tidy on the command line) and then using the Java XML APIs to get a Document that I could parse.

But I was able to find some sample code of how to use Tidy (jTidy) to parse HTML and return a Document, so I have changed my code as follows…

import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import org.w3c.tidy.Tidy;

public class FParser {

// Main program
// Much of the XML parsing code comes from
public static void main(String args[]) {
// The Quintessential Program to Create a DOM Document from an XML File

public static void startParse(String dir) {
Tidy tidy = new Tidy(); // obtain a new Tidy instance
Document doc = null;
try {
FileInputStream fstream = new
doc = tidy.parseDOM(fstream, null);
} catch (Exception e) {

// Retrieve the element using id
NodeList list = doc.getElementsByTagName(“p”);
for (int i=0; i whose id is “TagCloud”
if (element.getAttribute(“id”).equals(“TagCloud”)) {
// Get each of the Tag Clouds elements
NodeList list2 = element.getElementsByTagName(“a”);
for (int j=0; j

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: