Parsing Flickr’s Popular Tags page

23 Apr

For my Visual Databases course (again), one of my project goals is to be able to parse Flickr’s most popular tags page, save the results over the course of a few days, and check the variation of the popular tags with time. And on my Windows desktop…

OK, here is what I did.

1. Wrote a Windows shell script to pull down the Flickr’s popular tags page every day, save it to the folder. (I had Cygwin installed and used its curl binary to download the page.) This saves files with names like this: 2008-4-22_14-30.html .


dYear = DatePart("yyyy" , Now)
dMonth = DatePart("m" , Now)
dDay = DatePart("d" , Now)
dHour = DatePart("h", Now)
dMinute = DatePart("n" , Now)
dSeconds = DatePart("s" , Now)

fileName = dYear & "-" & dMonth & "-" & dDay  & "_" & dHour  & "-" & dMinute
Set WshShell = WScript.CreateObject("WScript.Shell")
WshShell.Run ("c:\cygwin\bin\curl.exe http://flickr.com/photos/tags/ -o " & fileName & ".html")

2. Set up a Scheduled Task in Windows to run this script daily at 2:30 pm.

So now, I can get several days worth of Flickr’s most popular tags page. On to parsing.

3. Java does not let you parse HTML easily. So I convert the HTML pages to XHTML using an online version of Tidy (this is manual for now).

4. Wrote a Java program that parses this XHTML page and outputs the tags and their “weight”, measured in font size.

import java.io.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;

public class FParser {

// Main program
// Much of the XML parsing code comes from
// http://www.exampledepot.com/egs/org.w3c.dom/pkg.html
public static void main(String args[]) {
// The Quintessential Program to Create a DOM Document from an XML File
Document doc = parseXmlFile(“flickr.xml”, false);

// Retrieve the element using id
NodeList list = doc.getElementsByTagName(“p”);
for (int i=0; i whose id is “TagCloud”
if (element.getAttribute(“id”).equals(“TagCloud”)) {
// Get each of the Tag Clouds elements
NodeList list2 = element.getElementsByTagName(“a”);
for (int j=0; j

One Response to “Parsing Flickr’s Popular Tags page”

Trackbacks/Pingbacks

  1. Flickr Popular Tags page parsing using Java - update « Suman Srinivasan’s Code Blog - April 30, 2008

    […] About « Parsing Flickr’s Popular Tags page […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: