Parsing Flickr’s Popular Tags page

For my Visual Databases course (again), one of my project goals is to be able to parse Flickr’s most popular tags page, save the results over the course of a few days, and check the variation of the popular tags with time. And on my Windows desktop…

OK, here is what I did.

1. Wrote a Windows shell script to pull down the Flickr’s popular tags page every day, save it to the folder. (I had Cygwin installed and used its curl binary to download the page.) This saves files with names like this: 2008-4-22_14-30.html .


dYear = DatePart("yyyy" , Now)
dMonth = DatePart("m" , Now)
dDay = DatePart("d" , Now)
dHour = DatePart("h", Now)
dMinute = DatePart("n" , Now)
dSeconds = DatePart("s" , Now)

fileName = dYear & "-" & dMonth & "-" & dDay  & "_" & dHour  & "-" & dMinute
Set WshShell = WScript.CreateObject("WScript.Shell")
WshShell.Run ("c:\cygwin\bin\curl.exe http://flickr.com/photos/tags/ -o " & fileName & ".html")

2. Set up a Scheduled Task in Windows to run this script daily at 2:30 pm.

So now, I can get several days worth of Flickr’s most popular tags page. On to parsing.

3. Java does not let you parse HTML easily. So I convert the HTML pages to XHTML using an online version of Tidy (this is manual for now).

4. Wrote a Java program that parses this XHTML page and outputs the tags and their “weight”, measured in font size.


import java.io.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;

public class FParser {

    // Main program
    // Much of the XML parsing code comes from
    // http://www.exampledepot.com/egs/org.w3c.dom/pkg.html
    public static void main(String args[]) {
        // The Quintessential Program to Create a DOM Document from an XML File
        Document doc = parseXmlFile("flickr.xml", false);

        // Retrieve the element using id
        NodeList list = doc.getElementsByTagName("p");
        for (int i=0; i<list.getLength(); i++) {
            Element element = (Element)list.item(i);
            // We want the <p> whose id is "TagCloud"
            if (element.getAttribute("id").equals("TagCloud")) {
                // Get each of the Tag Clouds <a> elements
                NodeList list2 = element.getElementsByTagName("a");
                for (int j=0; j<list2.getLength(); j++) {
                    Element element2 = (Element)list2.item(j);
                    // Get the tag by parsing the href attribute
                    String sHref = element2.getAttribute("href").replaceAll("/photos/tags/", "").replaceAll("/","");
                    // Get the weight by parsing the font-size attribute
                    String sStyle = element2.getAttribute("style").replaceAll("font-size: ", ""). replaceAll("px;", "");
                    System.out.println(sHref + ": " + sStyle);
                }
            }
        }

    }

    // Parses an XML file and returns a DOM document.
    // If validating is true, the contents is validated against the DTD
    // specified in the file.
    public static Document parseXmlFile(String filename, boolean validating) {
        try {
            // Create a builder factory
            DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
            factory.setValidating(validating);

            // Create the builder and parse the file
            Document doc = factory.newDocumentBuilder().parse(new File(filename));
            return doc;
        } catch (SAXException e) {
            // A parsing error occurred; the xml input is not valid
            e.printStackTrace();
        } catch (ParserConfigurationException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }
}

To-do:

5. Save this data for each day in some sort of database of Excel spreadsheet.

6. Plot the values to see if the tags change over time and if their weight changes too…

Leave a Reply