Maxwell Terry

Scraping a News.arc Feed With AppJet

Feeds are machine-readable serializations of web content. While the web was designed as a network of linked documents, feeds offer linked data. Most feeds provide frequently updated content, like weather forecasts, stock listings, status updates, and alerts of new blog posts. While this data could be published in any form, standard interchange formats are typically used so that other services don't have to use or write a custom parser.

The two dominant formats are XML and JSON: the former is a well-formed version of HTML (which gives stucture to web pages) and the latter a subset of the JavaScript programming language (which runs natively in all popular browsers). XML and JSON are widely supported, and libraries to convert the data to in-memory program structures are already written for most languages. (See http://en.wikipedia.org/wiki/Category:XML_parsers and http://json.org/.) We'll be working exclusively with JSON.

Sometimes a site doesn't provide a feed, or the official feed is found lacking. We can however scrape (i.e. programmatically extract) content from any accessible site. It's important to first make sure doing so doesn't violate the host's terms of service, and remember that the data might be blocked or otherwise unavailable to you at any given time. (I'd advise against starting a business around scraping.) But it can be very useful for getting content alerts from data that isn't already syndicated.

Let's look at how we would scrape updates from news.arc sites with AppJet. Included in Paul Graham's implementation of the Arc language, news.arc is a library deployed as Hacker News, Arc Forum, New Mogul, and Academic Hacker News, among others. Let's scrape the newest stories every hour, which we can use as an alert of forum activity.

Rather than simply grabbing the source and stripping out solely what we need, we'll build up a complete JSON feed of the site's dynamic content.

At the beginning of our AppJet code we'll include some metadata, including the version of the framework (required by AppJet) and an overview of what we're doing.

/* appjet:version 0.1 */
/** @fileoverview Scrape headlines from news.arc sites to JSON. */

We'll need to import a few libraries.

import("storage", "lib-json2", "lib-value")

The storage library can be used to persist objects on disk, lib-json2 is a server- and client-side copy of Douglas Crockford's json2.js, and lib-value is my own Value.js framework.

We'll include a table of the shortnames and addresses of existing deployments.

sites = {
  hn:  "http://news.ycombinator.com",
  arc: "http://arclanguage.org/forum",
  nm:  "http://newmogul.com",
  ahn: "http://www.cs.toronto.edu/~ad/news/"
}

And feature pages. (Some of these may only be available on Hacker News for the time being.)

pages = [
  "top",
  "newest",
  "threads",
  "newcomments",
  "leaders",
  "jobs",
  "best",
  "active",
  "bestcomments",
  "noobs",
  "classic"
]

It's generally rude to pull data from a host indiscriminately. To prevent excessive requests, we'll cap the maximum frequency at once a minute. This is user-driven: if no one requests data, it's not being pulled behind the scenes. [1]

cache = function(site) {

  if (!storage.site) storage.site = {}
  if (!storage.time) storage.time = {}

  if (!storage.site[site]) {
    storage.site[site] = wget(site)
    storage.time[site] = as.now()
  }

  as.minutely(function() {
    storage.site[site] = wget(site)
    storage.time[site] = as.now()
  }, storage.time[site])

}

The cache function takes in a site URL string. We first make sure the storage object has site and time properties. Then if the site hasn't been cached yet we retrieve it and set the current time, saving the site URL as a property of the time property. (This will let us know how stale the cache is.) Finally, the as.minutely function is a method of the as object in Value.js. A convenience variation of what's currently called as.cron (but should probably be expanded or renamed to as.every or as.often), as.minutely will only call the passed function (first argument) if the given Unix timestamp integer is from at least a minute ago.

posts = function(site, n, html) {

  cache(site)

  var them    = [],
      stories = storage.site[site].split("vote?").slice(1, n+1)

  for (var i=0, l=stories.length; i<l; i++) {

    var o   = {}

    o.by    = as.after(stories[i], "?id=", "\">")
    o.id    = 1 * as.after(stories[i], 4, "&")
    o.url   = as.after(stories[i], "href=\"", "\"")
    o.title = as.after(as.after(stories[i], "href=\"", "</a>"), ">")
    o.score = 1 * as.after(stories[i], "score_", " point")
                    .substring((o.id+"").length + 1)
    o.type  = "story"
    o.time  = trim(as.before(stories[i], "| ", "</a>"))

    if (o.url.substring(0,4) != "http") {
      o.url = site.split("/").slice(0,3).join("/")+"/"+o.url
    }

        if (!html) o = as.str(o)
    them.push(o)

  }

  return(html ? them : "["+them.join(",")+"]")

}

The meat of the program is the posts function, which accepts a site name, number of posts to include, and whether to return HTML (mostly for debugging purposes). The site address is given as a string, and should include the full address of the page (i.e. "http://news.ycombinator.com/newest"). The number of posts to display is currently capped at 30 [2]. The cache function is called, refreshing the cache if necessary. After splitting the code into chunks for each submission, the central for loop runs through each block, pulling out data and adding it as a value to a property in an object, then appending the new object to an array accumulating them. In the end the JSON (or HTML) is returned.

A main function is also provided for calling on each request. This could probably be better written as a block (i.e. (function() {})()), but AppJet doesn't seem to support them.

srv = function(site, n) {

  page.setMode("plain")

  site = request.params.site || sites.hn
  n = n || 30 //: only supports first page right now
  html = !!request.params.html

  if (!request.params.page) print(posts(site, n, html))

  else {

    if (is.within(pages, request.params.page)) {
      print(posts(site+"/"+request.params.page, n, html))
    }

  }

}

This can be called as

srv()

or optionally provided with a value for the site and number of posts, which otherwise default respectively to "http://news.ycombinator.com" and 30.

You can access this at http://news-arc-scrape.appjet.net/. It can be manipulated solely with URL parameters; try http://news-arc-scrape.appjet.net/?page=newest, http://news-arc-scrape.appjet.net/?page=newest&html=true, and http://news-arc-scrape.appjet.net/?site=http://arclanguage.org/forum.

This solution isn't perfect. Since the post date is displayed relatively, the absolute time can only be computed for more recent submissions, and must be done manually. (It would be preferable if the Unix time stamps that are used internally were exposed in the HTML or an official data feed.) Since the intended use here is to get the newest stories, the time information doesn't really matter. But while we're at it, we should derive a general solution to scraping news.arc sites. Future applications may need the time information.

Replies and text could be added by pulling the data from the story link. I'll leave this, as well as scraping comments, as an exercise for the reader. [3]

1. If this were preferred, we could use a cron job.

2. This is just because news pages usually have 30 elements. It could be expanded by recursively getting the fnid of the next page and retrieving it.

3. sockvotes, ip, and votes could also be included if one had admin access.