Parsing web sitemaps using JavaScript

Recently I came across an interesting project by Sean Thomas Burke called Sitemapper. This is a mini framework, which can be used to parse through sitemap XML files to get all included URLs. Such functionality is necessary when crawling through websites, as the sitemap (usually) holds an up-to-date list of all website URLs. In most cases this list should be enough when designing a crawler and you wouldn’t need to crawl manually the website and create a list of URLs.

Sitemap parser: Sitemapper

Sitemapper is a well-maintained and well-documented, open-source library offering the following features:

  • Follows redirects
  • Supports gzip sitemaps
  • Debug mode
  • Multi-threaded processing
  • Timeout limit
  • User-agent customization

I decided to replace the existing sitemap parsing functionality of the crawler project, described in my previous article, using Sitemapper. But since the project was missing several key features for my project, I added them to Sean’s repository.

New features

Below is the list of new features added and the business justification behind each feature:

  • Option to retry failed requests
    When crawling multiple URLs, it’s very common to have errors (e.g. server error). It’s a good practice to retry failed requests, to make sure that the issue is permanent and not just temporary failed attempt.
    You now have the option to define the number of “retries” allowed for each failed request, using the “retries” parameter.
  • Concurrency limit / Rate limiting
    Sitemap XML files, might be monitored by firewalls. If you are planning to parse the sitemap of large websites, you need to limit the number of concurrent requests to make sure that your IP address doesn’t get blocked by parsing a large number of files concurrently.
    You now have the option to set how many sitemap URLs can be requested at the same time, using the “concurrency” parameter.
  • Error reporting
    If you still get errors when parsing sitemaps, even after retrying, it’s useful to have the option to get back a list of errors occurred and the error code for each one of them. This allows you to follow them up manually or report them to another process for archiving purposes.
    Using the parameter “returnErrors” you can configure Sitemapper to return the list of errors occurred too.

Getting started

You can find Sean’s sitemapper repository on GitHub and npm.

At the time of writing the new features I added haven’t been merged in Sean’s GitHub repository yet (waiting for my Pull request to be reviewed), but you can find them in my personal public fork repository along with updated documentation explaining each new feature in more detail.

Panagiotis

Written By

Panagiotis (pronounced Panayotis) is a passionate G(r)eek with experience in digital analytics projects and website implementation. Fan of clear and effective processes, automation of tasks and problem-solving technical hacks. Hands-on experience with projects ranging from small to enterprise-level companies, starting from the communication with the customers and ending with the transformation of business requirements to the final deliverable.