Parsing web sitemaps using JavaScript
Recently I came across an interesting project by Sean Thomas Burke called Sitemapper. This is a mini framework, which can be used to parse through sitemap XML files to get all included URLs. Such functionality is necessary when crawling through websites, as the sitemap (usually) holds an up-to-date list of all website URLs. In most cases this list should be enough when designing a crawler and you wouldn’t need to crawl manually the website and create a list of URLs.
Sitemap parser: Sitemapper
Sitemapper is a well-maintained and well-documented, open-source library offering the following features:
- Follows redirects
- Supports gzip sitemaps
- Debug mode
- Multi-threaded processing
- Timeout limit
- User-agent customization
I decided to replace the existing sitemap parsing functionality of the crawler project, described in my previous article, using Sitemapper. But since the project was missing several key features for my project, I added them to Sean’s repository.
New features
Below is the list of new features added and the business justification behind each feature:
- Option to retry failed requests
When crawling multiple URLs, it’s very common to have errors (e.g. server error). It’s a good practice to retry failed requests, to make sure that the issue is permanent and not just temporary failed attempt.
You now have the option to define the number of “retries” allowed for each failed request, using the “retries” parameter. - Concurrency limit / Rate limiting
Sitemap XML files, might be monitored by firewalls. If you are planning to parse the sitemap of large websites, you need to limit the number of concurrent requests to make sure that your IP address doesn’t get blocked by parsing a large number of files concurrently.
You now have the option to set how many sitemap URLs can be requested at the same time, using the “concurrency” parameter. - Error reporting
If you still get errors when parsing sitemaps, even after retrying, it’s useful to have the option to get back a list of errors occurred and the error code for each one of them. This allows you to follow them up manually or report them to another process for archiving purposes.
Using the parameter “returnErrors” you can configure Sitemapper to return the list of errors occurred too.
Getting started
You can find Sean’s sitemapper repository on GitHub and npm.
At the time of writing the new features I added haven’t been merged in Sean’s GitHub repository yet (waiting for my Pull request to be reviewed), but you can find them in my personal public fork repository along with updated documentation explaining each new feature in more detail.