I would like to introduce a little Python program that I wrote called SiteGloop! It is a site crawler that can be used to take snapshots of your pages, or it can be used to warm your cache.
SiteGloop was written to solve a problem. Whenever you flush the cache on your website, you sometimes want to warm the cache back up afterwards. Using SiteGloop, you can read your site’s
sitemap.xml file and allow it to request each resource contained within that sitemap in order to force the web server to repopulate it’s cache.
Python Technologies Used
This project gave me the opportunity to explore some Python modules that I’ve been wanting to play with:
There are two modes that you can invoke with SiteGloop: Quick or Screenshot
This is the default mode of SiteGloop. It runs in an asynchronous fashion, using a set number of connections (
100 by default, though changeable) at a time to request each resource from the sitemap. Since it performs the task asynchronously, it is able to complete the task rather quickly. It also does not save the returned content from the web server, only the returned status code. Since there is no need to render the content that is returned, there is no performance penalty associated with loading the content.
In this mode, SiteGloop retrieves each resource synchronously. It then renders the webpage via Selenium and takes a screenshot of the rendered page. Finally, it creates output that can be browsed via a browser showing the screenshot as well as some basic details about the capture.
Source on GitHub
You can find the source repository at https://github.com/javiergayala/SiteGloop.