SiteGloop

Introducing SiteGloop

I would like to introduce a little Python program that I wrote called SiteGloop! It is a site crawler that can be used to take snapshots of your pages, or it can be used to warm your cache.

Background

SiteGloop was written to solve a problem. Whenever you flush the cache on your website, you sometimes want to warm the cache back up afterwards. Using SiteGloop, you can read your site’s sitemap.xml file and allow it to request each resource contained within that sitemap in order to force the web server to repopulate it’s cache.

Python Technologies Used

This project gave me the opportunity to explore some Python modules that I’ve been wanting to play with:

Uses

There are two modes that you can invoke with SiteGloop: Quick or Screenshot

Quick Mode

This is the default mode of SiteGloop. It runs in an asynchronous fashion, using a set number of connections (100 by default, though changeable) at a time to request each resource from the sitemap. Since it performs the task asynchronously, it is able to complete the task rather quickly. It also does not save the returned content from the web server, only the returned status code. Since there is no need to render the content that is returned, there is no performance penalty associated with loading the content.

Screenshot Mode

In this mode, SiteGloop retrieves each resource synchronously. It then renders the webpage via Selenium and takes a screenshot of the rendered page. Finally, it creates output that can be browsed via a browser showing the screenshot as well as some basic details about the capture.

Source on GitHub

You can find the source repository at https://github.com/javiergayala/SiteGloop.

Conclusion

SiteGloop is very much a work in progress. Please feel free to give a spin, and I am open to any and all feedback. You can submit issues here.

See also