Washington Web Scraping

Washington Data Scraping, Web Scraping Tennessee, Data Extraction Tennessee, Scraping Web Data, Website Data Scraping, Email Scraping Tennessee, Email Database, Data Scraping Services, Scraping Contact Information, Data Scrubbing

Tuesday, 16 December 2014

Git workflow for Scrapy projects

Our customers often ask us what’s the best workflow for working with Scrapy projects. A popular approach we have seen and used in the past is to split the spiders folder (typically project/spiders) into two folders: project/spiders_prod and project/spiders_dev, and use the SPIDER_MODULES setting to control which spiders are loaded on each environment. This works reasonably well, until you have to make changes to common code used by many spiders (ie. code outside the spiders folder), for example common base spiders.

Nowadays, DVCS (in particular, git) have become more popular and people are quite used to branching, so we recommend using a simple git workflow (similar to GitHub flow) where you branch for every change you make. You keep all changes in a branch while they’re being tested and finally merge to master when they’re finished. This means that master branch is always stable and contains only “production-ready” spiders.

If you are using our Scrapy Cloud platform, you can have 2 projects (myproject-dev, myproject-prod) and use myproject-dev to test the changes in your branch.  scrapy deploy in Scrapy 0.17 now adds the branch name to the version name (when using version=GIT or version=HG), so you can see which branch you are going to run directly on the panel. This is particularly useful with large teams working on a single Scrapy project, to avoid stepping into each other when making changes to common code.

Here is a concrete example to illustrate how this workflow works:y

•    the developer decides to work on issue 123 (could be a new spider or fixes to an existing spider)
•    the developer creates a new branch to work on the issue
•    git checkout -b issue123
•    the developer finishes working on the code and deploys to the panel (this assumes scrapy.cfg is configured with a deploy target, and using version=GIT – see here for more information)
•    scrapy deploy dev
•    the developer goes into the panel and runs the spider, where he’ll see the branch name (issue123) that will be run
•    the developer checks the scraped data looks fine through the item browser in the panel
•    whenever issues are found, the developer makes more fixes (always working on the same branch) and deploys new versions
•    once all issues are fixed, the developer merges the branch and deploys to production project
•    git checkout master
•    git merge issue123
•    git pull # make sure to pull latest code before deploying
•    scrapy deploy prod

We recommend you keep your common spiders well-tested and use Spider Contracts extensively to test your final spiders. Otherwise experience tell us that base spiders end up being copied (instead of reused) out of fear of breaking old spiders that depend on them, thus turning their maintenance into a nightmare.


No comments:

Post a Comment