This project was made for a client whose task was to scrape all law of the US code from Justia and then upload it to his Wordpress website. Plus, mirror the whole navigational system (pages, categories, tags and hyperlinks) from Justia to WP.
Project is divided into two main parts/scripts:
- JustiaURLFetcher.py - a simple, custom made crawler used for fetching the list of all urls needed.
- JustiaScraping-WordpressUploading.py - used for getting the actual data from the links that the previous script provided and uploading that data to Wordpress while preserving navigation.
A file called 'years.7z' is also provided which is a result from running JustiaURLFetcher.py. It contains files per each year from US code with each containing a urls to the laws.
Script requires heavily on BeautifulSoup4 and the Wordpress part relies heavily on python-wordpress-xmlrpc library and if used on Wordpress 0.70-3.42, XML-RPC API should be enabled manually in order for this to work.