Scrape Public Domain Legislation (10 Laws) And Output Csv

I need a python script to scrape public domain legislation, (10 laws) and output a a CSV file for each law containing  each individual article and contents. (that is, it has to strip html code).

This should be an extremely easy project to code (using regex and an html parser), shouldn’t take more than a few minutes.

—– Sample html——-:
<p><strong><a name=”a3″>Some article 3.</a></strong> </p>
<p>contents of a3</p>
<p>More contents.</p>
<center><h4><a name=”c2″>Chapter II.</a><br>PARSER IGNORES EVERY H4</h4></center>
p><strong><a name=”a4″>A New article 4.</a></strong> </p>
<p>contents of a4</p>
<p>More contents.</p>

—–End of sample——

I will provide a link to the index of each law, from which the script should extract and follow links to parse. (URLs via PMB)

I will need the script code as well as the resulting CSV files.

Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *