(Not) scraping Wikipedia – the easy way

Q: How can I download an HTML-formatted Wikipedia article without navigation and theming?

A: Try this URL pattern: http://en.wikipedia.org/w/index.php?action=render&title=Helsinki

Today, I spent over an hour looking for this little URL parameter trick. What led me astray was that it’s not in the most obvious place you’d look – the MediaWiki API. Sure, you can get a parsed article out of the API, but it’ll be wrapped in unnecessary XML and character escaping. This gives you just the HTML you need. (There will still be infoboxes and edit links, but those are page content.)

Leave a comment