I needed to get a list of common tags for OpenStreetMap from this page: http://wiki.openstreetmap.org/wiki/Map_Features

When I clicked view source for the wiki markdown, the sections were all contained in templates which are other files. The OpenStreetMap wiki uses MediaWiki, so I skimmed the API to find how to download an article as json.

I wrote this shell script to parse the Map Features page and to scrape all the included template pages for Tag: pairs:


load () {
  curl -sS "http://wiki.openstreetmap.org/w/api.php?titles=$1&action=query\
&prop=revisions&rvprop=content&format=json" \
    | jsonstream query.pages.*.revisions.0 | json 0.\*

load Map_Features \
  | sed '0,/^== Primary features/ d; 0,/^$/ {/^{{Map_Features:/ !d
      s/^{{Map_Features://; s/}}$// p}; d' \
  | while read page; do
    load "Template:Map_Features:$page" \
      | grep "Tag:$page=" | sed 's/.*Tag://; s/}}.*//'
  done \
  | sed 's/=/,/; s/\]\].*//'

The result is a 2-column csv file: category,feature.

The script uses json and jsonstream from npm: npm install -g json jsonstream.

I learned a neat thing by reading the sed manual: you can match from one pattern until another pattern, or 0,/regex/ starts matching anything and goes until /regex/ matches. So you can cut out a section of some text by deleting until you get to /first/, printing until you get to /second/, and deleting past /second/: 0,/first/ d; 0,/second/ p; d. If you want to do more than printing the matches, you can use a block with {} to execute multiple commands on the cut-out text: 0,/first/ d; 0,/second/ { cmd1; cmd2; cmd3 }; d

Here are the results: features.csv.