I needed to get a list of common tags for OpenStreetMap from this page: http://wiki.openstreetmap.org/wiki/Map_Features
When I clicked view source for the wiki markdown, the sections were all contained in templates which are other files. The OpenStreetMap wiki uses MediaWiki, so I skimmed the API to find how to download an article as json.
I wrote this shell script to parse the Map Features page and to scrape all the included template pages for Tag:
pairs:
#!/bin/bash
load () {
curl -sS "http://wiki.openstreetmap.org/w/api.php?titles=$1&action=query\
&prop=revisions&rvprop=content&format=json" \
| jsonstream query.pages.*.revisions.0 | json 0.\*
}
load Map_Features \
| sed '0,/^== Primary features/ d; 0,/^$/ {/^{{Map_Features:/ !d
s/^{{Map_Features://; s/}}$// p}; d' \
| while read page; do
load "Template:Map_Features:$page" \
| grep "Tag:$page=" | sed 's/.*Tag://; s/}}.*//'
done \
| sed 's/=/,/; s/\]\].*//'
The result is a 2-column csv file: category,feature
.
The script uses json
and jsonstream
from npm: npm install -g json jsonstream
.
I learned a neat thing by reading the sed manual: you can match from one pattern until another pattern, or 0,/regex/
starts matching anything and goes until /regex/
matches. So you can cut out a section of some text by deleting until you get to /first/
, printing until you get to /second/
, and deleting past /second/
: 0,/first/ d; 0,/second/ p; d
. If you want to do more than printing the matches, you can use a block with {}
to execute multiple commands on the cut-out text: 0,/first/ d; 0,/second/ { cmd1; cmd2; cmd3 }; d
Here are the results: features.csv.