Geocoding with BingMaps in Python

Jun 28, 2012. | By: Henok Mikre

It is convenient to have a geocoded dataset to work with when developing visualization tools. Although batch geocoding should be done in methods other than webservice calls, it is nice to have a command-line tool that we can call as needed. Below is a description of one such tool. Here is also included descriptions of various scripts utilized to prepare the dataset.

Let’s play with a dataset from NYC Open Data. One of the dataset that is available is SAT Scores for NYC public schools. Let’s download the CSV from here: https://nycopendata.socrata.com/Education/SAT-College-Board-2010-School-Level-Results/zt9s-n5aj

wget https://nycopendata.socrata.com/api/views/zt9s-n5aj/rows.csv?accessType=DOWNLOAD -O raw_dataset.csv

The first line of the file contains the column names. Let’s use that information to create a database table in which we can save it.

CREATE TABLE ds_NYC_SAT_2010 (
	id int(11) AUTO_INCREMENT PRIMARY KEY,
	DBN varchar(255),
	SchoolName varchar(255),
	NumTestTakers varchar(255),
	CriticalReading varchar(255),
	Math varchar(255),
	Writing varchar(255),
	Street varchar(255),
	City varchar(32),
	State varchar(15),
	Zipcode varchar(10),
	Lat varchar(32),
	Lon varchar(32)
);

In order to load the file into this table, we will need to remove the first line (the column names) and put the file in the /tmp directory so the mysql user can access it.

sed 1d dataset.csv > /tmp/NYC_SAT_2010-1d.csv

Let’s load it into the table:

LOAD DATA LOCAL INFILE '/tmp/NYC_SAT_2010.csv'
INTO TABLE ds_NYC_SAT_2010
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
(DBN,SchoolName,NumTestTakers,CriticalReading,Math,Writing);

It just so happens that this dataset does not contain addresses for the schools. The addresses for the schools are available on school-specific pages like this one: http://schools.nyc.gov/SchoolPortals/01/M292/default.htm. However, the dataset contains what they call a DBN. That number happens to be contained in the URL. So what we can do is get the DBN from our newly created table and parse the contents of the URL. We will use the HTMLParser class in Python (http://docs.python.org/library/htmlparser.html).

#!/usr/bin/python

import sys
import urllib2
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

  def __init__(self):
    HTMLParser.__init__(self)
    self._tag = ""
    self._name = ""
    self._value = ""

    self.street = ""
    self.city = ""
    self.state = ""
    self.zipcode = ""

  def handle_starttag(self, tag, attrs):
    self._tag = tag

    for name, value in attrs:
      self._value = value

  def handle_data(self, data):
    if self._tag == 'span':
      if 'lblSchoolStreet' in self._value:
        if self.street == "":
          self.street = data
      elif 'lblSchoolCity' in self._value:
        if self.city == "":
          self.city = data
      elif 'lblSchoolState' in self._value:
        if self.state == "":
          self.state = data
      elif 'lblSchoolZip' in self._value:
        if self.zipcode == "":
          self.zipcode = data

def GetAddressByNumber(DBN):
  # declare list to hold the address
  addrList = list()

  # extract the first two characters
  sid = DBN[:2]

  # extract the rest of the string
  snum = DBN[len(sid):]

  domain = "http://schools.nyc.gov/"
  url = domain + "SchoolPortals/{0}/{1}/default.htm".format(sid, snum)

  content = urllib2.urlopen(url).read()

  parser = MyHTMLParser()
  parser.feed(content)

  addrList.append(parser.street)
  addrList.append(parser.city)
  addrList.append(parser.state)
  addrList.append(parser.zipcode)

  parser.close()

  return addrList

Once we have the addresses, we can fetch the geocodes. There’s a script at http://rbrundritt.wordpress.com/2010/01/23/accessing-bing-maps-web-services-with-perl-and-python/ that we can modify and use for that purpose. But we will need to install a few libraries before we can call the web service.

sudo apt-get install python-setuptools
wget https://fedorahosted.org/releases/s/u/suds/python-suds-0.3.7.tar.gz
tar -xzf python-suds-0.3.7.tar.gz
cd python-suds-0.3.7
sudo python setup.py install

And here’s the script that has been modified for our purpose:

#!/usr/bin/python

import sys
import urllib2
from suds.client import Client
from suds.client import WebFault

def GetGeocode(key, arrAddr):
  _geocode = list()
  application_key = key

  domain = 'http://dev.virtualearth.net'
  virdir = '/webservices/v1/geocodeservice/'
  wsdl = 'geocodeservice.svc?wsdl'
  url = domain + virdir + wsdl

  c = Client(url)

  greq = c.factory.create('GeocodeRequest')

  #Credentials
  cred = c.factory.create('ns0:Credentials')
  cred.ApplicationId = application_key
  greq.Credentials = cred

  #Address
  address = c.factory.create('ns0:Address')

  # street
  address.AddressLine = arrAddr[0]

  # city
  address.Locality = arrAddr[1]

  # state
  address.AdminDistrict = arrAddr[2]

  # country
  address.CountryRegion = "USA"

  greq.Address = address

  #Select 1st port and set request options:
  c.set_options(port='BasicHttpBinding_IGeocodeService')

  try:
    response = c.service.Geocode(greq)
  except WebFault, e:
    print "ERROR!"
    print e

  locations = response['Results']['GeocodeResult'][0]['Locations']['GeocodeLocation']

  for loc in locations:
    _geocode.append(loc['Latitude'])
    _geocode.append(loc['Longitude'])

  return _geocode

Of course, we will need to save all this data to our table. In order to use the Python MySQL libraries, we need to install them.

# install python-dev (needed to build MySQL-Python)
sudo apt-get install python-dev

sudo apt-get install libmysqlclient-dev

# download MySQL-python
wget http://downloads.sourceforge.net/project/mysql-python/mysql-python/1.2.3/MySQL-python-1.2.3.tar.gz?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Fmysql-python%2F&ts=1340730391&use_mirror=superb-sea2 -o MySQL-python-1.2.3.tar.gz
tar xzf MySQL-python-1.2.3.tar.gz
cd MySQL-python-1.2.3
sudo python setup.py build
sudo python setup.py install

And here’s the code for retrieving and saving the data to our table:

#!/usr/bin/python

import sys
import MySQLdb as mdb

def GetDBN(dbhost, dbuser, dbpass, dbname, limit):
  conn = None
  dbnlist = list()

  try:
    conn = mdb.connect(dbhost, dbuser, dbpass, dbname)
    cursor = conn.cursor()

    cursor.execute("""
      SELECT DBN
      FROM ds_NYC_SAT_2010
      LIMIT %s
      """, (limit)
    )

    rows = cursor.fetchall()

    for row in rows:
      dbnlist.append(row[0])

    cursor.close()
    conn.commit()

    return dbnlist

  except mdb.Error, e:
    print "Error %d: %s" % (e.args[0], e.args[1])
    sys.exit(1)

  finally:
    if conn:
      conn.close()

def UpdateTable(dbhost, dbuser, dbpass, dbname, arrAddr, DBN):

  conn = None

  try:
    conn = mdb.connect(dbhost, dbuser, dbpass, dbname)
    cursor = conn.cursor()

    cursor.execute("""
      UPDATE ds_NYC_SAT_2010
      SET
        Street=%s,
        City=%s,
        State=%s,
        Zipcode=%s,
        Lat=%s,
        Lon=%s
      WHERE DBN=%s
      """,
      (
        arrAddr[0],
        arrAddr[1],
        arrAddr[2],
        arrAddr[3],
        arrAddr[4],
        arrAddr[5],
        DBN
      )
    )
    cursor.close()
    conn.commit()
  except mdb.Error, e:
    print "Error %d: %s" % (e.args[0], e.args[1])
    sys.exit(1)

  finally:
    if conn:
      conn.close()

Now that we have all our tasks in place, we can call them from a driver:

#!/usr/bin/python

import sys
from Database import GetDBN
from Database import UpdateTable
from AddressParser import GetAddressByNumber
from Geocode import GetGeocode

def main(argv=None):
  """Main function to drive the process.
  """
  dbhost = '_server'
  dbuser = '_user'
  dbpass = '_pass'
  dbname = "_dbname"
  apikey = "AAAAAABBBBBBBBBBBCCCCCCCCCCCCCCCCDDDDDDDDDDDDDDDDEEEEEEEEEEEEFFF"

  # Get a list of DBNs from DB table.
  numbers = GetDBN(dbhost, dbuser, dbpass, dbname, 5)

  # For each DBN, get address, get geocode, and update table.
  for number in numbers:
    addrs = GetAddressByNumber(number)
    geocode = GetGeocode(apikey, addrs)
    addrs.append(geocode[0])
    addrs.append(geocode[1])
    UpdateTable(dbhost, dbuser, dbpass, dbname, addrs, number)

  return 0

if __name__ == "__main__":
  sys.exit(main())

That’s it! Now we can go ahead and build the cool stuff…the tools that can make this data useful!

Resources:

Subscribe

Subscribe to this blog via RSS.

Recent Posts

About

BLEN Corp is a small, minority, and veteran-owned information technology firm located in Washington D.C. Since 2004, we have been ahead of the curve in early adaption and implementation of cutting edge technologies including web and mobile development, service-oriented architecture, and other innovative web based solutions. Look at some of our projects.

Social Links

Our Bunker

WeWork Wonder Bread Factory
641 S St. NW
3rd Floor
Washington, DC 20001
United States.