Archive for May, 2008

Dealing with UTF-8 with appengine’s bulk loading 5

We just uploaded our first app to Google Appengine’s servers.

It’s called YouDIG – You Draw I Guess – and we believe it to be a fun game to play online.

Being a word based game we had to have an easy way to upload new Riddles to the online app. So, I created a very simple bulkload.py which just calls the ImportCSV function from the SDK. The problem was that it doesn’t work with UTF-8 files!

Here’s what I did (I’m still a Python newbie, so feel free to send me your comments/suggestions):

1 – Edited the google/appengine/tools/bulkload_client.py

def ContentGenerator
....
  if rows_written > 0:
    yield rows_written, unicode(content.getvalue(),'utf-8')

  def PostEntities
  ....
  body = urllib.urlencode({
    constants.KIND_PARAM: kind,
    constants.CSV_PARAM: content.encode("utf-8"),
    })

this basically unicodes everything and encodes it as UTF-8 before sending the POST request.

2 – Created my Loaders (similar to the ones described in the docs):

class RiddleLoader(bulkload.Loader):

  def HandleEntity(self, entity):
    entity['approved']=True
    return entity

  def __init__(self):
    bulkload.Loader.__init__(self, 'Riddle',
      [('word', Riddle.lowerCase),('level', str),
       ('language', Language.get_key_by_code ),
       ('category', Category.get_key_by_sys_name)])

if __name__ == '__main__':
  mybulkload.main(RiddleLoader())

This is a simple loader that extends the bulkload.Loader. The “Language.get_key_by_code” and “Category.get_key_by_sys_name” are static functions that allow me to get the entity based on a string. This way I can import Languages, Categories and Riddles and have the relations set using simple string keys (since I don’t know and entity’s key before it’s saved!).

The main difference from the standard bulkloading is in the main method. The “mybulkload” which is a class extending “BulkLoad” and allows to receive UTF-8 CSV POST data.

3 – The mybulkload package:

def utf_8_encoder(unicode_csv_data):
  for line in unicode_csv_data:
    yield line.encode('utf-8')

class MyBulkLoad(BulkLoad):
""" A handler for bulk load requests.
"""

  def Load(self, kind, data):
    Validate(kind, basestring)
    Validate(data, basestring)
    output = []

    try:
      loader = Loader.RegisteredLoaders()[kind]
    except KeyError:
      output.append('Error: no Loader defined for kind %s.' % kind)
      return (httplib.BAD_REQUEST, ''.join(output))

    buffer = StringIO.StringIO(data)
    reader = csv.reader(utf_8_encoder(buffer), skipinitialspace=True)

    entities = []

    line_num = 1

    for row in reader:
      try:
       output.append('\nLoading from line %d...' % line_num)
       entities.extend(loader.CreateEntity([unicode(cell,'utf-8') for cell in row]))
       output.append('done.')
      except:
       exc_info = sys.exc_info()
       stacktrace = traceback.format_exception(*exc_info)
       output.append('error:\n%s' % stacktrace)
       return (httplib.BAD_REQUEST, ''.join(output))

      line_num += 1

     for entity in entities:
      datastore.Put(entity)

     return (httplib.OK, ''.join(output))

def main(*loaders):
"""Starts bulk upload.
Raises TypeError if not, at least one Loader instance is given.
Args:
loaders: One or more Loader instance.
"""
  if not loaders:
   raise TypeError('Expected at least one argument.')

  for loader in loaders:
    if not isinstance(loader, Loader):
     raise TypeError('Expected a Loader instance; received %r' % loader)

  application = webapp.WSGIApplication([('.*', MyBulkLoad)])
  wsgiref.handlers.CGIHandler().run(application)

if __name__ == '__main__':
  main()

i just copied this stuff from the __init.py__ in google/appengine/ext/bulkloa, added the utf8_encoder function  and extended the Bulkload class overloading the Load method.

Here what I used:

reader = csv.reader(utf_8_encoder(buffer), skipinitialspace=True)

to encode the stuff send to the CSV reader and:

entities.extend(loader.CreateEntity([unicode(cell,'utf-8') for cell in row]))

to unicode everything before creating the entities.

Perhaps there was an easier way but this is working for me so I hope this can help some of you.

Next I’ll write an entity eraser to bulk delete entities from the AppEngine’s production servers…