Dealing with UTF-8 with appengine’s bulk loading

We just uploaded our first app to Google Appengine’s servers.

It’s called YouDIG - You Draw I Guess - and we believe it to be a fun game to play online.

Being a word based game we had to have an easy way to upload new Riddles to the online app. So, I created a very simple bulkload.py which just calls the ImportCSV function from the SDK. The problem was that it doesn’t work with UTF-8 files!

Here’s what I did (I’m still a Python newbie, so feel free to send me your comments/suggestions):

1 - Edited the google/appengine/tools/bulkload_client.py

def ContentGenerator
....
  if rows_written > 0:
    yield rows_written, unicode(content.getvalue(),'utf-8')

  def PostEntities
  ....
  body = urllib.urlencode({
    constants.KIND_PARAM: kind,
    constants.CSV_PARAM: content.encode("utf-8"),
    })

this basically unicodes everything and encodes it as UTF-8 before sending the POST request.

2 - Created my Loaders (similar to the ones described in the docs):

class RiddleLoader(bulkload.Loader):

  def HandleEntity(self, entity):
    entity['approved']=True
    return entity

  def __init__(self):
    bulkload.Loader.__init__(self, 'Riddle',
      [('word', Riddle.lowerCase),('level', str),
       ('language', Language.get_key_by_code ),
       ('category', Category.get_key_by_sys_name)])

if __name__ == '__main__':
  mybulkload.main(RiddleLoader())

This is a simple loader that extends the bulkload.Loader. The “Language.get_key_by_code” and “Category.get_key_by_sys_name” are static functions that allow me to get the entity based on a string. This way I can import Languages, Categories and Riddles and have the relations set using simple string keys (since I don’t know and entity’s key before it’s saved!).

The main difference from the standard bulkloading is in the main method. The “mybulkload” which is a class extending “BulkLoad” and allows to receive UTF-8 CSV POST data.

3 - The mybulkload package:

def utf_8_encoder(unicode_csv_data):
  for line in unicode_csv_data:
    yield line.encode('utf-8')

class MyBulkLoad(BulkLoad):
""" A handler for bulk load requests.
"""

  def Load(self, kind, data):
    Validate(kind, basestring)
    Validate(data, basestring)
    output = []

    try:
      loader = Loader.RegisteredLoaders()[kind]
    except KeyError:
      output.append('Error: no Loader defined for kind %s.' % kind)
      return (httplib.BAD_REQUEST, ''.join(output))

    buffer = StringIO.StringIO(data)
    reader = csv.reader(utf_8_encoder(buffer), skipinitialspace=True)

    entities = []

    line_num = 1

    for row in reader:
      try:
       output.append('\nLoading from line %d...' % line_num)
       entities.extend(loader.CreateEntity([unicode(cell,'utf-8') for cell in row]))
       output.append('done.')
      except:
       exc_info = sys.exc_info()
       stacktrace = traceback.format_exception(*exc_info)
       output.append('error:\n%s' % stacktrace)
       return (httplib.BAD_REQUEST, ''.join(output))

      line_num += 1

     for entity in entities:
      datastore.Put(entity)

     return (httplib.OK, ''.join(output))

def main(*loaders):
"""Starts bulk upload.
Raises TypeError if not, at least one Loader instance is given.
Args:
loaders: One or more Loader instance.
"""
  if not loaders:
   raise TypeError('Expected at least one argument.')

  for loader in loaders:
    if not isinstance(loader, Loader):
     raise TypeError('Expected a Loader instance; received %r' % loader)

  application = webapp.WSGIApplication([('.*', MyBulkLoad)])
  wsgiref.handlers.CGIHandler().run(application)

if __name__ == '__main__':
  main()

i just copied this stuff from the __init.py__ in google/appengine/ext/bulkloa, added the utf8_encoder function  and extended the Bulkload class overloading the Load method.

Here what I used:

reader = csv.reader(utf_8_encoder(buffer), skipinitialspace=True)

to encode the stuff send to the CSV reader and:

entities.extend(loader.CreateEntity([unicode(cell,'utf-8') for cell in row]))

to unicode everything before creating the entities.

Perhaps there was an easier way but this is working for me so I hope this can help some of you.

Next I’ll write an entity eraser to bulk delete entities from the AppEngine’s production servers…

27 Comments so far

  1. Nuse on August 29th, 2008

    Nice tutorial but it didn’t work for me for some reason.
    i am also Newbi in python and need to bulk load UTF-8
    but i get the following error when trying to put utf-8 chars

    [’Traceback (most recent call last):\n’, ‘ File “C:\\Program Files\\Google\\goo
    gle_appengine\\google\\appengine\\ext\\bulkload\\__init__.py”, line 412, in Load
    \n entities.extend(loader.CreateEntity([unicode(cell,\’utf-8\’) for cell in r
    ow]))\n’, ‘ File “C:\\Program Files\\Google\\google_appengine\\google\\appengin
    e\\ext\\bulkload\\__init__.py”, line 228, in CreateEntity\n entity[name] = co
    nverter(val)\n’, ‘ File “C:\\Python25\\lib\\encodings\\cp1255.py”, line 12, in
    encode\n return codecs.charmap_encode(input,errors,encoding_table)\n’, “Unico
    deEncodeError: ‘charmap’ codec can’t encode character u’\\ufeff’ in position 0:
    character maps to \n”]
    ERROR 2008-08-29 10:18:30,977 bulkload_client.py] Import failed

  2. Wendy Santos on January 9th, 2009

    hi
    1t3593s56he2039l
    good luck

  3. Albert on January 11th, 2009

    I am a newbie on Python. Trying to use your method. But not sure where to put the file created in step 3. getting “name ‘BulkLoad’ is not defined”

    and thanks for your effort of posting this.

  4. Loasuazysparp on January 14th, 2010
  5. Hypornorwarry on January 15th, 2010

    I just discovered the website who reviews about
    Several
    home business ideas

    If you want to know more here it is
    home based business
    www.home-businessreviews.com

  6. instant loans on January 28th, 2010

    All successful people men and women are big dreamers. They imagine what their future could be, ideal in every respect, and then they work every day toward their distant vision, that goal or purpose.

  7. buy levitra online on January 28th, 2010

    react. buy cialis “Breath in body . - . Right. I’ll get blanket. Get you blanket.”

  8. vizitka n on January 30th, 2010

    tired of comments like “What is your favourite season? ” or buy antibiotics online. Then write to me at icq 75949683256…

  9. payday loans on January 30th, 2010

    There is no victory at bargain basement prices.

  10. dean on February 5th, 2010

    Try for a goal that’s reasonable, then gradually raise it.

  11. bandsxbands on February 9th, 2010

    I truly believe that we have reached the point where technology has become one with our society, and I think it is safe to say that we have passed the point of no return in our relationship with technology.I don’t mean this in a bad way, of course! Ethical concerns aside… I just hope that as the price of memory drops, the possibility of transferring our memories onto a digital medium becomes a true reality. It’s one of the things I really wish I could see in my lifetime.(Posted from NetSurf for R4i Nintendo DS.)

  12. Banks on February 11th, 2010

    Fuck Me…are those real?

  13. valtrex on February 13th, 2010

    buzz.
    zanaflex

  14. risperdal on February 14th, 2010

    naval officer, that he would drink a bottle of rum sitting on the
    singulair

  15. accutane on February 14th, 2010

    chain the beginning of which is hidden in heaven,” said Pierre.
    zovirax

  16. paxil attorneys san diego on February 14th, 2010

    fictitious.
    metformin and pcos

  17. norvasc reviews on February 14th, 2010

    likes her way of reading. She reads to him in the evenings and reads
    nizoral

  18. zanaflex on February 15th, 2010

    (Boris understood that Arakcheev envied Balashev and was displeased
    elavil

  19. avapro on February 15th, 2010

    he said to one of his adjutants, and then turned to the Duke of
    prilosec

  20. Belka on February 20th, 2010

    Resources like the one you mentioned here will be very useful to me! I will post a link to this page on my blog. I am sure my visitors will find that very useful.

  21. time heat academies without on February 23rd, 2010

    time heat academies without…

    include southern cycles million scenario…

  22. tepaysalp on February 28th, 2010

    The response to national disaster is awesome but it’s a damn shame that so many people take advantage of the negative situations.

    I mean everytime there is an earthquake, a flood, an oil spill - there’s always a group of heartless people who rip off tax payers.

    This is in response to reading that 4 of Oprah Winfreys “angels” got busted ripping off the system. Shame on them!
    http://www.cbsnews.com/blogs/2009/08/19/crimesider/entry5251471.shtml

  23. UoGKMquS on March 3rd, 2010

    aHKgdR

  24. non societies program seasonal on March 5th, 2010

    non societies program seasonal…

    national air 2004 allowing pre recent forward…

  25. rolfefield on March 5th, 2010

    attributable scientists developer near hypothesis mitigation ppm

  26. tamicurry on March 5th, 2010

    required impact article added

  27. Cialis on March 9th, 2010

    ndo2tA Excellent article, I will take note. Many thanks for the story!

Leave a Reply