Dealing with UTF-8 with appengine’s bulk loading 5
We just uploaded our first app to Google Appengine’s servers.
It’s called YouDIG – You Draw I Guess – and we believe it to be a fun game to play online.
Being a word based game we had to have an easy way to upload new Riddles to the online app. So, I created a very simple bulkload.py which just calls the ImportCSV function from the SDK. The problem was that it doesn’t work with UTF-8 files!
Here’s what I did (I’m still a Python newbie, so feel free to send me your comments/suggestions):
1 – Edited the google/appengine/tools/bulkload_client.py
def ContentGenerator
....
if rows_written > 0:
yield rows_written, unicode(content.getvalue(),'utf-8')
def PostEntities
....
body = urllib.urlencode({
constants.KIND_PARAM: kind,
constants.CSV_PARAM: content.encode("utf-8"),
})
this basically unicodes everything and encodes it as UTF-8 before sending the POST request.
2 – Created my Loaders (similar to the ones described in the docs):
class RiddleLoader(bulkload.Loader):
def HandleEntity(self, entity):
entity['approved']=True
return entity
def __init__(self):
bulkload.Loader.__init__(self, 'Riddle',
[('word', Riddle.lowerCase),('level', str),
('language', Language.get_key_by_code ),
('category', Category.get_key_by_sys_name)])
if __name__ == '__main__':
mybulkload.main(RiddleLoader())
This is a simple loader that extends the bulkload.Loader. The “Language.get_key_by_code” and “Category.get_key_by_sys_name” are static functions that allow me to get the entity based on a string. This way I can import Languages, Categories and Riddles and have the relations set using simple string keys (since I don’t know and entity’s key before it’s saved!).
The main difference from the standard bulkloading is in the main method. The “mybulkload” which is a class extending “BulkLoad” and allows to receive UTF-8 CSV POST data.
3 – The mybulkload package:
def utf_8_encoder(unicode_csv_data):
for line in unicode_csv_data:
yield line.encode('utf-8')
class MyBulkLoad(BulkLoad):
""" A handler for bulk load requests.
"""
def Load(self, kind, data):
Validate(kind, basestring)
Validate(data, basestring)
output = []
try:
loader = Loader.RegisteredLoaders()[kind]
except KeyError:
output.append('Error: no Loader defined for kind %s.' % kind)
return (httplib.BAD_REQUEST, ''.join(output))
buffer = StringIO.StringIO(data)
reader = csv.reader(utf_8_encoder(buffer), skipinitialspace=True)
entities = []
line_num = 1
for row in reader:
try:
output.append('\nLoading from line %d...' % line_num)
entities.extend(loader.CreateEntity([unicode(cell,'utf-8') for cell in row]))
output.append('done.')
except:
exc_info = sys.exc_info()
stacktrace = traceback.format_exception(*exc_info)
output.append('error:\n%s' % stacktrace)
return (httplib.BAD_REQUEST, ''.join(output))
line_num += 1
for entity in entities:
datastore.Put(entity)
return (httplib.OK, ''.join(output))
def main(*loaders):
"""Starts bulk upload.
Raises TypeError if not, at least one Loader instance is given.
Args:
loaders: One or more Loader instance.
"""
if not loaders:
raise TypeError('Expected at least one argument.')
for loader in loaders:
if not isinstance(loader, Loader):
raise TypeError('Expected a Loader instance; received %r' % loader)
application = webapp.WSGIApplication([('.*', MyBulkLoad)])
wsgiref.handlers.CGIHandler().run(application)
if __name__ == '__main__':
main()
i just copied this stuff from the __init.py__ in google/appengine/ext/bulkloa, added the utf8_encoder function and extended the Bulkload class overloading the Load method.
Here what I used:
reader = csv.reader(utf_8_encoder(buffer), skipinitialspace=True)
to encode the stuff send to the CSV reader and:
entities.extend(loader.CreateEntity([unicode(cell,'utf-8') for cell in row]))
to unicode everything before creating the entities.
Perhaps there was an easier way but this is working for me so I hope this can help some of you.
Next I’ll write an entity eraser to bulk delete entities from the AppEngine’s production servers…


















































