Dealing with UTF-8 with appengine’s bulk loading
We just uploaded our first app to Google Appengine’s servers.
It’s called YouDIG – You Draw I Guess – and we believe it to be a fun game to play online.
Being a word based game we had to have an easy way to upload new Riddles to the online app. So, I created a very simple bulkload.py which just calls the ImportCSV function from the SDK. The problem was that it doesn’t work with UTF-8 files!
Here’s what I did (I’m still a Python newbie, so feel free to send me your comments/suggestions):
1 – Edited the google/appengine/tools/bulkload_client.py
def ContentGenerator
....
if rows_written > 0:
yield rows_written, unicode(content.getvalue(),'utf-8')
def PostEntities
....
body = urllib.urlencode({
constants.KIND_PARAM: kind,
constants.CSV_PARAM: content.encode("utf-8"),
})
this basically unicodes everything and encodes it as UTF-8 before sending the POST request.
2 – Created my Loaders (similar to the ones described in the docs):
class RiddleLoader(bulkload.Loader):
def HandleEntity(self, entity):
entity['approved']=True
return entity
def __init__(self):
bulkload.Loader.__init__(self, 'Riddle',
[('word', Riddle.lowerCase),('level', str),
('language', Language.get_key_by_code ),
('category', Category.get_key_by_sys_name)])
if __name__ == '__main__':
mybulkload.main(RiddleLoader())
This is a simple loader that extends the bulkload.Loader. The “Language.get_key_by_code” and “Category.get_key_by_sys_name” are static functions that allow me to get the entity based on a string. This way I can import Languages, Categories and Riddles and have the relations set using simple string keys (since I don’t know and entity’s key before it’s saved!).
The main difference from the standard bulkloading is in the main method. The “mybulkload” which is a class extending “BulkLoad” and allows to receive UTF-8 CSV POST data.
3 – The mybulkload package:
def utf_8_encoder(unicode_csv_data):
for line in unicode_csv_data:
yield line.encode('utf-8')
class MyBulkLoad(BulkLoad):
""" A handler for bulk load requests.
"""
def Load(self, kind, data):
Validate(kind, basestring)
Validate(data, basestring)
output = []
try:
loader = Loader.RegisteredLoaders()[kind]
except KeyError:
output.append('Error: no Loader defined for kind %s.' % kind)
return (httplib.BAD_REQUEST, ''.join(output))
buffer = StringIO.StringIO(data)
reader = csv.reader(utf_8_encoder(buffer), skipinitialspace=True)
entities = []
line_num = 1
for row in reader:
try:
output.append('\nLoading from line %d...' % line_num)
entities.extend(loader.CreateEntity([unicode(cell,'utf-8') for cell in row]))
output.append('done.')
except:
exc_info = sys.exc_info()
stacktrace = traceback.format_exception(*exc_info)
output.append('error:\n%s' % stacktrace)
return (httplib.BAD_REQUEST, ''.join(output))
line_num += 1
for entity in entities:
datastore.Put(entity)
return (httplib.OK, ''.join(output))
def main(*loaders):
"""Starts bulk upload.
Raises TypeError if not, at least one Loader instance is given.
Args:
loaders: One or more Loader instance.
"""
if not loaders:
raise TypeError('Expected at least one argument.')
for loader in loaders:
if not isinstance(loader, Loader):
raise TypeError('Expected a Loader instance; received %r' % loader)
application = webapp.WSGIApplication([('.*', MyBulkLoad)])
wsgiref.handlers.CGIHandler().run(application)
if __name__ == '__main__':
main()
i just copied this stuff from the __init.py__ in google/appengine/ext/bulkloa, added the utf8_encoder function and extended the Bulkload class overloading the Load method.
Here what I used:
reader = csv.reader(utf_8_encoder(buffer), skipinitialspace=True)
to encode the stuff send to the CSV reader and:
entities.extend(loader.CreateEntity([unicode(cell,'utf-8') for cell in row]))
to unicode everything before creating the entities.
Perhaps there was an easier way but this is working for me so I hope this can help some of you.
Next I’ll write an entity eraser to bulk delete entities from the AppEngine’s production servers…

Comments(5)

















































Nice tutorial but it didn’t work for me for some reason.
i am also Newbi in python and need to bulk load UTF-8
but i get the following error when trying to put utf-8 chars
['Traceback (most recent call last):\n', ' File "C:\\Program Files\\Google\\goo
gle_appengine\\google\\appengine\\ext\\bulkload\\__init__.py", line 412, in Load
\n entities.extend(loader.CreateEntity([unicode(cell,\'utf-8\') for cell in r
ow]))\n’, ‘ File “C:\\Program Files\\Google\\google_appengine\\google\\appengin
e\\ext\\bulkload\\__init__.py”, line 228, in CreateEntity\n entity[name] = co
nverter(val)\n’, ‘ File “C:\\Python25\\lib\\encodings\\cp1255.py”, line 12, in
encode\n return codecs.charmap_encode(input,errors,encoding_table)\n’, “Unico
deEncodeError: ‘charmap’ codec can’t encode character u’\\ufeff’ in position 0:
character maps to \n”]
ERROR 2008-08-29 10:18:30,977 bulkload_client.py] Import failed
hi
1t3593s56he2039l
good luck
I am a newbie on Python. Trying to use your method. But not sure where to put the file created in step 3. getting “name ‘BulkLoad’ is not defined”
and thanks for your effort of posting this.
chocolate sweet recipe…
| sauce barbeque for recipe…
child valentine recipe…
| battle bunker hill picture…