bh: thuban/Doc/technotes string_representation.txt,NONE,1.1

Fri Jul 1 22:49:06 CEST 2005

Author: bh

Update of /thubanrepository/thuban/Doc/technotes
In directory doto:/tmp/cvs-serv11857/Doc/technotes

Added Files:
	string_representation.txt 
Log Message:
First step towards unicode.  With this roughly we're at step 1
string_representation.txt

* Doc/technotes/string_representation.txt: New.  Document how
strings are represented in Thuban and how to get to a Unicode
Thuban.

* Thuban/__init__.py (set_internal_encoding)
(unicode_from_internal, internal_from_unicode): New. The first few
functions for the internal string representation

* Thuban/UI/about.py (unicodeToLocale): Removed.  Use
internal_from_unicode instead.

* Thuban/UI/__init__.py (install_wx_translation): Determine the
encoding to use for the internal string representation.  Also,
change the translation function to return strings in internal
representation even on unicode builds of wxPython

* Thuban/Model/load.py (SessionLoader.check_attrs): Decode
filenames too.
(SessionLoader.start_clrange): Use check_attrs to decode and check
the attributes.

* Thuban/Model/xmlreader.py (XMLReader.encode): Use
internal_from_unicode to convert unicode strings.

* Thuban/Model/xmlwriter.py (XMLWriter.encode): Use
unicode_from_internal when applicable

* test/runtests.py (main): New command line option:
internal-encoding to specify the internal string encoding to use
in the tests.

* test/support.py (initthuban): Set the internal encoding to
latin-1

* test/test_load.py (TestSingleLayer.test, TestClassification.test)
(TestLabelLayer.test): Use the internal string representation when
dealing with non-ascii characters

* test/test_load_1_0.py (TestSingleLayer.test)
(TestClassification.test, TestLabelLayer.test): Use the internal
string representation when dealing with non-ascii characters

* test/test_load_0_9.py (TestSingleLayer.test)
(TestClassification.test): Use the internal string representation
when dealing with non-ascii characters

* test/test_load_0_8.py (TestUnicodeStrings.test): Use the
internal string representation when dealing with non-ascii
characters

* test/test_save.py (XMLWriterTest.testEncode)
(SaveSessionTest.testClassifiedLayer): Use the internal string
representation when dealing with non-ascii characters where
applicable

--- NEW FILE: string_representation.txt ---
Title: String Representation in Thuban
Author: Bernhard Herzog <bh at intevation.de>
Last-Modified: $Date: 2005/07/01 20:49:04 $
Version: $Revision: 1.1 $

Introduction

    Thuban originally assumed that text is represented by byte-strings
    encoded in ISO-8859-1 (latin-1).  This is problematic when the
    default encoding in the user's locale is not in fact latin-1, but
    e.g. UTF-8.  The solution is to use a more flexible representation
    that will also allow the switch to Unicode as the internal string
    representation at one point.

Internal String Representation

    Thuban has an internal string representation.  All textual data read
    by Thuban has to be converted to the internal representation.  All
    data written by Thuban has to be converted into whatever form is
    used by the output device.

    Thuban provides functions to convert between the internal
    representation and other representations.  E.g.:
    internal_from_unicode which converts from unicode and should be used
    when reading XML files, for instance and unicode_from_internal for
    the conversion to Unicode.

    The ultimate goal is to use Unicode objects as the internal string
    representation.  It will be much work to get there because we will
    have to find all the places where we need to make the conversions.
    Therefore the internal representation will be byte strings in the
    user's default encoding.  

    With byte strings and especially encodings like latin-1 we can get
    by without doing all the conversions correctly because basically all
    byte strings are valid latin-1 strings, even if they have the wrong
    encoding.  In those cases, the text may look strange, but there
    won't be exceptions in most cases.  With Unicode objects, exceptions
    are much more likely.  And in the end it's better to see some
    incorrect characters than no data at all.

    All this boils down to the following steps:

    1. Byte-Strings as Internal Representation

    The internal representation are byte strings in the user's default
    encoding as determined by the locale.  The encoding is chosen so
    that such byte strings can be passed to wxPython without problems.
    This even works with Unicode builds if we take care to convert the
    translated strings (wxGetTranslation returns Unicode objects in a
    Unicode build).

    If no suitable encoding can be determined, use latin-1.  It might be
    better to use ASCII instead, but latin 1 offers somewhat better
    backwards compatibility with older Thuban versions.

    Start implementing the conversion functions and use them wherever
    we have hard coded conversions to latin-1.  It's not necessary to
    find all places where conversion has to be done at this point.
    Since we're using byte strings in the user's default encoding most
    byte-strings that are read by Thuban are already in the right form
    and in most cases it's also the right form for output.

    2. Implement the conversion wherever necessary

    Start working toward Unicode as the internal representation.  In
    this phase, we need to find all places where conversion has to be
    done.  To help with this, there will be a command line option that
    sets the internal representation to Unicode so that it's easy to
    test.

    The most difficult areas for this are probably the various data
    sources.  Some of them -- dbf files for instance -- q don't provide
    any information about the encodings used.

    3. Switch to Unicode

    Finally, switch to Unicode as the internal string representation.
    For this step it might be best to wait until Unicode builds of
    wxPython are the default on the common platforms.