bh: thuban/Doc/technotes string_representation.txt,NONE,1.1
cvs@intevation.de
cvs at intevation.de
Fri Jul 1 22:49:06 CEST 2005
Author: bh
Update of /thubanrepository/thuban/Doc/technotes
In directory doto:/tmp/cvs-serv11857/Doc/technotes
Added Files:
string_representation.txt
Log Message:
First step towards unicode. With this roughly we're at step 1
string_representation.txt
* Doc/technotes/string_representation.txt: New. Document how
strings are represented in Thuban and how to get to a Unicode
Thuban.
* Thuban/__init__.py (set_internal_encoding)
(unicode_from_internal, internal_from_unicode): New. The first few
functions for the internal string representation
* Thuban/UI/about.py (unicodeToLocale): Removed. Use
internal_from_unicode instead.
* Thuban/UI/__init__.py (install_wx_translation): Determine the
encoding to use for the internal string representation. Also,
change the translation function to return strings in internal
representation even on unicode builds of wxPython
* Thuban/Model/load.py (SessionLoader.check_attrs): Decode
filenames too.
(SessionLoader.start_clrange): Use check_attrs to decode and check
the attributes.
* Thuban/Model/xmlreader.py (XMLReader.encode): Use
internal_from_unicode to convert unicode strings.
* Thuban/Model/xmlwriter.py (XMLWriter.encode): Use
unicode_from_internal when applicable
* test/runtests.py (main): New command line option:
internal-encoding to specify the internal string encoding to use
in the tests.
* test/support.py (initthuban): Set the internal encoding to
latin-1
* test/test_load.py (TestSingleLayer.test, TestClassification.test)
(TestLabelLayer.test): Use the internal string representation when
dealing with non-ascii characters
* test/test_load_1_0.py (TestSingleLayer.test)
(TestClassification.test, TestLabelLayer.test): Use the internal
string representation when dealing with non-ascii characters
* test/test_load_0_9.py (TestSingleLayer.test)
(TestClassification.test): Use the internal string representation
when dealing with non-ascii characters
* test/test_load_0_8.py (TestUnicodeStrings.test): Use the
internal string representation when dealing with non-ascii
characters
* test/test_save.py (XMLWriterTest.testEncode)
(SaveSessionTest.testClassifiedLayer): Use the internal string
representation when dealing with non-ascii characters where
applicable
--- NEW FILE: string_representation.txt ---
Title: String Representation in Thuban
Author: Bernhard Herzog <bh at intevation.de>
Last-Modified: $Date: 2005/07/01 20:49:04 $
Version: $Revision: 1.1 $
Introduction
Thuban originally assumed that text is represented by byte-strings
encoded in ISO-8859-1 (latin-1). This is problematic when the
default encoding in the user's locale is not in fact latin-1, but
e.g. UTF-8. The solution is to use a more flexible representation
that will also allow the switch to Unicode as the internal string
representation at one point.
Internal String Representation
Thuban has an internal string representation. All textual data read
by Thuban has to be converted to the internal representation. All
data written by Thuban has to be converted into whatever form is
used by the output device.
Thuban provides functions to convert between the internal
representation and other representations. E.g.:
internal_from_unicode which converts from unicode and should be used
when reading XML files, for instance and unicode_from_internal for
the conversion to Unicode.
The ultimate goal is to use Unicode objects as the internal string
representation. It will be much work to get there because we will
have to find all the places where we need to make the conversions.
Therefore the internal representation will be byte strings in the
user's default encoding.
With byte strings and especially encodings like latin-1 we can get
by without doing all the conversions correctly because basically all
byte strings are valid latin-1 strings, even if they have the wrong
encoding. In those cases, the text may look strange, but there
won't be exceptions in most cases. With Unicode objects, exceptions
are much more likely. And in the end it's better to see some
incorrect characters than no data at all.
All this boils down to the following steps:
1. Byte-Strings as Internal Representation
The internal representation are byte strings in the user's default
encoding as determined by the locale. The encoding is chosen so
that such byte strings can be passed to wxPython without problems.
This even works with Unicode builds if we take care to convert the
translated strings (wxGetTranslation returns Unicode objects in a
Unicode build).
If no suitable encoding can be determined, use latin-1. It might be
better to use ASCII instead, but latin 1 offers somewhat better
backwards compatibility with older Thuban versions.
Start implementing the conversion functions and use them wherever
we have hard coded conversions to latin-1. It's not necessary to
find all places where conversion has to be done at this point.
Since we're using byte strings in the user's default encoding most
byte-strings that are read by Thuban are already in the right form
and in most cases it's also the right form for output.
2. Implement the conversion wherever necessary
Start working toward Unicode as the internal representation. In
this phase, we need to find all places where conversion has to be
done. To help with this, there will be a command line option that
sets the internal representation to Unicode so that it's easy to
test.
The most difficult areas for this are probably the various data
sources. Some of them -- dbf files for instance -- q don't provide
any information about the encodings used.
3. Switch to Unicode
Finally, switch to Unicode as the internal string representation.
For this step it might be best to wait until Unicode builds of
wxPython are the default on the common platforms.
More information about the Thuban-devel
mailing list
This site is hosted by Intevation GmbH (Datenschutzerklärung und Impressum | Privacy Policy and Imprint)