release number? (was: Christmas release? :))

Bernhard Reiter bernhard at intevation.de
Fri Nov 30 09:58:04 CET 2007


On Friday 30 November 2007 00:07, Bram de Greve wrote:
> Hold the presses! =)

Hey, we were not that fast anyway, just trying to buy supplies for
the machines and planning the run. 8)

> Tonight, I've made significant progress to supporting the UTF-8 codec in
> shapefiles (dbflib) using the way of ESRI's codepage files.
> So, I should be able to move to full unicode support in Thuban now,

Very cool!

> but I do have an important question concerning the pyshapelib API:

> Each DBFFile instance has a codec attribute that tells the name of the
> python codec used to encode/decode from and to unicode strings.
> All strings pass to dbflib can be either unicode or regular strings, and
> in case of the former they are encoded to regular strings internally.
> So far, so good.
>
> But what do we do with strings that are returned by dbflib?  Should they
> be Unicode strings already (and break backwards compatibility), or
> should be return the "raw" regular strings and let the caller do the
> decoding (using the known codec).  That way, nothing changes for
> existing code, and - maybe more importantly - if some DBF file carries a
> language driver/codepage that is not supported by Python (or if the file
> simply is corrupt), then all is not lost because the raw string is
> returned and the caller may do whatever he's capable to salvage the data.
>
> So, what do we do?  My preference goes the the latter, returning the raw
> strings.

My master plan is to do anything as unicode objects within Thuban
and transfrom things as early as possible when they enter Thuban's realm
and usually late when they are written out.

Of course at each border we would need to now what the encoding of the
communication partner is. In some cases this will need to be configurable
as it might not be securly determined by runtime.

What borders to we have:
	a) the filesystem determining the filename encoding
	b) the contents of each file
	c) all internet connections
	d) possible libraries that cannot handle unicode directly.

So I believe taking and returning unicode would be perfectly fine
for dbflib and preferable even if we need to change Thuban then.
Of course there should be way to request and set the enconding that dbflib 
internally uses so that we can inform the user about which dbffile format is 
going to be written and let him override this. Apart from this I do not think
that users of dbflib should need to know about encoding issues.

> Bonus question 1: apart from the dbflib issues, what other areas of
> Thuban are not yet fully unicode internally?

This is the dark spot of Thuban currently.
We would need to switch to unicode and inspect all borders
as stuff would break during runtime. :/

> Bonus question 2: Bernhard, shall I make another branch where I commit
> these recent developments?  Or do I commit it to my current branch?

Make it in another branch please.
If possible branch from current trunk, this probably will ease merging.

Bernhard



-- 
Managing Director - Owner: www.intevation.net       (Free Software Company)
Germany Coordinator: fsfeurope.org. Coordinator: www.Kolab-Konsortium.com.
Intevation GmbH, Osnabrück, DE; Amtsgericht Osnabrück, HRB 18998
Geschäftsführer Frank Koormann, Bernhard Reiter, Dr. Jan-Oliver Wagner
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.intevation.de/pipermail/thuban-devel/attachments/20071130/d9ba0e46/attachment.bin


More information about the Thuban-devel mailing list

This site is hosted by Intevation GmbH (Datenschutzerklärung und Impressum | Privacy Policy and Imprint)