release number? (was: Christmas release? :))

Wed Dec 5 20:50:42 CET 2007

Bernhard Reiter wrote:
> On Friday 30 November 2007 00:07, Bram de Greve wrote:
>   
>> Hold the presses! =)
>>     
>
> Hey, we were not that fast anyway, just trying to buy supplies for
> the machines and planning the run. 8)
>
>   
Things will take a bit longer than antipicated, as I'm now communicating
with Frank Warmerdam, and I'll wait until more is know about how the
official shapelib is going to evolve.  File IO will likely change quite
a lot, to support 2GB+ files on Win32, and so the unicode filenames will
be revisited (trying to avoid the ugly _wfopen wide character API). 
Also, about the LDID/CPG code page things, some unified construct will
be designed first.

More on that later ...

> My master plan is to do anything as unicode objects within Thuban
> and transfrom things as early as possible when they enter Thuban's realm
> and usually late when they are written out.
>   

Agreed

> Of course at each border we would need to now what the encoding of the
> communication partner is. In some cases this will need to be configurable
> as it might not be securly determined by runtime.
>
> What borders to we have:
> 	a) the filesystem determining the filename encoding
> 	b) the contents of each file
> 	c) all internet connections
> 	d) possible libraries that cannot handle unicode directly.
>
> So I believe taking and returning unicode would be perfectly fine
> for dbflib and preferable even if we need to change Thuban then.
> Of course there should be way to request and set the enconding that dbflib 
> internally uses so that we can inform the user about which dbffile format is 
> going to be written and let him override this. Apart from this I do not think
> that users of dbflib should need to know about encoding issues.
>   
Requesting the encoding should be easy, setting it would most likely
only be possible when creating a new DBF file.  (in early
implementations) this might default to UTF-8 but in the real thing this
would be configurable in Thuban.

I'm still not sure about returning the string data as unicode or as
"raw" encoded strings.  You see, my interest in pyshapelib is beyond
Thuban.  Of course, regardless of the choice, DBFTable in Model/table.py
should use Unicode.  But if pyshapelib returns raw strings, then
DBFTable will do the conversion instead so that it blend seamlessly with
the rest of Thuban.

The reason why I'm hesitating is two-fold:
- backwards compatibililty with tons of existing scripts that don't use
unicode and might break
- what-if the encoding information (LDID/CPG) or the encoded content is
broken.  Returning Unicode will only be able to raise an exception,
while returning raw content might give opportunities to go further.

OTOH, I might do it in a configurable way.  For example: dbflib returns
raw encoded strings by default, unless you do something like this:

import dbflib
dbflib.return_unicode = True

or even individually on each file:

dbf = dbflib.open(u"foobar", return_unicode=True)

>> Bonus question 2: Bernhard, shall I make another branch where I commit
>> these recent developments?  Or do I commit it to my current branch?
>>     
>
> Make it in another branch please.
> If possible branch from current trunk, this probably will ease merging.
>
>   
OK, I did branch from current trunk.  But now I need to merge my Bramz
branch in my Unicode branch ... *shiver* =)

Bram

release number? (was: Christmas release? :))

release number? (was: Christmas release? :))