WIP-pyshapelib-Unicode issues

Wed Jan 9 18:52:15 CET 2008

Hi Bernhard,

Bernhard Reiter wrote:
>> *
>> test_transientdb.TestTransientTable.{test_auto_transient_table|test_transie
>> nt_table}:
>>
>> the shapelib C library returns integer numerical fields with more than 10
>> digits as double (in order to avoid an overflow with the C int).  This is
>> causing the test to fail as an 'int' is expected for column 3, and an
>> 'double' is found.  Bernard Herzog suggests to let pyshapelib convert them
>> back to Python long integers.  On the Python API level this can easily be
>> done using PyLong_FromDouble.  Is it OK to mix long integers and "short"
>> integers, or should all integers coming from pyshapelib be converted to
>> long ints?
>>     
>
> Probably one class of objects should always have the same representation,
> so if "long" is sufficient for ids, so why not use it for all ids?
> This seems consistent. If there are values where (int *) is sufficient, 
> we could use this. 
>   

Sounds reasonable.

>> * when creating new DBF files, Thuban will use the LDID_ESRI_ANSI code
>> page.  That's LDID 0x57 and uses the cp1252 codec. 
>>     
>
> Is this the default encoding old shapefiles uses to have?
>   

Yes ... and no.  Probably, most shapefiles created with a default ESRI
installation will have this encoding.  But shapefiles created with the
old dblfib did not have any associated code page at all!  At any rate,
dbflib totally ignored any code page when reading dbf files, so things
pretty much went down to the default Python encoding, which is ... ASCII?

>   
>> This should be 
>> configurable by the user, but I don't really know where to start.  Can
>> anyone who's familier with the Thuban UI give a headstart?
>>     
>
> First we have to decide where to save this property.
>   

Is there any "config" file for Thuban?  If so, I would save it there.

> It looks like the property of a .dbf table which can be a table on its own
> or a part of a shapefile layer. 
>   
Correct.
> If Thuban displays a table it will already have a concept about it's encoding.
> Otherwise it would need to recode it.
> So for files where Thuban cannot determine the encoding, we probably have to 
> add something like in the "import" statements.
>   

Good point! I totally missed the issue of dbf files that do not have any
code page associated.  Until now, I simply assumed cp1252.
But that's not the only reason the have this encoding configurable. 
More important is the encoding to be used when _creating_ new dbf files!

> Maybe the table view should also display this property.
>   

That would be nice, though not really necessary.

>   
>> * pyshapelib should get a proper unittest that can run on its own, but is
>> also tested from test/runtests.py.  I've never made a unittest in Python
>> before, so I'm a bit puzzled here.
>>     
>
> Check the files in there, they are pretty good examples. It is more 
> straightforward then you might think. :)
>   

I'll give it another try =)

Bram