reading data from csv files
Jakson A. Aquino
jalvesaq at gmail.com
Thu Sep 7 13:15:25 CEST 2006
Hello All,
I've made some changes in the algorithm used by statist to open data files.
All files that were opened should continue to be read as before, but now
statist is capable of reading .csv files as created by spreadsheet programs.
Statist should open most files without problems, automatically detecting if
the file has a header with variable names and correctly parsing the lines
with data. To achieve this goal, among other changes, I added double quotes
and commas to the ignore[] string (data.c), used to distinguish between data
and field separators.
I added the following options to let users fine tune statist behavior if it
fails to open a file correctly:
--header : the file has column names in the first line
--noheader : the file does not have column names
--sep <char> : field separator character
--dec <char> : decimal delimiter character (default: '.')
--na-string <string> : indicator of missing values (default: "M")
In the scripts with tests that are attached to this e-mail, it was possible
to open files with a wide range of formatting styles, and it was necessary
to use command line options only when the field separator was an empty space
and the decimal delimiter was a comma, as can be seen in the table below:
Command line options to open a data file with statist
according to how the data is formatted
============================================================================
| dec = . | dec = ,
|----------------------|---------------------
| sep = , | sep = ' ' | sep = , | sep = ' '
-------------------------------|----------|-----------|---------|-----------
header | quoted data | | | |
(statist style)|---------------|----------|-----------|---------|-----------
( #% ) | not quoted | | | | --dec ","
---------------|---------------|----------|-----------|---------|-----------
| quoted data | | | |
header |---------------|----------|-----------|---------|-----------
| not quoted | | | | --dec ","
---------------|---------------|----------|-----------|---------|-----------
| quoted data | | | |
no header |---------------|----------|-----------|---------|-----------
| not quoted | | | | --dec ","
============================================================================
Note: the combination of (data not quoted & dec = "," & sep = ",")
is possible only for integer values.
I also updated the documentation to reflect the changes, and if no problem
is detected in the new algorithm, it will be part of the next release of
statist. For now, the changes are available in the cvs:
cvs -z3 -d:pserver:anonymous at cvs.intevation.de:/home/statist/jail/statistrepository co statist
One disadvantage of these new features is that statist is taking about
30% more time to load a file.
All commentaries and suggestions are welcome!
Best regards,
Jakson
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test_statist.tar.gz
Type: application/octet-stream
Size: 2797 bytes
Desc: not available
Url : http://www.intevation.de/pipermail/statist-list/attachments/20060907/d39da57b/test_statist.tar.gz
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://www.intevation.de/pipermail/statist-list/attachments/20060907/d39da57b/attachment.bin
More information about the Statist-list
mailing list
This site is hosted by Intevation GmbH (Datenschutzerklärung und Impressum | Privacy Policy and Imprint)