missing values

Tue Feb 8 20:46:58 CET 2005

Hi Bernhard!

>Could you produce a few example files so that
>the error can be made obvious. We can use that as
test cases
>when we fix the bug.

I'm  sending the example files as attachment. The file
results_gss explains what was done and presents the
results. The file gss20x6_file_info has information
about the very small database used to make the test.

>Putting a special value in is not the niced solution,
>however I would need to look at the code to think
about
>others.

Yes, if the special value was present in the database
as a valid value this would be a problem. Perhaps a
option like --diffsysmis=number (define different
system missing value, other than 9.99999e+99) could
solve the problem. PSPP, which is cloning spss, seems
to use a list of user defined missing values to each
colum:

[jakson at localhost src]$ grep MISSING *.h -n
exprP.h:187:    OP_STR_MIS,                     /*
MISSING(strvar). */
var.h:282:    MISSING_NONE,             /* No
user-missing values. */
var.h:283:    MISSING_1,                        /* One
user-missing value. */
var.h:284:    MISSING_2,                        /* Two
user-missing values. */
var.h:285:    MISSING_3,                        /*
Three user-missing values. */
var.h:286:    MISSING_RANGE,            /* [a,b]. */
var.h:287:    MISSING_LOW,              /* (-inf,a].
*/
var.h:288:    MISSING_HIGH,             /* (a,+inf].
*/
var.h:289:    MISSING_RANGE_1,          /* [a,b], c.
*/
var.h:290:    MISSING_LOW_1,            /* (-inf,a],
b. */
var.h:291:    MISSING_HIGH_1,           /* (a,+inf),
b. */
var.h:292:    MISSING_COUNT

PSPP also seems to have a special value reserved to
sysmis. Look at the following lines:

var.h, line 40 and ss.:
/* Special values. */
#define SYSMIS (-DBL_MAX)
#define LOWEST second_lowest_value
#define HIGHEST DBL_MAX

magic.h, line 29 and ss.:
fndef SECOND_LOWEST_VALUE
/* "Second lowest" value for a flt64; that is,
(-FLT64_MAX) + epsilon. */
double second_lowest_value;
#endif

PSPP code is too big and complex to me, and I just
used grep to find the word "missing". But it seems
that they are using the biggest possible double
negative number as sysmis, and a lot of other user
defined missing values. Perhaps using a very big (and
negative) number was not so dangerous. Anyway, I don't
know if this is a good solution.

Anyway, I don't think it is necessary to complicate
the code creating a complex list of user defined
missing values. I prefer the simpler approach of
Statist of simply putting a 'M' in the database. In
social research we have three main sources of abudant
missing values: (1) people don't know how to answer a
question; (2) they do know how to, but are unwilling
to answer, and (3) the question don't apply. In some
analysis, all three cases are better considered
missing values, but in others, often using the same
data base, "I don't know" and "I prefer don't answer
this question" must be distinctly counted and
analysed. But this would happens only rarely, and it
would not be that difficult to write a program to
automatically recode an original database and create a
new one with the 'M's correctly replacing values that
have to be recoded as missing. 

Best,

Jakson

_______________________________________________________ 
Yahoo! Acesso Grátis - Instale o discador do Yahoo! agora. http://br.acesso.yahoo.com/ - Internet rápida e grátis
-------------- next part --------------
A non-text attachment was scrubbed...
Name: missing_problem.tar.gz
Type: application/x-gzip-compressed
Size: 3484 bytes
Desc: missing_problem.tar.gz
Url : http://www.intevation.de/pipermail/statist-list/attachments/20050208/55b59284/missing_problem.tar.gz