missing values

Wed Feb 9 13:29:21 CET 2005

Hi!

Before reading the email written by Andreas Beyer, I
had written something similar. I also wrote some
suggestions of code, but I don't know Statist source
code sufficiently well to have any certainty that what
I wrote would work.

If I'm not wrong, we can fix the problem of missing 
values writing few lines of code. I'm still thinking 
that we can define one specific double number as
system missing value.

I - DEFINING SYSMIS

First we would have to give to Statist the ability to 
keep missing values. We could borrow one line from
pspp and put in statist.h:

#define SYSMIS (-DBL_MAX)

I wrote the following program to see the exact value 
of -DBL_MAX:

#include <stdio.h>
#include <values.h>
#define SYSMIS (-DBL_MAX)

int main(){ 
  printf("\n%e\n\n", SYSMIS);
}

The program was compiled wit g++, and the output
was: -1.797693e+308

The complete number is:

-17976931348623157081452742373170435679807056752584499
659891747680315726078002853876058955863276687817154045
895351438246423432132688946418276846754670353751698604
991057655128207624549009038932894407586850845513394230
458323690322294816580855933212334827479782620414472316
8738177180919299881250404026184124858368.000000

Certainly the probability of finding this specific 
number in a database is very small. We can tell in 
the documentation that if this number appears in the 
database it is the responsibility of the user to 
define a different SYSMIS with the option 
-sysmis=number, or something similar.

II - WRITING SYSMIS IN TEMPFILES

Now that we have the variable SYSMIS, we have to read
columns correctly. Perhaps the following replacement 
will be enough:

data.c, line 261 and ff:

if ((token[0] == NODATA) && (strlen(token)==1)) {
  colread ++;
}
else if (sscanf(token, "%lf", &test)==1) {
  FWRITE(&test, sizeof(REAL), 1, ttempfile[actcol]);
  nn[actcol] ++;
  colread ++;
}

Should be changed to something like:

if ((token[0] == NODATA) && (strlen(token)==1)) {
  FWRITE(&SYSMIS, sizeof(REAL), 1, ttempfile[actcol]);
  nn[actcol] ++;
  colread ++;
}
else if (sscanf(token, "%lf", &test)==1) {
  FWRITE(&test, sizeof(REAL), 1, ttempfile[actcol]);
  nn[actcol] ++;
  colread ++;
}

III - READING SYSMIS FROM TEMPFILES

At first, I thought that it would be necessary to 
rewrite all functions because of the missing values 
problem. But now I have some hope that this will not 
be necessary. If I'm guessing correctly, a code 
similar to the following can solve the problem:

After the following code (data.c : 482 and ff.):

  for (k=0; k<n_alloc; k++) {
    if (!x_read[acol[k]]) {
      xx[acol[k]] = readcol(acol[k]);
      label_tab[acol[k]].ptr = xx[acol[k]];
      label_tab[acol[k]].str = alias[acol[k]];
    }

We can add something similar to the following 
to delete rows with missing values:

  int cr = 0; /* current row */
  int tr = 0; /* total number of rows already checked
*/
  BOOL RowHasMis = FALSE;
  while(tr < total_number_of_rows_in_data_base){
    for (k=0; k<n_alloc; k++)
      if(xx[acol[k][tr] == SYSMIS) RowHasMis = TRUE;

    if (RowHasMis){
        tr++;
        RowHasMis = FALSE;
    }
    else{ 
      for (k=0; k<n_alloc; k++)
        xx[acol[k]][cr] = xx[acol[k]][tr];
      cr++;
      tr++;
    }
  }
  current_valid_number_of_rows_in_xx = cr;

All code above is only a rough idea. I wrote it only 
here, and didn't test nothing in the real source code.
Probably Statist source code already have variables 
that correspond to
"current_valid_number_of_rows_in_xx" 
and "total_number_of_rows_in_data_base".

Of course Andreas Beyer's idea of putting the code in
a function is better than adding it directly to this
function. I think that when this fuction is written
the option -delrow will no longer be necessary.

*****************************************

It would be very nice if the people in this mail-list,
including those that no longer use Statist, told us 
about the difficulties they have had using Statist. 
This would help us to know what to do before starting 
writing code.

*****************************************

The main difference with Andreas Beyer suggestion was
that I'm now following pspp solution and proposing a
negative number as missing value.

Who will try to fix the problem? Andreas Beyer?

Best,

Jakson

 --- Andreas Beyer <beyer at imb-jena.de> escreveu: 
> Hi,
> 
> instead of using something like -99 as a
> representation for missing 
> values we should use something like infinity. The
> GNU C-lib knows the 
> constant:
> 
> float NAN;
> 
> which is "not-a-number". Unfortunately this constant
> is only defined on 
> GNU systems. In addition, NAN is unequal to itself.
> Hence, the following 
> is true:
> 
> float x = NAN;
> assert(x != x);
> 
> Instead we might use the maximum double, i.e.
> something like the following:
> 
> #include float.h // DBL_MAX
> 
> static double _no_value = DBL_MAX;
> 
> /* Obtaining the internal representation for a
> missing value. */
> double missing_value()
> {
>     return _no_value;
> }
> 
> /* Check if val is a missing value. */
> int is_missing_value(double val)
> {
>     return val == _no_value;
> }

_______________________________________________________ 
Yahoo! Acesso Grátis - Instale o discador do Yahoo! agora. http://br.acesso.yahoo.com/ - Internet rápida e grátis