missing values
Jakson Aquino
jaksonaquino at yahoo.com.br
Wed Feb 9 13:29:21 CET 2005
Hi!
Before reading the email written by Andreas Beyer, I
had written something similar. I also wrote some
suggestions of code, but I don't know Statist source
code sufficiently well to have any certainty that what
I wrote would work.
If I'm not wrong, we can fix the problem of missing
values writing few lines of code. I'm still thinking
that we can define one specific double number as
system missing value.
I - DEFINING SYSMIS
First we would have to give to Statist the ability to
keep missing values. We could borrow one line from
pspp and put in statist.h:
#define SYSMIS (-DBL_MAX)
I wrote the following program to see the exact value
of -DBL_MAX:
#include <stdio.h>
#include <values.h>
#define SYSMIS (-DBL_MAX)
int main(){
printf("\n%e\n\n", SYSMIS);
}
The program was compiled wit g++, and the output
was: -1.797693e+308
The complete number is:
-17976931348623157081452742373170435679807056752584499
659891747680315726078002853876058955863276687817154045
895351438246423432132688946418276846754670353751698604
991057655128207624549009038932894407586850845513394230
458323690322294816580855933212334827479782620414472316
8738177180919299881250404026184124858368.000000
Certainly the probability of finding this specific
number in a database is very small. We can tell in
the documentation that if this number appears in the
database it is the responsibility of the user to
define a different SYSMIS with the option
-sysmis=number, or something similar.
II - WRITING SYSMIS IN TEMPFILES
Now that we have the variable SYSMIS, we have to read
columns correctly. Perhaps the following replacement
will be enough:
data.c, line 261 and ff:
if ((token[0] == NODATA) && (strlen(token)==1)) {
colread ++;
}
else if (sscanf(token, "%lf", &test)==1) {
FWRITE(&test, sizeof(REAL), 1, ttempfile[actcol]);
nn[actcol] ++;
colread ++;
}
Should be changed to something like:
if ((token[0] == NODATA) && (strlen(token)==1)) {
FWRITE(&SYSMIS, sizeof(REAL), 1, ttempfile[actcol]);
nn[actcol] ++;
colread ++;
}
else if (sscanf(token, "%lf", &test)==1) {
FWRITE(&test, sizeof(REAL), 1, ttempfile[actcol]);
nn[actcol] ++;
colread ++;
}
III - READING SYSMIS FROM TEMPFILES
At first, I thought that it would be necessary to
rewrite all functions because of the missing values
problem. But now I have some hope that this will not
be necessary. If I'm guessing correctly, a code
similar to the following can solve the problem:
After the following code (data.c : 482 and ff.):
for (k=0; k<n_alloc; k++) {
if (!x_read[acol[k]]) {
xx[acol[k]] = readcol(acol[k]);
label_tab[acol[k]].ptr = xx[acol[k]];
label_tab[acol[k]].str = alias[acol[k]];
}
We can add something similar to the following
to delete rows with missing values:
int cr = 0; /* current row */
int tr = 0; /* total number of rows already checked
*/
BOOL RowHasMis = FALSE;
while(tr < total_number_of_rows_in_data_base){
for (k=0; k<n_alloc; k++)
if(xx[acol[k][tr] == SYSMIS) RowHasMis = TRUE;
if (RowHasMis){
tr++;
RowHasMis = FALSE;
}
else{
for (k=0; k<n_alloc; k++)
xx[acol[k]][cr] = xx[acol[k]][tr];
cr++;
tr++;
}
}
current_valid_number_of_rows_in_xx = cr;
All code above is only a rough idea. I wrote it only
here, and didn't test nothing in the real source code.
Probably Statist source code already have variables
that correspond to
"current_valid_number_of_rows_in_xx"
and "total_number_of_rows_in_data_base".
Of course Andreas Beyer's idea of putting the code in
a function is better than adding it directly to this
function. I think that when this fuction is written
the option -delrow will no longer be necessary.
*****************************************
It would be very nice if the people in this mail-list,
including those that no longer use Statist, told us
about the difficulties they have had using Statist.
This would help us to know what to do before starting
writing code.
*****************************************
The main difference with Andreas Beyer suggestion was
that I'm now following pspp solution and proposing a
negative number as missing value.
Who will try to fix the problem? Andreas Beyer?
Best,
Jakson
--- Andreas Beyer <beyer at imb-jena.de> escreveu:
> Hi,
>
> instead of using something like -99 as a
> representation for missing
> values we should use something like infinity. The
> GNU C-lib knows the
> constant:
>
> float NAN;
>
> which is "not-a-number". Unfortunately this constant
> is only defined on
> GNU systems. In addition, NAN is unequal to itself.
> Hence, the following
> is true:
>
> float x = NAN;
> assert(x != x);
>
> Instead we might use the maximum double, i.e.
> something like the following:
>
> #include float.h // DBL_MAX
>
> static double _no_value = DBL_MAX;
>
> /* Obtaining the internal representation for a
> missing value. */
> double missing_value()
> {
> return _no_value;
> }
>
> /* Check if val is a missing value. */
> int is_missing_value(double val)
> {
> return val == _no_value;
> }
_______________________________________________________
Yahoo! Acesso Grátis - Instale o discador do Yahoo! agora. http://br.acesso.yahoo.com/ - Internet rápida e grátis
More information about the Statist-list
mailing list
This site is hosted by Intevation GmbH (Datenschutzerklärung und Impressum | Privacy Policy and Imprint)