missing values

Mon Feb 7 11:08:34 CET 2005

Hi Jakson,

On Sun, Feb 06, 2005 at 06:46:14PM -0300, Jakson Aquino wrote:
> I didn't look at the source code carefuly, but from
> the "behavior" of Statist it seems to me that we have
> a problem. 

I agree that the behaviour you describe below is problematic
and we need to test it more thoroughly and find a solution.

> Suppose that a research team applied a questionnaire
> to 600 people, asking them 50 questions. Some question
> wouldn't apply to every one, and many people wouldn't
> answer many questions. Now, I have a database with 50
> columns and 600 rows, and many missing values. Every
> time I want to do some correlation between variables I
> will have to do:
> 
> 1) Open with Statist a version of the database with no
> 'M' inside. The missing values would be indicated by
> some valid value.
> 
> 2) Save the columns that I want to correlate in a
> ASCII file.
> 
> 3) Quit Statist
> 
> 4) Use an external program to recode the new and
> smaller database to turn into 'M' all values that, in
> fact, are missing.
> 
> 5) Load the new small recoded database with the option
> -delrow and, then, run the analysis. 

My idea was that recoding the long file
with 'M' on the missing values for the full database
should be a one step operation and then statist can 
do all the various analysis on it.
If this is not the case, this is indeed bad.

Could you produce a few example files so that
the error can be made obvious. We can use that as test cases
when we fix the bug.

> While still becoming familiar with the database, I
> should run many analyzes, mainly regression analyzes
> or multiple linear correlations, with different
> combinations of variables. It's easy to imagine how
> boring this work would be.

> We need some way of keeping Statist aware of missing
> values, instead of simply deleting them and testing
> whether the columns have the same number of data
> points. 

Yes.
This could be done before the calculations.

> Perhaps we can define the biggest possible
> double as missing value! [I don't know what the
> biggest double value is.] 

There will be a standard define for this.

> That is, whenever Statist
> finds 'M' in the database it puts 9.99999e+99 in its
> memory, and whenever it finds 9.99999e+99 in its
> memory it assumes that it is a missing value. While
> writing columns as ASCII files, missing would come
> back to the usual 'M'. And, just for precaution, if it
> finds 9.99999e+99 while reading an ASCII file, it will
> reports a error. Of course this will slow down the
> computations, since an if(xx[i] == 9.99999e+99) will
> be necessary when reading each single data point from
> memory. And, the worst, it would be necessary to fix
> all functions! Any better solution?

Putting a special value in is not the niced solution,
however I would need to look at the code to think about
others.

> Anyway, we need to write a documentation in English
> and include this information in the documentation.

An English documentation is really needed, that is true.

	Bernhard
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.intevation.de/pipermail/statist-list/attachments/20050207/f7882f3b/attachment.bin