[Thuban-list] Patch: classifiers

Thu Feb 12 14:08:08 CET 2004

(bringing this back to the list, as I think others might be interested)

Daniel Calvelo Aros said:
> On Wed, 11 Feb 2004 18:49:20 +0100 (CET), Moritz Lennert wrote
>> No, the program defines the maximum number of classes that make any sense
>> according to the stopping criterion defined in the code, but it does
not define the best number of classes.
>
> You mean it's not "best" according to cartographic or other criteria
different
> from the implicit "best" of the stopping criterion?
>

No, what I mean is that the stopping criterion does not imply "best"
number of solution, but measures the statistical significance of creating
another class break. The algorithm stops when the creation of another
break would not give more information than not creating one.

The definition of "best" number of solution is independant of that.
Especially, because cartographic issues enter into play as well, so what
is "best" (as always) depends on what one aims for.

>> I actually don't know how it
>> would do that. I've attached a results file generated with the
>> original code and as you can see it gives a number of solutions from
which I can chose. It would be nice to have such a table in thuban.
>
> I see. From what I understood in the code, those are printouts of each
step of
> the algorithm as it partitions further the dataset. The stopping
criterion
> is
> something like "difference between original class and best partition is
statistically insignificant", for a hard-wired significance threshold of
95%.
>

Exactly, it is a question of statistical significance of an additional
partition, not of best number of partitions. So when in your thuban verion
it says "Got X breakpoints in the distribution of Y", it should maybe say
something like: "Got X statistically significant (95%) breakpoints ...". A
really sophisticated version of this could even let the user decide the
level of desired significance...but I think this would be overkill :-)

> Now, what you suggest would be (correct me, please) to use the results from
> the partitioning steps as inputs into a quality measure of each
> classification
> (in the sense of your former definition of "best"), and from there let
either
> the user or some automated thing define the number of classes.
>
> I still don't have a Jenks-Caspall-like quality measure, so for now only
the
> user is available as a criterion of classification quality.

I don't really believe in the existence of _one_ quality measure. I think
that this depends on too many factors, including "purely"
graphical/cartographic ones. So I would agree with you saying that "only
the
user is available as a criterion of classification quality.", but I would
say that this is always the case, not just for now...

>
> In this order of ideas, the easiest way would be to fix the maximum
number
> of
> classes attainable by the algorithm, unless "automatic" is specified. In
both
> cases, if the hardwired stopping criterion is fulfilled, stop and give a
warning if Nclasses<specified. Would that be ok?

That sounds good, but I don't think the user should have to determine a
number of classes a priori. The algorithm should be seen as decision-aid,
not more and so, it should give the user all the necessary information to
make a good decision. So, giving her at least the potential class limits
and the number of observations per class, should help her in deciding on
the number of classes (the original code also provides a measure of
density, dividing the proportion of observations in a class by the
classes' proportion of the total amplitude). Sometimes, she might want to
subdivide a class, just because it contains too many observations...

I know this makes the whole thing more complicated to program, but I am
very suspicious of "automatic" cartography and in my eyes this is one of
the weak points of ArcView.

Moritz