[Thuban-list] Classifications

Thu Dec 18 22:34:14 CET 2003

Hi all.

I've read the extraordinary article by Jenks and Caspall on classification for
choroplethic maps. A must read, BTW. I'm willing to implement their technique
and maybe use it as a framework for all the classification options in Thuban.
I will need some help, not in the algorithms themselves but rather on how to
hook them into Thuban.

Let me give an overview of the classification problem as presented by Jenks
and Caspall, then I will present the problems I'm facing.

Jenks and Caspall do three important things in the article. First they state
the problem as an optimization one: the minimization of an "interpretation
error" induced by the given (classified) representation of choroplethic maps.
Second, they divide the problem into three different perspectives and finally
they give algorithms for partial and overall solutions of the problem.

Let me elaborate on the second point: the authors state that reading a
choroplethic map may be performed for three reasons (or phrased differently,
choropletic maps have three different uses). First, (the "O" use) they are
read as an overview of the spatial arrangement of the data; second, they are
used for specific reading of the data values at specific locations ("T" for
"table" use); third, they are used for "boundary" reading where the
differences between regions are mostly taken into account and not the values
themselves (the "B" for "boundary" use).

For each use or reading mode, they define a normalized error index and propose
a composite error index which serves for an optimization algorithm in order to
obtain the best classification.

Now, to the point. The error indices are roughly given by:

1) T error: some average measure of the classification error of the data
values with respect to the classified values. For example, in TeX notation:

$$ \sum_{i=1}^L ( v_i - m_i )^2 $$ for values $v_i$ indexed by 1..L and $m_i$
being the average of the v values within the class to which $v_i$ belongs.

T error relies on the data distribution only, and can be implemented
effortlessly for all the methods we have been discussing and implementing
(k-means, quantile, even breaks,...)

2) O error: some average on the *overall* visualization error. This is
calculated by something like

$$ \sum_{i=1}^L  |v_i - m_i|*A_i $$ where v and m are the same as the previous
ones, and A_i is the area of the i-th polygon. This is an average or total
error *volume*, which takes into account the fact that larger polygons are
visually prominent over smaller ones.

That means that in order to calculate an O-type index error, we need to access
the area of the polygons corresponding to the data. 

Hints on doing this elegantly? The ideal would be to have a classifier class
to which an array of areas is passed, instead of accessing the shape info from
the classifier.

3) B error: this is the tricky one. The boundary error measures the average of
differences in v vs. m values between *neighboring* polygons. In order to
implement this, we would need some representation of the layer's topology,
ideally an adjacency matrix for all the polygons.

How can you get an adjacency matrix from a polygon layer?

Finally, given these measures, we could use any of a set of optimization
techniques to get a good classification in the sense of each or the overall
indices.

I am currently implementing T-type classifiers to supplement quantile and even
spacing. 

I have somewhere a genetic algorithm implementation which could be used to
optimize classifications based on any of the error measures (although I reckon
a properly-tuned simulated annealing could perform better). But I have to be
able to compute O-type and B-type error indices, for which I need acces to a
layer's geometry and topology.

Any hints are welcome!

Daniel.

-- Daniel Calvelo Aros
-- Dirección General de Información Agraria
-- Ministerio de Agricultura del Perú
-- (51-1) 424-9001