4. Evaluation of Regionalization Methods with Synthetic Data
Regionalization is to aggregate a large number of small units into a relatively small number of regions while optimizing an objective function, such as the sum of squared differences (SSD) under certain contraints. For exploratory spatial analysis, the goal of regionalization is to remove spurious data variation (or unstable rate estimation) through aggregation and thus to discover hidden spatial patterns (such as areas of unusually high cancer rates). This paper investigates the capability of regionalization methods in terms of discovering/preserving spatial patterns in data. We evaluated several recent regionalization methods (Guo, 2008, Duque et al. 2010) with a large number of synthetic data sets. We also incorporated the Empirical Bayes Smoothers (EBS) with the regionalization methods. Following are the main findings:
- Incorporating EBS can significantly improve each method's performance in detecting spatial patterns;
- The Ward and CLK methods in REDCAP (Guo, 2008), with EBS, perform significantly better than other methods;
4.1. Synthetic Cancer Data Sets
The synthetic data set is generated as follows. First, a population density surface is created to simulate a mix of urban, suburban and rural areas. According to the density surface, one million points (each representing a person at a location) were randomly generated, which ensures that the spatial distribution of the one million people follows the predefined density surface.
Second, 1000 people (i.e., locations) are randomly selected to generate 1000 Voronoi polygons that cover the entire area (Figure 2). The average population for each polygon is 1000. The average risk of cancer is set as 1.0%, meaning that there will be 10,000 cancer incidents in the data, which will be generated in the third step below. By design, seven areas, each consisting of several contiguous polygons (Figure 2), are chosen to have a different cancer rate. Specifically, areas 1 - 3 have relative low rates (around 0.5%) and areas 4 – 7 have higher rates (around 1.5%).
Third, the population is randomly sampled and each selected individual is assigned as a cancer incidence with a probability depending on which area the person lives in. This process continues until 10,000 cancer incidences are assigned. Table 1 shows the cancer rates for the seven special areas for a selected data set. This step is repeated 1000 times to generate 1000 different data sets, each with the same set of polygons and population but different cancer assignments.
Table 1: An example synthetic cancer data set (out of 1000 automatically-generated data sets), which has three low risk areas and four high-risk areas. Figure 2 shows the maps of this data set.

4.2. Evaluation Measure
To evaluate the performance of different methods for detecting the seven cancer areas, we calculate a measure for each result as illustrated in Figure 1. The solid-lined polygon represents the detected region with high cancer rate and the dash-lined polygon represents the true area. Three values can be calculated: TP = the population in both polygons (overlapping part), FP = the population outside the true area but within the discovered region; and FN = the population inside the true cancer area but outside the discovered region. Then we use TP / (FP + FN) as the overall measure for a regionalization result. For example, in Figure 2, the measure for each result map is:
- TP = the population of yellow units in areas 1-3 PLUS the population of red units in areas 4-7;
- FP = the population of yellow units outside areas 1-3 PLUS the population of red units outside areas 4-7;
- FN = the population of orange units within areas 1-7.
To calculate this measure, we need to find two breaks to classify the data (i.e., cancer rates) into three classes: low, average, and high (see the maps in Figure 2). For each unit, its cancer rate is the same as its regional rate. Then for each result, we use an algorithm to find the best two breaks that maximize the value for TP / (FP + FN). In other words, given the estimated unit rates based on a regionalization result, we use its best possible TP/(FP+FN) value as its performance score.

Figure 1: An illustration of the measure for assessing a regionalization result in terms of detecting (or preserving) patterns in data.

Figure 2: Regionalization results with different methods for a selected synthetic data set. The seven polygons with thick boundaries are the true areas (see Table 1 for more information).
4.3. Evaluation Results
With the 1000 synthetic data sets, 1000 scores are obtained for each regionalization method. We calculate the average, median, first quartile (Q1), and the third quartile (Q3) for the 1000 scores. Table 2 shows the results for seven methods, including the Max-P (http://www.pysal.org/users/tutorials/region.html); Max-P with EBS (Empirical Bayes Smoothers); EBS alone; CLK (from REDCAP); Ward (from REDCAP); CLK with EBS; and Ward with EBS. Figure 3 shows a graphical view the their performances. Figure 2 shows the regionalization result of a selected data set, from which we can see that Ward_EBS can better detect the true cancer areas.
The Mann–Whitney U test method is used to test whether two results are significantly different from each other (or one result is significantly better than the other). Testing results (Table 2) show that:
- CLK_EBS and Ward_EBS are significantly better than all other methods;
- CLK_EBS and Ward_EBS are not significantly different from each other;
- Incorporating EBS can significantly improve each method (Max-P, CLK, or Ward);
- Without EBS, the original methods in REDCAP (CLK and Ward) perform significantly better than the Max-P;
Table 2: Performance comparison of different methods with 1000 synthetic data sets.


Figure 3: Performance comparison of different methods with 1000 synthetic data sets. Significance tests are carried out with the Mann–Whitney U test. The best two methods, Ward_EBS and CLK_EBS, are significantly better than their original version in REDCAP (i.e., Ward and CLK).









