Abstract
The census of population and dwellings undertaken by national state institutions world over at regular time intervals, is a fantastic source of information. However, there are major challenges to overcome when transforming the census data that usually consists of a vast number of attributes, into useful knowledge. In this paper, an artificial intelligent (AI) based approach is investigated to select appropriate attribute features that indicate interesting patterns in Beppu census wards in 2000 and 2010. The results of the self-organising map or SOM (unsupervised artificial neural network) based clustering, GIS visualisation and machine learning (J48 and JRip functions of WEKA), provide relevant discerning features, new patterns and new knowledge that can be of use to many professionals, such as urban/transport planers and resources management.
Introduction
The census of population and dwellings undertaken by national state institutions all around the world on a regular basis (every five/ten years), is a fantastic source of information. However, there are major challenges to overcome when analysing the data, and identifying the patterns embedded in the census data [1]. Nevertheless, in recent times especially, in the last two decades, the application of data mining techniques to official statistics has become a hot topic as it provides a means to gain richer and deeper insights on the population (demographics) of a nation/region [2]. Even then the application of data mining for official statistics is described as, not surprisingly minimal, the reasons being firstly, National Statistical Institutes (NSIs) are tasked with data collection and the common practice has been to outsource the analysis of the data, secondly, official statisticians’ main task is to answer precise questions and make forecasts, not to find unexpected patterns or models. Nonetheless, there have been some studies that attempted to gain new knowledge from censuses.
In 2000, the authors of a study [3] used 2000 census with a total population of 126.93 million along with a medium variant, and projected an increase in Japan’s population in the subsequent years. The study indicated a peak of 127.74 million in 2006. Then the population was expected to drop to the current size (year 2000) by 2013 and thereafter, the projections indicated a decrease to about 100.6 million in 2050.
As predicted in the above study that used a medium variant method, the population of the working age group reached its peak in 1995, which was said to enter a declining phase, falling below 70 million by 2030, and eventually to drop to 53.89 million by 2050 (Fig. 1 from Fig. 3 in [3] p10). In this context, the paper presents an application of self-organising map (SOM) based clustering, GIS visualisation and machine learning techniques namely, J48 and JRip (in WEKA software package), to analyse trends in the population dynamics and patterns in dwelling types in Beppu census wards on Kyushu island of Japan. SOM cluster membership of census data is incorporated into a GIS environment (ArcGIS) for visualisation purposes. The machine learning techniques are used to feature selection and knowledge extraction from the Beppu 2000 and 2010 censuses. The census wards that got the most population increase/decrease in 2010 are further analysed to understand the demographic changes in these wards.
Meanwhile, in the past there have been feature selection and noise removal attempts in different contexts. For instance, in [2] the authors elaborated on a genetic algorithm (GA) based feature selection and parameters optimization especially, for support vector machines. Xiao et al. [3] introduced an efficient top-(k,l) range query processing for uncertain data based on multi core architectures. In [4], the authors elaborated on an adaptive algorithm for retrieving global skyline tuples from all existing distributed local sites with the minimum communication cost. In [5], the authors used a sample feature value with its class label in a rapid incremental learning algorithm which improved the performance of AdaBoost significantly. AdaBoost is a popular method for vehicle detection
With that introduction to the issues relating to feature selection in a variety of application domains, section 2 gives an outline on literature reviewed in census data mining for this work following which the methodology adopted in this research is presented. Section 4 elaborates on the results achieved and finally conclusion and future directions of this research are summarised.
Data mining of census data
Extracting valuable previously unknown knowledge and patterns from census data has become a common practice for supporting strategic policy making in many sectors and enterprises [6, 7]. In [8], an application of Subgroup Miner on census data has been illustrated. The goal of the system was to provide a spatial and temporal mining tool for census data mining especially, for creating interactions between spatial subgroup mining and GIS mapping. The study used the UK Census, undertaken every ten years to collect population and other statistics aimed at serving those who have to plan and allocate resources. The major customers who were anticipated to use census data with this tool included departments of national and local government, and providers of services such as health and education.
Furthermore, data mining census using SOM and GIS [9], CART [7] and a relational data mining approach [10] are a few to name. In [9], the authors outlined the common data mining approaches so far investigated to census data mining. With that introduction to the paper topic and literature reviewed for this work, in the next section the methodology adopted is elaborated.
The methodology
Data relating to Beppu’s age group and dwelling population profiles extracted from 2000 and 2010 censuses are investigated using SOM based clustering and machine learning techniques (JRip and J48) in this research to look for any new interesting patterns in the two census data. This is an extension to the work presented in [11] using Beppu’s census 2000. In that work 86 attributes relating to age, employment and dwelling details extracted from Beppu census 2000 were clustered progressively from two to seven using Kohonen SOMs. In this study, the SOM clustering (membership) is used as the target variable for the rule extraction with machine learning techniques. The JRip and J48 decision tree rules extracted using WEKA sw [12] are analysed to select any discerning attributes in the SOM clustering. The SOM membership of the two censuses are visualised in a GIS (ArcGIS) environment.
SOM clustering
A SOM is a single layered artificial neural network that uses feed forward algorithmic learning for clustering similar multidimensional data points on a two dimensional display (Fig. 2). The SOM algorithm first introduced by Kohonen based on 20th century’s understanding on brain’s cortex cells in the 1980 s has been successfully applied to real world problems in many disciplines [13]. SOM techniques (clustering and component analysis) provide an excellent data mining tool especially, to overcome issues imposed by conventional data analysis methods, such as standard statistical.
SOM algorithm
A SOM consists of a regular, usually two-dimensional grid of neurons. Each neuron i of
the SOM is represented by a weight model vector,
The set of weight vectors is called the codebook. The map of neurons are connected to their adjacent neighbours by a neighbourhood relation (1), which dictates the topology of the map. Usually a rectangular or hexagonal topology is used. Immediate neighbours belong to the neighbourhood Ni of the neuron i.
In the basic SOM algorithm, the topological relations and the number of neurons are fixed
from the beginning. The number of neurons may vary from a few dozens up to several
thousands. It determines the granularity of the mapping, which in turn affects the
accuracy and generalisation capacity of the SOM. During an iterative training, the SOM
forms an elastic net that folds onto the ‘cloud’ formed by the input data. The net tends
to approximate the probability density of the data; the codebook vectors tend to drift to
places where the data is dense, while there would be only a few codebook vectors in places
where data is sparse. At each training step, one sample vector x is randomly chosen from
the input data set and the distances (such as the similarities) between the vector x and
all codebook vectors are computed. The best matching unit (BMU) denoted here by c, would
be the map unit whose weight vector is closest to
After finding the BMU, the weight vectors are updated. The BMU and its topological
neighbours are moved closer to the input vector in the input space. The update rule for
the weight vector of unit i is
where, t denotes time.
N c (t) is the non-increasing neighbourhood function around the winner unit c and 0 < α (t) <1 is a learning coefficient, a decreasing function of time i.
Machine learning algorithms described as components of intelligent information systems enable “ ... compact generalizations, inferred from large databases of recorded information, to be applied as knowledge in various practical ways-such as being embedded in automatic processes like expert systems, or used directly for communicating with human experts and for educational purposes” [14] p1.
JRip and J48 are the functions used in this research. JRip classifier in WEKA is an implementation of the RIPPER rule learner created by William W. Cohen [15]. In JRip (RIPPER) classes are examined in increasing size and an initial set of rules for each class is generated. Incremental reduced error JRip (RIPPER) proceeds are used by treating all the examples of a particular judgment in the training data as a class when finding a set of rules that cover all the members of that class. The process is repeated with all the classes and different sets of rules are generated for each class [16]. WEKA’s JRip classifier model consists of collection rules and some statistics about those rules (e.g., coverage/no coverage, true/false positives/negatives). Meanwhile, J48 decision tree –inducing algorithm is the WEKA implementation of C4.5, which was published by Ross Quinlan in1999.
Beppu census data
Beppu with 162 census wards, is a suburb in Oita Prefecture on the island of Kyushu in southern Japan. Census in Japan is carried out every 5 years in October. Beppu census data of 2000 and 2010 are analysed to look at the population and dwelling profiles in the two censuses. Altogether 36 attributes in each census, are analysed separately using SOM membership and machine learning techniques (WEKA) to study the similarities and dissimilarities among the 162 wards in the two censuses. The attributes analysed to study the population distribution and dynamics as well as dwelling profiles within Beppu (2000 and 2010) are: ward size in ha., 21 age related i.e., 0–4, 4–9, 10–14 ... .>100 and foreigners, and 14 dwelling types i.e., house owned, rented, apartment, for the 162 census wards. All age and dwelling population data is divided by respective ward size to convert the data into population densities (person per ha).
The results
The overall increase in Beppu’s total population during this period (2000–2010) is 3,119. The increases and decreases in different age groups of censuses 2000 and 2010 of this suburb (Fig. 3) reflect the general ageing trend of Japanese national population dynamics (and projections in Fig. 1) as well as that of the general global trend [17]. Age 20–14, 35–39 and all of the over 60 groups of Beppu show an increase. Age 60–64 (baby boomers) show an increase of 1,793, the highest of all age groups in the decade. Meanwhile, there is a sharp increase in foreigner population, which is 2,433, the possible reason for this could be the launching of an international university called Ritsumeikan Asia Pacific University in Beppu in2000.
Seven cluster SOM and machine learning results
All 36 attributes (ward size (ha), 21 age group, foreign and 14 dwelling type densities of 2000 and 2010 censuses were analysed using separate SOM maps to see the clustering. 200 node SOMs were created for each census using the default learning method in viscovery, a commercial software package supported by Eudapics. The two census data was then analysed using machine leaning techniques (J48 and JRip) of WEKA with SOM membership as the target variable. Initially, the SOMs of both censuses are analysed at 7 and then at 12-cluster level, 7 was identified as showing interesting patterns in 2000census [7].
WEKA results of 7-cluster SOM Beppu 2000
J48 decision tree (89.5% accuracy) and JRip rules (88.3%) extracted from 7-cluster membership at 10-fold cross validation are as follows:
Based on the left branch of J48 decision tree (Fig. 6), C1 (84 instances) lie in the periphery with employed
density < =22, ward size < =139.72 ha and apartment dwelling density
<=26. Majority of C2 of 2000 (in
right branch of Fig. 5) has
employed >22 with <=69 married. C3 of 2000 has employed >22 with > =69 married and < =16 people at the
age to 80–84.
WEKA results of 7-cluster SOM Beppu 2010 census
The following are the interpretations of the 7-cluster SOM, J48, and JRip rules (Figs. 5 and 6): The major discerning factors in the seven-cluster
SOM are: density of households, household single, ward size and density of 70–74
age group. The first two are confirmed by JRip rule 5 (Fig. 4). S1 is characterised by household >20 employed < =68 and
household single >7 people (63 units in Fig. 5). S2 is characterised by household < =20 < =103.646 ha
census units (there are 70 units in that category). S3 is characterised by household >20 employed >68 and
< =21 aged 70–74 people.
12-cluster SOM and J48 and JRip results
Based on the 12-Cluster SOM, clustering of both 2000 and 2010 censuses seems to be the
same (Fig. 8a), except for a few
wards and they are: wards Z30, Z32, Z39, Z103, Z128, Z157 and Z161. The other observations
from this 12-clustering are: In both censuses, same areas within the central city, i.e., C4 of
2000 and S5 of 2010 census seem to have the highest density however, the density has
increased in 2010. The density of
foreigners in the central city areas (same as the above) have increased from 1 in
200 census to 4 in 2010 census. Wards
density of 64-65 has increased to over 20 in 2010 which was just over 15 in
2000census. The wards along the
coastal area of Beppu with the second highest densities as well remain the same in
both censuses (C3 or 2000 and S1 in 2010), the densities in 60–64 as well as 20–24
age groups have increased showing a migration from the peripheral wards (Fig. 7
graph).
In the next step, using the SOM membership as the target variable J48 and JRip rules (functions of WEKA software) were extracted from the ward data and the rules are elaborated.
The main features relating to JRip rules extracted from the twelve-cluster SOM of Beppu
2000 census (Figs. 8a–c and 9a–d) are as follows: C1 has employed density < =22 ha units < =50.32 ha and
apartment < =26 (74 wards). C2 has
employed >22, 60–64 aged < =10, apt < =44, 10–14 aged < =8 and 15–19
aged < =11 (21 wards). C3 has
employed >22, 60–64 aged < =10 and apt >44 (23
wards).
The main features relating to JRip rules extracted from the twelve-cluster SOM of Beppu
2010 census (Figs. 9a–d and 10) are as follows: S4 has <22 household in < =36.74 ha
units with < =20–24 aged population (53 wards) (* in Fig. 9c). S2 has >22 households < =68 employed people < =38 rented
people (34 wards 3 exceptions)(** in Fig. 9c). S1 has >22 households < =68 employed people >38 rented
and 30–34 aged people >4 (25 wards with 1 exception) (*** in Fig. 9c).
From the rules the main discerning features of 2000 census are employed, apartment dwelling and age group (Figs. 8a–c and 9a–d). Meanwhile, for 2010 census is household, employed and age groups.
In this section, similar SOM clusters of Beppu 2000 and 2010 censuses are analysed using graphs. As it appears, looking at the SOM cluster profile graphs and mappings (Fig. 9d), people from the immediately outer skirts of Beppu i.e., SOM 2000 C1 and C2, have moved to coastal and central city wards respectively (SOM 2010 S1 and S5 wards). In the Central City wards of Beppu, age groups 60–64 and 20–24 show a significant increase in density in 2010. The former (60–64) indicate the worldwide trend in baby boomer population in Beppu’s central city wards. Interestingly, age groups 70–74 and 50–54 show degrease though all other categories above 55 show increase (Fig. 10). As far as work and dwelling types are concerned all categories show increase except for self-employed. Foreign student as well seem to be more concentrated in the central city wards.
Wards with population change in 2010
To further study the demographics of census wards in which population increase and decrease have been experienced, the difference in each attribute between 2010 and 2000 censuses was initially calculated (Fig. 11). Based on the values, ward Z126 total population has increased by 1,264 (Fig. 12). Meanwhile, in ward Z117 total population decreased by 257.
The demographics of ward Z126 show, major increase in 15–19 and 20–24 age groups, the increases in the categories being, 442 and 258 respectively, Among this total increase (764), 500 were foreigners (Fig. 12). Z126 is the ward in which an international tertiary educational institution called Ritsumeikan Asia Pacific University (APU) is situated. Since 2000, the year in which this international university was launched, the University’s On-campus hostels have encouraged part of the APU students to live in this ward. In the first year, some 401 APU students were residing in this ward, of which 380 were from overseas. Over the years, more university hostel/dormitories have been added and in 2010 census, APU students living in the ward increased to 1,165 (of which 880 are foreign students).
Ward Z117 showed the highest decrease (–257). The major decrease has been in all age groups younger than 75 years. The ward’s 75–79, 80–84 and 85–89 age groups have increased by around 10 each group. Z117 is an area with government owned affordable rental apartments (Fig. 13). Both the building conditions and residents as well are old hence, the decrease seems to be considered as natural with the 10 year span of time.
Conclusions
The feature selection and knowledge extraction from Beppu 2000 and 2010 censuses using SOM based clustering and J48 and JRip functions of WEKA software showed interesting patterns in the demographical changes over the ten year period. Even though the trends in Beppu population as well reflect that of Japan’s decrease in total population, the results showed the migration patterns and demographics within Beppu census wards, i.e., the age groups migrating either towards/away from the central city.
It was also clear that the foreign students seem to continue to concentrate in the ward in which an International University called Ritsumeikan Asia Pacific University is situated. The results show the interesting changes in dwelling patterns in this ward since the establishment of this university in year 2000.
Footnotes
Acknowledgments
This work was partly supported by grant from Japan Society for the Promotion of Science (KAKENHI No. 26420634).
