Application of GIS-Based Back Propagation Artificial Neural Networks and Logistic Regression for shallow Landslide Susceptibility Mapping in South China-Take Meijiang River Basin as an Example

RESEARCH ARTICLE Application of GIS-Based Back Propagation Artificial Neural Networks and Logistic Regression for shallow Landslide Susceptibility Mapping in South China-Take Meijiang River Basin as an Example Qing-hua Gong, Jun-xiang Zhang and Jun Wang Guangzhou Institute of Geography, Guangzhou 510070, China Guangdong Open Laboratory of Geo-spatial Information Technology and Application, Guangzhou 510070, China School of Tourism, Huangshan University, Huangshan 245021, China


INTRODUCTION
Landslides result in enormous casualties and huge economic losses in mountainous regions. The recent intensification of land-use changes has raised the level of landslide susceptibility. China is a mountainous country, correspondingly, landslide disasters occur frequently. There is no doubt that this will be a serious threat to the economic In this paper, the application of the ANN methods to landslide susceptibility mapping is studied. This paper introduced the details of the methods as well as the findings of the study. This paper is organized in four major parts. The first part describes the feature about landslide of the study area. The second part determines the statistical correlations between landslide frequency and the physical parameters contributing to the initiation of landslides. The third part describes the methods, development and training of ANNs algorithms. In the fourth part, landslide susceptibility mapping was modeled by incorporating factors in a GIS-based ANN model. The controlling factors were also identified by the model. The fifth part describes logistic regression model applied in the susceptibility mapping.

STUDY AREA
The ANN model was tested on the Meizhou city which is located at the south-east of china, with a total area 15876.05 km 2 and the elevation ranging from 0 to 1559 m (Fig. 1). There are widespread mountains in the study area with complex geological conditions and strong tectonic activity, which caused serious landslides disasters. Therefore, it is a serious hazard to the economic development in mountain area. Considering the hazard from landslides, it is urgent to identify area which may be prone to landslide.
The study area is typical of the geological features of surf-layer and bedrock in the south of China. The bedrock geology mainly comprises of Paleozoic and Mesozoic strata in the middle of study area (Fig. 2). Volcanic rocks compose mainly of granite in the northwest part and southeast part of study area, which occupied 40% of gross area. The climate is typical of subtropical and monsoonal, hot and humid in summer mild and dry in winter. Heavy rains and typhoons often bring huge rainfall. The average annual precipitation varies from 1067 to 1727 mm during the period of 1990-2005. The groundwater is composed of pore water, bedrock fissure water and carbonate rock karst-cranny water in the zone. The development of transient perched water table at the interface of the colluviums and residual soils, and the underlying weathered rock during or following periods of intense rainfall, result in the natural steep slopes prone to sliding.  Landslide locations were identified in the study areas by the interpretation of aerial photographs, field surveys and inventory reports. Altogether, 384 landslides were recorded in the landslide inventory and related spatial parameters, such as location, time of failure, and dimensions, available for each record. The scene of landslide disaster is shown in Fig. (3). Landslides type in the study area is mainly the shallow soil landslides. Landslide scale is a small shallow landslide, whereas the landslide thickness is generally 1-2 meters. The feature of the landslide district is highly regional and seasonal. Most of the landslides in the study area are located at the hilly ground area with an elevation of 100-500 meters, which is easy to deposit colluviums. As can be seen from layer structure, the hazards are mainly distributed in the territory of granite rock, which covered with weathered layer and soft metamorphosed rock. From the time of distribution, 97.35% of the landslides occurred during the flood season.

METHOD AND DATA USED
The use of the ANNs can be a valid alternative in hazard mapping, when the conditioning factors are not approximable by a normal distribution and are strongly correlated. Artificial neural network model is a non-linear model with input layer, hidden layer and output layer. The landslide influencing factors are taken as the input layer, the landslide susceptibility as the output layer. The training and learning of the model are carried out. After the model has been validated, the output from the forecast area can be used to assess the sensitivity of the disaster. The data-learned model has an evaluation function that can calculate the magnitude of the landslide susceptibility.
Landslides result from interdependent spatio-temporal processes, including hydrology rainfall, groundwater, vegetation, soil condition, bedrock, topography, and human activities and so on. These can be subdivided into four groups: geomorphological conditions, geological conditions, land use and landcover, and meteorological and hydrological conditions. The factors utilized in this study are variables of both categorical and continuous data. In this study, both quantitative and qualitative variables were subdivided into proper categories, defined on the basis of the influence that they exert on landslide mechanics, and normalized in the interval 0-1. In this paper, based on index factor status and factor correlation analysis results [20], eleven factors were selected. Detailed description of the factors and data sources in the study area are summarized in Table 1. These parameters were transformed into a spatial vector database, and the thematic layers were converted into a raster grid for application in the neural network modeling.

Geomorphological Parameters
The genesis of geological hazards is highly relevant to topographic conditions factors. This model was used to prepare altitude, slope, aspect, and topographic relief. The topographical and geographical conditions are based on the digital elevation model(DEM). The digital elevation model of the study area was produced by using topographic maps in 1/25,0000 scale. DEM was generated by contour interpolation. The triangulation growth algorithm is used to transform the discrete points on contours into TIN. It was generated using 3D Analyst in the ArcScene module and then convert the correct TIN to grid data. The grid data was resampled to form a grid DEM. The gross error was checked by overlay the contour line interpolated by DEM and the original contour. The DEM was defined in the Gauss Krugar Xian 1980coordinate system with 5m cell size.

Altitude
Altitude is useful to classify the local relief and locate points of maximum and minimum heights within terrains. Therefore, altitude is one of the topographic factors affecting landslides. The altitude map of study area was divided into five categories with 150m interval. The number of landslide point in each altitude group was presented in Table 2.
In the study area, 77% landslides located in plateau hilly areas of less than 300m. This is because plateau hilly areas are beneficial to the soil deposit. The accumulation of loose layer provides the source condition for landslide formation.

Slope
The main parameter in slope stability analysis is the slope angle. The slope angle is directly related to the landslides. The slope angle map of the study area was divided into five slope categories. The landslide percentage in each slope group is presented in Table 2. This table indicates that most of the landslides occur at slope angle of less than 40°.

Aspect
The vegetation, precipitation and temperature are different in sunny slopes and shady slopes. Therefore, this is another important factor in landslide susceptibility maps. In this study, the aspect map (Fig. 2). of study area was produced to show the relationship between aspect and landslide.  Table 2.

Topographic Relief
Terrain relief is the relative height difference between the top and bottom of the slope. It provides the necessary effective surface for landslide occurrence and determines the kinetic energy of the slope itself. It can determine the stability of the slope state. Topographic relief has an effect on hydrological and deposition processes [21]. So, it can reflect the change of relief and indicate the degree of surface erosion. Therefore, Topographic relief was selected for landslide susceptibility maps by many researches. The Topographic relief map (Fig. 2) of study area was produced from DEM. The Topographic relief of study area was divided into three groups. The landslide percentage in each group is presented in Table 2. This table indicates that most of the landslides occur at the scope of 0-100m.

Distance to Fault
Geological structure factors have a very significant effect on slope stability [21]. Activity faults are the main cause of large-scale landslides. The fault map (Fig. 2). of study area was produced by using regional geological map in 1/250,000 scale. The fractures are generally NS-SW striking, though some are NW-SE striking. The distance to faults was divided into six groups. The landslide percentage in each group is presented in Table 2.

Rock Type
The rock type features are the foundation of the landslide, which can control the development of the landslide and provide material source for the landslide. Because of their different physical and chemical properties, different lithologies have different effects on the landslides [21]. As the chart shows, there are four rock types in the region including limestone, sandstone, shale and granite (Fig. 2). The landslide percentage in each rock type is presented in Table 2. As in the table, most of the landslides occur in the granite and sandstone region.

Soil Types
The effects of soil on slope stability have been widely considered in landslide research [22]. The soil types have an effect on landslide distribution through cohesiveness, thickness and so on. The different soil types present in this region were grouped into a number of types that are homogenous in terms of chemical composition. The digital soil layer is shown in Fig. (4). There are four soil types in the region including waterloggogenic paddy soil, yellow soil, red soil and purple soil. The landslide percentage in each soil type is presented in Table 2. As seen from the table, most of the landslides occur in the red soil.

Land Use
Many studies have revealed the relationship between human activity and slope stability [23]. Rainfall and human activities are very important triggering factors of the landslide relating to geological condition. In this study, land use layer was extracted from Landsat ETM image by using object-based classification method. There are four different land use types in the region including cultivated, land garden plot, forest land and construction land. The landslide percentage in each land use type is presented in Table 2. As seen from the table, most of the landslides occur in the forest land and cultivated land.

NDVI (Normalized Difference Vegetation Index)
There are two ways for plant roots to improve soil capability of anti-erosion. One is that fine root can keep soil by its net-work, another is increasing slope load [24]. The NDVI map was obtained from LandsatTM satellite image acquired on 15 September 2009. The NDVI map of the study area was divided into five categories. The landslide percentage in each category is presented in Table 2. As seen from the table, most of landslides occur at the scope of 0.6-0.7. Table 3. Weights and Thresholds between the Input Layer and the First Hidden Layer.

Distance to Rivers
In general, at a closer distance to the river, the erosion is stronger and the probability of the occurrence of landslides is higher [24]. The distance to rivers is represented by the proximity of the rivers and drainages in the area. The river map at the scale of 1:25,000 was obtained from DEM. River system can be divided into four grades. The smallest watershed area is about 20 km 2 . The distance to river of the study area was divided into six groups. The landslide percentage in each group is presented in Table 2.

Rainfall
Slope in certain geological setting and in a certain mechanical environment requires a certain rainfall, rainfall intensity or duration to promote slope damage [24]. Rainfall is one of the important inducing factors to disasters. The maximum rainfall intensity isocline map at scale 10m*10m was obtained from the meteorological department. There are 12 precipitation stations distributed in the study area. The maximum rainfall intensity isocline map was generated by interpolating data from precipitation stations. The maximum rainfall intensity of study area was divided into four groups. The landslide percentage in each group is presented in Table 2.

Architecture of Neural Network
In the presented study, the traditional approach of landslide susceptibility mapping by using an artificial neural network model was implemented in a GIS framework. This study sets up a three-layer BP artificial neural network to analyze landslide sensitivity. The three-layer interconnected neural network (Fig. 5). consists of one input layer, two hidden layers and one output layer. In this specific structure network, there are 11 input nodes (respectively for altitude, slope, aspect, Topographic relief, distance to faults, rock types, soil types, land use type, NDVI, Maximum rainfall intensity, and distance to drainage,) and the output layer will have one node, reflecting the disaster situation (Value of 0 or 1).

Training of ANN
According to the material of field investigation and survey, there have been 384 landslides in the study area. The spatial distribution of landslides in the region is shown in Fig. (1). Before running the artificial neural network program, the training site should be selected. The study chose 380 points as the test data where no landslide disasters occur. Furthermore, using the landslides and security point locations, we extracted 11 quantitative data and constructed spatial database by GIS. The spatial correlations between the variables and neural network equations were established in ARCGIS. Then, the landslides and security point data are partitioned into two subsets, such as the "training data" and "test data". The 300 landslides and 300 security points were selected for training the ANN. And 84 landslides and 80 non-landslide points were used for the prediction testing. The most popular ANN model used in prediction and regression tasks is the multi-layer perceptron (MLP) with a feed-forward back-error propagation (BP) type of learning algorithm [25]. This learning algorithm was trained with the BP type, which consists of one input layer, two hidden layers, and one output layer.  Three-layer feed-forward network was implemented using the MATLAB software package. The normalized transfer function, the training function, the number of hidden layer and the active function for ANN were modeled, simulated, and determined by using neural network toolbox of MATLAB7.11.0. The adaptive learning algorithm was selected in this study which can enable studying-speed faster and it is self-adaptive to data. And the maximum training is 5000 times and the goal of training is 0.001. The weights and threshold of each factor estimated by neural network in this study is shown in Tables 3-5.

Accuracy of the ANN Model
Assessing the performance of the landslide susceptibility models is considered to be a crucial step in model selection [26]. In order to evaluate the performance of the produced map, two methods were used. The first method is to calculate the accuracy rate by comparing the pixel values between the landslide inventory map and the final susceptibility maps. According to the results of this process, the accuracy rate is calculated as 0.799. It shows that the produced map represents the reliable results. The second performance index is the AUC values. In practice, the AUC values usually were used for the assessment of relative quality of susceptibility maps [26 -28]. Thus, the results of the information model were validated by using landslide inventories and the area under the curve (AUC). When the AUC value close to 1, it means high accuracy, while the AUC close to 0.5, indicates accuracy. We use the model result building the Receiver Operating Characteristic curve (ROC) and the AUC could also be calculated. The results of the model show that the AUC is 82.6% respectively shown in Fig. (6). Therefore, the ANN model is valid when assessing the susceptibility.

Analysis on the Main Control Factor
The landslide-susceptibility analysis is a function of a variety of variables that include the altitude, slope, aspect, slope angle, distance to faults, rock types, soil types, land cover, maximum rainfall intensity, NDVI and distance to river. In order to further verify the model and determine the sensitivity of various factors for slope stability, we assume that each variable values is 1 and input it into the model(Single factor effect doesn't exist in reality).We take slope as an example, assuming that when all other factors is 0, only slope, and set it to 1, the input matrix of the nine factors is ,the model calculation results is 0.866. The rest of the factors can be done in the same manner. The single factor conditions of neural network identification results were calculated and normalized. The land use type showed the highest value as 0.294, then the slope is 0.288 and rock type is 0.225. The result displays that the slope, rock types and the land use type are the main controlling factors in the disaster formation process. Landslide formation of internal cause in South China mainly depends on the topography and the geology, while rainfall is the motivating factor. The model results conform to the disaster mechanism in the south China.

LOGISTIC REGRESSION ANALYSES
In the recent years, logistic regression analysis is one of the most popular multivariate statistical methods. In the recent literature, many studies have been published on the assessment of the landslides by using logistic regression analyses [29,30]. Logistic regression analysis mainly predicts the probability of occurrence of an event through the multiple regression relationship between a dependent variable and multiple independent variables. In the logistic regression analysis, the dependent variable Y is a dichotomous variable, the values Y = 1 and Y = 0, represent landslide and no landslide, respectively. The independent variables are X 1 , X 2 ,...,X n respectively. The conditional probability of landslide occurring under the independent variables is P = P (Y = 1 | X1, X2, ..., Xn). Then logistic regression model can be expressed as: Z i is intermediate variable parameter, a 0 is regression constants, a i is regression coefficients (i=1,2,…n), X ij is the value of the j th variable in element i, and P i is regression prediction value of landslide occurrence in unit i.
The first stage in the application of logistic regression analyses is production of data matrix. For the continuous variable data (altitude, Slope angle, , topographic relief, NDVI, maximum rainfall intensity, distance to rivers), draw the histogram of the frequency distribution of continuous variables, the continuous variable data were normalized in the range of [0, 1] according to the frequency distribution of the histogram. Since the parameter slope aspect, rock types, soil types, land use type are categorical data, they were expressed in binary format with respect to each definitions. Dependent variables of the analyses are also expressed in binary format with respect to presence (1) and absence (0) of landslide or no landslide cell.
The logistic regression analysis was calculated in SPSS software. As a result regression analysis showed that the average correct classification percentage was 75.4%. Hosmer and Lemeshow test showed that the significance (Sig) is less than 0.05. The logistic regression model of landslide risk factors is shown in Table 6.

RESULT
The data used are shown in Table 2. Various GIS data layers have been illustrated in Fig. (3). For the convenience of computing, all input layers were converted into raster layers. Then, using the raster calculator, the result of each cell is obtained on the basis of the LR model and the established BP model above. The weights and threshold of each factor estimated by neural network in this study are shown in Tables 3-5. At last, the calculated results were reclassified into three categories according to value: lower sensitive zone, medium sensitive zone and high sensitive zone (Fig. 7).  The artificial neural network model and the logistic regression model can be verified mutually. According to the results, the landslide bodies are distributed at each sensitive level. There are 70.8% and 55.21% of the landslides distributed in the high sensitive zone by BP and LR model. The sensitivity level was higher and the bigger proportion of the landslides. The prediction results of ANN model in high sensitive zone is more accurate than the logistic regression model. However, this does not mean that the artificial neural network model is better than the ANN model in other geological environments.

DISCUSSION
It is difficult to get a complete and detailed shallow landslide map in short-term because the landslides have the characteristics of small size and wide distribution. Hence, it is needed for landslide susceptibility assessment work under the conditions of incomplete records. We employed ANN model and logistic regression to analyze landslide susceptibility and to select the slope, land use and so on eleven factors to establish susceptibility evaluation index system. The results show that the ANN model is feasible to susceptibility map. The susceptibility zoning map was in line with the actual conditions of the area. It can play an important role in the work of landslide hazard and risk assessment of disasters.
In South China, shallow landslides are commonly triggered by high pore-water pressure which results from highintensity or short-duration rainfall. The deformation modes of slope in red soil hilly region are mainly shallow landslide. Shallow landslides are preferentially distributed on slopes with high-permeable soils overlaying lowpermeable soils ( Table 7). The landslides are roughly parallel to the ground surface. The shallow landslides are highly correlated to the landform. The results conform to the landslide characteristic in red hilly region in the South China. The research results show that high sensitive zones in the study area locate at the hilly ground area with elevation from150 to 300m,northern aspect, slope gradient among5-10°and topographic relief with 0-40m.
Regarding the application of artificial neural network and logistic regression model, as well as the relative importance and weighting between factors calculated, landslide hazard maps are of great help to planners and engineers when they choose suitable locations to implement development activities. These results can be used as basic data to assist slope management and land-use planning. The models used in the study are valid for generalized planning and assessment purposes, although they may be less useful at the site-specific scale where local geological and geographic heterogeneities may prevail. To make the model more general, more landslide data are needed.

CONCLUSION
In this study, the neural network model and its cross-application approach was used successfully. Using ANN model established by MATLAB, the landslide susceptibility map was created and verified. The result of verification showed a prediction accuracy of 82.6%. The verification result is of a high value. The conclusion basically matches with the actual situation, so it shows that it is feasible to use the BP network model based on MATLAB neural network toolbox for landslide susceptibility analysis. The results display that slope, rock types and land use type were the main controlling factors in the disaster formation process. The results conform to the disaster mechanism in the South China.

CONSENT FOR PUBLICATION
Not applicable.