IJCNN'99 paper

LOOKING INSIDE THE ANN "BLACK BOX": CLASSIFYING INDIVIDUAL NEURONS AS OUTLIER DETECTORS Carlos López
Centro de Cálculo
Facultad de Ingeniería, CC 30, Montevideo, Uruguay

Abstract. The main body of the literature states that Artificial Neural Networks must be regarded as a "black box" without further interpretation due to the inherent difficulties for analyze the weights and bias terms. Some authors claim that ANN trained as a regression device tend to organize itself by specializing some neurons to learn the main relationships embedded in the training set, while other neurons are more concerned with the noise also existing. We suggest here a rule for identify the "noise-related" neurons, and we assume that those neurons are activated only when some unusual values (or combination of values) are present. We consider those events as candidates to hold an outlier.
The specultative nature of this statement has been tested in an experiment summarized in this paper. We used a set of ANN's trained to predict daily precipitation values for a weather station using as input the records obtained from other stations for the same date. The overall procedure was compared within a Monte Carlo framework with state-of-the-art methods for outlier detection. The results show that: a) some evidence confirms the abovementioned assumption about the different roles of the neurons b) our rule for classifying neurons as related with noise seems reliable c) ANN-based outlier detection methods based upon our rule outperformed other well established procedures.
The use of the ANN as outlier detector does not require further training, and can be easily applied. If the dataset is believed to have outliers, further refinements in the training process might include removing dubious values once detected by the method.
1. Introduction
Every dataset coming from observations is prone to have errors, which might be either systematic or at random. The systematic error cannot be detected by analyzing the data alone, so they will be ignored hereinafter. The random error is usually assumed to has a normal probability density function (pdf), and in good quality datasets this might be true. However, it is common that other errors (named outliers) are also included in the dataset and they do not follow the normal pdf. They are difficult to locate, and since they can be large enough to affect statistics, regressions, results arising from numerical models, etc. derived using the population, every possible effort should be done to remove them.
Despite outliers might affect ANN themselves, little or no effort has been reported in eliminating outliers prior to the training phase of an ANN. Two possible strategies might be suggested: design and train an ANN specially conceived to classify outliers, or re-use ANN trained as regression devices, looking for large differences between prediction and data. Training itself is a heavy task, and the lack of information about outliers (its pdf, for example) makes even more difficult to train an ANN to recognize an outlier. In addition, outliers should be very few by definition. Some authors claim that the second strategy suffers from the masking effect which appears when some outliers form a cluster, and affect the parameters to show themselves as regular points in Rⁿ. Here in this paper we suggest an intermediate approach, using the ANN trained as a regression device but analyzing its internal behaviour to make evident that there might be some of the neurons within the ANN which works as outlier detectors.
It is well known that, when training an ANN as a regression tool, we might find a balance between over and underfitting. Usually the number of neurons is adjusted as a compromise; too many neurons allows a better fit the training set, but with the risk for the ANN to loose its ability for generalization, which is the property of producing suitable answers even with inputs which were either not included in the training set, or were outside its range. On the other hand, too few neurons might render poor adjustment of the function.
Considering this situation, the literature suggest two approaches: start the training with too many neurons, analyze the ability for generalization and prune the network if the results are not satisfactory. The other choice is start with too few neurons, and add extra ones until good results with the testing set are obtained.
In both cases, the objective is to have good fitting properties: the possibility to have outliers are not considered at all. In this paper, we suggest to stop earlier the pruning procedure or to continue further adding extra neurons in order to have some neurons specialized as outlier detectors.
This paper is organized as follows: section 1 serves as an introductory background to the problem. Section 2 describes the suggested rule for analyze the internal behaviour of the ANN while in section 3 we describe some state-of-the-art methods for outlier detection. Section 4 gives a summary of the experiments conducted. Section 5 is devoted to the results and section 6 to the conclusions.
2. Understanding the role of the neurons: the suggested rule
It has been claimed that the ANN should be considered as a black box, with little or no possibility to understand its internal behaviour. Among others, one argument is the difficulties in analyzing the non-linear behaviour of the typical transfer functions when the neurons are mutually connected in a net. Other important argument is the empirical nature of the device, which organizes itself without direct participation of the trainer. Some recent efforts (Benítez et al., 1997) have attempted to learn facts from the trained ANN, which might be a promising line of research.
As has been argued before, we usually cannot make a supervised classification of outlier-no outlier data, because of the impredictible nature of the phenomena. Instead, we propose to train an ANN as a regression tool, and later analyze the internal structure of the ANN to detect/classify some of the neurons as outlier detectors.
Thus, we will restrict ourselves to the following problem: our input is composed of real valued data belonging to R^k, and we want to predict a continuous function of such input belonging to R¹. In other words, data can be arranged in a table, with k+1 columns and as many rows (events) as available, being the first column the function value to be predicted using the remaining ones. We will assume also that such k columns have entries which have been properly rescaled to have zero mean and unitary standard deviation. Our ANN architecture will be also very simple: k inputs, one or more hidden layers with suitable transfer functions, and one neuron in the output layer.
We will concentrate our analysis in the last hidden layer. The weighted average of their output stimulates the output neuron, giving the requested functional value. The weights themselves might vary significantly from neuron to neuron in the same layer, and we claim that this might be associated with a different role for each neuron. In our experiments (to be described later) we noticed that most of the weights are of similar size, but some are larger. While trying to manually pruning the network (without extra training!) we noticed that the neurons with lower weights have usually a significant stimuli, leading to a non negligible output. The high weight neuron(s) are typically non active, except in very exceptional cases. This cases were analyzed carefully, and some outliers were detected in the inputs. Summing up, after trial-and-error, we concluded that in our example, large weights can be translated as five times the smallest weight in the layer. Once selected the neurons, we analyze the distribution of its outputs in the training set, and we specify for each one an outlier region (Davies and Gather, 1992). An event is thus said to hold an outlier if at least one of the outlier-neurons has outputs belonging to the outlier region. This completes our rule for classifying neurons as outlier detectors, and the criteria required for its later use.
It should be stressed that a carefully pruned ANN might not have any neuron as the ones defined before. Thus, in this case, the strategy should be to add some extra neurons under the hypothesis that they will assume the role of explaining small details of the training set (i.e. it might led to overfitting) which might be connected with outliers.
3. Current procedures for outlier detection
In addition to the new method already described, we considered also a number of methods well known in the literature. For the sake of completeness a brief summary is included here. Assumming a multivariate normal pdf, the classical Mahalanobis distance is used as an indicator of outliers. It is defined for any set X and for any event x_iÎ X (Rousseeuw and Van Zomeren 1990) as

(1)

being T(X) usually estimated as the arithmetic mean of the data set X and C(X) estimated using the usual sample covariance matrix. The distance MD_i tell us how far the x_i is from the center of the cloud. The covariance C(X) is a positive-definite matrix, so the set of events x_iwith the same Mahalanobis distance lies on the surface of an ellipsoid with center T(X). Under some hypothesis large values for the Mahalanobis distance correspond to outliers; for normal distributions the squared Mahalanobis distance should follow a chi-square law.
Calculating C(X) and T(X) with the standard procedure suffers from the masking effect which appears when a cluster of outliers is present. C(X) and T(X) are both affected and the events with outliers no longer have a large MD_i. To overcome that problem, some other estimates of C(X) and T(X) have been proposed. The term high breakdown is coined in the statistics literature to express that the results will be unaffected even by arbitrary large errors in a fraction e of the population. The theoretical bound for e is dependent on the method, but in all cases it should be slightly less than half the population.
Among the high breakdown methods, we have considered the Minimum Covariance Determinant (MCD, see Rousseeuw and Leroy 1987), the Minimum Volume Ellipsoid (MVE, see Rousseeuw and Van Zomeren 1990) and the Hadi's method (Hadi 1992, 1994) as well. All of them produce a robust estimation of C(X) and T(X). Once they are available, the Mahalanobis distance can be calculated for all events, and they can be ordered accordingly. Those events with larger distances will be the first candidates to hold outliers. Hadi (1994) suggested that under multivariate normal hypothesis, only those events with a Mahalanobis distance larger than a preset value should be considered as outliers. The preset value depends on the number of the columns and of a confidence level. In the simulations (see below) we ignored such limit and get new candidates from the ordered list as requested. In addition, since the estimators are robust, it will be useless to re-calculate them after removing some errors.
4. Experimental setup
The reasoning behind our proposal was very speculative in nature, and some evidence should be given to support it. Before going into the details, some basic background and notation in statistics is requested.
In order to analyze a new method for outlier location usually two aspects should be considered and reported: a) its ability to detect known errors in a given dataset and b) its requirements in computer resources. For the first aspect there exist a number of widely available and well studied datasets (Rousseeuw and Leroy, 1987). They are usually very small (few dozens of events) so the methods are expected to discover all the known outliers in a single step. For a large dataset application, we found more realistic to discover the errors through a process instead of a single step operation; this will enable an optimization of the human and computational resources involved as well. In an industrial size application it might be more important to find quickly the most significant errors rather than all of the errors.
In the process, it is assumed that once a value is selected as candidate, it can be corrected without error, which in the statistical literature is known as the "perfect inspector" hypothesis. Such value cannot be chosen as a candidate again. Under such hipothesis, and given a measure of success, both a best and worst method can be defined; López (1997) suggested that any other method can be ranked in between according to a numerical index. The value 0.0 corresponds to the worst method, and the value 1.0 to the best one; larger values are associated with better methods. We refer the reader to the original reference for further details.
In the experiment, we seed the dataset with outliers, and applied a detection-correction-and further detection process for each method until the finishing criteria is satisfied. The criteria was that the inspector will not correct more than a prescribed fraction of the dataset, and we denote such fraction as effort. All methods were applied to the same outlier dataset, the index were calculated and stored, and the procedure were repeated more than 450 times within the Monte Carlo experiment.
All methods have been compared for a given effort considering three differente measures of success: a) how many errors they left in the dataset b) the root mean square of the errors (RMSE) left in the dataset and c) the mean absolute deviation error (MAD) also left in the dataset. Thus, three independent indexes can be derived in each case, being larger values of the index associated with method which are closer to the best one.
The dataset consist of 30 years of daily precipitation records obtained from 10 weather stations (WS) from Uruguay (34ºS, 54ºW). This set has about 11000 events, and all series have a strongly non-normal pdf, since around 80 per cent of the records show 0.00 mm rain. The original motivation for the ANN approach was to eliminate missing values of the dataset. Thus, for each WS, we trained an ANN to predict its daily precipitation values using as input records those available from other stations for the same date. Once available for further applications, we applied our suggested rule for identifying the "noise related" neurons, and we assume that those neurons are activated only when some unusual values (or combination of values) are present. If, for a given date, any of the 10 ANN (using each 9 out of 10 WS values as inputs) activates its noisy neurons we consider such date as candidate to hold an outlier.
The ANN were all of the same architecture, with only one hidden layer with 4 neurons using sinh as the transfer function. In turn, the output neuron uses its inverse, asinh as the transfer function.
5. Results of the Monte Carlo experiment
After 450 independent realizations, we summarize in Table 1 the results for the comparison. We considered three different Indexes (one for a different measure of success): the first simply counts the errors not found by the method, the second measures accuracy (size of the errors not found) by the MAD and the third by the RMSE. Each column named "Best of all" stands for the estimated probability of the method for being the best among those considered. The header "Average" is simply the arithmetic mean of the Index, averaged over the number of experiments.
The procedure based on our rule for outlier detection is the best method in 57.0 per cent of the cases, a significant performance. However, in average, it is very close to the MVE and MCD. Hadi´s method lies a bit behind, probably due to a higher sensitivity of the method to the non-gaussian nature of the precipitation data (Hadi, 1997, personal communication). When considering the size of the error, ANN is surpassed by MVE. In other words, the ANN is more sensitive to errors irrespective of their size, a property which makes it particularly interesting for many applications.

	Index for number of errors not found		Index for MAD		Index for RMSE
Method	Best of all	Average	Best of all	Average	Best of all	Average
MVE	20.0	59.0	61.5	76.3	72.0	65.6
MCD	23.0	59.2	0.0	75.3	0.0	63.9
HADI	0.0	50.7	0.0	46.2	0.0	33.3
ANN	57.0	59.3	38.5	76.1	28.0	64.9

Table 1. Comparative performance of four outlier detection procedures over a daily precipitation dataset. Success was measured in terms of three different indexes, which might range from 1.00 (best) to 0.00 (worst). Best of all stands for the probability of having the best performance among all methods. Average is simply the mean value of the index. All values in per cent, calculated after 450 experiments.

6. Conclusions
The interpretation of the role of the individual neurons within a complex ANN is a topic of interest for many applications, either biological or mathematical. This has been recognized as a difficult task, due in part to the empirical nature of the training phase and to the non linear behaviour of the most common transfer functions.
Our contribution here is an hypothesis which attempts to identify some neurons specialized in detecting outliers (i.e. unusual events) in an ANN trained to predict missing values in a multivariate series. Thus, the outlier detector has been trained with an unsupervised strategy. We suggest that those neurons in the last hidden layer with weights clearly higher than the others of the same layer are likely to operate as outlier detectors. They activate only in unusual cases which can be highlighted by specifying a valid range (outlier region) for its output.
This assert has been put in practice by comparing the performance of state-of-the-art linear outlier detectors against our suggested rule, in a large dataset of daily precipitation records. Within a Monte Carlo framework, we seed sinthetic outliers and compared the performance of each method in pinpointing them. The comparison is based on an index, which measures the distance to the "best" and "worst" possible method for picking the outliers. Despite the crude reasoning used, the rule for selecting the outlier detector neurons performed very well. When considering the size of the errors, the ANN was close but a little behind the other methods, which suggest that this will be a very sensitive method to locate unusual values.

7. References
Davies, L. and Gather, U., 1993, The identification of multiple outliers. Journal of the American Statistical Association,88, 423, 782-801
López, C., 1997. Quality of Geographic Data - Detection of Outliers and Imputation of Missing Values. Ph.D. Thesis, Dept. of Geodesy and Photogrammetry, Royal Institute of Technology, Stockholm, Sweden, ISSN 1400-3155
Hadi, A. S., 1994, A Modification of a Method for the detection of Outliers in Multivariate Samples. J. Royal Statist. Soc. B56, 2, 393-396
Hadi, A.. S., 1992, Identifying Multiple Outliers in Multivariate Data. J. Royal Statist. Soc. B 54, 3, 761-771
Hadi, A.. S., 1997, Personal communication
Rousseeuw, P. J. and Leroy, A., 1987, Robust Regression and Outlier Detection, New York: John Wiley
Benítez, J. M.; Castro, J. L. and Requena, I. 1997, Are Artificial Networks Black Boxes? IEEE Trans. On Neural Networks, 8, 5, 1156-1164
Rousseeuw, P. J. and Van Zomeren, B.C., 1990, Unmasking Multivariate Outliers and Leverage Points Journal of the American Statistical Association, 85, 411, 633-639