Classification of Binary Document Images into Texture or Non-textual Data Blocks
Using Neural Network Models

D. X. Le
G. Thoma
H. Weschler

Keywords:

Back-propagation, Document processing, Probabilistic networks, Radial basis function, Self-organizing feature maps, Textual classification


Abstract

This paper describes a new method for classification of binary document images into textual or non-textual data blocks using neural network models. Binary document images are first segmented into blocks using the constrained run length algorithm (CRLA). The resulting blocks are labeled using the component labeling procedure. The features for each block, calculated based on the coordinates of the block's extremities, are then fed into the input layer of a neural network for classification. Four neural networks were considered and they include back-propagation (BP), radial basis functions (RBF), probabilistic neural network (PNN), and Kohonen's self-organizing feature maps (SOFM). The performance and behavior of these neural network models are analyzed and compared in terms of training times, memory requirements, and classification accuracy. The experiments carried out on a variety of medical journals show the feasibility of using the neural network approach for textual block classification and indicate that in terms of both accuracy and training time RBF should be preferred.

1. INTRODUCTION AND BACKGROUND

There is widespread and increasing interest in converting paper-based document information to electronic format for computer storage, access, and processing. The conversion is being routinely done for records management, automated document delivery, document archiving, journal distribution, and other applications. The Lister Hill National Center for Biomedical Communications, a research and development division of the National Library of Medicine (NLM), is conducting research in the area of binary document processing including scanning, displaying, quality assurance, image processing, archiving to optical disk, indexing, accessing, retrieving, printing and faxing, text recognition, and image and text database creation. This paper describes one component of this ongoing effort at NLM, the classification of binary document images into blocks of textual and non-textual data using neural network models. A binary image refers to a two-dimensional binary function f (i, j), whose value is either 0 or 1, where i and j denote the vertical and horizontal coordinates of the pixel, respectively, and where the coordinates' origin is located at the top left corner of the image. A block in a binary document image is a connected component (blob) and it is defined here as a collection of black runs that are 8-connected(5). Textual data usually is letter, word, or sentence, while examples of non-textual data include graphics, forms, line art, and dithered images (binary images in which gray shades are simulated by grouping dots into clusters). Information about the data type (textual or non-textual) of a block from some binary document image is important for any automated system for printed documents. Block data type information is useful for optical character recognition (OCR) systems because it helps to improve the accuracy rate as well as to speed up the recognition process by handling only the textual data blocks. Block data type information is also useful for any automated system that performs image graphics enhancement, and for any automated system for printed documents, where the system needs to divide the contents of a printed document into smaller parts such as headline area, text line area, graphic area, footnotes area, and background(1).

Several techniques to classify data blocks as textual or non-textual (7,11,12,15) have been reported. As an example, Wahl, Wong and Casey(15) used a linear adaptive classification technique based on block features to classify data blocks into different classes. This classification algorithm required the selection of a set of 15 constraint parameters. The authors report that the choice of these constraints is not critical. However, it is not clear how one should select these parameter values. Pavlidis and Zhou(12) proposed another classification technique that is based on strength and variation of correlation of adjacent scanlines. They argued that text correlation of adjacent scanlines is very high and it changes rapidly while non-text correlation of adjacent scanlines behaves in the opposite way. This technique, however, had problem with pages where the "white streams" flow irregularly*** and it also required tuning several parameter values. While these two techniques were reported to perform well on a variety of printed documents, no information was provided on the size of the experiments and performance statistics are lacking. Gorman(11) recently proposed a method to locate text lines and text blocks in an image that is based on k-nearest-neighbor clustering of page components. Using the bottom-up approach, k-nearest-neighbor pairs between components are formed and plotted on nearest-neighbor angle and distance histograms. Text orientation, character spacing, and text line spacing are estimated from peaks found in these histograms and are then used to form text lines and text blocks. The method is independent of page skew and text spacings and it does not require a prior knowledge of character size and line spacing. The method is limited, however, because it deals only with the textual parts of the page image. Jain and Bhattacharjee(7) suggested an interesting approach based on the use of the Gabor filters for textual classification. Despite the popularity and apparent usefulness of these methods, it is not obvious that these methods would carry over to binary data, and that the associated costs will remain affordable for real-time document analysis. It is also not straightforward as to how the best channels (filters) for optimal clustering would be determined automatically at runtime. As our projects could involve large numbers of documents with widely varying characteristics, it became apparent that neural networks could provide a robust and non-parametric(9,16) classification approach. We describe in this paper a new approach for the classification of a binary document image into textual or non-textual data blocks using neural network models. Four neural networks were considered and they include the BP (back-propagation), the RBF (radial basis functions), the PNN (probabilistic neural network), and the SOFM (self-organizing feature maps). The BP was chosen because of its simplicity and popularity. However, as the BP lacks locality, we decided to also include the RBF as the network model having local characteristics(2,6). The PNN was selected for its ability to estimate probability density functions (pdfs) based on training samples, and because the training is instantaneous and it is not hard to adapt to new incoming patterns. Lastly, the SOFM was chosen for its unsupervised learning characteristics as well as its ability to self-organize input patterns. The performance and behavior of these neural network models are analyzed and then compared in terms of training times, memory requirements, and classification accuracy. The experiments carried out on a variety of page images from medical journals show the feasibility of the neural network approach. The performance of our new approach for classification of binary document image into textual or non-textual data blocks using neural network models has been evaluated using a sample of 50 images from medical journal pages.

2. SYSTEM OVERVIEW

The process consists of four steps: block run lengths detection, block labeling, block features calculation, and neural network learning and classification. These steps are briefly described in the following paragraphs.

  • Step 1 is the Block Run Lengths Detection. This step segments the entire binary document image into blocks using the CRLA developed by Wahl, Wong and Casey(15).
  • Step 2 is the Block Labeling. This step labels segmented blocks using the component labeling procedure. After being labeled, a block can be distinguished from another one based on its label.
  • Step 3 is the Block Features Calculation. This step calculates the features needed for classification of each labeled block.
  • Step 4 is the Neural Network Training and Classification. This step selects training and testing data sets, models the networks, learns, and generalizes using cross-validation method. After being trained, the neural network classifies each block of a binary document image as textual data or non-textual data based on its features.

In the subsequent sections, we present a detailed description of our system, the results of experiments carried out on test data taken from medical journals, and a comparison among four neural network models in terms of training times, memory requirements, and classification accuracy.

2.1 Block Run Lengths Detection

A binary document image can be segmented into blocks using the CRLA developed by Wahl, Wong, and Casey(15). This method is reviewed below. Assume binary 0's and 1's represent white pixels and black pixels, respectively. Given a predefined constraint C, the CRLA replaces 0's by 1's if the number of adjacent 0's is less than or equal to C. For example, with C = 2 the sequence:

	0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 1 0 1 0 0 0 0

would become the sequence

	1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 

This bitstring operation is applied row-by-row and column-by-column to a bitmap of a binary document image. The horizontal processing with constraint Chor creates a row-by-row intermediate bitmap and the vertical processing with constraint Cver creates a column-by-column intermediate bitmap. These two intermediate bitmaps are combined by a logical AND operation. As pointed out by Wahl, Wong, and Casey, some black regions representing text lines are interrupted by small gaps that should be removed. Therefore, as a final processing step, an additional horizontal smoothing processing with constraint Csm is applied to the combined bitmap to produce the desired final bitmap. Wahl, Wong, and Casey claim that the choice of the constraints Chor, Cver and Csm is not critical. They recommended that the values of Chor and Cver should be set to a number of pixels covering the length of long words, whereas Csm should be set to cover a few character widths. They calculated that Chor = 300 pixels, Cver = 500 pixels and Csm = 30 pixels can be used with fairly good results on a variety of documents. The input to the block run lengths detection step, a bitmap of an original binary document image of size 8.5 x 11 inches and scanned at 200 dot per inch (dpi) resolution. For a 200 dpi resolution, long words correspond approximately to Chor = 300 pixels and Cver = 500 pixels and several character widths correspond to Csm = 30 pixels. The result of the horizontal processing of the CRLA is applied with constraint Chor = 300 pixels to the original image. The result of the vertical processing of the CRLA is applied with constraint Cver = 500 pixels to the original image. The desired final bitmap image by applying the horizontal smoothing of the CRLA with constraint Csm = 30 pixels on the combined bitmap image is the result of this step, and fits well the expectation.

2.2 Block Labeling

A block is defined as a connected component and it represents a collection of black runs that are 8-connected(5). Block labeling is similar to component labeling. Its purpose is to segment a binary image by labeling its blocks. Each black run is assigned an integer number called label and the labels of connected black runs must be the same. A block is created from connected black runs and its label is the same as that of its black runs. Blocks are then distinguished by their labels. The component labeling procedure identifies black runs for each row, analyzes the connectivity of black runs between the current row and the previous row, assigns labels to black runs of the current row and/or updates labels of black runs of the previous rows, and defines blocks of a binary image in terms of their coordinates.

2.3 Block Features Calculation

In this step, features of each block in a binary document image are calculated based on the coordinates of a block's extremities. The coordinates of a block's extremities provide information about the approximate size of a block in a binary image. Each block has its own extreme coordinates with respect to the origin of the binary image f (i, j): min_left_column, max_right_column, min_top_row and max_bottom_row. Actually, features of each block are obtained as a byproduct during the block labeling step. The definitions of these extreme coordinates are as follows:

min_left_column: the column value of the left-most pixel of a block
max_right_column: the column value of the right-most pixel of a block
min_top_row: the row value of the top-most pixel of a block
max_bottom_row: the row value of the bottom-most pixel of a block

In this paper, seven block features H, E, S, R, HR, ER and SR (defined later) are used for classification. The first four block features (original pattern input terms) are similar to those defined by Wahl, Wong, and Casey, while the last three block features are combined from the first four block features to enhance input representation and to eliminate the need for more hidden layers(17). The augmented input data corresponds to higher order neural networks and should enhance the performance on linearly nonseparable discrimination tasks. A preliminary experiment based on the back-propagation neural network was conducted using only the first four original pattern inputs and the performance results were not satisfactory. As a result, the seven block features were proposed to increase the classification performance. The first four block features were selected based on some observations with respect to size, total number of black pixels, and total number of white-black or black-white transitions of a block. Generally, the height of a textual block is much lower than that of a non-textual block. The ratio of the total number of black pixels over the area of a block in a textual data block is also much smaller than the ratio in a non-textual data block. On the other hand, the ratio of the total number of white-black or black-white transitions over the area of a block in a textual data block is higher than the ratio in a non-textual data block.

Let Dx and Dy be the width and the height of each block.
Dx = max_right_column - min_left_column + 1
Dy = max_bottom_row - min_top_row + 1
Let BC be the total number of black pixels of each block,
DC be the total number of black pixels of the original data within each block,
and TC be the number of white-black (or black-white) transitions of the original data within each block.
Here, original data means data before applying the CRLA in step 1. The definitions of the seven block features are given next.

  1. The height H of each block
    H = Dx
  2. The ratio E between the width and the height of each block
    E = Dx / Dy
  3. The ratio S of the number of block pixels to the block area
    S = BC / (Dx * Dy)
  4. The ratio R of the mean horizontal length of black runs of the original data within each block
    R = DC / TC
  5. The product HR of the block height and the ratio R
    HR = H * R
  6. The product ER of the ratio E and the ratio R
    ER = E * R
  7. The product SR of the ratio S and the ratio R
    SR = S * R

The seven block features are normalized to have unit length and are adjusted to be within the [-0.5, 0.5] range.

2.4 Neural Network Training and Testing Methodology

Before any of the four neural network models can be used as a pattern classifier, their structure has to be designed and trained. We discuss in this section the selection of training and testing data sets as well as a method to train and test neural networks next. The detailed network structure design for each of the four neural network models is discussed in section 3.

2.4.1 Training and Testing Data Sets

Fifty binary document images selected from 12 different medical journals are used to create the training and testing data sets. These images are 8.5 x 11 inches in size and were scanned at 200 dpi resolution. They cover a wide variety of layouts such as mixed fonts, graphics, forms, line art and dithered images. These fifty binary document images consist of 2304 segmented blocks of which 129 blocks are non-textual data and 2175 blocks are textual data. The 129 non-textual data blocks and 2175 textual data blocks are divided randomly into (1) 85 non-textual data blocks and 1450 textual data blocks for the training data set and (2) 44 non-textual data blocks and 725 textual data blocks for the testing data set. The training data set is used to design the neural network while the testing data set is used to estimate the network's classification accuracy.

2.4.2 Cross-Validation Method

For purposes of generalization, the cross-validation (CV) technique(6,16) is used in this paper by splitting the training data set into two sets called a CV-train set and a CV-test set. The training data set is randomly divided into five data groups of which four data groups create a CV-train set and one remaining data group is considered as a CV-test set. As a result, there are five pairs of a CV-train set and a CV-test set. Each CV-train set consists of 1228 blocks (68 non-textual data blocks and 1160 textual data blocks) and each CV-test set consists of 307 blocks (17 non-textual data blocks and 290 textual data blocks). Each neural network is trained and tested with each pair of a CV-train set and a CV-test set. The modified weights corresponding to the winning pair of a CV-train set and a CV-test set, the one yielding the highest classification accuracy, are chosen to be the final weights for that particular neural network.

3. Neural Network Architectures and Performance Evaluation

Except for the SOFM neural network, all the other neural networks considered in this paper are multi-layer networks with one input layer, hidden layers, and an output layer. Note that we will not count input nodes as one layer in this paper. The size of the input nodes corresponds to the number of block features and the size of an output layer is one binary output discriminating between textual and non-textual data. The number of nodes in hidden layers, however, depends upon the particular neural network chosen. They are designed by supervised learning or learning with a teacher. The SOFM neural network, on the other hand, consists of one input layer and one square output matrix layer, and is designed by unsupervised learning.

In supervised learning, the network is presented with a set of input and desired output patterns. The resulting (actual) outputs are compared with the desired outputs and their differences are used to adjust the network weights. In unsupervised learning, the network is presented with only input patterns. Without desired responses, the network has no knowledge about whether or not its resulting outputs are correct. As a result, the network has to self-organize (cluster) the data into similar classes by adjusting its weights so that the clustering improves. In the following subsections, each neural network is discussed in detail with respect to its network architecture, the training and testing strategy employed, and experimental results. In this work, a DELL 486D/50 Personal Computer is used to train and test each neural network.

3.1 Back-Propagation Neural Network

3.1.1 Overview

Back-propagation (BP)(6,8,17) is a multi-layer neural network using sigmoidal activation functions. The network is made up of an input layer, hidden layers, and an output layer and nodes in each layer are fully connected to those in the layers above and below. Each connection is associated with a synaptic weight.

The BP network is trained by supervised learning, using a gradient descent method which is based on the least squared error between the desired and the actual response of the network. During the training process, the synaptic weights are adjusted in such a way that the actual response gets increasingly closer to the desired response. When an input pattern vector of a training data set is presented to the input layer, the sigmoidal activation values (layers' responses) are propagated forward through the network from the input layer to the output layer. The network's actual response is obtained at the network output layer. The difference between the actual and the desired response is then used to modify the network's weights backward through the network from the output layer to the input layer using a gradient descent method.

3.1.2 Network Architecture

The two-layer BP network is implemented with an input layer - seven block features H, E, S, R, HR, ER and SR, an output layer - one binary output for non-textual or textual data, and one single hidden layer of which the number of nodes is defined below.

As pointed out by Hush and Horne(6), for generalization purpose, the number of training samples should be approximately ten times the number of weights in the multi-layer propagation network. For two-layer BP network with I inputs, O output and P training samples, the number of weight is[(I+1) * A + (A+1) * O] and the number of nodes A of a hidden layer is 10 * [(I +1) * A + (A+1) * O] = P. For 7 inputs, 1 output and 1228 training samples, the number of nodes A of a hidden layer using the above formula is about 14. Therefore, the two-layer BP network architecture is 7-14-1. Each input vector of the training data set is presented to the network many times and the weights are adjusted on each presentation to improve the network's performance until the network stops improving. Two learning factors that significantly affect convergence speed, as well as accomplish avoiding local minima, are the learning rate and the momentum. The learning rate determines the portion of weight needed to be adjusted. Even though a small learning rate guarantees a true gradient descent(17), it slows down the network convergence process. The momentum determines the fraction of the previous weight adjustment that will be added to the current weight adjustment. Its purpose is to accelerate the network convergence process. During the training process, the learning rate was adjusted to bring the network out of either its local minima (where the network has converged but its output error is still large) or its no-gain mode (the network mode in which its output error does not change or changes very little over many cycles). The learning rate ranges from 0.001 to 0.1, and the momentum is 0.9.

3.1.3 Experimental Results

The BP neural network is trained with all five pairs of a CV-train set and a CV-test set. The BP neural network configuration associated with the winning pair is evaluated on the testing data set. Table 1 presents the cross-validation training result on the winning pair and it shows that the classification accuracy on the testing data set is about 99.35%. Table 5a shows the training time spent for the winning neural architecture. Table 6a show the confusion matrix of this BP experiment.

3.2 Radial Basis Function Neural Network

3.2.1 Overview

Radial basis function (RBF)(6,10) is a multi-layer neural network consisting of an input layer, hidden layers, and an output layer. Nodes in each layer are fully connected to those in the layers above and below, and nodes in hidden layers (basis function nodes) have kernel functions usually given as Gaussian profiles. Each connection is associated with a synaptic weight but the unit weight is assigned to all connections between the input layer and hidden layers.

The RBF network is trained first by unsupervised learning to determine the characteristics of the hidden layer and then by supervised learning. In the unsupervised learning process, the means and variances of the basis functions for the hidden layers are determined using K-means clustering algorithm(6). The supervised learning process is followed by presenting each input-output pattern to the network and calculating the basis function node outputs. The basis function node outputs and the desired outputs are used to determine the network output weights.

3.2.2 Network Architecture

The two-layer RBF neural network is implemented with an input layer - seven block features H, E, S, R, HR, ER and SR, an output layer - one binary output for non-textual or textual data, and one single hidden layer. Let F be the number of basis function nodes in the hidden layer and H be the global proportional factor of the network. As pointed out by Ng and Lippmann(10), the classification accuracy of the RBF classifier depends upon both H and F. The factor H is included to provide flexibility in setting the width of the basis function. The purpose is to have narrow basis functions in a dense population region and wide basis functions in sparse population region. In order to have good results, the basis functions must have enough overlaps.

Too small a value for factor H would result in poor performance due to the fact that the basis functions are narrow and do not overlap. On the other hand, too large a value for H also results in poor performance because all basis functions are alike. The number of basis function nodes F measures the complexity of the RBF and generally speaking, when F increases the performance of the RBF improves. However, beyond a certain point, increasing the value of F will not result in improving the RBF performance. In order to observe its behavior with different combinations of H and F, the RBF neural network is trained with different values of F (2, 4, 7, 10, 14, and 16) and H (60, 80, 100, 150, 200, 250, 300, 350, 400, 450, and 500).

3.2.3 Experimental Results

The RBF neural network is also trained with all five pairs of a CV-train set and a CV-test sets for all different combinations of F and H values. The RBF neural network configuration associated with the winning pair is tested with the testing data set. Table 2 presents the classification accuracy on the testing data set for all values of F and H. Table 2 shows that the classification accuracy of the trained RBF on the testing data set is about 99.61 % for F = 10 and H = 150. Table 5b shows the training time spent for the winning neural architecture. Table 6b shows the confusion matrix of this RBF experiment.

3.3 Probabilistic Neural Network

3.3.1 Overview

Probabilistic neural network (PNN)(14) is a multi-layer neural network using exponential functions instead of sigmoidal activation functions. The network is made up of an input layer, a hidden layer consisting of a pattern layer and a summation layer and an output layer. The number of nodes in a pattern layer is the number of patterns in the training data set. The number of nodes in a summation layer is the number of categories to be classified. Nodes in an output layer are all two-input neurons and the number of nodes can be determined based on the grouping of different pair of summation nodes. Nodes in the input layer are fully connected to nodes in the pattern layer, while nodes in the summation layer receive inputs only from pattern nodes that are associated to the category from which the training pattern belongs to.

3.3.2 Network Architecture

The three-layer PNN is implemented with an input layer - seven block features H, E, S, R, HR, ER and SR, an output layer - one binary output for non-textual or textual data, and one single hidden layer consisting of P pattern units. The number of training samples P is 1228. As pointed out by Specht(14), the error rate of the PNN depends upon the smoothing parameter s. The parameter s controls the shape of the network exponential activation function. As a result, in order to observe its behavior changed, the PNN is trained with different values of smoothing parameter s chosen to be 0.1, 0.3, 0.5, 0.7, 0.9, 1, 1.4, 1.5, 1.6, 2, 10, 50, and 100.

3.3.3 Experimental Results

The PNN is also trained with all five pairs of a CV-train set and a CV-test sets for all different values of s. The PNN configuration associated with the winning pair is tested with the testing data set. Table 3 presents the classification accuracy on the testing data set for all values of s. Table 3 shows that the lowest possible classification accuracy of the trained PNN on the testing data set is about 98.18 % for any value of smoothing parameter s that is greater or equal to 1.6. Table 5c shows the training time spent for the winning neural architecture. Table 6c shows the confusion matrix of this PNN experiment.

3.4 Kohonen's Self-Organizing Feature Maps Neural Network

3.4.1 Overview

The Kohonen's self-organizing feature maps neural network (SOFM)(8,17) consists of one input layer and one output matrix competitive layer. Nodes in the input layer are fully connected to those in the output layer and each interconnection in the network is associated with a weight value. The network is trained by an unsupervised learning method that self-organizes similar input patterns into clusters. When a training input pattern vector is presented to the input layer, the output neuron values are calculated and compete to find a winning output neuron who produces the strongest response. After the winning neuron is determined, weights are adjusted for the winning neuron as well as all other neurons within its neighborhood. The neighborhood is defined as a square area in the output matrix layer that centers at the winning neuron. The size of the neighborhood square area decreases during the training process.

3.4.2 Network Architecture

The SOFM is implemented with an input layer - seven block features H, E, S, R, HR, ER and SR, and an output layer - a square output matrix. Since the performance of the SOFM depends upon the size of the square output matrix layer, the network is trained with different values of the square output matrix size chosen to be 12 (144 output neurons), 20 (400 output neurons), 28 (784 output neurons), and 36 (1296 output neurons). The size of the neighborhood square area is initialized at about half of the square matrix size. The learning rate, which determines the portion of weight needed to be adjusted, is chosen to be 0.4 for all above configurations. Usually, the number of iterations is chosen to be about 500 times the number of output neurons in order to have a good result(17). After being trained, the weights stabilized and then the obtained network is calibrated. Calibration is performed by supervised labeling of the square output matrix neuron array according to a known vector input from the training data set. When a textual block data input vector from the training data set is presented at the input layer, one neuron in the output array produces the strongest response. This neuron is labeled T. In the same way, the neuron in the output array, who produces the strongest response to a non-textual block data input vector, is labeled G. Any neuron in the output array remains uncommitted after calibration is labeled *.

For a given block data input vector, the trained SOFM may return with a * response which is an unknown classification to the network. Therefore, in order to classify any block, a modified calibration is proposed in this paper and it is presented as follows. Each neuron in the square output matrix array has two features: a label (G for non-textual and T for textual) and a response value. When a textual or non-textual block data input vector from the training data set is presented, labels and response values of the neuron who produce the strongest response and its 8 neighborhood neurons in the square output matrix array are calculated using the following modified calibration algorithm.

  • (A) If the current neuron response is stronger then its previous response then
    1. (1) update its response by the current response, and
    2. (2) update its label to T if the input vector is textual and to G if the input vector is non-textual.
  • (B) Otherwise, nothing changes.

3.4.3 Experimental Results

The SOFM is also trained with all five pairs of a CV-train set and a CV-test sets for all different values of output matrix size. The SOFM configuration associated with the winning pair is tested with the testing data set. Table 4 presents the classification accuracies on the testing data set for all different values of output matrix size. Table 4 shows that the best classification accuracy of the trained SOFM on the testing data set is about 99.22 % for the output matrix size of 36 (1296 output neurons). Table 5d shows the training time spent for the winning neural architecture. Table 6d shows the confusion matrix of this SOFM experiment.

4. DISCUSSION

In this section we first justify the use of neural network models in place of traditional statistical pattern recognition, and then we try to explain the reasons behind RBF performing best.

Future advances on the connectionist front concern both modeling and learning and require awareness of existing analogies between neural networks (NN) and related statistical disciplines. The basic tasks neural networks have to cope with involve learning adaptive mappings for functional approximation, solving inverse problems within the framework provided by regularization, and learning to represent (sensory) inputs through functional decomposition using optimal (kernel) bases ('receptive fields'). Fundamental links have been established recently between neural networks and statistical pattern recognition when the network outputs were shown to possess specific statistical significance. As an example, if BP learning is defined over the error surface: e(f) = S [yi - f(xi)]2, Geman, Bienenstock and Doursat(4) have shown that among all the functions of x, the regression is the best prediction of y given x in the Minimum Square Error (MSE) sense, i.e., E [(y - f(x))2* x] * E [(y - E(y/x))2* x]. Another recent but related result due to Richard and Lippmann(13) states that MSE approximation estimates Bayesian probabilities according to the degree to which the training data reflect true likelihood distribution and a priori probabilities. Note also the well known fact from statistics that the error function e(f) can be represented as e(f) = 'bias' + 'variance' and that smoothing introduces bias as details of the 'true' regression function are lost - neural networks enjoy low bias at the price of increased variance. Finally neural network solutions for inverse problems expanded on their applicability when NN were shown to be equivalent to statistical (MAP) estimation. As NN seem to be functionally equivalent to statistical pattern recognition it has recently been suggested that one could benefit from using NN for non-parametric ('low bias') problems involving large sample sizes(3). As our project indeed considers non-parametric situations and large sample sizes, the use of NN is justified. Furthermore, as we discuss next, NN performance can be improved if its variance would be decreased by increased locality as it is the case with RBFs.

RBFs, related to potential functions, implement adaptation as functional approximation as that involved in sparse surface reconstruction. Given some function f(x), its approximation (decomposition in terms of basis functions) or learned prototype is <ci, Ti, R> and it is found as:

	f(x) = S ci N(** x - Ti ** 2 R)

	where Ti are kernels (bases) of the function to be 
	learned given in terms of parameterized templates across 
	the space of geometrical (and/or topological) variability 
	and weighted by coefficients ci. N is the Gaussian 
	distribution, R is the normalization matrix, and the 
	corresponding norm is defined as 
	** x - Ti ** 2 R = (x - Ti) R' R (x - Ti)

As it has been discussed recently by Bottou and Vapnik(2), 'a high capacity learning system is able to model accurately and with high confidence the parts of the decision boundary that are well described by the training examples. In these areas, both rejection and misclassification are rare. The same system, however, produces unreliable high confidence decision boundaries in the poorly sampled areas of the pattern space: rejections are rare, but misclassifications are frequent. Alternatively, a low capacity learning system builds low confidence boundaries in the poorly sampled areas of the pattern space. This system rejects more atypical patterns, but reduces the number of misclassifications. Unfortunately, such a device performs poorly in the well sampled areas. Unable to take profit of the abundant data, it builds poor decision boundaries, and rejects almost everything, because everything looks too ambiguous'.

BP is a nonlocal algorithm, with a comparatively high capacity. BP performance could be thus improved by introducing locality and by reducing capacity. The k-nearest neighbors classifier (kNN) is an extremely local algorithm, with a very low capacity. Reducing its locality while increasing its capacity should improve the resulting performance (of the kNN). The explanation for RBFs performing best most likely comes then from the fact that their approximation is derived on a local basis corresponding to some given set of optimal kernel bases. RBFs enjoy nice characteristics in terms of both locality and capacity and they constitute a good compromise between BP and kNN schemes.

5. SUMMARY

We presented in this paper the classification of textual and non-textual data using four different neural networks: the back-propagation, the radial basis function, the probabilistic, and the Kohonen's self-organizing feature maps. All four neural networks demonstrated very good classification results. The back-propagation neural network is more complex and takes more time to train, but it requires less memory. The probabilistic neural network performs well, but the entire training set must be stored and used during testing, and the time spent to classify an unknown pattern is proportional to the training data size. The Kohonen's self-organizing feature maps also performs very well but in this particular application it required both a large memory size for the output neuron array and long computation time.

Compared to these other models, the radial basis function neural network has the highest classification accuracy (99.61 %), while requiring only intermediate amounts of memory and training time. We conclude therefore that the radial basis function neural network is the best architecture for this particular application in terms of both accuracy and training time.

Although in this work, our new approach for classification of binary document image into textual or non-textual data blocks using neural network models proved successful, we noted that the block data types are classified only as either textual and non-textual. A non-textual data block is not identified as graphics, form, line art or dithered image. A future task is to expand the current neural network's configuration and to further classify a non-textual data block into such more refined and meaningful categories.

6. REFERENCES

  1. T. Akiyama and N. Hagita (1990), Automated Entry System for Printed Documents. Pattern Recognition 23(11): 1141-1154.
  2. L. Bottou and V. Vapnik (1992), Local Learning Algorithms. Neural Computation 4:888-900.
  3. V. Cherkassky, J. Friedman, and H. Wechsler (Eds.) (1994), From Statistics to Neural Networks. Springer-Verlag.
  4. 4. S. Geman, E. Bienenstock, and R. Doursat (1992) Neural Networks and the Bias/Variance Dilemma. Neural Computation 5: 1-58.
  5. R. C. Gonzalez and P. Wintz (1987), Digital Image Processing - 2nd edition. Addison-Wesley, Reading, Massachusetts.
  6. L. R. Hush and B. G. Horne (1993), Progress in Supervised Neural Networks - What's New Since Lippmann? IEEE Signal Processing Magazine: 8-39.
  7. A. K. Jain and S. Bhattacharjee (1992), Text Segmentation Using Gabor Filters for Automatic Document Processing. Machine Vision and Applications 5: 169-184.
  8. R. P. Lippmann (1987), An Introduction to Computing with Neural Nets. IEEE Acoustics, Speech and Signal Processing Magazine 4(2): 4-22.9.
  9. R. P. Lippmann (1989), Pattern Classification Using Neural Networks. IEEE Communications Magazine: 47-62.
  10. K. Ng and R. P. Lippmann (1991), A Comparative Study of the Practical Characteristics of Neural Network and Conventional Pattern Classifiers. Technical Report 894, Massachusetts Institute Of Technology, Lincoln Laboratory.
  11. L. O'Gorman (1993), The Document Spectrum for Page Layout Analysis. IEEE Transaction On Pattern Analysis and Machine Intelligence 5(11): 1162-1173).
  12. T. Pavlidis and J. Zhou (1992), Page Segmentation and Classification. Computer Vision Graphics and Image Processing 54(6): 484-496.
  13. M. D. Richard and R. P. Lippmann (1991), Neural Network Classifiers Estimate Bayesian A Posteriori Probabilities. Neural Computation 3: 461-483.
  14. D. F. Specht (1990), Probabilistic Neural Networks. Neural Networks 3:109-118.
  15. F. M. Wahl, K. Y. Wong and R. G. Casey (1982), Block Segmentation and Text Extraction in Mixed Text/Image Documents. Computer Graphics and Image Processing 20: 375-390.
  16. S. M. Weiss and C. A. Kulikowski (1991), Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufmann Publishers, Inc., San Mateo, California.
  17. J. M. Zurada (1992), Introduction to Artificial Neural Systems. West Publishing Company, St. Paul, Minnesota.
  18. *** Theo Pavlidis, Department of Computer Science, State University of NewYork at Stony Brook, Stony Brook, NY 11794, private communication, August 1993.

ABOUT THE AUTHORS

Daniel X. Le received the B.S. degree summa cum laude in electrical computer engineering from California State Polytechnique University of Pomona, in June 1986 and the M.S. degree in computer science from George Mason University, Fairfax, Virginia, in January 1993. From June 1986 to April 1989, he was a software engineer at Jet Propulsion Laboratory, Pasadena, California. From April 1989 to September 1990, he was a system engineer at Science Applications International Corporation, McLean, Virginia. Since September 1990, he has been an electronics engineer at the Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, Maryland. His current work is concerned with document analysis, image quality and image processing.

George Thoma received the B.S. from Swarthmore College, and the M.S. and Ph.D. from the University of Pennsylvania, all in electrical engineering. He is Chief of the Communications Engineering Branch of the Lister Hill National Center for Biomedical Communications, the research and development arm of the National Library of Medicine. He directs research in image processing, document image storage on digital optical disks, high-speed image transmission, analog videodiscs and satellite communications. He has published and lectured extensively in these areas. Dr. Thoma's previous experience at the General Electric Company, AII Systems and the University of Pennsylvania had been in the application of satellites in voice and data communications, video distribution, navigation and surveillance.

Harry Wechsler received the Ph.D. in Computer Science from the University of California, Irvine, in 1975, and he is presently Professor of Computer Science at George Mason University. His research in the areas of Computer Vision and Neural Networks (NN) includes multistrategy learning (statistics, adaptive signal processing, information theory, genetic algorithms, and neural networks), scale-space and clustering for early and intermediate vision using conjoint (Gabor and wavelet) image representations, attention, functional and selective perception, face recognition, and object recognition. He was Director for the NATO Advanced Study Institutes on "Active Perception and Robot Vision" (Maratea, Italy, 1989) and on "From Statistics to Neural Networks" (Les Arcs, France, 1993) and he will serve as the co-Chair for the International Conference on Pattern Recognition to be held in Vienna, Austria, in 1996. He authored over 100 scientific papers, his book "Computational Vision" was published by Academic Press in 1990, and he is the editor of the book, "Neural Networks for Perception" (Vol 1 & 2), published by Academic Press in 1991. He was elected as an IEEE Fellow in 1992.