Document Classification using Connectionist Models

D. X. Le, G. Thoma, H. Weschler
Proc. 1994 IEEE International Conference On Neural Networks
Orlando, FL
June 28 - July 2, 1994
Vol. 5, pp. 3009-3014.

Keywords:

Back propagation, Document processing, Probabilistic connectionist, Radial basis function, Self-organizing feature map, Textual classification

Abstract

As part of research into document analysis, we implemented a method for classification of binary document images into textual or non-textual data blocks using connectionist models. The four connectionist models considered were back propagation, radial basis function, probabilistic connectionist, and Kohonen's self-organizing feature map. The performance and behavior of these connectionist models are analyzed and compared in terms of training times, memory requirements, and classification accuracy. The experiments carried out on a variety of medical journals show the feasibility of using the connectionist approach for textual block classification and indicate that in terms of both accuracy and training time the radial basis function connectionist should be preferred.

1.0 Introduction

The conversion of paper-based document information to electronic format is important for automated document delivery, document preservation and other applications. The Lister Hill National Center for Biomedical Communications, a research and development division of the National Library of Medicine (NLM), is conducting research in the area of binary document processing including scanning, displaying, quality assurance, image processing, archiving to optical disk, indexing, accessing, retrieving, printing and faxing, text recognition, and image and text database creation. This paper describes one component of this ongoing effort, the classification of binary document images into blocks of textual and non-textual data using connectionist models. A block in a binary document image is a connected component (blob) and is defined here as a collection of black runs that are 8-connected(1). Textual data occurs as a letter, word, or sentence, while examples of non-textual data include graphics, forms, line art, and dithered images. Textual data blocks help to improve the accuracy rate and to speed up the recognition process for optical character recognition (OCR) systems while non-textual data blocks are useful for any automated system that performs image graphics enhancement. Block classification is also necessary for any automated system for printed documents, where the system needs to divide the content of a printed document into smaller parts such as headline area, text line area, graphic area, footnotes area, and background(2).

Several techniques to classify data blocks as textual or non-textual have been reported in the literature(3,4). As an example, Wahl, Wong and Casey(3) used a linear adaptive classification technique based on block features to classify data blocks into different classes. The Pavlidis and Zhou(4) classification technique is based on strength and variation of correlation of adjacent scanlines. They argued that text correlation of adjacent scanlines is very high and it changes very fast unlike non-text correlation of adjacent scanlines. While these two techniques were reported to perform well on a variety of printed documents, no information was provided on the size of the experiments and performance statistics were lacking. We describe in this paper a new approach for the classification of a binary document image into textual or non-textual data blocks using connectionist models. Four connectionist models were considered and they include the back propagation, the radial basis function, the probabilistic connectionist, and the Kohonen's self-organizing feature map. The performance and behavior of these connectionist models are analyzed and then compared in terms of training times, memory requirements, and classification accuracy. This evaluation used a sample size of 50 images from medical journal pages consisting of more than two thousand data blocks.

2. CLASSIFICATION SYSTEM

The classification process consists of four steps: block run lengths detection, block labeling, block features calculation, and connectionist learning and classification.

2.1 Block Run Lengths Detection

A binary document image can be segmented into blocks using the constrained run length algorithm (CRLA) developed by Wahl, Wong, and Casey(3). Assume that binary 0's and 1's represent white pixels and black pixels, respectively. The CRLA replaces 0's by 1's if the number of adjacent 0's is less than or equal to a predefined constraint C. This operation creates two intermediate bitmaps: row-by-row with a constraint Chor and column-by-column with a constraint Cver. These two intermediate bitmaps are combined by a logical AND operation and the resulting bitmap is required to go through an additional horizontal smoothing process with constraint Csm to produce the desired final bitmap. Wahl, Wong, and Casey recommended that Chor = 300 pixels, Cver = 500 pixels and Csm = 30 pixels can be used with fairly good results on a variety of documents.

2.2 Block Labeling

A block is defined as a connected component and it represents a collection of black runs that are 8-connected(1). Its purpose is to segment a binary image by labeling its blocks. Each black run is assigned an integer number called label and the labels of connected black runs must be the same.

2.3 Block Features Calculation

In this step, the coordinates of a block's extremities carrying information about the approximate size of a block are used to calculate features of each block in a binary document image. Each block has its own extreme coordinates with respect to the origin of the binary image: min_left_column (the column value of the left-most pixel of a block), max_right_column (the column value of the right-most pixel of a block), min_top_row (the row value of the top-most pixel of a block) and max_bottom_row (the row value of the bottom-most pixel of a block).

Seven block features H, E, S, R, HR, ER and SR (defined below) are used for classification. The first four block features are similar to those defined by Wahl, Wong, and Casey, while the last (higher-order) three block features are combined from the first four block features to enhance input representation and to eliminate the need for more hidden layers(5). Let BC be the total number of black pixels of each block, DC be the total number of black pixels of the original data within each block, and TC be the number of white-black (or black-white) transitions of the original data within each block. Also, let H and W be the width and the height of each block (H = max_bottom_row - min_top_row + 1 and W = max_right_column - min_left_column + 1). The definitions of the seven block features are given next. These block features are normalized to have unit length and are adjusted to be within the [-0.5, 0.5] range.

  1. The height of each block:
  2. The ratio between the width and the height of each block: E = W / H
  3. The ratio of the number of block pixels to the block area: S = BC / (W * H)
  4. The mean horizontal length of black runs of the original data within each block: R = DC / TC
  5. The product of the block height and the ratio R: HR = H * R
  6. The product of the ratio E and the ratio R: ER = E * R
  7. The product of the ratio S and the ratio R: SR = S * R

2.4 Connectionist Learning and Classification

Fifty scanned binary document images (8.5 x 11 inches in size and 200 dpi resolution) covering a wide variety of layouts were selected from 12 different medical journals to create the training and testing data sets. The training data set is used to design the connectionist while the testing data set is used to estimate the connectionist's classification accuracy. These fifty binary document images consist of 129 non-textual data blocks and 2175 textual data blocks. These segmented blocks are divided randomly into (1) 85 non-textual data blocks and 1450 textual data blocks for the training data set and (2) 44 non-textual data blocks and 725 textual data blocks for the testing data set. For purposes of generalization, the cross-validation (CV) technique(6) is used by randomly dividing the training data set into five data groups of which four data groups create a CV-train set (68 non-textual data blocks and 1160 textual data blocks) and one remaining data group is considered as a CV-test set (17 non-textual data blocks and 290 textual data blocks). As a result, there are five pairs of a CV-train set and a CV-test set that are used to train and test each connectionist. The modified weights corresponding to the winning pair of a CV-train set and a CV-test set, the one yielding the highest classification accuracy, are chosen to be the final weights for that particular connectionist.

3. PERFORMANCE EVALUATION

In this section, each connectionist is discussed with respect to its training and testing strategy employed, and experimental results. A DELL 486D/50 Personal Computer is used to train and test each connectionist.

3.1 Back Propagation

The two-layer back propagation connectionist(5,6,7) is implemented with an input layer - seven block features H, E, S, R, HR, ER and SR, an output layer - one binary output for non-textual or textual data, and one single hidden layer of which the number of nodes is defined based on the number of inputs, outputs, and training samples. As pointed out by Hush and Horne(6), for generalization, the number of training samples should be approximately ten times the number of weights in the multi-layer propagation network. For 7 inputs, 1 output and 1228 training samples, the number of nodes of a hidden layer is about 14. Therefore, the chosen two-layer back propagation connectionist architecture is 7-14-1. The back propagation connectionist is trained and tested with all five pairs of a CV-train set and a CV-test set. The training time spent for the each pair was about 5 hours. Table 1 presents the cross-validation training result on the winning pair and it shows that the best classification accuracy on the testing data set is about 99.4 %.

3.2 Radial Basis Function

The two-layer radial basis function connectionist(6,8) is implemented with an input layer - seven block features H, E, S, R, HR, ER and SR, an output layer - one binary output for non-textual or textual data, and one single hidden layer of which the number of nodes is the number of basis function nodes F. The means and variances of the basis functions are determined using K-means clustering algorithm(6). The classification accuracy of the radial basis function classifier depends upon both H and F where H is the global proportional factor(8). Since the radial basis function is a multi-layer network, the generalization issue in the radial basis function can be handled in the same way as that in the back propagation (refer to section 3.1). For 7 inputs, 1 output and 1228 training samples, the number of basis function nodes F is chosen to be 14. The radial basis function connectionist is also trained and tested with all five pairs of a CV-train set and a CV-test set for all different combinations between F and H. The training time spent for the each pair was about 47 seconds. Table 2 presents the classification accuracy on the testing data set for all values of F and H in which the highest classification accuracy is about 99.5 % for F = 14 and H = 450.

3.3 Probabilistic Connectionist

The three-layer probabilistic connectionist(9) is implemented with an input layer - seven block features H, E, S, R, HR, ER and SR, an output layer - one binary output for non-textual or textual data, and one single hidden layer consisting of P pattern units where P is the number of training samples. The error rate of the probabilistic connectionist depends upon the smoothing parameter s(9) because it controls the shape of the network exponential activation function. Therefore, in order to observe changes in its behavior, the probabilistic connectionist is trained with different values of smoothing parameter s. The probabilistic connectionist is also trained and tested with all five pairs of a CV-train set and a CV-test sets for different values of s. The training time spent for each winning pair was about 14 seconds. Table 3 presents the classification accuracy on the testing data set for all values of s in which the best classification accuracy is about 98.2 % for any value of smoothing parameter s that is greater or equal to 1.6.

3.4 Kohonen's Self-Organizing Feature Map

The Kohonen's self-organizing feature map connectionist(5,7) is implemented with an input layer - seven block features H, E, S, R, HR, ER and SR, and an output layer - a square output matrix. Since the performance of the Kohonen's self-organizing feature map depends upon the size of the square output matrix layer, the network is trained with different values of the square output matrix size. Moreover, the number of iterations is chosen to be about 500 times the number of output neurons in order to obtain good results(5).

After being trained, the resulting network is calibrated by supervised labeling of the square output matrix neuron array using known vector inputs from the training data set. Each neuron in the square output matrix array has two features: a label (G for non-textual and T for textual) and a response value. When a block data input vector from the training data set is presented, labels and response values of the neuron which produces the strongest response and its 8 neighborhood neurons in the square output matrix array are calculated as follows:

  1. If the current neuron response is stronger than its previous response, then
    • (a) update its response by the current response, and
    • (b)update its label to T if the input vector is textual and to G if the input vector is non-textual.
  2. Otherwise, nothing changes.

The Kohonen's self-organizing feature map is also trained and tested with all five pairs of a CV-train set and a CV-test sets for all different values of output matrix size. The training time spent for each winning pair was about 14 hours. Table 4 presents the classification accuracies on the testing data set for all different values of the output matrix size. The best classification accuracy is about 99.2 % when the output matrix size is 36 (1296 output neurons).

4. SUMMARY

The classification of textual and non-textual data using four connectionist models including the back propagation, the radial basis function, the probabilistic, and the Kohonen's self-organizing feature map has been presented. All four connectionists showed very good classification results. The back propagation connectionist takes more time to train, but it requires less memory. The probabilistic connectionist requires that the entire training data must be stored and used for each classification of an unknown pattern. The Kohonen's self-organizing feature map requires a large memory size for the output neuron array. On the other hand, the radial basis function connectionist has the highest classification accuracy, and it requires intermediate amounts of memory and training time. We conclude therefore that the radial basis function connectionist is the best architecture for this particular application in terms of both accuracy and training time. Moreover, in this work, the block data types are classified only as either textual and non-textual. However, a non-textual data block can be graphics, form, line art or dithered image. A future task is to expand the current connectionist's configuration and to further refine the classification of a non-textual data block.

5. REFERENCES

  1. R. C. Gonzalez and P. Wintz (1987), Digital Image Processing - 2nd edition. Addison-Wesley, Reading, Massachusetts.
  2. T. Akiyama and N. Hagita (1990), Automated Entry System for Printed Documents. Pattern Recognition 23(11): 1141-1154.
  3. F. M. Wahl, K. Y. Wong and R. G. Casey (1982), Block Segmentation and Text Extraction in Mixed Text/Image Documents. Computer Graphics and Image Processing 20: 375-390.
  4. T. Pavlidis and J. Zhou (1992), Page Segmentation and Classification. Computer Vision Graphics and Image Processing 54(6): 484-496.
  5. J. M. Zurada (1992), Introduction to Artificial Neural Systems. West Publishing Company, St. Paul, Minnesota.
  6. D. R. Hush and B. G. Horne (1993), Progress in Supervised Neural Networks - What's New Since Lippmann? IEEE Signal Processing Magazine: 8-39.
  7. R. P. Lippmann (April 1987), An Introduction to Computing with Neural Nets. IEEE Acoustics, Speech and Signal Processing Magazine 4(2): 4-22.
  8. K. Ng and R. P. Lippmann (1991), A Comparative Study of the Practical Characteristics of Neural Network and Conventional Pattern Classifiers. TR 894, MIT Technology, Lincoln Laboratory.
  9. D. F. Specht (1990), Probabilistic Neural Networks. Neural Networks 3:109-118.