Automated Page Orientation and Skew Angle Detection for Binary Document Images

D. X. Le, G. Thoma, H. Weschler
Pattern Recognition, Volume 27, Number 10, pp. 1325-1344,
October 1994.

Keywords: Page orientation, Projection profiles, Component, labelling, Hough transform, Document skew angle


Abstract We describe the development and implementation of algorithms for detecting the page orientation (portrait/landscape) and the degree of skew for documents available as binary images. A new and fast approach is advanced herein whereby skew angle detection takes advantage of information found using the page orientation algorithm. Page orientation is accomplished using local analysis, while skew angle detection is implemented based on the processing of pixels of the last black runs of binary image objects. The experiments carried out on a variety of medical journals show the feasibility of the new approach and indicate that detection accuracy can be improved by minimizing the effects of non-textual data.

1. INTRODUCTION

The conversion of paper documents to electronic format is being routinely done for records management, automated document delivery, document archiving, journal distribution and other applications. Document conversion stages include scanning, displaying, quality assurance, image processing, text recognition, and image and text database creation. During the document scanning process, the whole document or a portion of it is fed through a looseleaf page scanner. Some pages are not fed straight into the scanner, however, causing skewing of the bitmapped-images of these pages, and thus requiring identification and rescanning by a quality control operator. Quality control is the slowest process in the system because the operator has to check too many image features such as focus, readability, skew, missing or extra pages, and page sequencing. In an attempt to partially automate the quality assurance process as well as to improve any subsequent text recognition process, a page orientation detection algorithm and a document skew angle detection algorithm have been developed.

The page orientation of a document is defined as the printing direction of text lines. Therefore the page orientation can be in either horizontal printing mode (portrait mode) or vertical printing mode (landscape mode). The skew angle of a document is defined as the orientation angle of text lines.

Earlier approaches to detect page orientation(1,2) and document skew angle(2,3,4,5,6) have been reported. As an example, Akiyama and Hagita(1) use a technique based on the vertical and horizontal variances of projection profiles in a binary image. The technique emphasizes global rather than local variations and as a consequence it is quite sensitive in the presence of non-textual data. Examples of non-textual data one should consider include blanks, graphics, forms, line arts, large fonts, and dithered images. (The dithered image is a binary image in which its shades of gray are simulated by grouping dots into clusters.) These binary images along with their projection profiles reside on a square portion of an image. This portrait mode binary image was tested using the Akiyama and Hagita's technique and it was wrongly labelled as landscape mode. The main source of error in this global approach is the presence of the non-textual data (e.g. blank area in the middle and dithered area in the right side) which severely affects the (black pixels) projection profiles distribution. To account for such situations, Hinds, Fisher, & D'Amato(2) proposed an algorithm that determines a document page orientation based on the counts of short run-lengths in the vertical and horizontal histograms of an image. They argued that since stroke heights have longer run-lengths than stroke widths for a portrait mode document, the number of short run-lengths in the horizontal histogram will be larger than that in the vertical histogram. This argument, however, holds only for a document in which text data predominates.

Techniques to detect document skew angle as implemented by Fletcher and Kasturi(3) and by Rastogi and Srihari(4) have been reviewed by Hinds, Fisher, & D'Amato(2). The Fletcher and Kasturi(3) algorithm is robust and reliable, but it is slow due to the use of connected components analysis on the original image. The Rastogi and Srihari(4) algorithm performs well but it is slow because of the excessive data required to be processed by the Hough transform. Baird(5) has proposed an algorithm to determine the skew angle using an energy function on sets of projection-counts of character locations. While this technique works well on a wide variety of portrait mode layouts, there is no discussion of its suitability to pages in landscape mode. Hinds, Fisher & D'Amato(2) have proposed an algorithm that uses the Hough transform to determine a document skew angle. To reduce the amount of data that is input to the Hough transform, an intermediate image is created by placing the length of the vertical run in its bottom-most pixel. The skew angle will then be determined by applying the Hough transform on the intermediate image. While the technique correctly determines document skew angle on a variety of images, the data reduction factor is still small: about 2 for a typical text document and about 11 for a mixture of pictures and text document. As a result, the resolution of the document image has to be reduced before creating its intermediate image in order to increase the speed. The current commercial ScanFixTM(6) software provides a method to automatically detect and correct document skew. While it works very well with portrait images, it performs very poorly on landscape images.

The page orientation detection algorithm and the new document skew angle detection algorithm proposed in this paper perform well on test images independent of text dominance and kind. The algorithms are able to reduce the impacts of the non-textual data on the page orientation and the skew angle results in order to increase the detection accuracy rate. The performance of the page orientation detection algorithm has been evaluated using a sample size of several thousand images of medical journal pages. Evaluation results show that the page orientation algorithm is capable of detecting the page orientation at an accuracy rate of 99.9 % and it is the best compared against existing algorithms(1,2) in terms of accuracy because it can account for non-textual data. The page orientation detection algorithm proposed plays a very important role as a preprocessing step to increase the robustness not only for skew angle detection and correction but also for optical character recognition (OCR) systems in handling landscape images. In addition, the page orientation information is crucial for any automated entry system for printed documents, where the system needs to divide the content of a printed document into smaller parts such as headline area, text line area, graphic area, footnotes area, and background(1).

The new approach for the document skew angle detection algorithm has also been tested on several hundred images of medical journal pages and it has successfully detected the document skew angles with an accuracy of about 0.50 degrees. Evaluation results using the new approach show that while keeping the same skew angle detection accuracy rate, the document skew angle detection speed performance is improved about four times compared to our previous approach(7,8). Our skew angle detection algorithm is also considered to be the best compared against existing algorithms (2,3,4,5,6) regarding its accuracy rate as well as its insensitivity to non-textual data.

2. BASIC FEATURES

Basic definitions necessary to understand our system of algorithms are given here.

2.1. Binary image

The term binary image refers to a two-dimensional binary function f(i,j), whose value is either 0 or 1, where i and j denote vertical and horizontal coordinates of the pixel respectively, and where the coordinates origin is located at the top left corner of the image.

2.2. Projection profiles

Projection is an operation that maps a binary image into a one-dimensional array called a histogram or projection profile(9). A horizontal projection histogram h(i) of a binary image f(i,j) is the sum of black pixels projected onto the vertical axis i. A vertical projection histogram v(j) of a binary image f(i,j) is the sum of black pixels projected onto the horizontal axis j.

2.3. Variances and squared sums

The variance of a distribution is the average squared deviation from its mean value and it can be used to measure the spread of a distribution. The larger the variance the larger the spread(10). Formally, consider an n x n square binary image f(i,j) and let s be the total black pixels of the square, *h be the variance of the horizontal projection histogram and *v be the variance of the vertical projection histogram. Since it is a square image, its mean values of both the horizontal and vertical projection histograms are the same and equal to s/n. The variances *h and *v can be calculated as follows:

*h = [h(0) - s/n]2 + [h(1) - s/n]2 + ... + [h(n-1) - s/n]2
= [h2(0) + h2(1) + ... + h2(n-1)] + c
*v = [(v(0) - s/n)2] + [(v(1) - s/n)2] + ... + [(v(n-1) - s/n)2]
= [v2(0) + v2(1) + ... + v2(n-1)] + c
where c is equal to [n(s/n)2 - 2(s/n)s] which is a constant.
Let Sh and Sv be the squared sums of the horizontal and
vertical projection histograms, respectively.
Sh = [h2(0) + h2(1) + ... + h2(n-1)] + c
Sv = [v2(0) + v2(1) + ... + v2(n-1)] + c
The formulas of *h and *v can be rewritten as follows:
*h = Sh + c
*v = Sv + c

Since c is a constant, the squared sums Sh and Sv could be compared instead of the variances *h and *v to detect the page orientation. The advantage of using Sh and Sv is to reduce the amount of variances computation by eliminating the calculation of the constant c.

2.4. Textual and non-textual squares

A textual square of a binary image is defined as a square area in which text data is dominant. A non-textual square of a binary image is a square area in which non-textual data (blanks, graphics, forms, line art, large fonts, or dithered images) is dominant.

2.5. Textual square mode weight

The textual square mode weight is the total black pixels of a portrait or landscape mode textual square. There are two types of mode weight: portrait mode weight and landscape mode weight.

2.6.Object

An object in a binary image is like connected component. It is defined here as a collection of black runs that are 8-connected(11). The black runs a and b are considered to be connected if a pixel of black run a is 8-connected to a pixel of black run b and vice versa. Note that the black run connectivity obeys the transitivity rule such that if a black run a is connected to a black run b and b is connected to a black run c then a and c are connected.

2.7. Object extreme coordinates

The object extreme coordinates provide information about the approximate size of an object in a binary image. Each object in a binary image has its own extreme coordinates with respect to the origin of the binary image f(i,j): min_left_column, max_right_column, min_top_row and max_bottom_row. The definitions of these extreme coordinates are as follows:

min_left_column:the column value of the left-most pixel of an object
max_right_column:the column value of the right-most pixel of an object
min_top_row:the row value of the top-most pixel of an object
max_bottom_row:the row value of the bottom-most pixel of an object

2.8. Objects bottom pixels

The objects bottom pixels are black pixels belonged to the last black runs in the bottom row of the object. These black pixels belong to the last two black runs in row 9.

2.9. Candidate object

et diff_col and diff_row be (max_right_column - min_left_column + 1) and (max_bottom_row - min_top_row + 1) respectively. A candidate object is an object that satisfies all of the following empirical conditions:

  1. MIN_OBJECT_WIDTH < diff_col < MAX_OBJECT_WIDTH
  2. MIN_OBJECT_HEIGHT < diff_row < MAX_OBJECT_HEIGHT
  3. MIN_OBJECT_AREA < diff_row x diff_col < MAX_OBJECT_AREA

The empirical values of the following predefined parameters MIN_OBJECT_WIDTH, MAX_OBJECT_WIDTH, MIN_OBJECT_HEIGHT, MAX_OBJECT_HEIGHT, MIN_OBJECT_AREA, and MAX_OBJECT_AREA will be defined in Appendix.

3. SYSTEM OVERVIEW

The system proposed in this paper consists of two algorithms: the page orientation detection algorithm and the skew angle detection algorithm. The first algorithm uses local analysis while the second is based on the processing of pixels of the last black runs of binary image objects. The main goal is to increase the detection accuracy rate by minimizing the effects of the non-textual data on page orientation and skew angle detection results. Our system consists of two major processes: page orientation process and skew angle process.

Page Orientation Process takes a binary image page as its input and produces a page orientation as its output. It consists of three steps: textual and non-textual squares classification, textual squares page orientation estimation and squares grouping. These steps are described in the following paragraphs. Step 1.1 is the Textual and Non-Textual Squares Classification. This step divides the entire binary image into squares where the size of a square is defined in Appendix. Each square is then classified as textual square or non-textual square.

Step 1.2 is the Textual Squares Page Orientation Estimation. It estimates the page orientation of each textual square by using (1) a projection profiles method or (2) a square difference method. In an attempt to reduce the impact of non-textual data on the page orientation detection results, non-textual squares are not used in the page orientation detection process. The projection profiles and their squared sums are the basis for the page orientation detection algorithm. The projection profiles method determines the page orientation by analyzing the shapes of the horizontal and vertical projection histograms. A more detailed description of this method is discussed in the section 4.2.1. The square difference method is another way to determine the page orientation. The horizontal and vertical projection histograms of a portrait mode text binary image for whom the horizontal projection histogram shows more fluctuation than the vertical projection histogram. Therefore, the page orientation can be detected using a square difference method which compares the difference in the squared sums of the horizontal and vertical projection histograms and determines the page orientation based on the histogram having the larger squared sums. Details of the square difference method are given in the section 4.2.2.

Since the projection profiles method does not require a lot of calculations, it would work faster than the square difference method. The method can be used successfully, however, only for several histogram shapes that are defined later on in the section 4.2.1. As a result, the page orientation of each textual square is first estimated using the projection profiles method. When no page orientation conclusion can be made, the square difference method must then be used to decide the page orientation of the textual square.

Once page orientation is determined, each textual square is assigned its mode weight as the total black pixels of a portrait or landscape mode textual square. Step 1.3 is the Squares Grouping. This step determines the page orientation of a binary image using the notion of a pyramidal image data structure(12). As shown in step 1.1, the binary image was divided into squares where the size of a square is defined in Appendix. These squares constitute the first layer of the pyramid. The square grouping technique then groups each 9-neighboring squares of the first layer together to build a second layer square. The page orientation and the mode weight of each second layer square are determined by comparing its total portrait mode black pixels and its total landscape mode black pixels. The larger between them decides the page orientation and mode weight of the second layer square. Note that the total portrait (landscape) mode black pixels of a second layer square is the sum of mode weights of all portrait (landscape) mode squares of the first layer in its group. This squares grouping process goes on until the last 9-neighboring squares corresponding to the last layer of the pyramid are computed. The last 9-neighboring squares are then used to determine the page orientation of the binary image. There are three layers - first, second, and 3rd or last - of the pyramid.

Skew Angle Process in our new approach takes only a portion of a binary image as its input and produces a skew angle as its output. It consist of four steps: textual area selection and rotation, component labelling, data reduction and Hough transform. These steps are described in the following paragraphs. Step 2.1 is the Textual Area Selection and Rotation. This step selects one of the 9 squares of the last layer of the pyramid resulting from the page orientation algorithm for detecting the skew angle. The page orientation of this chosen square must be the same as that of the binary image and its mode weight is the largest among those squares having the same page orientation. A binary image and its chosen square can be located in the bottom half of the image. Since this algorithm works well only on a portrait mode binary image, it is required to rotate any landscape mode chosen square into a portrait mode square before continuing to the next step.

Step 2.2 is the Component Labelling. This step segments the chosen square into objects. Labelling objects in a binary image uses the component labelling step.

Step 2.3 is the Data Reduction. This step creates a simplified square from the chosen square by preserving only bottom pixels of candidate objects. The objects bottom pixels of the chosen square are selected in step 2.1.

Step 2.4 is the Hough Transform. In this step, the skew angle of a binary image may be detected by applying the Hough transform on the simplified square created in the above step. In the subsequent sections, we discuss a more detailed description of our system of algorithms, the results of experiments on test data taken from medical journals, and a comparison against our earlier approach(7,8) and existing algorithms(1,2,3,4,5,6).

4. PAGE ORIENTATION PROCESS

The page orientation detection algorithm described in this paper uses local analysis. The main goal is to reduce page orientation detection errors by minimizing the effects of the non-textual data on the page orientation decision. The page orientation process consists of three steps: (1) textual and non-textual squares classification step which divides the entire binary image into squares and categorizes these squares as textual or non-textual data squares, (2) textual squares page orientation estimation step which estimates the orientations of textual squares, and (3) squares grouping step which determines the page orientation using the 9-neighboring squares grouping technique. In the following subsections, each step of the page orientation detection algorithm will be discussed in detail.

4.1 TEXTUAL AND NON-TEXTUAL SQUARES CLASSIFICATION

In this step, the entire binary image is divided into squares where the size of a square is defined in Appendix. Each square is then classified as textual square or non-textual square. Note that in order to use the squares grouping step which is based on the pyramidal image data structure(12), the number of squares in each row must be equal to the number of squares in each column and this number is 3L, where L is the number of the pyramid layers. The blank squares will be added to the right or to the bottom of a binary image to satisfy this condition.

As defined before, the difference between a textual square and a non-textual square is that text data is dominant in the first and non-textual data is dominant in the second. Any square of a binary image that satisfies at least one of the following empirical conditions is considered to be a non-textual square.

  1. The ratio between the total black pixels and the total pixels in a square is less than a predefined parameter called BLANK_SQUARE_RATIO, or
  2. The ratio between the total black pixels and the total pixels in a top-half square, or in a bottom-half square, or in a left-half square or in a right-half square is less than BLANK_SQUARE_RATIO, or
  3. The ratio between the total black pixels and the total pixels in a square is greater than a predefined parameter called GRAPHICS_SQUARE_RATIO, or
  4. The ratio between the total black pixels and the total pixels in a top-half square, or in a bottom-half square, or in a left-half square or in a right-half square is greater than GRAPHICS_SQUARE_RATIO, or
  5. The length of any black run in a row or in a column of a square is greater than a predefined parameter called LARGE_FONT_LENGTH_PIXELS, or
  6. The sum of the black runs in a row or in a column of a square is greater than a predefined parameter called MAXIMUM_OBJECTS_SQUARE, or
  7. At least the total of * rows or columns of a square that satisfy the following condition: In each row or each column of a square, the total black pixels is less than or equal to the product of 2 (for 2-pixel black run) and the number of black runs.

The BLANK_SQUARE_RATIO is the maximum ratio between the total black pixels and the total pixels in a blank square. The GRAPHICS_SQUARE_RATIO is the minimum ratio between the total black pixels and the total pixels in a graphic square. The LARGE_FONT_LENGTH_PIXELS is the maximum allowable length in pixels of a black run in a row or column of a square. The MAXIMUM_OBJECTS_SQUARE is the maximum allowable number of black runs in a row or column of a square.

It is noted that the above two conditions (6) and (7) are used to cope with the dithered area problem in a square image and that the values of the predefined parameters *, *, *, BLANK_SQUARE_RATIO, GRAPHICS_SQUARE_RATIO, MAXIMUM_OBJECTS_SQUARE and LARGE_FONT_LENGTH_PIXELS will be defined in Appendix. As a result, textual squares are any squares of a binary image that are not non-textual squares.

4.2 TEXTUAL SQUARES PAGE ORIENTATION ESTIMATION

Page orientation of a textual square can be determined using either the projection profiles method or the square difference method. The projection profiles method is based on an analysis of shapes of horizontal and vertical projection histograms, while the square difference method is based on the comparison between squared sums of horizontal and vertical projection histograms.

4.2.1. Projection profiles method

Conventional typesetting has gaps between adjacent text lines. This gap is called line spacing or white line and its height is dependent on the text font size. Given a portrait mode square of a binary image that consists of several text lines. Its horizontal histogram will have a series of black/white stripe patterns. Such a pattern is called black-and-white text pattern in this paper.

A projection histogram is considered to have a black-and-white text pattern if all black and white stripes satisfy all of the following empirical conditions:

  1. All black stripe lengths are greater than or equal to MIN_LENGTH_NONZERO
  2. All black stripe lengths are less than or equal to MAX_LENGTH_NONZERO
  3. All white stripe lengths are greater than or equal to MIN_LENGTH_ZERO
  4. The maximum of a white stripe length is less than or equal to twice the minimum of a white stripe length.

The MIN_LENGTH_NONZERO and MAX_LENGTH_NONZERO are the minimum and maximum lengths in pixels of a black stripe and a white stripe respectively. The MIN_LENGTH_ZERO is the minimum length in pixels of a white stripe. The values of MIN_LENGTH_NONZERO, MAX_LENGTH_NONZERO and MIN_LENGTH_ZERO will be defined in Appendix.

The projection profile method determines a page orientation by analyzing shapes of horizontal and vertical projection histograms. Consider the horizontal and vertical projection histograms of a simple text binary image. The horizontal projection histogram consists of black-and-white text patterns, but the vertical projection does not. So, one can conclude that this textual square is in portrait mode. Page orientation cannot be determined using this method because both the horizontal and vertical projection histograms have black-and-white text patterns.

Using this procedure, page orientation of a textual square is determined as follows:

  • If only a horizontal projection histogram has a black-and-white text pattern then the square is in portrait mode.
  • If only a vertical projection histogram has a black-and-white text pattern then the square is in landscape mode.
  • If both horizontal and vertical projection histogram do not satisfy the conditions (1) and (3) then the corresponding square should be considered as a non-textual square and it should not be used to determine the page orientation.
  • Otherwise, no conclusion can be made about the page orientation of the square based on the projection profiles method and the square difference method which is discussed in the next section must be used to decide the page orientation of the square.

4.2.2. Square difference method

The page orientation of each textual square of a binary image can be determined by comparing the difference in the squared sums of its horizontal and vertical projection histograms. The histogram having the larger squared sums determines the page orientation of the textual square. The horizontal and vertical projection histograms of a portrait mode text binary image for whom the horizontal projection histogram shows more fluctuation than the vertical projection histogram.

Therefore, the page orientation of can be detected using the squared sums Sh and Sv, the upper bound * and the lower bound *, as follows:

[1] Sh >= * . Sv <=> Portrait Mode
[2] Sh < * . Sv <=> Landscape Mode
[3] Otherwise <=> Indeterminate Mode

The indeterminate mode is introduced here to specify the value range of the upper bound * and the lower bound * for which any page orientation results are uncertain. Details about the selection for the values of the upper bound * and the lower bound * can be found in Le(8).

At the end of this step, each textual square is assigned its mode weight (portrait mode weight or landscape mode weight). However, if the textual square is in an indeterminate mode then its mode weight is set to 0.??

4.3 SQUARES GROUPING

The page orientation of a binary image is determined using the notion of a pyramidal image data structure(12). The binary image was divided into squares as mentioned in section 4.1 and these squares constitute the first layer of the pyramid. In this squares grouping step, every 9-neighboring squares of the first layer are grouped together to form a second layer square of which the page orientation and the mode weight are determined based on the mode weight of first layer squares. The second layer square page orientation is in portrait mode if the total mode weight of its first layer portrait mode squares is greater than or equal to that of its first layer landscape mode squares. Otherwise, the second layer square page orientation is in landscape mode. The mode weight of the second layer square is the larger between the total portrait mode weight and the total landscape mode weight. This squares grouping process is continued until the last 9-neighboring squares belonging to the last layer of the pyramid are computed. The last 9-neighboring squares are then used to determine the page orientation of the binary image.

5. PAGE ORIENTATION DETECTION ALGORITHM

The page orientation detection algorithm covering the previous three sections 4.1, 4.2 and 4.3 can be summarized as follows:

  • Build the first layer of the pyramid by dividing a binary image into squares where the size of a square is defined in Appendix.
  • Calculate the horizontal and vertical projection histograms for each square.
  • Calculate the total black pixels in each square.
  • Categorize all squares into textual or non-textual squares.
  • For each textual square, the projection profiles method or the square difference method will be used to determine its page orientation. The square difference method will be used only if the page orientation can not be decided by the projection profiles method.
    • (a) If a textual square is in either portrait or landscape mode then assign its total black pixels to its mode weight.
    • (b) Otherwise, it is an indeterminate mode; then assign 0 to its mode weight.
  • For each 9-neighboring squares of the current layer of the pyramid, calculate the total portrait mode black pixels by adding weights of all portrait mode squares and calculate the total landscape mode black pixels by adding weights of all landscape mode squares. Group 9-neighboring squares of the current layer together to form a square of the next layer of the pyramid. The page orientation and the mode weight of this new square are determined as follows:
    • (a) If the total portrait mode black pixel count is greater than or equal to the total landscape mode black pixel count then the page orientation is in portrait mode and the mode weight is the total portrait mode black pixel count.
    • (b) Otherwise, the page orientation is in landscape mode and the mode weight is the total landscape mode black pixels.
  • The process in step 6 is continued until having the last layer of the pyramid where the page orientation of the binary image is finally determined.

6. SKEW ANGLE PROCESS

The process may be described by starting with a square textual portion of a skewed binary image. This image is simplified as shown in Figure 14 by removing all black pixels except those belonging to the last black run of each object. The remaining pixels are almost all oriented in the same direction. (A trial experiment consisting of several dozens of skewed binary images resulted in the same observation). Consequently, the skew angle of a binary image can be detected by applying the Hough transform on this simplified binary image. A simplified version contains scattered pixels, making skew angle detection difficult. Non-textual data is a major source of skew angle error and it is important to minimize its influence in the skew angle detection process, as it is done here.

The document skew angle detection algorithm is based on the processing of pixels of the last black runs of objects(7,8). The algorithm segments a binary image into objects, creates a simplified binary image from the original binary image and applies the Hough transform on the simplified binary image to determine the skew angle. In this approach, the entire binary image is processed to determine its skew angle. Actually, the running time of the algorithm can be reduced if it only processes text regions of an image. Fortunately, the page orientation algorithm presented earlier provides not only the page orientation but also where the textual data areas are located in the binary image. Using textual data areas information resulting from the page orientation algorithm, the skew angle of a binary image can be determined by selecting and processing only a small portion of the binary image consisting of textual data. Thus, the skew angle detection algorithm using our new approach consists of four steps: (1) textual area selection and rotation, (2) component labelling, (3) data reduction, and (4) Hough transform.

In the following subsections, each step of the new approach for the document skew angle detection algorithm is discussed in detail.

6.1 TEXTUAL AREA SELECTION AND ROTATION

The page orientation detection algorithm uses the 9-neighboring squares grouping technique to segment and group the original binary image into the final squares of the last layer of the pyramid which are then used to decide the page orientation. (Each of the final squares is created from textual data squares of the previous layers during the squares grouping process). Each square has also been classified into either portrait or landscape mode and it carries a mode weight that represents the total number of black pixels of textual data squares of the previous layer. Therefore, the larger the mode weight the larger the contents of textual data.

The document skew angle detection algorithm selects and processes only a small portion of the original binary image. In this new approach, this portion is chosen to be one of the final squares of the last layer of the pyramid. The chosen square called the square area for skew angle detection must have the same page orientation as that of the original binary image. The mode weight of this chosen square is the largest among squares having the same page orientation as that of the chosen square. A binary image and its square area for skew angle detection is located in the bottom half.

Like our previous approach (7,8), this approach works well only on a portrait mode binary image. Therefore, if the chosen square is in landscape mode, it requires to be rotated so it transform into a portrait mode square before continuing to the next step.

6.2 COMPONENT LABELLING

The purpose of component labelling is to segment a binary image by labelling its objects(9). Each black run is assigned an integer number called label and the labels of connected black runs must be the same. An object is created from connected black runs and its label is the same as that of its black runs. Objects are discriminated using their labels. During this process, the object extreme coordinates are also calculated for each labelled object. An example of labelling objects in a binary image uses the component labelling procedure.

6.3 DATA REDUCTION

The purpose of this step is: (1) to extract features corresponding to objects bottom pixels from the chosen square, (2) to speed up the Hough transform and (3) to minimize the skew angle detection error due to non-textual data. As mentioned previously, bottom pixels of objects can be used as feature points to determine the document skew angle. Using only objects bottom pixels also improve the Hough transform speed performance by reducing the amount of data to be processed. In order to reduce the participation of the non-textual data and increase the detection accuracy rate, only bottom pixels of candidate objects participate in the skew angle detection algorithm. After the chosen square is segmented into objects by the component labelling step, the task of the data reduction step is to create a simplified square image from the chosen square by preserving only bottom pixels of candidate objects.

6.4 HOUGH TRANSFORM

The Hough transform(11) can be used to detect, among other things, straight lines in digital images. In polar coordinate, a straight line can be described via the equation:

* = x . cos* + y . sin* [4]

The above equation describes a mapping of a point in the Cartesian coordinate x-y plane to a sinusoidal curve in the polar coordinate *-* plane. Consider M points that are lying on a line *0 = x.cos*0 + y.sin*0. The Hough transform maps M points into M sinusoidal curves that cross at a point (*0, *0) in the *-* plane. It can also be said that the intersection point (*0, *0) of M sinusoidal curves denotes a line in the x-y plane corresponding to (*0, *0) passing through these M points.

The *-* plane is quantized in accumulator cells, where each cell A(*i, *i) keeps track of the number of intersections of sinusoidal curves for that cell. When the mapping is completed for all the data points in the image, the accumulated value in A(*i, *i) represents the number of points lying on the corresponding line in the x-y plane.

The line detection in a binary image using the Hough transform algorithm can be summarized as follows:

  • (1) Define the Hough transform parameters *min , *max , *min and *max.
  • (2) Quantize the *-* plane into cells by forming an accumulator cells array A (*i, *i), where *i is between *min and *max and *i is between *min and *max.
  • (3) Initialize each element of an accumulator cells array A to zero.
  • (4) For each black pixel in a binary image, perform the following:
    • (a) For each value of *i from *min to *max, calculate the corresponding *i using the following equation:
      *i = x . cos*i + y . sin*i
    • (b) Round off the *i value to the nearest interval value.
    • (c) Increment the accumulator array element A (*i, *i).
  • (5) Detect best line candidates as local maxima in the accumulator cell array.

The computational performance of the Hough transform algorithm can be determined by analyzing the equation [4]. If Nb is the number of black pixels in a binary image and K* is the number of increments of *i from *min to *max, the number of computations in the Hough transform is 3NbK*. To increase the speed of the Hough transform, the amount of data points to be processed in a binary image should be reduced. A data reduction procedure proposed to achieve this goal was discussed in the previous section.

This step then applies the Hough transform on a simplified square image and then analyzes the local maxima of the Hough accumulator cell array to determine the square skew angle by (1) collecting the first and second maxima of Hough accumulator cell array elements, (2) collecting all Hough accumulator cell array elements of which the values are greater than one-half of the second maxima of Hough accumulator cell element, (3) adding these values together based on their angle, and finally (4) determining the square skew angle based on the angle corresponding to the maximum of these values. The square skew angle is also considered as the skew angle of the entire binary image.

7. DOCUMENT SKEW ANGLE DETECTION ALGORITHM

The document skew angle detection algorithm covering the previous four Sections 6.1, 6.2, 6.3 and 6.4 can be summarized as follows. Let sq(i,j) and simpsq(i,j) represent the square area for skew angle detection and its simplified version, respectively. The document skew angle detection algorithm using the new approach can be summarized as follows:

  • (1) Get the page orientation and the square area for skew angle detection sq(i,j) of a binary image f(i,j) from the page orientation detection algorithm.
  • (2) Adjust sq(i,j) into portrait mode, if necessary.
  • (3) For each row of the square area for skew angle detection sq(i,j), generate and label its black runs.
  • (4) Build objects based on black runs of the same labels and update extreme coordinates of objects.
  • (5) Create a simplified square simpsq(i,j) by preserving the last black runs of each candidate object of sq(i,j).
  • (6) Apply the Hough transform on the simplified square simpsq(i,j).
  • (7) Analyze the local maxima of the Hough accumulator cell array to detect the skew angle of the simplified square simpsq(i,j) as follows:
    • (a) Collect the first and second maxima of Hough accumulator cell array elements.
    • (b) Collect all Hough accumulator cell array elements of which the values are greater than one-half of the second maxima of Hough accumulator cell element.
    • (c) Add these values together based on their angle.
    • (d) The skew angle is the angle corresponding to the maximum of these values.
  • (8) The skew angle of the binary image f(i,j) is the same as that of the simplified square simpsq(i,j).

An example of a binary image and its square area for skew angle detection sq(i,j) which is located in the bottom half of the image. This square sq(i,j) is selected from the last 9 big squares of the binary image through the page orientation detection algorithm. It has the same orientation as that of the original binary image and its mode weight is the largest. A simplified version simpsq(i,j) of the above square sq(i,j) and the skew angle detected by applying the Hough transform on this simpsq(i,j) is the skew angle of the binary image.

8. EXPERIMENTAL RESULTS

All binary images used in these experiments are 8.5 x 11 inches in size and were scanned at 200 dpi resolution. The average time to process the page orientation and the document skew angle in the new approach of a scanned binary image on a DELL 486D/50 Personal Computer is about 4.1 seconds and 3.8 seconds, respectively.

For the page orientation detection algorithm, two sets of test samples are used to conduct the experiment: (1) for the first test set, which consists of 6,087 pages from 63 different medical journals, the algorithm has correctly detected page orientation for 6,083 pages, giving an accuracy rate of 99.9%, and (2) for the second test set, which consists of 5,190 pages from 52 different medical journals, the algorithm has correctly detected page orientations of 5,186 pages, giving an accuracy rate of 99.9 %. Evaluation results show that the page orientation algorithm is the best compared against existing algorithms(1,2) in terms of accuracy because it can account for non-textual data.

For the document skew angle detection algorithm, two sets of test images, which are known to have many skewed images, are selected for the experiment: (1) the first set consists of 12 selected binary images which represent both non-textual and textual data pages, and (2) the second set consists of 238 pages from 3 different medical journals. It is noted that the skew angle of each binary image is also measured manually and tabulated to test for the accuracy of the algorithm. As a result of this time consuming process, the size of these two sets of test images is limited to 250 pages in total. It is assumed that the document skew angle will be limited within [-15 , +15] degrees and the angle *i is incremented by 0.5 degree. The summarized results for the test sample set 1 are tabulated in Table 1. Table 1 show that the document skew angle of a binary image can be detected with an accuracy of about 0.50 degrees. Table 1 also shows that under the new approach, while keeping the same detection accuracy rate, the document skew angle detection speed performance is improved about 4 times and the skew angles detected are almost the same as that of our previous approach(7,8). Our skew angle detection algorithm is also considered to be the best compared against existing algorithms (2,3,4,5,6) regarding its detection accuracy rate as well as its insensitivity to non-textual data. Further details of the page orientation and document skew angle results are given in reference(8).

9. SUMMARY

We presented algorithms for detecting page orientation and skew angle of binary document images as they apply to automatic document processing in general, and optical scanning in particular. The page orientation detection process consists of (1) dividing and classifying the entire binary image into textual and non-textual squares, (2) estimating the page orientations of each textual squares, and (3) grouping 9-neighboring squares together to determine the page orientation of the binary image. The skew angle detection process consists of (1) selecting and rotating, if necessary, the square area for skew angle detection based on the information from the page orientation process, (2) labelling objects in the chosen square, (3) creating a simplified square from a chosen square, and (4) applying the Hough Transform on the simplified square to detect the skew angle.

The non-textual data in a binary image constitutes the main source for the page orientation classification and the document skew angle errors. By dividing a binary image into squares in the page orientation case and using the bottom pixels of candidate objects in the skew angle case, our system of algorithms is able to reduce the inclusion of non-textual data and thus to increase the detection accuracy rate. Evaluation results show that our system performs well on a variety of medical journal pages independent of text dominance. The performance of the Hough transform used to detect the skew angle is sped up using (1) data reduction procedure, which simplifies a chosen square by preserving only the bottom pixels of candidate objects; and (2) page orientation information to detect skew angle on a chosen square area instead of the entire binary image.

10. REFERENCES

  1. T. Akiyama and N. Hagita, "Automated Entry System for Printed Documents," Pattern Recognition, vol. 23, no. 11, pp. 1141-1154 (1990).
  2. S. C. Hinds, J. L. Fisher and D. P. D'Amato, "A Document Skew Detection Method Using Run-Length Encoding and the Hough Transform," 10th International Conference on Pattern Recognition, vol. 1, pp. 464-468 (1990).
  3. L. A. Fletcher and R. Kasturi, "A robust algorithm for text string separation from mixed text/graphics images," IEEE Trans. on PAMI, vol 10, pp. 910-918 (1988).
  4. A. Rastogi and S. N. Srihari, "Recognizing textual blocks in document images using the Hough transform," TR 86-01, Dept. of Computer Science, SUNY Buffalo, NY, Jan 1986.
  5. H. S. Baird, "The skew angle of printed documents," Proc. of Society of Photographic Scientists and Engineers, vol. 40, pp. 21-24 (1987).6. Sequoia Data Corporation, ScanFix Image Optimizer, Version 2.10 for MS/DOS.
  6. D. X. Le and G. R. Thoma, "Document Skew Angle Detection Algorithm," To appear in SPIE's 1993 Symposium on Aerospace and Remote Sensing.
  7. D. X. Le, "Automated Document Skew Angle Detection Using Projection Profiles, Variances, Component Labelling and the Hough Transform," M.S. thesis, Computer Science Department, George Mason University, November 17th (1992).
  8. A. K. Jain, Fundamentals of Digital Image Processing, Prentice Hall, Englewood Cliffs, New Jersey, pp. 375-376 and 409-410 (1989).
  9. H. Robbins and J. V. Ryzin, Introduction to Statistics, Science Research Associates, pp. 66-67 and 238-269 (1975).
  10. R. C. Gonzalez and P. Wintz, Digital Image Processing, 2nd ed., Addison-Wesley, Reading, Massachusetts, pp. 30-31 and 130-134 (1987).
  11. D. H. Ballard and C. M. Brown, Computer Vision, Prentice Hall, Englewood Cliffs, New Jersey, pp. 106-107 (1982).

11. APPENDIX

In this section, all predefined parameters will be defined and whenever possible formulas will be derived. Let dpi (dots per inch) represent the scanning resolution. Details of the parameter values and formulas are given in Le(8).

12. ABOUT THE AUTHORS

Daniel X. Le received the B.S. degree summa cum laude in electrical computer engineering from California State Polytechnique University of Pomona, in June 1986 and the M.S. degree in computer science from George Mason University, Fairfax, Virginia, in January 1993. From June 1986 to April 1989, he was a software engineer at Jet Propulsion Laboratory, Pasadena, California. From April 1989 to September 1990, he was a system engineer at Science Applications International Corporation, McLean, Virginia. Since September 1990, he has been an electronics engineer at the Lister Hill National Center for Biomedical Communications, National Library Of Medicine, Bethesda, Maryland. His current work is concerned with the document analysis, image quality and image processing.

George R. Thoma received the B.S. from Swarthmore College, and the M.S. and Ph.D. from the University of Pennsylvania, all in electrical engineering. He is Chief of the Communications Engineering Branch of the Lister Hill National Center for Biomedical Communications, the research and development arm of the National Library of Medicine. He has developed and evaluated systems involving image processing, document image storage on digital optical disks, high-speed image transmission, analog videodiscs and satellite communications. He has published and lectured extensively in these areas. Dr. Thoma's previous experience at the General Electric Company, AII Systems and the University of Pennsylvania had been in the application of satellites in voice and data communications, video distribution, navigation and surveillance.

Harry Wechsler received the Ph.D. in Computer Science from the University of California, Irvine, in 1975, and he is presently Professor of Computer Science at George Mason University. His research in the areas of Computer Vision and Neural Networks (NN) includes multistrategy learning (statistics, adaptive signal processing, information theory, genetic algorithms, and neural networks), scale-space and clustering for early and intermediate vision using conjoint (Gabor and wavelet) image representations, attention, functional and selective perception, face recognition, and object recognition. He was Director for the NATO Advanced Study Institutes on "Active Perception and Robot Vision" (Maratea, Italy, 1989) and on "From Statistics to Neural Networks" (Les Arcs, France, 1993) and he will serve as the co-Chair for the International Conference on Pattern Recognition to be held in Vienna, Austria, in 1996. He authored over 100 scientific papers, his book "Computational Vision" was published by Academic Press in 1990, and he is the editor of the book, "Neural Networks for Perception" (Vol 1 & 2), published by Academic Press in 1991. He was elected as an IEEE Fellow in 1992.

SUMMARY

We present algorithms for detecting page orientation and skew angle of binary document images as they apply to automatic document processing in general, and optical scanning in particular. The page orientation detection process consists of (1) dividing and classifying the entire binary image into textual and non-textual squares, (2) estimating the page orientations of each textual squares, and (3) grouping 9-neighboring squares together to determine the page orientation of the binary image. The skew angle detection process consists of (1) selecting and rotating, if necessary, the square area for skew angle detection based on the information from the page orientation process, (2) labelling objects in the chosen square, (3) creating a simplified square from a chosen square, and (4) applying the Hough transform on the simplified square to detect the skew angle. The non-textual data in a binary image constitutes the main source for the page orientation classification and the document skew angle errors. By dividing a binary image into squares in the page orientation case and using the bottom pixels of candidate objects in the skew angle case, our system of algorithms is able to reduce the inclusion of non-textual data and thus to increase the detection accuracy rate. Evaluation results show that our system performs well on a variety of medical journal pages independent of text dominance. The performance of the Hough transform used to detect the skew angle is sped up using (1) data reduction procedure, which simplifies a chosen square by preserving only the bottom pixels of candidate objects; and (2) page orientation information to detect skew angle on a chosen square area instead of the entire binary image. ScanFix is a trademark of Sequoia Data Corporation.