Investigator Name Recognition from Medical Journal Articles: A Comparative Study of SVM and Structural SVM

Xiaoli Zhang, Jie Zou, Daniel X. Le, George R. Thoma

National Library of Medicine, Lister Hill National Center for Biomedical Communications, 8600 Rockville Pike, Bethesda, 20894

{zhangxiaol, jzou, daniel, gthoma}@mail.nih.gov

ABSTRACT

Automated extraction of bibliographic information from journal articles is key to the affordable creation and maintenance of citation databases, such as MEDLINE®. A newly required bibliographic field in this database is "Investigator Names": names of people who have contributed to the research addressed in the article, but who are not listed as authors. Since the number of such names is often large, several score or more, their manual entry is prohibitive. The automated extraction of these names is a problem in Named Entity Recognition (NER), but differs from typical NER due to the absence of normal English grammar in the text containing the names. In addition, since MEDLINE conventions require names to be expressed in a particular format, it is necessary to identify both first and last names of each investigator, an additional challenge. We seek to automate this task through two machine learning approaches: Support Vector Machine and structural SVM, both of which show good performance at the word and chunk levels. In contrast to traditional SVM, structural SVM attempts to learn a sequence by using contextual label features in addition to observational features. It outperforms SVM at the initial learning stage without using contextual observation features. However, with the addition of these contextual features from neighboring tokens, SVM performance improves to match or slightly exceed that of the structural SVM.

Categories and Subject Descriptors

H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval - Retrieval models; I.2.7 [Artificial Intelligence]: Natural Language Processing -Text analysis; I.7.5 [Document and Text Processing]: Document Capture - Document Analysis.

General Terms

Algorithm, Design, Experimentation, Performance.

Keywords

Investigator Name, Named Entity Recognition, Support Vector Machine (SVM), Structural SVM, Document analysis, MEDLINE

1. INTRODUCTION

MEDLINE®, the flagship database of the U.S. National Library of Medicine, contains over 17 million citations to the medical journal literature and is a critical source of information for biomedical research and clinical medicine. With the rapid increase of journal literature indexed by MEDLINE every year, it is essential to have automatic methods to retrieve bibliographic data, including article titles, author names, affiliations, abstracts and so on.

Beginning with journals published in 2008, personal names of those who are not entered as authors but belong to members of corporate organizations are required to be included in a new "Investigator Names" field in MEDLINE citation. The addition of these investigator names to MEDLINE allows retrieving information on the collaborative research one has taken part in. The investigator names are usually listed in one or several paragraphs in those articles containing such names. The investigator name paragraphs can appear at the beginning of the article, right below the author section or at the end of the article, in the appendix or footnote. It is common for an investigator name paragraph to contain over a hundred names, and sometimes well over a thousand. Manual extraction of these names is time-consuming, costly, tedious and error-prone.

Automatic investigator name recognition is a two-step process:
(1) locate investigator name paragraphs; and (2) parse the paragraphs to extract investigator names. In this article, we assume investigator name paragraphs have already been identified by a preceding automated method or by a human operator. In this paper we discuss the second step, parsing the paragraph to recognize the names.

Figure 1 shows three examples of investigator name paragraphs. The investigator names are usually mixed with institute names, addresses, degrees and many other entities, which usually are not arranged into sentences complying with English grammar. In most cases, they freely co-occur with only some separators, e.g., commas, parentheses, or even spaces, in between. For most investigator names, first names precede the last names, but they may be in the reverse order, as in the example shown in Figure 1(c). The first name can be a complete word or just initials. MEDLINE conventions oblige us to identify not only names, but also their particles. In other words, the first and last names of each investigator need to be identified. For some long names, such as Vicente Rodrìguez Pappalard or Francisco J. García De La Corte shown in Figure 1(a), this is not a trivial task. Extracting investigator names is a named entity recognition (NER) problem, but the variations and special requirements discussed above pose new challenges. Existing NER algorithms

(a)

(b)

usually expect sentences to follow natural language grammars, and do not identify name particles (first and last names), and therefore cannot be directly used for our recognition problem. We designed and compared two algorithms based on state-of-the-art machine learning tools, SVM and structural SVM. Both approaches achieve good recognition accuracies, and comparing them also reveals some interesting issues. The rest of the paper is organized as follows: In Section 2, we review the related work in named entity recognition and also briefly describe SVM and structural SVM. In Section 3, we describe our method, including preprocessing, feature extraction, SVM and structural SVM classification and post-processing. Both SVM methods are evaluated and compared in Section 4. Finally, Section 5 provides the summary.

2. RELATED WORK

Investigator name recognition falls in the general category of named entity recognition (NER), which typically involves the

Table 1. 62 features extracted from each token for investigator name recognition

Table 1.62 features extracted from each token for investigator name recognize

areas, such as classification with taxonomies, named entity recognition, sequence alignment and natural language context-free grammar parsing. We implemented a structural SVM algorithm for our investigator name recognition problem and compared it to our parsing algorithm using traditional SVM.

3. METHODS

In our task, each entity which we call a token in the subsequent discussion is a single word in the investigator name paragraph. As shown in Figure 1, words in an investigator name paragraph are separated by spaces and punctuations. Before investigator name recognition, preprocessing is conducted to segment the paragraph into tokens based on the spaces and punctuations.

3.1 Feature Extraction

Five types of features - dictionary features, text features, punctuation features, special word features, and contextual features are used in investigator name recognition. All our features are binary features, and they are described in Table 1.

Dictionary features
Dictionary features are collected by looking up First Name List, Last Name List, Affiliation Key Word List, Country Name List, US Canada State List and Degree List. We built these lists from MEDLINE data for about 8 million medical articles. If a candidate word is found in one of these lists, we set the corresponding dictionary feature to 1.

Text features
The text features examine character cases and special characters in a word. A word with all upper case characters can be an abbreviation of degree, state or special words. A name initial pattern appears as a capital (A-Z) usually followed by a period. Words containing digits can be excluded as names. These text features provide important information to distinguish named entities of different types.

Punctuation features
Due to the regularity of the appearance of groups of named entities in an investigator name paragraph, punctuations like spaces, commas, periods, hyphens, dashes, semi-colons, brackets, etc. before and after a word are important features and can signify that adjacent words are in the same group of named entities or have the same entity type. For example, a semi-colon or a comma before and after a word often indicates the start of a new group of named entities. Hyphens are generally used to connect words of the same entity type. A name or affiliation very likely consists of words separated by spaces. For each punctuation mark listed in "Punctuation features" in Table 1, we add two features of value 1: if the character before a word is the specific punctuation and if the character after a word is the punctuation.

Special word feature
Special words can eliminate the class confusion. For example, university, institute and hospital are often associated with affiliations. Investigator, coordinator or manager indicates a person's position usually followed by a name.

Contextual features
The features from neighboring tokens can be very informative. For example, in Figure 1(c), the word "Hospital" in "A. Gemelli Hospital" clearly indicates that "A. Gemelli" is not an investigator name. Therefore, to take advantage of the contextual dependencies between tokens, contextual features from neighboring words are also extracted for each token.

3.2 SVM and Structural SVM Classification

An SVM is a supervised learning method which involves training and test stages. The goal is to produce a model using a training set and to predict unknown test data given the model. Our investigator name recognition is a three-class (First Name, Last Name, and Other) classification problem. A total of 62 observation features including dictionary, text, punctuation, and special word features are used to represent each word token. In addition, due to the context existing among tokens, the observation features from neighboring tokens are also used for SVM classification.

An essential step in designing a structural SVM is to define its feature presentation function Ψ(x, y). Our investigator name recognition is a sequence labeling problem. Therefore, Ψ(x, y) includes two kinds of features: state transition features and observation features extracted from individual tokens. State transition features model only the adjacent label dependencies. We use first order transition dependencies, i.e., only the dependencies between adjacent token labels are modeled. For observation features, we use the same 62 features defined in Section 3.1. We also experimented with adding contextual observation features from neighboring tokens. Details are described in Section 4. Structural SVM applied specifically for sequence labeling is sometimes called SVMHMM, possibly because it uses similar types of feature representations as Hidden Markov Models. We used the SVMHMM library, available at [26], to implement our structural SVM algorithm for investigator name recognition.

3.3 Postprocessing

After SVM or structural SVM classification, every token is assigned a label. However, a post-processing step is still required to analyze the labeled paragraph and then derive individual complete names by finding corresponding name particles. We take a heuristic approach based on the following rules:

Following these rules, an algorithm can be implemented to remove isolated single first name labels and organize the remaining first name and last name tokens into complete names.

4. EVALUATION

By searching MEDLINE citations after 2008, we found 370 articles which have investigator name paragraphs. After obtaining the full text of these articles, we manually identified investigator name paragraphs in the articles and saved them as plain text files. The ground truth labeling of these investigator name paragraphs is then created semi-manually. We randomly selected 100 from those 370 articles for training and reserved the remaining 270 articles for testing. Some statistics of this data collection are listed in Table 2. We evaluate algorithm performance at two levels. One is at the token level, i.e., the labeling accuracy of individual tokens. The other is at the name chunk level, i.e., the precision and recall of retrieving individual full names. At the name chunk level, a name is considered correctly retrieved only when both first name and last name tokens are correctly labeled. For example, for the name, "Francisco J. García De La Corte", shown in Figure 1(a), we accept as a correct chunk labeling only when all three tokens "Francisco J. García" are labeled as the first name chunk, and all three tokens "De La Corte" are labeled as the last name chunk. A false-negative and a false-positive are counted as such, even if only a single token is mislabeled. Therefore, chunk-level evaluation is much more rigorous than token-level evaluation.

Table 2. Dataset statistics

Table 2. Dataset statistics

4.1 Evaluation of SVM Method

We use LibSVM [6], an SVM library developed at National Taiwan University, to implement our token classification. Radial Basis Function (RBF) is adopted as the kernel function. The two parameters in RBF, C (penalty parameter of the errors) and γ (RBF parameter), are optimized by an exhaustive grid-search using cross-validation on the training samples.

To observe the effects from neighboring tokens, 62 basic observation features together with different orders of contextual observation features are used in our SVM token classification. The "kth order contextual observation features" means the observation features from k neighboring tokens on either side. For a token, each order of contextual features added implies that one token from either its left or right side contributes 62 observation features. The feature dimensionality of the current token is thereby extended by 124 (2 x 62). Considering the complexity, we compare the evaluation results only up to the second order contextual features. Tables 3 and 4 show the SVM evaluation at token and name chunk levels with 62 initial observation features, with 186 (62 + 2 x 62) features by adding the first order contextual observation features, and with 310 (62 + 4 x 62) features by adding the second order contextual observation features, respectively. Note that the accuracy increases significantly as the observation features extracted from neighboring tokens increase. The contextual information is very helpful to SVM modeling as there are potential dependencies among a sequence of tokens.

4.2 Evaluation of Structural SVM Method

We used SVMHMM , an implementation of structural SVMs for sequence labeling by Thorsten Joachims [26], to conduct our experiments. In structural SVM method, the same 62 observation features are extracted from each individual token. We used linear-kernel due to the fact that other kernels, e.g., RBF, can be extremely computation intensive. Meta-parameters are determined with cross-validation on training samples.

Tables 5 and 6 show the evaluation at token and name chunk levels. Even though the structural SVM algorithm has already considered the contextual label information through state transition features, it is still of interest to know whether the features extracted from neighboring tokens would further increase the accuracy. Therefore, besides using 62 observation features

Table 3. Accuracy of token classification using SVM method

Table 3

Table 4. Precision and recall of full name extraction using SVM method

Table 4

Table 5. Accuracy of token classification using structural SVM method

Table 5

Table 6. Precision and recall of full name extraction using structural SVM method

Table 6

Figure 3

Figure 3: Performance values from Tables 3-6 plotted against k-order contextual observation features. Left: token classification accuracy before post-processing; Middle: token classification accuracy after post-processing; Right: name retrieval F-Measure.

4.3 Discussion

We summarize the following observations from our experiments. First of all, the information from neighboring tokens is very helpful and must be utilized. There are two kinds of contextual features: the labels assigned to the neighboring tokens and the observation features extracted from the neighboring tokens. We call the first one contextual label features and the second one contextual observation features in the following discussion.

In our opinion, the most important difference between our implementations of the two methods is that the SVM method uses only the contextual observation features, while structural SVM method may use both types of contextual features.

For SVM method, when we use only the observation features from the token itself, no contextual features are used, thereby providing a baseline performance. As expected, the performance is relatively low: the overall token classification accuracies before and after post-processing are 85.82% and 87.69%, respectively (Table 3), and the F-Measure of the name chunk retrieval is 78.09% (Table 4). After combining the observation features from immediate left and right neighbors, the corresponding accuracies and F-Measure significantly increase to 93.89%, 94.68% and 88.45%. This clearly indicates the importance of the first order contextual observation features. After combining observation features from one further left and one further right neighbors, the corresponding accuracies and F-Measure increase to 95.10%, 95.60% and 91.37%. This indicates the second order contextual observation features are still helpful, but less significantly than the first order contextual observation features.

For structural SVM method, when we use only the observation features from the token itself, it does not use any contextual observation features, but it does use the contextual label features (see Section 3.2 for the discussion on state transition features of structural SVM). The token classification accuracies before and after post-processing are 92.06% and 92.75%, respectively (Table 5), and the F-Measure of name chunk retrieval is 86.24% (Table 6), much better than that of the SVM method in the same setting. This clearly indicates that contextual label features can be very helpful. After we add contextual observation features, the performance increases are much less significant compared to the SVM method at the same settings. This may indicate that the discriminative information provided by contextual observation and contextual label features are redundant. After adding the second order contextual observation features, there is no performance gain for structural SVM method even though it uses extra contextual label features.

For better visualization, we have plotted the performance data extracted from Tables 3, 4, 5 and 6 in Figure 3. We also observe that after post-processing, the token classification accuracies for First Name all decrease, but the token classification accuracies of Other all increase. This is due to the second rule we used in post-processing and listed in Section 3.3. This rule re-assigns Other label to all tokens which are labeled as isolated First Name. This rule would make errors for some tokens, which are indeed First Name (though their corresponding Last Name tokens are mislabeled). On the other hand, this rule also helps correct many Other tokens, which are mislabeled as First Name. Overall, this rule increases performance.

4.4 Error Analysis

Partial screen-dumps of our GUI (Graphic User Interface) program for visually examining investigator name recognition results are shown in Figure 4. In this GUI program, the first name chunks are marked in red and the last name chunks are marked in blue. Most of the investigator names in these two samples are recognized correctly. Notice that in most cases, the algorithm recognizes those organizations named after people, e.g., Lozano Blesa Hospital in Figure 4(b).

Figure 4 also illustrates three kinds of possible recognition errors. For example, Figure 4(a) shows an under-labeling error (marked in blue), which is an uncommon name. Figure 4(b) illustrates an over-labeling error and a mis-chunking error (pointed by two arrows). San Sebastian is a city name, but the algorithm mislabels it as an investigator name. The first name chunk of "J. López del Val" should be "J. López" and the last name chunk should be "del Val". The algorithm mislabeled the word "López", and this error adds a false positive (because an extra false name is labeled) and a false negative (because the true name is mislabeled). This kind of mis-chunking error is the most common type for both SVM and structural SVM methods, and causes the large drop from the accuracy of overall token classification to the F-measure of full name recognition. It is not an easy task to eliminate this kind of error, and further research is required.

5. SUMMARY

We have implemented and evaluated two investigator name recognition methods. SVM method uses the observation features from the token itself and contextual observation features from neighboring tokens. Structural SVM method further utilizes contextual label features. Both contextual (observation and label) features provide important information and are very helpful for improving the recognition performance. After combining the

(a)

(b)

Figure 4: Two examples of visually examining investigator name recognition. Recognized first name chunks are marked in red, and recognized last name chunks are marked in blue. Three errors are discussed in text.

6. ACKNOWLEDGMENTS

We thank Dr. In-Cheol Kim for help in preparing Figure 2. This research was supported by the Intramural Research Program of the National Institutes of Health (NIH), National Library of Medicine, and Lister Hill National Center for Biomedical Communications.

7. REFERENCES

  1. Ananiadou, S., Friedman, Carol, and Tsujii, Jun'ichi. 2004. Introduction: named entity recognition in biomedicine. Journal of Biomedical Informatics, 37, 6, 393-395.
  2. Masayuki, Asahara, Matsumoto Y. 2003. Japanese named entity extraction with redundant morphological analysis. In Proc. Human Language Technology conference -North American chapter of the Association for Computational Linguistics. 8-15.
  3. Bikel, D.M., Schwartz, R.L., and Weischedel, R.M. 1999. An algorithm that learns what's in a name. Machine Learning, 34, 1-3, 211-231.
  4. Burges, C. J. C. 1998. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 2, 1-43.
  5. Carvalho, V. R. and Cohen, W. W. 2004. Learning to extract signature and reply lines from email. Proc. of the Conference on Email and Anti-Spam 2004, Mountain View, California.
  6. Chang, C.-C. and Lin, C.-J. 2001. LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
  7. Corters, C. and Vapnik, V. 1995. Support vector network. Machine Learning, 20, 273-297.
  8. Crammer, K. and Singer, Y. 2001. On the algorithmic implementation of multi-class kernel-based vector machines. Machine Learning Research, 2, 265-292.
  9. Cristianini, N. and Shawe-Taylor, J. 2000. An introduction to support vector machines and other kernel-based learning methods. Cambrige University Press, Cambridge, UK.
  10. Joachims, T. 1998. Text categorization with support vector machine. Proc. Euro. Con. Machine Learning, 137-142.
  11. Joachims, T., Finley, T., Yu, Chun-Nam. 2009. Cutting-plane training of structural SVMs. Machine Learning Journal, 27-59.
  12. Kim, J.D., Ohta, T., Tateishi, Y. and Tsujii, J. 2004. Introduction to the bio-entity recognition task at JNLPBA. Proc. Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA).
  13. Lee, C., Hou, W. J. and Chen, H.-H. 2004. Annotating multiple types of biomedical entities: a single word classification approach. Proc. Joint Workshop on Natural Language Processing in Biomedicine and its Applications.
  14. McCallum, A., Freitag, D., and Pereira, F. 2000. Maximum entropy models for information extraction and segmentation. Proc. of the 17th International Conference on Machine Learning, 591-598.
  15. McCallum, A. and Li, W. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. Proc. of the 7th Conference on Natural Language Learning (CoNLL-2003), 4, 188-191.
  16. Nadeau, D. and Satoshi, S. 2007. A survey of named entity recognition and classification. Linguisticae Investigations, 30, 1, 3-26.
  17. Nguyen, N. and Guo, Y. 2007. Comparisons of sequence labeling algorithms and extensions. Proc. of the 24th International Conference on Machine Learning, 681-688.
  18. Satoshi, S., Nobata, C. 2004. Definition, dictionaries and tagger for extended named entity hierarchy. In Proc. Conference on Language Resources and Evaluation.
  19. Settles, B., 2004. Biomedical named entity recognition using conditional random Fields and novel feature sets. Proc. Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA).
  20. Si, L., Kanungo, T., Huang, X. 2005. Boosting performance of bio-entity recognition by combining results from multiple systems. Proc. Workshop on Data Mining in Bioinformatics (BioKDD).
  21. De Sitter, A. and Daelemans, W. 2003. Information extraction via double classification. In Proceedings of International Workshop on Adaptive Text Extraction and Mining, Dubrovnik.
  22. Tsochantaridis, I., Hofmann, T., Joachims, T., Altun Y. 2004. Support vector machine learning for interdependent and structured output spaces. Int’l Conf. on Machine Learning.
  23. Vapnik, V. 1995. The nature of statistical learning theory, New York: Springer-Verlag.
  24. Vapnik, V. 1998. Statistical learning theory. Wiley.
  25. Weston, J. and Watkins, C. 1999. Support vector machines for multi-class pattern recognition. In Proc. of the 7th European Symposium on Artificial Neural Networks.
  26. http://www.cs.cornell.edu/People/tj/svm_light/svm_hmm.html.