Title:
NLP-derived Information Improves the Estimates of Risk of Disease Compared To Estimates Based On Manually Extracted Data
Alone.
Author(s):
Callaghan FM, Jackson MT, Demner-Fusham D, Abhyankar S, McDonald CJ.
Institution(s):
1) National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
2) Food and Drug Administration (CDER/OTS/OB/DBVI), White Oak, MD, USA
Source:
5th International Symposium on Semantic Mining in Biomedicine. Zurich. 2012.
Abstract:
Natural language processing (NLP) enables
researchers to extract large quantities of information
from free-text that otherwise could
only be extracted manually. This information
can then be used to answer clinical research
questions via statistical analysis. However,
NLP extracts information with some degree
of error - the sensitivity and specificity
of state-of-the-art NLP methods are typically
80-90% - and most statistical methods assume
that the information has been observed "without
measurement error". As we show in this
paper, if an NLP-derived smoking status predictor
is used, for example, to estimate the
risk of smoking-related cancer without any adjustment
for measurement error, the estimate
is biased. Conversely, if a smaller subset of
manually extracted data is used alone, then
the estimate is unbiased, but imprecise, and
the corresponding inference methods tend to
have low power to detect significant relationships. We propose using a statistical measurement
error method - a maximum likelihood
(ML) method - that combines information
from NLP with manually validated data
to produce unbiased estimates that also have
good power to detect a significant signal. This
method has the potential to open-up large freetext
databases to statistical analysis for clinical
research. With a case study using smoking status
to predict smoking-related cancer and simulations,
we demonstrate that the ML method
performs better under a variety of scenarios
than using either NLP or manually extracted
data alone.
Publication Type: CONFERENCE
More about this article:








