Epi Data Science: Uncertainty in Radiology Reports

Uncertainty in Radiology Reports

A commonly heard complaint among physicians is that radiology reports tend to contain ambiguous and uncertain phraseology. What should be a precise anatomical description can be perceived by the reader as a hedge over two or more differing conclusions supported by words that enhance an uncertain attitude.

I decided to test this belief by extracting text from a large number of radiology reports and comparing the text to a dictionary of words suggesting certainty or uncertainty. Using the MIMIC II (Multiparameter Intelligent Monitoring in Intensive Care) Databases MIMIC2, I extracted the radiology reports of 246705 intensive care patients and decomposed them into individual words. I then created a 302 word dictionary composed of 151 words each of words suggesting certainty and uncertainty. The tokenized report words were compared to the dictionary which allowed an overall assessment of certainty vs. uncertainty in the corpus as a whole.

First, load the required packages

library(tidyverse)
library(stringr)
library(tidytext)
library(purrr)
library(readtext)
library(qdap)
library(regexr)
library(SentimentAnalysis)
library(doParallel)
library(forcats)

The radiology reports are an immense file and tokenizing them takes adequate RAM and cpu speed. This code enables use of all the cores available. The MIMIC2 data contains 32 folders each holding data for 5000 patients.

cl<-makeCluster(4)
registerDoParallel(cl)

An empty data frame to hold the corpus is prepared first.

radiology_reports <- data_frame(list("0"))
colnames(radiology_reports) <- "reports"

The MIMIC2 data radiology reports ae extracted and prepared in a format to tokenize

location_in <- "D:/Radiology/MIMIC2/00" # located on my computer drive
num <- location_in %>% str_extract("\\d\\d") #remove 00
num <- as.numeric(num) #change char to numeric
num <- num <- num +1 # increase the number
#format it to two digits
num <- formatC(num, width = 2, format = "d", flag = "0")

The paths to each patient text file are defined

dirs <- list.dirs(path = location_in)
txts <- Sys.glob(file.path(dirs, "*.txt"))
txts_df <- data_frame(txts)
notes <- txts_df %>% filter(str_detect(txts,pattern = "NOTEEVENTS-"))

An empty list is prepared for all of the indivdual reports

rad_reports<- list(as.character(0)) #create an empty list

The individual reports are extracted and appended to the whole list of reports - This takes a while. As downloaded from MIMIC2, each patient’s data is contained in a folder for that patient. That folder is contained in a folder containing the folders for a group of 5000 patients. That group’s folder is contained in the folder which is downloadeded from MIMIC2. In order to automate the extraction of all the individual patient data, considerable coding gymnastics are required. So please bear with all of the code necessary to do this.

for(i in 1:997) {
  read_csv(notes$txts[i]) #read all the .txt files
  record <- readtext(notes$txts[i]) #put the text into a vector
  record <- record$text #take just the text column
  record
  rad1 <- record %>% str_replace_all("\n", " ")
  rad2 <- rad1 %>% str_extract("(RADIOLOGY_REPORT).+?(?=IMPRESSION)")
  rad2
  if(!is.na(rad2)){
  rad_reports <- append(rad_reports, rad2)
  }
}

# Append to main data frame
rad_reports <- data_frame(rad_reports)
colnames(rad_reports) <- "reports"
radiology_reports <- bind_rows(radiology_reports, rad_reports)

radiology_reports2 <- unlist(radiology_reports)
write.csv(radiology_reports2, file = "radiology_reports.csv")

radiology_reports3 <- data_frame(radiology_reports2)
colnames(radiology_reports3) <- "reports"

#More cleanup
reports_tok <- radiology_reports3 %>% unnest_tokens(text, reports)

#remove numbers
reports_tok1 <- data_frame(str_replace_all(reports_tok$text, "([0-9])", ""))
colnames(reports_tok1) <- "text"
#remove blank spaces
reports_tok2 <- reports_tok1 %>% filter(text != "")
 #remove _'s
reports_tok3 <- data_frame(str_replace_all(reports_tok2$text, "_", ""))
colnames(reports_tok3) <- "text"
#remove blanks
reports_tok3 <- reports_tok3 %>% filter(text != "")

reports_tok3 %>% count(text, sort = TRUE)

## # A tibble: 8,977 x 2
##     text     n
##    <chr> <int>
##  1   the  7524
##  2   and  5710
##  3    of  4661
##  4  with  4186
##  5    to  4013
##  6    is  3900
##  7   for  3357
##  8    in  3144
##  9    no  2893
## 10     a  2432
## # ... with 8,967 more rows

colnames(reports_tok3) <- "words"

Now we can compare the reports to the dictionary using an inner join.

#don't discard stop_words since many hedging wordsa are included in them
hedge <- read_csv("UncertaintyLexicon2.csv") #long form
dodge <- read_csv("UncertaintyLexicon.csv") #wide form

Here is the whole dictionary:

print.data.frame(hedge)

##            uncertain       certain
## 1               some    documented
## 2               said          none
## 3              often       exactly
## 4           probably    definitely
## 5           possibly       checked
## 6            claimed        proven
## 7            alleged     confident
## 8        authorities       assured
## 9            experts      definite
## 10          relative  specifically
## 11         generally       unknown
## 12             known     assurance
## 13        frequently        belief
## 14           studies       believe
## 15          regarded     certainty
## 16             noted       clarity
## 17       recommended    confidence
## 18         mentioned         trust
## 19               may   trustworthy
## 20           clearly          real
## 21              said       reality
## 22            saying         truth
## 23       speculation        secure
## 24         ambiguity      security
## 25       ambivalence         cinch
## 26           concern    conviction
## 27          distrust          firm
## 28          mistrust          lock
## 29         skeptical      positive
## 30        skepticism       staunch
## 31           trouble         valid
## 32        uneasiness         sound
## 33  unpredictability authoritative
## 34     unpredictable      surefire
## 35             worry   ascertained
## 36      bewilderment         clear
## 37        conjecture      definite
## 38       contingency    definitive
## 39           dilemma     authentic
## 40          disquiet   categorical
## 41          doubtful       decided
## 42      doubtfulness   determinate
## 43             guess     doubtless
## 44         guesswork      accurate
## 45         hesitancy       careful
## 46        hesitation     exactness
## 47      inconclusive     faultless
## 48  inconclusiveness      incisive
## 49        indecision    meticulous
## 50       oscillation       precise
## 51         perplexed      veracity
## 52        perplexity    acceptance
## 53            puzzle     admission
## 54        puzzlement        avawal
## 55             qualm      credence
## 56           quandry        credit
## 57             query     deduction
## 58             vague     knowledge
## 59         vagueness    persuasive
## 60            wonder       reliant
## 61             doubt understanding
## 62       fluctuation        bright
## 63              hazy          bold
## 64         hesitancy       courage
## 65        hesitation          firm
## 66          hesitate     fortitude
## 67              iffy         hardy
## 68        indecision         pluck
## 69         misgiving      resolute
## 70           obscure  resoluteness
## 71            muddle    resolution
## 72           quandry       reliant
## 73         tentative      tenacity
## 74      tentativenss     resoulute
## 75            unsure   resolutness
## 76             angst    uniformity
## 77      apprehension     unfailing
## 78           lottery dependability
## 79              odds determination
## 80              risk       earnest
## 81         speculate      fidelity
## 82       speculation     permanent
## 83           venture       regular
## 84          accident       staunch
## 85            chance   trustworthy
## 86          eventual  unchangeable
## 87        likelihood     unfailing
## 88       predicament     immutable
## 89       probability    dependable
## 90          probable          even
## 91            cloudy    uniformity
## 92          slippery       durable
## 93           opinion       lasting
## 94        capricious      reliable
## 95       fluctuating         tried
## 96             fluid    changeless
## 97        inconstant    invariable
## 98         mercurial   predictable
## 99           mutable       settled
## 100        unsettled        stable
## 101         unstab;e    stationary
## 102         unsteady     unvarying
## 103         variable      resolute
## 104         volatile     convinced
## 105          protean   assuredness
## 106          aimless    inevitable
## 107        arbitrary    inexorable
## 108          erratic    legitimate
## 109        haphazard     authentic
## 110        irrugular authoritative
## 111           random      coherent
## 112        scattered    consistent
## 113         slapdash      complete
## 114            stray   justifiable
## 115       ambivalent   supportable
## 116         hesitate   sustainable
## 117       hesitating        strong
## 118            shaky       logical
## 119      vacillating         sound
## 120        vacillate      sensible
## 121         wavering         solid
## 122            dicey        actual
## 123     undependable       binding
## 124         unstable    compelling
## 125           unsure          good
## 126         unlikely        cogent
## 127         doubtful     confirmed
## 128       incomplete       confirm
## 129       fallacious   orrefutable
## 130       misleading        proven
## 131           absurd          pure
## 132     unconvincing  susbstantial
## 133        illogical       telling
## 134          invalid        tested
## 135          unsound      ultimate
## 136             weak   uncorrupted
## 137           unreal       factual
## 138     unreasonable    diagnostic
## 139            FALSE      detailed
## 140             fake      discrete
## 141       improbable   explanatory
## 142            wrong      rational
## 143          unsound    reasonable
## 144        erroneous         solid
## 145       inaccurate    systematic
## 146           untrue       testing
## 147           faulty      thorough
## 148          inexact      powerful
## 149         spurious     veritable
## 150         specious        narrow
## 151             wide           all

dodge_words <- reports_tok3 %>% inner_join(dodge)
dodge_count <- dodge_words %>% count(words, sort = TRUE)

dodge_type <- dodge_words %>%
  inner_join(dodge_count) %>%
  distinct %>%
  arrange(desc(n))
dodge_type$type <- factor(dodge_type$type)

dodge_type$words <- factor(dodge_type$words)
dodge_type %>% ggplot(aes(x = forcats::fct_inorder(words), y = n)) +
  geom_bar(stat = "identity") +
  facet_wrap(~type, scales = "free_x") +
  coord_cartesian(xlim = c(0, 25)) +
  ggtitle("Radiology Report Word Counts",
          subtitle = "Each bar = total number of each word in lexicon") +
  xlab("Words") +
  ylab("Word Count") +
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        plot.title = element_text(hjust = .5, color = "blue"),
        plot.subtitle = element_text(hjust = .5),
        panel.grid.major.x = element_blank() )

wilcox.test(dodge_type$n~dodge_type$type) # where y is numeric and x is a binary factor

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  dodge_type$n by dodge_type$type
## W = 1020.5, p-value = 0.523
## alternative hypothesis: true location shift is not equal to 0

Well, it looks like the radiologists aren’t as ambiguous and uncertain as we thought. The distributions of certain vs. uncertain words appear almost identical, and Wilcoxon says there is no significant difference. So, our apologies to all radiologists!

Epi Data Science

Wednesday, December 6, 2017

Uncertainty in Radiology Reports

Uncertainty in Radiology Reports

David P. Nichols, MD, MPH

December 5, 2017

First, load the required packages

The radiology reports are an immense file and tokenizing them takes adequate RAM and cpu speed. This code enables use of all the cores available. The MIMIC2 data contains 32 folders each holding data for 5000 patients.

An empty data frame to hold the corpus is prepared first.

The MIMIC2 data radiology reports ae extracted and prepared in a format to tokenize

The paths to each patient text file are defined

An empty list is prepared for all of the indivdual reports

Now we can compare the reports to the dictionary using an inner join.

Here is the whole dictionary:

Well, it looks like the radiologists aren’t as ambiguous and uncertain as we thought. The distributions of certain vs. uncertain words appear almost identical, and Wilcoxon says there is no significant difference. So, our apologies to all radiologists!

No comments:

Post a Comment

Pages

Labels

My Blog List