A commonly heard complaint among physicians is that radiology reports tend to contain ambiguous and uncertain phraseology. What should be a precise anatomical description can be perceived by the reader as a hedge over two or more differing conclusions supported by words that enhance an uncertain attitude.
I decided to test this belief by extracting text from a large number of radiology reports and comparing the text to a dictionary of words suggesting certainty or uncertainty. Using the MIMIC II (Multiparameter Intelligent Monitoring in Intensive Care) Databases MIMIC2, I extracted the radiology reports of 246705 intensive care patients and decomposed them into individual words. I then created a 302 word dictionary composed of 151 words each of words suggesting certainty and uncertainty. The tokenized report words were compared to the dictionary which allowed an overall assessment of certainty vs. uncertainty in the corpus as a whole.
First, load the required packages
library(tidyverse)
library(stringr)
library(tidytext)
library(purrr)
library(readtext)
library(qdap)
library(regexr)
library(SentimentAnalysis)
library(doParallel)
library(forcats)
The radiology reports are an immense file and tokenizing them takes adequate RAM and cpu speed. This code enables use of all the cores available. The MIMIC2 data contains 32 folders each holding data for 5000 patients.
cl<-makeCluster(4)
registerDoParallel(cl)
An empty data frame to hold the corpus is prepared first.
radiology_reports <- data_frame(list("0"))
colnames(radiology_reports) <- "reports"
The paths to each patient text file are defined
dirs <- list.dirs(path = location_in)
txts <- Sys.glob(file.path(dirs, "*.txt"))
txts_df <- data_frame(txts)
notes <- txts_df %>% filter(str_detect(txts,pattern = "NOTEEVENTS-"))
An empty list is prepared for all of the indivdual reports
rad_reports<- list(as.character(0)) #create an empty list
The individual reports are extracted and appended to the whole list of reports - This takes a while. As downloaded from MIMIC2, each patient’s data is contained in a folder for that patient. That folder is contained in a folder containing the folders for a group of 5000 patients. That group’s folder is contained in the folder which is downloadeded from MIMIC2. In order to automate the extraction of all the individual patient data, considerable coding gymnastics are required. So please bear with all of the code necessary to do this.
for(i in 1:997) {
read_csv(notes$txts[i]) #read all the .txt files
record <- readtext(notes$txts[i]) #put the text into a vector
record <- record$text #take just the text column
record
rad1 <- record %>% str_replace_all("\n", " ")
rad2 <- rad1 %>% str_extract("(RADIOLOGY_REPORT).+?(?=IMPRESSION)")
rad2
if(!is.na(rad2)){
rad_reports <- append(rad_reports, rad2)
}
}
# Append to main data frame
rad_reports <- data_frame(rad_reports)
colnames(rad_reports) <- "reports"
radiology_reports <- bind_rows(radiology_reports, rad_reports)
radiology_reports2 <- unlist(radiology_reports)
write.csv(radiology_reports2, file = "radiology_reports.csv")
radiology_reports3 <- data_frame(radiology_reports2)
colnames(radiology_reports3) <- "reports"
#More cleanup
reports_tok <- radiology_reports3 %>% unnest_tokens(text, reports)
#remove numbers
reports_tok1 <- data_frame(str_replace_all(reports_tok$text, "([0-9])", ""))
colnames(reports_tok1) <- "text"
#remove blank spaces
reports_tok2 <- reports_tok1 %>% filter(text != "")
#remove _'s
reports_tok3 <- data_frame(str_replace_all(reports_tok2$text, "_", ""))
colnames(reports_tok3) <- "text"
#remove blanks
reports_tok3 <- reports_tok3 %>% filter(text != "")
reports_tok3 %>% count(text, sort = TRUE)
## # A tibble: 8,977 x 2
## text n
## <chr> <int>
## 1 the 7524
## 2 and 5710
## 3 of 4661
## 4 with 4186
## 5 to 4013
## 6 is 3900
## 7 for 3357
## 8 in 3144
## 9 no 2893
## 10 a 2432
## # ... with 8,967 more rows
colnames(reports_tok3) <- "words"
Now we can compare the reports to the dictionary using an inner join.
#don't discard stop_words since many hedging wordsa are included in them
hedge <- read_csv("UncertaintyLexicon2.csv") #long form
dodge <- read_csv("UncertaintyLexicon.csv") #wide form
Here is the whole dictionary:
print.data.frame(hedge)
## uncertain certain
## 1 some documented
## 2 said none
## 3 often exactly
## 4 probably definitely
## 5 possibly checked
## 6 claimed proven
## 7 alleged confident
## 8 authorities assured
## 9 experts definite
## 10 relative specifically
## 11 generally unknown
## 12 known assurance
## 13 frequently belief
## 14 studies believe
## 15 regarded certainty
## 16 noted clarity
## 17 recommended confidence
## 18 mentioned trust
## 19 may trustworthy
## 20 clearly real
## 21 said reality
## 22 saying truth
## 23 speculation secure
## 24 ambiguity security
## 25 ambivalence cinch
## 26 concern conviction
## 27 distrust firm
## 28 mistrust lock
## 29 skeptical positive
## 30 skepticism staunch
## 31 trouble valid
## 32 uneasiness sound
## 33 unpredictability authoritative
## 34 unpredictable surefire
## 35 worry ascertained
## 36 bewilderment clear
## 37 conjecture definite
## 38 contingency definitive
## 39 dilemma authentic
## 40 disquiet categorical
## 41 doubtful decided
## 42 doubtfulness determinate
## 43 guess doubtless
## 44 guesswork accurate
## 45 hesitancy careful
## 46 hesitation exactness
## 47 inconclusive faultless
## 48 inconclusiveness incisive
## 49 indecision meticulous
## 50 oscillation precise
## 51 perplexed veracity
## 52 perplexity acceptance
## 53 puzzle admission
## 54 puzzlement avawal
## 55 qualm credence
## 56 quandry credit
## 57 query deduction
## 58 vague knowledge
## 59 vagueness persuasive
## 60 wonder reliant
## 61 doubt understanding
## 62 fluctuation bright
## 63 hazy bold
## 64 hesitancy courage
## 65 hesitation firm
## 66 hesitate fortitude
## 67 iffy hardy
## 68 indecision pluck
## 69 misgiving resolute
## 70 obscure resoluteness
## 71 muddle resolution
## 72 quandry reliant
## 73 tentative tenacity
## 74 tentativenss resoulute
## 75 unsure resolutness
## 76 angst uniformity
## 77 apprehension unfailing
## 78 lottery dependability
## 79 odds determination
## 80 risk earnest
## 81 speculate fidelity
## 82 speculation permanent
## 83 venture regular
## 84 accident staunch
## 85 chance trustworthy
## 86 eventual unchangeable
## 87 likelihood unfailing
## 88 predicament immutable
## 89 probability dependable
## 90 probable even
## 91 cloudy uniformity
## 92 slippery durable
## 93 opinion lasting
## 94 capricious reliable
## 95 fluctuating tried
## 96 fluid changeless
## 97 inconstant invariable
## 98 mercurial predictable
## 99 mutable settled
## 100 unsettled stable
## 101 unstab;e stationary
## 102 unsteady unvarying
## 103 variable resolute
## 104 volatile convinced
## 105 protean assuredness
## 106 aimless inevitable
## 107 arbitrary inexorable
## 108 erratic legitimate
## 109 haphazard authentic
## 110 irrugular authoritative
## 111 random coherent
## 112 scattered consistent
## 113 slapdash complete
## 114 stray justifiable
## 115 ambivalent supportable
## 116 hesitate sustainable
## 117 hesitating strong
## 118 shaky logical
## 119 vacillating sound
## 120 vacillate sensible
## 121 wavering solid
## 122 dicey actual
## 123 undependable binding
## 124 unstable compelling
## 125 unsure good
## 126 unlikely cogent
## 127 doubtful confirmed
## 128 incomplete confirm
## 129 fallacious orrefutable
## 130 misleading proven
## 131 absurd pure
## 132 unconvincing susbstantial
## 133 illogical telling
## 134 invalid tested
## 135 unsound ultimate
## 136 weak uncorrupted
## 137 unreal factual
## 138 unreasonable diagnostic
## 139 FALSE detailed
## 140 fake discrete
## 141 improbable explanatory
## 142 wrong rational
## 143 unsound reasonable
## 144 erroneous solid
## 145 inaccurate systematic
## 146 untrue testing
## 147 faulty thorough
## 148 inexact powerful
## 149 spurious veritable
## 150 specious narrow
## 151 wide all
dodge_words <- reports_tok3 %>% inner_join(dodge)
dodge_count <- dodge_words %>% count(words, sort = TRUE)
dodge_type <- dodge_words %>%
inner_join(dodge_count) %>%
distinct %>%
arrange(desc(n))
dodge_type$type <- factor(dodge_type$type)
dodge_type$words <- factor(dodge_type$words)
dodge_type %>% ggplot(aes(x = forcats::fct_inorder(words), y = n)) +
geom_bar(stat = "identity") +
facet_wrap(~type, scales = "free_x") +
coord_cartesian(xlim = c(0, 25)) +
ggtitle("Radiology Report Word Counts",
subtitle = "Each bar = total number of each word in lexicon") +
xlab("Words") +
ylab("Word Count") +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
plot.title = element_text(hjust = .5, color = "blue"),
plot.subtitle = element_text(hjust = .5),
panel.grid.major.x = element_blank() )
wilcox.test(dodge_type$n~dodge_type$type) # where y is numeric and x is a binary factor
##
## Wilcoxon rank sum test with continuity correction
##
## data: dodge_type$n by dodge_type$type
## W = 1020.5, p-value = 0.523
## alternative hypothesis: true location shift is not equal to 0
Well, it looks like the radiologists aren’t as ambiguous and uncertain as we thought. The distributions of certain vs. uncertain words appear almost identical, and Wilcoxon says there is no significant difference. So, our apologies to all radiologists!