A commonly heard complaint among physicians is that radiology reports tend to contain ambiguous and uncertain phraseology. What should be a precise anatomical description can be perceived by the reader as a hedge over two or more differing conclusions supported by words that enhance an uncertain attitude.
I decided to test this belief by extracting text from a large number of radiology reports and comparing the text to a dictionary of words suggesting certainty or uncertainty. Using the MIMIC II (Multiparameter Intelligent Monitoring in Intensive Care) Databases MIMIC2, I extracted the radiology reports of 246705 intensive care patients and decomposed them into individual words. I then created a 302 word dictionary composed of 151 words each of words suggesting certainty and uncertainty. The tokenized report words were compared to the dictionary which allowed an overall assessment of certainty vs. uncertainty in the corpus as a whole.
First, load the required packages
The radiology reports are an immense file and tokenizing them takes adequate RAM and cpu speed. This code enables use of all the cores available. The MIMIC2 data contains 32 folders each holding data for 5000 patients.
An empty data frame to hold the corpus is prepared first.
radiology_reports <- data_frame(list("0"))
colnames(radiology_reports) <- "reports"
The paths to each patient text file are defined
dirs <- list.dirs(path = location_in)
txts <- Sys.glob(file.path(dirs, "*.txt"))
txts_df <- data_frame(txts)
notes <- txts_df %>% filter(str_detect(txts,pattern = "NOTEEVENTS-"))
An empty list is prepared for all of the indivdual reports
rad_reports<- list(as.character(0))
The individual reports are extracted and appended to the whole list of reports - This takes a while. As downloaded from MIMIC2, each patient’s data is contained in a folder for that patient. That folder is contained in a folder containing the folders for a group of 5000 patients. That group’s folder is contained in the folder which is downloadeded from MIMIC2. In order to automate the extraction of all the individual patient data, considerable coding gymnastics are required. So please bear with all of the code necessary to do this.
for(i in 1:997) {
record <- readtext(notes$txts[i])
record <- record$text
rad1 <- record %>% str_replace_all("\n", " ")
rad2 <- rad1 %>% str_extract("(RADIOLOGY_REPORT).+?(?=IMPRESSION)")
rad_reports <- append(rad_reports, rad2)
rad_reports <- data_frame(rad_reports)
colnames(rad_reports) <- "reports"
radiology_reports <- bind_rows(radiology_reports, rad_reports)
radiology_reports2 <- unlist(radiology_reports)
write.csv(radiology_reports2, file = "radiology_reports.csv")
radiology_reports3 <- data_frame(radiology_reports2)
colnames(radiology_reports3) <- "reports"
reports_tok <- radiology_reports3 %>% unnest_tokens(text, reports)
reports_tok1 <- data_frame(str_replace_all(reports_tok$text, "([0-9])", ""))
colnames(reports_tok1) <- "text"
reports_tok2 <- reports_tok1 %>% filter(text != "")
reports_tok3 <- data_frame(str_replace_all(reports_tok2$text, "_", ""))
colnames(reports_tok3) <- "text"
reports_tok3 <- reports_tok3 %>% filter(text != "")
reports_tok3 %>% count(text, sort = TRUE)
## # A tibble: 8,977 x 2
## text n
## <chr> <int>
## 1 the 7524
## 2 and 5710
## 3 of 4661
## 4 with 4186
## 5 to 4013
## 6 is 3900
## 7 for 3357
## 8 in 3144
## 9 no 2893
## 10 a 2432
## # ... with 8,967 more rows
colnames(reports_tok3) <- "words"
Now we can compare the reports to the dictionary using an inner join.
hedge <- read_csv("UncertaintyLexicon2.csv")
dodge <- read_csv("UncertaintyLexicon.csv")
Here is the whole dictionary:
dodge_words <- reports_tok3 %>% inner_join(dodge)
dodge_count <- dodge_words %>% count(words, sort = TRUE)
dodge_type <- dodge_words %>%
inner_join(dodge_count) %>%
distinct %>%
dodge_type$type <- factor(dodge_type$type)
dodge_type$words <- factor(dodge_type$words)
dodge_type %>% ggplot(aes(x = forcats::fct_inorder(words), y = n)) +
geom_bar(stat = "identity") +
facet_wrap(~type, scales = "free_x") +
coord_cartesian(xlim = c(0, 25)) +
ggtitle("Radiology Report Word Counts",
subtitle = "Each bar = total number of each word in lexicon") +
xlab("Words") +
ylab("Word Count") +
plot.title = element_text(hjust = .5, color = "blue"),
plot.subtitle = element_text(hjust = .5),
panel.grid.major.x = element_blank() )

## Wilcoxon rank sum test with continuity correction
## data: dodge_type$n by dodge_type$type
## W = 1020.5, p-value = 0.523
## alternative hypothesis: true location shift is not equal to 0
Well, it looks like the radiologists aren’t as ambiguous and uncertain as we thought. The distributions of certain vs. uncertain words appear almost identical, and Wilcoxon says there is no significant difference. So, our apologies to all radiologists!