Impact of COVID-19 pandemic on cancer-related preprints.
It may seem evident that preprints have served as an unprecedent way of communicating scientific results during COVID-19 pandemic. I recently found a very interesting paper that investigates the attributes of COVID-19 preprints (including preprinting and publishing behaviour) and the impact of the pandemic on the scientific publishing landscape published in bioarxiv. After seeing this paper I wonder how much cancer-related preprints were affected during the pandemic ❓
library(tidyverse)
library(rcrossref) # to access DOIs registered with Crossref
library(roadoi) # to link DOIs and open access versions of papers
library(rvest) # to Harvest (Scrape) Web Pages. Parsing of HTML/XML files
library(rentrez) # to get data from NCBI
library(XML) # to parse XML document
library(lubridate) # to deal with dates
To answer that question I am going to obtained data that expands from one year before the declaration of pandemic (that’s it from 31st December 2018) up to the day I started this analysis (18th August 2020). I am also going to use some of the functions the authors use in the paper that were made publicly available on github.
Retrieve preprint metadata via bioRxiv API
max_results_per_page <- 100
base_url <- "https://api.biorxiv.org/details/"
start <- "2018-12-31"
end <- "2020-08-18"
getPreprintData <- function(server) {
# Make initial request
url <- paste0(base_url, server, "/", start, "/", end, "/", 0)
request <- httr::GET(url = url)
content <- httr::content(request, as = "parsed")
# Determine total number of results and required iterations for paging
total_results <- content$messages[[1]]$total
pages <- ceiling(total_results / max_results_per_page) - 1
data <- content$collection
for (i in 1:pages) {
cursor <- format(i * max_results_per_page, scientific = FALSE)
# otherwise page 100000 becomes 1e05, which the api does not recognise
url <- paste0(base_url, server, "/", start, "/", end, "/", cursor)
request <- httr::RETRY("GET", url, times = 5, pause_base = 1,
pause_cap = 60) # retry if server error
content <- httr::content(request, as = "parsed")
if(content$messages[[1]]$status == "ok") {
data <- c(data, content$collection)
} else {
data <- c(data, list("error" = content$messages[[1]]$status))
}
Sys.sleep(1)
}
return(data)
}
The we can retrieve data from biorxiv and medrxiv using the
getPreprintData
function.
preprint_data <- purrr::map(c("biorxiv", "medrxiv"), getPreprintData)
This dataset is quite extensive, and retrieving all these data took a while to finish on my computer. Therefore, I am going to convert the obtained data into a tibble, do some transformations on it and safe the final tibble for future access.
parsePreprintData <- function(item) {
tibble(
source = item$server,
doi = item$doi,
title = item$title,
authors = item$authors,
author_corresponding = item$author_corresponding,
author_corresponding_institution = item$author_corresponding_institution,
posted_date = item$date,
version = item$version,
license = item$license,
type = item$type,
category = item$category,
abstract = item$abstract,
is_published = item$published != "NA",
published_doi = if(item$published == "NA") NA_character_ else item$published
)
}
# Build a search string containing terms related to COVID-19
search_string_covid <- "coronavirus|covid-19|sars-cov|ncov-2019|
2019-ncov|hcov-19|sars-2"
# Build a search string containing terms related to cancer
search_string_cancer <- "cancer|tumor|adenocarcinoma|neoplasm|
carcinoma"
# Set date of first case of COVID-19
covid_start <- "2019-12-31"
# Parse data to dataframe
preprints_all <- map_dfr(preprint_data, ~ map_df(.x, parsePreprintData)) %>%
group_by(doi) %>%
# calculate number of versions of a preprint and number of authors
mutate(n_versions = n()) %>%
ungroup() %>%
# keep the first version record
filter(version == 1) %>%
select(-version) %>%
mutate(
# clean up DOIs for later matching
doi = str_trim(str_to_lower(doi)),
published_doi = str_trim(str_to_lower(published_doi)),
covid_preprint = case_when(
str_detect(title, regex(search_string_covid, ignore_case = TRUE)) & posted_date > covid_start ~ "covid",
str_detect(abstract, regex(search_string_covid, ignore_case = TRUE)) & posted_date > covid_start ~ "covid",
str_detect(title, regex(search_string_cancer, ignore_case = TRUE)) ~ "cancer",
str_detect(abstract, regex(search_string_cancer, ignore_case = TRUE)) ~ "cancer",
T ~ "other"),
n_authors = map_int(authors, ~length(strsplit(., split = ";")[[1]]))
) %>%
# some duplicates are included
distinct() %>%
# reorder elements
select(source, doi, posted_date, covid_preprint, title,
abstract, n_versions, license, type, category, authors, n_authors,
author_corresponding, author_corresponding_institution,
is_published, published_doi)
preprints_all %>%
write_csv("data/preprints_basic_20181231_20200818.csv")
To safe the data and also be able to uploaded into Github (files size limit on Github is 100Mb), I am going to into split the data into two parts: -Preprints from 2018-12-31 before CV start 2019-12-31. -Preprints after CV start 2019-12-31 untill today 2020-08-18.
preprints_all %>%
filter(posted_date <= "2019-12-31") %>%
write_csv("data/preprints_basic_beforeCV_20181231_20191231.csv")
preprints_all %>%
filter(posted_date >= "2019-12-31") %>%
write_csv("data/preprints_basic_afterCV_20191231_20200818.csv")
We can read and merge these data together.
preprints_bf<- read.csv("/Users/dermit01/Downloads/blog-master-1/data/preprints/preprints_basic_beforeCV_20181231_20191231.csv")
preprints_af<- read.csv("/Users/dermit01/Downloads/blog-master-1/data/preprints/preprints_basic_afterCV_20191231_20200818.csv")
preprints_all<-rbind(preprints_bf,preprints_af)
Plotting the data
Let’s first inspect the number of cancer-related and covid-related preprints in the time span between 31st December 2018 up to 18th August 2020.
preprints_all %>%
mutate(posted_date = lubridate::ymd(posted_date)) %>%
mutate(posted_month = floor_date(posted_date, unit = "month")) %>%
group_by(covid_preprint,posted_month) %>% count() %>%
mutate_if(is.character,as.factor)%>%
ggplot(aes(x = posted_month, y = n, fill = covid_preprint))+
geom_col(alpha=0.8)+
labs(x = "Posted date (by months)", y = "Preprints deposited")+ theme(axis.text.x = element_text(angle = 90))
Wow! the number of COVID-19 preprints increased tremendously along with the expansion of COVID-19 pandemic, reaching up to 2,000 preprints deposited in May (see preprint surge
).
Early in May, preprint servers enhanced their usual screening procedures for publishing scientific articles related to COVID-19, adopting some restrictive publication norms like putting a bar on manuscripts that are solely based on computational models, and this could be one of the causes behind that drop in COVID-19 preprints after May.
Nevertheless, the majority of preprints were neither cancer-related nor COVID-19-related preprints.
The number of cancer-related preprints did not decrease when the pandemic expanded, and if anything there is a trend for more cancer-related preprints during 2019 and 2020. Despite many of the labs (including mine) being closed on the second trimester of 2020, the number of cancer-related preprints did not decreased during early days of the pandemic. Researchers could have had more time to put together results for preprint publication. I wonder though, if any category within biology research was more affected than others 🤔.
preprints_all %>%
mutate(posted_date = lubridate::ymd(posted_date)) %>%
mutate(posted_month = floor_date(posted_date, unit = "month")) %>%
group_by(covid_preprint,posted_month) %>% count() %>%
mutate_if(is.character,as.factor) %>% filter(covid_preprint=="cancer") %>%
ggplot(aes(x = posted_month, y = n, color = covid_preprint))+
geom_col(aes(fill = covid_preprint))+
labs(x = "Posted date (by months)", y = "Preprints deposited")+
theme(axis.text.x = element_text(angle = 90))
How many of the preprints were later published in peer-review journals after WHO declared the outbreak of COVID-19 pandemic on the 11 of March 2020?
preprints_all %>%
mutate(posted_date = lubridate::ymd(posted_date)) %>%
filter(posted_date >= "2020-03-11") %>%
group_by(covid_preprint,is_published) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n)) %>%
filter(is_published!="FALSE") %>%
ggplot(aes(x = covid_preprint, y = freq, fill = is_published))+
geom_col(aes(fill = covid_preprint)) +
scale_y_continuous(labels = scales::percent_format())+
labs(x = "", y = "% of preprints published")
Almost double the number of COVID-19 preprints were published in peer-review journals upon preprint publication. This could have to do with the urgency of disseminating scientific knowledge of SAR-CoV2 virus pandemic and the little things known about SAR-CoV2 virus.
Additional explorations on these data.
There are some additional exploration one could do with these data:
Percentage of downloads: One could use the function
getUsageData
to get preprints metrics. Did scientist stopped downloading cancer-related preprints in favor of COVID-19 preprints or the number of cancer-related preprints downloads remained stable/grew during the pandemic?Comments per preprint: Disqus is a comment platform that allows readers to post text comments and is featured in BioRxiv and medRxiv html pages. In the Fraser preprint authors found that non-COVID-19 preprints were rarely commented upon, in comparison to COVID-19 preprints (Fig 5D). One could explore if the engagement level of cancer-related preprints increased upon the outbreak of COVID-19 pandemic (if so, this could be because scientists that work in this research field would had have more office time to read and comment on these preprints).