The Data Daily

Gene Updater: a web tool that autocorrects and updates for Excel misidentified gene names - Scientific Reports

Gene Updater: a web tool that autocorrects and updates for Excel misidentified gene names - Scientific Reports

When gene expression datasets are opened with Excel under default settings (Microsoft Corp., Redmond, WA), a recurring problem where gene names are converted to dates occurs. Similarly, if gene names are copied from another application (e.g. text processors) and pasted into an Excel spreadsheet without specifying cell formatting, conversion of gene names to dates can occur. While Excel is popular and widely used in data analysis, these auto-conversions can affect pathway enrichment analysis, as many of the pathway enrichment tools such as Enrichr, Gene set enrichment analysis (GSEA) and Ingenuity Pathway Analysis rely on gene symbols to query against pathway databases such as Gene Ontology and Reactome. As dates are not recognized by these pathway databases, this can result in voids in pathway enrichment analysis. For instance, septins (e.g. SEPT1), which are involved in cell division, are internally converted to SEP-01 in Excel, which cannot be recognized by other databases. This problem has become so rampant that approximately one-fifth of the published papers with supplementary Excel gene lists contain erroneous gene name conversions. As many of these datasets are frequently accessed by other data scientists, such errors may be carried over to other scientific publications, resulting in further distortion of downstream data analysis.

To tackle this issue, the HUGO Gene Nomenclature Committee (HGNC) announced in 2017 to update the gene names that may be unintentionally converted to dates in Excel files. This movement was well-received by researchers and data scientists, as changing to the updated gene names would allow sharing of gene expression data without worrying about the automatic conversion of gene symbols to dates in Excel. However, at present, most of the published gene expression data are not updated to the newly approved gene names, especially in the microarray datasets. We thus developed a Gene Updater web tool that allows researchers to convert the previous gene names to the newly approved gene names recommended by HGNC. Moreover, if the gene names are unintentionally converted to dates by Excel, the web tool allows researchers to rectify these terms back to the correct gene names. We believe that these efforts will facilitate gene expression data sharing between researchers who may be working on different analytics platforms.

The Gene Updater webtool is publicly available at: https://share.streamlit.io/kuanrongchan/date-to-gene-converter/main/date_gene_tool.py. The code was written with the Python programming language (https://www.python.org/) and the web tool is made with Streamlit (https://www.streamlit.io). To run the app locally, several freely available packages are required: pandas, numpy, regex, inflect, dateparser, streamlit, streamlit-tags, openpyxl, xlrd, and XlsxWriter. The recommended versions are as follows: pandas > = 1.2.5, numpy > = 1.19.5, regex > = 2021.8.3, inflect > = 5.3.0, dateparser > = 1.1.0, streamlit > = 1.8.1, streamlit-tags > = 1.2.8, openpyxl > = 3.0.9, xlrd > = 2.0.1, XlsxWriter > = 3.0.2. The MIT licence version 0.1.0 is also added to the source package on GitHub as an open source licence.

The up-to-date codes and new releases will be made available on GitHub, including the step-by-step protocol information on running the app locally: https://github.com/kuanrongchan/date-to-gene-converter (Zenodo; https://doi.org/10.5281/zenodo.6845701). This page can also be used to communicate any issues, queries, or request features. Otherwise, users can contact the developers via email.

Users can directly upload data from Excel spreadsheets or csv files containing gene names into Gene Updater (https://share.streamlit.io/kuanrongchan/date-to-gene-converter/main/date_gene_tool.py). The old gene names will be automatically updated to the new gene names with Gene Updater. If genes were converted to dates by Excel, these dates will also be automatically converted to the updated gene names, except for Mar-01 and Mar-02 as these dates can be potentially mapped to more than one gene (Fig. 1). The conversion of Mar-01 to either MTARC1 or MARCHF1, as well as Mar-02 to either MTARC2 or MARCHF2 can be assigned by the user within the Gene Updater web tool. The output is an Excel data file containing the updated HUGO gene names which can be downloaded for further downstream analysis.

The user interface starts with a file uploader that enables users to upload their .csv or .xlsx file(s). Multiple files can also be uploaded, as long as the first column contains the gene names. If the gene file contains Mar-01 and Mar-02, we encourage having an identifier (e.g. gene description) on the second column so that the identities of MARCH1/MARC1 and MARCH2/MARC2 can be easily resolved with the Gene Updater. Users do not have to remove the other data columns in their Excel or csv files to use the web tool. The checkbox located at the sidebar allows users to inspect that the correct data file is uploaded. To demonstrate the features of the app, a demo dataset containing Excel converted gene terms, gene descriptions and numeric values is pre-loaded if no data files are uploaded into Gene Updater.

The human gene names that are converted to dates in Excel, with their updated approved gene names and descriptions are detailed in Table 1. With the exception for Mar-01 and Mar-02, all other genes modified by Excel can be mapped to a unique HUGO gene (Table 1). To examine the impact of omitting these genes due to Excel conversions on pathway analysis, we examined the biological processes modulated by these genes. Pathway enrichment analysis against the Gene Ontology (GO) Biological Processes database highlighted that these genes play a critical role in cell division, exocytosis, cilium assembly, ubiquitination, and nitric oxide biosynthesis (Fig. 2). Specifically, SEPTIN1-14 is encoded for pathways related to cytoskeleton-dependent cytokinesis, some of which are also involved in other specialised functions such as exocytosis, secretion and cilium assembly. On the other hand, MARCH1-8 are involved in protein ubiquitination whereas MTARC1 and MTARC2 are involved in nitric oxide biosynthesis and metabolic processes (Fig. 2). Overall, our results highlight the voids in pathway enrichment analyses if these gene names are auto-converted to dates in Excel.

To resolve gene duplicates related to Mar-01, Gene Updater converts the first instance of Mar-01 to Mar-01_1st and the second instance to Mar-01_2nd. Using the dropdown widget, users can assign the Mar-01_1st and Mar-01_2nd based on gene description or by any other unique identifiers (Fig. 3). The same procedure is then repeated for duplicates related to Mar-02. After the conversion of these dates, the output dataframe or file should have gene names that are updated to the new HUGO gene names.

The converted dataframe or file can be inspected within the Gene Updater web tool. Users can use the multi-query search bar to verify that the gene names are successfully updated. The output file with the updated gene names can then be exported as an Excel file.

To evaluate the utility of Gene Updater in resolving Excel misidentified gene names, we leveraged on Mark Ziemann’s dashboard (http://ziemann-lab.net/public/gene_name_errors/Report_2022-05.html#journals-affected) to extract text and Excel files from various journals, including BMC Genomics, Nature, Genome Biology, Nucleic Acids Research, Human Molecular Genetics, BMC Bioinformatics, Nature Communications, PLoS One, Genome Research, Genes Development and RNA that were published in June 2022 (Table 2). A total of 356 text and Excel files were found, of which 81 of these files contain gene terms or gene symbols. Notably, 28 (34.6%) of the files with gene symbols contained date-related errors, highlighting the significance of having a tool to correct for these misidentified date terms (Table 2). Gene Updater was able to autocorrect for the majority of these files (78.6%), except for the files which contain Mar-01 and Mar-02 terms, but did not provide an accompanying unique identifier such as gene description or gene information (Table 2, right most column). These findings highlight the importance and utility of Gene Updater in rectifying misidentified gene terms, and emphasise the need to include a gene description column to better resolve MARC1, MARCH1, MARC2 and MARCH2 gene terms.

Presently, the two tools that can potentially convert dates to gene names are Oct4th (https://oct4th.sandbox.bio) and Truke (http://maplab.imppc.org/truke/). However, Oct4th only works on gene data files that have not been manipulated and processed in Excel. Moreover, the tool is presently unable to convert to the updated gene names, which are more resilient to auto-conversion. Truke can potentially convert the date formats to gene names, but can only convert dates that are labelled in the dd/mm/yy format, and process files one at a time. In contrast, our Gene Updater Streamlit web tool can process multiple .csv and .xlsx files, and takes into account the different kinds of date formatting that are converted by Excel, thus allowing faster and more efficient processing of dates to genes as compared to other existing web tools.

Images Powered by Shutterstock