SABOC: Journal Publication Announcement

Title: Concept placement using BERT trained by transforming and summarizing biomedical ontology structure

Authors: Hao Liu, Yehoshua Perl, James Geller

Journal: Journal of Biomedical Informatics Volume 112, December 2020, 103607

URL: https://www.sciencedirect.com/science/article/pii/S1532046420302355

Introduction to the General Area:

Modern medicine relies on standardized catalogs of medical terms. These catalogs are called ontologies. Terms with the same meaning (for example, "cardiac arrest" has the same meaning as "myocardial infarction") are collected together and each such collection is called a "concept."

Concepts may be very general ("disease") or very specific ("Left Anterior Descending Occlusion") or anything in between. The power of ontologies that distinguishes them from dictionaries is that every concept is connected to one or more other concepts that are "minimally" more general. This can be written as a text triple ("cardiac arrest," "is a kind of," "heart disease") or visualized as an arrow from a text box "cardiac arrest" to a text box "heart disease." Such an arrow is commonly called an IS-A link, and one can read the above example as "cardiac arrest IS-A heart disease." Combining many such triples together results in a complex network of text boxes and arrows (Figure 1). For example by adding a triple "heart disease IS-A disease," a network from "Left Anterior Descending Occlusion" to "disease" is created.

Figure 1: Two "concept IS-A concept" triples are combined into a simple network.

A medical ontology is only widely useful if it is large, which means it contains hundreds of thousands of concepts. This makes it extremely difficult for anybody but its creator(s) to understand its general structure and to know "what is where." At a certain size, even its creators will start to struggle with orientation and comprehension.

The problem is complicated by the fact that every year (and for some ontologies several times a year) updates are provided for two reasons: (1) New concepts need to be included. There was no COVID-19 in summer 2019, and (2) structural inconsistencies and content errors are found by users and need to be corrected. For the SNOMED CT ontology with over 350,000 concepts there are new releases twice a year. The process of including new concepts at their correct places in SNOMED CT is extremely difficult and requires an expert who is familiar with medicine and the theory of ontologies. Thus, there is a desire to provide software tools that correctly suggest where to place many or most new concepts in a new release of SNOMED CT. It is easier for an expert to accept or reject such a suggested placement than to determine a placement without any support. This is the topic of our research.

To achieve a better visual intuition, new concepts are often called "children," and the concepts under which they are placed are called "parents," just like in a genealogy. In other words, our research deals with finding parent concepts for new child concepts, or, as a concrete example, should we place "COVID-19" under the concept "viral disease" or under the concept "virus"?

Summary of the Research:

This paper proposes a method to automatically predict the presence of IS-A relationships between a new concept and pre-existing concepts based on the language representation model BERT that was developed and made widely available by Google. BERT is a machine learning program, which means it can be trained with known, correct data (= training data) to make predictions about new data. BERT's main application is to predict whether a sentence B makes sense immediately after a sentence A (Next Sentence Prediction).

Our method converts the neighborhood network of a concept into “sentences,” which means we are working with the textual representation of triples. Then we use BERT’s capability of predicting the adjacency of two sentences. Whenever the sentence representation of an existing concept is predicted by BERT to be the next sentence after the sentence representation of a new concept" then the system will propose to the user to place an IS-A link from this new concept to this existing concept. With this, the placement of the new concept in SNOMED CT has been found. (This is a simplification because a concept could have several parents.)

Our method is successful, but we will now present a further improvement. To augment our method’s performance, we refined the training data by employing an ontology summarization technique. Ontology summarization has been a specialty of the SABOC team since the 1990s.

We trained our BERT model with the two largest hierarchies of the SNOMED CT 2017 July release and applied it to predicting the parents of new concepts added in the SNOMED CT 2018 January release. Thus we could compare our results with the actual placement of concepts in the new release to compute how successful our method is. The results showed that the trained BERT model achieved an average F1 score (a measure of success between 0 and 1) of 0.87 in testing with 8,574 concept pairs containing 2005 new concepts in the Clinical Finding hierarchy of SNOMED CT.

The average F1 score in testing with 3,908 concept pairs containing 911 new concepts for the Procedure hierarchy was 0.82. Furthermore, we employed the SABOC Area Taxonomy summarization technique to refine the training data. This resulted in a higher Recall (another measure of success between 0 and 1). Ontology curators can benefit from this high Recall since it indicates that the trained model will propose a higher ratio of proper parents for a given concept.

The proposed method can not only identify potential parents of a new concept but also filter out irrelevant concepts, reducing the number of improper placement choices for a concept. Therefore, the proposed method can save ontology maintenance staff time and effort that would be needed to search for those parent concepts manually.