Comparative study of Named-Entity Recognition methods in the agronomical domain

TitleComparative study of Named-Entity Recognition methods in the agronomical domain
Publication TypeConference Paper
Year of PublicationSubmitted
AuthorsDo H, Tran H, Khoat Than Q, Larmande P
Conference Name{CiCling}
Abstract

Text mining is becoming an important part of research topic in biology with the original purpose to extract biological entities such as genes, proteins and traits to extend the knowledge from scientific papers. However, few thorough studies on text mining and applications are developed for plant molecular biology data, especially rice, thus resulting a lack of datasets available to train models. Since there is no/rare benchmark for rice, we have to face various difficulties in exploiting advanced machine learning methods for accurate analysis of rice bibliography. In this article, we developed a new training datasets (Oryzabase) as the benchmark. Then, we evaluated the performance of several current approaches to find a methodology with the best results and assigned it as the state of the art method for our own technique in the future. We applied Name Entities Recognition ({NER}) tagger, which is built from a Long Short Term Memory model, and combined with Conditional Random Fields ({LSTM}-{CRF}) to extract information of plant genes and proteins. We analyzed the performance of {LSTM}-{CRF} when applying to the Oryzabase dataset and improved the results up to 89% in F1. We found that on average, the result from {LSTM}-{CRF} is more exploitable with the new benchmark.

URLhttps://hal.archives-ouvertes.fr/hal-01711331