Machine Learning for Identifying Relevance to Biosurveillance in Multilingual Text


Global biosurveillance is an extremely important, yet challenging task. One form of global biosurveillance comes from harvesting open source online data (e.g. news, blogs, reports, RSS feeds). The information derived from this data can be used for timely detection and identification of biological threats all over the world. However, the more inclusive the data harvesting procedure is to ensure that all potentially relevant articles are collected, the more data that is irrelevant also gets harvested. This issue can become even more complex when the online data is in a non-native language. Foreign language articles not only create language-specific issues for Natural Language Processing (NLP), but also add a significant amount of translation costs. Previous work shows success in the use of combinatory monolingual classifiers in specific applications, e.g., legal domain. A critical component for a comprehensive, online harvesting biosurveillance system is the capability to identify relevant foreign language articles from irrelevant ones based on the initial article information collected, without the additional cost of full text retrieval and translation.


The objective is to develop an ensemble of machine learning algorithms to identify multilingual, online articles that are relevant to biosurveillance. Language morphology varies widely across languages and must be accounted for when designing algorithms. Here, we compare the performance of a word embedding-based approach and a topic modeling approach with machine learning algorithms to determine the best method for Chinese, Arabic, and French languages.

Primary Topic Areas: 
Original Publication Year: 
Event/Publication Date: 
January, 2018

January 25, 2018

Contact Us

NSSP Community of Practice



This website is supported by Cooperative Agreement # 6NU38OT000297-02-01 Strengthening Public Health Systems and Services through National Partnerships to Improve and Protect the Nation's Health between the Centers for Disease Control and Prevention (CDC) and the Council of State and Territorial Epidemiologists. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of CDC. CDC is not responsible for Section 508 compliance (accessibility) on private websites.

Site created by Fusani Applications