ДОСЛІДЖЕННЯ МЕТОДІВ ВЕКТОРИЗАЦІЇ ТЕКСТІВ У ЗАДАЧАХ ВАЛІДАЦІЇ ВІДПОВІДЕЙ, ПОДАНИХ ПРИРОДНОЮ МОВОЮ

K.T. Kuzma; O.V. Melnyk

doi:10.32851/tnv-tech.2021.6.5

Authors

K.T. Kuzma http://orcid.org/0000-0002-0937-7299
O.V. Melnyk http://orcid.org/0000-0002-9778-4109

DOI:

https://doi.org/10.32851/tnv-tech.2021.6.5

Keywords:

answer given in text form, natural language text, open-type answer, text vectorization, bag-of-words model, TF, IDF, TF-IDF, TF-PWI, set of features for text vectorization

Abstract

Intellectualization of the process of processing natural language texts in the tasks of automated testing determines the relevance of the research. Since open-type answers in testing systems are natural language texts, the problem of their processing refers to the applied problem of word processing. All applied problems of word processing, the solution of which takes place with the use of machine learning, neural networks, require vectorization – the conversion of text into digital values. The aim of the article is to research the models, methods of vectorization of texts in the problems of processing answers given in natural language. At the first stage, the basic applied problems of word processing are investigated, as a result of which their classification is given. The assignment of the problem of checking natural language answers within the framework of this research to the problem of text classification and semantic analysis is substantiated. In the second stage, the basic models of text representation in digital form are analyzed: bagof- words and distributive semantics. The application of the bag-of-words model for the problem of processing open-ended answers is substantiated, as the vocabulary used to encode the collection of correct answers and the frequency of words with which they are used in the answers of “training” and “test” sets are enough to determine the answer class. It is noted that the vector of features in this problem is the frequency of tokens (symbolic or verbal uni-, bi-, n-grams) of the dictionary, formed by the training sample, in the answers of the “training” and “test” data sets. In the third stage, the approaches of calculating the vector of characteristics are investigated: absolute frequency (TF), relative frequency (TF-IDF), compatible information (PWI), the advantages and disadvantages of each of them are determined. At the last stage for vectorization of texts in problems of processing of the answers given in natural language, the following combinations of sets of signs are offered: model bag-ofwords and TF; bag-of-words and TF-IDF model; verbal n-grams and TF-IDF; symbol n-grams and TF-IDF; model bag-of-words and TF-PWI. The proposed sets of features and their combinations are a means of improving the machine learning model for the task of checking the answers given in natural language. Further research will be aimed at developing a model of machine learning of this problem and its experimental testing with the proposed sets of features in order to obtain an effective mathematical model.

References

Zhang L.J. et al. Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 2018. DOI: 10.1002/widm.1253.

Лесько О.М., Рогушина Ю.В. Использование онтологий для анализа семантики естественно-языковых текстов. Проблеми програмування. 2009. № 3. С. 59−65.

Ваколюк Т.В., Комарницька О.І. Алгоритм нечіткого семантичного порівняння текстової інформації. Збірник наукових праць Військового інституту Київського національного університету ім. Т. Шевченка. 2013. № 39. С. 163−168.

Цыганов Н.Л., Циканин М.А. Исследование методов поиска дубликатов веб-документов с учетом запроса пользователя. Интернет-математика-2007 : сборник работ участников конкурса. 2007. Екатеринбург : Издательство Уральского университета. С. 211–222.

Mutabazi E. et al. A Review on Medical Textual Question Answering Systems Based on Deep Learning Approaches. Applied Sciences. 2021. No. 11 (12). DOI: 10.3390/app11125456.

Rocktäschel T. et al. Reasoning about entailment with neural attention. arXiv preprint arXiv.1509.06664. 2015.

Ampomah I.K., Park S.B., Lee S.J. A Sentence-to-Sentence Relation Network for Recognizing Textual Entailment. World Academy of Science, Engineering and Technology International Journal of Computer and Information Engineering. 2016. Nov. 1; 10 (12): 1955-8.

Годич О.В., Наконечний Ю.С., Щербина Ю.М. Категоризація електронних документів. Вісник Національного університету «Львівська політехніка» «Інформаційні системи та мережі». 2010. № 673. С. 233–248.

Euclidean norm. Wikipedia: the free encyclopedia. URL: https://en.wikipedia.org/wiki/Norm_(mathematics) (дата звернення: 15.09.2021 р.).

Zipf’s law. Wikipedia: the free encyclopedia. URL: https://en.wikipedia.org/wiki/Zipf%27s_law (дата звернення: 15.09.2021 р.).

TF-IDF. Wikipedia: the free encyclopedia. URL: https://en.wikipedia.org/wiki/Tf-idf (дата звернення: 20.11.2021 р.).

Pointwise mutual information. Wikipedia: the free encyclopedia. URL: https://en.wikipedia.org/wiki/Pointwise_mutual_information (дата звернення: 20.11.2021 р.).

Mutual information. Wikipedia: the free encyclopedia. URL: https://en.wikipedia.org/wiki/Mutual_information (дата звернення: 20.11.2021 р.).

Levy O., Goldberg Yoav. Neural Word Embedding as Implicit Matrix Factorization. Advances in neural information processing systems. 2014. № 27, pp. 2177–2185.

RESEARCH OF METHODS FOR TEXT VECTORIZATION IN THE TASKS OF VALIDATION THE ANSWERS PRESENTED IN NATURAL LANGUAGE

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

Language