MODIFICATION OF THE FUZZY SEARCH ALGORITHMS TO USE A SYMBOLS SIMILARITY TABLE

Authors

DOI:

https://doi.org/10.32782/tnv-tech.2023.3.3

Keywords:

fuzzy search, symbol similarity table, Damerau-Levenstein algorithm, text data processing, editing distance.

Abstract

The object of the research is a fuzzy search algorithm based on the Damerau-Levenshtein distance and a symbols similarity table. The work involved investigating, analyzing, and providing recommendations on how to integrate the capabilities of the symbols similarity table with the Damerau-Levenshtein fuzzy search algorithm. Researching of the fuzzy search algorithms in text is an important topic in the field of an information retrieval and text processing. This is driven by the increasing volume of the information and the likelihood of errors due to the human factors during text composition. Fuzzy search utilizes algorithms to find data in the text that matches patterns approximately. This is achieved by comparing and matching strings or keywords that may be similar but not identical. A symbols similarity table can be employed for fuzzy search that helps in determining the degree of similarity between pairs of characters. By combining the fuzzy search algorithm with the symbols similarity table, more accurate and personalized access to a large volume of the text-based information can be achieved. The study conducted a comparative analysis of the effectiveness and correctness of the fuzzy search algorithms with and without the using of a symbols similarity table, as well as an exact search algorithm. The using of the symbols similarity table improves the obtained results, especially when there are languages featuring special characters. This allows to find finding significantly more relevant results, although the speed of the algorithm decreases. The obtained results could contribute significantly to the enhancement of the search systems. This could enable users to find relevant documents even in the presence of spelling errors, synonyms, abbreviations, or other forms of inaccuracies at the query. The approach of the utilizing a symbols similarity table could be used in systems for spelling checking and automatic correction, auto-suggestion and auto-completion systems, as well as into the implementing functions of the plagiarism detection and data duplicates.

References

М. В. Михайлова. Порівняння алгоритмів нечіткого пошуку в текстах українською мовою: Радіоелектроніка, інформатика, управління, 2007, 80 с.

Відстань Дамерау-Левенштейна [Електронний ресурс]. https://www. geeksforgeeks.org/damerau-levenshtein-distance/

Fred J. Damerau. A Technique for Computer Detection and Correction of Spelling Errors : Communications of the ACM, 1964, с. 171 – 176.

Gonzalo N. A guided tour to approximate string matching / Navarro Gonzalo. // Association for Computing Machinery. – 2001.

J. P. Carvalho and L. Coheur, "Introducing UWS – A fuzzy based word similarity function with good discrimination capability: Preliminary results," 2013 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Hyderabad, India, 2013, pp. 1-8.

H. Ayeldeen, A. E. Hassanien and A. A. Fahmy, "Lexical similarity using fuzzy Euclidean distance," 2014 International Conference on Engineering and Technology (ICET), Cairo, Egypt, 2014, pp. 1-6.

Mihov S. Fast Approximate Search in Large Dictionaries / S. Mihov, K. Schulz. // Computational Linguistics. – 2004.

Yu, M., Li, G., Deng, D. et al. String similarity search and join: a survey. Front. Comput. Sci. 10, 399–417 (2016).

Вступ в алгоритми, 4 видання / [Т. Кормен, Р. Рівест]., 2022. – 1312 с.

Fancy Letters [Електронний ресурс]. https://symbl.cc/en/collections/fancyletters/

Посібник користувача Google Benchmar [Електронний ресурс]. https:// github.com/google/benchmark/blob/main/docs/user_guide.md#runtime-and-reportingconsiderations

Published

2023-10-09

How to Cite

Клещ, К. О., & Царьов, М. О. (2023). MODIFICATION OF THE FUZZY SEARCH ALGORITHMS TO USE A SYMBOLS SIMILARITY TABLE. Таuridа Scientific Herald. Series: Technical Sciences, (3), 21-28. https://doi.org/10.32782/tnv-tech.2023.3.3

Issue

Section

COMPUTER SCIENCE AND INFORMATION TECHNOLOGY