CALCULATION OF A COEFFICIENT OF UNIQUENESS FOR TEXT DOCUMENT USING JACCARD SIMILARITY

Authors

DOI:

https://doi.org/10.32851/tnv-tech.2021.6.8

Keywords:

plagiarism, antiplagiarism systems, shingles, Jaccard similarity

Abstract

The rapid development of the Internet, along with the growing computer literacy, is contributing to the penetration of plagiarism in various areas of human activity: plagiarism is an acute problem in education, industry and the scientific community. According to [1], plagiarism is understood as the illegal use or disposal of the protected results of another’s creative work, which is accompanied by bringing to others wrong information about himself as a real author. Plagiarism can be a violation of copyright and patent law and as such can lead to legal liability. On the other hand, plagiarism is possible in areas that are not covered by any type of intellectual property, such as mathematics and other basic scientific disciplines. Plagiarism with the advent of the Internet has become a serious problem. Once on the Internet, knowledge becomes the property of all, it becomes increasingly difficult and sometimes impossible to enforce copyright. Therefore, checking the uniqueness of documents is the important task. The article analyzes modern methods and means of checking textual information for uniqueness. For each of them there are example of work, advantages and disadvantages provided. It is noted that the important task is to increase the accuracy when verifying texts for uniqueness. The shingles method has been identified as the most common and effective method of plagiarizing textual information. Based on the shingles method, an improved algorithm for checking texts for uniqueness using the Jaccard coefficient is proposed. The complexity of the proposed algorithm in terms of memory and processing power was considered. It is noted that with the introduction of additional improvements, the performance of the algorithm has not deteriorated.

References

Савчук Т.О., Кучевський Ю.А. Удосконалений алгоритм перевірки текстів на унікальність. INTERNET-EDUCATION-SCIENCE : Proceedings of the XII International scientific-practical conference, м. Вінниця, 26–29 травня 2020 р. Вінниця : Вінницький національний технічний університет, 2020. С. 237–239. URL: https://ir.lib.vntu.edu.ua/bitstream/handle/123456789/30970/WORK-IES-2020-269-271.pdf?sequence=1&isAllowed=y.

Савчук Т.О., Кучевський Ю.А. Підхід до аналізу на унікальність курсових розробок. Матеріали XLIX науково-технічної конференції підрозділів Вініцького національного технічного університету, м. Вінниця, 27–28 квітня 2020 р. URL: https://conferences.vntu.edu.ua/index.php/all-fitki/all-fitki-2020/paper/view/8929/7739.

Monostori K., Zaslavsky A., Schmidt H. Document Overlap Detection System for Distributed Digital Libraries. Proceedings of the fifth ACM conference on Digital libraries. 2000. P. 226–227.

Leong A., Lau H., Rynson W. H. Check: A Document Plagiarisment Detection System. Proceedings of ACM Symposium for Applied Computing. 1997. P. 70–77.

Dreher H. Automatic Conceptual Analysis for Plagiarism Detection. The Journal of Issues in Informing Science and Information Technology. 2007. Vol. 4. P. 601–614.

Meyer zu Eissen S., Stein B. Intrinsic Plagiarism Detection. European Conference on Information Retrieval. Springer, 2006. P. 565–569.

Седов А.В., Рогов А.А. Анализ неоднородностей в тексте на основе последовательностей частей речи. Современные проблемы науки и образования. 2013. № 1. URL: https://science-education.ru/ru/article/view?id=8339.

Антиплагиат: обнаружение заиствований. Веб-сайт. URL: https://www.antiplagiat.ru/corporate/education.

Unichek. Сервіс перевірки на плагіат для найкращих результатів. Вебсайт. URL: https://unicheck.com/uk-ua.

Ширяев М.А., Мустакимов В. Plagiatinform избавит от плагиата в научных работах. Educational Technology & Society. 2008. № 11 (1). С. 367‒374. URL: https://cyberleninka.ru/article/n/plagiatinform-izbavit-ot-plagiata-v-nauchnyh-rabotah/viewer.

Brin S., Davis J., Garcia-Molina H. Copy Detection Mechanisms for Digital Documents. CM International Conference on Management of Data (SIGMOD 1995), San Jose, California, May 22–25, 1995. P. 398–409.

Published

2022-02-14

How to Cite

Савчук, Т., & Кучевський, Ю. (2022). CALCULATION OF A COEFFICIENT OF UNIQUENESS FOR TEXT DOCUMENT USING JACCARD SIMILARITY. Таuridа Scientific Herald. Series: Technical Sciences, (6), 58-65. https://doi.org/10.32851/tnv-tech.2021.6.8

Issue

Section

COMPUTER SCIENCE AND INFORMATION TECHNOLOGY