DATA PROCESSING AND ANALYSIS ON THE EXAMPLE OF THE SPAMBASE DATASET USING MACHINE LEARNING LIBRARIES

Authors

DOI:

https://doi.org/10.32782/tnv-tech.2024.2.1

Keywords:

dataset, machine learning, artificial intelligence, properties, features matrix, feature vector, targets vector

Abstract

The article analyzes the Spambase dataset with data on e-mails classified as spam and nonspam. A detailed analysis of this data frame is provided with information about the data in the columns (properties) and records. The dataset was uploaded to the Google CoLab software development environment for programming and further analysis. NumPy, Pandas, Matplotlib, Sklearn, and Imblearn libraries are used for scientific calculations and data analysis in Python. Their combinations allow developers and researchers to effectively work with structured data, perform various operations, visualize results and solve complex data analysis and processing tasks. To better understand the material, reviewed basic theoretical information about data forecasting. Definitions of machine learning, artificial intelligence, and data science are provided. Machine learning categories such as supervised, unsupervised, and reinforcement learning are also described. The main types of features used in machine learning models are considered: qualitative, ordinal, and quantitative. The Heart Disease dataset was also presented, describing and labelling important definitions such as features matrix X, feature vector, properties, targets vector Y. The need to break up the whole dataset into training, validation and testing datasets for correct evaluation and model verification is described. The use of L1 and L2 loss functions to evaluate model performance is explained, and the advantages and disadvantages of each approach are indicated. The analysis of the Spambase dataset in the Google Colab environment continued. Histograms were constructed to represent the distribution of data by different properties for two classes: spam and non-spam. Analyzed histograms for properties word_freq_credit, char_freq_! and capital_run_length_total. The split() function from the NumPy library splits the data into training, validation, and testing sets. For the training dataset, classes were rebalanced using the random oversampling method (RandomOverSampler). As a result, new instances were created for the less-represented class of e-mails containing spam.

References

Butakov, N. Exploring machine learning use cases in telecom. Ericsson – Helping to shape a world of communication. Режим доступу: https://www.ericsson.com/en/blog/2021/5/machine-learning-use-cases-in-telecom (дата звернення: 30.03.2024).

Tsolaki, K. (2023). Utilizing machine learning on freight transportation and logistics applications: A review. ICT Express, (9), 284–295. Режим доступу: https://doi.org/10.1016/j.icte.2022.02.001 (дата звернення: 30.03.2024).

Kuleto, V. (2021). Opportunities and Challenges of Artificial Intelligence and Machine Learning in Higher Education Institutions. Sustainability, (13), 10-24. Режим доступу: https://doi.org/10.3390/su131810424 (дата звернення: 30.03.2024).

Харченко, В. О. (2023). Основи машинного навчання : навч. посіб. Суми : Сум. держ. Унiверситет. 264с. Режим доступу: https://essuir.sumdu.edu.ua/bitstreamdow nload/123456789/92711/1/Kharchenko_mashynne_navchannia.pdf (дата звернення: 30.03.2024).

Могильний, С. Б. (2019). Машинне навчання з використанням мікрокомп’ютерів : Навч.-метод. посіб. Київ, 224 с. Режим доступу: https://api.man.gov.ua/api/assets/man/54c0ee59-b490-4ff3-a346-90a89fd67e30/ (дата звернення: 30.03.2024).

Burkov A. (2019). The Hundred-Page Machine Learning Book. p.160.

FreeCodeCamp.org. Machine Learning for Everybody – Full Course. (2022). YouTube. Режим доступу: https://www.youtube.com/watch?v=i_LwzRVP7bg (дата звернення: 30.03.2024).

Programming with Mosh. Python Machine Learning Tutorial (Data Science). (2020). YouTube. Режим доступу: https://www.youtube.com/watch?v=7eh4d6sabA0 (дата звернення: 30.03.2024).

Spambase / M. Hopkins та ін. UCI Machine Learning Repository. Режим доступу: https://archive.ics.uci.edu/dataset/94/spambase (дата звернення: 30.03.2024).

McKinney, W. Python for Data Analysis. Sebastopol, California : O'Reilly Media, Inc., 2012. p. 470. Режим доступу: https://bedford-computing.co.uk/learning/wp-content/uploads/2015/10/Python-for-Data-Analysis.pdf (дата звернення: 30.03.2024).

Лемешко, А.В., Антоненко, А.В., Петрик, А.В. (2023) Нейроморфні системи як інструмент реалізації штучного інтелекту. Вчені записки ТНУ імені В.І. Вернадського. Серія: Технічні науки, 34 (73), (3), 175-183.

Антоненко А., Пахомов, М., Калита, Т., Галета, В. (2023). Використання штучного інтелекту в автоматизованих системах. Вісник Хмельницького національного університету, (4), (323), 11-20.

Pykes K. Python Machine Learning: Scikit-Learn Tutorial. https://app.datacamp.com. Режим доступу: https://www.datacamp.com/tutorial/machine-learning-python (дата звернення: 30.03.2024).

Tamanna. Handling Imbalanced Datasets in Python: Methods and Procedures. Medium. Режим доступу: https://medium.com/@tam.tamanna18/handlingimbalanced-datasets-in-python-methods-and-procedures-7376f99794de (дата звернення: 30.03.2024).

Heart Disease / A. Janosi та ін. UCI Machine Learning Repository. Режим доступу: https://archive.ics.uci.edu/dataset/45/heart+disease (дата звернення: 30.03.2024).

Published

2024-07-09

How to Cite

Балвак, А. А., Лемешко, А. В., Антоненко, А. В., Зіняр, Д. А., Бурачинський, А. Ю., & Приходько, А. П. (2024). DATA PROCESSING AND ANALYSIS ON THE EXAMPLE OF THE SPAMBASE DATASET USING MACHINE LEARNING LIBRARIES. Таuridа Scientific Herald. Series: Technical Sciences, (2), 3-20. https://doi.org/10.32782/tnv-tech.2024.2.1

Issue

Section

COMPUTER SCIENCE AND INFORMATION TECHNOLOGY

Most read articles by the same author(s)

1 2 > >>