Investigation of efficiency of application of machine learning algorithm for classification of internet traffic

DOI: 10.31673/2412-9070.2020.062932

Authors

  • А. П. Козиряцький, (Kozyryatsʹkyy A. P.) State University of Telecommunications, Kyiv
  • В. В. Жебка, (Zhebka V. V.) State University of Telecommunications, Kyiv
  • Л. О. Дьоміна, (Dʹomina L. O.) State University of Telecommunications, Kyiv
  • Д. О. Тарасенко, (Tarasenko D. O.) State University of Telecommunications, Kyiv

DOI:

https://doi.org/10.31673/2412-9070.2020.062932

Abstract

The article investigates the effectiveness of the machine learning algorithm for the classification of Internet traffic. The RF algorithm, which works by constructing many decision trees, is considered. The efficiency of the RF algorithm in the problems of application classification in the presence and absence of background network traffic is evaluated. A laboratory network of several computers was set up to collect the data needed for analysis. One of the computers was connected to the World Wide Web and a wireless access point was set up on its base. On the same computer, all the traffic passing through it was captured using Wireshark. Various applications were running on other computers connected to the access point. Web pages were viewed using Google Chrome and Opera browsers, using Skype, video calls were made, files were downloaded using the µTorrent torrent client, the Steam digital game distribution service was used, etc. The obtained data were stored in the PCAP format. To bring the obtained data in line with the requirements of the problem, the data was pre-processed. In the experiment, a random forest was constructed and the quality of classification on a given sample was assessed. The most acceptable parameters of the algorithm were selected experimentally. It is experimentally chosen that the forest consists of 5 trees with the maximum possible depth. The algorithm is most effective for data related to DNS traffic. In addition to checking the operation of the algorithm on the test sample, which has the same class composition as the training, the assessment of its quality was also carried out in the presence of background traffic, i.e. in the test sample there were copies of classes absent in the training sample.

Keywords: machine learning; Internet traffic; RF algorithm; Wireshark program; efficiency; metrics.

References
1. Weyrich M., Ebert C. Reference architectures for the internet of things // IEEE Software. 2018. Vol. 33, № 1. P. 112–116.
2. Lightweight, payload-based traffic classification: An experimental evaluation / F. Risso, M. Baldi, O. Morandi [et al.] // Proc. IEEE ICC, 2018. P. 5869–5875.
3. Sen S., Spatscheck O., Wang D. Accurate Scalable In-Network Identification of P2P Traffic Using Application Signatures // Proc. of the 13th international conference on World (WWW’04). New York, NY, USA, 2016. P. 512–521.
4. ICAP [Електронний ресурс]: [Інтернет-портал]. URL: https://tools.ietf.org/html/rfc3507 (дата звернення 20.10.2020). Internet Content Adaptation Protocol (ICAP)
5. QUIC [Електронний ресурс]: [Інтернет-портал]. URL: https://tools.ietf.org/html/draft-tsvwg-quic-protocol-00 (дата звернення 25.10.2020). QUIC: A UDPBased Secure and Reliable Transport for HTTP/2 draft-tsvwg-quic-protocol-00

Published

2021-03-25

Issue

Section

Articles