Analysing encrypted traffic to identify security breaches in PUZZLE

  • April 11, 2022
  • 4 min read

SMEs’ are particularly vulnerable to security breaches since they do not have the expertise and resources to fight against them. Inexpensive cybersecurity solutions (e.g., firewalls, IDS/IPS, anti-virus solutions) only partially protect their networks since they rely mainly on known attack signatures. This is particularly true when they are inspecting encrypted traffic.

Today nearly 90 percent of network traffic is encrypted making it difficult if not impossible to detect malicious network packets and sessions, and transmission of malware and unauthorized exfiltration of information. To address this, there are several possible solutions:

  • Decrypt the traffic, this is possible in some cases (e.g., web applications, SSL/TSL). Then signature-based solutions can be used.
  • Analyse packet headers to detect anomalies, identify source IP addresses, etc.
  • Profile the network packets and session behaviour. Many closed-source solutions offer User and Entity Behaviour Analytics (UEBA) but it is very difficult to determine their effectiveness and are often expensive.

To facilitate the use of these techniques by SMEs, PUZZLE is building an open-source Advanced Cybersecurity Analytics System (ACAS) whose aim is to monitor traffic and detect potential anomalies. It achieves this by measuring and computing multiple packet and flow features of the network traffic and using this information to classify each flow into normal or malicious category.

The ACAS system consists of two modules: a) the Feature Engineering module responsible for extraction and calculation of flow features is utilizing and extending an industrial tool MMT-Probe provided by Montimage, and b) the Deep Learning module that does the classification with the use of a Machine Learning model that is trained using the state-of-the-art Deep Learning techniques.

The general architecture can be seen in the following figure. The benefit of this architecture is its flexibility, as both the different features and Deep Learning models can be easily added to the structure and integrated into the final solution. We will describe each of the building blocks briefly.

Figure 1: General architecture of the proposed system for Analytic Service

Feature Engineering

The aim of this module is to parse the data in both online and offline modes, and extract the information about each of the packets/flows in the network. The module allows parsing a wide range of network protocols (e.g., TCP, UDP, IP). The extracted information is stored and analysed. From this information, a set of statistics is calculated that allows inspecting the traffic data. The flow granularity of the data is used, meaning that the data is grouped as sequences of packets from a particular source to a particular destination. Finally, the restructured and computed data is transformed into a numeric vector so that it can be processed by the Machine Learning algorithms.

Deep Learning (DL)

This module implements a hybrid learning structure, combining two DL techniques: a) the Stacked Autoencoders (SAE), and b) the Convolutional Neural Networks (for one dimensional data). In particular, it uses two SAE, one for each class of data (malicious/normal). The two autoencoders are trained and parameterized, and the output of each of them is fed to the CNN. The model takes the numeric vector features as input. Each of the autoencoders are trained separately with normal and malicious samples respectively. Each SAE uses one hidden dense layer. The output of each of the autoencoders is concatenated to one vector and passed as input to the one-dimensional CNN. The structure of CNN is based on the well-known CNN model.

Results

The DL model gives very good results for the tested dataset, i.e., precision and accuracy of near 99%. These initial tests were performed using a dataset from an open-source database CSE-CIC-IDS2018 (Figure 2) provided by the Canadian Institute for Cybersecurity consisting of raw PCAP files containing botnet attacks. The dataset was cleaned and divided in order to provide a balanced set, containing the same number of malicious and normal traffic for training. The used dataset consisted of 45290 flows, which were split into two datasets, one containing 70% of the samples for training and another containing 30% for testing.
With this dataset configuration the number of True Positives (TP) was 6779 samples, the True Negatives (TN) contained 6792 samples, the False Negatives (FN) consisted of 14 samples, and the False Positives (FP) corresponded to just 1 sample.

Future work

The first results are encouraging but we need to:

  • Test the solution using many other datasets to make sure that it is generic enough to address different types of situations.
  • Extend the Deep Learning model by considering different types of attacks and multi class labels.
  • Integrate the solution within the PUZZLE framework.
  • And most important, simplify the procedure so that it can be easily used by SMEs.

Author: Montimage

Figure 1: https://github.com/Montimage/mmt-probe
Figure 2: https://www.unb.ca/cic/datasets/ids-2018.html

VALIDATION CONTRACTS CALL!

APPLY UNTIL MAY 31

Validation Contracts of €10.000 for 10 highest ranked SMEs&MES and 5 highest ranked Cybersecurity Vendors.