Task 1: TLS-encrypted Flow Packet-length Sequences Analysis

Network-based intrusion detection is a highly challenging task since a substantial fraction of Internet traffic is encrypted with Transport Layer Security (TLS). When a malware abuses the TLS to hide the payload, both signature-based detection and deep packet inspection approaches will be bypassed. One countermeasure for this issue is to utilize side information such as packet length sequence of TLS-encrypted flows.

Data

In this task, the participants are required to predict the actual malware family by analyzing the packet-length sequence of TLS-encrypted flow (bidirectional TLS flow distinguished by the 4-tuple). The entire dataset is formatted in tap-separated values (TSV) file, which contains total 14,980 labeled packet-length sequences collected from 8 malware families.

Malware Family Flow Samples Training Set Test Set
Angler-EK 128 90 38
Dridex 539 377 162
Gootkit 112 78 34
Hancitor 2187 1531 656
IcedID 193 135 58
Rig-EK 264 185 79
Trickbot 10363 7254 3109
Zeus 1194 836 358
Total 14980 10486 4494

Note that each element in packet length sequence (i.e., a packet length) is a signed integer so that minus sign represents egress packet.

The task for the competition is to perform malware family classification through sequence analysis, where training and testing data are given as follows:

Dataset File Samples (Rows) Dimension (Columns)
Training Training_set.tsv 10486 Dynamic (1st column is the class label)
Testing Testing_set.tsv 4494 Dynamic

Results Submission

You are required to submit one CSV file containing the prediction for testing sample. The expected result submission should be a CSV file that has 2 columns (index and label) and 4494 rows (No header line).

Citation of This Dataset

Heejun Roh and Joonseo Ha, TLS-encrypted flow packet-length sequences dataset, for the 13th International Cybersecurity Data Mining Competition (CDMC2022), Defensible Networked Systems Lab., Korea University, June 2022

 

Note: Please login to download data.

Task 2: Malware API Call Histogram for Malware Classification

The goal of this challenge is to use multiway classification to predict the nine malware families in testing set using a model trained with the features from training data provided. This malware features dataset was built from malware samples provided by Abuse.ch. The malware features were extracted by dynamic analysis using the Cuckoo sandbox, the features consist of an API call count histogram. The dataset is split using an 80/20 split.The label is contained in the first column of the training data. The API histogram represents 208 unique API calls and 9 unique malware families. The following table gives the statistics of training and testing data for this task.

Dataset File Samples (Rows) Dimension (Columns)
Training train_features.txt 537 209 ( 1st column is the class label)
Testing test_features.txt 134 208

Results Submission

You are required to submit one CSV file containing the prediction for testing sample. The expected result submission should be a CSV file that has 2 columns (index and label) and 134 rows (No header line).

Citation of This Dataset

Paul Black, Malware API Call Histogram Dataset for the 13th International Cybersecurity Data Mining Competition (CDMC2022), Internet Commerce Security Laboratory (ICSL), Federation University Australia, 15/06/2022.

Note: Please login to download data.

Task 3: Table Structure Recognition in Documents

Introduction

Data mining technology is widely used to discover patterns and anomalies in a large amount of data. Documents are an important source of data in business activities, and document analysis has gradually become a research hotspot in the field of data mining. A company's financial performance, including earnings, sales, operating expenses, capital needs, etc., can be mined from its financial documents. Correct and efficient extraction of all financial information from these documents not only prevents data forgery but is also a prerequisite for subsequent data mining to identify significant issues in business activities.

Task Description

Key figures in financial documents are primarily presented in tables, but the format and style of tables can vary widely. Therefore, the main problem to be solved in financial data forgery prevention is table recognition, where structure recognition is an important step, and it is a well-studied problem in document analysis. There are many academic and commercial approaches developed to recognize tables in several document formats. However, there is limited work on image-based table recognition in real scenarios. Because of the cost of the annotation on real scenario images, the limitation on training data amount will be the main technical challenge in this task. The challenge aims at exploring state-of-the-art methods to recognize table structure in real financial documents with limited training data. In particular, the task is defined like this: given an image of a financial document whose main content is a table, the system will recognize the row count and column count of the table.

Data 

The training data directory (train) includes,

  • “img” folder : 100 png images.
  • “train_label.csv”: the label of 100 png images, each line contains 3 fields:
    • file_name: the name of the image file
    • row_count: the count of the rows in the table
    • column_count: the count of the columns in the table

The testing data directory gives the “img” folder which contains a total of 500 png files

 

The following figure shows how the label of training data is recorded in the  “train_label.csv”:

Results Submission

You are required to submit one CSV file containing the prediction of each test image. The format of the CSV file needs to be the SAME as train_lable.csv (i.e., keep the header line “file_name,row_count,column_count”, one sample for one image output, in the same order as the header line). The expected result submission should be a CSV file that has 3 columns and 501 rows (including 1 header line).

Performance Evaluation

The prediction accuracy is defined as the ratio of correctly predicted samples in the test set. Note that a correctly predicted sample means that both “row_count” and “column_count” predictions are the same as the ground truth, only row or column correct prediction is NOT counted for the accuracy.

For example, in the following result.csv:

train_2.png prediction is not correct on “row_count”

train_3.png prediction is not correct on both “row_count” and “column_count”

train_4.png prediction is not correct on “column_count”

This results in 50% accuracy, as it gives 3 correct predictions out of total 6 images.

 

Citation of this Dataset

Qiufeng Wang, Kaizhu Huang, TSRD (Table Structure Recognition in Documents) dataset for the 13th International Cybersecurity Data Mining Competition (CDMC2022), Suzhou Municipal Key Laboratory of Cognitive Computation and Applied Technology (LCCAT), Xi'an Jiaotong Liverpool University (XJTLU), June 2022.

Note: Please login to download data.

< December 2022  
Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31