Synthetic network traffic generation with generative neural models

Generating synthetic network data with generative neural models that is realistic enough to replace real data in machine learning tasks, without the privacy considerations.

image/svg+xmlRisk ofPrivacyCompromiseSyntheticDataTrainSTANYourDataModelsTrainCybersecurityApplications
Read the Paper
STAN: Synthetic Network Traffic Generation with Generative Neural Models

Shengzhe Xu, Manish Marwah, Martin Arlitt, and Naren Ramakrishnan.

Excerpt from paper

Step 0

Motivation

Key Concern

Malicious cyber-activity cost to the U.S. economy $57 to $109 billion in 2016.

Privacy

Organizations are reluctant to share real-life data due to privacy concerns.

Other Solutions

Anonymization may leak private information. Perturbing the original data degrades it.

Our Solution

Generate synthetic network flow data based on real-world data without leaking any sensitive data.

We generate tabular rows of synthetic network data via a temporal process with homogeneous columns:

timestamps duration number of packets number of bytes source IP source port destination IP destination flags protocol
See how it works

Step 1

Generate a Netflow Record

Multiple Attribute Types

STAN uses a flexible deep neural network architecture for learning and generating any combination of network traffic attributes with the help of well-designed decoders for both continuous attributes and discrete attributes.

Continuous variables

Continuous Variables

STAN produces continuous variables such as timestamps, durations, number of bytes, and number of packets.

Discrete Variables

Furthermore, STAN is flexible enough to handle specific meanings for different domain attributes, such as IP addresses and port numbers, and can learn complicated domain semantics such as the relationship between package size and protocol flags.

IP Addresses and Ports

STAN can also generate IP addresses and ports that follow domain semantics and are aware of standard practices.

Ports variables
STAN Row Generator
Attribute Type Example
timestamp continuous 2016-04-11 00:02:15
duration continuous 0.344
bytes continuous 11238
packets continuous 11
transport protocol discrete TCP
source IP address discrete 85.201.196.53
source port discrete 19925
destination IP address discrete 42.219.145.151
destination port discrete 80

Step 2

Generate Netflow Scenario Time Series Table

Building a Netflow Scenario
Excerpt from paper

From Records to Tables

But to create a useful dataset we need to create something more than individual records. STAN captures both temporal dependencies and attribute dependencies, and generates the netflow temporal series auto-regressively.

Deep Learning Techniques

STAN uses deep convolutional neural layers to capture the complex dependencies and mixes density neural layers and softmax layers to precisely learn distributions.

Step 3

Evaluation

1
Statistical Distribution

The counts of unique IP addresses in real world datasets follow a power law and they do so in our synthetic data too.

2
Domain Specifics

Our synthetic data mimics real world relationships between dependent network attributes such as number of bytes and number of packets.

3
ML Task Performance

In our tests, our synthetic data supports equivalent outcomes for real network security machine learning tasks such as value forecasting and information completion.

Install It

Try out our code to generate your own sythetic network traffic data.

  • Based on PyTorch
  • Supports GPU acceleration
 > pip install stannetflow
      

Frequently Asked Questions

FAQ

How do I use it?

Download the package from via pip or clone the repo on Github.

Get started generated data right away with the saved model in the repo.

Or, train your own model with the available sample data or bring your own data.

How should I cite it?
@inproceedings{xu2021stan,
  title={STAN: Synthetic Network Traffic Generation with Generative Neural Models},
  author={Xu, Shengzhe and Marwah, Manish and Arlitt, Martin and Ramakrishnan, Naren},
  booktitle={International Workshop on Deployable Machine Learning for Security Defense},
  pages={3--29},
  year={2021},
  organization={Springer}
}
Who is involved?