Synthetic Network Traffic Generation

Synthetic network traffic generation with generative neural models

Generating synthetic network data with generative neural models that is realistic enough to replace real data in machine learning tasks, without the privacy considerations.

Read the Paper

STAN: Synthetic Network Traffic Generation with Generative Neural Models

Shengzhe Xu, Manish Marwah, Martin Arlitt, and Naren Ramakrishnan.

Step 0

Motivation

Key Concern

Malicious cyber-activity cost to the U.S. economy $57 to $109 billion in 2016.

Privacy

Organizations are reluctant to share real-life data due to privacy concerns.

Our Solution

Generate synthetic network flow data based on real-world data without leaking any sensitive data.

We generate tabular rows of synthetic network data via a temporal process with homogeneous columns:

timestamps duration number of packets number of bytes source IP source port destination IP destination flags protocol

See how it works

Step 1

Generate a Netflow Record

Multiple Attribute Types

STAN uses a flexible deep neural network architecture for learning and generating any combination of network traffic attributes with the help of well-designed decoders for both continuous attributes and discrete attributes.

Continuous Variables

STAN produces continuous variables such as timestamps, durations, number of bytes, and number of packets.

Discrete Variables

Furthermore, STAN is flexible enough to handle specific meanings for different domain attributes, such as IP addresses and port numbers, and can learn complicated domain semantics such as the relationship between package size and protocol flags.

IP Addresses and Ports

STAN can also generate IP addresses and ports that follow domain semantics and are aware of standard practices.

STAN Row Generator

Attribute	Type	Example
timestamp	continuous	2016-04-11 00:02:15
duration	continuous	0.344
bytes	continuous	11238
packets	continuous	11
transport protocol	discrete	TCP
source IP address	discrete	85.201.196.53
source port	discrete	19925
destination IP address	discrete	42.219.145.151
destination port	discrete	80

Step 2

Generate Netflow Scenario Time Series Table

Building a Netflow Scenario

From Records to Tables

But to create a useful dataset we need to create something more than individual records. STAN captures both temporal dependencies and attribute dependencies, and generates the netflow temporal series auto-regressively.

Deep Learning Techniques

STAN uses deep convolutional neural layers to capture the complex dependencies and mixes density neural layers and softmax layers to precisely learn distributions.

Step 3

Evaluation

Statistical Distribution

The counts of unique IP addresses in real world datasets follow a power law and they do so in our synthetic data too.

Domain Specifics

Our synthetic data mimics real world relationships between dependent network attributes such as number of bytes and number of packets.

ML Task Performance

In our tests, our synthetic data supports equivalent outcomes for real network security machine learning tasks such as value forecasting and information completion.

Install It

Try out our code to generate your own sythetic network traffic data.

Based on PyTorch
Supports GPU acceleration

 > pip install stannetflow

Frequently Asked Questions

FAQ

How do I use it?

Download the package from via pip or clone the repo on Github.

Get started generated data right away with the saved model in the repo.

Or, train your own model with the available sample data or bring your own data.

How should I cite it?

@inproceedings{xu2021stan,
  title={STAN: Synthetic Network Traffic Generation with Generative Neural Models},
  author={Xu, Shengzhe and Marwah, Manish and Arlitt, Martin and Ramakrishnan, Naren},
  booktitle={International Workshop on Deployable Machine Learning for Security Defense},
  pages={3--29},
  year={2021},
  organization={Springer}
}

Who is involved?