Synthetic network traffic generation with generative neural models
Generating synthetic network data with generative neural models that is realistic enough to replace real data in machine learning tasks, without the privacy considerations.
Read the Paper
STAN: Synthetic Network Traffic Generation with Generative Neural Models
Shengzhe Xu, Manish Marwah, Martin Arlitt, and Naren Ramakrishnan.
Step 0
Motivation
Key Concern
Malicious cyber-activity cost to the U.S. economy $57 to $109 billion in 2016.
Privacy
Organizations are reluctant to share real-life data due to privacy concerns.
Other Solutions
Anonymization may leak private information. Perturbing the original data degrades it.
Our Solution
Generate synthetic network flow data based on real-world data without leaking any sensitive data.
We generate tabular rows of synthetic network data via a temporal process with homogeneous columns:
Step 1
Generate a Netflow Record
Multiple Attribute Types
STAN uses a flexible deep neural network architecture for learning and generating any combination of network traffic attributes with the help of well-designed decoders for both continuous attributes and discrete attributes.
Continuous Variables
STAN produces continuous variables such as timestamps, durations, number of bytes, and number of packets.
Discrete Variables
Furthermore, STAN is flexible enough to handle specific meanings for different domain attributes, such as IP addresses and port numbers, and can learn complicated domain semantics such as the relationship between package size and protocol flags.
IP Addresses and Ports
STAN can also generate IP addresses and ports that follow domain semantics and are aware of standard practices.
STAN Row Generator
Attribute | Type | Example |
---|---|---|
timestamp | continuous | 2016-04-11 00:02:15 |
duration | continuous | 0.344 |
bytes | continuous | 11238 |
packets | continuous | 11 |
transport protocol | discrete | TCP |
source IP address | discrete | 85.201.196.53 |
source port | discrete | 19925 |
destination IP address | discrete | 42.219.145.151 |
destination port | discrete | 80 |
Step 2
Generate Netflow Scenario Time Series Table
Building a Netflow Scenario
From Records to Tables
But to create a useful dataset we need to create something more than individual records. STAN captures both temporal dependencies and attribute dependencies, and generates the netflow temporal series auto-regressively.
Deep Learning Techniques
STAN uses deep convolutional neural layers to capture the complex dependencies and mixes density neural layers and softmax layers to precisely learn distributions.
Step 3
Evaluation
Statistical Distribution
The counts of unique IP addresses in real world datasets follow a power law and they do so in our synthetic data too.
Domain Specifics
Our synthetic data mimics real world relationships between dependent network attributes such as number of bytes and number of packets.
ML Task Performance
In our tests, our synthetic data supports equivalent outcomes for real network security machine learning tasks such as value forecasting and information completion.
Install It
Try out our code to generate your own sythetic network traffic data.
- Based on PyTorch
- Supports GPU acceleration
> pip install stannetflow
Frequently Asked Questions
FAQ
How do I use it?
Download the package from via pip or clone the repo on Github.
Get started generated data right away with the saved model in the repo.
Or, train your own model with the available sample data or bring your own data.
How should I cite it?
@inproceedings{xu2021stan,
title={STAN: Synthetic Network Traffic Generation with Generative Neural Models},
author={Xu, Shengzhe and Marwah, Manish and Arlitt, Martin and Ramakrishnan, Naren},
booktitle={International Workshop on Deployable Machine Learning for Security Defense},
pages={3--29},
year={2021},
organization={Springer}
}