Data Description

 

1. The Network Traffic Monitoring System

 

We built a network traffic monitoring system that executes continuous queries over 10 hosts in the database group’s network. A central coordinator is notified of the change of query answers continuously from the 10 hosts, and records the aggregation of the query answers.

 

At each host, the number of packets going through the network interface is counted, using TCPdump utility. There are 5 filters coded in the program that sort out different types of packets to answer 5 different queries. The 5 queries are:

 

Q1: Monitor the volume of traffic between hosts within the organization and external hosts.

 

Q2: Monitor the volume of incoming traffic received by all hosts within the organization.

 

Q3: Monitor the volume of incoming SYN packets received by all hosts within the organization.

 

Q4: Monitor the volume of outgoing DNS lookup requests originating from within the organization.

 

Q5: Monitor the volume of remote login (telnet, ssh, ftp, etc.) requests received by hosts within the organization that originate from external hosts.

 

The hosts update the query answers once every second. Using a one-minute moving window, we define the data object for each query answer as the average number of packets per second over the last 60 seconds at any time. Each query answer is therefore a continuously changing numerical data object.

 

Every second, the 10 hosts notify the central coordinator of query answers if there are changes. The central coordinator aggregates the queries answers from all hosts to obtain the overall query answers for the 5 queries.

 

 

2. The Data Approximation Algorithm

 

To reduce the communication cost between the central coordinator and the 10 hosts, an adaptive data approximation algorithm is implemented in the network traffic monitoring system. (Please refer to “Adaptive Filters for Continuous Queries over Distributed Data Streams” at http://dbpubs.stanford.edu:8090/pub/2002-55.)

 

For each query, the central coordinator is given a precision constraint C, which is a numerical number. At any time, the central coordinator is required to give an interval of width C that contains the precise data value of the answer to this query. I.e., if the precise query answer is V, the central coordinator gives [L, U] such that UL = C and L <= V <= U. Based on some policy, the central coordinator divides the overall constraint C for a query over 10 hosts, and assigns constraint Ci to host i. (The sum of all Ci’s is C.) While host i keeps a precise data value Vi for its query answer, it only sends to the central coordinator an interval [Li, Ui] of width Ci that contains the precise data value Vi. Over time, as long as Vi remains within the old [Li, Ui] the host does not need to send the same interval to the central coordinator. Only when Vi moves out of the interval [Ui, Li] will the host update Ui and Li, and notify the central coordinator of the new [Li’, Ui’] that contains the new data value. Note that Ui’Li’ = Ci must still hold. Hence communication cost is reduced.

 

The central coordinator adaptively reassigns new constraint Ci’s to 10 hosts, according to the data-changing rates at different hosts. Intuitively, a data object with large fluctuation will be assigned a loose constraint, while a data object that varies within small boundary will be assigned a tight constraint. The sum of the constraints for a certain query at 10 hosts always remains the same. This policy ensures that the communication cost is reduced as much as possible.

 

To best capture the change of the intervals sent from the hosts, we use two parameters to represent an interval. The center is the midpoint between the lower bound and the upper bound of an interval, and is updated by the host. The width is the width of the interval, and is modified by the central coordinator. A (center, width) pair fully describes an interval for a data object.

 

 

3. Data Files

 

We ran our system for a few weeks, and collected and recorded the network traffic rate during this time.

 

We have a set of files, each containing the network traffic data within a day. Each time a host sent the central coordinator the new interval for a data object, a log record was written to the file by the central coordinator, specifying the host name, the query number, the interval, and the timestamp.

 

A record is of the following format:

 

            Query-Host     Interval Center  Interval Width               Timestamp

 

E.g., record

 

“Q2-5            0.083333            0.020000        5”

 

means at the fifth second of the day, host 5 sent to the central coordinator its query answers for Q2, and the interval has center 0.083333 and width 0.020000. We could conclude that the precise query answer lies between 0.073333 and 0.093333. To make it easier for later processing, upon each update from a host, we also update the aggregation of the query answer at the central coordinator. So right after the above record, we should see another record that looks something like

 

“Q2-all             0.533333            0.200000        5”,

 

which means the overall query answer for Q2 at the fifth second of the day was updated to lie between 0.433333 and 0.633333.

 

At the beginning of each file, we also included the starting time and ending time of the data contained in the file. So if a file started on “Jan 20, 2003 23:39”, timestamp n in a recorded should be interpreted as the n’th second starting from then.