

DNSANON_RSSAC


Dnsanon_rssac is an implementation of most of RSSAC-002v2 processing for
DNS statistics. Given the "RSSAC Advisory on Measurements of the Root
Server System", at
https://www.icann.org/en/system/files/files/rssac-002-measurements-root-20nov14-en.pdf,
it provides all values that can be computed from packet captures. Its
processing can be parallelized and and done incrementally.


Design Goals

Explict design goals:

-   Incremental computation. It _must_ be possible to compute statistics
    over the day and merge once, or to compute statistics at different
    sites and backhaul only minimal information for a central merge.

-   Extensibility. It should be easy to add measurements in the future.

-   Constant memory usage. Roots get attacked; we don't want attacks to
    take out (computationally) the measurement system by having steps
    that require O(n) memory.

-   Optional parallel processing. It works with Hadoop or GNU parallel,
    although it can also run sequentally.

Non-goals:

-   High performance (we know some ways to make it faster; maybe in the
    future).

-   Pedantic levels of accuracy. The goal is to support root operation,
    and that does not requires 5 decimal places of precision. We believe
    our approach is correct (we're just adding up sums), we do not
    currently implement careful checks around time boundries (midnight).

-   Computation of RSSAC-002 values that cannot be easily derived from
    packet captures. We do not compute the load time nor zone size
    metrics.

-   Graphs. (Although if you want to add some, please let us know.)

Although not an explicit goal, this implementation is largely indepent
of the other implementations we know of. We depend on dnsanon, which
includes some code from DSC (TCP reassembly).


The Basic Idea

The basic idea: nearly everything in RSSAC-002 is a specialized version
of "word count", if you write the words carefully. That lets one use
Hadoop style-parallelism to process and combine data.

Get pcaps and extract the DNS queries to Fsdb format (Fsdb is
tab-separated text with a header, see http://www.isi.edu/~johnh/FSDB.

Convert each pcap's queries to "rssacint" format, an internal format
that supports easy aggregation. Each line of rssacint format is of the
format (OPERATOR)(KEY) (COUNT). For example, for "+udp-ipv4-queries 10"
the operator is "+", the key is "udp-ipv4-queries" and we've seen 10 of
them. The + means if we see two rows with the same key, we can add them
together. (In practice we use terser keys because we move a lot of bytes
around, so this key is actually "+3u04".) There are several operators;
see rssacint_reduce for details.

Rssacint files can be arbitrarily combined using the rssacint_reduce
command. Just merge and sort two or more files then the reduce command
will sum up counts without losing information.

As the last step, count the number of unique sources and convert to
YAML. These steps loose information.


The Specific Workflow

A full pipeline is:

1.  collect pcaps of all traffic. We use LANDER. Alternates: dnscap.

We assume pcaps show up as a series of files with dates and/or sequence
numbers. For B, they look like 20151227-050349-00203216.pcap, where the
last set of numbers are a sequence number.

2.  extract the DNS queries to "message" format. We use dnsanon. Dnsanon
    is packaged separately at
    https://ant.isi.edu/software/dnsanon/index.html.

    <20151227-050349-00203216.pcap dnsanon -i - -o - -p
    MQ >20151227-050349-00203216.message_question.xz

3.  convert messages to rssacint format. Use ./message_to_rssacint.

    xzcat 20151227-050349-00203216.message_question.xz |
    ./message_to_rssacint --file-seqno=203216 >20151227-050349-00203216.rssacint

3a. optionally (but recommended), process that rssacint format locally
to reduce data size:

    < 20151227-050349-00203216.rssacint LC_COLLATE=C sort -k 1,1 | \
    ./rssacint_reduce > smaller.20151227-050349-00203216.rssacint

4.  merge all rssacint files into one big one and reduce it (can be done
    multiple times).

    cat smaller*.rssacint.fsdb | LC_COLLATE=C sort -k 1,1 |
    ./rssacint_reduce > complete.rssacint.fsdb

5.  reduce it again to count unique ips

    < complete.rssacint.fsdb ./rssacint_reduce --count-ips >
    complete.rssacfin.fsdb

6.  Convert rssacfin to yaml. We use ./rssacfin_to_rssacyaml

    < complete.rssacfin.fsdb ./rssacfin_to_rssacyaml

In Hadoop terms, setps 2 and 3 are the map phase, 3a is a combiner, step
4 is a reduce phase, and steps 5 and 6 are a second reduce phase. When
we run with Hadoop we often do steps 5 and 6 as a single process.

(And there is nothing magical about Hadoop. The only requirement is that
data be sorted before any rssacint_reduce step.)


Detailed Documentation and Sample Output

Each program has a manual page with examples and short sample input and
output.

Extended sample output is included in the sample_data subdirectory. Run
cd sample_data; make test to exercise this sample output as a test
suite.


At B

For B-Root, we capture about 1 pcap file every minute or two (step 1),
we process them incrementally over the day (steps 2 and 3). Every night
we run steps 4 as a map-reduce job with Hadoop, and run the final reduce
directly (without Hadoop).

Each pcap file is 2GB uncompressed. Each message file is about 200MB
compressed (xz). A merged rssacint file for a day of traffic is
typically 10MB after xz compression. After counting unique IPs, this
drops to about 2KB.

We have checked our computations for internal consistency and against
the Hedgehog implementation of RSSAC-002. We believe our results are
interally consistent. We see some differences with Hedgehog's numbers,
but they are close. We believe some differences are due to B-Root's
specific use of Hedgehog which triggers a limitation of Hedgehog that we
have never worked-around.



INSTALLATION


These program use the standard Perl build system. To install:

    perl Makefile.PL
    make
    make test
    make install

For customization options, see ExtUtils::MakeMaker::FAQ(3) or
http://perldoc.perl.org/ExtUtils/MakeMaker/FAQ.html.

The current version of dnsanon_rssac is at
https://ant.isi.edu/software/dnsanon_rssac/.

This program depends on dnsanon, available from
https://ant.isi.edu/software/dnsanon/.



RELEASES


-   dnsanon_rssac-1.0 2016-05-29: First public release.
-   dnsanon_rssac-1.1 2016-05-29: Corrects RPM build specification.



FEEDBACK


We are interested in feedback, particularly about correctness or other
active users.

Please contact John Heidemann johnh@isi.edu with comments.
