Looking at a typical packet capture, there are several layers whose size is fixed or easily predictable: the packet header (time, size), the Ethernet header (always 14 bytes), the IP header (almost always 20 bytes and easy to check), the TCP or UDP headers... Some higher level protocols are also interesting but let's focus on the concept.
Networking layers have repetitive structures, easy to compress with many algorithms. And I had the intuition that they are more easy to compress when treated in isolation. For example, to compress N packets from a TCP conversation we could first parse the "Etherner layer", that is the N Ethernet headers which only take one of 2 values, then the "Internet layer" which has a repetitive structure very dependent on the context. So we could "slice" the traffic by layers, and compress each layer individually, potentially facilitating the compression algorithm's work by providing it with more homogeneous data.
As a proof of concept, I took a handy-enough packet capture from the Mid-Atlantic Collegiate Cyber Defense Competition and started playing with it (or a subset of the first 100,000 packets to start with).
I started by analyzing the used protocols and packet sizes:
$ tshark -r maccdc2012_00000.pcap.gz -c 100000 -Tfields -e _ws.col.Protocol |sort|uniq -c|sort -rg|head
88167 SMB
6445 TCP
2742 TLSv1
638 STP
408 HTTP
397 UDP
307 ICMP
277 SSLv2
264 EIGRP
74 SSHv2Looks like SMB is the prevailing protocol. But let's not dive further than TCP in this case since I want to try a very generic solution.
Fixed Layer Sizes
Instead of full protocol analysis, we can try to choose good fixed layer sizes by just looking at packet sizes:$ tshark -r maccdc2012_00000.pcap.gz -c 100000 -Tfields -e frame.len|sort|uniq -c|sort -rg|head
43783 109So for this proof of concept, I will be using the following fixed size layers:
14562 212
14560 208
4896 207
4872 209
4853 217
4281 70
1445 78
769 64
609 68
Note that I am adding two fixed size layers based on the most frequent packet sizes, plus another for the rest. These layers should really be alternatives to each other but for the sake of simplicity I am treating them as if they represented more protocol layers.layersizes = (
16, # Packet header (time & length)
14, # Ethernet
20, # IP (almost always)
20, # TCP (typical)
109 - (14+20+20), # Bytes til 109
212 - (14+20+20+109), # Bytes til 212
9999 # The rest
)
With a simple Python (dpkt) script I sliced the PCAP into 6 layers, adding up to the same decompressed size:
for ts, buf in pcap:
packets +=1
l = len(buf)
bytecount += l
size_hist[l] = size_hist.get(l,0) + 1
streams[0][1].write(struct.pack('dII',ts,l,l))
streams[0][2] += 16 # Same size as in PCAP
pos, rest = 0, len(buf)
for s in streams[1:]:
fragments += 1
w = min(s[0], rest)
s[1].write(buf[pos:pos+w])
s[2] += w
if w >= rest:
break
pos += w
rest -= w
Note that we generated one stream per layer, and concatenating them we obtain a file of the same size as the original pcap:
$ cat 1e5packets-layer* > 1e5packets.layers
$ ll|cut -c30-
1600024 Jul 30 09:31 1e5packets-layer0
1400000 Jul 30 09:31 1e5packets-layer1
2000000 Jul 30 09:31 1e5packets-layer2
2000000 Jul 30 09:31 1e5packets-layer3
5194853 Jul 30 09:31 1e5packets-layer4
2297257 Jul 30 09:31 1e5packets-layer5
2678388 Jul 30 09:31 1e5packets-layer6
17170522 Jul 30 09:32 1e5packets.layers # Added layers
17170522 Jul 27 22:16 1e5packets.pcap # Original
Now we can easily check the effect of compressing the original and the "layered" files with different algorithms:
Size | File | CPU seconds | Compress ratio | Layering Improvement | MBytes / CPU |
17170522 | 1e5packets.pcap | - | 1.00 | - | - |
3861578 | 1e5packets.pcap.gz1 | 0.5 | 4.45 | n/a | 26.62 |
3615304 | 1e5packets.layers.gz1 | 0.4 | 4.75 |
6.81%
|
33.89 |
3413950 | 1e5packets.pcap.gz | 1.7 | 5.03 | n/a | 8.09 |
3382712 | 1e5packets.pcap.gz9 | 5.3 | 5.08 | n/a | 2.60 |
3272017 | 1e5packets.layers.gz9 | 8.8 | 5.25 |
3.38%
|
1.58 |
3266231 | 1e5packets.pcap.bz1 | 3.4 | 5.26 | n/a | 4.09 |
3238030 | 1e5packets.layers.gz | 1.5 | 5.30 |
5.43%
|
9.29 |
3149087 | 1e5packets.layers.bz2 | 5.3 | 5.45 |
-1.17%
|
2.65 |
3112274 | 1e5packets.pcap.bz2 | 3.6 | 5.52 | n/a | 3.91 |
2989412 | 1e5packets.layers.bz1 | 3.9 | 5.74 |
9.26%
|
3.64 |
2438016 | 1e5packets.pcap.xz1 | 2.2 | 7.04 | n/a | 6.70 |
2283236 | 1e5packets.pcap.xz | 17.0 | 7.52 | n/a | 0.88 |
2281332 | 1e5packets.pcap.xz9 | 17.0 | 7.53 | n/a | 0.88 |
2148592 | 1e5packets.layers.xz1 | 2.0 | 7.99 |
13.47%
|
7.51 |
1803756 | 1e5packets.layers.xz | 15.0 | 9.52 |
26.58%
|
1.02 |
1802924 | 1e5packets.layers.xz9 | 15.2 | 9.52 |
26.54%
|
1.01 |
In the best case of "xz" compression we appreciate a boost of 13~27% in compression efficiency, and in all cases but bzip2 there was a decrease in CPU cost.
Note: Some results were actually better with different layer sizes resulting from a typo in the first runs of last night, so there is room for improvement in the way we choose layer sizes...
Test Code
Using python 2.7 cause v3 did not have dpkt module ready on my Ubuntu system:
pcapz.py
#!/env/python2.7 # "pcapz" prototype by gatopeich 2017 import dpkt import subprocess import struct import os import time import zlib input = '1e5packets' f = open(input+'.pcap') pcap = dpkt.pcap.Reader(f) headers = ( 16, # Packet header (time & length) # 14+20+20, # Eth+IP+TCP 14, # Ethernet 20, # IP (almost always) 20, # TCP (typical) 109-14-20-20, # 212-14-20-20, 99999 # Rest ) print "headers:",headers compression_methods = ( ('bz1', 'bzip2 -1c'), # ('bz9', 'bzip2 -9c'), # ('gz1', 'gzip -1c'), # ('gz9', 'gzip -9c'), ('xz1', 'xz -1c'), # ('xz9', 'xz -9c'), ('zlib1', zlib.compress, 1), ('zlib9', zlib.compress, 9) ) def compress(method, filename): cfile = filename+'.'+method[0] if len(method) == 2: out = subprocess.check_output("time -p %s %s > %s"%(method[1],filename,cfile) , shell=True, stderr=subprocess.STDOUT) size = os.stat(cfile).st_size cpu = float(out.split()[1]) else: data = open(filename,'rb').read() t0 = time.time() size = len(method[1](data,method[2])) cpu = time.time() - t0 return size, cpu class Stream: N = 0 def __init__(self, basename, header_size): self.h_size = header_size self.filename = '%s-stream%d'%(basename, Stream.N) Stream.N += 1 self.file = open(self.filename,'wb') self.size = 0 def write(self, data): self.file.write(data) self.size += len(data) def compress(self, methods = compression_methods): self.file.close() self.best_size = self.size self.best_method = None print "%s: %d bytes."%(self.filename, self.size), fsize = float(self.size) for method in methods: size, cpu = compress(method, self.filename) print method[0], 'x%.1f(%.1fs),'%(fsize/size,cpu), if size < self.best_size: self.best_size = size self.cpu_cost = cpu self.best_method = method[0] print print " => -%.1f Kbytes with %s in %.1f seconds."%( (self.size-self.best_size)/1024.0, self.best_method, self.cpu_cost) streams = tuple(Stream(input,hs) for hs in headers) streams[0].write('--- PCAP Header STUB ---') size_hist = {} packets, fragments, bytecount = 0, 0, 0 for ts, buf in pcap: packets +=1 l = len(buf) bytecount += l size_hist[l] = size_hist.get(l,0) + 1 streams[0].write(struct.pack('dII',ts,l,l)) pos, rest = 0, l for s in streams[1:]: fragments += 1 if s.h_size > 0: w = min(s.h_size, rest) elif (-s.h_size) >= l: w = rest else: continue s.write(buf[pos:pos+w]) rest -= w if rest <= 0: break pos += w assert rest == 0 print "Done. packets, fragments, bytecount =", packets, fragments, bytecount total = [0,0,0] for s in streams: s.compress() total[0] += s.size total[1] += s.best_size total[2] += s.cpu_cost print "All streams:", total[0], "==>", total[1], "(-%.3fMB)"%((total[0]-total[1])/(1024*1024.0)), print "ratio = %.1f"%(float(total[0])/total[1]), print "cost =", total[2], " cpusecs." # Print the stats we use above for selecting layer sizes... #top_sizes = sorted(map(lambda ab: ab[::-1], size_hist.items()), reverse=True) #print "top_sizes:", top_sizes[:7] # ==> [(43783, 109), (14562, 212), (14560, 208), (4896, 207), (4872, 209), (4853, 217), (4281, 70)] |
And some sample outputs:
headers: (16, 14, 20, 20, 55, 158, 99999)Done. packets, fragments, bytecount = 100000 448868 15570498 1e5packets-stream0: 1600024 bytes. bz1 x14.6(0.2s), xz1 x23.2(0.1s), zlib1 x10.2(0.0s), zlib9 x10.7(1.1s), => -1495.3 Kbytes with xz1 in 0.1 seconds. 1e5packets-stream1: 1400000 bytes. bz1 x155.0(0.8s), xz1 x114.0(0.1s), zlib1 x43.4(0.0s), zlib9 x115.9(0.1s), => -1358.4 Kbytes with bz1 in 0.8 seconds. 1e5packets-stream2: 2000000 bytes. bz1 x6.7(0.4s), xz1 x7.5(0.3s), zlib1 x3.5(0.1s), zlib9 x3.5(1.9s), => -1694.0 Kbytes with xz1 in 0.3 seconds. 1e5packets-stream3: 2000000 bytes. bz1 x4.1(0.4s), xz1 x6.6(0.3s), zlib1 x2.9(0.1s), zlib9 x3.2(0.9s), => -1656.4 Kbytes with xz1 in 0.3 seconds. 1e5packets-stream4: 5194853 bytes. bz1 x4.1(1.0s), xz1 x5.5(0.8s), zlib1 x3.6(0.1s), zlib9 x4.2(4.1s), => -4144.7 Kbytes with xz1 in 0.8 seconds. 1e5packets-stream5: 4807822 bytes. bz1 x8.8(1.2s), xz1 x13.0(0.3s), zlib1 x9.1(0.1s), zlib9 x10.1(0.2s), => -4334.7 Kbytes with xz1 in 0.3 seconds. 1e5packets-stream6: 167823 bytes. bz1 x5.2(0.1s), xz1 x6.9(0.0s), zlib1 x6.1(0.0s), zlib9 x6.4(0.0s), => -140.2 Kbytes with xz1 in 0.0 seconds. All streams: 17170522 ==> 1991123 (-14.476203MB) ratio = 8.6 cost = 2.65 cpusecs. headers: (16, 8, 8, 8, 8, 8, 14, 55, 99999) Done. packets, fragments, bytecount = 100000 747720 15570498 1e5packets-stream0: 1600024 bytes. bz1 x14.6(0.2s), xz1 x23.2(0.1s), xz9 x30.9(2.0s), zlib1 x10.2(0.0s), zlib9 x10.7(1.1s), => -1512.0 Kbytes with xz9 in 2.0 seconds. 1e5packets-stream1: 800000 bytes. bz1 x126.4(0.5s), xz1 x91.5(0.0s), xz9 x116.5(0.2s), zlib1 x37.5(0.0s), zlib9 x103.8(0.1s), => -775.1 Kbytes with bz1 in 0.5 seconds. 1e5packets-stream2: 800000 bytes. bz1 x101.9(0.5s), xz1 x87.3(0.0s), xz9 x107.2(0.2s), zlib1 x34.2(0.0s), zlib9 x80.8(0.1s), => -774.0 Kbytes with xz9 in 0.2 seconds. 1e5packets-stream3: 800000 bytes. bz1 x5.5(0.2s), xz1 x7.4(0.1s), xz9 x8.2(0.9s), zlib1 x3.1(0.0s), zlib9 x2.9(1.9s), => -685.6 Kbytes with xz9 in 0.9 seconds. 1e5packets-stream4: 800000 bytes. bz1 x6.2(0.2s), xz1 x7.9(0.1s), xz9 x10.4(0.8s), zlib1 x3.5(0.0s), zlib9 x3.5(1.6s), => -706.4 Kbytes with xz9 in 0.8 seconds. 1e5packets-stream5: 800000 bytes. bz1 x43.3(0.4s), xz1 x44.9(0.0s), xz9 x56.0(0.3s), zlib1 x22.4(0.0s), zlib9 x34.8(0.1s), => -767.3 Kbytes with xz9 in 0.3 seconds. 1e5packets-stream6: 1400000 bytes. bz1 x2.9(0.3s), xz1 x4.5(0.3s), xz9 x6.3(0.9s), zlib1 x2.2(0.1s), zlib9 x2.4(1.1s), => -1149.2 Kbytes with xz9 in 0.9 seconds. 1e5packets-stream7: 5194853 bytes. bz1 x4.1(1.0s), xz1 x5.5(0.8s), xz9 x5.9(5.4s), zlib1 x3.6(0.1s), zlib9 x4.2(4.1s), => -4209.3 Kbytes with xz9 in 5.4 seconds. 1e5packets-stream8: 4975645 bytes. bz1 x8.5(1.3s), xz1 x12.6(0.4s), xz9 x12.9(3.4s), zlib1 x9.0(0.1s), zlib9 x9.9(0.2s), => -4481.5 Kbytes with xz9 in 3.4 seconds. All streams: 17170522 ==> 1748617 (-14.707MB) ratio = 9.8 cost = 14.28 cpusecs. BEST ration I have found manually is 10.0: headers: (16, 8, 8, 8, 8, 8, 16, 64, 64, 99999) Done. packets, fragments, bytecount = 100000 793191 15570498 1e5packets-stream0: 1600024 bytes. bz1 x14.6(0.2s), xz1 x23.2(0.1s), xz9 x30.9(1.9s), zlib1 x10.2(0.0s), zlib9 x10.7(1.1s), => -1512.0 Kbytes with xz9 in 1.9 seconds. 1e5packets-stream1: 800000 bytes. bz1 x126.4(0.5s), xz1 x91.5(0.0s), xz9 x116.5(0.2s), zlib1 x37.5(0.0s), zlib9 x103.8(0.1s), => -775.1 Kbytes with bz1 in 0.5 seconds. 1e5packets-stream2: 800000 bytes. bz1 x101.9(0.5s), xz1 x87.3(0.0s), xz9 x107.2(0.2s), zlib1 x34.2(0.0s), zlib9 x80.8(0.1s), => -774.0 Kbytes with xz9 in 0.2 seconds. 1e5packets-stream3: 800000 bytes. bz1 x5.5(0.2s), xz1 x7.4(0.1s), xz9 x8.2(0.9s), zlib1 x3.1(0.0s), zlib9 x2.9(1.9s), => -685.6 Kbytes with xz9 in 0.9 seconds. 1e5packets-stream4: 800000 bytes. bz1 x6.2(0.2s), xz1 x7.9(0.1s), xz9 x10.4(0.7s), zlib1 x3.5(0.0s), zlib9 x3.5(1.6s), => -706.4 Kbytes with xz9 in 0.7 seconds. 1e5packets-stream5: 800000 bytes. bz1 x43.3(0.4s), xz1 x44.9(0.0s), xz9 x56.0(0.3s), zlib1 x22.4(0.0s), zlib9 x34.8(0.1s), => -767.3 Kbytes with xz9 in 0.3 seconds. 1e5packets-stream6: 1600000 bytes. bz1 x2.0(0.4s), xz1 x2.9(0.5s), xz9 x3.6(1.1s), zlib1 x1.7(0.1s), zlib9 x1.7(1.7s), => -1128.0 Kbytes with xz9 in 1.1 seconds. 1e5packets-stream7: 5518075 bytes. bz1 x5.5(1.1s), xz1 x7.5(0.6s), xz9 x8.3(5.6s), zlib1 x5.0(0.1s), zlib9 x5.9(2.7s), => -4735.7 Kbytes with xz9 in 5.6 seconds. 1e5packets-stream8: 2981162 bytes. bz1 x7.1(0.6s), xz1 x11.8(0.2s), xz9 x11.6(3.0s), zlib1 x7.0(0.0s), zlib9 x8.5(0.4s), => -2664.3 Kbytes with xz1 in 0.2 seconds. 1e5packets-stream9: 1471261 bytes. bz1 x12.5(0.8s), xz1 x14.5(0.1s), xz9 x14.8(0.5s), zlib1 x12.3(0.0s), zlib9 x13.4(0.0s), => -1339.6 Kbytes with xz9 in 0.5 seconds. All streams: 17170522 ==> 1720437 (-14.734MB) ratio = 10.0 cost = 12.0 cpusecs. Combined xz1/xz4 compression achieves a very nice 9.6 in just 4 seconds: headers: (16, 8, 8, 8, 8, 8, 16, 64, 64, 99999) Done. packets, fragments, bytecount = 100000 793191 15570498 1e5packets-stream0: 1600024 bytes. xz1 x23.2(0.1s), xz4 x21.9(0.6s), => -1495.3 Kbytes with xz1 in 0.1 seconds. 1e5packets-stream1: 800000 bytes. xz1 x91.5(0.0s), xz4 x91.2(0.1s), => -772.7 Kbytes with xz1 in 0.0 seconds. 1e5packets-stream2: 800000 bytes. xz1 x87.3(0.0s), xz4 x87.0(0.1s), => -772.3 Kbytes with xz1 in 0.0 seconds. 1e5packets-stream3: 800000 bytes. xz1 x7.4(0.1s), xz4 x8.2(0.7s), => -685.4 Kbytes with xz4 in 0.7 seconds. 1e5packets-stream4: 800000 bytes. xz1 x7.9(0.1s), xz4 x10.5(0.7s), => -706.9 Kbytes with xz4 in 0.7 seconds. 1e5packets-stream5: 800000 bytes. xz1 x44.9(0.0s), xz4 x49.0(0.1s), => -765.3 Kbytes with xz4 in 0.1 seconds. 1e5packets-stream6: 1600000 bytes. xz1 x2.9(0.5s), xz4 x3.7(1.1s), => -1137.8 Kbytes with xz4 in 1.1 seconds. 1e5packets-stream7: 5518075 bytes. xz1 x7.5(0.6s), xz4 x6.5(1.7s), => -4674.7 Kbytes with xz1 in 0.6 seconds. 1e5packets-stream8: 2981162 bytes. xz1 x11.8(0.2s), xz4 x12.2(0.6s), => -2672.9 Kbytes with xz4 in 0.6 seconds. 1e5packets-stream9: 1471261 bytes. xz1 x14.5(0.1s), xz4 x14.6(0.2s), => -1338.6 Kbytes with xz4 in 0.2 seconds. All streams: 17170522 ==> 1788160 (-14.670MB) ratio = 9.6 cost = 4.15 cpusecs. |