Extracting 'HTTP CONNECT' Requests with Python

Published: 2022-11-14
Last Updated: 2022-11-14 02:35:27 UTC
by Jesse La Grew (Version: 1)
1 comment(s)

Seeing abnormal Suricata alerts isn’t too unusual in my home environment. In many cases it may be a TLD being resolved that at one point in time was very suspicious. With the increased legitimate adoption of some of these domains, these alerts have been less useful, although still interesting to investigate. I ran into a few of these alerts one night and when diving deeper there was an unusual amount, frequency, and source of the alerts.

Figure 1: Suspicious Suricata Alerts 

The source indicated that the alerts were coming from a dedicated internal firewall on my network, which is used to gather additional data on Honeypot attack traffic. The source ended up being my DShield honeypot. These alerts have come up before, but the amount was very unusual. Since this traffic wasn’t being shown in my standard web honeypot logs, I decided to look at local PCAP captures.

Figure 2: PCAP HTTP CONNECT Requests from Wireshark

The data showed a variety of HTTP CONNECT requests that were arriving at the honeypot. HTTP CONNECT requests are often used with proxy servers to open a connection to a desired destination [1]. Looking into any one of the streams didn’t give much additional information since the CONNECT requests were directing to encrypted HTTP connections.

Figure 3: TCP Stream of HTTP CONNECT Request from Wireshark

There were Zeek and other data available to summarize this information but decided to pull together a python script to process the PCAP files. The goal was to understand the scale of these requests and the change over time.

from scapy.all import *
from scapy.layers import http
from collections import Counter
import os
import time

def print_header(header_text):
    print("{:>50s}  {:>10s}".format(header_text, "Count"))

directory = os.getcwd()

csv_export = open("http_connect_info.csv","a")
csv_export.write("Epoch Time,Date,Source IP,Destination Port,HTTP CONNECT Path,HTTP CONNECT Host\n")
src_ips = []
dst_ports = []
connect_paths = []
connect_hosts = []

for filename in os.scandir(directory):
    if ".pcap" in filename.path:
        print("Processing file: " + filename.path)
        for pkt in PcapReader(filename.path):
            if pkt.haslayer(http.HTTPRequest):
                if pkt.Method.decode() == "CONNECT":
                    src_ip = ""
                    dst_port = ""
                    connect_path = ""
                    connect_host = ""

                    if pkt[IP].src is not None:
                        src = pkt[IP].src
                    if pkt[IP].dport is not None:
                        dst_port = pkt[IP].dport
                    if pkt[IP].Path is not None:
                        connect_path = pkt[IP].Path.decode()
                    if pkt[IP].Host is not None:
                        connect_host = pkt[IP].Host.decode()                                                                 

                    print(str(pkt.time) + ", " + time.strftime('%Y-%m-%d %H:%M:%S %z',time.localtime(float(pkt.time))) 
                        + ", " + src + ", " + str(dst_port) + ", " + connect_path + ", " + connect_host)
                    csv_export.write(str(pkt.time) + "," + time.strftime('%Y-%m-%d %H:%M:%S %z',time.localtime(float(pkt.time))) 
                        + "," + src + "," + str(dst_port) + "," + connect_path + "," + connect_host + "\n")

src_ip_counts = Counter(src_ips)
dst_port_counts = Counter(dst_ports)
connect_paths = Counter(connect_paths)
connect_hosts = Counter(connect_hosts)

print_header("Source IP")
for each_item in src_ip_counts.most_common():
    print("{:>50s}  {:10d}".format(each_item[0], each_item[1]))

print_header("Destination Port")
for each_item in dst_port_counts.most_common():
    print("{:>50d}  {:10d}".format(each_item[0], each_item[1]))

print_header("HTTP Connect Path")
for each_item in connect_paths.most_common():
    print("{:>50s}  {:10d}".format(each_item[0], each_item[1]))

print_header("HTTP Connect Host")
for each_item in connect_hosts.most_common():
    print("{:>50s}  {:10d}".format(each_item[0], each_item[1]))

This script reviews all the *.pcap files in the current directory, prints out a basic summary of the HTTP CONNECT requests and also saves the data to a CSV file.

Figure 4: Destination Port and HTTP CONNECT Request Path Counts

Figure 5: HTTP CONNECT Request Host Counts

Figure 6: HTTP CONNECT Request Source IPs

For a small snapshot of a day or two, it was completed processing within an hour or so. I was curious how this compared to historic data. I ran the same script against 6 months of PCAPS. This took over a day to process. Using a tool such as Zeek [1] would likely be quicker to get this information. The http.log file of Zeek would have the information and a utility like zeek-cut [2] could help get the raw requests.

An item that stood out when looking at the data was that recent HTTP CONNECT requests had greatly increased this month and especially in the last week.

Figure 7: Graph of HTTP CONNECT Method Requests by Month Since May 2022

Figure 8: Graph of HTTP CONNECT Method Requests by Day in November 2022

Top 10 HTTP CONNECT Path Ports

HTTP CONNECT Path Port Count
443 64681
27115 7876
25565 1871
25900 919
30125 529
22 483
30120 468
3389 467
80 446
53 417

Top 10 HTTP CONNECT Source IP Addresses

142[.]202[.]242[.]113 16164
69[.]30[.]246[.]66 11354
204[.]12[.]248[.]130 10902
65[.]109[.]19[.]42 9740
209[.]222[.]97[.]249 6747
69[.]30[.]243[.]18 3729
172[.]93[.]100[.]135 3557
142[.]202[.]243[.]109 2667
104[.]251[.]122[.]239 1759
167[.]99[.]176[.]180 1537


28sex[.]com:443 16357
109[.]237[.]111[.]71:27115 7876
beo555[.]co:443 4620
beo333[.]com:443 4442
h5[.]xhlax[.]com:443 3764
www[.]korims[.]com:443 3119
www[.]serruriervaud[.]ch:443 1872
share[.]nuox[.]top:443 1730
18[.]140[.]35[.]119:443 1464
keokeo[.]top:443 1144

Python can be a great way to programmatically extract data from a PCAP and use that data for other purposes, such as data enrichment or summarization. It was an easy way, if other tools were unavailable, to easily summarize HTTP requests. For larger pools of data, using other tools such as Zeek can also be extremely useful.

The HTTP CONNECT requests may have been an attempt to relay traffic through the honeypot and hide the original source of the request. It is also possible that the traffic may have been funneled through multiple proxy endpoints to make identification of the source difficult to identify. Allowing HTTP CONNECT on internet facing resources can potentially expose internal network resources or assist in the forwarding of malicious traffic. A majority of the HTTP CONNECT requests were directed at port TCP 8080 (99.5%) with the remaining aimed at TCP 80.

[1] https://www.rfc-editor.org/rfc/rfc9110.html#name-connect
[2] https://docs.zeek.org/en/master/about.html
[3] https://docs.zeek.org/en/v3.0.14/examples/logs/index.html

Jesse La Grew


1 comment(s)


Thank you for the very useful article.
Only a small correction: src_ip=pkt[IP].src.

Diary Archives