Getting a Better Handle on International Domain Names and Punycode
International domain names (IDN) continue to be an interesting topic. For the most part, they are probably less of an issue than some people make them out to be, given that popular browsers like Google Chrome are pretty selective in displaying them. But on the other hand, they are still used legitimately or not, and keeping a handle on them is interesting.
When analyzing DNS traffic, you should see the Punycode encoding for these domain names. Punycode is defined in RFC 3492 [1]. Punycode encoded domain names start with "xn--", making identifying them easy.
Several anomalies may happen with Punnycode; luckily, some Python modules can help us identify them.
1 - Invalid Punycode
The Punycode standard is complex, and you may end up with invalid Punycode domains.
2 - Mixed Script
That is the most interesting issue. You are detecting if a domain name mixes different languages. There is no easy way to identify the "language"; instead, we are using the "Script". The Latin script can be used for most European languages. The "Script" identifies a group of languages using the same characters. In Python, the "unicodedata2" module can be used to determine the script of a particular character.
The Python "unicodedata2" module can be used to look up the Unicode name of a character, and the first word in a Unicode name identifies the script the character is a part of. Mixing different scripts in a domain name is suspect as legit international domain names should only use one language.
You can find a quick Python implementation on GitHub: https://github.com/jullrich/idntest
[1] https://datatracker.ietf.org/doc/html/rfc3492
Johannes B. Ullrich, Ph.D. , Dean of Research, SANS.edu
Twitter|
Application Security: Securing Web Apps, APIs, and Microservices | Las Vegas | Sep 22nd - Sep 27th 2025 |
Comments