Domo.rb & Domo.js: Canonize domain names easily
In a nutshell:
About a year ago, while I was still working at Kontera, my team leader asked me to write a small utility that could extract the canonical domain from a domain string. A canonical domain name is the domain authority with all the TLD (top level domain) suffixes following it. So for example, the domains
motors.ebay.com share the same canonical domain:
ebay.com. That’s an easy example, but what happens when you take into consideration all the possible TLD suffixes (.com, .info, .org, .edu, .mobi …) and also all the country code suffixes (.us, .ca, .mx, .jp, .de, .il ….)? To thicken the plot even more, some countries use
co for commercial domains (i.e. UK:
www.ebay.co.uk) and others use
com (i.e. Australia:
The reason we needed to canonize domains was for reporting purposes. We were counting our system’s page views and we wanted to divide them into per-domain buckets. The task was high priority, and I couldn’t find any existing Ruby code that did the job, so I managed to come up with a solution within an hour’s work. Extracting a domain’s canonical form can actually be achieved by applying a simple algorithm I devised (I’m sure I’m not the first one to think of this):
1. Strip the domain string from protocols, ports and paths.
"http://motors.ebay.co.uk:8082/search/buy_now.html" # Regular
"motors.ebay.co.uk" # Stripped
2. Split the domain by dots and reverse the array.
["uk", "co", "ebay", "motors"]
3. Iterate the array until you reach a slice that isn’t a standard TLD, and dispose of the rest of the array.
["uk", "co", "ebay"] # "motors" dropped
4. Reverse the array back and join it. TADA - the canonical domain string:
This method works fairly well but it does have its pitfalls. It won’t manage to find the canonical form of domains who’s authority is also a TLD. So assume an imaginary company named “Mobi”, the domain
news.mobi.com will be erroneously canonized to
news.mobi.com instead of
mobi is a standard TLD. If anyone has a clue on how to identify those trickier domains’ canonical form, please do share.
Wikipedia has a full list of all top level domains, which I used as reference for matching TLDs in the code.