Domo.rb & Domo.js: Canonize domain names easily
In a nutshell:
>> Domo.canonize("motors.ebay.co.uk")
=> "ebay.co.uk"
Ruby: https://github.com/gurdotan/domo-rb
Javascript: https://github.com/gurdotan/domo.js
In detail:
About a year ago, while I was still working at Kontera, my team leader asked me to write a small utility that could extract the canonical domain from a domain string. A canonical domain name is the domain authority with all the TLD (top level domain) suffixes following it. So for example, the domains www.ebay.com and motors.ebay.com share the same canonical domain: ebay.com. That’s an easy example, but what happens when you take into consideration all the possible TLD suffixes (.com, .info, .org, .edu, .mobi …) and also all the country code suffixes (.us, .ca, .mx, .jp, .de, .il ….)? To thicken the plot even more, some countries use co for commercial domains (i.e. UK: www.ebay.co.uk) and others use com (i.e. Australia: www.qantas.com.au).
The reason we needed to canonize domains was for reporting purposes. We were counting our system’s page views and we wanted to divide them into per-domain buckets. The task was high priority, and I couldn’t find any existing Ruby code that did the job, so I managed to come up with a solution within an hour’s work. Extracting a domain’s canonical form can actually be achieved by applying a simple algorithm I devised (I’m sure I’m not the first one to think of this):
1. Strip the domain string from protocols, ports and paths.
"http://motors.ebay.co.uk:8082/search/buy_now.html" # Regular
"motors.ebay.co.uk" # Stripped
2. Split the domain by dots and reverse the array.
["uk", "co", "ebay", "motors"]
3. Iterate the array until you reach a slice that isn’t a standard TLD, and dispose of the rest of the array.
["uk", "co", "ebay"] # "motors" dropped
4. Reverse the array back and join it. TADA - the canonical domain string:
"ebay.co.uk"
This method works fairly well but it does have its pitfalls. It won’t manage to find the canonical form of domains who’s authority is also a TLD. So assume an imaginary company named “Mobi”, the domain news.mobi.com will be erroneously canonized to news.mobi.com instead of mobi.com, because mobi is a standard TLD. If anyone has a clue on how to identify those trickier domains’ canonical form, please do share.
Come November 2011, I was asked by Oren, my manager at Capriza, to add a canonical domain to each record in the database. DOHHH!!! Why didn’t I open source that piece of code from a year ago? So this time around, even though I was doing it in Javascript instead of Ruby, I decided to publish it all on Github for the future’s sake. Maybe my grandchildren will one day ask me how we canonized domains back in the days - I’ll just send them to my Github account and go back to watching TV :-)
Ruby: https://github.com/gurdotan/domo-rb
Javascript: https://github.com/gurdotan/domo.js
Reference
Wikipedia has a full list of all top level domains, which I used as reference for matching TLDs in the code.
6 Notes/ Hide
-
kathleentag901 likes this
-
rubyglazed posted this
