Logo

Ruby Glazed

  • Archive
  • RSS
  • Ask me anything

Sugar glazed tips and random thoughts about software, startups and life.

Domo.rb & Domo.js: Canonize domain names easily

In a nutshell:

>> Domo.canonize("motors.ebay.co.uk")
=> "ebay.co.uk"

Ruby: https://github.com/gurdotan/domo-rb

Javascript: https://github.com/gurdotan/domo.js

In detail:

About a year ago, while I was still working at Kontera, my team leader asked me to write a small utility that could extract the canonical domain from a domain string.  A canonical domain name is the domain authority with all the TLD (top level domain) suffixes following it.  So for example, the domains  www.ebay.com and motors.ebay.com share the same canonical domain: ebay.com.  That’s an easy example, but what happens when you take into consideration all the possible TLD suffixes (.com, .info, .org, .edu, .mobi …) and also all the country code suffixes (.us, .ca, .mx, .jp, .de, .il ….)?  To thicken the plot even more, some countries use co for commercial domains (i.e. UK:  www.ebay.co.uk) and others use com (i.e. Australia:  www.qantas.com.au).

The reason we needed to canonize domains was for reporting purposes.  We were counting our system’s page views and we wanted to divide them into per-domain buckets.  The task was high priority, and I couldn’t find any existing Ruby code that did the job, so I managed to come up with a solution within an hour’s work.  Extracting a domain’s canonical form can actually be achieved by applying a simple algorithm I devised (I’m sure I’m not the first one to think of this):

1. Strip the domain string from protocols, ports and paths.

"http://motors.ebay.co.uk:8082/search/buy_now.html"  	# Regular
"motors.ebay.co.uk" # Stripped

2. Split the domain by dots and reverse the array.

["uk", "co", "ebay", "motors"]

3. Iterate the array until you reach a slice that isn’t a standard TLD, and dispose of the rest of the array.

["uk", "co", "ebay"]	# "motors" dropped

4. Reverse the array back and join it.  TADA - the canonical domain string:

"ebay.co.uk"

This method works fairly well but it does have its pitfalls.  It won’t manage to find the canonical form of domains who’s authority is also a TLD.  So assume an imaginary company named “Mobi”, the domain news.mobi.com will be erroneously canonized to news.mobi.com instead of mobi.com, because mobi is a standard TLD.  If anyone has a clue on how to identify those trickier domains’ canonical form, please do share.

Come November 2011, I was asked by Oren, my manager at Capriza, to add a canonical domain to each record in the database.  DOHHH!!! Why didn’t I open source that piece of code from a year ago?  So this time around, even though I was doing it in Javascript instead of Ruby, I decided to publish it all on Github for the future’s sake.  Maybe my grandchildren will one day ask me how we canonized domains back in the days - I’ll just send them to my Github account and go back to watching TV :-)

Ruby: https://github.com/gurdotan/domo-rb

Javascript: https://github.com/gurdotan/domo.js

Reference

Wikipedia has a full list of all top level domains, which I used as reference for matching TLDs in the code.

    • #canonical domain
    • #domain canonization
    • #javascript
    • #open source
    • #ruby
    • #top level domain
  • 1 year ago
  • 6
  • Comments
  • Permalink
  • Share
    Tweet

6 Notes/ Hide

  1. kathleentag901 likes this
  2. rubyglazed posted this

Recent comments

Blog comments powered by Disqus
← Previous • Next →

About me [Gur Dotan]

Avatar Software engineer, adrenaline junkie, husband, beer-aholic, Tel-Aviv dweller. When I'm not hacking away, you can find me jamming on my guitar, anxiously waiting for the next surf \ snow session. Previously a Konteran and Caprizioner, I am now a proud co-founder and chief code brewer of SOOMLA. SOOMLA is the complete virtual economy solution for mobile game devs. We enable automatic creation and management of In-App Purchase Stores. Our code is open source and available at http://project.soom.la.

Twitter

loading tweets…

I Dig These Posts

See more →
  • RSS
  • Random
  • Archive
  • Ask me anything
  • Mobile

Effector Theme by Carlo Franco.

Powered by Tumblr