Hot Koehls
  • Email
  • Feedburner
  • Linkedin
  • Twitter
  • Home
  • About
  • Archives
  • Contact
  • Software
    • S3imple Backup
    • Twitter Feed Archiver
    • FileTime
    • Flickr API Demo
Search
Home» For techies » Get domain out of any URL string (yes, really)

Get domain out of any URL string (yes, really)

Posted by Frank - October 27, 2009 - For techies
0

It’s a common problem with no single right answer: extract the top domain (e.g. example.com) from a given string, which may or may not be a valid URL. I had need of such functionality recently and found answers around the web lacking. So if you ever “just wanted the domain name” out of a string, give this a shot…

 1) {
        preg_match("/^.+\.([a-z0-9\.\-]+\.[a-z]{2,4})$/", $host, $host);
        if (isset($host[1])) {
          return $host[1];
        } else {
          // not a valid domain
          return false;
        }
      } else {
        return $host;
      }
    break;
  }
}

// some examples
var_dump(get_top_domain('http://www.validurl.example.com/directory', 'all'));
var_dump(get_top_domain('http://www.validurl.example.com/directory', 'www'));
var_dump(get_top_domain('domain-string.example.com', 'all'));
var_dump(get_top_domain('domain-string.example.com/nowfails', 'all'));
var_dump(get_top_domain('finds the domain url.example.com', 'all'));
var_dump(get_top_domain('12.34.56.78', 'all'));
?>

Most of the examples are simply proofs, but I want to draw attention to the string in example #4, 'domain-string.example.com/nowfails'. This is not a valid URL, so the call to parse_url() fails, forcing the script to use the entire original string. In turn, the path part of the string causes the regex to break, causing a complete failout (return false;).

Is there a way to account for this? Surely, however I’m not about to tap that massive keg of exceptions (i.e. just a slash, slash plus path, slash plus another domain in a human-readable string, etc).

No regex for validating URL’s or email addresses is ever perfect; the “strict” RFC requirements are too damn broad. So I did what I always do: chose “what works” over “what’s technically right.” This one requires any 2-4 characters for a the top level domain (TLD), so it doesn’t allow for the .museum TLD, and doesn’t check to see if the provided TLD is actually valid. If you need to do further verification, that’s on you. Here’s the current full list of valid TLD’s provided by the IANA.

If you need to modify the regex at all, I highly recommend you read this article about email address regex first for two reasons:

  1. There’s a ton of overlap between email and URL regex matching
  2. It will point out all the gotcha’s in your “better” regex theory that you didn’t think about
coding theory, crappy coding, php, programming, regular expressions, software development

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Categories

  • For entrepreneurs
  • For everyone
  • For techies

Latest Tweets

  • The word traps planners plan themselves into | Life. Then strategy http://t.co/iANAdASb
    May 8, 2012 - 2:43 pm
  • Random network security tip for those about to appear on TV - Boing Boing http://t.co/tC1lXFQ4
    May 8, 2012 - 1:42 pm
  • A Picture http://t.co/H846Uy69
    April 27, 2012 - 12:25 pm
  • The Broken "Buy-One, Give-One" Model: 3 Ways to Save Toms Shoes | Co.Exist: World changing ideas and innovation http://t.co/RI0sVMW6
    April 10, 2012 - 12:23 pm

Recent Comments

  • whiz on What 255 characters looks like
  • Andrew on Find the second (or third, or fourth) occurence in a string
  • IanArcher on Get number of message parts in an email using PHP
  • Usama on Remove parent directories from tar archives
  • Frank on It’s dangerous to go alone

Recent Posts

  • It’s dangerous to go alone
  • Create Self-Signed Wildcard SSL Certificate
  • What comes after the yottabyte?
  • Write code like they do in Hollywood
  • Brian Rolle machine gun celebration
(c) 2012 Frank Koehl. All Rights Reserved.
  • Contact Us
  • Sitemap