Get domain out of any URL string (yes, really)

It’s a common problem with no single right answer: extract the top domain (e.g. example.com) from a given string, which may or may not be a valid URL. I had need of such functionality recently and found answers around the web lacking. So if you ever “just wanted the domain name” out of a string, give this a shot…

<?php
function get_top_domain($url, $remove_subdomains = 'all') {
  $host = strtolower(parse_url($url, PHP_URL_HOST));
  if ($host == '') $host = $url;
  switch ($remove_subdomains) {
    case 'www':
      if (strpos($host, 'www.') === 0) {
        $host = substr($host, 4);
      }
      return $host;
    case 'all':
    default:
      if (substr_count($host, '.') > 1) {
        preg_match("/^.+\.([a-z0-9\.\-]+\.[a-z]{2,4})$/", $host, $host);
        if (isset($host[1])) {
          return $host[1];
        } else {
          // not a valid domain
          return false;
        }
      } else {
        return $host;
      }
    break;
  }
}
 
// some examples
var_dump(get_top_domain('http://www.validurl.example.com/directory', 'all'));
var_dump(get_top_domain('http://www.validurl.example.com/directory', 'www'));
var_dump(get_top_domain('domain-string.example.com', 'all'));
var_dump(get_top_domain('domain-string.example.com/nowfails', 'all'));
var_dump(get_top_domain('finds the domain url.example.com', 'all'));
var_dump(get_top_domain('12.34.56.78', 'all'));
?>

Most of the examples are simply proofs, but I want to draw attention to the string in example #4, 'domain-string.example.com/nowfails'. This is not a valid URL, so the call to parse_url() fails, forcing the script to use the entire original string. In turn, the path part of the string causes the regex to break, causing a complete failout (return false;).

Is there a way to account for this? Surely, however I’m not about to tap that massive keg of exceptions (i.e. just a slash, slash plus path, slash plus another domain in a human-readable string, etc).

No regex for validating URL’s or email addresses is ever perfect; the “strict” RFC requirements are too damn broad. So I did what I always do: chose “what works” over “what’s technically right.” This one requires any 2-4 characters for a the top level domain (TLD), so it doesn’t allow for the .museum TLD, and doesn’t check to see if the provided TLD is actually valid. If you need to do further verification, that’s on you. Here’s the current full list of valid TLD’s provided by the IANA.

If you need to modify the regex at all, I highly recommend you read this article about email address regex first for two reasons:

  1. There’s a ton of overlap between email and URL regex matching
  2. It will point out all the gotcha’s in your “better” regex theory that you didn’t think about

Build a slick Twitter feed on your site

A few months ago I published an article describing how to output a Twitter stream on a page using PHP, and later followed up with two more to polish the display. The article content and code examples have since been tweaked based on feedback and my own debugging.

If you haven’t already had a look, or missed a portion, here’s the full series:

  1. Display Twitter updates on your website
  2. Calculate dates and times in different timezones (translate Twitter timestamps)
  3. Parse URL’s in text, create links (automatically link URL’s in stream)
  4. Download and store your Twitter posts in a database

If you have any comments or questions, be sure to post them under the proper article.


Extract email addresses from tags

Ran into another cool hurdle today for my Fwd:Vault development. When I grab the message content to archive it in the system, first thing I do is scrub it out to ensure that (a) it displays properly, and (b) there are no misbehaving characters. I grab both plain text and HTML email formats (if present), so the scrubbing process is a little different in each case. For the plain text, I take some extra steps to ensure there is no HTML whatsoever. Naturally, at one point this involves a call to PHP’s ultra-useful strip_tags() function.

However, in the course of testing today, I realized that when a message is forwarded, sometimes the forward header will encode the email address, which gets stripped when I process the message. Allow me to demonstrate. Here’s the body an example message that someone might send to Fwd:Vault for safe keeping…

---------- Forwarded message ----------
From: "Office Flirt" <flirt@example.com>
Date: Wed, Jan 14, 2009 at 10:14 AM
Subject: Delete those images
To: you@example.com

My boss is sniffing around. I want you to delete those pictures I sent you right away.

Signed,
Office Flirt

Obviously you’re tucking this one away in Fwd:Vault to provide a little CYA-insurance when the boss calls you into his office. Good call. Now, before today, this message would come out of the scrubbing process looking like this:

---------- Forwarded message ----------
From: Office Flirt
Date: Wed, Jan 14, 2009 at 10:14 AM
Subject:
To: you@example.com
...

Look at the bolded red line. The email address is gone. You don’t have any other copies of it, so your boss doesn’t believe your story, and you get the blame. You’re forced to attend one of those god-awful sexual harassment classes. Fail.

So, what happened? Remember, you are looking at the body of a message in plain text. That “Forwarded message” block at the beginning is just part of the body text. So when the text was scrubbed by strip_tags(), the function picked it up as just another tag, which it dutifully removed.

To handle this situation, I came up with a piece of code that will look for email addresses in “tagged format” — i.e. surrounded by < and > — and remove the surrounding symbols, leaving us with harmless text.

$test = 'some surrounding text';
$test = preg_replace( "/(\<)(.+@[^\(\);:,<>]+\.[a-zA-Z]{2,4})(\>)/",
                      ' $2 ',
                      $test);
$test = preg_replace('/[\s]+/', '', $test);
echo $test;

Let’s break this down. First, we have a regular expression that identifies email addresses: .+@[^\(\);:,<>]+\.[a-zA-Z]{2,4}. This is the same expression set in the example on the Quanetic Software Regular Expression Tester (an excellent tool). We surround that in parentheses to isolate it as a subpattern. Then on either end of the expression, we tack on more regex voodoo to look for tag syntax: (\<) and (\>). These also get parentheses to identify them as subpatterns. Once its finished, we have an expression that will only match addresses wrapped in tagging structure.

The second argument in preg_replace() is the replacement, or what we should replace any matches with. In this case, we’ve isolated the address from the tags using subpatterns. So all we need to do is make a single call to the proper reference, which is $2, because its the second set of parentheses in the expression. Confused? You can learn about subpatterns on the PHP manual page for preg_replace().

Note the spaces around the $2 in the second argument. Sometimes the address will not have any spaces between the person’s name and the actual address. This could lead to the address being combined with the name which, in the case of Fwd:Vault, would screw up our search indexing. So we add spaces during the replace, then make a second call to preg_replace() to eliminate extra spaces: $test = preg_replace('/[\s]+/', '', $test);.

Legal Disclaimer: In case you do end up using Fwd:Vault when it launches, I’m fairly certain the service wouldn’t be liable in this silly hypothetical. Just make sure you read the terms before you sign up if you play the field at your office. Sorry to everyone going “duh” right now; it’s a sue-happy world.

Update: When I went to implement this change today, I discovered that the code was catching newlines (\n or \r) in the crossfire. It was actually due to the second call to preg_replace(), the “\s” character class includes not only spaces but line terminators as well. Oops. The revised version looks like this:

$body_text = preg_replace('/[ ]{2,}/', ' ', $body_text);

Parse URL’s in text, create links

I’m absolutely in love with the status update stream I’ve put together for Fwd:Vault (follow link for example). However in the process, I’ve discovered a huge drawback to the Twitter messaging system: it does not store links. The Twitter site itself will identify URL’s in messages and convert them into clickable links for you automatically. But the magic ends at Twitter’s borders; anyone who wants to do the same on their site is on their own.

So I consulted the almighty Google. I found plenty of raw regex, javascript, and Twitter-focused discussions on the matter, but I found the offered solutions and tips lacking. I wanted to do this up right, transparently via PHP in the background. No JS required.

Finally, I found a small PHP script that accomplished what I needed. Here’s a renamed version—all code intact—that will find and convert any well-formed URL into a clickable <a> tag link.

Update: My buddy Tonk has updated the code to link up @replies and #hashtags as well. He also switched from POSIX to Perl regular expressions syntax, mostly cause he’s a regex dork.

function linkify( $text ) {
  $text = preg_replace( '/(?!<\S)(\w+:\/\/[^<>\s]+\w)(?!\S)/i', '<a href="$1" target="_blank">$1</a>', $text );
  $text = preg_replace( '/(?!<\S)#(\w+\w)(?!\S)/i', '<a href="http://twitter.com/search?q=#$1" target="_blank">#$1</a>', $text );
  $text = preg_replace( '/(?!<\S)@(\w+\w)(?!\S)/i', '@<a href="http://twitter.com/$1" target="_blank">$1</a>', $text );
  return $text;
}

Copy that into your code, then run your text containing unlinked URL’s through it. Let’s apply it to the Twitter feed example as we left it in Step 2:

    <li><?php echo linkify($status->text) . '<br />' . $time_display; ?></li>

You can find this code at work on Fwd:Vault.

Build a slick Twitter feed on your site

  1. Display Twitter updates on your website
  2. Calculate dates and times in different timezones (translate Twitter timestamps)
  3. Parse URL’s in text, create links