Get domain out of any URL string (yes, really)

It’s a common problem with no single right answer: extract the top domain (e.g. example.com) from a given string, which may or may not be a valid URL. I had need of such functionality recently and found answers around the web lacking. So if you ever “just wanted the domain name” out of a string, give this a shot…

<?php
function get_top_domain($url, $remove_subdomains = 'all') {
  $host = strtolower(parse_url($url, PHP_URL_HOST));
  if ($host == '') $host = $url;
  switch ($remove_subdomains) {
    case 'www':
      if (strpos($host, 'www.') === 0) {
        $host = substr($host, 4);
      }
      return $host;
    case 'all':
    default:
      if (substr_count($host, '.') > 1) {
        preg_match("/^.+\.([a-z0-9\.\-]+\.[a-z]{2,4})$/", $host, $host);
        if (isset($host[1])) {
          return $host[1];
        } else {
          // not a valid domain
          return false;
        }
      } else {
        return $host;
      }
    break;
  }
}
 
// some examples
var_dump(get_top_domain('http://www.validurl.example.com/directory', 'all'));
var_dump(get_top_domain('http://www.validurl.example.com/directory', 'www'));
var_dump(get_top_domain('domain-string.example.com', 'all'));
var_dump(get_top_domain('domain-string.example.com/nowfails', 'all'));
var_dump(get_top_domain('finds the domain url.example.com', 'all'));
var_dump(get_top_domain('12.34.56.78', 'all'));
?>

Most of the examples are simply proofs, but I want to draw attention to the string in example #4, 'domain-string.example.com/nowfails'. This is not a valid URL, so the call to parse_url() fails, forcing the script to use the entire original string. In turn, the path part of the string causes the regex to break, causing a complete failout (return false;).

Is there a way to account for this? Surely, however I’m not about to tap that massive keg of exceptions (i.e. just a slash, slash plus path, slash plus another domain in a human-readable string, etc).

No regex for validating URL’s or email addresses is ever perfect; the “strict” RFC requirements are too damn broad. So I did what I always do: chose “what works” over “what’s technically right.” This one requires any 2-4 characters for a the top level domain (TLD), so it doesn’t allow for the .museum TLD, and doesn’t check to see if the provided TLD is actually valid. If you need to do further verification, that’s on you. Here’s the current full list of valid TLD’s provided by the IANA.

If you need to modify the regex at all, I highly recommend you read this article about email address regex first for two reasons:

  1. There’s a ton of overlap between email and URL regex matching
  2. It will point out all the gotcha’s in your “better” regex theory that you didn’t think about

Get HTTP status code of cURL call in PHP

With all the fancy cURL-based API’s out there these days (Facebook and Twitter immediately come to mind), using cURL to directly access and manipulate data is becoming quite common. However like all programming, there’s always the chance for an error to occur, and thus these calls must be immediately followed by error checks to ensure everything went as planned.

Most decent API’s will return their own custom errors when an internal problem occurs, but that does not account for issues dealing directly with the connection. So before your application goes looking for API-based errors, they should first check the returned HTTP status code to ensure the connection itself went well.

For example, Twitter-specific error messages are always paired with a “400 Bad Request” status. The message is of course helpful, but it’s far easier (as you’ll see) to find the status code from the response headers and then code for the exceptions as necessary, using the error text for logging and future debugging.

Anyway, the HTTP status code, also called the “response code,” is a number that corresponds with the result of an HTTP request. Your browser gets these codes every time you access a webpage, and cURL calls are no different. The following codes are the most common (excerpted from the Wikipedia entry on the subject)…

  • 200 OK
    Standard response for successful HTTP requests. The actual response will depend on the request method used. In a GET request, the response will contain an entity corresponding to the requested resource. In a POST request the response will contain an entity describing or containing the result of the action.
  • 301 Moved Permanently
    This and all future requests should be directed to the given URI.
  • 400 Bad Request
    The request contains bad syntax or cannot be fulfilled.
  • 401 Unauthorized
    Similar to 403 Forbidden, but specifically for use when authentication is possible but has failed or not yet been provided. The response must include a WWW-Authenticate header field containing a challenge applicable to the requested resource.
  • 403 Forbidden
    The request was a legal request, but the server is refusing to respond to it. Unlike a 401 Unauthorized response, authenticating will make no difference.
  • 404 Not Found
    The requested resource could not be found but may be available again in the future. Subsequent requests by the client are permissible.
  • 500 Internal Server Error
    A generic error message, given when no more specific message is suitable.

So now that we know what we’re looking for, how do we go about actually getting them? Fortunately, PHP’s cURL support makes performing these checks pretty easy, they just don’t make the process plain. We need a function called curl_getinfo(). It returns an array full of useful information, but we only need to know the status number. Fortunately, we can set the arguments so that we only get this number back, like so…

// must set $url first. Duh...
$http = curl_init($url);
// do your curl thing here
$result = curl_exec($http);
$http_status = curl_getinfo($http, CURLINFO_HTTP_CODE);
echo $http_status;

curl_getinfo() returns data for the last curl request, so you must execute the cURL call first, then call curl_getinfo(). The key is the second argument; the predefined constant CURLINFO_HTTP_CODE tells the function to forego all the extra data, and just return the HTTP code as a string.

Echoing out the variable $http_status gets us the status code number, typically one of those outlined above.


Build a slick Twitter feed on your site

A few months ago I published an article describing how to output a Twitter stream on a page using PHP, and later followed up with two more to polish the display. The article content and code examples have since been tweaked based on feedback and my own debugging.

If you haven’t already had a look, or missed a portion, here’s the full series:

  1. Display Twitter updates on your website
  2. Calculate dates and times in different timezones (translate Twitter timestamps)
  3. Parse URL’s in text, create links (automatically link URL’s in stream)
  4. Download and store your Twitter posts in a database

If you have any comments or questions, be sure to post them under the proper article.


Translating PHP error constants

I wanted to log all the errors thrown out by Fwd:Vault processes to ensure that any bugs I don’t catch myself bubble to the top very quickly. To get started, I replaced PHP’s default error handling with a custom error handler function, which simply logs the error in a MySQL table before passing it along to the normal internal error handler.

Later, I’m going to add non-error notices to the mix, and set up an RSS feed to output these errors, allowing me real-time updates on overall system health.

If the error handling stuff sounds like Greek, read up before going further:

When PHP throws any kind of error, the error is assigned an error level, which can be expressed in two ways: an integer or a predefined constant. The constant represents the integer, making the two completely interchangeable. However if you build a custom error handler, you are only given the integer, which doesn’t automagically translate back to the constant value. It’s a heckuva lot easier to recognize E_USER_ERROR instead of the integer 256, so I want to store that error constant for reading purposes. If you find yourself looking at error numbers, and want the matching constant string, use this block of code:

switch ($errno) {
  case 1:     $e_type = 'E_ERROR'; break;
  case 2:     $e_type = 'E_WARNING'; break;
  case 4:     $e_type = 'E_PARSE'; break;
  case 8:     $e_type = 'E_NOTICE'; break;
  case 16:    $e_type = 'E_CORE_ERROR'; break;
  case 32:    $e_type = 'E_CORE_WARNING'; break;
  case 64:    $e_type = 'E_COMPILE_ERROR'; break;
  case 128:   $e_type = 'E_COMPILE_WARNING'; break;
  case 256:   $e_type = 'E_USER_ERROR'; break;
  case 512:   $e_type = 'E_USER_WARNING'; break;
  case 1024:  $e_type = 'E_USER_NOTICE'; break;
  case 2048:  $e_type = 'E_STRICT'; break;
  case 4096:  $e_type = 'E_RECOVERABLE_ERROR'; break;
  case 8192:  $e_type = 'E_DEPRECATED'; break;
  case 16384: $e_type = 'E_USER_DEPRECATED'; break;
  case 30719: $e_type = 'E_ALL'; break;
  default:    $e_type = 'E_UNKNOWN'; break;
}

This will give you a string in $e_type matching the proper constants. The switch block accounts for all the current PHP constants as of this posting, plus a catch-all E_UNKNOWN in case you’re doing something really weird.

Now let’s add some perspective to this code block. Here’s an sample custom error handler that grabs the constant string for logging purposes and outputs the error to the screen. The internal handler is bypassed in this example, since we don’t need it to do anything (note how the function kills page processing when a fatal error occurs). We’ll also set this custom function as the default error handler.

function custom_error_handler($errno, $errstr, $errfile, $errline) {
  $exit_now = false;
  switch ($errno) {
    case 1:     $e_type = 'E_ERROR'; $exit_now = true; break;
    case 2:     $e_type = 'E_WARNING'; break;
    case 4:     $e_type = 'E_PARSE'; break;
    case 8:     $e_type = 'E_NOTICE'; break;
    case 16:    $e_type = 'E_CORE_ERROR'; $exit_now = true; break;
    case 32:    $e_type = 'E_CORE_WARNING'; break;
    case 64:    $e_type = 'E_COMPILE_ERROR'; $exit_now = true; break;
    case 128:   $e_type = 'E_COMPILE_WARNING'; break;
    case 256:   $e_type = 'E_USER_ERROR'; $exit_now = true; break;
    case 512:   $e_type = 'E_USER_WARNING'; break;
    case 1024:  $e_type = 'E_USER_NOTICE'; break;
    case 2048:  $e_type = 'E_STRICT'; break;
    case 4096:  $e_type = 'E_RECOVERABLE_ERROR'; $exit_now = true; break;
    case 8192:  $e_type = 'E_DEPRECATED'; break;
    case 16384: $e_type = 'E_USER_DEPRECATED'; break;
    case 30719: $e_type = 'E_ALL'; $exit_now = true; break;
    default:    $e_type = 'E_UNKNOWN'; break;
  }
  echo "<strong>$e_type</strong> &mdash; $errstr on line $errline in file $errfile<br />n";
  //send_to_log("$e_type - $errstr on line $errline in file $errfile");
  if ($exit_now) exit(1);
  // Don't execute PHP internal error handler
  return true;
}
set_error_handler('custom_error_handler');

At this point you have all the information necessary to do whatever you want with the error. A future post will expand on that send_to_log() statement, but serves as a placeholder example.


Circumvent PHP errors with define_once()

Core PHP does not include a define_once() function to complement functions like require_once() and include_once(), which is pretty silly in my opinion. While I am generally not a fan of using *_once statements due to the performance penalty (and incurred laziness), define_once is the exception. There are ways to look for a loaded/missing file, but a define is not a define until you define it, so you really have no choice.

So in situations where you have to blindly load defines — I do it to build language defines in a cascading templating system — use this function to achieve the proper results:

function define_once($define, $value) {
  if (!defined((string)$define)) {
    define($define, $value);
    return true;
  }
  return false;
}

Find the second (or third, or fourth) occurence in a string

PHP includes some handy functions to find the first or last occurrence of a given string token in a string: strpos and strrpos. However these functions are limited to just the first occurrence; what if I want to know the location of the second token’s position, or the third? These problems usually result in some serious coding acrobatics.

Well no need for code-jitsu anymore. Based almost completely on a post I found at another blog — which is now down, how’s that for timing? — here are two functions which allow you to search for any occurrence of a specific token in a string…

/**
 * Find position of Nth $occurrence of $needle in $haystack
 * Starts from the beginning of the string
**/
function strpos_offset($needle, $haystack, $occurrence) {
  // explode the haystack
  $arr = explode($needle, $haystack);
  // check the needle is not out of bounds
  switch( $occurrence ) {
    case $occurrence == 0:
      return false;
    case $occurrence > max(array_keys($arr)):
      return false;
    default:
      return strlen(implode($needle, array_slice($arr, 0, $occurrence)));
  }
}
 
/**
 * Find position of Nth $occurrence of $needle in $haystack
 * Starts from the end of the string
**/
function strrpos_offset($needle, $haystack, $occurrence) {
  // explode the haystack
  $arr = array_reverse(explode($needle, $haystack));
  // check the needle is not out of bounds
  switch( $occurrence ) {
    case $occurrence == 0:
      return false;
    case $occurrence > max(array_keys($arr)):
      return false;
    default:
      $inverted = strlen(implode($needle, array_slice($arr, 0, $occurrence)));
      $actual = (strlen($haystack) - 1) - $inverted;
      return $actual;
  }
}
 
// look for second occurrence of letter 'a' from the start of string
echo strpos_offset('a', 'abracadabra', 2);
// returns 3
 
// look for second occurrence of letter 'a' from the end of string
echo strrpos_offset('a', 'abracadabra', 2);
// returns 7

In terms of use, we’ve essentially added an extra argument to strpos and strrpos that specifies which occurrence you’re looking for. In other words, you can make both of these functions work like the PHP standards by setting the third $occurrence variable to 1.


Extract email addresses from tags

Ran into another cool hurdle today for my Fwd:Vault development. When I grab the message content to archive it in the system, first thing I do is scrub it out to ensure that (a) it displays properly, and (b) there are no misbehaving characters. I grab both plain text and HTML email formats (if present), so the scrubbing process is a little different in each case. For the plain text, I take some extra steps to ensure there is no HTML whatsoever. Naturally, at one point this involves a call to PHP’s ultra-useful strip_tags() function.

However, in the course of testing today, I realized that when a message is forwarded, sometimes the forward header will encode the email address, which gets stripped when I process the message. Allow me to demonstrate. Here’s the body an example message that someone might send to Fwd:Vault for safe keeping…

---------- Forwarded message ----------
From: "Office Flirt" <flirt@example.com>
Date: Wed, Jan 14, 2009 at 10:14 AM
Subject: Delete those images
To: you@example.com

My boss is sniffing around. I want you to delete those pictures I sent you right away.

Signed,
Office Flirt

Obviously you’re tucking this one away in Fwd:Vault to provide a little CYA-insurance when the boss calls you into his office. Good call. Now, before today, this message would come out of the scrubbing process looking like this:

---------- Forwarded message ----------
From: Office Flirt
Date: Wed, Jan 14, 2009 at 10:14 AM
Subject:
To: you@example.com
...

Look at the bolded red line. The email address is gone. You don’t have any other copies of it, so your boss doesn’t believe your story, and you get the blame. You’re forced to attend one of those god-awful sexual harassment classes. Fail.

So, what happened? Remember, you are looking at the body of a message in plain text. That “Forwarded message” block at the beginning is just part of the body text. So when the text was scrubbed by strip_tags(), the function picked it up as just another tag, which it dutifully removed.

To handle this situation, I came up with a piece of code that will look for email addresses in “tagged format” — i.e. surrounded by < and > — and remove the surrounding symbols, leaving us with harmless text.

$test = 'some surrounding text';
$test = preg_replace( "/(\<)(.+@[^\(\);:,<>]+\.[a-zA-Z]{2,4})(\>)/",
                      ' $2 ',
                      $test);
$test = preg_replace('/[\s]+/', '', $test);
echo $test;

Let’s break this down. First, we have a regular expression that identifies email addresses: .+@[^\(\);:,<>]+\.[a-zA-Z]{2,4}. This is the same expression set in the example on the Quanetic Software Regular Expression Tester (an excellent tool). We surround that in parentheses to isolate it as a subpattern. Then on either end of the expression, we tack on more regex voodoo to look for tag syntax: (\<) and (\>). These also get parentheses to identify them as subpatterns. Once its finished, we have an expression that will only match addresses wrapped in tagging structure.

The second argument in preg_replace() is the replacement, or what we should replace any matches with. In this case, we’ve isolated the address from the tags using subpatterns. So all we need to do is make a single call to the proper reference, which is $2, because its the second set of parentheses in the expression. Confused? You can learn about subpatterns on the PHP manual page for preg_replace().

Note the spaces around the $2 in the second argument. Sometimes the address will not have any spaces between the person’s name and the actual address. This could lead to the address being combined with the name which, in the case of Fwd:Vault, would screw up our search indexing. So we add spaces during the replace, then make a second call to preg_replace() to eliminate extra spaces: $test = preg_replace('/[\s]+/', '', $test);.

Legal Disclaimer: In case you do end up using Fwd:Vault when it launches, I’m fairly certain the service wouldn’t be liable in this silly hypothetical. Just make sure you read the terms before you sign up if you play the field at your office. Sorry to everyone going “duh” right now; it’s a sue-happy world.

Update: When I went to implement this change today, I discovered that the code was catching newlines (\n or \r) in the crossfire. It was actually due to the second call to preg_replace(), the “\s” character class includes not only spaces but line terminators as well. Oops. The revised version looks like this:

$body_text = preg_replace('/[ ]{2,}/', ' ', $body_text);

Parse URL’s in text, create links

I’m absolutely in love with the status update stream I’ve put together for Fwd:Vault (follow link for example). However in the process, I’ve discovered a huge drawback to the Twitter messaging system: it does not store links. The Twitter site itself will identify URL’s in messages and convert them into clickable links for you automatically. But the magic ends at Twitter’s borders; anyone who wants to do the same on their site is on their own.

So I consulted the almighty Google. I found plenty of raw regex, javascript, and Twitter-focused discussions on the matter, but I found the offered solutions and tips lacking. I wanted to do this up right, transparently via PHP in the background. No JS required.

Finally, I found a small PHP script that accomplished what I needed. Here’s a renamed version—all code intact—that will find and convert any well-formed URL into a clickable <a> tag link.

Update: My buddy Tonk has updated the code to link up @replies and #hashtags as well. He also switched from POSIX to Perl regular expressions syntax, mostly cause he’s a regex dork.

function linkify( $text ) {
  $text = preg_replace( '/(?!<\S)(\w+:\/\/[^<>\s]+\w)(?!\S)/i', '<a href="$1" target="_blank">$1</a>', $text );
  $text = preg_replace( '/(?!<\S)#(\w+\w)(?!\S)/i', '<a href="http://twitter.com/search?q=#$1" target="_blank">#$1</a>', $text );
  $text = preg_replace( '/(?!<\S)@(\w+\w)(?!\S)/i', '@<a href="http://twitter.com/$1" target="_blank">$1</a>', $text );
  return $text;
}

Copy that into your code, then run your text containing unlinked URL’s through it. Let’s apply it to the Twitter feed example as we left it in Step 2:

    <li><?php echo linkify($status->text) . '<br />' . $time_display; ?></li>

You can find this code at work on Fwd:Vault.

Build a slick Twitter feed on your site

  1. Display Twitter updates on your website
  2. Calculate dates and times in different timezones (translate Twitter timestamps)
  3. Parse URL’s in text, create links

Display Twitter updates on your website

Update: I’ve added a new chunk of code that will download and store your Twitter posts in a database, allowing you to do whatever the heck you want with them. After you’ve finished reading this, be sure to check that out as well.

I am not a fan of social networking or so-called lifestreaming. I think it’s a BS excuse to fiddle on your computer more. Instead of telling everyone where you are and what you’re doing, go out and meet some friends for a drink.

However I did find a practical use for Twitter in a recent issue of php|architect (Twitter as a Development Tool by Sam McCallum). The article discussed using Twitter as an automated logger, where a program would make posts to a Twitter account based on system actions (i.e. log in/out, create accounts, etc.).

I decided to turn the idea around a bit and use Twitter as an activity log to chronicle my development work on a new project. Think SVN log comments without the repository. The site itself is currently a simple placeholder page, so Twitter updates make an easy way to keep a website fresh while building out the service that will eventually reside there. It also engages the users that wind up looking at the site, letting them know that it might be something of interest to them. That’s to say nothing of any SEO or attention-grabbing effects that may result from having a Twitter stream.

Given the rabidity surrounding said scoial networking silliness, I thought that finding a suitable plug ‘n play solution to this would be easy. Surprisingly (or perhaps unsurprisingly) many of the Twitter scripts I found were plain garbage. The following code was put together by sifting through what I found and putting the best working bits together. So if this sounds interesting, or if you were also frustrated with the plethora of crappy Twitter code, here’s how you can easily display your Twitter updates on any site using PHP.

First, grab this function…

function twitter_status($twitter_id, $hyperlinks = true) {
  $c = curl_init();
  curl_setopt($c, CURLOPT_URL, "http://twitter.com/statuses/user_timeline/$twitter_id.xml");
  curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
  curl_setopt($c, CURLOPT_CONNECTTIMEOUT, 3);
  curl_setopt($c, CURLOPT_TIMEOUT, 5);
  $response = curl_exec($c);
  $responseInfo = curl_getinfo($c);
  curl_close($c);
  if (intval($responseInfo['http_code']) == 200) {
    if (class_exists('SimpleXMLElement')) {
      $xml = new SimpleXMLElement($response);
      return $xml;
    } else {
      return $response;
    }
  } else {
    return false;
  }
}

I’m not going to discuss the various cURL options here or how Twitter uses cURL, as its outside the scope of our discussion here. If you’re lost or curious, you can read up on the cURL library, cURL in PHP, and/or the Twitter API.

As its name implies, twitter_status() will connect to Twitter and grab the timeline for the Twitter account identified by the $twitter_id. The $twitter_id is a unique number assigned to every Twitter account. You can find yours by visiting your profile page and examining the RSS link at the bottom left of the page. The URL will look like this:

http://twitter.com/statuses/user_timeline/12345678.rss

That 8-digit number at the end is your ID. Grab it and pass it as the lone argument to twitter_status(). Note that, as long as your Twitter profile is public, you do not need to pass any credentials to retrieve a user timeline. The API makes this information available to anyone, anywhere. There are more options that can be accessed through the user_timeline() function, if you’re curious.

The next step is to actually use the returned data, which comes in one of two forms: a SimpleXML object, or a raw XML document. SimpleXML is preferred because it’s a PHP object, and allows you access to all the usual object manipulation. Very easy. SimpleXML was added to PHP starting with version 5. The PHP manual has all the necessary details on SimpleXML.

The following code example assumes you’re using SimpleXML. Here I am taking the first five results and putting them in an HTML list. I’ll include a link to view the profile, as well as an error message in case Twitter is suffering from one of its famous fail-whale spasms.

<ul>
<?php
if ($twitter_xml = twitter_status('12345678')) {
  foreach ($twitter_xml->status as $key => $status) {
?>
  <li><?php echo $status->text; ?></li>
<?php
    ++$i;
    if ($i == 5) break;
  }
?>
  <li><a href="http://twitter.com/YOUR_PROFILE_HERE">more...</a></li>
<?php
} else {
  echo 'Sorry, Twitter seems to be unavailable at the moment...again...';
}
?>
</ul>

If you want to see this code in action, just check out the front page of Fwd:Vault, my new full-time startup. While you’re checking out the code in action, why don’t you follow along with me @fwdvault?

Build a slick Twitter feed on your site

  1. Display Twitter updates on your website
  2. Calculate dates and times in different timezones (translate Twitter timestamps)
  3. Parse URL’s in text, create links
  4. New Download and store Twitter posts in a MySQL table

Get number of message parts in an email using PHP

Alright, I admit up front that this is a pretty specific problem, but hopefully some Googlers will find it useful.

I recently had need for a small side project to read e-mails. Every e-mail is split up into parts; each “part” represents every separate piece of the e-mail. The plain text format, rich text or HTML formats, and attachments are all sent as parts. Problem is that there is no obvious way to quickly decipher just how many parts you have in a message. The documentation for the imap functions in PHP is also woefully inadequate. Maybe I’ll help flesh it out once this project is done.

Anyway, you can ascertain the total number of parts using the results from the function imap_fetchstructure(). The parts array in the returned object contains ALL the parts of the message, including the top level used to construct the rest of the object’s data. So, this simple call is all you need…

$structure = imap_fetchstructure($mbox, $message_num);
$total_parts = sizeof($structure->parts);

Next Page »