Get domain out of any URL string (yes, really)

It’s a common problem with no single right answer: extract the top domain (e.g. example.com) from a given string, which may or may not be a valid URL. I had need of such functionality recently and found answers around the web lacking. So if you ever “just wanted the domain name” out of a string, give this a shot…

<?php
function get_top_domain($url, $remove_subdomains = 'all') {
  $host = strtolower(parse_url($url, PHP_URL_HOST));
  if ($host == '') $host = $url;
  switch ($remove_subdomains) {
    case 'www':
      if (strpos($host, 'www.') === 0) {
        $host = substr($host, 4);
      }
      return $host;
    case 'all':
    default:
      if (substr_count($host, '.') > 1) {
        preg_match("/^.+\.([a-z0-9\.\-]+\.[a-z]{2,4})$/", $host, $host);
        if (isset($host[1])) {
          return $host[1];
        } else {
          // not a valid domain
          return false;
        }
      } else {
        return $host;
      }
    break;
  }
}
 
// some examples
var_dump(get_top_domain('http://www.validurl.example.com/directory', 'all'));
var_dump(get_top_domain('http://www.validurl.example.com/directory', 'www'));
var_dump(get_top_domain('domain-string.example.com', 'all'));
var_dump(get_top_domain('domain-string.example.com/nowfails', 'all'));
var_dump(get_top_domain('finds the domain url.example.com', 'all'));
var_dump(get_top_domain('12.34.56.78', 'all'));
?>

Most of the examples are simply proofs, but I want to draw attention to the string in example #4, 'domain-string.example.com/nowfails'. This is not a valid URL, so the call to parse_url() fails, forcing the script to use the entire original string. In turn, the path part of the string causes the regex to break, causing a complete failout (return false;).

Is there a way to account for this? Surely, however I’m not about to tap that massive keg of exceptions (i.e. just a slash, slash plus path, slash plus another domain in a human-readable string, etc).

No regex for validating URL’s or email addresses is ever perfect; the “strict” RFC requirements are too damn broad. So I did what I always do: chose “what works” over “what’s technically right.” This one requires any 2-4 characters for a the top level domain (TLD), so it doesn’t allow for the .museum TLD, and doesn’t check to see if the provided TLD is actually valid. If you need to do further verification, that’s on you. Here’s the current full list of valid TLD’s provided by the IANA.

If you need to modify the regex at all, I highly recommend you read this article about email address regex first for two reasons:

  1. There’s a ton of overlap between email and URL regex matching
  2. It will point out all the gotcha’s in your “better” regex theory that you didn’t think about


Fix emails dropped or blocked by Comcast

As an email-based backup service, Fwd:Vault ran into spam filters pretty quickly. Most of this can be mitigated with proper server configuration and getting records in the right places (i.e. abuse.net). From there it’s simply a matter of reminding users to check the spam folder when things are missing.

However through the tribulations of one of my testers, I found out that Comcast goes the extra mile for users of their comcast.net webmail. Unlike most setups, where spam is simply redirected to a spam-specific folder, Comcast will delete the message outright, without issuing any kind of notice to the sender or recipient.

Truly, above and beyond (belief).

Of all the lousy IT practices I’ve seen over the years, this one takes the cake. No spam filter is perfect, so it’s guaranteed that they are dropping legitimate emails (case-and-point: I’m losing Fwd:Vault account emails). Plus it appears they default to a “highly suspicious” mode with newer systems, as fwdvault.com, my IP address, and my DNS records are completely fresh and unblemished.

Finally, the sheer size of their operation means that getting a hold of anyone to actually fix the problem when it happens to you is virtually zero. I’d go so far as to say that they can get away with this nonsense precisely because they are a large ISP. As a former “your company IT guy,” I can imagine getting at least an earful, and at worst a pink slip, if I were caught doing this.

Despite my astonishment, I couldn’t deny reality. Through my logs I watched Fwd:Vault’s mail server find their systems, connect, and deliver the message and get a 250 response code (i.e. all good). Then over in my comcast.net inbox I’d get exactly nada, ditto for the spam folder. Since the actual delivery had no technical issue, I had zero clue as to the cause of the problem. I wasn’t on any blacklists, the IP was static, and my DNS records were in good order, including a reverse DNS record with my hosting service.

Fortunately, it seems that someone in the trenches at Comcast is fighting the good fight, as I took two long-shot attempts today and it seems one of them paid off. Here’s what I did, hopefully it works for you.

1. Use the feedback form at comcastsupport.com
I tried to retrace my steps on how I found this one, but their sites are so damn convoluted I kept going in circles. However I know I started from inside the web mail interface, aka their “SmartZone”.

(See kids? That’s what we call irony. Can you say, “irony?”)

Whatever, here’s the link. You don’t need to log in to use the form:

http://www.comcastsupport.com/forms/net/sccfeedback.asp

I selected Spam or Junk Mail in the checkboxes and wrote something to the effect of:

I am not receiving mail from example.com in my Comcast email. I own and operate the mail server for this domain and have confirmed through my logs that the message is delivered properly (response code 250) to Comcast MX servers.

My tests delivered via the server mx.comcast.net (IP 00.00.00.00). It’s been over 24 hours and I have not received a bounce, nor is anything showing up in my inbox or spam folder.

As I have nothing else to go on, I am looking for help from your end.

I did not receive any reply, however I also took another step…

2. Use their RBL Removal Form
This should only apply if your mail server has actually been blocked by Comcast, in which case you would likely see an error code of 550 in your logs. If your server picks up the full response from Comcast, you may also get additional helpful information as outlined in their list of custom mail delivery error codes.

None of this applied to me, as the connection and delivery went off without a hitch. Still, I figured it was worth a shot; a bureaucracy this big is bound to have systems running into one another.

I sent in a request to be removed from their RBL by way of this form:

http://www.comcastsupport.com/Forms/NET/blockedprovider.asp

Most of the information will depend on your setup, however I did check the boxes for Implemented technology to filter or prevent transmission of spam and Changed the rDNS records to reflect a consistent and non-dynamic setting just in case. I included text similar to what I outlined earlier in the Issue Description box.

I saw emails coming through less than 30 minutes after sending this message. However, I sent the feedback first, followed by a brief online chat with their support, who directed me to the RBL form. All told it was at least an hour between my first step and the delivered message.

Update: I received this message back in response to my RBL request…

Thank you for contacting Comcast Customer Security Assurance. We have received and reviewed your RBL removal request.

Below each IP address you submitted in your request, we have included the result of our research. Please do not reply to this message.

[IP address(es)]

We have received your request for removal from our inbound blocklist. After investigating the issue, we have found that the IP you provided for removal is currently not on our blocklist.

We need the IP address currently blocked to further investigate this issue. The IP address is a number separated by decimals and is located in an error code starting with “550″ in the returned email from Comcast. You can learn more about how to identify a blocked IP by visiting our Frequently Asked Question page at:
http://www.comcast.net/help/faq/index.jsp?faq=SecurityMail_Policy18667

Please verify the IP(s) and resubmit your request to http://www.comcastsupport.com/rbl

So it looks like the RBL request didn’t do anything. Unless it did, and some numb-nut at Comcast was covering for their idiotic policies.

My gut tells me that I caught a particularly helpful support person manning the feedback desk who was able to punch the few keys it took to rectify the problem. If that’s the case, thanks for the help, and I hope the rest of you get to run into him/her as well. I sent the message around 2:00 pm on a Monday.

You can find more helpful information, including a link to the Blacklist Removal Request Form, on the Comcast Postmaster Site.

Best advice I can give: encourage your users to switch to Gmail. :)


Opera hanging on page load on your site? Check for missing files.

I recently discovered an issue with the mailing list signup on the Fwd:Vault placeholder site. [Aside: It was perfect when I set it up, I swear there are gremlins in my code sometimes. If you tried to sign up and were unable to, I apologize, try it one more time.] After fixing the issue, I naturally tested it in every browser, and Opera was simply hanging at the page load. The page would visually come up fully, but the Javascript effects wouldn’t fire, making the signup form unresponsive.

To start figuring it out, I enabled the Progress Bar to check out when/where the problem occurred. The bar contains a bunch of useful stats while the page is loading. To see the bar yourself:

Shift-F12 > "Toolbars" tab > choose "Pop-up at bottom" from dropdown.

Here’s a screenshot of the menu:

opera-appearance-menu

Then I refreshed the pageload and saw this…

opera-pageload-hang

Note the “Elements” load is off by one, yet the request is listed as “completed.” Keep in mind that the browser is still acting like it’s loading at this point: hourglass cursor, and I couldn’t interact with the page. To make matters worse, you can’t click through to see any further details on any of these stats, which is really dumb since this summary information is coming from somewhere.

Opera’s Developer Tools, Error Console, and Java Console interfaces showed nothing. I probed the rest of Opera’s menu’s and output options, and couldn’t find anything to support remedying the situation. I could be wrong of course, so if you have more Opera experience, please let me know where to look.

Instead, I used Firebug to discover a missing Javascript file, as the title suggests. I had a standard <script src=[...]></script> block, but the target file wasn’t in place. Now every other browser had handled this situation just fine, they timed out on the missing file and moved on. Opera is apparently more insistent, which wouldn’t be a problem if they provided the information necessary to diagnose the situation or manually override the hanging pageload.

This behavior is simply idiotic planning on Opera’s part: end users see a finished page that doesn’t allow interaction, and webmasters get zero clue to the cause of the hangup.

Expounding, this seeming lack of user focus likely contributes to Opera’s weak market position. With all the free choices out there, they have to be the most appealing in one or more categories to one or more demographics, and I never saw anything Opera that differentiated it significantly from the pack. That seems like a “Business 101″ observation to me, but they certainly aren’t the first netcom to come along without a clearly defined path to market and profit.


Where is the Google Analytics code block?

This is the kind of “usability fail” that drives me nuts, especially from a company of Google’s caliber. When you add a new site to Google Analytics, they present you with the Javascript code block you need during the setup process. But what if you need to look up that code block again? What if Google updates the Analytics code block (they just did it in Dec 2008)? Naturally you go back to the Analytics site to get the code…
Analytics Screenshot
Where’s the code block at? Anyone else stumped? Keep in mind that I limited this puzzle to just a single page; I scoured the whole site before finding the link on this page. As it turns out, the code block is hiding in the Check Status link in the top right corner of the main column.
Analytics Tracking Code Link
It’s not part of any of the standardized navigation — the header, the breadcrumbs, the sidebar — and the link text is not at all indicative of what the resulting page contains. Speaking of which, the page that actually houses the code is even more incriminating.
Analytics code page
The “status” information appears at least in part on the previous page, so the overt purpose of this page is pretty worthless. To make matters worse, the breadcrumb actually says “Tracking Code.” So someone got the right idea when they put this page together, but didn’t follow through when hooking it up to the rest of the site navigation.

Everyone (and so every company) makes mistakes. However Analytics is an offshoot of the Google AdWords service, which is their bread and butter. This kind of oversight on a flagship product is simply sloppy.

Make sure that you spend all the necessary time on the key “touch points” of your site/service to ensure they are operating at their maximum.


Do the upfront planning

Since announcing my work on Fwd:Vault, a few friends have come forward expressing interest in being angel investors. From what I’ve read, most entrepreneurs have to seek such people out, so I’ll take it as a compliment that these friends either (a) really like me, or (b) really trust my technical and prowess. I’m leaning towards (a)…

Anyway, despite my obvious flattery at the offer, I always answer the same way…

I’m not looking for investors at the moment, no. My goal is to get this thing off the ground with zero incurred debt or outside investment, and grow with revenue. That’s the beauty of a dotcom startup: depending on what you’re doing, if you put it together the right way the first time, it can largely run itself.

Think about that. After several months analyzing technical requirements, programming, support, customer service, marketing, etc., I’m confident that I can run the entire business in a solo fashion until reaching a certain point in revenue. The only reason I can do this is because I am setting myself up for a launch with as many core features and as few bugs as humanly possible. That will allow me to focus on customer support and marketing, both of which require that I not have my head buried in a computer screen 12 hours a day.

I don’t care what Michael Masterson says, “ready-fire-aim” will only create more work for you down the road. Don’t fire off projects half-cocked. Take the time to lay the groundwork, and you’ll reap huge benefits in the long-term.


Free and open source alternative to ShareThis, AddThis, AddToAny

Update: Make sure you check out the comments! My post is just a launching point for some great commentary from staff at iBegin Share and Add to Any.

Every site with timely or useful content should utilize some on-site bookmark sharing tool. I’m talking about the bar of links to social networking sites like Facebook, Digg, Reddit, Twitter, etc. that you find at the end of a post. These buttons are preset to recognize the URL of the page they appear on, allowing visitors to quickly propagate your content to their digital lifestream. Wordpress specifically offers a ton of plugins that offer such functionality.

The most popular tools use Javascript to display all the sites in a popup: Add to Any, AddThis, and ShareThis. Speaking in terms of pure function, these tools are great: they make sharing functionality readily available without cluttering up the display.

However these JS-based bookmarkers possess some significant downsides. First and foremost are the performance concerns. These tools are all stored remotely, and get loaded on your page as a javascript include. Here’s an example of the code from ShareThis:

<script src="http://w.sharethis.com/button/sharethis.js#tabs=web%2Cpost%2Cemail&amp;charset=utf-8&amp;style=default&amp;publisher=abc123" type="text/javascript"><!--mce:0--></script>

Pay attention specifically to src="http://w.sharethis.com/button/sharethis.js[...]". It’s just a normal URL, like any page you visit. This means that each time the page is loaded, the user’s browser goes off to retrieve a copy of the javascript required to display the button. Aside from the obvious bump in bandwidth usage, they can cause an obvious delay in page loading. Worse, if the service is experiencing any kind of slowdown or outage, including these services can cause your site to hang and timeout. And these services do hang on a regular basis. I’ve seen it last so long on my own blog that I’ve had to disable to the plugin until service returned. That the delay is not your fault does not matter; it slows your page down, making you the laggard in the eyes of users. Not good.

But while these services are not focused on reliability and uptime, they do spend an awful lot of time on data collection/aggregation, legal, and advertising. None of these are good for you, the site owner. All activity surrounding the button on your site is tracked. They can partner with ad networks, packaging in extra ad cookies when the button is served up. Aside from the privacy issues, this again increases bandwidth. Imagery — specifically the branded icons of each service — are copyrighted, making them subject to usage restrictions and leaving you open to dealing with pain-in-the-ass take-down requests. Update: Per conversation with Add to Any Founder Pat Driven in the comments, Add to Any actually avoids this type of language entirely, limiting all their legal jargon to a plain-speak Privacy Policy.

To be clear, there’s nothing inherently wrong with any of this. These are businesses, they provide a service and have to make money to stay alive. However I think the vast majority of users just want the fancy javascript popup, everything else is excess baggage.

Enter iBegin Share, a free, open source alternative for javascript-powered bookmark sharing. Instead of going offsite to retrieve code at each page load, iBegin Share runs locally on your site, saving you bandwidth and decreasing load time. iBegin Share tracks usage like its corporate counterparts, but that data is stored in your database and used for your own data tracking purposes only, saving more bandwidth (since it doesn’t have to communicate back) and your privacy. Finally, since its open source you can modify the code any way you want: change the look, layout, color scheme — the tool includes 4 preset color schemes, plus an option for text vs. button link — even add totally new share options. A Wordpress plugin version is available.

On the downside, external documentation is pretty thin at the moment, but the code is well-commented. There is also a forum, but activity there is rather limited right now — a discussion on a seemingly common issue started earlier this month has yet to receive any official word. So you’re on your own with any heavy customizing or problems, but I suppose that’s the tradeoff for eliminating any third party eyes poking around your traffic. Assuming it works as advertised, I’d argue that it’s a far better deal than the other tools, even without any customization ability.

If you decide to give iBegin Share a shot, or if you’re using it already, I’d love to hear how it’s working for you. Please share your experiences in the comments.


Trouble logging into anything Google from Firefox

400 Bad Request
That’s what I keep getting everywhere I go in the Google universe for the past several days using Firefox (using latest stable release, v3.0.6). I know I’m not the only one, I’ve found a plethora of recent support posts discussing the same issue. Normally I’m the one writing here to say “Hey, look out for this one, here’s how to handle it.” But this time I’m at a loss.

I automatically attribute any and all odd Firefox behavior to my ridiculous extension collection. There are so many, and sometimes one of them will get out of date and go all bull-china-shop on me. So I start disabling the most likely suspects…then a few more…finally all of them. Google still won’t behave. I go so far as to reset my Firefox profile (Beginners: try the safe way; Experts: go full monty). I log in once, but the behavior rapidly returns within a couple sessions (didn’t even have to restart).

Then I started looking at more systemic fixes, and I discover that clearing cache and cookies will fix the problem (with or without extensions in the picture), but again only lasts a few sessions. Since the problem is definitely with cache and/or cookies, I’m pretty much out of moves. I have no cookie or cache manager plugins that could be responsible, so this one rests squarely at Firefox’s furry feet.

At first I blamed Google, but then I saw similar behavior when trying to log into the Webmin interface I use for some servers. Logging in just kicks right back to the login prompt, with no errors.

In the meantime, I’ve switched all my logged-in Google activities (GMail, Analytics, etc.) over to Chrome, where I’ve had no problems whatsoever. I never used Chrome for routine browsing; the lack of extensions — notably AdBlock Plus and Firebug — really makes it unappealing. However I must admit that the overall UI is fantastic. If the Mozilla team doesn’t figure this cookie issue out soon, or if Google gets their act together and starts allowing Firefox plugins, I may make the full switch.

Still, I’d like to get back to a single window if possible, so if anyone has any other suggestions, I’m all ears.

Update: I’ve now reinstalled Firefox as well, to no avail. Until I find a permanent solution, I’ve installed another add-on called Clear Private Data, which gives you an icon to do exactly that. Now whenever Firefox acts up, I hit the button and continue. It still happens with ridiculous regularity, but at least the browser stays usable.


There are still kinks in Apple’s armor

To: Anyone at Apple who works on iPod/iTunes
From: Annoyed iPod Shuffle user
Subject: Stop deleting my songs

Message:
A recent run was rudely cut short when I realized that my iPod Shuffle contained just two of the hundred or so songs that I listen to while working out. Some sleuthing lead me to discover that every time I connect my iPod to the docking station and open iTunes, the program re-syncs every song I have on the device. If the song is not located in the original location — i.e. the place on my computer that I uploaded it from — iTunes jumps to the conclusion that I must not want the song on my iPod any longer. I can sincerely assure you that this is never the case. The NAS device where I house my music is only connected when I need something, and I do not consider your RIAA-worshipping deletion paranoia a legitimate reason to alter my behavior.

Your devices are indeed wonderful, I bought one after all. However the software connected to it leaves much to be desired. Obviously I’m not alone, as there are a plethora of alternatives to your crap-tastic iTunes interface. If the Genius Bar isn’t smart enough to realize that *I* get to decide what goes on and comes off the device, it’s hardly deserving of the title, and I suggest you put your efforts toward a better overall experience.

In a way, I can’t say I blame you. Your devices bring in hordes of money, the software that runs on the desktop is almost an afterthought. However I can assure that I am keeping my eyes keen for the company that can deliver a competitive product with a superior interface with my computer.


You hear that, competitors? I want to give you money for the stuff that Apple can’t deliver. Quit charging them in the open field, and start sniping at range. Their armor is weak just below the neck…


What not to put on your homepage

I have a friend who showed me something on the WineAccess homepage tonight that just made me laugh. Check it out.

See anything wrong with that picture? How about the “What You Missed” section at the bottom? It contains wine bottles that you can’t purchase, because they’re sold out. Hence the title. Worthless information with which the user can do nothing, essentially wasted primo homepage real estate.

Homepage design is a very complex topic, but you can definitely tuck this one in the “What not to do” column.


Smash bugs, don’t treat symptoms

I previously discussed why certain “automagical” features can sometimes facilitate the creation of crappy code. However they only create a possibility of crappy code. Today I want to warn you against a practice that will create crappy code 100% of the time.

First a scenario – you have written a program in your language of choice. It’s fairly complex, partially because of the basic needs of your client or employer, and partly because every project is a moving target to a certain extent. At some point in the logic flow, your code behaves aberrantly; let’s keep it really simple and say that it’s outputting dashes instead of spaces in a block of text. “Well these shouldn’t be here at this point,” you think. “That text was scrubbed out when it came out of the database.” You confirm the scrubbing occurs, and check some things along the way to the output. Everything checks out.

However there’s a huge nebulous area that you conveniently sidestep, a big ol’ chunk of code written by Larry. The same Larry who got fired last month for half-assing that reporting module for the marketing team. A piece of his code still sits between your perfect database setup and your equally perfect outputting logic. You don’t want to touch Larry’s code with a 10-foot pole. “The problem MUST be in there,” you decide. “I’ll just undo the text change on the other side and be done with it.”

In other words, you treated the symptom, and didn’t solve problem. This time-saving decision, while fairly innocent on its own, has far-reaching consequences for both your software and your own career as a developer. None of them are good.

Because you did not identify for certain where the problem lies, you have absolutely zero guarantee that Larry’s code is the problem at all. It could very well be in Larry’s code, but you didn’t look everywhere so you can’t say for sure.

The symptom you treated may well lead to a much larger problem. Perhaps it’s not only replacing spaces with dashes, but also truncating the text beyond a certain length. You won’t see that until a long-enough string passes by, and it may not pass by a person’s eyes for even longer. That’s the nasty thing about bugs, a human being must find and remove them. No matter what your philosopher-slash-uber coder friend says, the Matrix and its self-making code does not exist, so get in there and clean up the mess.

In short, with that one move, you’ve started down the path of writing crappy code. Keep taking that shortcut, and it won’t be long before you’re fired too, because your code will be a bug-ridden mess. Kind of like Larry, right? That because it’s the same path taken by Larry and every other lazy coder you’ve ever known.

The good news is that this path is easily avoidable: don’t be lazy. Do the work right the first time, stick with your syntax rules, and get to the root of every problem, every time.

Also, if the hypothetical scenario sounds eerily familiar, you might want to finish reading this post and go double-check that page slug creation code you wrote.


Next Page »