Hot Koehls
  • Email
  • Feedburner
  • Linkedin
  • Twitter
  • Home
  • About
  • Archives
  • Contact
  • Software
    • S3imple Backup
    • Twitter Feed Archiver
    • FileTime
    • Flickr API Demo
Search
Home» For techies » Extract email addresses from tags

Extract email addresses from tags

Posted by Frank - January 15, 2009 - For techies
1

Ran into another cool hurdle today for my Fwd:Vault development. When I grab the message content to archive it in the system, first thing I do is scrub it out to ensure that (a) it displays properly, and (b) there are no misbehaving characters. I grab both plain text and HTML email formats (if present), so the scrubbing process is a little different in each case. For the plain text, I take some extra steps to ensure there is no HTML whatsoever. Naturally, at one point this involves a call to PHP’s ultra-useful strip_tags() function.

However, in the course of testing today, I realized that when a message is forwarded, sometimes the forward header will encode the email address, which gets stripped when I process the message. Allow me to demonstrate. Here’s the body an example message that someone might send to Fwd:Vault for safe keeping…

---------- Forwarded message ----------
From: "Office Flirt" <flirt@example.com>
Date: Wed, Jan 14, 2009 at 10:14 AM
Subject: Delete those images
To: you@example.com

My boss is sniffing around. I want you to delete those pictures I sent you right away.

Signed,
Office Flirt

Obviously you’re tucking this one away in Fwd:Vault to provide a little CYA-insurance when the boss calls you into his office. Good call. Now, before today, this message would come out of the scrubbing process looking like this:

---------- Forwarded message ----------
From: Office Flirt
Date: Wed, Jan 14, 2009 at 10:14 AM
Subject:
To: you@example.com
...

Look at the bolded red line. The email address is gone. You don’t have any other copies of it, so your boss doesn’t believe your story, and you get the blame. You’re forced to attend one of those god-awful sexual harassment classes. Fail.

So, what happened? Remember, you are looking at the body of a message in plain text. That “Forwarded message” block at the beginning is just part of the body text. So when the text was scrubbed by strip_tags(), the function picked it up as just another tag, which it dutifully removed.

To handle this situation, I came up with a piece of code that will look for email addresses in “tagged format” — i.e. surrounded by < and > — and remove the surrounding symbols, leaving us with harmless text.

$test = 'some surrounding text';
$test = preg_replace( "/(\<)(.+@[^\(\);:,<>]+\.[a-zA-Z]{2,4})(\>)/",
                      ' $2 ',
                      $test);
$test = preg_replace('/[\s]+/', '', $test);
echo $test;

Let’s break this down. First, we have a regular expression that identifies email addresses: .+@[^\(\);:,<>]+\.[a-zA-Z]{2,4}. This is the same expression set in the example on the Quanetic Software Regular Expression Tester (an excellent tool). We surround that in parentheses to isolate it as a subpattern. Then on either end of the expression, we tack on more regex voodoo to look for tag syntax: (\<) and (\>). These also get parentheses to identify them as subpatterns. Once its finished, we have an expression that will only match addresses wrapped in tagging structure.

The second argument in preg_replace() is the replacement, or what we should replace any matches with. In this case, we’ve isolated the address from the tags using subpatterns. So all we need to do is make a single call to the proper reference, which is $2, because its the second set of parentheses in the expression. Confused? You can learn about subpatterns on the PHP manual page for preg_replace().

Note the spaces around the $2 in the second argument. Sometimes the address will not have any spaces between the person’s name and the actual address. This could lead to the address being combined with the name which, in the case of Fwd:Vault, would screw up our search indexing. So we add spaces during the replace, then make a second call to preg_replace() to eliminate extra spaces: $test = preg_replace('/[\s]+/', '', $test);.

Legal Disclaimer: In case you do end up using Fwd:Vault when it launches, I’m fairly certain the service wouldn’t be liable in this silly hypothetical. Just make sure you read the terms before you sign up if you play the field at your office. Sorry to everyone going “duh” right now; it’s a sue-happy world.

Update: When I went to implement this change today, I discovered that the code was catching newlines (\n or \r) in the crossfire. It was actually due to the second call to preg_replace(), the “\s” character class includes not only spaces but line terminators as well. Oops. The revised version looks like this:

$body_text = preg_replace('/[ ]{2,}/', ' ', $body_text);
email, fwdvault, php, programming, regular expressions, security, usability

One comment on “Extract email addresses from tags”

  1. ercan says:
    June 16, 2011 at 5:41 pm

    Good job!
    But I need something little bit complicated.
    I want to resolve To: field and this field may have more than 1 email addresses. So is it possible resolve something like this?

    To: barbara , yalintan , aac

    I will be glad if you can help, Thank you

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Categories

  • For entrepreneurs
  • For everyone
  • For techies

Latest Tweets

  • The word traps planners plan themselves into | Life. Then strategy http://t.co/iANAdASb
    May 8, 2012 - 2:43 pm
  • Random network security tip for those about to appear on TV - Boing Boing http://t.co/tC1lXFQ4
    May 8, 2012 - 1:42 pm
  • A Picture http://t.co/H846Uy69
    April 27, 2012 - 12:25 pm
  • The Broken "Buy-One, Give-One" Model: 3 Ways to Save Toms Shoes | Co.Exist: World changing ideas and innovation http://t.co/RI0sVMW6
    April 10, 2012 - 12:23 pm

Recent Comments

  • whiz on What 255 characters looks like
  • Andrew on Find the second (or third, or fourth) occurence in a string
  • IanArcher on Get number of message parts in an email using PHP
  • Usama on Remove parent directories from tar archives
  • Frank on It’s dangerous to go alone

Recent Posts

  • It’s dangerous to go alone
  • Create Self-Signed Wildcard SSL Certificate
  • What comes after the yottabyte?
  • Write code like they do in Hollywood
  • Brian Rolle machine gun celebration
(c) 2012 Frank Koehl. All Rights Reserved.
  • Contact Us
  • Sitemap