MySQL’s Monty Widenius responds

My summarizing and opining post discussing MySQL founder Michael Widenius’ protest of the Sun purchase by Oracle prompted a response from none other than Monty himself. Hit the comments to see what he has to say about my response, which was definitely net-negative in the final analysis. I have responded in the comments of that post as well.

I must be moving up in the world, or Monty was just really bored over his Christmas vacation. :)


Select a single dataset across multiple tables

Say you’ve got two tables with the same structural layout but contain logically different information. A common example would be storing “deleted” records in a separate table to reduce table sizes, simplify queries, and improve performance. A record only exists in one of the two tables, either it’s deleted or it’s not.

But sometimes you just need to find the information. “I don’t care where record #123 is, I just need to see it.” This situation presents a unique problem. A standard SELECT statement can combine tables to find a unified set of data, but can’t look for the same thing in two different place simultaneously. So you can do the obvious thing, which is the run the same query twice, once on each table. But that’s a way bigger performance penalty than necessary. Get to know the UNION statement

(SELECT * FROM tbl_stuff WHERE id = '123')
UNION
(SELECT * FROM tbl_stuff_deleted WHERE id = '123')

One query, a single set of results, and performance optimization all wrapped into one. Sweet.


Archive your entire Twitter timeline

My code for displaying Twitter posts on your site is pretty handy, but it does have drawbacks. Each page load involves calling a remote URL, downloading a resulting XML file, and parsing the results, increasing your load times and using bandwidth. To minimize the impact, you can really only display a handful of the most recent posts.

Plus, the downloaded stream is never saved. Google does index Twitter, but the thoroughness and benefit to you are subject to much speculation.

We can solve both problems by locally storing and serving Twitter posts ourselves. Once you have them in your own system, you can display as many of them as you want without expensive external URL lookups. Plus, with the content centrally located on your site, getting Google to index and apply it to your rankings is straightforward.

Note for SEO geeks:
Yes, I am aware that displaying and indexing Twitter posts on your own site does technically fall under the category of duplicate content, so save your typing.

Given the disparate nature of Twitter content and the utter disconnect from my sites, I’m not too concerned about incurring a penalty for it. Your opinion and experience may vary. You should at least familiarize yourself with Google’s rules for duplicate content. If your paranoid, consider applying canonicalization to pages that display large portions of a Twitter timeline.

Let’s get started
The end of the post includes a link to download all the code, as well as a link to a live demo.

I am assuming that you’ve got a standard PHP/MySQL stack for your site, ideally running on Linux, super-ideally Debian (Digg uses it for a reason, you know).

I am also assuming that you know how to use it; bring a decent understanding of SQL, PHP, and basic web programming. Here’s your first test: the demo assumes your PHP installation is version 5 and includes the Simple XML libraries.

First, here’s the SQL INSERT command for the table that our example will use. Apply this to your database:

CREATE TABLE IF NOT EXISTS twitter (
  `id` bigint(10) unsigned NOT NULL,
  `created_at` datetime NOT NULL,
  `source` varchar(255) NOT NULL,
  `in_reply_to_screen_name` varchar(255) NOT NULL,
  `text` varchar(255) NOT NULL,
  UNIQUE KEY `id` (id)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

Now let’s have a look at the class, which is the meat of the entire thing:

class Twitter {
  public function __construct($twitter_id) {
    $this->id = (int)$twitter_id;
  }
 
  public function user_timeline($page, $count = '200', $since_id = '') {
    $url = 'http://twitter.com/statuses/user_timeline/' . $this->id . '.xml?count=' . $count . '&page=' . $page;
    if ($since_id && $since_id != '') {
      $url .= '&since_id=' . $since_id;
    }
    $c = curl_init();
    curl_setopt($c, CURLOPT_URL, $url);
    curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($c, CURLOPT_CONNECTTIMEOUT, 3);
    curl_setopt($c, CURLOPT_TIMEOUT, 5);
    $response = curl_exec($c);
    $responseInfo = curl_getinfo($c);
    curl_close($c);
    if ($response != '' && intval($responseInfo['http_code']) == 200) {
      if (class_exists('SimpleXMLElement')) {
        return new SimpleXMLElement($response);
      } else {
        return $response;
      }
    } else {
      return false;
    }
  }
 
  public function rebuild_archive($your_timezone) {
    $orig_tz = date_default_timezone_get();
    date_default_timezone_set('GMT');
    $tz = new DateTimeZone($your_timezone);
    $sql = "SELECT id FROM twitter ORDER BY id DESC LIMIT 1";
    /**
     * INSTALLATION
     * execute $sql on your DB to get the latest twitter post
     * set the value of `id` to a variable named $since_id
     * set $since_id to false if the table is empty (i.e. a new install)
    **/
    $tweet_count = 0;
    for ($page = 1; $page <= 200; ++$page) {
      if ($twitter_xml = $this->user_timeline($page, '200', $since_id)) {
        foreach ($twitter_xml->status as $key => $status) {
          $datetime = new DateTime($status->created_at);
          $datetime->setTimezone($tz);
          $created_at = $datetime->format('Y-m-d H:i:s');
          $sql = "INSERT IGNORE INTO twitter
                    (id, created_at, source, in_reply_to_screen_name, text)
                  VALUES (
                    '" . $status->id . "',
                    '" . $created_at . "',
                    '" . addslashes((string)$status->source) . "',
                    '" . addslashes((string)$status->in_reply_to_screen_name) . "',
                    '" . addslashes((string)$status->text) . "'
                  )";
          /**
           * INSTALLATION
           * Execute $sql over your DB here
          **/
          ++$tweet_count;
        }
      } else {
        break;
      }
    }
    $sql = "ALTER TABLE twitter ORDER BY `id`";
    /**
     * INSTALLATION
     * Execute $sql over your DB here
    **/
    date_default_timezone_set($orig_tz);
    return $tweet_count;
  }
}

Twitter::user_timeline()
This method is a modified version of my previous twitter_status() function.

The big difference is that we’re passing additional arguments to Twitter’s user_timeline API call: count (specifies the number of statuses to retrieve) and page (specifies the page of results to retrieve).

Twitter::rebuild_archive()
This method takes the results from user_timeline() and places them in your DB. Its lone argument is the string representation for the timezone of your server. To find out what the string is and why you need it, just read the second post of my twitter series. For me on the US east coast, I use 'America/New_York'.

Quick Warning
Hopefully you noticed several large comment blocks with INSTALLATION in all caps: I didn’t include any code to run SQL over your DB. Every system includes their own wrapper for database calls, including mine, so I’m not wasting time writing out SQL inserts using raw PHP functions that you’ll just remove. Find the three blocks labeled “INSTALLATION” and follow the instructions to execute the list SQL.

Now we just need to run it.

require('/path/to/twitter.class.php');
$Twitter = new Twitter('12345678');
$Twitter->rebuild_archive('America/New_York');

We instantiate the class and pass the ID number of our Twitter account. You’ll find instructions on getting this number about halfway down my first post on displaying Twitter updates. After that, a single call to Twitter::rebuild_archive() will grab all available updates and store them.

If the `twitter` table is empty, it will grab your entire Twitter timeline, up to 3200 posts. If you have more than 3200 posts, you’re out of luck for the time being, although I’d recommend you take a break from the computer, take a shower, and say “Hi” to the wife and kids.

After the first run, subsequent runs will only grab new posts by way of the API’s since_id argument.

If you have the access, you can easily make this into a cron job:

#!/usr/bin/php5
<?php
require('/path/to/twitter.class.php');
$Twitter = new Twitter('12345678');
$Twitter->rebuild_archive('America/New_York');
?>

Save that last block of code to a file, set it to be executable (chmod 755 usually), and set the job to run hourly. That top line identifies the interpreter that the system should use to read the file. You may need to change it to reflect the location of the PHP executable on your system.

Want to see everything described above in action? Check out the Developer’s Diary on Fwd:Vault.

Don’t worry about cut ‘n paste, just download the zip file with the class and all the examples:
Twitter Archiver (.zip)

Update 08-19-2009: Removed references to function calls specific to my framework.

Update 12-16-2009: The `id` field has been bumped up to a BIGINT. Twitter ID numbers are bigger than what an unsigned INT field can hold.


Format a RFC2822 date for mysql datetime fields

I was crash-coursing myself on PHP’s IMAP functionality recently, one of the first questions I came across was how I might store the date from an e-mail header in a MySQL DATETIME field. I was afraid I was going to have to parse out the string using a bunch of calls to substr(), but then I remembered that the strtotime() function recognizes various RFC formats.

So, if you need to get the timestamp of an RFC2822 e-mail header—or most any RFC-based timestamp—into MySQL DATETIME format, it’s easy:

$timestamp = strtotime('Tue, 30 Sep 2008 10:30:00 EDT');
echo date('Y-m-d g:i:s a', $timestamp);

Keep in mind that if there are differences between the timezones of your server and the timestamp, the date() function could screw with your desired output. You should probably take a look at date_default_timezone_set().


Automating SSH or SFTP in scripts

Recently I needed to automate copying a MySQL database to a backup server. We keep a copy of our site and DB on this box in the event that our main systems go down, or there’s a problem with our internet connection. It’s kind of like a poor man’s colocation setup. I actually prefer the setup over true colocation for the vast majority of small and medium-sized business, because it’s far simpler and requires far less overhead and continuous support.

When I started searching around for resources on how to automate the SFTP connection, I was hit in the face with tons of dead ends. Several Google searches were spitting back mailing list and forum archives of plenty of questions regarding how to create backup scripts that connect to a remote server via SFTP. If you are in this boat, read on.

Here’s the problem. At some point in the development of SFTP, the writers decided that storing access credentials in files as part of an automated process was a very bad idea. So they coded SFTP to bypass the password challenge when invoked from a script (aka the -b flag, which runs commands from a file).

Instead, they recommend that you create a private key pair between the two systems. This preemptive measure handshake eliminates the need for passwords entirely, making your code a bit simpler. It’s fairly easy to do but, of course, most developers groan at the thought of having to learn yet another technique, and looks for ways around the restriction. I did both, and recommend the key pair approach. I’ll describe both here, and let you decide for yourself.

SSH/SFTP connection without passwords

The following example is borrowed from an article on The Linux Problem Base, but there are several out there explaining the same approach.

First log in on A as user a and generate a pair of authentication keys. Do not enter a passphrase:

a@A:~> ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/a/.ssh/id_rsa):
Created directory '/home/a/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/a/.ssh/id_rsa.
Your public key has been saved in /home/a/.ssh/id_rsa.pub.
The key fingerprint is:
3e:4f:05:79:3a:9f:96:7c:3b:ad:e9:58:37:bc:37:e4 a@A

Now use ssh to create a directory ~/.ssh as user b on system B. The directory may already exist, which is fine:

a@A:~> ssh b@B mkdir -p .ssh
b@B's password:

Finally append a’s new public key to b@B:.ssh/authorized_keys and enter b’s password one last time. Note that you must be looking at the local directory in which you saved the key in step one.

cd /home/a/
a@A:~> cat .ssh/id_rsa.pub | ssh b@B 'cat >> .ssh/authorized_keys'
b@B's password:

From now on you can log into B as b from A as a without password. Try this to confirm:

a@A:~> ssh b@B hostname

It should return the hostname of system B without prompting for a password.

It’s painless and easy, and every SSH connection you make going forward requires less typing. However, if you still really want to use a password, you have two options.

Utilize the -o Flag

So the SFTP team made their stance clear, and backed it up with action. However, it’s not impossible to bypass the restriction. The -o flag allows you to access all the options available in the sshd_config file, so you can change any of them on the fly. Here we need to disable the batchmode directive, so your SFTP call would look something like this:

sftp -o "batchmode no" -b /tmp/bat user@host

I found this one on a random forum post, and it comes with an important warning:

Note that it must come *before* -b, which may be surprising – this is
due to ssh processing -o options as if they were read from the config
file – ssh_config(5) again:

The only problem here is that the password challenge gets sent back out to the command line, requiring normal keyboard interaction.

Use SSHPass

So if that’s still not good enough for you, check out a SourceForge project called SSHPass. From the link:

Sshpass is a tool for non-interactivly performing password authentication with SSH’s so called “interactive keyboard password authentication”. Most user should use SSH’s more secure public key authentiaction instead.

SSHPass is available from default Debian apt servers; I couldn’t find anything reliable on its availability through yum.

Proceed at your own risk. If you server allows any sort of public access, even to a large handful of outside users, I strongly recommend going the key route.

Update Feb 27, 2009: Reader pointed out that OpenSSH has a shortcut function, ssh-copy-id, to install your public key on a remote machine. Nice.


Using SQL computations in WHERE clause

When writing SQL statements, sometimes I want to filter the result set based on a value that must be computed out of the stored data. Here’s an example:

SELECT widgetID, count( widgetID ) AS totalWidgets
FROM table_widget_sales
WHERE totalWidgets > 1
GROUP BY widgetID
ORDER BY totalWidgets DESC

I want to see which widgets have been sold more than once, and how many times each of those remaining widgets have sold. Obviously we could pass a simpler query into a scripting language to calculate it for us, but that’s more cycles and memory. I love to squeeze as much out of my SQL statements as possible.

Problem is, when I run that query above, I get this error:

#1054 - Unknown column 'totalWidgets' in 'where clause'

My first reaction is, “I can use that column in the order by clause, why not where clauses!?” Simple, the value doesn’t exist when the WHERE clause is evaluated. The number is calculated as each matching record is passed. By the time we reach the ORDER BY clause, we have the data we need. But the WHERE clause is used to actually figure out what records we need to perform that calculation, so we’ve got a chicken-egg issue.

The solution is to use a subquery, sometimes called a “sub-select statement.” Using two select statements, we can split the duties for filtering (more than 1 widget sale) and counting (total number of sales per widget).

Here’s the working query:

SELECT w.widgetID, COUNT( ws.widgetID ) AS totalWidgets
FROM table_widgets AS w
INNER JOIN table_widget_sales AS ws ON ( w.widgetID = ws.widgetID )
WHERE (
SELECT COUNT( ws.widgetID ) AS number
FROM table_widget_sales AS ws
WHERE ws.widgetID = w.widgetID
) > 1
GROUP BY widgetID
ORDER BY totalWidgets DESC

See what’s going on? We’re doing the same count twice, but the subquery returns everything (number of sales per widget for all widgets), while the first select only grabs the results we want (sales greater than 1).