Two Coins Bug

There’s an old riddle I heard while growing up that I used to hate.

You have two coins that add up to 30 cents, but one isn’t a nickel. What are the two coins?

The answer, of course, is a quarter and a nickel. Like many riddles, it comes down to an assumption we make when parsing the question. In this case, it’s absolutely true that one of the coins isn’t a nickel — but the other one is.

This riddle felt particularly salient to some work I was doing in WordPress this week, around the logic of MySQL queries and the WP_Meta_Query functionality.

As sometimes an example can explain a bit more than theoretical, consider the following:

Post ID: 50, has a title of "Fifty", and two separate rows in the postmeta table, both under the 'color' key -- 'red', and 'blue'.

Post ID: 51, has a title of "Fiftyone", and one row in the postmeta table, under the 'color' key -- 'red'

How would you query for all posts that have color of red, but do not have color of blue? My first attempt was something along the lines of this:

get_posts(
  array(
    'meta_query' => array(
      array(
        'key' => 'color',
        'compare' => '=',
        'value' => 'red',
      ),
      array(
        'key' => 'color',
        'compare' => '!=',
        'value' => 'blue',
      ),
    ),
  )
)

However, to my initial surprise, this returned both ID 50 and ID 51! So I pulled up the query to see why that could be, and it turned out the generated SQL looked something like:

SELECT *
FROM wp_posts
INNER JOIN wp_postmeta ON ( wp_posts.ID = wp_postmeta.post_id )
INNER JOIN wp_postmeta AS mt1 ON ( wp_posts.ID = mt1.post_id )
WHERE ( wp_postmeta.meta_key = 'color' AND wp_postmeta.meta_value = 'red' )
AND ( mt1.meta_key = 'color' AND mt1.meta_value != 'blue' )
GROUP BY wp_posts.ID

(slightly simplified)

Well, you would ask, how does that return post ID 50? Well, because both of the joins to the meta table wind up finding the same meta! We say not to return a result with a meta key of blue, but SQL interprets that as find me any entry with a meta key that is not blue — and as post ID 50 also has a meta key that is red (which is not blue) we’re in business!

Anyway, finding the absence of something in the where clause is tricky — as the where clause specifies what you want to select. I’d looked and I’m pretty sure just using what’s already there in the WP_Meta_Query class it’s not possible to properly structure this, I’d done some work on it a decade ago, implementing the EXISTS and NOT EXISTS comparators.

So we could get the result we wanted, if we rewrote the above MySQL to:

SELECT *
FROM wp_posts
INNER JOIN wp_postmeta ON ( wp_posts.ID = wp_postmeta.post_id )
LEFT JOIN wp_postmeta AS mt1 ON ( wp_posts.ID = mt1.post_id AND mt1.meta_key = 'color' AND mt1.meta_value = 'blue' )
WHERE ( wp_postmeta.meta_key = 'color' AND wp_postmeta.meta_value = 'red' )
AND ( mt1.post_id IS NULL )
GROUP BY wp_posts.ID

The rough idea is that it will attempt to do a left join instead of an inner join, which could not turn up any results from the joined table — so we explicitly want to query for rows where a column that would have to be there — post_id — being NULL to ensure that there was nothing for it to join to. Thereby ensuring it NO HAZ any blue values.

So here’s what I wound up writing. It’s still rough, but functional thus far in my testing (so long as the NO HAZ isn’t the first parameter in the meta_query — as written the regex relies on the aliasing of the table name to parse it out). There’s probably some cleanup that it needs, however …

/**
 * An extra filter to let our custom `NO HAZ` meta_compare work!
 */
function no_haz_meta_compare( $sql, $queries, $type, $primary_table, $primary_id_column, $context ) {
	global $wpdb;

	foreach ( $queries as $key => $query ) {
		if ( ! empty( $query['compare'] ) && 'NO HAZ' === $query['compare'] ) {

			// If we've got a NO HAZ query, pull out the table alias it's using for the join.
			$matches = array();
			$search  = sprintf(
				'#\( (\w+)\.meta_key = \'%1$s\' AND (\w+)\.meta_value = \'%2$s\' \)#',
				preg_quote( $query['key'], '#' ),
				preg_quote( $query['value'], '#' )
			);
			if ( preg_match( $search, $sql['where'], $matches ) ) {
				$table_alias = esc_sql( $matches[1] );

				// And then rewrite the JOIN bit to do a LEFT join instead of inner, so it can check for null (absence)
				$sql['join'] = str_replace(
					array(
						sprintf(
							"INNER JOIN {$wpdb->postmeta} AS {$table_alias} ON ( {$wpdb->posts}.ID = {$table_alias}.post_id )",
							$matches[1]
						),
						sprintf(
							"LEFT JOIN {$wpdb->postmeta} AS {$table_alias} ON ( {$wpdb->posts}.ID = {$table_alias}.post_id )",
							$matches[1]
						),
					),
					$wpdb->prepare(
						"LEFT JOIN {$wpdb->postmeta} AS {$table_alias} ON ( {$wpdb->posts}.ID = {$table_alias}.post_id AND {$table_alias}.meta_key = %s AND {$table_alias}.meta_value = %s )",
						$query['key'],
						$query['value']
					),
					$sql['join']
				);

				// And now yeet the original WHERE bit and swap it to testing for `post_id` in the postmeta table being null -- or confirming that none were found.
				$sql['where'] = str_replace(
					$matches[0],
					"( {$table_alias}.post_id IS NULL )",
					$sql['where']
				);

			}
		}
	}

	return $sql;
}
add_filter( 'get_meta_sql', 'no_haz_meta_compare', 10, 6 );

which would permit the following to return only post id 51:

get_posts(
  array(
    'meta_query' => array(
      array(
        'key' => 'color',
        'compare' => '=',
        'value' => 'red',
      ),
      array(
        'key' => 'color',
        'compare' => 'NO HAZ',
        'value' => 'blue',
      ),
    ),
  )
)

The code would probably be a lot cleaner if it was in WP_Meta_Query proper, instead of regex parsing it all out — or if there was a filter in `WP_Meta_Query::get_sql_for_clause` to allow overriding — but hopefully this serves as a useful proof of concept for someone else hitting a similar issue down the road.

Final-ish legislation.sql table structure

The final (until I add a sponsors table) db structure for the legislation.  Works with the existing import script, with added indexes for easier querying.

Legislation DB Dump

Still not quite the final DB structure I’d like, but this is available for data mining and trying to build something awesome out of.

OpenDataDay Hackathon DC!

So I went down to DC this weekend to participate in the Open Data Day Hackathon!  There were some tremendous projects proposed, but the one that caught my eye from the start of the day was one proposed by Jim Harper of the Cato Institute to track down the genealogy of legislation put forth in congress.

Basically, the goal is to programatically find similar passages in multiple bills.  This can be used for many purposes, including looking at sections in large omnibus bills and getting an idea if the things that get shoehorned in it have been proposed previously, and what happened then.

So, our team largely consisted of  myself, Alexander Furnas of the Sunlight Foundation, and John Bloch of 10up, with guidance from Jim Harper (previously mentioned, of the Cato Institute), Molly Bohmer (also of the Cato Institute), and Kirsten Gullickson providing some clarification on the way the XML data we were working with was structured.

I spent my time building a MySQL database and a PHP import script that could map all the relevant data from the XML files in to it.

Alexander worked in Python primarily fleshing out a way of doing Latent Semantic Analysis on the data we’ve extracted to sort out what is similar to what, and where can we find meaning in it.

John spent his time working on a front-end for the final dataset, to help end-users get something useful out of the data we’re building.

The data that we were pulling from can be readily accessed by anyone through the Library of Congress at the following URLs:

I’m currently putting some finishing touches on the DB structure, but when that’s done, I’ll be releasing that and the import script in a subsequent post, as well as a SQL dump for the final accumulated and sorted data — ripe for data mining.  As the day was wrapping up, I had someone come to me inquiring about data mining for references to money allocated in appropriations bills and the like, and I was able to very quickly do a MySQL query along the lines of

SELECT * FROM `resolution_text` WHERE `text` LIKE '$%'

to find anything that started with a dollar sign and then listed an amount over a very limited data set of three million rows or such.  The final data set will be much larger.

Magento Duplicate Orders MySQL Search

A handy MySQL query to check a Magento DB for potential duplicate orders!

SELECT `quote_id`,
	COUNT(`quote_id`) AS `qty_duplicates`,
	`increment_id` AS `first_increment_id`,
	GROUP_CONCAT( `increment_id` SEPARATOR ' | ' ) AS `increment_ids`,
	`created_at`,
	`state`,
	`status`,
	`customer_firstname`,
	`customer_lastname`,
	`customer_email`,
	`grand_total`
FROM `sales_flat_order`
GROUP BY `quote_id`
HAVING COUNT(`quote_id`) > 1
ORDER BY `created_at` ASC