<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>philihp.com &#187; SQL</title>
	<atom:link href="http://www.philihp.com/blog/tag/sql/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.philihp.com/blog</link>
	<description>I do things, and then I tell the internet about them.</description>
	<lastBuildDate>Mon, 06 Feb 2012 05:40:30 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>SQL Optimization: Union vs. Union All</title>
		<link>http://www.philihp.com/blog/2010/sql-optimization-union-vs-union-all/</link>
		<comments>http://www.philihp.com/blog/2010/sql-optimization-union-vs-union-all/#comments</comments>
		<pubDate>Thu, 27 May 2010 22:08:45 +0000</pubDate>
		<dc:creator>philihp</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Optimization]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[Union]]></category>

		<guid isPermaLink="false">http://www.philihp.com/blog/?p=582</guid>
		<description><![CDATA[Everyone should learn the difference between Union and Union All. Knowing it will make you a better programmer, and it&#8217;s fairly trivial to understand. SELECT * FROM apples UNION SELECT * FROM oranges When you know for a fact that there will never be any common rows between the apples table and the oranges table, [...]]]></description>
			<content:encoded><![CDATA[<p>Everyone should learn the difference between Union and Union All. Knowing it will make you a better programmer, and it&#8217;s fairly trivial to understand.</p>
<pre>SELECT * FROM apples
UNION
SELECT * FROM oranges</pre>
<p>When you know for a fact that there will never be any common rows between the <code>apples</code> table and the <code>oranges</code> table, this query will be slightly faster with at low cardinality, and incredibly faster at high cardinality by using &#8220;UNION ALL&#8221;</p>
<pre>SELECT * FROM apples
UNION ALL
SELECT * FROM oranges</pre>
<p>The difference between the two queries is this: UNION ALL will simply concatenate the two queries together into the resultset. Just using UNION will concatenate, but then remove duplicates (do a distinct sort). Leaving out this second step can vastly reduce the time it takes for your query to run.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.philihp.com/blog/2010/sql-optimization-union-vs-union-all/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Counting distinct variables in SQL with SAS</title>
		<link>http://www.philihp.com/blog/2009/counting-distinct-variables-in-sql/</link>
		<comments>http://www.philihp.com/blog/2009/counting-distinct-variables-in-sql/#comments</comments>
		<pubDate>Mon, 26 Oct 2009 06:44:45 +0000</pubDate>
		<dc:creator>philihp</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Count]]></category>
		<category><![CDATA[Distinct]]></category>
		<category><![CDATA[SAS]]></category>
		<category><![CDATA[SQL]]></category>

		<guid isPermaLink="false">http://www.philihp.com/blog/?p=504</guid>
		<description><![CDATA[One way to get the count of distinct variables, which works in most flavors of SQL, is to use a subquery. For instance, in Oracle this is: SELECT count(SELECT DISTINCT foo FROM table) FROM dual In SAS, using PROC SQL, you can do that too, but you can also simply do this: SELECT count(distinct foo) [...]]]></description>
			<content:encoded><![CDATA[<p>One way to get the count of distinct variables, which works in most flavors of SQL, is to use a subquery. For instance, in Oracle this is:</p>
<pre>SELECT count(SELECT DISTINCT foo FROM table) FROM dual</pre>
<p>In SAS, using PROC SQL, you can do that too, but you can also simply do this:</p>
<pre>SELECT count(distinct foo) FROM table</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.philihp.com/blog/2009/counting-distinct-variables-in-sql/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Avoid Correlated Subqueries</title>
		<link>http://www.philihp.com/blog/2008/avoid-correlated-subqueries/</link>
		<comments>http://www.philihp.com/blog/2008/avoid-correlated-subqueries/#comments</comments>
		<pubDate>Tue, 16 Sep 2008 03:58:00 +0000</pubDate>
		<dc:creator>philihp</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Optimization]]></category>
		<category><![CDATA[SAS]]></category>
		<category><![CDATA[SQL]]></category>

		<guid isPermaLink="false">http://www.philihp.com/blog/2008/09/avoid-correlated-subqueries/</guid>
		<description><![CDATA[If your SQL code has a nested select that references a column in an outer select, such as the following, it may be possible to rewrite to perform orders of magnitude faster. proc sql; create table new_rates as select from work.exchange_rate n where not exists( select from imf.exchange_rate o where n.effective_date=o.effective_date and n.iso_char_code=o.iso_char_code ); NOTE: [...]]]></description>
			<content:encoded><![CDATA[<p>If your SQL code has a nested select that references a column in an outer select, such as the following, it may be possible to rewrite to perform orders of magnitude faster.<br />
<code>proc sql;<br />
   create table new_rates as<br />
   select<br />
     from work.exchange_rate n<br />
     where not exists(<br />
       select  from imf.exchange_rate o<br />
       where n.effective_date=o.effective_date and n.iso_char_code=o.iso_char_code );</p>
<p><span style="color: #0000FF">NOTE: Table WORK.NEW_RATES created, with 49 rows and 4 columns.</span></p>
<p> quit;</p>
<p><span style="color: #0000FF">NOTE: PROCEDURE SQL used (Total process time):<br />
     <strong>real time           8.83 seconds</strong><br />
     cpu time            8.65 seconds</span></code><br />
Here, the table imf.exchange_rate has 13416 rows, covering exchange rates at close, daily, for 39 different currencies, over nearly 1 year. Modest, but fairly small. It has no indexes, and has not been sorted (or marked as sorted). work.exchange_rate is a smaller version of it, covering only exchange rates for the last month, with 980 rows. The query is trying to return any exchange rates that we didn&#8217;t have before.</p>
<p>Should be simple right? There&#8217;s no reason for it to take this long. By rewriting the query to do a left join, below, SAS merges the tables behind the scenes, then finishes the query in a single scan.<br />
<code>proc sql;<br />
   create table new_rates as<br />
   select n.*<br />
     from work.exchange_rate n<br />
       left join imf.exchange_rate o<br />
         on (n.effective_date = o.effective_date and n.iso_char_code = o.iso_char_code)<br />
     where o.iso_char_code = '';</p>
<p><span style="color: #0000FF">NOTE: Table WORK.NEW_RATES created, with 49 rows and 4 columns.</span></p>
<p> quit;</p>
<p><span style="color: #0000FF">NOTE: PROCEDURE SQL used (Total process time):<br />
     <strong>real time           0.13 seconds</strong><br />
     cpu time            0.13 seconds</span></code></p>
]]></content:encoded>
			<wfw:commentRss>http://www.philihp.com/blog/2008/avoid-correlated-subqueries/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using a Materialized Path Model for Trees within OLTP Databases (part 1)</title>
		<link>http://www.philihp.com/blog/2008/using-a-materialized-path-model-for-trees-within-oltp-databases-part-1/</link>
		<comments>http://www.philihp.com/blog/2008/using-a-materialized-path-model-for-trees-within-oltp-databases-part-1/#comments</comments>
		<pubDate>Sun, 18 May 2008 08:24:00 +0000</pubDate>
		<dc:creator>philihp</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[SQL]]></category>

		<guid isPermaLink="false">http://www.philihp.com/blog/2008/05/using-a-materialized-path-model-for-trees-within-oltp-databases-part-1/</guid>
		<description><![CDATA[Databases are very very good at storing tabular, dimensional data; and in a world where everything is a spreadsheet (your finance department), this works very well. Often, though, there&#8217;s a need for an application to store and deal with a tree of data; such as in classification systems or management structures. The Library of Congress, [...]]]></description>
			<content:encoded><![CDATA[<p>Databases are very very good at storing tabular, dimensional data; and in a world where everything is a spreadsheet (your finance department), this works very well. Often, though, there&#8217;s a need for an application to store and deal with a tree of data; such as in classification systems or management structures. The <a href="http://en.wikipedia.org/wiki/Library_of_Congress_Classification">Library of Congress</a>, the <a href="http://en.wikipedia.org/wiki/Dewey_Decimal_Classification">Dewey Decimal System</a>, and <a href="http://www.census.gov/epcd/naics02/naicod02.htm">NAICS</a> all come to mind. Anyone who&#8217;s ever been working with their company&#8217;s employee data has certainly come across the management hierarchy. In each of these, you&#8217;ve got a tree of nodes, each node (except the top; the tree root) having at most one parent, and each node (except those on the bottom; the leaves) having any number of children.
<pre>          [node]         /           [node]    [node]      /            [node]  [node]   [node]

                     [node]</pre>
<p>In a properly normalized <a href="http://en.wikipedia.org/wiki/OLTP">OLTP</a> database, what people usually do is create a table structure where each node only knows his parent. This satisfies the requirement of database normalization to have each fact/idea in one and only one location so that there can&#8217;t possibly be a conflict (aside: this doesn&#8217;t guarantee that there can&#8217;t be conflicts. Node A could say node B is its parent, and node B could say node A is its parent).</p>
<p>This is called the <b>Adjacency Model</b>, since it is clear when two records are adjacent siblings; they share the same parent.</p>
<p>The way this looks as far as table schema would be
<pre>+--------------+| NODE         | <---.+--------------+     || node_id      |     || parent_id    | ----'| node_data... |+--------------+</pre>
<p>The other day I was asked how one could quickly query the database the question &#8220;How do I find all of the direct and indirect children of a node?&#8221;</p>
<p>Direct children are simple&#8230;
<pre>SELECT *  FROM node  WHERE parent_id = ?</pre>
<p>Children of children of a node is also simple&#8230;
<pre>SELECT *  FROM node  WHERE parent_id = ?    OR parent_id in (      SELECT node_id        FROM node        WHERE parent_id = ?)</pre>
<p>But this gets out of hand pretty quickly, and it becomes obvious we won&#8217;t be able to have a single, simple, fast query that gives ALL children of a node, no matter how far down their ancestry.</p>
<p><a href="http://www.amazon.com/exec/obidos/ASIN/0596008945"><b>Enter the Materialized Path Model</b></a></p>
<p>In a Materialized Path model, each row of your table knows its entire ancestry. It&#8217;s a little like if every file on your file system knew its entire path, rather than just the folder it was in.</p>
<p>Probably the best way to describe it is by example. Say we had this tree:
<pre>             0           / |           1  2  3         /  /         4  5   6              /|             7 8 9</pre>
<p>In an adjacency model, our table looks like:
<pre>+---------+-----------+| node_id | parent_id |+---------+-----------+| 0       | .         || 1       | 0         || 2       | 0         || 3       | 0         || 4       | 1         || 5       | 2         || 6       | 2         || 7       | 6         || 8       | 6         || 9       | 6         |+---------+-----------+</pre>
<p>However in a materialized path model, we would store this as
<pre>+---------+-------+| node_id | path  |+---------+-------+| 0       |       || 1       | 1     || 2       | 2     || 3       | 3     || 4       | 1.4   || 5       | 2.5   || 6       | 2.6   || 7       | 2.6.7 || 8       | 2.6.8 || 9       | 2.6.9 |+---------+-------+</pre>
<p>In order to ask the database &#8220;which nodes are children of <i>x</i>&#8221; you can query with
<pre>SELECT *  FROM node  WHERE path LIKE    (SELECT path       FROM node       WHERE node_id = <i>x</i>)||'%'</pre>
<p>In some languages, the &#8220;begins with&#8221; operation is optimized. MySQL can use regular expressions. SAS has the <code>=:</code> operator.</p>
<p>This model lends itself to other nice things; finding the depth of a node can be calculated on the number of dot delimiters in the path.</p>
<p>The Materialized Path model works very well as opposed to the Adjacency model when the root node, and nodes with deep ancestry rarely change; which is usually the case in classification schemes (base categories tend to be the established ones) and company org hierarchies (upper level executive managers tend to be senior).</p>
<p>We can do a little bit better with a materialized path model, though&#8230; More in Part 2.<br /><!--<br />part 2:</p>
<p>standardize the length of each element in the path for depth calc<br />no need for delimiters now</p>
<p>part 3:</p>
<p>disconnect using the ID, instead use a letter<br />letters don&#8217;t have to be unique to the table, just the last element has to be unique<br />  for all the direct reports of that record&#8217;s manager<br />by using alphanumerics, support up to 62 direct reports<br />&#8211;></p>
]]></content:encoded>
			<wfw:commentRss>http://www.philihp.com/blog/2008/using-a-materialized-path-model-for-trees-within-oltp-databases-part-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

