INCLUDE_DATA

Dramatically Increasing SAS DI Studio performance of SCD Type-2 Loader Transforms

SCD LoaderIn SAS DI Studio 3.4 (and I imagine in future versions), the prepackaged code for the SCD Type-2 Loader works like this: Does the dataset exist? If not, create an empty dataset with structure and indexes as defined from metadata. Then detect differences between it and the source dataset and the target dataset, expire any observations that are modified or deleted by setting their valid-to-date to now, and append any modified or new observations with a valid-from-date. The expire bit is done in-place with a data step modify statement, and the append is done with PROC APPEND. I assume this is done to reduce the amount of locking necessary on the dimension dataset. Because new observations are appended, the dataset never actually gets sorted by the business key, so this could lead to exponential growth over time on the expire bit; every time the transform wants to change a single observation’s valid-to-date, it scans the entire table. And it doesn’t help that compression is off by default.

In the instance below, we had an uncompressed dataset with a modest 220,000 rows. The dataset had a variable “description” defined as a 4000-length string which was usually but not always null. Most steps of the SCD Loader run in a few seconds, but the following is usually the bottleneck. On a run one night to populate a new variable, this data step ran in a little over 9 hours:

     data PRESTAGE.LOAD_SW_ENTITY_RELEASE_X_HOST;
        modify PRESTAGE.LOAD_SW_ENTITY_RELEASE_X_HOST
           work.etls_close
            (rename = (ETLS_KEY = SW_ENTITY_REL_X_HOST_ID
           ETLS_FROMDATE = VALID_FROM_DTTM))
            updatemode=nomissingcheck;
        by SW_ENTITY_REL_X_HOST_ID VALID_FROM_DTTM;
        VALID_TO_DTTM = ETLS_CLSDATE;
        if %sysrc(_SOK) eq _iorc_ then
           replace;
        _iorc_ = 0;
        _error_ = 0;
     run;
NOTE: There were 1 observations read from the data set PRESTAGE.LOAD_SW_ENTITY_RELEASE_X_HOST.
NOTE: The data set PRESTAGE.LOAD_SW_ENTITY_RELEASE_X_HOST has been updated.  There were 45924 observations rewritten, 0
     observations added and 0 observations deleted.
NOTE: There were 45924 observations read from the data set WORK.ETLS_CLOSE.
NOTE: DATA statement used (Total process time):
     real time           9:06:00.46
     cpu time            9:05:58.43

Turning compression on reduced the size of this dataset from 1.1 gigs to 4.5 megs, mostly from compressing the 4000-char string that was usually empty; with compression off, an X-length string always takes up X bytes because it’s faster to seek to a certain observation if all observations are the same size. Additionally, in metadata an index was defined on the two variables used above in the BY statement. I ran this PROC DATASETS statement to create the index by hand (the Loader would create them if the table didn’t exist, but it assumes indexes exist if the table exists).

proc datasets library=prestage nolist;
 modify load_sw_entity_release_x_host;
 index create ndx=(sw_entity_rel_x_host_id valid_from_dttm) / unique;
quit;

With this metadata in place, the SCD Loader uses the index in the datastep and generates the following datastep instead:

     data PRESTAGE.LOAD_SW_ENTITY_RELEASE_X_HOST;
        set work.etls_close(rename = (ETLS_KEY = SW_ENTITY_REL_X_HOST_ID
                                      ETLS_FROMDATE = VALID_FROM_DTTM));
        modify PRESTAGE.LOAD_SW_ENTITY_RELEASE_X_HOST
            key=ndx / unique;
        VALID_TO_DTTM = ETLS_CLSDATE;
        if %sysrc(_SOK) eq _iorc_ then
           replace;
        _iorc_ = 0;
        _error_ = 0;
     run;

By using the index, the runtime was reduced to under two seconds.

NOTE: There were 45924 observations read from the data set WORK.ETLS_CLOSE.
NOTE: The data set PRESTAGE.LOAD_SW_ENTITY_RELEASE_X_HOST has been updated.  There were 45924 observations rewritten, 0
     observations added and 0 observations deleted.
NOTE: DATA statement used (Total process time):
     real time           1.41 seconds
     cpu time            1.21 seconds

Friends in 2009

Lots of differences between now and last year. It’s interesting that now that I’m single, a good percentage of my 10 closest friends are also single. The age distribution is a lot wider now too, I expect this trend to continue as I am no longer meeting friends almost exclusively in school.

Summary of Top 10 Closest Friends at End of 2009

2009_7_virgin

Easy Inserting/Appending Libraries into FMTSEARCH path lists in SAS 9.2

In SAS 9.1.3 and prior, Options list (such as FMTSEARCH for libraries with format catalogs, and SASAUTOS for paths that contain macros shared across jobs) were annoying to work with. If you have nested code that wants to add a library or a path the list, doing so like this could potentially clobber statements executed outside of the nested block:

OPTIONS FMTSEARCH=(mylib.formats);

To really be safe, you want to add your library to the list. Doing this wasn’t very intuitive. I had always done this like this:

%let FMTSEARCH = %substr(%sysfunc(getoption(fmtsearch) ) ,2);
OPTIONS FMTSEARCH = (mylib.formats &FMTSEARCH;

But in 9.2, there are two new System Options, INSERT= and APPEND=. This is now as simple as

OPTIONS INSERT=(FMTSEARCH=mylib.formats);

Counting distinct variables in SQL with SAS

One way to get the count of distinct variables, which works in most flavors of SQL, is to use a subquery. For instance, in Oracle this is:

SELECT count(SELECT DISTINCT foo FROM table) FROM dual

In SAS, using PROC SQL, you can do that too, but you can also simply do this:

SELECT count(distinct foo) FROM table

The Family Reunion Problem

Here’s an interesting problem. I would like to call this the family reunion problem. As far as I know, this problem hasn’t been stated or generalized anywhere else. But maybe it has?

Suppose a set of people live in different cities and wish to have a reunion. They’re fairly far apart such that they’ll be flying, cost is an issue. Everyone is indifferent as to the city, but they wish to minimize the total travel cost of the trip.

The family consists of the following people:

  • Alice living in Boston
  • Bobby living in Miami
  • Cindy living in Los Angeles
  • David living in Seattle
  • Euclid living in San Francisco

Costs to fly between each city are given in this matrix:

BOS MIA LAX SEA
SFO 218 233 98 98
SEA 218 238 138
LAX 218 283
MIA 168

Which member of the family should host the reunion?

Of course this is easy with just 4 people, but family reunions usually have many times this. And family reunions don’t necessarily have to be held in a home city; it could be an entirely new venue.

My intuition tells me this is NP-Complete, but it has been a while.

For loops in SAS

When coding for loops in SAS, one neat thing to remember is that all of the parts of it are optional.

start <TO stop> <BY increment> <WHILE(expression) | UNTIL(expression)>

http://support.sas.com/documentation/cdl/en/lrdict/62618/HTML/default/a000201276.htm

So in another language, what might be

for(int i=0;i < max;i++) {

In SAS, you can leave out the "by 1", since it's the default increment. So this would be

do i=0 to max;

What's fun is in SAS you can combine a Repeat-Until/Do-While loop with a for loop. This might be useful if for some reason you don't set the var "max" until you've iterated through the loop once, such as in this case, where "metadata_getnobj" returns a negative on fail, but otherwise returns the number of objects that match the query.

do i=1 by 1 until (i >= objs);
  rc=metadata_getnobj("omsobj:SASLibrary?@Libref='MYLIBREF'",i,uri);
  if rc>=0 then objs=rc;

  rc=metadata_getnasn(uri,"LibraryConnection",1,uri);
  if rc>=0 then leave;
end;

This code loops through all the libraries in the metadata with libref='MYLIBREF', and exits with a URI for the first one it sees with at least one LibraryConnection association. This is useful when you have a remote library pointing to a local library and they both have the same name and you want a URI to the remote one.

Another cool thing with FOR loops in SAS is iteration through lists. In Java, say you have a block like this:

for(String x : new String[] { "a","b","c" }) {
  System.out.println("x="+x);
}

In SAS, this is simply

do x="a","b","c";
  put x=;
end;

NC Democrat Party refers to people as “Tea Baggers”

You keep using that word. I do not think it means what you think it means.

From: David Young, Chairman, NC Democrats <headquarters@ncdp.org>
Sent: Friday, September 18, 2009 9:46 AM
Subject: Democrats Counter Tea Baggers

Democratic strength continues to grow

Democratic leaders from across the country met last week in Austin, TX for the Democratic National Committee’s Fall Meeting.

In addition to electing a slate of organizational and caucus officers, Democrats learned about the rebirth and growth of the Democratic Party in Texas and the strength of our gubernatorial candidates in New Jersey and Virginia.

North Carolina continues to lead the way with DNC Member Susan Burgess’ pitch to host the 2012 Democratic National Convention in Charlotte, where she serves as Mayor Pro-Tem.

Joyce Brayboy and Ed Turlington were elected at-large DNC members. Turlington and Everett Ward were elected to the Resolutions and Credentials committees, respectively, with Ward elected as co-chairman of his committee.

David Parker was also elected Treasurer of the Southern Caucus. Spanning states from Virginia to Texas, the caucus discusses regional issues and concerns. DNC Chairman Tim Kaine said the organization will continue to focus its efforts on winning elections in the South.

Removing formats from all variables in a SAS Dataset

In SAS, every field/variable in a table/dataset can be given a format. This format tells SAS how to display the data. The following datastep will create a table called “formatted” having 1 row containing 3 variables: x, y, and d.

data formatted;
  x=9000;
  y=42;
  now=16761;
  format x comma6.;
  format y dollar5.;
  format now date7.;
  put x=;
  put y=;
  put now=;
run;

Additionally it prints to the log the formatted values.

x=9,000
y=$42
now=21NOV05

In SAS you can also create your own formats, and you can assign these formats to whichever datasets you like. However if that format goes away for some reason and you try to look at the data again, you will get this error:

ERROR: Informat $YOURFMT not found or couldn't be loaded for variable YOURVAR.

You can get around this by turning off this option

options nofmterr;

Then your ERROR turns into NOTE, and things work normally, except when you try to view the data you see it unformatted. This could be useful, however, if you find yourself in a situation like this:

data _null_;
  a=20.0000001;
  b=20.000001;
  format a dollar4.2;
  format b dollar4.2;
  put a= b=;
  if(a=b) then put 'matches';
   else put 'no match';
run;

Runs and outputs:

a=20.0 b=20.0
no match

At any rate, there are times in SAS when you simply want to remove all the formats from a dataset. This can be done in one line in a datastep:

data unformatted;
  set formatted;
  format _all_;
run;

Tablesaw that Won’t Cut Fingers

This is one of the greatest inventions I’ve seen in a long time. Using a charge in the saw blade, if it detects capacitance from human flesh it immediately halts and retracts the spinning blade. When pushing wood through at any reasonable speed, it’s supposed to stop before the blade cuts any deeper than 1/16th an inch. Probably does irreparable damage to the saw’s blade, but at least you keep your fingers.

Not that I have any use for a table saw, but if I did, I’d buy this one.

AT&T U-Verse DNS Loopback workaround for 3800HGV-B

AT&T’s U-Verse network just came up on my street.

So… since the gateway doubles as a router & switch, I was trying to set it up so my httpd server was accessible from the outside. Setup the port forwarding through the firewall just fine, and indeed everything works for the rest of the internet. All of you should be able to get to shatner.philihp.com.

However I can’t. From what I gather, the gateway doesn’t support DNS loopback. So when I resolve shatner.philihp.com, it resolves to my WAN IP, then I try to connect to it and I timeout. It’s similar to the problem this guy had with his FTP server.

Eric DeVaudreuil suggested that if I could get the DNS on the gateway to resolve the domain differently for LAN users, it would work. And it looks like from the manual, that there used to be a quazi-secret management console that had a “DNS Resolve” page for doing just this. However a recent firmware upgrade totally removed this in order to fix some XSS attack. So that’s out of the picture.

this guy got around it by setting up a router behind his gateway… I suppose I could do this, but it seems like overkill for something that should just work.

Someone at Experts-Exchange says it can’t be done… but over there they use a sort of “democratic truth” similar to the abomination that is Yahoo Answers; so I wouldn’t believe much I see there.

Right now, I have this working solution/workaround. Admittedly it doesn’t scale well, but this isn’t an issue when my network is under half a dozen devices. My solution is just to add in lines to my /etc/hosts file for all of my machines to make the domain resolve to the LAN IP of the HTTP server, rather than the WAN IP. The only drawback/annoyance to this is with my laptop which constantly goes in and out of the local network.