# Finding the Center of US Counties in SAS

I had a problem where I needed the center longitude and latitude of each US county. SAS comes with some datasets containing map data in the format of tables that trace a line around the borders of counties. I was interested in the MAPSGFK.US_COUNTIES dataset, which has data that looks like this:

The State and County numbers were
FIPS codes, which
were also stored in the US_COUNTIES_ATTR table, so those are easy enough to
figure out. Segment *seemed* to be always 1. Resolution and Density were
numbers that I figured had to do with how much detail the drawing would have
if these points were used in a line drawing.

The longitude and latitude here are all points along the border. The simplest, and most incorrect way of getting the center of the county would be to take an average of them, which would be fine for rectangular counties, but for territories where one side is flat (and has few waypoints) and one side is jagged (such as a coastal border), the midpoint will be weighted to that side. So instead, use geometry.

In geometry, there’s a formula for the Centroid of a polygon. This is a magic formula that I just trust is correct, because everything on Wikipedia is true, without exception. Here was my first pass at an implementation of it in SAS:

```
proc sort data=mapsgfk.us_counties out=centroids;
by state county;
run;
data centroids(keep=cx cy county state);
retain yi yj xi xj a cx cy x0 y0 0;
set centroids(keep=state county lat long rename=(lat=yj long=xj));
by state county;
if(first.county) then do;
cx = 0;
cy = 0;
a = 0;
x0 = xj;
y0 = yj;
end;
else if(not first.county) then do;
ta = (xi*yj - xj*yi);
cx + ((xi+xj)*ta);
cy + ((yi+yj)*ta);
a + ta;
end;
if(last.county) then do;
ta = (xj*y0 - x0*yj);
cx + ((xj+x0)*ta);
cy + ((yj+y0)*ta);
a = ta + a * 0.5;
cx = cx / (6*a);
cy = cy / (6*a);
output;
end;
xi = xj;
yi = yj;
run;
```

Here, *xi* and *yi* are always the coordinates from the previous
point, and “`if not first.county`

” prevents processing of the first
point of a county because it didn’t have a previous point. The variables
*a*, *cx*, and *cy* accumulate for every point. The variable
*ta* is the area of the rectangle defined by the two points. Once all the
points have been accumulated, the area is halfed so it’s the actual area, and
the area is used in calculating the centroid (*cx* and *cy*).

But then I noticed the 2nd county was positioned somewhere in the mid-Atlantic. Something had to be up. Looking closer, this was Baldwin County, which had an island. Looking at the source data, this segment was drawn in two segments, which caused the centroid formula to choke, as it assumed contiguous shapes. So instead, in my second pass, I compute the centroid of every segment, then average them weighted by their areas.

```
proc sort data=mapsgfk.us_counties out=centroids;
by state county segment;
run;
data centroids_temp;
retain yi yj xi xj a cx cy x0 y0 0;
set centroids(keep=state county segment lat long rename=(lat=yj long=xj));
by state county segment;
if(first.segment) then do;
cx = 0;
cy = 0;
a = 0;
x0 = xj;
y0 = yj;
end;
else if(not first.segment) then do;
ta = (xi*yj - xj*yi);
cx + ((xi+xj)*ta);
cy + ((yi+yj)*ta);
a + ta;
end;
if(last.segment) then do;
ta = (xj*y0 - x0*yj);
cx + ((xj+x0)*ta);
cy + ((yj+y0)*ta);
a = ta + a * 0.5;
cx = cx / (6*a);
cy = cy / (6*a);
output;
end;
xi = xj;
yi = yj;
run;
proc sql;
create table centroid_weight as
select
state, county, sum(a) as sum
from centroids_temp
group by state, county;
quit;
proc sql;
create table centroids as
select a.state, a.county,
sum(cx*(a / sum)) as lat,
sum(cy*(a / sum)) as long
from centroids_temp a
inner join centroid_weight b
on (a.state = b.state and a.county = b.county)
group by a.state, a.county;
quit;
proc sql;
drop table centroids_temp;
drop table centroid_weight;
quit;
```

Here, I add two more steps. The first proc SQL block sums up the total area of each county, which is used in the second block to find the average centroid of each county’s centroids, weighted by the total area that centroid represents, which gives us correct centroids for counties drawn in multiple segments.