From one of my data sources, Segment, gives me event data with a GUID/UUID that looked like this when I queried it
Querying this table was slow, resulting in dashboard views that took 60+ seconds to load, and I wanted a quick win to reduce the size of it. Interestingly, the casing wasn’t consistent, but that shouldn’t matter.
CREATE TABLE tmp_raw_table SELECT id as event_id, received_at FROM segment.event LIMIT 100000; CREATE TABLE tmp_raw_uuid SELECT id::uuid as event_id, received_at FROM segment.event LIMIT 10000000;
SELECT pg_size_pretty( pg_total_relation_size('tmp_raw_table') ); SELECT pg_size_pretty( pg_total_relation_size('tmp_uuid_table') );
Running this to test my syntax, and the reported size of the differences in tables goes from 7488 kB to 5096 kB. Not bad for something as low-effort as coercing a type, so I tried to ship it to the full table, and then it failed…
ERROR: invalid input syntax for type uuid: "0qREGUQhbFNoRd-aeFacw"
In my experience, halting functions like these are nasty when used in a production ETL flow. Imagine if a malformed ID like that were introduced months later; now the flow is broken until developer time can be allocated to fix it. Gross. In some situations you want your code to fail-fast, but I just wanted a damned UUID. Or maybe null would have been acceptable. This post had a good snippet, so I turned that into a database function.
CREATE OR REPLACE FUNCTION to_uuid(raw text) RETURNS uuid IMMUTABLE STRICT AS $$ BEGIN RETURN raw::uuid; EXCEPTION WHEN invalid_text_representation THEN RETURN uuid_in(overlay(overlay(md5(raw) placing '4' from 13) placing '8' from 17)::cstring); END; $$ LANGUAGE plpgsql;
There are a lot of good reasons to avoid database functions, but I think as long as they’re very concise, it shouldn’t be any more annoying than using built-in functions. I added that
IMMUTABLE bit to make sure it’s deterministic and fast/safe to use in indexes.
SELECT to_uuid(id) as event_id, received_at FROM segment.event;