I am using Snowflake for this SQL question if there are any unique functions I can use, please help me out!
I have a data set with unique ids, other attributes that aren’t important, and then a list of categories (~22) each Unique id could fall into (denoted by a 1 if it’s in the category and 0 if not.)
I am trying to figure out how to write something where I could see if across all the categories, if a category was removed if any of the unique ids would then be left without any category and count how many unique ids would then do total how many ids would be left category less.
Example below for unique id Jshshsv it is only in CatAA but id Hairbdb is in CatY and CatAA. If CatAA was dropped, how many Ids would be left with no category?
| UniqueID | Sum across Categories | CatX | CatY | CatZ | CatAA |
|---|---|---|---|---|---|
| Hairbdb | 2 | 0 | 1 | 0 | 1 |
| Jshshsv | 1 | 0 | 0 | 0 | 1 |
For some reason I just cannot figure out how to do this in a manageable way in sql with so many category buckets. Any tips or things to try would be appreciated.
CodePudding user response:
So if you data is in the form of pairs, with repeats ID|Cat --|-- Hairbdb|CatY Hairbdb|CatY Hairbdb|CatAA Jshshsv|CatAA Jshshsv|CatAA
The follow SQL can be used to find the Catagories are the single match for an ID.
WITH data AS (
SELECT * FROM VALUES
('Hairbdb','CatY'),
('Hairbdb','CatAA'),
('Jshshsv','CatAA')
v(id, cat)
), dist_data AS (
SELECT DISTINCT id, cat FROM data
), cat_counts AS (
SELECT id, count(distinct cat) c_cat
FROM data
GROUP BY 1
HAVING c_cat = 1
)
SELECT a.cat, a.id
FROM dist_data AS a
JOIN cat_counts AS b
ON b.id = a.id;
This works because you first count per id, how many categories the id is in, then you join the distinct data with those where the id is only in one cat, will give you id & cat
| CAT | ID |
|---|---|
| CatAA | Jshshsv |
IF you data is in a wide format (like how you present it), you can turn it into my form via UNPIVOT like so:
WITH data AS (
SELECT * FROM VALUES
('Hairbdb',0,1,0,1),
('Jshshsv',0,0,0,1)
v(id, catx, caty, catz, cataa )
)
SELECT id, cat from data unpivot(catv for cat in (catx, caty, catz, cataa))
WHERE catv = 1;
giving:
| ID | CAT |
|---|---|
| Hairbdb | CATY |
| Hairbdb | CATAA |
| Jshshsv | CATAA |
But if it's in your form with duplicates removed you could just use a WHERE clause:
WITH data AS (
SELECT * from values
('Hairbdb', 0, 1, 0, 1),
('Jshshsv', 0, 0, 0, 1)
v(UniqueID, CatX,CatY,CatZ, CatAA)
)
SELECT UniqueID,
CatX CatY CatZ CatAA as "Sum across Categories",
CatX,
CatY,
CatZ,
CatAA
FROM data
WHERE "Sum across Categories" = 1;
So another variation, if you have many rows, per id, and the category allocation is not the same across the set, you can use a COUNT_IF and the greater than 0 test to turn the data into a in any, then using a HAVING clause to filter out those that are are in many columns
WITH data AS (
SELECT * FROM VALUES
('Hairbdb',0,1,0,1),
('Hairbdb',1,1,0,1),
('Hairbdb',0,1,1,1),
('Jshshsv',0,0,0,1),
('Jshshsv',0,0,0,1)
v(id, catx, caty, catz, cataa )
)
SELECT id,
COUNT_IF(catx=1)>0 AS catx_a,
COUNT_IF(caty=1)>0 AS caty_a,
COUNT_IF(catz=1)>0 AS catz_a,
COUNT_IF(cataa=1)>0 AS cataa_a
FROM data
GROUP BY 1
HAVING catx_a::int caty_a::int catz_a::int cataa_a::int = 1;
CodePudding user response:
if you are storing the categories in columns (though not a good design) you could try this.
SELECT UniqueID , sum(CatX CatY CatZ CatAA) over (partition by UniqueID) as "Sum across Categories",
CatX, CatY, CatZ, CatAA FROM (
SELECT 'Hairbdb' as UniqueID, 0 as CatX, 1 as CatY, 0 as CatZ, 1 as CatAA from dual
UNION ALL
SELECT 'Jshshsv', 0,0,0,1 from dual
);
