I have a list of people, and am trying to pull out those people that come up more frequently than average (we'll start with 3 or more times). Simplified data set looks like this:
| Person_ID | other value |
|---|---|
| 1 | irrelevantvalue |
| 1 | irrelevant_value |
| 1 | irrelevant_value |
| 1 | irrelevant_value |
| 2 | irrelevant_value |
| 2 | irrelevant_value |
| 3 | irrelevant_value |
| 4 | irrelevant_value |
| 4 | irrelevant_value |
| 4 | irrelevant_value |
The output table would be the following, keeping only the values where this person appeared 3 or more times:
| Person_ID | other value |
|---|---|
| 1 | irrelevantvalue |
| 1 | irrelevant_value |
| 1 | irrelevant_value |
| 1 | irrelevant_value |
| 4 | irrelevant_value |
| 4 | irrelevant_value |
| 4 | irrelevant_value |
Thoughts on how to proceed here? Using SAS EG
CodePudding user response:
You can do this with SQL. Filter only to irrelevant values with a where statement, then count all groups where the total number of irrelevant values is >= 3.
Example data:
data have;
length group 8. value $15.;
input group value$;
datalines;
1 relevant
1 relevant
1 irrelevant
1 irrelevant
1 irrelevant
2 irrelevant
2 relevant
2 relevant
3 irrelevant
3 relevant
3 relevant
3 irrelevant
3 irrelevant
4 irrelevant
4 irrelevant
4 irrelevant
;
run;
Code:
proc sql;
create table irrelevant_values as
select *
from have
where value = 'irrelevant'
group by id
having count(*) GE 3
;
quit;
Output:
group value
1 irrelevant
1 irrelevant
1 irrelevant
3 irrelevant
3 irrelevant
3 irrelevant
4 irrelevant
4 irrelevant
4 irrelevant
CodePudding user response:
You can select the ids having a frequency greater than the average frequency by using sub-queries.
Example:
data have(keep=id rownum);
input id times;
do _n_ = 1 to times; rownum 1; output; end;
datalines;
1 4
2 2
3 1
4 3
;
proc sql;
create table want as
select
have.id, rownum
from
have
, (select mean(freq) as avg_freq
from (select id, count(*) as freq from have group id)
)
group
have.id
having
count(*) > avg_freq
;
