Count Frequency Of Values In A Column Using Apache Pig
There may be situations to count the occurence of a value in a field.Let this be the sample input bag.
user_id course_name user_name
1 Social Anju
2 Maths Malu
1 English Anju
1 Maths AnjuSay we need to calculate no of occurence of each user_name.
Anju 3
Malu 1Inorder to achieve this - COUNT Built In Function can be used.
COUNT Function in Apache Pig
COUNT function compute the number of elements in a bag.
To group count a preceding GROUP BY statement and for global counts GROUP ALL statement is required.
The basic idea to do the above example is to group by user_name and count the tuples in the bag.
--count.pig
userAlias = LOAD /home/sreeveni/myfiles/pig/count.txt as
(user_id:long,course_name:chararray,user_name:chararray);
groupedByUser = group userAlias by user_name;
counted = FOREACH groupedByUser GENERATE group as user_name,COUNT(userAlias) as cnt;
result = FOREACH counted GENERATE user_name, cnt;
store result into /home/sreeveni/myfiles/pig/OUT/count;The COUNT function ignores NULLs, that is tuple in the bag will not be counted if the first field in this tuple is NULL.
COUNT_STAR can be used to count fields including NULL values.
alternative link download