Monday, 17 April 2017

Count Frequency Of Values In A Column Using Apache Pig

Count Frequency Of Values In A Column Using Apache Pig



There may be situations to count the occurence of a value in a field.
Let this be the sample input bag.


user_id course_name user_name
1 Social Anju
2 Maths Malu
1 English Anju
1 Maths Anju

Say we need to calculate no of occurence of each user_name.
Anju 3
Malu 1

Inorder to achieve this - COUNT Built In Function can be used.


COUNT Function in Apache Pig


COUNT function  compute the number of elements in a bag.
To group count a preceding GROUP BY statement and for global counts GROUP ALL statement is required.

The basic idea to do the above example is to group by user_name and count the tuples in the bag.


--count.pig

userAlias = LOAD /home/sreeveni/myfiles/pig/count.txt as
(user_id:long,course_name:chararray,user_name:chararray);
groupedByUser = group userAlias by user_name;
counted = FOREACH groupedByUser GENERATE group as user_name,COUNT(userAlias) as cnt;
result = FOREACH counted GENERATE user_name, cnt;
store result into /home/sreeveni/myfiles/pig/OUT/count;

The COUNT function ignores NULLs, that is tuple in the bag will not be counted if the first field in this tuple is NULL.
COUNT_STAR can be used to count fields including NULL values.




download
alternative link download

Like the Post? Do share with your Friends.