Data Recipes: Structural Similarity With Apache Pig

Tuesday, May 3, 2011

Structural Similarity With Apache Pig

A while ago I posted this about computing the jaccard similarity (otherwise known as the structural similarity) of nodes in a network graph. Looking back on it over the past few days I realize there are some areas for serious improvement. Actually, it's just completely wrong. The approach is broken. So, I'm going to revisit the algorithm again, in more detail, and write a vastly improved Pig script for computing the structural similarity of vertices in a network graph.

The graph

Of course, before you can do a whole lot of anything, you've got to have a graph to work with. To keep things dead simple (and breaking from what I normally do) I'm going to draw an arbitrary graph (and by draw I mean draw). Here it is:

This can be represented as a tab-separated-values set of adjacency pairs in a file called 'graph.tsv':

$: cat graph.tsv
A C
B C
C D
C E
E F
C H
G H
C G
D E
A G
B G

I don't think it gets any simpler.

The Measure

Now, the idea here is to compute the similarity between nodes, eg similarity(A,B). There's already a well defined measure for this called the jaccard similarity. The key take away is to notice that:

The Pig

This can be broken into a set of Pig operations quite easily actually:

load data

Remember, I saved the data into a file called 'graph.tsv'. So:


edges     = LOAD 'graph.tsv' AS (v1:chararray, v2:chararray);
edges_dup = LOAD 'graph.tsv' AS (v1:chararray, v2:chararray);

It is necessary, but still a hack, to load the data twice at the moment. This is because self-intersections (which I'll be doing in a moment) don't work with Pig 0.8 at the moment.

augment with sizes

Now I need the 'size' of each of the sets. In this case the 'sets' are the list of nodes each node links to. So, all I really have to do is calculate the number of outgoing links:


grouped_edges = GROUP edges BY v1;
aug_edges     = FOREACH grouped_edges GENERATE FLATTEN(edges) AS (v1, v2), COUNT(edges) AS v1_out;

grouped_dups  = GROUP edges_dup BY v1;
aug_dups      = FOREACH grouped_dups GENERATE FLATTEN(edges_dup) AS (v1, v2), COUNT(edges_dup) AS v1_out;

Again, if self-intersections worked, the operation of counting the outgoing links would only have to happen once. Now I've got a handle on the |A| and |B| parts of the equation above.

intersection

Next I'm going to compute the intersection using a Pig join:


edges_joined  = JOIN aug_edges BY v2, aug_dups BY v2;
intersection  = FOREACH edges_joined {
                  --
                  -- results in:
                  -- (X, Y, |X| + |Y|)
                  -- 
                  added_size = aug_edges::v1_out + aug_dups::v1_out;
                  GENERATE
                    aug_edges::v1 AS v1,
                    aug_dups::v1  AS v2,
                    added_size    AS added_size
                  ;
                };

Notice I'm adding the individual set sizes. This is to come up the |A| + |B| portion of the denominator in the jaccard index.

intersection sizes

In order to compute the size of the intersection I've got to use a Pig GROUP and collect all the elements that matched on the JOIN:


intersect_grp   = GROUP intersection BY (v1, v2);
intersect_sizes = FOREACH intersect_grp {
                    --
                    -- results in:
                    -- (X, Y, |X /\ Y|, |X| + |Y|)
                    --
                    intersection_size = (double)COUNT(intersection);
                    GENERATE
                      FLATTEN(group)               AS (v1, v2),
                      intersection_size            AS intersection_size,
                      MAX(intersection.added_size) AS added_size -- hack, we only need this one time
                    ;
                  };

There's a hack in there. The reason is that 'intersection.added_size' is a Pig data bag containing some number of identical tuples (equal to the intersection size). Each of these tuples is the 'added_size' from the previous step. Using MAX is just a convenient way of pulling out only one of them.

similarity

And finally, I have all the pieces in place to compute the similarities:


similarities = FOREACH intersect_sizes {
                 --
                 -- results in:
                 -- (X, Y, |X /\ Y|/|X U Y|)
                 --
                 similarity = (double)intersection_size/((double)added_size-(double)intersection_size);
                 GENERATE
                   v1         AS v1,
                   v2         AS v2,
                   similarity AS similarity
                 ;
               };

That'll do it. Here's the full script for completeness:


edges     = LOAD '$GRAPH' AS (v1:chararray, v2:chararray);
edges_dup = LOAD '$GRAPH' AS (v1:chararray, v2:chararray);

--
-- Augment the edges with the sizes of their outgoing adjacency lists. Note that
-- if a self join was possible we would only have to do this once.
--
grouped_edges = GROUP edges BY v1;
aug_edges     = FOREACH grouped_edges GENERATE FLATTEN(edges) AS (v1, v2), COUNT(edges) AS v1_out;

grouped_dups  = GROUP edges_dup BY v1;
aug_dups      = FOREACH grouped_dups GENERATE FLATTEN(edges_dup) AS (v1, v2), COUNT(edges_dup) AS v1_out;

--
-- Compute the sizes of the intersections of outgoing adjacency lists
--
edges_joined  = JOIN aug_edges BY v2, aug_dups BY v2;
intersection  = FOREACH edges_joined {
                  --
                  -- results in:
                  -- (X, Y, |X| + |Y|)
                  -- 
                  added_size = aug_edges::v1_out + aug_dups::v1_out;
                  GENERATE
                    aug_edges::v1 AS v1,
                    aug_dups::v1  AS v2,
                    added_size    AS added_size
                  ;
                };

intersect_grp   = GROUP intersection BY (v1, v2);
intersect_sizes = FOREACH intersect_grp {
                    --
                    -- results in:
                    -- (X, Y, |X /\ Y|, |X| + |Y|)
                    --
                    intersection_size = (double)COUNT(intersection);
                    GENERATE
                      FLATTEN(group)               AS (v1, v2),
                      intersection_size            AS intersection_size,
                      MAX(intersection.added_size) AS added_size -- hack, we only need this one time
                    ;
                  };

similarities = FOREACH intersect_sizes {
                 --
                 -- results in:
                 -- (X, Y, |X /\ Y|/|X U Y|)
                 --
                 similarity = (double)intersection_size/((double)added_size-(double)intersection_size);
                 GENERATE
                   v1         AS v1,
                   v2         AS v2,
                   similarity AS similarity
                 ;
               };

DUMP similarities;

Results

Run it:


$: pig -x local jaccard.pig
...snip...
(A,A,1.0)
(A,B,1.0)
(A,C,0.2)
(B,A,1.0)
(B,B,1.0)
(B,C,0.2)
(C,A,0.2)
(C,B,0.2)
(C,C,1.0)
(C,D,0.25)
(C,G,0.25)
(D,C,0.25)
(D,D,1.0)
(E,E,1.0)
(G,C,0.25)
(G,G,1.0)

And it's done! Hurray. Now, go make a recommender system or something.

13 comments:

UnknownOctober 2, 2011 at 4:14 AM
To get around the hack couldn't you use 'added_size' in the group?

> GROUP intersection BY (v1, v2, added_size);

Since it has a constant value for all occurrences of a given (v1, v2) pair this should give you the same grouping as on (v1,v2) but now you don't have to pull the size out of the bag.
ReplyDelete
Replies
Andrew CleggJanuary 8, 2012 at 8:11 AM
Many thanks for this. I'm doing something similar for a tutorial I'm writing for the Hadoop UK Users' Group, and it would have taken much longer without your example to follow.

One suggestion -- maybe it would be a bit faster to load the edges file once, do the degree counting once, and then do a

edges_dup = foreach edges generate *;

Re. Kurt's comment, I once wrote a "First" UDF for exactly these cases -- where you know all the elements in the bag are the same, so finding their MAX (or copying into the grouping key) is just waited cycles...
ReplyDelete
Replies
Andrew CleggJanuary 8, 2012 at 8:32 AM
Hah, not just me:

https://github.com/linkedin/datafu/blob/master/src/java/datafu/pig/bags/FirstTupleFromBag.java
ReplyDelete
Replies
UnknownSeptember 4, 2018 at 12:08 AM
nice article
data science training in bangalore
hadoop training in bangalore
python online training
ReplyDelete
Replies
lavanyaJuly 28, 2020 at 12:32 PM
I must thank you for the efforts you have put in penning this site. I am hoping to check out the same high-grade content by you later on as well. Keep up the good workI am really enjoyed a lot when reading your well-written posts. It shows like you spend more effort and time to write this blog. I have saved it for my future reference. Keep it up the good work.Java training in Chennai

Java Online training in Chennai

Java Course in Chennai

Best JAVA Training Institutes in Chennai

Java training in Bangalore

Java training in Hyderabad

Java Training in Coimbatore

Java Training

Java Online Training
ReplyDelete
Replies
aravindAugust 8, 2020 at 2:23 AM
Good job in presenting the correct content with the clear explanation. The content looks real with valid information. Good Work, Thank you for sharing such detailed article. I am learning a lot from you.
DevOps Training in Chennai

DevOps Online Training in Chennai

DevOps Training in Bangalore

DevOps Training in Hyderabad

DevOps Training in Coimbatore

DevOps Training

DevOps Online Training
ReplyDelete
Replies
RashikaAugust 11, 2020 at 10:22 PM
I am feeling grateful to read this.you gave a nice info for us.please update more.

Digital Marketing Training in Chennai

Digital Marketing Course in Chennai

SEO Training in Chennai

Digital Marketing Training in Bangalore

Digital Marketing Training in Hyderabad

Digital Marketing Training in Coimbatore

Digital Marketing Training

Digital Marketing Course

Digital Marketing Online Training

ReplyDelete
Replies
deviAugust 14, 2020 at 10:08 AM
Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
Data Science Training In Chennai

Data Science Online Training In Chennai

Data Science Training In Bangalore

Data Science Training In Hyderabad

Data Science Training In Coimbatore

Data Science Training

Data Science Online Training
ReplyDelete
Replies
vivekvedhaAugust 26, 2020 at 12:26 AM
Great site and a great topic as well I really get amazed to read this.This is incredible,I feel really happy to have seen your webpage.
acte reviews

acte velachery reviews

acte tambaram reviews

acte anna nagar reviews

acte porur reviews

acte omr reviews

acte chennai reviews

acte student reviews

ReplyDelete
Replies
vivekvedhaAugust 26, 2020 at 12:28 AM
Great and Useful Article.
acte reviews

acte velachery reviews

acte tambaram reviews

acte anna nagar reviews

acte porur reviews

acte omr reviews

acte chennai reviews

acte student reviews

ReplyDelete
Replies
radhikaAugust 26, 2020 at 1:02 AM
I just got to this amazing site not long ago. I was actually captured with the piece of resources you have got here. Big thumbs up for making such wonderful blog page!

AWS Course in Bangalore

AWS Course in Hyderabad

AWS Course in Coimbatore

AWS Course

AWS Certification Course

AWS Certification Training

AWS Online Training

AWS Training

ReplyDelete
Replies
Silent GirlOctober 25, 2022 at 9:22 AM
Flvto is superb Reddit video downloader. It helps convert Reddit videos to mp3, and of course video file download includes audio track. Flvto Youtube Downloader App
ReplyDelete
Replies
haseebOctober 25, 2022 at 11:30 AM
Tally ERP 9 Crack 2022 With Latest Version Full Download. This is easy software which is perfect for all users involved in improved business .Tally GST Crack
ReplyDelete
Replies

Add comment

mathjax