mathjax

Tuesday, May 3, 2011

Structural Similarity With Apache Pig

A while ago I posted this about computing the jaccard similarity (otherwise known as the structural similarity) of nodes in a network graph. Looking back on it over the past few days I realize there are some areas for serious improvement. Actually, it's just completely wrong. The approach is broken. So, I'm going to revisit the algorithm again, in more detail, and write a vastly improved Pig script for computing the structural similarity of vertices in a network graph.

The graph


Of course, before you can do a whole lot of anything, you've got to have a graph to work with. To keep things dead simple (and breaking from what I normally do) I'm going to draw an arbitrary graph (and by draw I mean draw). Here it is:

This can be represented as a tab-separated-values set of adjacency pairs in a file called 'graph.tsv':

$: cat graph.tsv
A C
B C
C D
C E
E F
C H
G H
C G
D E
A G
B G


I don't think it gets any simpler.

The Measure


Now, the idea here is to compute the similarity between nodes, eg similarity(A,B). There's already a well defined measure for this called the jaccard similarity. The key take away is to notice that:



The Pig


This can be broken into a set of Pig operations quite easily actually:

load data


Remember, I saved the data into a file called 'graph.tsv'. So:


edges = LOAD 'graph.tsv' AS (v1:chararray, v2:chararray);
edges_dup = LOAD 'graph.tsv' AS (v1:chararray, v2:chararray);


It is necessary, but still a hack, to load the data twice at the moment. This is because self-intersections (which I'll be doing in a moment) don't work with Pig 0.8 at the moment.

augment with sizes


Now I need the 'size' of each of the sets. In this case the 'sets' are the list of nodes each node links to. So, all I really have to do is calculate the number of outgoing links:


grouped_edges = GROUP edges BY v1;
aug_edges = FOREACH grouped_edges GENERATE FLATTEN(edges) AS (v1, v2), COUNT(edges) AS v1_out;

grouped_dups = GROUP edges_dup BY v1;
aug_dups = FOREACH grouped_dups GENERATE FLATTEN(edges_dup) AS (v1, v2), COUNT(edges_dup) AS v1_out;


Again, if self-intersections worked, the operation of counting the outgoing links would only have to happen once. Now I've got a handle on the |A| and |B| parts of the equation above.

intersection


Next I'm going to compute the intersection using a Pig join:


edges_joined = JOIN aug_edges BY v2, aug_dups BY v2;
intersection = FOREACH edges_joined {
--
-- results in:
-- (X, Y, |X| + |Y|)
--
added_size = aug_edges::v1_out + aug_dups::v1_out;
GENERATE
aug_edges::v1 AS v1,
aug_dups::v1 AS v2,
added_size AS added_size
;
};



Notice I'm adding the individual set sizes. This is to come up the |A| + |B| portion of the denominator in the jaccard index.

intersection sizes


In order to compute the size of the intersection I've got to use a Pig GROUP and collect all the elements that matched on the JOIN:


intersect_grp = GROUP intersection BY (v1, v2);
intersect_sizes = FOREACH intersect_grp {
--
-- results in:
-- (X, Y, |X /\ Y|, |X| + |Y|)
--
intersection_size = (double)COUNT(intersection);
GENERATE
FLATTEN(group) AS (v1, v2),
intersection_size AS intersection_size,
MAX(intersection.added_size) AS added_size -- hack, we only need this one time
;
};


There's a hack in there. The reason is that 'intersection.added_size' is a Pig data bag containing some number of identical tuples (equal to the intersection size). Each of these tuples is the 'added_size' from the previous step. Using MAX is just a convenient way of pulling out only one of them.

similarity


And finally, I have all the pieces in place to compute the similarities:


similarities = FOREACH intersect_sizes {
--
-- results in:
-- (X, Y, |X /\ Y|/|X U Y|)
--
similarity = (double)intersection_size/((double)added_size-(double)intersection_size);
GENERATE
v1 AS v1,
v2 AS v2,
similarity AS similarity
;
};


That'll do it. Here's the full script for completeness:


edges = LOAD '$GRAPH' AS (v1:chararray, v2:chararray);
edges_dup = LOAD '$GRAPH' AS (v1:chararray, v2:chararray);

--
-- Augment the edges with the sizes of their outgoing adjacency lists. Note that
-- if a self join was possible we would only have to do this once.
--
grouped_edges = GROUP edges BY v1;
aug_edges = FOREACH grouped_edges GENERATE FLATTEN(edges) AS (v1, v2), COUNT(edges) AS v1_out;

grouped_dups = GROUP edges_dup BY v1;
aug_dups = FOREACH grouped_dups GENERATE FLATTEN(edges_dup) AS (v1, v2), COUNT(edges_dup) AS v1_out;

--
-- Compute the sizes of the intersections of outgoing adjacency lists
--
edges_joined = JOIN aug_edges BY v2, aug_dups BY v2;
intersection = FOREACH edges_joined {
--
-- results in:
-- (X, Y, |X| + |Y|)
--
added_size = aug_edges::v1_out + aug_dups::v1_out;
GENERATE
aug_edges::v1 AS v1,
aug_dups::v1 AS v2,
added_size AS added_size
;
};

intersect_grp = GROUP intersection BY (v1, v2);
intersect_sizes = FOREACH intersect_grp {
--
-- results in:
-- (X, Y, |X /\ Y|, |X| + |Y|)
--
intersection_size = (double)COUNT(intersection);
GENERATE
FLATTEN(group) AS (v1, v2),
intersection_size AS intersection_size,
MAX(intersection.added_size) AS added_size -- hack, we only need this one time
;
};

similarities = FOREACH intersect_sizes {
--
-- results in:
-- (X, Y, |X /\ Y|/|X U Y|)
--
similarity = (double)intersection_size/((double)added_size-(double)intersection_size);
GENERATE
v1 AS v1,
v2 AS v2,
similarity AS similarity
;
};

DUMP similarities;


Results



Run it:


$: pig -x local jaccard.pig
...snip...
(A,A,1.0)
(A,B,1.0)
(A,C,0.2)
(B,A,1.0)
(B,B,1.0)
(B,C,0.2)
(C,A,0.2)
(C,B,0.2)
(C,C,1.0)
(C,D,0.25)
(C,G,0.25)
(D,C,0.25)
(D,D,1.0)
(E,E,1.0)
(G,C,0.25)
(G,G,1.0)


And it's done! Hurray. Now, go make a recommender system or something.

13 comments:

  1. To get around the hack couldn't you use 'added_size' in the group?

    > GROUP intersection BY (v1, v2, added_size);

    Since it has a constant value for all occurrences of a given (v1, v2) pair this should give you the same grouping as on (v1,v2) but now you don't have to pull the size out of the bag.

    ReplyDelete
  2. Many thanks for this. I'm doing something similar for a tutorial I'm writing for the Hadoop UK Users' Group, and it would have taken much longer without your example to follow.

    One suggestion -- maybe it would be a bit faster to load the edges file once, do the degree counting once, and then do a

    edges_dup = foreach edges generate *;

    Re. Kurt's comment, I once wrote a "First" UDF for exactly these cases -- where you know all the elements in the bag are the same, so finding their MAX (or copying into the grouping key) is just waited cycles...

    ReplyDelete
  3. Hah, not just me:

    https://github.com/linkedin/datafu/blob/master/src/java/datafu/pig/bags/FirstTupleFromBag.java

    ReplyDelete
  4. I must thank you for the efforts you have put in penning this site. I am hoping to check out the same high-grade content by you later on as well. Keep up the good workI am really enjoyed a lot when reading your well-written posts. It shows like you spend more effort and time to write this blog. I have saved it for my future reference. Keep it up the good work.Java training in Chennai

    Java Online training in Chennai

    Java Course in Chennai

    Best JAVA Training Institutes in Chennai

    Java training in Bangalore

    Java training in Hyderabad

    Java Training in Coimbatore

    Java Training

    Java Online Training

    ReplyDelete
  5. Good job in presenting the correct content with the clear explanation. The content looks real with valid information. Good Work, Thank you for sharing such detailed article. I am learning a lot from you.
    DevOps Training in Chennai

    DevOps Online Training in Chennai

    DevOps Training in Bangalore

    DevOps Training in Hyderabad

    DevOps Training in Coimbatore

    DevOps Training

    DevOps Online Training

    ReplyDelete
  6. Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
    Data Science Training In Chennai

    Data Science Online Training In Chennai

    Data Science Training In Bangalore

    Data Science Training In Hyderabad

    Data Science Training In Coimbatore

    Data Science Training

    Data Science Online Training

    ReplyDelete
  7. Great site and a great topic as well I really get amazed to read this.This is incredible,I feel really happy to have seen your webpage.
    acte reviews

    acte velachery reviews

    acte tambaram reviews

    acte anna nagar reviews

    acte porur reviews

    acte omr reviews

    acte chennai reviews

    acte student reviews

    ReplyDelete
  8. I just got to this amazing site not long ago. I was actually captured with the piece of resources you have got here. Big thumbs up for making such wonderful blog page!


    AWS Course in Bangalore

    AWS Course in Hyderabad

    AWS Course in Coimbatore

    AWS Course

    AWS Certification Course

    AWS Certification Training

    AWS Online Training

    AWS Training

    ReplyDelete
  9. Flvto is superb Reddit video downloader. It helps convert Reddit videos to mp3, and of course video file download includes audio track. Flvto Youtube Downloader App

    ReplyDelete
  10. Tally ERP 9 Crack 2022 With Latest Version Full Download. This is easy software which is perfect for all users involved in improved business .Tally GST Crack

    ReplyDelete