GitHub - CompNet/GooglePlusParser: Java program allowing to retrieve the G+ network

CompNet / GooglePlusParser Public

Notifications You must be signed in to change notification settings
Fork 0
Star 1

Java program allowing to retrieve the G+ network

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.settings		.settings
build		build
lib		lib
log		log
res		res
src/tr/edu/gsu/googleplus		src/tr/edu/gsu/googleplus
.classpath		.classpath
.project		.project
launch.sh		launch.sh
license.txt		license.txt
readme.txt		readme.txt

Repository files navigation

Google+ Network Extractor - version 1

-----------------------------------------------------------------------

Google+ Network Extractor is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation.

For source availability and license information see licence.txt

-----------------------------------------------------------------------

Description: This software was designed to retrieve the social network
behind the Google+ service. It was designed during summer 2011 and tested
starting from september 2011. At this time, there was no official documented
G+ API, so I had to search the web for related programs. I found a very nice
piece of open source code by Jason Grey (cf. the parser package), which
I used to access the relevant data: followers and followees. Note as of 2012,
the official Google+ API still does not document this feature (it focuses
on some personal data, by opposition to these relational ones).
I also needed to obtain the complete list of G+ users. For this, I used
some ideas from Paul Anthony and NgocChinh (cf. the explorer package).

Here is how the program works: first it scans some publicly available Google
files to retrieve all the Google ids (gid), which are unique 21 digits
numbers associated to each Google user (including people actually not using
Google+). Those are inserted in a Person table located in a H2 database.
Second, for each of these gid, it retrieves the lists of followers
and followees, also under the form of gid. These are inserted in a Relationship
table. To speed up the process, this part is multithreaded. Third, once all
the data has been gathered, it exports the social network (basically the
relationship table) under the form of an edgelist file (list of links, each
one defined by the source and target nodes).

The program was tested first to retrieve my ego-network (i.e. the people
around me, and those around them, etc.). Once the program was working, I
started retrieving the whole G+ network. It took much longer than I thought
(around 2 months), for a network containing 80,080,893 nodes and 473,106,758
links (not counting isolates). I initially had planned to retrieve the network
every two months to track its evolution. However, Google eventually added a
captcha system, preventing me from reaching this goal (I started retrieving
the November network, but could not finish). Finally, I decided to focus on
what I had, and to just analyze the Sepbember-October network. The result is
the following paper, which I ask you to cite if you use this software for
academic purposes:
<TODO to be completed>

Besides the functionalities exposed here, the software can also retrieve
other data from a gid, such as name, firstname, etc. I eventually did not
retrieve these because first I did not really need them, and second it was
making the whole process extremely long.

Moreover, I implemented some additional post-processing, partly in R and
partly in Java. Indeed, the R iGraph library is very convenient to process
topological properties, but some of its functions did not work on such a
large network (at least on my computer), for memory matters. So I had to
code a part of the post processing in Java.
The res/rscript folder contains the following R scripts:
- edgelist2graphml.R is able to convert an edgelist file to a graphml file,
so that you can play with it in Gephi (http://gephi.org/), for instance.
The script requires the iGraph library v0.5 (http://igraph.sourceforge.net/).
Note all those softwares are open source.
- extractGiantComponent.R processes some properties on the whole network,
and extract the list of nodes corresponding to the giant component of the
whole network. This list must then be processed by the Java class SubgraphExtractor
in order to get the actual subgraph, represented as an edgelist.
- giantcompAnalysis.R is in charge of processing the properties of the
giant component (i.e. the file generated by the Java class).

It is important knowing the captcha system does not completely prevent using
this software: you just cannot do too many requests from the same IP. It considerably
slows the process down, too much for my academic purposes... but it still is enough
to play with it a bit. Google might also raise this limit, one day, or provide
an academic access (which is not the case for this service, up to now).

Google might also change the structure of its JSON objects, or the way its (undocumented)
API is invoked, which would make the parser obsolete. Note I do not plan maintaining
this software and updating it if such changes were to happen. Please, do not contact
me for these matters. I think the Javadoc is complete enough to allow anyone adapting
the program to his needs. I would be interested to be informed of any use/modification,
though!

If you plan to use this software to retreive the whole G+ network, note it
has an important limitation: gid are stored as string in the DB. Why strings
and not integers, since they are actually integers, you might ask? Well, the
DB was not able to store such large integers. A re-numbering of the modes
would have allowed to use much smaller values than those bloody 21 digits,
but I was not aware of that at this time. It turns out using strings was a
huge problem when I started processing the network (because of all the
memory it requires, of course). So I eventually had to renumber the nodes,
which took quite a while. I did not update the program though, since I did not
plan to use it again. In other words, this renumbering was performed completely
as a post-processing. The program still retrieves and store strings.

-----------------------------------------------------------------------

This product uses open source softwares:
+ H2 - Java SAL Database
http://www.h2database.com
Mozilla Public License
Used to store the data retrived from Google+

+ Jersey (Sun/Oracle implementation of JAX-RS - Java API for RESTful Web Services)
http://jersey.java.net/
GPL2 License
Used to query the Google+ service

+ Jettison - JSON StAX Implementation
http://jettison.codehaus.org/
Apache License v2
Used to handle the JSON objects

+ Log4j
http://logging.apache.org/log4j
Apache License v2
Used to log the program activity

+ Some ideas from these blogs
Paul Anthony: http://blog.webdistortion.com/2011/06/12/google-people-search-how-big-is-googles-social-network/
NgocChinh: http://www.blackhatworld.com/blackhat-seo/black-hat-seo/315777-list-google-profiles-do-what-you-want.html
Used to obtain all the Google+ ids

+ Some source code from Jason Grey in the following classes:
GoogleJSONFilter
GooglePlusParser
https://plus.google.com/114074126238467553553/posts
(A G+ post, incidentally!)