-
Notifications
You must be signed in to change notification settings - Fork 4
/
Social Web 2015-2016 - Hands-on Session 1.txt
434 lines (294 loc) · 20.2 KB
/
Social Web 2015-2016 - Hands-on Session 1.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
*****************************************************
* The Social Web *
* 2015-2016 Master Information Sciences *
* Instructors Davide Ceolin, Anca Dumitrache, *
* and Niels Ockeloen *
* *
* Exercises for Hands-on session 1 *
* 4 February 2016 09:00 - 10:45 *
* WN-F153 (L) WN-P345 *
*****************************************************
BEFORE INSTALLING PACKAGES: SET UP YOUR PYTHON VIRTUAL ENVIRONMENT!! SEE THE HANDS ON STARETING GUIDE.
Prerequisites:
- Python 2.7
- Python packages: twitter, prettytable, numpy (SEE BELOW), python-dateutil, pyparsing, matplotlib
===================== Installing numpy package on Windows ===============
NB: On the WINDOWS machines only, you will have to install a precompiled version of the numpy package.
Please still stick to the package installation order above.
To install numpy on the VU Windows machines, use the following commands:
> curl -O http://www.few.vu.nl/~con220/socialweb/numpy-1.8.2+mkl-cp27-none-win32.whl
> pip install numpy-1.8.2+mkl-cp27-none-win32.whl
NB2: This is a 32-bits version targeted for the VU Windows Lab computers. If you need a different version for your own Windows Machine, e.g. 64 bits or different Python version, you can download the package from:
http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy
Download the needed numpy-1.8.2 package directly from the website, place it in your working folder and 'pip install' it using its file name.
=========================================================================
NOTE1: All python script lines are placed in script blocks designated by =[START SCRIPT]=== and =[END SCRIPT]=== marker lines.
NOTE2: Identation has a functianal meaning in python scripts, so be carefull with it. The examples in this file use four spaces for one identation step. Using tabs can be tricky and result in indentation errors.
First you need to know how to retrieve some social web data. Exercises 1 and 2 will show you how to retrieve trends and search results from Twitter.
Exercise 1: Authorizing an application to access Twitter account data (from Example 1-1/9-1 in Mining the Social Web):
################# TO DO BEFORE STARTING!!! ##############################
# Go to https://apps.twitter.com/ and sign in with a twitter account #
# to create an app and get values for the credentials mentioned below, #
# which you'll need to provide in place of the empty string values that #
# are defined as placeholders. If you don't have a Twitter account, #
# create one first. NB: YOU WILL HAVE TO REGISTER YOUR MOBILE PHONE #
# NUMBER IN YOUR PROFILE BEFORE YOU CAN CREATE AN APP. #
#########################################################################
After creating the app, click 'manage keys and access tokens' on the resulting page.
=======[START SCRIPT]==========================================================
import twitter # Tell Python to use the twitter package
CONSUMER_KEY = ''
CONSUMER_SECRET = ''
OAUTH_TOKEN = '' # to get the oauth credential you need to click on the 'Create my access token' button and wait few moments
OAUTH_TOKEN_SECRET = ''
auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,CONSUMER_KEY, CONSUMER_SECRET)
twitter_api = twitter.Twitter(auth=auth)
print twitter_api # Nothing to see by displaying twitter_api except that it's now a defined variable
=======[END SCRIPT]============================================================
Exercise 2: Retrieving twitter search trends (from Example 1-2/9-2 in Mining the Social Web):
=======[START SCRIPT]==========================================================
WORLD_WOE_ID = 1 # The Yahoo! Where On Earth ID for the entire world
world_trends = twitter_api.trends.place(_id=WORLD_WOE_ID) # get back a callable
print world_trends
=======[END SCRIPT]============================================================
Your output should look like something like this:
[{u'created_at': u'2014-01-22T13:52:40Z', u'trends': [{u'url': u'http://twitter.com/search?q=%22Iqbal+CJR%22', u'query': u'%22Iqbal+CJR%22', u'name': u'Iqbal CJR', u'promoted_content': None, u'events': None}, {u'url': u'http://twitter.com/search?q=%23QueremosP9NaEliana', u'query': u'%23QueremosP9NaEliana', u'name': u'#QueremosP9NaEliana', u'promoted_content': None, u'events': None}, {u'url': u'http://twitter.com/search?q=%23PresidenSobatIndonesia', u'query': u'%23PresidenSobatIndonesia', u'name': u'#PresidenSobatIndonesia', u'promoted_content': None, u'events': None}, {u'url': u'http://twitter.com/search?q=%23CrushKitaKaso', u'query': u'%23CrushKitaKaso', u'name': u'#CrushKitaKaso', u'promoted_content': None, u'events': None}, {u'url': u'http://twitter.com/search?q=%233FlasherNobitaNioners', u'query': u'%233FlasherNobitaNioners', u'name': u'#3FlasherNobitaNioners', u'promoted_content': None, u'events': None}, {u'url': u'http://twitter.com/search?q=%23bagasrdsREPLY', u'query': u'%23bagasrdsREPLY', u'name': u'#bagasrdsREPLY', u'promoted_content': None, u'events': None}, {u'url': u'http://twitter.com/search?q=%22Warna+HPmu%22', u'query': u'%22Warna+HPmu%22', u'name': u'Warna HPmu', u'promoted_content': None, u'events': None}, {u'url': u'http://twitter.com/search?q=%22AkDe%C4%9Fil+Karanl%C4%B1k%C4%B0%C5%9FlerinPartisi%22', u'query': u'%22AkDe%C4%9Fil+Karanl%C4%B1k%C4%B0%C5%9FlerinPartisi%22', u'name': u'AkDe\u011fil Karanl\u0131k\u0130\u015flerinPartisi', u'promoted_content': None, u'events': None}, {u'url': u'http://twitter.com/search?q=Abueva', u'query': u'Abueva', u'name': u'Abueva', u'promoted_content': None, u'events': None}, {u'url': u'http://twitter.com/search?q=%22YesOrNo+PengenTinggian%22', u'query': u'%22YesOrNo+PengenTinggian%22', u'name': u'YesOrNo PengenTinggian', u'promoted_content': None, u'events': None}], u'as_of': u'2014-01-22T13:55:54Z', u'locations': [{u'woeid': 1, u'name': u'Worldwide'}]}]
-------------------------------------------------------------------------------
Task 1: Find out how WORLD_WOE_IDs are defined by Yahoo and try to use others in a query. What kind of differences do you find between the worldwide trends and the local trends?
-------------------------------------------------------------------------------
Exercise 2.1: Retrieving search results (from Example 1-5/9-3 in Mining the Social Web):
=======[START SCRIPT]==========================================================
q = '#MentionSomeoneImportantForYou' # XXX: Set this variable to a trending topic, or anything else you like.
count = 100 # number of results to retrieve
# See https://dev.twitter.com/docs/api/1.1/get/search/tweets for more info
search_results = twitter_api.search.tweets(q=q, count=count) # search for your query 'q' 100 times
statuses = search_results['statuses'] # extract the tweets found
# The following code allows you to print in a nice format the contents of search_results
import json
print json.dumps(statuses[0], indent=1)
=======[END SCRIPT]============================================================
The start of your output should look like something like this:
{
"contributors": null,
"truncated": false,
"text": "RT @JamesMCallaghan: Love her!!!! RT @Erin_131: #MentionSomeoneImportantForYou My dad",
"in_reply_to_status_id": null,
"id": 425527242413080576,
"favorite_count": 0,
"source": "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>",
"retweeted": false,
"coordinates": null,
"entities": {
"symbols": [],
"user_mentions": [
{
"id": 43795567,
"indices": [
3,
19
],
"id_str": "43795567",
"screen_name": "JamesMCallaghan",
"name": "Jim Callaghan"
},
{
"id": 149373154,
"indices": [
39,
48
],
"id_str": "149373154",
"screen_name": "Erin_131",
"name": "Erin Callaghan (:"
}
],
"hashtags": [
{
"indices": [
50,
80
],
"text": "MentionSomeoneImportantForYou"
}
],
"urls": []
},
[...]
-------------------------------------------------------------------------------
Task 2: Create a second variable (e.g., statuses2) that holds the results of a query other than the query you posed in exercise 2.1. Think about a query that would yield very different results than the first one, for example one that may yield shorter tweets or less diverse tweets.
-------------------------------------------------------------------------------
Simply printing all the search results to screen is nice, but to really start analysing them, it is handy to select interesting parts from the results and store them in a different structure such as a list.
Exercise 3.1: Extracting text, screen names, and hashtags from tweets (from Example 1-6 in Mining the Social Web):
In this example you are using a thing called "List Comprehension".
It is basically a list that consists of lists. In this example, you are creating a variable 'status_texts' of type list. That list will be filled with the 'text' elements from each 'statuses', and 'statuses' comes from looping through all items in the 'search_results' list.
Look up the list comprehensions in your Python reference materials to make sure you understand what's happening here.
=======[START SCRIPT]==========================================================
status_texts = [ status['text']
for status in statuses ]
screen_names = [ user_mention['screen_name']
for status in statuses
for user_mention in status['entities']['user_mentions'] ]
hashtags = [ hashtag['text']
for status in statuses
for hashtag in status['entities']['hashtags'] ]
# Compute a collection of all words from all tweets
words = [ w
for t in status_texts
for w in t.split() ] #split the string on the empty spaces
# Explore the first 5 items for each...
print json.dumps(status_texts[0:5], indent=1)
print json.dumps(screen_names[0:5], indent=1)
print json.dumps(hashtags[0:5], indent=1)
print json.dumps(words[0:5], indent=1)
=======[END SCRIPT]============================================================
-------------------------------------------------------------------------------
Task 3: Also carry out this exercise for the results you obtained in Task 2. Please remember to not overwrite the variables that you just assigned.
-------------------------------------------------------------------------------
Exercise 3.2: Creating a basic frequency distribution from the words in tweets (from Example 1-7 in Mining the Social Web):
=======[START SCRIPT]==========================================================
from collections import Counter
for item in [words, screen_names, hashtags]:
c = Counter(item)
print c.most_common()[:10] # top 10
=======[END SCRIPT]============================================================
Your output should look like this:
[(u'MentionSomeoneImportantForYou', 97), (u'mentionsomeoneimportantforyou', 1), (u'Perfection', 1)]
-------------------------------------------------------------------------------
Task 4: Also carry out this exercise for the second query that you posed in Task 2. Think about possible explanations for the different results you get from the analyses for the different queries.
-------------------------------------------------------------------------------
So far, we have been storing the data in working memory. Often it's handy to store your data more permanently so you can also return to it in a next session.
The cPickle module lets you write data to a file in a format that is easily imported again later.
Exercise 4: Storing data for later use:
=======[START SCRIPT]==========================================================
import cPickle
f = open("myData.pickle", "wb") # create a file handle for writing (w) in binary mode (b) named myData.pickle,
cPickle.dump(words, f) # write the contents of list 'words' to file 'f'
f.close() # clean up after yourself
=======[END SCRIPT]============================================================
If you browse to your working directory, you should find a file there named "myData.pickle". You can open this in a text editor, or load its contents back into a variable to do some more analyses on.
=======[START SCRIPT]==========================================================
words=cPickle.load(open("myData.pickle")) # open the myData.pickle file and store its contents into variable 'words'
=======[END SCRIPT]============================================================
Exercise 5: Using prettytable to display tuples in a nice tabular format (from Example 1-8 in Mining the Social Web):
=======[START SCRIPT]==========================================================
from prettytable import PrettyTable
for label, data in (('Word', words),
('Screen Name', screen_names),
('Hashtag', hashtags)):
pt = PrettyTable(field_names=[label, 'Count'])
c = Counter(data)
[ pt.add_row(kv) for kv in c.most_common()[:10] ]
pt.align[label], pt.align['Count'] = 'l', 'r' # Set column alignment
print pt
=======[END SCRIPT]============================================================
Your output should look like this:
+--------------------------------+-------+
| Word | Count |
+--------------------------------+-------+
| #MentionSomeoneImportantForYou | 97 |
| RT | 78 |
| @_CollegeHumor_: | 61 |
| vodka | 61 |
| “@_CollegeHumor_: | 13 |
| vodka” | 12 |
| ♥ | 7 |
| @DinaHeshamAlii: | 4 |
| 😂😂😂 | 3 |
| "@_CollegeHumor_: | 3 |
+--------------------------------+-------+
Exercise 6: Calculating lexical diversity for tweets (from Example 1-9 in Mining the Social Web):
=======[START SCRIPT]==========================================================
# Define a function for computing lexical diversity
def lexical_diversity(tokens): # This is the way to declare user defined functions
return 1.0*len(set(tokens))/len(tokens)
# Define a function for computing the average number of words per tweet
def average_words(statuses):
total_words = sum([ len(s.split()) for s in statuses ])
return 1.0*total_words/len(statuses) # Prior to Python 3.0, the division operator (/) applies the floor function and returns an integer value (unless one of the operands is a floating-point value). Multiply either the numerator or the denominator by 1.0 to avoid truncation errors.
# Let's use these functions:
print lexical_diversity(words)
print lexical_diversity(screen_names)
print lexical_diversity(hashtags)
print average_words(status_texts)
=======[END SCRIPT]============================================================
-------------------------------------------------------------------------------
Task 5: What do the printed numbers indicate? Try to explain them.
-------------------------------------------------------------------------------
Exercise 7: Finding the most popular retweets (from Example 1-10 in Mining the Social Web):
=======[START SCRIPT]==========================================================
retweets = [(status['retweet_count'],
status['retweeted_status']['user']['screen_name'],
status['text'])# Store out a tuple of these three values
for status in statuses # for each status
if status.has_key('retweeted_status') # ... so long as the status meets this condition.
]
# Slice off the first 5 from the sorted results and display each item in the tuple
pt = PrettyTable(field_names=['Count', 'Screen Name', 'Text'])
[ pt.add_row(row) for row in sorted(retweets, reverse=True)[:5] ]
pt.max_width['Text'] = 50
pt.align= 'l'
print pt
=======[END SCRIPT]============================================================
Your output should look like this:
+-------+----------------+----------------------------------------------------+
| Count | Screen Name | Text |
+-------+----------------+----------------------------------------------------+
| 82 | _CollegeHumor_ | RT @_CollegeHumor_: #MentionSomeoneImportantForYou |
| | | vodka |
| 82 | _CollegeHumor_ | RT @_CollegeHumor_: #MentionSomeoneImportantForYou |
| | | vodka |
| 82 | _CollegeHumor_ | RT @_CollegeHumor_: #MentionSomeoneImportantForYou |
| | | vodka |
| 82 | _CollegeHumor_ | RT @_CollegeHumor_: #MentionSomeoneImportantForYou |
| | | vodka |
| 82 | _CollegeHumor_ | RT @_CollegeHumor_: #MentionSomeoneImportantForYou |
| | | vodka |
+-------+----------------+----------------------------------------------------+
Exercise 8: Looking up users who have retweeted a status (from Example 1-11 in Mining the Social Web):
=======[START SCRIPT]==========================================================
_retweets = twitter_api.statuses.retweets(id=) # Get the original tweet id for a tweet from its retweeted_status node and insert it here
print [r['user']['screen_name'] for r in _retweets]
=======[END SCRIPT]============================================================
-------------------------------------------------------------------------------
Task 6 (For experienced users):
If you have a Twitter account with a nontrivial number of tweets, request your historical tweet archive from your account settings and analyze it. The export of your account data includes files organized by time period in a convenient JSON
format. See the README.txt file included in the downloaded archive for more details.
What are the most common terms that appear in your tweets?
Who do you retweet the most often?
How many of your tweets are retweeted (and why do you think this is the case)?
-------------------------------------------------------------------------------
In the previous exercises we have been looking at the text from the tweets, but when you retrieved the results, you retrieved much more information about the tweets, such as the username of the person who shared this tweet with the world. You can use this information to find out who retweets whom in our examples.
Exercise 9.1: Plotting frequencies of words (from Example 1-12 in Mining the Social Web):
=======[START SCRIPT]==========================================================
word_counts = sorted(Counter(words).values(), reverse=True)
import matplotlib.pyplot as plt
plt.loglog(word_counts)
plt.ylabel("Freq")
plt.xlabel("Word Rank")
plt.show()
=======[END SCRIPT]============================================================
Exercise 9.2: Generating histograms of words, screen names, and hashtags (from Example 1-13 in Mining the Social Web):
=======[START SCRIPT]==========================================================
for label, data in (('Words', words), # Build a frequency map for each set of data
('Screen Names', screen_names),
('Hashtags', hashtags)):
c = Counter(data)
# plot the values
plt.hist(c.values())
# Add a title and y-label ...
plt.title(label)
plt.ylabel("Number of items in bin")
plt.xlabel("Bins (number of times an item appeared)")
plt.show()
=======[END SCRIPT]============================================================
Exercise 9.3: Generating a histogram of retweet counts (from Example 1-14 in Mining the Social Web):
NB: Using underscores while unpacking values in a tuple is idiomatic for discarding them
=======[START SCRIPT]==========================================================
counts = [count for count, _, _ in retweets]
plt.hist(counts)
plt.title("Retweets")
plt.xlabel('Bins (number of times retweeted)')
plt.ylabel('Number of tweets in bin')
print counts
plt.show()
=======[END SCRIPT]============================================================