Last month in the Plaksha Tech Leaders Fellowship that I am currently pursuing, I began learning about working with big data, and deploying things at scale. In the course, I was introduced to this library called Dask.

Dask is a flexible library for parallel computing in Python. - The first line of its documentation

Contrary to Apache Spark, which was originally written for Java and Scala, Dask is a completely pythonic library which has similar functionality.

Dask has layers to how it operates. At the highest level, where we usually work at, you have arrays, dataframes, machine learning models, and a bags object which is unique to Dask.

Below it there is either Delayed or Futures. Delayed creates a directed acyclic graph (DAG) of all the functions and processes that need to be executed, before running them. Whereas, Futures works similar to how eager execution works in Tensorflow, i.e. things get done as they come in.

At the lowest level is the Scheduler. Now while there are different schedulers, we will be working with the threaded scheduler as we will mainly have arrays and dataframes. The Scheduler is responsible for actually running the DAGs. It can be visualised using the Client, and we will also use that to monitor how it is running.

First, we do our imports and read in the data. For this tutorial, we will be using the Yelp dataset.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

Out[1]:
<dask.config.set at 0x7f74f56d2ba8>

Here we will start the Client that will enable us to monitor the scheduler

In [ ]:
from dask.distributed import Client
client = Client(n_workers=8, threads_per_worker=1, processes=False, scheduler_port=0)
client


All the green means that the tasks were parallelised well and utilised all available hardware efficiently. But if it looks something like this:

This means that all the "workers" (cores of your processor), were not utilised fully.

Another screen is the graph screen, where you can see the DAG of your task.

The blue nodes are completed sub tasks, the green nodes are pending ones, while the red ones are in memory and being currently processed.

Finally, you can go to the workers tab to see all your processor cores, and memory and the load they are taking.

You can also go to the status tab to see graphs of how disk, processor, network etc. are running.

While running this notebook it will be helpful if you keep the client open on one tab and monitor how the tasks parallelise and make your computer(s?) turn up the heat.

In [2]:
review = dd.read_json("/kaggle/input/yelp-dataset/yelp_academic_dataset_review.json",lines = True, encoding = 'utf-8', blocksize="100MB")

In [3]:
review.head()

Out[3]:
review_id user_id business_id stars useful funny cool text date
0 Q1sbwvVQXV2734tPgoKj4Q hG7b0MtEbXx5QzbzE6C_VA ujmEBvifdJM6h6RLv4wQIg 1 6 1 0 Total bill for this horrible service? Over $8G... 2013-05-07 04:34:36 1 GJXCdrto3ASJOqKeVWPi6Q yXQM5uF2jS6es16SJzNHfg NZnhc2sEQy3RmzKTZnqtwQ 5 0 0 0 I *adore* Travis at the Hard Rock's new Kelly ... 2017-01-14 21:30:33 2 2TzJjDVDEuAW6MR5Vuc1ug n6-Gk65cPZL6Uz8qRm3NYw WTqjgwHlXbSFevF32_DJVw 5 3 0 0 I have to say that this office really has it t... 2016-11-09 20:09:03 3 yi0R0Ugj_xUx_Nek0-_Qig dacAIZ6fTM6mqwW5uxkskg ikCg8xy5JIg_NGPx-MSIDA 5 0 0 0 Went in for a lunch. Steak sandwich was delici... 2018-01-09 20:56:38 4 11a8sVPMUFtaC7_ABRkmtw ssoyf2_x0EQMed6fgHeMyQ b1b1eb3uo-w561D0ZfCEiQ 1 7 0 0 Today was my second out of three sessions I ha... 2018-01-30 23:07:38 In [4]: business.head()  Out[4]: business_id name address city state postal_code latitude longitude stars review_count is_open attributes categories hours 0 1SWheh84yJXfytovILXOAQ Arizona Biltmore Golf Club 2818 E Camino Acequia Drive Phoenix AZ 85016 33.522143 -112.018481 3.0 5 0 {'GoodForKids': 'False'} Golf, Active Life None 1 QXAEGFB4oINsVuTFxEYKFQ Emerald Chinese Restaurant 30 Eglinton Avenue W Mississauga ON L5R 3E7 43.605499 -79.652289 2.5 128 1 {'RestaurantsReservations': 'True', 'GoodForMe... Specialty Food, Restaurants, Dim Sum, Imported... {'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W... 2 gnKjwL_1w79qoiV3IC_xQQ Musashi Japanese Restaurant 10110 Johnston Rd, Ste 15 Charlotte NC 28210 35.092564 -80.859132 4.0 170 1 {'GoodForKids': 'True', 'NoiseLevel': 'u'avera... Sushi Bars, Restaurants, Japanese {'Monday': '17:30-21:30', 'Wednesday': '17:30-... 3 xvX2CttrVhyG2z1dFg_0xw Farmers Insurance - Paul Lorenz 15655 W Roosevelt St, Ste 237 Goodyear AZ 85338 33.455613 -112.395596 5.0 3 1 None Insurance, Financial Services {'Monday': '8:0-17:0', 'Tuesday': '8:0-17:0', ... 4 HhyxOkGAM07SRYtlQ4wMFQ Queen City Plumbing 4209 Stuart Andrew Blvd, Ste F Charlotte NC 28217 35.190012 -80.887223 4.0 4 1 {'BusinessAcceptsBitcoin': 'False', 'ByAppoint... Plumbing, Shopping, Local Services, Home Servi... {'Monday': '7:0-23:0', 'Tuesday': '7:0-23:0', ... In [5]: user.head()  Out[5]: user_id name review_count yelping_since useful funny cool elite friends fans ... compliment_more compliment_profile compliment_cute compliment_list compliment_note compliment_plain compliment_cool compliment_funny compliment_writer compliment_photos 0 l6BmjZMeQD3rDxWUbiAiow Rashmi 95 2013-10-08 23:11:33 84 17 25 2015,2016,2017 c78V-rj8NQcQjOI8KP3UEA, alRMgPcngYSCJ5naFRBz5g... 5 ... 0 0 0 0 1 1 1 1 2 0 1 4XChL029mKr5hydo79Ljxg Jenna 33 2013-02-21 22:29:06 48 22 16 kEBTgDvFX754S68FllfCaA, aB2DynOxNOJK9st2ZeGTPg... 4 ... 0 0 0 0 0 0 1 1 0 0 2 bc8C_eETBWL0olvFSJJd0w David 16 2013-10-04 00:16:10 28 8 10 4N-HU_T32hLENLntsNKNBg, pSY2vwWLgWfGVAAiKQzMng... 0 ... 0 0 0 0 1 0 0 0 0 0 3 dD0gZpBctWGdWo9WlGuhlA Angela 17 2014-05-22 15:57:30 30 4 14 RZ6wS38wnlXyj-OOdTzBxA, l5jxZh1KsgI8rMunm-GN6A... 5 ... 0 0 0 0 0 2 0 0 1 0 4 MM4RJAeH6yuaN8oZDSt0RA Nancy 361 2013-10-23 07:02:50 1114 279 665 2015,2016,2017,2018 mbwrZ-RS76V1HoJ0bF_Geg, g64lOV39xSLRZO0aQQ6DeQ... 39 ... 1 0 0 1 16 57 80 80 25 5 5 rows × 22 columns In [6]: checkin.head()  Out[6]: business_id date 0 --1UhMGODdWsrMastO9DZw 2016-04-26 19:49:16, 2016-08-30 18:36:57, 2016... 1 --6MefnULPED_I942VcFNA 2011-06-04 18:22:23, 2011-07-23 23:51:33, 2012... 2 --7zmmkVg-IMGaXbuVd0SQ 2014-12-29 19:25:50, 2015-01-17 01:49:14, 2015... 3 --8LPVSo5i0Oo61X01sV9A 2016-07-08 16:43:30 4 --9QQLMTbFzLJ_oT-ON3Xw 2010-06-26 17:39:07, 2010-08-01 20:06:21, 2010... In [7]: tip.head()  Out[7]: user_id business_id text date compliment_count 0 UPw5DWs_b-e2JRBS-t37Ag VaKXUpmWTTWDKbpJ3aQdMw Great for watching games, ufc, and whatever el... 2014-03-27 03:51:24 0 1 Ocha4kZBHb4JK0lOWvE0sg OPiPeoJiv92rENwbq76orA Happy Hour 2-4 daily with 1/2 price drinks and... 2013-05-25 06:00:56 0 2 jRyO2V1pA4CdVVqCIOPc1Q 5KheTjYPu1HcQzQFtm4_vw Good chips and salsa. Loud at times. Good serv... 2011-12-26 01:46:17 0 3 FuTJWFYm4UKqewaosss1KA TkoyGi8J7YFjA6SbaRzrxg The setting and decoration here is amazing. Co... 2014-03-23 21:32:49 0 4 LUlKtaM3nXd-E4N4uOk_fQ AkL6Ous6A1atZejfZXn1Bg Molly is definately taking a picture with Sant... 2012-10-06 00:19:27 0 Now that we know what sort of data we have, let us pick 3 things that we can possibly learn from this data 1. Which state has the best restaurants? What makes them so special? 2. What is the difference between a good review and a useful review? 3. Who are the most influential users? ## State with the best restaurants¶ In [8]: star_list = business.groupby('state').stars.mean().compute()  In [9]: plt.figure(figsize=(10,10)) sns.barplot(star_list.index,star_list.values)  Out[9]: <matplotlib.axes._subplots.AxesSubplot at 0x7f747ee25f60> In [10]: star_list.sort_values(ascending=False)  Out[10]: state TN 5.000000 NJ 5.000000 XWY 4.500000 XGL 4.500000 VT 4.250000 TX 4.166667 CA 4.026316 BAS 4.000000 VA 4.000000 XGM 3.875000 AL 3.833333 GA 3.750000 AZ 3.707185 NV 3.696423 QC 3.635535 WI 3.610691 PA 3.577523 NC 3.542187 OH 3.505341 SC 3.503873 CON 3.500000 CT 3.500000 DUR 3.500000 IL 3.464286 AB 3.385359 ON 3.356504 NY 3.250000 NE 3.000000 DOW 3.000000 AK 2.750000 FL 2.500000 NM 2.500000 WA 2.333333 UT 2.000000 AR 2.000000 BC 1.500000 Name: stars, dtype: float64 Clearly Tenessee and New Jersey have the highest star ratings But what about the number of review, if they have less reviews then that could be a bit biased. Let us check that In [11]: rev_count_list = business.groupby('state').review_count.sum().compute()  In [12]: rev_count_list.sort_values()  Out[12]: state XGL 3 DUR 3 CON 3 BC 3 TN 3 DOW 4 BAS 4 UT 4 AK 7 AR 7 XWY 8 NJ 8 VT 8 AL 12 NE 12 XGM 13 GA 14 NM 14 VA 16 CT 16 WA 18 CA 247 NY 273 FL 698 TX 1052 SC 20467 IL 41021 AB 96764 WI 129609 QC 175745 PA 281129 OH 310545 NC 394317 ON 761180 AZ 2003145 NV 2243534 Name: review_count, dtype: int64 Aha! So that intuition was right. Both those states have only 3 reviews which makes it a very skewed sample. Now we can take out the states that don't have atleast a 3 digit count of reviews and revisit the stars of ratings. In [13]: sufficient_rev_states = rev_count_list.loc[rev_count_list > 100].index.tolist()  In [14]: star_list = business.groupby('state').stars.mean().compute()  In [15]: star_rev = pd.concat([star_list[sufficient_rev_states], rev_count_list.loc[rev_count_list > 100]], axis=1) star_rev.sort_values(by='stars',ascending=False)  Out[15]: stars review_count state TX 4.166667 1052 CA 4.026316 247 AZ 3.707185 2003145 NV 3.696423 2243534 QC 3.635535 175745 WI 3.610691 129609 PA 3.577523 281129 NC 3.542187 394317 OH 3.505341 310545 SC 3.503873 20467 IL 3.464286 41021 AB 3.385359 96764 ON 3.356504 761180 NY 3.250000 273 FL 2.500000 698 This still does not seem to represent the whole picture. If we create a new feature, that is an average of stars per review, it could be a more accurate representation of how good the restaurants are in that state In [16]: star_count = business.groupby('state').stars.sum().compute()[sufficient_rev_states]  In [17]: star_rev['star_count'] = star_count  In [18]: star_rev['stars_per_review'] = star_rev.apply(lambda row: row.star_count / row.review_count, axis = 1) star_rev.rename(columns={'stars':'star_mean'},inplace=True) star_rev.sort_values(by='stars_per_review',ascending=False)  Out[18]: star_mean review_count star_count stars_per_review state CA 4.026316 247 76.5 0.309717 AB 3.385359 96764 27123.5 0.280306 NY 3.250000 273 71.5 0.261905 SC 3.503873 20467 4071.5 0.198930 QC 3.635535 175745 33516.0 0.190708 OH 3.505341 310545 51518.0 0.165895 IL 3.464286 41021 6693.0 0.163160 ON 3.356504 761180 112147.5 0.147334 WI 3.610691 129609 18609.5 0.143582 PA 3.577523 281129 40125.5 0.142730 NC 3.542187 394317 52141.0 0.132231 AZ 3.707185 2003145 210145.5 0.104908 NV 3.696423 2243534 134224.5 0.059827 TX 4.166667 1052 25.0 0.023764 FL 2.500000 698 10.0 0.014327 California is the state with the highest stars per review. It can be thought of as the state with the best restaurants. Well unless more reviews come in with low ratings of course. ## Difference between a good review and a useful review¶ To find the difference, let us try to find out the most commonly appearing words in a good review (5 stars), and a useful review (top 75%tile of usefulness). In [19]: review.describe().compute()  Out[19]: stars useful funny cool count 6.685900e+06 6.685900e+06 6.685900e+06 6.685900e+06 mean 3.716199e+00 1.354134e+00 4.827667e-01 5.787708e-01 std 1.463643e+00 3.700192e+00 2.378646e+00 2.359024e+00 min 1.000000e+00 -1.000000e+00 0.000000e+00 -1.000000e+00 25% 3.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 50% 4.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 75% 5.000000e+00 2.000000e+00 0.000000e+00 1.000000e+00 max 5.000000e+00 1.241000e+03 1.290000e+03 5.060000e+02 In [20]: useful = review.loc[review['useful'] > 2] useful.head()  Out[20]: review_id user_id business_id stars useful funny cool text date 0 Q1sbwvVQXV2734tPgoKj4Q hG7b0MtEbXx5QzbzE6C_VA ujmEBvifdJM6h6RLv4wQIg 1 6 1 0 Total bill for this horrible service? Over$8G... 2013-05-07 04:34:36
2 2TzJjDVDEuAW6MR5Vuc1ug n6-Gk65cPZL6Uz8qRm3NYw WTqjgwHlXbSFevF32_DJVw 5 3 0 0 I have to say that this office really has it t... 2016-11-09 20:09:03
4 11a8sVPMUFtaC7_ABRkmtw ssoyf2_x0EQMed6fgHeMyQ b1b1eb3uo-w561D0ZfCEiQ 1 7 0 0 Today was my second out of three sessions I ha... 2018-01-30 23:07:38
6 G7XHMxG0bx9oBJNECG4IFg jlu4CztcSxrKx56ba1a5AQ 3fw2X5bZYeW9xCz_zGhOHg 3 5 4 5 Tracy dessert had a big name in Hong Kong and ... 2016-05-07 01:21:02
7 8e9HxxLjjqc9ez5ezzN7iQ d6xvYpyzcfbF_AZ8vMB7QA zvO-PJCpNk4fgAVUnExYAA 1 3 1 1 This place has gone down hill. Clearly they h... 2010-10-05 19:12:35
In [21]:
u_text = useful.text.values.compute()


Here we use a dask array instead of storing it in a normal array. The chunks of size 1000 will help us perform parallel functions on those chunks and aggregate them before outputting the results.

In [22]:
# Printing some sample reviews
print(u_text[0])
print(u_text[1])
print(u_text[2])
dist_u_text = da.from_array(u_text, chunks=1000)

Total bill for this horrible service? Over $8Gs. These crooks actually had the nerve to charge us$69 for 3 pills. I checked online the pills can be had for 19 cents EACH! Avoid Hospital ERs at all costs.
I have to say that this office really has it together, they are so organized and friendly!  Dr. J. Phillipp is a great dentist, very friendly and professional.  The dental assistants that helped in my procedure were amazing, Jewel and Bailey helped me to feel comfortable!  I don't have dental insurance, but they have this insurance through their office you can purchase for \$80 something a year and this gave me 25% off all of my dental work, plus they helped me get signed up for care credit which I knew nothing about before this visit!  I highly recommend this office for the nice synergy the whole office has!


CountVectorizer is an inbuilt Scikit-learn function that will help us create a term-frequency matrix. We also use NLTK (Natural Language ToolKit) to help remove stopwords, i.e. words such as "the", "and", "a". These stopwords are almost definitely the most commonly occuring words and will introduce noise in our outputs so we remove them.

In [23]:
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))


We modify the function used by Cristhian Boujon in one of his blogs to accomodate the removal of stopwords.

In [24]:
# Function from : https://gist.github.com/CristhianBoujon/c719ba2287a630a6d3821d37a9608ac8#file-get_top_n_words-py
def get_top_n_words(corpus, n=None):
"""
List the top n words in a vocabulary according to occurrence in a text corpus.

get_top_n_words(["I love Python", "Python is a language programming", "Hello world", "I love the world"]) ->
[('python', 2),
('world', 2),
('love', 2),
('hello', 1),
('is', 1),
('programming', 1),
('the', 1),
('language', 1)]
"""
vec = CountVectorizer().fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = []
for word, idx in vec.vocabulary_.items():
if word not in stop:
words_freq.append((word, sum_words[0, idx]))
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]


We use the delayed wrapper to perform the function in parallel.

In [25]:
top_u_words = delayed(get_top_n_words)(dist_u_text,15)

In [26]:
out_u = top_u_words.compute()

In [27]:
out_u

Out[27]:
[('place', 648874),
('like', 627164),
('good', 606280),
('one', 591128),
('get', 578194),
('food', 558900),
('time', 537234),
('would', 533604),
('back', 453894),
('great', 436558),
('service', 430809),
('go', 403360),
('really', 396775),
('also', 359216),
('us', 345261)]
In [28]:
good = review.loc[review['stars'] == 5]

Out[28]:
review_id user_id business_id stars useful funny cool text date
1 GJXCdrto3ASJOqKeVWPi6Q yXQM5uF2jS6es16SJzNHfg NZnhc2sEQy3RmzKTZnqtwQ 5 0 0 0 I *adore* Travis at the Hard Rock's new Kelly ... 2017-01-14 21:30:33
2 2TzJjDVDEuAW6MR5Vuc1ug n6-Gk65cPZL6Uz8qRm3NYw WTqjgwHlXbSFevF32_DJVw 5 3 0 0 I have to say that this office really has it t... 2016-11-09 20:09:03
3 yi0R0Ugj_xUx_Nek0-_Qig dacAIZ6fTM6mqwW5uxkskg ikCg8xy5JIg_NGPx-MSIDA 5 0 0 0 Went in for a lunch. Steak sandwich was delici... 2018-01-09 20:56:38
15 svK3nBU7Rk8VfGorlrN52A NJlxGtouq06hhC7sS2ECYw YvrylyuWgbP90RgMqZQVnQ 5 0 0 0 You can't really find anything wrong with this... 2017-04-07 21:27:49
18 rEITo90tpyKmEfNDp3Ou3A 6Fz_nus_OG4gar721OKgZA 6lj2BJ4tJeu7db5asGHQ4w 5 0 0 0 We've been a huge Slim's fan since they opened... 2017-05-26 01:23:19
In [29]:
g_text = good.text.values.compute()

In [30]:
dist_g_text = da.from_array(g_text, chunks=1000)

In [31]:
top_g_words = delayed(get_top_n_words)(dist_g_text,15)

In [32]:
out_g = top_g_words.compute()

In [33]:
out_g

Out[33]:
[('great', 1628982),
('place', 1323345),
('food', 1159451),
('good', 1026461),
('service', 938286),
('time', 882483),
('best', 733795),
('back', 713984),
('one', 713598),
('get', 711772),
('like', 701400),
('go', 666276),
('love', 640133),
('amazing', 635195),
('always', 591582)]

While there is some definite overlap like "place" and "good", strong positive words such as "love", "best" and "amazing" exist more in the good reviews, while less positive words such as "like" and "really" are present in useful reviews. There is also a distinct set of words "would go back" which might be a commonly occurring phrase in useful reviews.

## Who is the most influential user?¶

In [34]:
user.head()

Out[34]:
user_id name review_count yelping_since useful funny cool elite friends fans ... compliment_more compliment_profile compliment_cute compliment_list compliment_note compliment_plain compliment_cool compliment_funny compliment_writer compliment_photos
0 l6BmjZMeQD3rDxWUbiAiow Rashmi 95 2013-10-08 23:11:33 84 17 25 2015,2016,2017 c78V-rj8NQcQjOI8KP3UEA, alRMgPcngYSCJ5naFRBz5g... 5 ... 0 0 0 0 1 1 1 1 2 0
1 4XChL029mKr5hydo79Ljxg Jenna 33 2013-02-21 22:29:06 48 22 16 kEBTgDvFX754S68FllfCaA, aB2DynOxNOJK9st2ZeGTPg... 4 ... 0 0 0 0 0 0 1 1 0 0
2 bc8C_eETBWL0olvFSJJd0w David 16 2013-10-04 00:16:10 28 8 10 4N-HU_T32hLENLntsNKNBg, pSY2vwWLgWfGVAAiKQzMng... 0 ... 0 0 0 0 1 0 0 0 0 0
3 dD0gZpBctWGdWo9WlGuhlA Angela 17 2014-05-22 15:57:30 30 4 14 RZ6wS38wnlXyj-OOdTzBxA, l5jxZh1KsgI8rMunm-GN6A... 5 ... 0 0 0 0 0 2 0 0 1 0
4 MM4RJAeH6yuaN8oZDSt0RA Nancy 361 2013-10-23 07:02:50 1114 279 665 2015,2016,2017,2018 mbwrZ-RS76V1HoJ0bF_Geg, g64lOV39xSLRZO0aQQ6DeQ... 39 ... 1 0 0 1 16 57 80 80 25 5

5 rows × 22 columns

In [35]:
elites = user[user['elite'] != '']
elites.describe().compute()

Out[35]:
review_count useful funny cool fans average_stars compliment_hot compliment_more compliment_profile compliment_cute compliment_list compliment_note compliment_plain compliment_cool compliment_funny compliment_writer compliment_photos
count 71377.000000 71377.000000 71377.000000 71377.000000 71377.000000 71377.000000 71377.000000 71377.000000 71377.000000 71377.000000 71377.000000 71377.000000 71377.000000 71377.000000 71377.000000 71377.000000 71377.000000
mean 222.831767 624.389159 316.346764 406.685739 23.255867 3.878246 46.161761 5.661614 4.097903 3.590540 1.585833 27.299004 60.564748 63.232526 63.232526 23.327991 22.549939
std 270.113953 2135.464138 1530.817538 1857.576475 70.265278 0.332848 365.742739 38.537678 52.248527 31.782833 19.919108 293.858759 421.655730 397.645877 397.645877 141.929850 334.541271
min 1.000000 0.000000 0.000000 0.000000 0.000000 2.160000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 115.000000 143.500000 41.000000 60.000000 6.000000 3.700000 2.000000 1.000000 0.000000 0.000000 0.000000 3.000000 4.000000 4.000000 4.000000 3.000000 1.000000
50% 187.000000 312.000000 100.000000 133.750000 13.000000 3.930000 7.000000 2.000000 1.000000 0.000000 0.000000 8.000000 11.000000 13.000000 13.000000 7.000000 3.000000
75% 581.250000 1885.250000 912.000000 1229.750000 75.000000 4.170000 91.000000 13.000000 7.000000 6.000000 3.000000 63.000000 122.250000 152.000000 152.000000 62.000000 31.000000
max 13278.000000 154202.000000 130207.000000 148658.000000 9538.000000 5.000000 34167.000000 3928.000000 6473.000000 2829.000000 2374.000000 57833.000000 52103.000000 32266.000000 32266.000000 12128.000000 44390.000000
In [36]:
famous = elites[elites['fans'] > 75]
famous_sm = famous[['name','review_count','useful','funny','cool']]

In [37]:
famous_sm.head()

Out[37]:
name review_count useful funny cool
5 Marilyn 214 3475 2424 3048
6 Keane 1122 13311 19356 15319
18 Diana 453 3578 1501 2532
20 Aurélie 1563 4172 1661 3246
32 Katharine 412 1816 463 1341
In [38]:
sns.violinplot(x=famous_sm['review_count'])

Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7470498048>
In [39]:
# Removing outliers for better plot
famous_to_plot = famous_sm[(famous_sm['useful'] < 10000) & (famous_sm['funny'] < 10000) & (famous_sm['cool'] < 10000)]

plt.subplot(1,3,1)
sns.violinplot(x=famous_to_plot['useful'],cut=0)

plt.subplot(1,3,2)
sns.violinplot(x=famous_to_plot['funny'],cut=0)

plt.subplot(1,3,3)
sns.violinplot(x=famous_to_plot['cool'],cut=0)

Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f74a92d8f98>

From the above, we can assign equal weightage to funny and cool, but a little more weightage to useful.

In [40]:
famous_sm['score'] = famous_sm.apply(lambda row: (0.4*row.useful + 0.3*row.funny + 0.3*row.cool)/(3*row.review_count) , meta=(None, 'float64'),axis=1)

In [41]:
famous_sm[famous_sm['score'] > 15].head()

/opt/conda/lib/python3.6/site-packages/dask/dataframe/core.py:5868: UserWarning: Insufficient elements for head. 5 elements requested, only 4 elements available. Try passing larger npartitions to head.
warnings.warn(msg.format(n, len(r)))

Out[41]:
name review_count useful funny cool score
101 Alan 1379 78486 49550 70825 16.317839
3845 Rodney 851 50327 45166 48429 18.883392
4961 Gem 64 3672 2791 3411 17.340625
6120 Jen 398 19406 16287 18464 15.232580

Clearly, Rodney seems to be the most influential based not only on the defined score, but taking a look at the number of his reviews, and their attributes you can definitely see that he is a loud voice in the Yelp community

So, hopefully this short exercise was a good introduction into using dask dataframes and arrays in parallelising the work on large datasets. Going forward, the official dask tutorials would be great to go through if you want to continue working with this library.

Thank you for reading and good luck!

In [43]:
!jupyter nbconvert Yelpd.ipynb

[NbConvertApp] WARNING | pattern './Yelpd.ipynb' matched no files
This application is used to convert notebook files (*.ipynb) to various other
formats.

WARNING: THE COMMANDLINE INTERFACE MAY CHANGE IN FUTURE RELEASES.

Options
-------

Arguments that take values are actually convenience aliases to full
Configurables, whose aliases are listed on the help line. For more information
on full configurables, see '--help-all'.

--debug
set log level to logging.DEBUG (maximize logging output)
--generate-config
generate default config file
-y
--execute
Execute the notebook prior to export.
--allow-errors
Continue notebook execution even if one of the cells throws an error and include the error message in the cell output (the default behaviour is to abort conversion). This flag is only relevant if '--execute' was specified, too.
--stdin
read a single notebook file from stdin. Write the resulting notebook with default basename 'notebook.*'
--stdout
Write notebook output to stdout instead of files.
--inplace
Run nbconvert in place, overwriting the existing notebook (only
relevant when converting to notebook format)
--clear-output
Clear output of current file and save in place,
overwriting the existing notebook.
--no-prompt
Exclude input and output prompts from converted document.
--no-input
Exclude input cells and output prompts from converted document.
This mode is ideal for generating code-free reports.
--log-level=<Enum> (Application.log_level)
Default: 30
Choices: (0, 10, 20, 30, 40, 50, 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL')
Set the log level by value or name.
--config=<Unicode> (JupyterApp.config_file)
Default: ''
Full path of a config file.
--to=<Unicode> (NbConvertApp.export_format)
Default: 'html'
The export format to be used, either one of the built-in formats
['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf',
'python', 'rst', 'script', 'slides'] or a dotted object name that represents
the import path for an Exporter class
--template=<Unicode> (TemplateExporter.template_file)
Default: ''
Name of the template file to use
--writer=<DottedObjectName> (NbConvertApp.writer_class)
Default: 'FilesWriter'
Writer class used to write the  results of the conversion
--post=<DottedOrNone> (NbConvertApp.postprocessor_class)
Default: ''
PostProcessor class used to write the results of the conversion
--output=<Unicode> (NbConvertApp.output_base)
Default: ''
overwrite base name use for output files. can only be used when converting
one notebook at a time.
--output-dir=<Unicode> (FilesWriter.build_directory)
Default: ''
Directory to write output(s) to. Defaults to output to the directory of each
notebook. To recover previous default behaviour (outputting to the current
working directory) use . as the flag value.
--reveal-prefix=<Unicode> (SlidesExporter.reveal_url_prefix)
Default: ''
The URL prefix for reveal.js (version 3.x). This defaults to the reveal CDN,
but can be any url pointing to a copy  of reveal.js.
For speaker notes to work, this must be a relative path to a local  copy of
reveal.js: e.g., "reveal.js".
If a relative path is given, it must be a subdirectory of the current
directory (from which the server is run).
See the usage documentation
slideshow) for more details.
--nbformat=<Enum> (NotebookExporter.nbformat_version)
Default: 4
Choices: [1, 2, 3, 4]
The nbformat version to write. Use this to downgrade notebooks.

To see all available configurables, use --help-all

Examples
--------

The simplest way to use nbconvert is

> jupyter nbconvert mynotebook.ipynb

which will convert mynotebook.ipynb to the default format (probably HTML).

You can specify the export format with --to.
Options include ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides'].

> jupyter nbconvert --to latex mynotebook.ipynb

Both HTML and LaTeX support multiple output templates. LaTeX includes
'base', 'article' and 'report'.  HTML includes 'basic' and 'full'. You
can specify the flavor of the format used.

> jupyter nbconvert --to html --template basic mynotebook.ipynb

You can also pipe the output to stdout, rather than a file

> jupyter nbconvert mynotebook.ipynb --stdout

PDF is generated via latex

> jupyter nbconvert mynotebook.ipynb --to pdf

You can get (and serve) a Reveal.js-powered slideshow

> jupyter nbconvert myslides.ipynb --to slides --post serve

Multiple notebooks can be given at the command line in a couple of
different ways:

> jupyter nbconvert notebook*.ipynb
> jupyter nbconvert notebook1.ipynb notebook2.ipynb

or you can specify the notebooks list in a config file, containing::

c.NbConvertApp.notebooks = ["my_notebook.ipynb"]

> jupyter nbconvert --config mycfg.py


In [ ]: