Working with Dask

Last month in the Plaksha Tech Leaders Fellowship that I am currently pursuing, I began learning about working with big data, and deploying things at scale. In the course, I was introduced to this library called Dask.

Dask is a flexible library for parallel computing in Python. - The first line of its documentation

Contrary to Apache Spark, which was originally written for Java and Scala, Dask is a completely pythonic library which has similar functionality.

Dask has layers to how it operates. At the highest level, where we usually work at, you have arrays, dataframes, machine learning models, and a bags object which is unique to Dask.

Below it there is either Delayed or Futures. Delayed creates a directed acyclic graph (DAG) of all the functions and processes that need to be executed, before running them. Whereas, Futures works similar to how eager execution works in Tensorflow, i.e. things get done as they come in.

At the lowest level is the Scheduler. Now while there are different schedulers, we will be working with the threaded scheduler as we will mainly have arrays and dataframes. The Scheduler is responsible for actually running the DAGs. It can be visualised using the Client, and we will also use that to monitor how it is running.

title


Reading in the data

First, we do our imports and read in the data. For this tutorial, we will be using the Yelp dataset.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import dask
import dask.dataframe as dd
import dask.array as da
from dask import delayed
dask.config.set(scheduler='threads')
Out[1]:
<dask.config.set at 0x7f74f56d2ba8>

Here we will start the Client that will enable us to monitor the scheduler

In [ ]:
from dask.distributed import Client
client = Client(n_workers=8, threads_per_worker=1, processes=False, scheduler_port=0)
client

If you follow the link in the client, you will see a dashboard that will help you monitor your dask tasks. Once you run a command you should see something similar to this:

All is well

All the green means that the tasks were parallelised well and utilised all available hardware efficiently. But if it looks something like this:

All is not well

This means that all the "workers" (cores of your processor), were not utilised fully.

Another screen is the graph screen, where you can see the DAG of your task.

Graph

The blue nodes are completed sub tasks, the green nodes are pending ones, while the red ones are in memory and being currently processed.

Finally, you can go to the workers tab to see all your processor cores, and memory and the load they are taking.

Workers

You can also go to the status tab to see graphs of how disk, processor, network etc. are running. Status

While running this notebook it will be helpful if you keep the client open on one tab and monitor how the tasks parallelise and make your computer(s?) turn up the heat.

In [2]:
review = dd.read_json("/kaggle/input/yelp-dataset/yelp_academic_dataset_review.json",lines = True, encoding = 'utf-8', blocksize="100MB")
business = dd.read_json("/kaggle/input/yelp-dataset/yelp_academic_dataset_business.json",lines = True, encoding = 'utf-8', blocksize="100MB")
user = dd.read_json("/kaggle/input/yelp-dataset/yelp_academic_dataset_user.json",lines = True, encoding = 'utf-8', blocksize="100MB")
checkin = dd.read_json("/kaggle/input/yelp-dataset/yelp_academic_dataset_checkin.json",lines = True, encoding = 'utf-8', blocksize="100MB")
tip = dd.read_json("/kaggle/input/yelp-dataset/yelp_academic_dataset_tip.json",lines = True, encoding = 'utf-8', blocksize="100MB")
In [3]:
review.head()
Out[3]:
review_id user_id business_id stars useful funny cool text date
0 Q1sbwvVQXV2734tPgoKj4Q hG7b0MtEbXx5QzbzE6C_VA ujmEBvifdJM6h6RLv4wQIg 1 6 1 0 Total bill for this horrible service? Over $8G... 2013-05-07 04:34:36
1 GJXCdrto3ASJOqKeVWPi6Q yXQM5uF2jS6es16SJzNHfg NZnhc2sEQy3RmzKTZnqtwQ 5 0 0 0 I *adore* Travis at the Hard Rock's new Kelly ... 2017-01-14 21:30:33
2 2TzJjDVDEuAW6MR5Vuc1ug n6-Gk65cPZL6Uz8qRm3NYw WTqjgwHlXbSFevF32_DJVw 5 3 0 0 I have to say that this office really has it t... 2016-11-09 20:09:03
3 yi0R0Ugj_xUx_Nek0-_Qig dacAIZ6fTM6mqwW5uxkskg ikCg8xy5JIg_NGPx-MSIDA 5 0 0 0 Went in for a lunch. Steak sandwich was delici... 2018-01-09 20:56:38
4 11a8sVPMUFtaC7_ABRkmtw ssoyf2_x0EQMed6fgHeMyQ b1b1eb3uo-w561D0ZfCEiQ 1 7 0 0 Today was my second out of three sessions I ha... 2018-01-30 23:07:38
In [4]:
business.head()
Out[4]:
business_id name address city state postal_code latitude longitude stars review_count is_open attributes categories hours
0 1SWheh84yJXfytovILXOAQ Arizona Biltmore Golf Club 2818 E Camino Acequia Drive Phoenix AZ 85016 33.522143 -112.018481 3.0 5 0 {'GoodForKids': 'False'} Golf, Active Life None
1 QXAEGFB4oINsVuTFxEYKFQ Emerald Chinese Restaurant 30 Eglinton Avenue W Mississauga ON L5R 3E7 43.605499 -79.652289 2.5 128 1 {'RestaurantsReservations': 'True', 'GoodForMe... Specialty Food, Restaurants, Dim Sum, Imported... {'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W...
2 gnKjwL_1w79qoiV3IC_xQQ Musashi Japanese Restaurant 10110 Johnston Rd, Ste 15 Charlotte NC 28210 35.092564 -80.859132 4.0 170 1 {'GoodForKids': 'True', 'NoiseLevel': 'u'avera... Sushi Bars, Restaurants, Japanese {'Monday': '17:30-21:30', 'Wednesday': '17:30-...
3 xvX2CttrVhyG2z1dFg_0xw Farmers Insurance - Paul Lorenz 15655 W Roosevelt St, Ste 237 Goodyear AZ 85338 33.455613 -112.395596 5.0 3 1 None Insurance, Financial Services {'Monday': '8:0-17:0', 'Tuesday': '8:0-17:0', ...
4 HhyxOkGAM07SRYtlQ4wMFQ Queen City Plumbing 4209 Stuart Andrew Blvd, Ste F Charlotte NC 28217 35.190012 -80.887223 4.0 4 1 {'BusinessAcceptsBitcoin': 'False', 'ByAppoint... Plumbing, Shopping, Local Services, Home Servi... {'Monday': '7:0-23:0', 'Tuesday': '7:0-23:0', ...
In [5]:
user.head()
Out[5]:
user_id name review_count yelping_since useful funny cool elite friends fans ... compliment_more compliment_profile compliment_cute compliment_list compliment_note compliment_plain compliment_cool compliment_funny compliment_writer compliment_photos
0 l6BmjZMeQD3rDxWUbiAiow Rashmi 95 2013-10-08 23:11:33 84 17 25 2015,2016,2017 c78V-rj8NQcQjOI8KP3UEA, alRMgPcngYSCJ5naFRBz5g... 5 ... 0 0 0 0 1 1 1 1 2 0
1 4XChL029mKr5hydo79Ljxg Jenna 33 2013-02-21 22:29:06 48 22 16 kEBTgDvFX754S68FllfCaA, aB2DynOxNOJK9st2ZeGTPg... 4 ... 0 0 0 0 0 0 1 1 0 0
2 bc8C_eETBWL0olvFSJJd0w David 16 2013-10-04 00:16:10 28 8 10 4N-HU_T32hLENLntsNKNBg, pSY2vwWLgWfGVAAiKQzMng... 0 ... 0 0 0 0 1 0 0 0 0 0
3 dD0gZpBctWGdWo9WlGuhlA Angela 17 2014-05-22 15:57:30 30 4 14 RZ6wS38wnlXyj-OOdTzBxA, l5jxZh1KsgI8rMunm-GN6A... 5 ... 0 0 0 0 0 2 0 0 1 0
4 MM4RJAeH6yuaN8oZDSt0RA Nancy 361 2013-10-23 07:02:50 1114 279 665 2015,2016,2017,2018 mbwrZ-RS76V1HoJ0bF_Geg, g64lOV39xSLRZO0aQQ6DeQ... 39 ... 1 0 0 1 16 57 80 80 25 5

5 rows × 22 columns

In [6]:
checkin.head()
Out[6]:
business_id date
0 --1UhMGODdWsrMastO9DZw 2016-04-26 19:49:16, 2016-08-30 18:36:57, 2016...
1 --6MefnULPED_I942VcFNA 2011-06-04 18:22:23, 2011-07-23 23:51:33, 2012...
2 --7zmmkVg-IMGaXbuVd0SQ 2014-12-29 19:25:50, 2015-01-17 01:49:14, 2015...
3 --8LPVSo5i0Oo61X01sV9A 2016-07-08 16:43:30
4 --9QQLMTbFzLJ_oT-ON3Xw 2010-06-26 17:39:07, 2010-08-01 20:06:21, 2010...
In [7]:
tip.head()
Out[7]:
user_id business_id text date compliment_count
0 UPw5DWs_b-e2JRBS-t37Ag VaKXUpmWTTWDKbpJ3aQdMw Great for watching games, ufc, and whatever el... 2014-03-27 03:51:24 0
1 Ocha4kZBHb4JK0lOWvE0sg OPiPeoJiv92rENwbq76orA Happy Hour 2-4 daily with 1/2 price drinks and... 2013-05-25 06:00:56 0
2 jRyO2V1pA4CdVVqCIOPc1Q 5KheTjYPu1HcQzQFtm4_vw Good chips and salsa. Loud at times. Good serv... 2011-12-26 01:46:17 0
3 FuTJWFYm4UKqewaosss1KA TkoyGi8J7YFjA6SbaRzrxg The setting and decoration here is amazing. Co... 2014-03-23 21:32:49 0
4 LUlKtaM3nXd-E4N4uOk_fQ AkL6Ous6A1atZejfZXn1Bg Molly is definately taking a picture with Sant... 2012-10-06 00:19:27 0

Now that we know what sort of data we have, let us pick 3 things that we can possibly learn from this data

  1. Which state has the best restaurants? What makes them so special?
  2. What is the difference between a good review and a useful review?
  3. Who are the most influential users?

State with the best restaurants

In [8]:
star_list = business.groupby('state').stars.mean().compute()
In [9]:
plt.figure(figsize=(10,10))
sns.barplot(star_list.index,star_list.values)
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f747ee25f60>
In [10]:
star_list.sort_values(ascending=False)
Out[10]:
state
TN     5.000000
NJ     5.000000
XWY    4.500000
XGL    4.500000
VT     4.250000
TX     4.166667
CA     4.026316
BAS    4.000000
VA     4.000000
XGM    3.875000
AL     3.833333
GA     3.750000
AZ     3.707185
NV     3.696423
QC     3.635535
WI     3.610691
PA     3.577523
NC     3.542187
OH     3.505341
SC     3.503873
CON    3.500000
CT     3.500000
DUR    3.500000
IL     3.464286
AB     3.385359
ON     3.356504
NY     3.250000
NE     3.000000
DOW    3.000000
AK     2.750000
FL     2.500000
NM     2.500000
WA     2.333333
UT     2.000000
AR     2.000000
BC     1.500000
Name: stars, dtype: float64

Clearly Tenessee and New Jersey have the highest star ratings But what about the number of review, if they have less reviews then that could be a bit biased. Let us check that

In [11]:
rev_count_list = business.groupby('state').review_count.sum().compute()
In [12]:
rev_count_list.sort_values()
Out[12]:
state
XGL          3
DUR          3
CON          3
BC           3
TN           3
DOW          4
BAS          4
UT           4
AK           7
AR           7
XWY          8
NJ           8
VT           8
AL          12
NE          12
XGM         13
GA          14
NM          14
VA          16
CT          16
WA          18
CA         247
NY         273
FL         698
TX        1052
SC       20467
IL       41021
AB       96764
WI      129609
QC      175745
PA      281129
OH      310545
NC      394317
ON      761180
AZ     2003145
NV     2243534
Name: review_count, dtype: int64

Aha! So that intuition was right. Both those states have only 3 reviews which makes it a very skewed sample. Now we can take out the states that don't have atleast a 3 digit count of reviews and revisit the stars of ratings.

In [13]:
sufficient_rev_states = rev_count_list.loc[rev_count_list > 100].index.tolist()
In [14]:
star_list = business.groupby('state').stars.mean().compute()
In [15]:
star_rev = pd.concat([star_list[sufficient_rev_states], rev_count_list.loc[rev_count_list > 100]], axis=1)
star_rev.sort_values(by='stars',ascending=False)
Out[15]:
stars review_count
state
TX 4.166667 1052
CA 4.026316 247
AZ 3.707185 2003145
NV 3.696423 2243534
QC 3.635535 175745
WI 3.610691 129609
PA 3.577523 281129
NC 3.542187 394317
OH 3.505341 310545
SC 3.503873 20467
IL 3.464286 41021
AB 3.385359 96764
ON 3.356504 761180
NY 3.250000 273
FL 2.500000 698

This still does not seem to represent the whole picture. If we create a new feature, that is an average of stars per review, it could be a more accurate representation of how good the restaurants are in that state

In [16]:
star_count = business.groupby('state').stars.sum().compute()[sufficient_rev_states]
In [17]:
star_rev['star_count'] = star_count
In [18]:
star_rev['stars_per_review'] = star_rev.apply(lambda row: row.star_count / row.review_count, axis = 1)
star_rev.rename(columns={'stars':'star_mean'},inplace=True)
star_rev.sort_values(by='stars_per_review',ascending=False)
Out[18]:
star_mean review_count star_count stars_per_review
state
CA 4.026316 247 76.5 0.309717
AB 3.385359 96764 27123.5 0.280306
NY 3.250000 273 71.5 0.261905
SC 3.503873 20467 4071.5 0.198930
QC 3.635535 175745 33516.0 0.190708
OH 3.505341 310545 51518.0 0.165895
IL 3.464286 41021 6693.0 0.163160
ON 3.356504 761180 112147.5 0.147334
WI 3.610691 129609 18609.5 0.143582
PA 3.577523 281129 40125.5 0.142730
NC 3.542187 394317 52141.0 0.132231
AZ 3.707185 2003145 210145.5 0.104908
NV 3.696423 2243534 134224.5 0.059827
TX 4.166667 1052 25.0 0.023764
FL 2.500000 698 10.0 0.014327

California is the state with the highest stars per review. It can be thought of as the state with the best restaurants.

Well unless more reviews come in with low ratings of course.

Difference between a good review and a useful review

To find the difference, let us try to find out the most commonly appearing words in a good review (5 stars), and a useful review (top 75%tile of usefulness).

In [19]:
review.describe().compute()
Out[19]:
stars useful funny cool
count 6.685900e+06 6.685900e+06 6.685900e+06 6.685900e+06
mean 3.716199e+00 1.354134e+00 4.827667e-01 5.787708e-01
std 1.463643e+00 3.700192e+00 2.378646e+00 2.359024e+00
min 1.000000e+00 -1.000000e+00 0.000000e+00 -1.000000e+00
25% 3.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
50% 4.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
75% 5.000000e+00 2.000000e+00 0.000000e+00 1.000000e+00
max 5.000000e+00 1.241000e+03 1.290000e+03 5.060000e+02
In [20]:
useful = review.loc[review['useful'] > 2]
useful.head()
Out[20]:
review_id user_id business_id stars useful funny cool text date
0 Q1sbwvVQXV2734tPgoKj4Q hG7b0MtEbXx5QzbzE6C_VA ujmEBvifdJM6h6RLv4wQIg 1 6 1 0 Total bill for this horrible service? Over $8G... 2013-05-07 04:34:36
2 2TzJjDVDEuAW6MR5Vuc1ug n6-Gk65cPZL6Uz8qRm3NYw WTqjgwHlXbSFevF32_DJVw 5 3 0 0 I have to say that this office really has it t... 2016-11-09 20:09:03
4 11a8sVPMUFtaC7_ABRkmtw ssoyf2_x0EQMed6fgHeMyQ b1b1eb3uo-w561D0ZfCEiQ 1 7 0 0 Today was my second out of three sessions I ha... 2018-01-30 23:07:38
6 G7XHMxG0bx9oBJNECG4IFg jlu4CztcSxrKx56ba1a5AQ 3fw2X5bZYeW9xCz_zGhOHg 3 5 4 5 Tracy dessert had a big name in Hong Kong and ... 2016-05-07 01:21:02
7 8e9HxxLjjqc9ez5ezzN7iQ d6xvYpyzcfbF_AZ8vMB7QA zvO-PJCpNk4fgAVUnExYAA 1 3 1 1 This place has gone down hill. Clearly they h... 2010-10-05 19:12:35
In [21]:
u_text = useful.text.values.compute()

Here we use a dask array instead of storing it in a normal array. The chunks of size 1000 will help us perform parallel functions on those chunks and aggregate them before outputting the results.

In [22]:
# Printing some sample reviews
print(u_text[0])
print(u_text[1])
print(u_text[2])
dist_u_text = da.from_array(u_text, chunks=1000)
Total bill for this horrible service? Over $8Gs. These crooks actually had the nerve to charge us $69 for 3 pills. I checked online the pills can be had for 19 cents EACH! Avoid Hospital ERs at all costs.
I have to say that this office really has it together, they are so organized and friendly!  Dr. J. Phillipp is a great dentist, very friendly and professional.  The dental assistants that helped in my procedure were amazing, Jewel and Bailey helped me to feel comfortable!  I don't have dental insurance, but they have this insurance through their office you can purchase for $80 something a year and this gave me 25% off all of my dental work, plus they helped me get signed up for care credit which I knew nothing about before this visit!  I highly recommend this office for the nice synergy the whole office has!
Today was my second out of three sessions I had paid for. Although my first session went well, I could tell Meredith had a particular enjoyment for her male clients over her female. However, I returned because she did my teeth fine and I was pleased with the results. When I went in today, I was in the whitening room with three other gentlemen. My appointment started out well, although, being a person who is in the service industry, I always attend to my female clientele first when a couple arrives. Unbothered by those signs, I waited my turn. She checked on me once after my original 30 minute timer to ask if I was ok. She attended my boyfriend on numerous occasions, as well as the other men, and would exit the room without even asking me or looking to see if I had any irritation. Half way through, another woman had showed up who she was explaining the deals to in the lobby. While she admits timers must be reset half way through the process, she reset my boyfriends, left, rest the gentleman furthest away from me who had time to come in, redeem his deal, get set, and gave his timer done, before me, then left, and at this point my time was at 10 minutes. So, she should have reset it 5 minutes ago, according to her. While I sat there patiently this whole time with major pain in my gums, i watched the time until the lamp shut off. Not only had she reset two others, explained deals to other guest, but she never once checked on my time. When my light turned off, I released the stance of my mouth to a more relaxed state, assuming I was only getting a thirty minute session instead of the usual 45, because she had yet to come in. At this point, the teeth formula was not only burning the gum she neglected for 25 minutes now, but it began to burn my lips. I began squealing and slapping my chair trying to get her attention from the other room in a panic. I was in so much pain, that by the time she entered the room I was already out of my chair. She finally then acknowledged me, and asked if she could put vitamin E on my gum burn (pictured below). At this point, she has treated two other gums burns, while neglecting me, and I was so irritated that I had to suffer, all I wanted was to leave. While I waited for my boyfriend, she kept harassing me about the issue. Saying, "well burns come with teeth whitening." While I totally agree, and under justifiable circumstances would not be as irritate, it could have easily been avoid if she had checked on me even a second time, so I could let her know. Not only did she never check on my physical health, she couldn't even take two seconds to reset the timer, which she even admitted to me. Her accuse was that she was coming in to do it, but I had the light off for a solid two minutes before I couldn't stand the pain. She admitted it should be reset every 15 minutes, which means for 25 minutes she did not bother to help me at all. Her guest in the lobby then proceeded to attack me as well, simply because I wanted to leave after the way I was treated. I also expected a refund for not getting a complete session today, due to the neglect, and the fact I won't be returning for my last, she had failed to do that. She was even screaming from the door, and continued to until my boyfriend and I were down the steps. I have never in my life been more appalled by a grown woman's behavior, who claims to be in the business for "10 years." Admit your wrongs, but don't make your guest feel unwelcome because you can't do you job properly.

CountVectorizer is an inbuilt Scikit-learn function that will help us create a term-frequency matrix. We also use NLTK (Natural Language ToolKit) to help remove stopwords, i.e. words such as "the", "and", "a". These stopwords are almost definitely the most commonly occuring words and will introduce noise in our outputs so we remove them.

In [23]:
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))

We modify the function used by Cristhian Boujon in one of his blogs to accomodate the removal of stopwords.

In [24]:
# Function from : https://gist.github.com/CristhianBoujon/c719ba2287a630a6d3821d37a9608ac8#file-get_top_n_words-py
def get_top_n_words(corpus, n=None):
    """
    List the top n words in a vocabulary according to occurrence in a text corpus.
    
    get_top_n_words(["I love Python", "Python is a language programming", "Hello world", "I love the world"]) -> 
    [('python', 2),
     ('world', 2),
     ('love', 2),
     ('hello', 1),
     ('is', 1),
     ('programming', 1),
     ('the', 1),
     ('language', 1)]
    """
    vec = CountVectorizer().fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = []
    for word, idx in vec.vocabulary_.items():
        if word not in stop:
            words_freq.append((word, sum_words[0, idx]))
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

We use the delayed wrapper to perform the function in parallel.

In [25]:
top_u_words = delayed(get_top_n_words)(dist_u_text,15)
In [26]:
out_u = top_u_words.compute()
In [27]:
out_u
Out[27]:
[('place', 648874),
 ('like', 627164),
 ('good', 606280),
 ('one', 591128),
 ('get', 578194),
 ('food', 558900),
 ('time', 537234),
 ('would', 533604),
 ('back', 453894),
 ('great', 436558),
 ('service', 430809),
 ('go', 403360),
 ('really', 396775),
 ('also', 359216),
 ('us', 345261)]
In [28]:
good = review.loc[review['stars'] == 5]
good.head()
Out[28]:
review_id user_id business_id stars useful funny cool text date
1 GJXCdrto3ASJOqKeVWPi6Q yXQM5uF2jS6es16SJzNHfg NZnhc2sEQy3RmzKTZnqtwQ 5 0 0 0 I *adore* Travis at the Hard Rock's new Kelly ... 2017-01-14 21:30:33
2 2TzJjDVDEuAW6MR5Vuc1ug n6-Gk65cPZL6Uz8qRm3NYw WTqjgwHlXbSFevF32_DJVw 5 3 0 0 I have to say that this office really has it t... 2016-11-09 20:09:03
3 yi0R0Ugj_xUx_Nek0-_Qig dacAIZ6fTM6mqwW5uxkskg ikCg8xy5JIg_NGPx-MSIDA 5 0 0 0 Went in for a lunch. Steak sandwich was delici... 2018-01-09 20:56:38
15 svK3nBU7Rk8VfGorlrN52A NJlxGtouq06hhC7sS2ECYw YvrylyuWgbP90RgMqZQVnQ 5 0 0 0 You can't really find anything wrong with this... 2017-04-07 21:27:49
18 rEITo90tpyKmEfNDp3Ou3A 6Fz_nus_OG4gar721OKgZA 6lj2BJ4tJeu7db5asGHQ4w 5 0 0 0 We've been a huge Slim's fan since they opened... 2017-05-26 01:23:19
In [29]:
g_text = good.text.values.compute()
In [30]:
dist_g_text = da.from_array(g_text, chunks=1000)
In [31]:
top_g_words = delayed(get_top_n_words)(dist_g_text,15)
In [32]:
out_g = top_g_words.compute()
In [33]:
out_g
Out[33]:
[('great', 1628982),
 ('place', 1323345),
 ('food', 1159451),
 ('good', 1026461),
 ('service', 938286),
 ('time', 882483),
 ('best', 733795),
 ('back', 713984),
 ('one', 713598),
 ('get', 711772),
 ('like', 701400),
 ('go', 666276),
 ('love', 640133),
 ('amazing', 635195),
 ('always', 591582)]

While there is some definite overlap like "place" and "good", strong positive words such as "love", "best" and "amazing" exist more in the good reviews, while less positive words such as "like" and "really" are present in useful reviews. There is also a distinct set of words "would go back" which might be a commonly occurring phrase in useful reviews.

Who is the most influential user?

In [34]:
user.head()
Out[34]:
user_id name review_count yelping_since useful funny cool elite friends fans ... compliment_more compliment_profile compliment_cute compliment_list compliment_note compliment_plain compliment_cool compliment_funny compliment_writer compliment_photos
0 l6BmjZMeQD3rDxWUbiAiow Rashmi 95 2013-10-08 23:11:33 84 17 25 2015,2016,2017 c78V-rj8NQcQjOI8KP3UEA, alRMgPcngYSCJ5naFRBz5g... 5 ... 0 0 0 0 1 1 1 1 2 0
1 4XChL029mKr5hydo79Ljxg Jenna 33 2013-02-21 22:29:06 48 22 16 kEBTgDvFX754S68FllfCaA, aB2DynOxNOJK9st2ZeGTPg... 4 ... 0 0 0 0 0 0 1 1 0 0
2 bc8C_eETBWL0olvFSJJd0w David 16 2013-10-04 00:16:10 28 8 10 4N-HU_T32hLENLntsNKNBg, pSY2vwWLgWfGVAAiKQzMng... 0 ... 0 0 0 0 1 0 0 0 0 0
3 dD0gZpBctWGdWo9WlGuhlA Angela 17 2014-05-22 15:57:30 30 4 14 RZ6wS38wnlXyj-OOdTzBxA, l5jxZh1KsgI8rMunm-GN6A... 5 ... 0 0 0 0 0 2 0 0 1 0
4 MM4RJAeH6yuaN8oZDSt0RA Nancy 361 2013-10-23 07:02:50 1114 279 665 2015,2016,2017,2018 mbwrZ-RS76V1HoJ0bF_Geg, g64lOV39xSLRZO0aQQ6DeQ... 39 ... 1 0 0 1 16 57 80 80 25 5

5 rows × 22 columns

In [35]:
elites = user[user['elite'] != '']
elites.describe().compute()
Out[35]:
review_count useful funny cool fans average_stars compliment_hot compliment_more compliment_profile compliment_cute compliment_list compliment_note compliment_plain compliment_cool compliment_funny compliment_writer compliment_photos
count 71377.000000 71377.000000 71377.000000 71377.000000 71377.000000 71377.000000 71377.000000 71377.000000 71377.000000 71377.000000 71377.000000 71377.000000 71377.000000 71377.000000 71377.000000 71377.000000 71377.000000
mean 222.831767 624.389159 316.346764 406.685739 23.255867 3.878246 46.161761 5.661614 4.097903 3.590540 1.585833 27.299004 60.564748 63.232526 63.232526 23.327991 22.549939
std 270.113953 2135.464138 1530.817538 1857.576475 70.265278 0.332848 365.742739 38.537678 52.248527 31.782833 19.919108 293.858759 421.655730 397.645877 397.645877 141.929850 334.541271
min 1.000000 0.000000 0.000000 0.000000 0.000000 2.160000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 115.000000 143.500000 41.000000 60.000000 6.000000 3.700000 2.000000 1.000000 0.000000 0.000000 0.000000 3.000000 4.000000 4.000000 4.000000 3.000000 1.000000
50% 187.000000 312.000000 100.000000 133.750000 13.000000 3.930000 7.000000 2.000000 1.000000 0.000000 0.000000 8.000000 11.000000 13.000000 13.000000 7.000000 3.000000
75% 581.250000 1885.250000 912.000000 1229.750000 75.000000 4.170000 91.000000 13.000000 7.000000 6.000000 3.000000 63.000000 122.250000 152.000000 152.000000 62.000000 31.000000
max 13278.000000 154202.000000 130207.000000 148658.000000 9538.000000 5.000000 34167.000000 3928.000000 6473.000000 2829.000000 2374.000000 57833.000000 52103.000000 32266.000000 32266.000000 12128.000000 44390.000000
In [36]:
famous = elites[elites['fans'] > 75]
famous_sm = famous[['name','review_count','useful','funny','cool']]
In [37]:
famous_sm.head()
Out[37]:
name review_count useful funny cool
5 Marilyn 214 3475 2424 3048
6 Keane 1122 13311 19356 15319
18 Diana 453 3578 1501 2532
20 Aurélie 1563 4172 1661 3246
32 Katharine 412 1816 463 1341
In [38]:
sns.violinplot(x=famous_sm['review_count'])
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7470498048>
In [39]:
# Removing outliers for better plot
famous_to_plot = famous_sm[(famous_sm['useful'] < 10000) & (famous_sm['funny'] < 10000) & (famous_sm['cool'] < 10000)]

plt.subplot(1,3,1)
sns.violinplot(x=famous_to_plot['useful'],cut=0)

plt.subplot(1,3,2)
sns.violinplot(x=famous_to_plot['funny'],cut=0)

plt.subplot(1,3,3)
sns.violinplot(x=famous_to_plot['cool'],cut=0)
Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f74a92d8f98>

From the above, we can assign equal weightage to funny and cool, but a little more weightage to useful.

In [40]:
famous_sm['score'] = famous_sm.apply(lambda row: (0.4*row.useful + 0.3*row.funny + 0.3*row.cool)/(3*row.review_count) , meta=(None, 'float64'),axis=1)
In [41]:
famous_sm[famous_sm['score'] > 15].head()
/opt/conda/lib/python3.6/site-packages/dask/dataframe/core.py:5868: UserWarning: Insufficient elements for `head`. 5 elements requested, only 4 elements available. Try passing larger `npartitions` to `head`.
  warnings.warn(msg.format(n, len(r)))
Out[41]:
name review_count useful funny cool score
101 Alan 1379 78486 49550 70825 16.317839
3845 Rodney 851 50327 45166 48429 18.883392
4961 Gem 64 3672 2791 3411 17.340625
6120 Jen 398 19406 16287 18464 15.232580

Clearly, Rodney seems to be the most influential based not only on the defined score, but taking a look at the number of his reviews, and their attributes you can definitely see that he is a loud voice in the Yelp community


So, hopefully this short exercise was a good introduction into using dask dataframes and arrays in parallelising the work on large datasets. Going forward, the official dask tutorials would be great to go through if you want to continue working with this library.

Thank you for reading and good luck!

In [43]:
!jupyter nbconvert Yelpd.ipynb
[NbConvertApp] WARNING | pattern './Yelpd.ipynb' matched no files
This application is used to convert notebook files (*.ipynb) to various other
formats.

WARNING: THE COMMANDLINE INTERFACE MAY CHANGE IN FUTURE RELEASES.

Options
-------

Arguments that take values are actually convenience aliases to full
Configurables, whose aliases are listed on the help line. For more information
on full configurables, see '--help-all'.

--debug
    set log level to logging.DEBUG (maximize logging output)
--generate-config
    generate default config file
-y
    Answer yes to any questions instead of prompting.
--execute
    Execute the notebook prior to export.
--allow-errors
    Continue notebook execution even if one of the cells throws an error and include the error message in the cell output (the default behaviour is to abort conversion). This flag is only relevant if '--execute' was specified, too.
--stdin
    read a single notebook file from stdin. Write the resulting notebook with default basename 'notebook.*'
--stdout
    Write notebook output to stdout instead of files.
--inplace
    Run nbconvert in place, overwriting the existing notebook (only 
    relevant when converting to notebook format)
--clear-output
    Clear output of current file and save in place, 
    overwriting the existing notebook.
--no-prompt
    Exclude input and output prompts from converted document.
--no-input
    Exclude input cells and output prompts from converted document. 
    This mode is ideal for generating code-free reports.
--log-level=<Enum> (Application.log_level)
    Default: 30
    Choices: (0, 10, 20, 30, 40, 50, 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL')
    Set the log level by value or name.
--config=<Unicode> (JupyterApp.config_file)
    Default: ''
    Full path of a config file.
--to=<Unicode> (NbConvertApp.export_format)
    Default: 'html'
    The export format to be used, either one of the built-in formats
    ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf',
    'python', 'rst', 'script', 'slides'] or a dotted object name that represents
    the import path for an `Exporter` class
--template=<Unicode> (TemplateExporter.template_file)
    Default: ''
    Name of the template file to use
--writer=<DottedObjectName> (NbConvertApp.writer_class)
    Default: 'FilesWriter'
    Writer class used to write the  results of the conversion
--post=<DottedOrNone> (NbConvertApp.postprocessor_class)
    Default: ''
    PostProcessor class used to write the results of the conversion
--output=<Unicode> (NbConvertApp.output_base)
    Default: ''
    overwrite base name use for output files. can only be used when converting
    one notebook at a time.
--output-dir=<Unicode> (FilesWriter.build_directory)
    Default: ''
    Directory to write output(s) to. Defaults to output to the directory of each
    notebook. To recover previous default behaviour (outputting to the current
    working directory) use . as the flag value.
--reveal-prefix=<Unicode> (SlidesExporter.reveal_url_prefix)
    Default: ''
    The URL prefix for reveal.js (version 3.x). This defaults to the reveal CDN,
    but can be any url pointing to a copy  of reveal.js.
    For speaker notes to work, this must be a relative path to a local  copy of
    reveal.js: e.g., "reveal.js".
    If a relative path is given, it must be a subdirectory of the current
    directory (from which the server is run).
    See the usage documentation
    (https://nbconvert.readthedocs.io/en/latest/usage.html#reveal-js-html-
    slideshow) for more details.
--nbformat=<Enum> (NotebookExporter.nbformat_version)
    Default: 4
    Choices: [1, 2, 3, 4]
    The nbformat version to write. Use this to downgrade notebooks.

To see all available configurables, use `--help-all`

Examples
--------

    The simplest way to use nbconvert is
    
    > jupyter nbconvert mynotebook.ipynb
    
    which will convert mynotebook.ipynb to the default format (probably HTML).
    
    You can specify the export format with `--to`.
    Options include ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides'].
    
    > jupyter nbconvert --to latex mynotebook.ipynb
    
    Both HTML and LaTeX support multiple output templates. LaTeX includes
    'base', 'article' and 'report'.  HTML includes 'basic' and 'full'. You
    can specify the flavor of the format used.
    
    > jupyter nbconvert --to html --template basic mynotebook.ipynb
    
    You can also pipe the output to stdout, rather than a file
    
    > jupyter nbconvert mynotebook.ipynb --stdout
    
    PDF is generated via latex
    
    > jupyter nbconvert mynotebook.ipynb --to pdf
    
    You can get (and serve) a Reveal.js-powered slideshow
    
    > jupyter nbconvert myslides.ipynb --to slides --post serve
    
    Multiple notebooks can be given at the command line in a couple of 
    different ways:
    
    > jupyter nbconvert notebook*.ipynb
    > jupyter nbconvert notebook1.ipynb notebook2.ipynb
    
    or you can specify the notebooks list in a config file, containing::
    
        c.NbConvertApp.notebooks = ["my_notebook.ipynb"]
    
    > jupyter nbconvert --config mycfg.py

In [ ]: