Jordi Casanellas
  • Data Science Blog
  • Dataguda
  • Astrophysics
    • Research
    • Teaching
    • Videos
    • Press
  • Contact
  • Data Science Blog
  • Dataguda
  • Astrophysics
    • Research
    • Teaching
    • Videos
    • Press
  • Contact
Jordi Casanellas

INTRO TO Machine Learning with Spark

24/9/2015

8 Comments

 
Do you know that you can apply machine learning algorithms to big data very easily? What makes it simple is Spark and its machine learning library MLlib. And it gets even simpler using the python API PySpark.

To better visualize how to do that, please take a look at this notebook:  
              Spark_MLlib_Classification
Although it would be much better if you visited the link above, I will show you anyway in this post some of the more important steps. As an example, we use the  publicly available data  of products of the online-shop Otto. The data contains more than 60k products with 93 numerical features (yes, is small data) and their classification in 9 product categories. We will try to predict the product category using MLlib classification algorithms.

First, we want to use pySpark interactively in the ipython notebook (Jupyter). To do that, we use findspark:

    
Then, we create a RDD (the type of dataset in Spark) from our sample train set:
Code Editor

    
We can process the raw data and explore it using built-in functions like map, distinct, count, countByValue, etc. (see notebook). With the Pandas and Seaborn libraries we can plot, in just a few lines of code, the distribution of the values of each feature separated by class:
Picture
Box plots of the values of the features, grouped by class. Here we show just 3 of the 93 features of the dataset.
Now let's charge our machine learning weapons:
Code Editor

    
We must label the RDD using the LabeledPoint function, which arguments are the labels (the classes, last column) and the features. Note that the features are first converted to the dense format:
Code Editor

    
Then. after shuffling and splitting the data,
Code Editor

    
we are  ready to use the MLlib classification algorithms, like the  Naive Bayes Classifier:
Code Editor

    
In the notebook you will find examples to use decision trees, random forest classifiers and gradient boosted trees.
8 Comments
vishnu link
9/12/2016 07:14:08


Hii you are providing good information.Thanks for sharing AND Data Scientist Course in Hyderabad, Data Analytics Courses, Data Science Courses, Business Analytics Training ISB HYD Trained Faculty with 10 yrs of Exp See below link
<a href="http://hadooptraininginhyderabad.co.in/data-scientist-course-in-hyderabad/">
data-scientist training in ameerpet</a>

Reply
Dr Alan Beckles
29/1/2017 04:45:42

When I try to load data with "train_noheader.csv" I receive an error msg with product_data_raw.first() telling me that the RDD is empty. When data is loaded with "train.csv" product_data_raw.first() outputs the header. How can the header be removed from the RDD?

Reply
Jordi link
29/1/2017 16:19:53

When you load the data with header from "train.csv" you can remove the header from the RDD a posteriori with:

product_data_raw = product_data_raw.zipWithIndex().filter(lambda (row,index): index > 0).keys()

Alternatively, to remove the header before reading the file you can use notepad or any editor. in linux you can also use the following command:

tail -n +2 "train.csv" > "train_noheader.csv"


Reply
Dr Alan Beckles
29/1/2017 22:09:07

I am using python 2.7

Rahul
2/5/2017 06:43:14

Even after removing header i'm getting below error
"ValueError: RDD is empty"

Reply
anirudh link
27/5/2019 12:38:48


I like your post very much. It is very much useful for my research. I hope you to share more info about this. Keep posting!!
<a href="https://nareshit.in/devops-training/">Best Devops Training Institute</a>

Reply
orien it link
6/7/2019 13:40:19

thanks for sharing nice information and nice article and very useful information.....

Reply
Machine Learning Training in Hyderabad link
3/4/2020 08:34:26

I really enjoy studying on this website, it holds wonderful posts. Excellent Machine Learning blog you have got here... It’s difficult to find excellent writing like yours these days. I really appreciate individuals like you!

Reply



Leave a Reply.

    Jordi

    Data Scientist.
    Here you'll find some examples of data analysis, visualizations, machine learning and related topics.

    Archives

    July 2016
    October 2015
    September 2015

    Categories

    All
    Bokeh
    Data Visualization
    Machine Learning
    Python
    R
    SQL

    RSS Feed

Picture