To better visualize how to do that, please take a look at this notebook:
Spark_MLlib_Classification
First, we want to use pySpark interactively in the ipython notebook (Jupyter). To do that, we use findspark:
Do you know that you can apply machine learning algorithms to big data very easily? What makes it simple is Spark and its machine learning library MLlib. And it gets even simpler using the python API PySpark. To better visualize how to do that, please take a look at this notebook: Spark_MLlib_Classification Although it would be much better if you visited the link above, I will show you anyway in this post some of the more important steps. As an example, we use the publicly available data of products of the online-shop Otto. The data contains more than 60k products with 93 numerical features (yes, is small data) and their classification in 9 product categories. We will try to predict the product category using MLlib classification algorithms. First, we want to use pySpark interactively in the ipython notebook (Jupyter). To do that, we use findspark: Then, we create a RDD (the type of dataset in Spark) from our sample train set: Code Editor
We can process the raw data and explore it using built-in functions like map, distinct, count, countByValue, etc. (see notebook). With the Pandas and Seaborn libraries we can plot, in just a few lines of code, the distribution of the values of each feature separated by class: Now let's charge our machine learning weapons: Code Editor
We must label the RDD using the LabeledPoint function, which arguments are the labels (the classes, last column) and the features. Note that the features are first converted to the dense format: Code Editor
Then. after shuffling and splitting the data, Code Editor
we are ready to use the MLlib classification algorithms, like the Naive Bayes Classifier: Code Editor
In the notebook you will find examples to use decision trees, random forest classifiers and gradient boosted trees.
8 Comments
Reply
Dr Alan Beckles
29/1/2017 04:45:42
When I try to load data with "train_noheader.csv" I receive an error msg with product_data_raw.first() telling me that the RDD is empty. When data is loaded with "train.csv" product_data_raw.first() outputs the header. How can the header be removed from the RDD?
Reply
When you load the data with header from "train.csv" you can remove the header from the RDD a posteriori with:
Reply
Dr Alan Beckles
29/1/2017 22:09:07
I am using python 2.7
Rahul
2/5/2017 06:43:14
Even after removing header i'm getting below error
Reply
3/4/2020 08:34:26
I really enjoy studying on this website, it holds wonderful posts. Excellent Machine Learning blog you have got here... It’s difficult to find excellent writing like yours these days. I really appreciate individuals like you!
Reply
Leave a Reply. |
JordiData Scientist. Archives
July 2016
Categories |