Machine Learning Library (MLlib) of Apache Spark
MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives, as outlined below:
- Data types
- Basic statistics
- summary statistics
- correlations
- stratified sampling
- hypothesis testing
- random data generation
- Classification and regression
- linear models (SVMs, logistic regression, linear regression)
- decision trees
- naive Bayes
- Collaborative filtering
- alternating least squares (ALS)
- Clustering
- k-means
- Dimensionality reduction
- singular value decomposition (SVD)
- principal component analysis (PCA)
- Feature extraction and transformation
- Optimization (developer)
- stochastic gradient descent
- limited-memory BFGS (L-BFGS)
MLlib is under active development. The APIs marked Experimental/DeveloperApi may change in future releases, and the migration guide below will explain all changes between releases.
Dependencies
MLlib uses the linear algebra package Breeze, which depends on netlib-java, and jblas. netlib-java
and jblas
depend on native Fortran routines. You need to install the gfortran runtime library if it is not already present on your nodes. MLlib will throw a linking error if it cannot detect these libraries automatically. Due to license issues, we do not include netlib-java
’s native libraries in MLlib’s dependency set under default settings. If no native library is available at runtime, you will see a warning message. To use native libraries from netlib-java
, please build Spark with -Pnetlib-lgpl
or include com.github.fommil.netlib:all:1.1.2
as a dependency of your project. If you want to use optimized BLAS/LAPACK libraries such as OpenBLAS, please link its shared libraries to /usr/lib/libblas.so.3
and /usr/lib/liblapack.so.3
, respectively. BLAS/LAPACK libraries on worker nodes should be built without multithreading.
To use MLlib in Python, you will need NumPy version 1.4 or newer.
Usage
Java Example
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
double[] array = ... // a double array
Vector vector = Vectors.dense(array); // a dense vector
Python Example
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
# Create a labeled point with a positive label and a dense feature vector.
pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])
# Create a labeled point with a negative label and a sparse feature vector.
neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))
The book: Machine Learning with Spark
by Nick Pentreath February 2015
The Spark Workshop
Hosted by Stanford ICME Friday April 4th 2014 Clark Center Auditorium, Stanford University
Organizers: Reza Zadeh | Matei Zaharia | Ion Stoica
A workshop on the high-speed cluster programming framework, Spark. We will discuss Machine Learning and Matrix Computations, along with other projects in the Spark platform such as Shark and MLlib, and new programming tools for big data.
The workshop will include introductions to the many Spark features, case studies from current users, best practices for deployment and tuning, future development plans, and hands-on exercises for Amazon EC2.