MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives, as outlined below:

Online Documentation

  • Data types
  • Basic statistics
    • summary statistics
    • correlations
    • stratified sampling
    • hypothesis testing
    • random data generation
  • Classification and regression
    • linear models (SVMs, logistic regression, linear regression)
    • decision trees
    • naive Bayes
  • Collaborative filtering
    • alternating least squares (ALS)
  • Clustering
    • k-means
  • Dimensionality reduction
    • singular value decomposition (SVD)
    • principal component analysis (PCA)
  • Feature extraction and transformation
  • Optimization (developer)
    • stochastic gradient descent
    • limited-memory BFGS (L-BFGS)

MLlib is under active development. The APIs marked Experimental/DeveloperApi may change in future releases, and the migration guide below will explain all changes between releases.

Dependencies

MLlib uses the linear algebra package Breeze, which depends on netlib-java, and jblas. netlib-java and jblas depend on native Fortran routines. You need to install the gfortran runtime library if it is not already present on your nodes. MLlib will throw a linking error if it cannot detect these libraries automatically. Due to license issues, we do not include netlib-java’s native libraries in MLlib’s dependency set under default settings. If no native library is available at runtime, you will see a warning message. To use native libraries from netlib-java, please build Spark with -Pnetlib-lgpl or include com.github.fommil.netlib:all:1.1.2 as a dependency of your project. If you want to use optimized BLAS/LAPACK libraries such as OpenBLAS, please link its shared libraries to /usr/lib/libblas.so.3 and /usr/lib/liblapack.so.3, respectively. BLAS/LAPACK libraries on worker nodes should be built without multithreading.

To use MLlib in Python, you will need NumPy version 1.4 or newer.

Usage

Java Example

import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;

double[] array = ... // a double array
Vector vector = Vectors.dense(array); // a dense vector

Python Example

from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint

# Create a labeled point with a positive label and a dense feature vector.
pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])

# Create a labeled point with a negative label and a sparse feature vector.
neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))

The book: Machine Learning with Spark

by Nick Pentreath February 2015

DOUBAN Link

The Spark Workshop

Hosted by Stanford ICME Friday April 4th 2014 Clark Center Auditorium, Stanford University

Organizers: Reza Zadeh | Matei Zaharia | Ion Stoica

A workshop on the high-speed cluster programming framework, Spark. We will discuss Machine Learning and Matrix Computations, along with other projects in the Spark platform such as Shark and MLlib, and new programming tools for big data.

The workshop will include introductions to the many Spark features, case studies from current users, best practices for deployment and tuning, future development plans, and hands-on exercises for Amazon EC2.

  • Overview of the Spark Ecosystem (Patrick Wendell) [slides]
  • Parallel Programming with Spark (Holden Karau) [slides]
  • Spark for Numerical Computing (Hossein Falaki) [slides]
  • Machine Learning with MLlib (Xiangrui Meng) [slides]