## Example: Spam Classification

This program uses two MLlib algorithms: HashingTF, which builds term frequency feature vectors from text data, and LogisticRegressionWithSGD, which implements the logistic regression procedure using stochastic gradient descent (SGD). We assume that we start with two files, spam.txt an normal.txt, each of which contains examples of spam and non-spam emails, one per line. We then turn the text in each file into a feature vector with TF, and train a logistic regression model to separate the two types of messages.

## Algorithms

Here only has some usual APIs.

### Feature Extraction

#### Scaling

Most machine learning algorithms consider the magnitude of each element in the feature vector, and thus work best when the features are scaled so they weigh equally (e.g., all features have a mean of 0 and standard deviation of 1). Once you have built feature vectors, you can use the StandardScaler class in MLlib to do this scaling, both for the mean and the standard deviation. You create a StandardScaler, call fit() on a dataset to obtain a StandardScalerModel (i.e., compute the mean and variance of each column), and then call transform() on the model to scale a dataset.

#### Normalization

Simply use Normalizer().transform(rdd). By default Normalizer uses the L 2 norm (i.e, Euclidean length), but you can also pass a power pto Normalizer to use the L p norm.

#### Word2Vec

Once you have trained the model (withWord2Vec.fit(rdd)), you will receive a Word2VecModel that can be used to transform() each word into a vector. Note that the size of the models in Word2Vec will be equal to the number of words in your vocabulary times the size of a vector (by default, 100). You may wish to filter out words that are not in a standard dictionary to limit the size. In general, a good size for the vocabulary is 100,000 words.

### Statistics

Statistics.colStats(rdd)
Computes a statistical summary of an RDD of vectors, which stores the min, max, mean, and variance for each column in the set of vectors. This can be used to obtain a wide variety of statistics in one pass.

Statistics.corr(rdd, method)
Computes the correlation matrix between columns in an RDD of vectors, using either the Pearson or Spearman correlation (method must be one of pearson and spearman).

Statistics.corr(rdd1, rdd2, method)
Computes the correlation between two RDDs of floating-point values, using either the Pearson or Spearman correlation (method must be one of pearson and spearman).

Statistics.chiSqTest(rdd)
Computes Pearson’s independence test for every feature with the label on an RDD of LabeledPoint objects. Returns an array of ChiSqTestResult objects that capture the p-value, test statistic, and degrees of freedom for each feature. Label and feature values must be categorical (i.e., discrete values).

### Classification and Regression

MLlib includes a variety of methods for classification and regression, including simple linear methods and decision trees and forests.

#### Logistic regression

The logistic regression algorithm has a very similar API to linear regression, covered in the previous section. One difference is that there are two algorithms available for solving it: SGD and LBFGS. LBFGS is generally the best choice, but is not available in some earlier versions of MLlib (before Spark 1.2). These algorithms are available in the mllib.classification.LogisticRegressionWithLBFGS and WithSGD classes, which have interfaces similar to LinearRegressionWithSGD. They take all the same parameters as linear regression.

#### Support Vector Machines

They are available through the SVMWithSGD class, with similar parameters to linear and logisitic regression. The returned SVMModel uses a threshold for prediction like LogisticRegressionModel.

#### Naive Bayes

In MLlib, you can use Naive Bayes through themllib.classification.NaiveBayes class. It supports one parameter, lambda (or lambda_ in Python), used for smoothing. You can call it on an RDD of LabeledPoints, where the labels are between 0 and C–1 for C classes.

#### Decision trees and random forests

In MLlib, you can train trees using the mllib.tree.DecisionTree class, through the static methods trainClassifier() and trainRegressor(). Unlike in some of the other algorithms, the Java and Scala APIs also use static methods instead of a DecisionTree object with setters.

### Collaborative Filtering and Recommendation

The training exercises from the Spark Summit 2014 include a hands-on tutorial for personalized movie recommendation with spark.mllib.

PCA in Scala

SVD in Scala

### Pipeline API

Pipeline API version of spam classification in Scala