Skip to content
Snippets Groups Projects
Commit 8532e26f authored by Nick Pentreath's avatar Nick Pentreath
Browse files

[SPARK-23112][DOC] Add highlights and migration guide for 2.3

Update ML user guide with highlights and migration guide for `2.3`.

## How was this patch tested?

Doc only.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #20363 from MLnick/SPARK-23112-ml-guide.
parent d20bbc2d
No related branches found
No related tags found
No related merge requests found
...@@ -72,32 +72,31 @@ To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4 ...@@ -72,32 +72,31 @@ To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4
[^1]: To learn more about the benefits and background of system optimised natives, you may wish to [^1]: To learn more about the benefits and background of system optimised natives, you may wish to
watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in Scala](http://fommil.github.io/scalax14/#/). watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in Scala](http://fommil.github.io/scalax14/#/).
# Highlights in 2.2 # Highlights in 2.3
The list below highlights some of the new features and enhancements added to MLlib in the `2.2` The list below highlights some of the new features and enhancements added to MLlib in the `2.3`
release of Spark: release of Spark:
* [`ALS`](ml-collaborative-filtering.html) methods for _top-k_ recommendations for all * Built-in support for reading images into a `DataFrame` was added
users or items, matching the functionality in `mllib` ([SPARK-21866](https://issues.apache.org/jira/browse/SPARK-21866)).
([SPARK-19535](https://issues.apache.org/jira/browse/SPARK-19535)). * [`OneHotEncoderEstimator`](ml-features.html#onehotencoderestimator) was added, and should be
Performance was also improved for both `ml` and `mllib` used instead of the existing `OneHotEncoder` transformer. The new estimator supports
([SPARK-11968](https://issues.apache.org/jira/browse/SPARK-11968) and transforming multiple columns.
[SPARK-20587](https://issues.apache.org/jira/browse/SPARK-20587)) * Multiple column support was also added to `QuantileDiscretizer` and `Bucketizer`
* [`Correlation`](ml-statistics.html#correlation) and ([SPARK-22397](https://issues.apache.org/jira/browse/SPARK-22397) and
[`ChiSquareTest`](ml-statistics.html#hypothesis-testing) stats functions for `DataFrames` [SPARK-20542](https://issues.apache.org/jira/browse/SPARK-20542))
([SPARK-19636](https://issues.apache.org/jira/browse/SPARK-19636) and * A new [`FeatureHasher`](ml-features.html#featurehasher) transformer was added
[SPARK-19635](https://issues.apache.org/jira/browse/SPARK-19635)) ([SPARK-13969](https://issues.apache.org/jira/browse/SPARK-13969)).
* [`FPGrowth`](ml-frequent-pattern-mining.html#fp-growth) algorithm for frequent pattern mining * Added support for evaluating multiple models in parallel when performing cross-validation using
([SPARK-14503](https://issues.apache.org/jira/browse/SPARK-14503)) [`TrainValidationSplit` or `CrossValidator`](ml-tuning.html)
* `GLM` now supports the full `Tweedie` family ([SPARK-19357](https://issues.apache.org/jira/browse/SPARK-19357)).
([SPARK-18929](https://issues.apache.org/jira/browse/SPARK-18929)) * Improved support for custom pipeline components in Python (see
* [`Imputer`](ml-features.html#imputer) feature transformer to impute missing values in a dataset [SPARK-21633](https://issues.apache.org/jira/browse/SPARK-21633) and
([SPARK-13568](https://issues.apache.org/jira/browse/SPARK-13568)) [SPARK-21542](https://issues.apache.org/jira/browse/SPARK-21542)).
* [`LinearSVC`](ml-classification-regression.html#linear-support-vector-machine) * `DataFrame` functions for descriptive summary statistics over vector columns
for linear Support Vector Machine classification ([SPARK-19634](https://issues.apache.org/jira/browse/SPARK-19634)).
([SPARK-14709](https://issues.apache.org/jira/browse/SPARK-14709)) * Robust linear regression with Huber loss
* Logistic regression now supports constraints on the coefficients during training ([SPARK-3181](https://issues.apache.org/jira/browse/SPARK-3181)).
([SPARK-20047](https://issues.apache.org/jira/browse/SPARK-20047))
# Migration guide # Migration guide
...@@ -115,36 +114,17 @@ There are no breaking changes. ...@@ -115,36 +114,17 @@ There are no breaking changes.
**Deprecations** **Deprecations**
There are no deprecations. * `OneHotEncoder` has been deprecated and will be removed in `3.0`. It has been replaced by the
new [`OneHotEncoderEstimator`](ml-features.html#onehotencoderestimator)
(see [SPARK-13030](https://issues.apache.org/jira/browse/SPARK-13030)). **Note** that
`OneHotEncoderEstimator` will be renamed to `OneHotEncoder` in `3.0` (but
`OneHotEncoderEstimator` will be kept as an alias).
**Changes of behavior** **Changes of behavior**
* [SPARK-21027](https://issues.apache.org/jira/browse/SPARK-21027): * [SPARK-21027](https://issues.apache.org/jira/browse/SPARK-21027):
We are now setting the default parallelism used in `OneVsRest` to be 1 (i.e. serial), in 2.2 and earlier version, We are now setting the default parallelism used in `OneVsRest` to be 1 (i.e. serial). In 2.2 and
the `OneVsRest` parallelism would be parallelism of the default threadpool in scala. earlier versions, the level of parallelism was set to the default threadpool size in Scala.
## From 2.1 to 2.2
### Breaking changes
There are no breaking changes.
### Deprecations and changes of behavior
**Deprecations**
There are no deprecations.
**Changes of behavior**
* [SPARK-19787](https://issues.apache.org/jira/browse/SPARK-19787):
Default value of `regParam` changed from `1.0` to `0.1` for `ALS.train` method (marked `DeveloperApi`).
**Note** this does _not affect_ the `ALS` Estimator or Model, nor MLlib's `ALS` class.
* [SPARK-14772](https://issues.apache.org/jira/browse/SPARK-14772):
Fixed inconsistency between Python and Scala APIs for `Param.copy` method.
* [SPARK-11569](https://issues.apache.org/jira/browse/SPARK-11569):
`StringIndexer` now handles `NULL` values in the same way as unseen values. Previously an exception
would always be thrown regardless of the setting of the `handleInvalid` parameter.
## Previous Spark versions ## Previous Spark versions
......
...@@ -7,6 +7,29 @@ description: MLlib migration guides from before Spark SPARK_VERSION_SHORT ...@@ -7,6 +7,29 @@ description: MLlib migration guides from before Spark SPARK_VERSION_SHORT
The migration guide for the current Spark version is kept on the [MLlib Guide main page](ml-guide.html#migration-guide). The migration guide for the current Spark version is kept on the [MLlib Guide main page](ml-guide.html#migration-guide).
## From 2.1 to 2.2
### Breaking changes
There are no breaking changes.
### Deprecations and changes of behavior
**Deprecations**
There are no deprecations.
**Changes of behavior**
* [SPARK-19787](https://issues.apache.org/jira/browse/SPARK-19787):
Default value of `regParam` changed from `1.0` to `0.1` for `ALS.train` method (marked `DeveloperApi`).
**Note** this does _not affect_ the `ALS` Estimator or Model, nor MLlib's `ALS` class.
* [SPARK-14772](https://issues.apache.org/jira/browse/SPARK-14772):
Fixed inconsistency between Python and Scala APIs for `Param.copy` method.
* [SPARK-11569](https://issues.apache.org/jira/browse/SPARK-11569):
`StringIndexer` now handles `NULL` values in the same way as unseen values. Previously an exception
would always be thrown regardless of the setting of the `handleInvalid` parameter.
## From 2.0 to 2.1 ## From 2.0 to 2.1
### Breaking changes ### Breaking changes
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment