-
- Downloads
[SPARK-19762][ML] Hierarchy for consolidating ML aggregator/loss code
## What changes were proposed in this pull request? JIRA: [SPARK-19762](https://issues.apache.org/jira/browse/SPARK-19762) The larger changes in this patch are: * Adds a `DifferentiableLossAggregator` trait which is intended to be used as a common parent trait to all Spark ML aggregator classes. It factors out the common methods: `merge, gradient, loss, weight` from the aggregator subclasses. * Adds a `RDDLossFunction` which is intended to be the only implementation of Breeze's `DiffFunction` necessary in Spark ML, and can be used by all other algorithms. It takes the aggregator type as a type parameter, and maps the aggregator over an RDD. It additionally takes in a optional regularization loss function for applying the differentiable part of regularization. * Factors out the regularization from the data part of the cost function, and treats regularization as a separate independent cost function which can be evaluated and added to the data cost function. * Changes `LinearRegression` to use this new hierarchy as a proof of concept. * Adds the following new namespaces `o.a.s.ml.optim.loss` and `o.a.s.ml.optim.aggregator` Also note that none of these are public-facing changes. All of these classes are internal to Spark ML and remain that way. **NOTE: The large majority of the "lines added" and "lines deleted" are simply code moving around or unit tests.** BTW, I also converted LinearSVC to this framework as a way to prove that this new hierarchy is flexible enough for the other algorithms, but I backed those changes out because the PR is large enough as is. ## How was this patch tested? Test suites are added for the new components, and some test suites are also added to provide coverage where there wasn't any before. * DifferentiablLossAggregatorSuite * LeastSquaresAggregatorSuite * RDDLossFunctionSuite * DifferentiableRegularizationSuite Below are some performance testing numbers. Run on a 6 node virtual cluster with 44 cores and ~110G RAM, the dataset size is about 37G. These are not "large-scale" tests, but we really want to just make sure the iteration times don't increase with this patch. Notably we are doing the regularization a bit differently than before, but that should cost very little. I think there's very little risk otherwise, and these numbers don't show a difference. Of course I'm happy to add more tests as we think it's necessary, but I think the patch is ready for review now. **Note:** timings are best of 3 runs. | | numFeatures | numPoints | maxIter | regParam | elasticNetParam | SPARK-19762 (sec) | master (sec) | |----|---------------|-------------|-----------|------------|-------------------|---------------------|----------------| | 0 | 5000 | 1e+06 | 30 | 0 | 0 | 129.594 | 131.153 | | 1 | 5000 | 1e+06 | 30 | 0.1 | 0 | 135.54 | 136.327 | | 2 | 5000 | 1e+06 | 30 | 0.01 | 0.5 | 135.148 | 129.771 | | 3 | 50000 | 100000 | 30 | 0 | 0 | 145.764 | 144.096 | ## Follow ups If this design is accepted, we will convert the other ML algorithms that use this aggregator pattern to this new hierarchy in follow up PRs. Author: sethah <seth.hendrickson16@gmail.com> Author: sethah <shendrickson@cloudera.com> Closes #17094 from sethah/ml_aggregators.
Showing
- mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/DifferentiableLossAggregator.scala 88 additions, 0 deletions...rk/ml/optim/aggregator/DifferentiableLossAggregator.scala
- mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LeastSquaresAggregator.scala 224 additions, 0 deletions...he/spark/ml/optim/aggregator/LeastSquaresAggregator.scala
- mllib/src/main/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularization.scala 71 additions, 0 deletions...he/spark/ml/optim/loss/DifferentiableRegularization.scala
- mllib/src/main/scala/org/apache/spark/ml/optim/loss/RDDLossFunction.scala 72 additions, 0 deletions...cala/org/apache/spark/ml/optim/loss/RDDLossFunction.scala
- mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala 14 additions, 313 deletions...ala/org/apache/spark/ml/regression/LinearRegression.scala
- mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/DifferentiableLossAggregatorSuite.scala 160 additions, 0 deletions.../optim/aggregator/DifferentiableLossAggregatorSuite.scala
- mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/LeastSquaresAggregatorSuite.scala 157 additions, 0 deletions...ark/ml/optim/aggregator/LeastSquaresAggregatorSuite.scala
- mllib/src/test/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularizationSuite.scala 61 additions, 0 deletions...ark/ml/optim/loss/DifferentiableRegularizationSuite.scala
- mllib/src/test/scala/org/apache/spark/ml/optim/loss/RDDLossFunctionSuite.scala 83 additions, 0 deletions...org/apache/spark/ml/optim/loss/RDDLossFunctionSuite.scala
Loading
Please register or sign in to comment