Unverified Commit 16d8472f authored 8 years ago by sethah Committed by DB Tsai 8 years ago

[SPARK-19746][ML] Faster indexing for logistic aggregator

## What changes were proposed in this pull request?

JIRA: [SPARK-19746](https://issues.apache.org/jira/browse/SPARK-19746)

The following code is inefficient:

````scala
    val localCoefficients: Vector = bcCoefficients.value

    features.foreachActive { (index, value) =>
      val stdValue = value / localFeaturesStd(index)
      var j = 0
      while (j < numClasses) {
        margins(j) += localCoefficients(index * numClasses + j) * stdValue
        j += 1
      }
    }
````

`localCoefficients(index * numClasses + j)` calls `Vector.apply` which creates a new Breeze vector and indexes that. Even if it is not that slow to create the object, we will generate a lot of extra garbage that may result in longer GC pauses. This is a hot inner loop, so we should optimize wherever possible.

## How was this patch tested?

I don't think there's a great way to test this patch. It's purely performance related, so unit tests should guarantee that we haven't made any unwanted changes. Empirically I observed between 10-40% speedups just running short local tests. I suspect the big differences will be seen when large data/coefficient sizes have to pause for GC more often. I welcome other ideas for testing.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #17078 from sethah/logistic_agg_indexing.

parent 8a5a5850

No related branches found

No related tags found

No related merge requests found

Hide whitespace changes

Inline Side-by-side

Showing with 34 additions and 3 deletions

Please register or to comment