Skip to content
Snippets Groups Projects
Commit 89f91226 authored by DB Tsai's avatar DB Tsai Committed by Xiangrui Meng
Browse files

[SPARK-4596][MLLib] Refactorize Normalizer to make code cleaner

In this refactoring, the performance will be slightly increased due to removing
the overhead from breeze vector. The bottleneck is still in breeze norm
which is implemented by activeIterator.

This inefficiency of breeze norm will be addressed in next PR. At least,
this PR makes the code more consistent in the codebase.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #3446 from dbtsai/normalizer and squashes the following commits:

e20a2b9 [DB Tsai] first commit
parent 0fe54cff
No related branches found
No related tags found
No related merge requests found
...@@ -17,10 +17,10 @@ ...@@ -17,10 +17,10 @@
package org.apache.spark.mllib.feature package org.apache.spark.mllib.feature
import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, norm => brzNorm} import breeze.linalg.{norm => brzNorm}
import org.apache.spark.annotation.Experimental import org.apache.spark.annotation.Experimental
import org.apache.spark.mllib.linalg.{Vector, Vectors} import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vector, Vectors}
/** /**
* :: Experimental :: * :: Experimental ::
...@@ -47,22 +47,31 @@ class Normalizer(p: Double) extends VectorTransformer { ...@@ -47,22 +47,31 @@ class Normalizer(p: Double) extends VectorTransformer {
* @return normalized vector. If the norm of the input is zero, it will return the input vector. * @return normalized vector. If the norm of the input is zero, it will return the input vector.
*/ */
override def transform(vector: Vector): Vector = { override def transform(vector: Vector): Vector = {
var norm = brzNorm(vector.toBreeze, p) val norm = brzNorm(vector.toBreeze, p)
if (norm != 0.0) { if (norm != 0.0) {
// For dense vector, we've to allocate new memory for new output vector. // For dense vector, we've to allocate new memory for new output vector.
// However, for sparse vector, the `index` array will not be changed, // However, for sparse vector, the `index` array will not be changed,
// so we can re-use it to save memory. // so we can re-use it to save memory.
vector.toBreeze match { vector match {
case dv: BDV[Double] => Vectors.fromBreeze(dv :/ norm) case dv: DenseVector =>
case sv: BSV[Double] => val values = dv.values.clone()
val output = new BSV[Double](sv.index, sv.data.clone(), sv.length) val size = values.size
var i = 0 var i = 0
while (i < output.data.length) { while (i < size) {
output.data(i) /= norm values(i) /= norm
i += 1 i += 1
} }
Vectors.fromBreeze(output) Vectors.dense(values)
case sv: SparseVector =>
val values = sv.values.clone()
val nnz = values.size
var i = 0
while (i < nnz) {
values(i) /= norm
i += 1
}
Vectors.sparse(sv.size, sv.indices, values)
case v => throw new IllegalArgumentException("Do not support vector type " + v.getClass) case v => throw new IllegalArgumentException("Do not support vector type " + v.getClass)
} }
} else { } else {
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment