Skip to content
Snippets Groups Projects
Commit 864be935 authored by Yanbo Liang's avatar Yanbo Liang
Browse files

[SPARK-17141][ML] MinMaxScaler should remain NaN value.

## What changes were proposed in this pull request?
In the existing code, ```MinMaxScaler``` handle ```NaN``` value indeterminately.
* If a column has identity value, that is ```max == min```, ```MinMaxScalerModel``` transformation will output ```0.5``` for all rows even the original value is ```NaN```.
* Otherwise, it will remain ```NaN``` after transformation.

I think we should unify the behavior by remaining ```NaN``` value at any condition, since we don't know how to transform a ```NaN``` value. In Python sklearn, it will throw exception when there is ```NaN``` in the dataset.

## How was this patch tested?
Unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14716 from yanboliang/spark-17141.
parent 5377fc62
No related branches found
No related tags found
No related merge requests found
......@@ -186,8 +186,10 @@ class MinMaxScalerModel private[ml] (
val size = values.length
var i = 0
while (i < size) {
val raw = if (originalRange(i) != 0) (values(i) - minArray(i)) / originalRange(i) else 0.5
values(i) = raw * scale + $(min)
if (!values(i).isNaN) {
val raw = if (originalRange(i) != 0) (values(i) - minArray(i)) / originalRange(i) else 0.5
values(i) = raw * scale + $(min)
}
i += 1
}
Vectors.dense(values)
......
......@@ -90,4 +90,31 @@ class MinMaxScalerSuite extends SparkFunSuite with MLlibTestSparkContext with De
assert(newInstance.originalMin === instance.originalMin)
assert(newInstance.originalMax === instance.originalMax)
}
test("MinMaxScaler should remain NaN value") {
val data = Array(
Vectors.dense(1, Double.NaN, 2.0, 2.0),
Vectors.dense(2, 2.0, 0.0, 3.0),
Vectors.dense(3, Double.NaN, 0.0, 1.0),
Vectors.dense(6, 2.0, 2.0, Double.NaN))
val expected: Array[Vector] = Array(
Vectors.dense(-5.0, Double.NaN, 5.0, 0.0),
Vectors.dense(-3.0, 0.0, -5.0, 5.0),
Vectors.dense(-1.0, Double.NaN, -5.0, -5.0),
Vectors.dense(5.0, 0.0, 5.0, Double.NaN))
val df = spark.createDataFrame(data.zip(expected)).toDF("features", "expected")
val scaler = new MinMaxScaler()
.setInputCol("features")
.setOutputCol("scaled")
.setMin(-5)
.setMax(5)
val model = scaler.fit(df)
model.transform(df).select("expected", "scaled").collect()
.foreach { case Row(vector1: Vector, vector2: Vector) =>
assert(vector1.equals(vector2), "Transformed vector is different with expected.")
}
}
}
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment