Skip to content
Snippets Groups Projects
Commit 14e2700d authored by Yu ISHIKAWA's avatar Yu ISHIKAWA Committed by Xiangrui Meng
Browse files

[SPARK-12874][ML] ML StringIndexer does not protect itself from column name duplication

## What changes were proposed in this pull request?
ML StringIndexer does not protect itself from column name duplication.

We should still improve a way to validate a schema of `StringIndexer` and `StringIndexerModel`.  However, it would be great to fix at another issue.

## How was this patch tested?
unit test

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #11370 from yu-iskw/SPARK-12874.
parent fb8bb047
No related branches found
No related tags found
No related merge requests found
......@@ -150,6 +150,7 @@ class StringIndexerModel (
"Skip StringIndexerModel.")
return dataset
}
validateAndTransformSchema(dataset.schema)
val indexer = udf { label: String =>
if (labelToIndex.contains(label)) {
......
......@@ -118,6 +118,17 @@ class StringIndexerSuite
assert(indexerModel.transform(df).eq(df))
}
test("StringIndexerModel can't overwrite output column") {
val df = sqlContext.createDataFrame(Seq((1, 2), (3, 4))).toDF("input", "output")
val indexer = new StringIndexer()
.setInputCol("input")
.setOutputCol("output")
.fit(df)
intercept[IllegalArgumentException] {
indexer.transform(df)
}
}
test("StringIndexer read/write") {
val t = new StringIndexer()
.setInputCol("myInputCol")
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment