Skip to content
Snippets Groups Projects
Commit d51d6ba1 authored by Peter Rudenko's avatar Peter Rudenko Committed by Xiangrui Meng
Browse files

[Ml] SPARK-5804 Explicitly manage cache in Crossvalidator k-fold loop

On a big dataset explicitly unpersist train and validation folds allows to load more data into memory in the next loop iteration. On my environment (single node 8Gb worker RAM, 2 GB dataset file, 3 folds for cross validation), saved more than 5 minutes.

Author: Peter Rudenko <petro.rudenko@gmail.com>

Closes #4595 from petro-rudenko/patch-2 and squashes the following commits:

66a7cfb [Peter Rudenko] Move validationDataset cache to declaration
c5f3265 [Peter Rudenko] [Ml] SPARK-5804 Explicitly manage cache in Crossvalidator k-fold loop
parent c78a12c4
No related branches found
No related tags found
No related merge requests found
......@@ -108,6 +108,7 @@ class CrossValidator extends Estimator[CrossValidatorModel] with CrossValidatorP
// multi-model training
logDebug(s"Train split $splitIndex with multiple sets of parameters.")
val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
trainingDataset.unpersist()
var i = 0
while (i < numModels) {
val metric = eval.evaluate(models(i).transform(validationDataset, epm(i)), map)
......@@ -115,6 +116,7 @@ class CrossValidator extends Estimator[CrossValidatorModel] with CrossValidatorP
metrics(i) += metric
i += 1
}
validationDataset.unpersist()
}
f2jBLAS.dscal(numModels, 1.0 / map(numFolds), metrics, 1)
logInfo(s"Average cross-validation metrics: ${metrics.toSeq}")
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment