-
- Downloads
[SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib] Standardize ML Prediction APIs
This is part (1a) of the updates from the design doc in [https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs] **UPDATE**: Most of the APIs are being kept private[spark] to allow further discussion. Here is a list of changes which are public: * new output columns: rawPrediction, probabilities * The “score” column is now called “rawPrediction” * Classifiers now provide numClasses * Params.get and .set are now protected instead of private[ml]. * ParamMap now has a size method. * new classes: LinearRegression, LinearRegressionModel * LogisticRegression now has an intercept. ### Sketch of APIs (most of which are private[spark] for now) Abstract classes for learning algorithms (+ corresponding Model abstractions): * Classifier (+ ClassificationModel) * ProbabilisticClassifier (+ ProbabilisticClassificationModel) * Regressor (+ RegressionModel) * Predictor (+ PredictionModel) * *For all of these*: * There is no strongly typed training-time API. * There is a strongly typed test-time (prediction) API which helps developers implement new algorithms. Concrete classes: learning algorithms * LinearRegression * LogisticRegression (updated to use new abstract classes) * Also, removed "score" in favor of "probability" output column. Changed BinaryClassificationEvaluator to match. (SPARK-5031) Other updates: * params.scala: Changed Params.set/get to be protected instead of private[ml] * This was needed for the example of defining a class from outside of the MLlib namespace. * VectorUDT: Will later change from private[spark] to public. * This is needed for outside users to write their own validateAndTransformSchema() methods using vectors. * Also, added equals() method.f * SPARK-4942 : ML Transformers should allow output cols to be turned on,off * Update validateAndTransformSchema * Update transform * (Updated examples, test suites according to other changes) New examples: * DeveloperApiExample.scala (example of defining algorithm from outside of the MLlib namespace) * Added Java version too Test Suites: * LinearRegressionSuite * LogisticRegressionSuite * + Java versions of above suites CC: mengxr etrain shivaram Author: Joseph K. Bradley <joseph@databricks.com> Closes #3637 from jkbradley/ml-api-part1 and squashes the following commits: 405bfb8 [Joseph K. Bradley] Last edits based on code review. Small cleanups fec348a [Joseph K. Bradley] Added JavaDeveloperApiExample.java and fixed other issues: Made developer API private[spark] for now. Added constructors Java can understand to specialized Param types. 8316d5e [Joseph K. Bradley] fixes after rebasing on master fc62406 [Joseph K. Bradley] fixed test suites after last commit bcb9549 [Joseph K. Bradley] Fixed issues after rebasing from master (after move from SchemaRDD to DataFrame) 9872424 [Joseph K. Bradley] fixed JavaLinearRegressionSuite.java Java sql api f542997 [Joseph K. Bradley] Added MIMA excludes for VectorUDT (now public), and added DeveloperApi annotation to it 216d199 [Joseph K. Bradley] fixed after sql datatypes PR got merged f549e34 [Joseph K. Bradley] Updates based on code review. Major ones are: * Created weakly typed Predictor.train() method which is called by fit() so that developers do not have to call schema validation or copy parameters. * Made Predictor.featuresDataType have a default value of VectorUDT. * NOTE: This could be dangerous since the FeaturesType type parameter cannot have a default value. 343e7bd [Joseph K. Bradley] added blanket mima exclude for ml package 82f340b [Joseph K. Bradley] Fixed bug in LogisticRegression (introduced in this PR). Fixed Java suites 0a16da9 [Joseph K. Bradley] Fixed Linear/Logistic RegressionSuites c3c8da5 [Joseph K. Bradley] small cleanup 934f97b [Joseph K. Bradley] Fixed bugs from previous commit. 1c61723 [Joseph K. Bradley] * Made ProbabilisticClassificationModel into a subclass of ClassificationModel. Also introduced ProbabilisticClassifier. * This was to support output column “probabilityCol” in transform(). 4e2f711 [Joseph K. Bradley] rat fix bc654e1 [Joseph K. Bradley] Added spark.ml LinearRegressionSuite 8d13233 [Joseph K. Bradley] Added methods: * Classifier: batch predictRaw() * Predictor: train() without paramMap ProbabilisticClassificationModel.predictProbabilities() * Java versions of all above batch methods + others 1680905 [Joseph K. Bradley] Added JavaLabeledPointSuite.java for spark.ml, and added constructor to LabeledPoint which defaults weight to 1.0 adbe50a [Joseph K. Bradley] * fixed LinearRegression train() to use embedded paramMap * added Predictor.predict(RDD[Vector]) method * updated Linear/LogisticRegressionSuites 58802e3 [Joseph K. Bradley] added train() to Predictor subclasses which does not take a ParamMap. 57d54ab [Joseph K. Bradley] * Changed semantics of Predictor.train() to merge the given paramMap with the embedded paramMap. * remove threshold_internal from logreg * Added Predictor.copy() * Extended LogisticRegressionSuite e433872 [Joseph K. Bradley] Updated docs. Added LabeledPointSuite to spark.ml 54b7b31 [Joseph K. Bradley] Fixed issue with logreg threshold being set correctly 0617d61 [Joseph K. Bradley] Fixed bug from last commit (sorting paramMap by parameter names in toString). Fixed bug in persisting logreg data. Added threshold_internal to logreg for faster test-time prediction (avoiding map lookup). 601e792 [Joseph K. Bradley] Modified ParamMap to sort parameters in toString. Cleaned up classes in class hierarchy, before implementing tests and examples. d705e87 [Joseph K. Bradley] Added LinearRegression and Regressor back from ml-api branch 52f4fde [Joseph K. Bradley] removing everything except for simple class hierarchy for classification d35bb5d [Joseph K. Bradley] fixed compilation issues, but have not added tests yet bfade12 [Joseph K. Bradley] Added lots of classes for new ML API:
Showing
- examples/src/main/java/org/apache/spark/examples/ml/JavaCrossValidatorExample.java 4 additions, 2 deletions...g/apache/spark/examples/ml/JavaCrossValidatorExample.java
- examples/src/main/java/org/apache/spark/examples/ml/JavaDeveloperApiExample.java 217 additions, 0 deletions...org/apache/spark/examples/ml/JavaDeveloperApiExample.java
- examples/src/main/java/org/apache/spark/examples/ml/JavaSimpleParamsExample.java 6 additions, 4 deletions...org/apache/spark/examples/ml/JavaSimpleParamsExample.java
- examples/src/main/java/org/apache/spark/examples/ml/JavaSimpleTextClassificationPipeline.java 3 additions, 1 deletion...ark/examples/ml/JavaSimpleTextClassificationPipeline.java
- examples/src/main/scala/org/apache/spark/examples/ml/CrossValidatorExample.scala 4 additions, 3 deletions.../org/apache/spark/examples/ml/CrossValidatorExample.scala
- examples/src/main/scala/org/apache/spark/examples/ml/DeveloperApiExample.scala 184 additions, 0 deletions...la/org/apache/spark/examples/ml/DeveloperApiExample.scala
- examples/src/main/scala/org/apache/spark/examples/ml/SimpleParamsExample.scala 8 additions, 8 deletions...la/org/apache/spark/examples/ml/SimpleParamsExample.scala
- examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala 4 additions, 3 deletions.../spark/examples/ml/SimpleTextClassificationPipeline.scala
- mllib/src/main/scala/org/apache/spark/ml/Estimator.scala 6 additions, 3 deletionsmllib/src/main/scala/org/apache/spark/ml/Estimator.scala
- mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala 206 additions, 0 deletions...scala/org/apache/spark/ml/classification/Classifier.scala
- mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala 133 additions, 79 deletions...g/apache/spark/ml/classification/LogisticRegression.scala
- mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala 147 additions, 0 deletions...che/spark/ml/classification/ProbabilisticClassifier.scala
- mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala 12 additions, 12 deletions...e/spark/ml/evaluation/BinaryClassificationEvaluator.scala
- mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala 2 additions, 2 deletions...rc/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
- mllib/src/main/scala/org/apache/spark/ml/impl/estimator/Predictor.scala 234 additions, 0 deletions.../scala/org/apache/spark/ml/impl/estimator/Predictor.scala
- mllib/src/main/scala/org/apache/spark/ml/param/params.scala 58 additions, 10 deletionsmllib/src/main/scala/org/apache/spark/ml/param/params.scala
- mllib/src/main/scala/org/apache/spark/ml/param/sharedParams.scala 22 additions, 6 deletions...c/main/scala/org/apache/spark/ml/param/sharedParams.scala
- mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala 96 additions, 0 deletions...ala/org/apache/spark/ml/regression/LinearRegression.scala
- mllib/src/main/scala/org/apache/spark/ml/regression/Regressor.scala 78 additions, 0 deletions...main/scala/org/apache/spark/ml/regression/Regressor.scala
- mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala 13 additions, 0 deletions...rc/main/scala/org/apache/spark/mllib/linalg/Vectors.scala
Loading
Please register or sign in to comment