Skip to content
Snippets Groups Projects
  1. Apr 08, 2016
    • wm624@hotmail.com's avatar
      [SPARK-12569][PYSPARK][ML] DecisionTreeRegressor: provide variance of prediction: Python API · e0ad75f2
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      A new column VarianceCol has been added to DecisionTreeRegressor in ML scala code.
      
      This patch adds the corresponding Python API, HasVarianceCol, to class DecisionTreeRegressor.
      
      ## How was this patch tested?
      ./dev/lint-python
      PEP8 checks passed.
      rm -rf _build/*
      pydoc checks passed.
      
      ./python/run-tests --python-executables=python2.7 --modules=pyspark-ml
      Running PySpark tests. Output is in /Users/mwang/spark_ws_0904/python/unit-tests.log
      Will test against the following Python executables: ['python2.7']
      Will test the following Python modules: ['pyspark-ml']
      Finished test(python2.7): pyspark.ml.evaluation (12s)
      Finished test(python2.7): pyspark.ml.clustering (18s)
      Finished test(python2.7): pyspark.ml.classification (30s)
      Finished test(python2.7): pyspark.ml.recommendation (28s)
      Finished test(python2.7): pyspark.ml.feature (43s)
      Finished test(python2.7): pyspark.ml.regression (31s)
      Finished test(python2.7): pyspark.ml.tuning (19s)
      Finished test(python2.7): pyspark.ml.tests (34s)
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #12116 from wangmiao1981/fix_api.
      e0ad75f2
    • Kai Jiang's avatar
      [SPARK-14373][PYSPARK] PySpark RandomForestClassifier, Regressor support export/import · e5d8d6e0
      Kai Jiang authored
      ## What changes were proposed in this pull request?
      supporting `RandomForest{Classifier, Regressor}` save/load for Python API.
      [JIRA](https://issues.apache.org/jira/browse/SPARK-14373)
      ## How was this patch tested?
      doctest
      
      Author: Kai Jiang <jiangkai@gmail.com>
      
      Closes #12238 from vectorijk/spark-14373.
      e5d8d6e0
  2. Apr 06, 2016
    • Bryan Cutler's avatar
      [SPARK-13430][PYSPARK][ML] Python API for training summaries of linear and logistic regression · 9c6556c5
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      Adding Python API for training summaries of LogisticRegression and LinearRegression in PySpark ML.
      
      ## How was this patch tested?
      Added unit tests to exercise the api calls for the summary classes.  Also, manually verified values are expected and match those from Scala directly.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #11621 from BryanCutler/pyspark-ml-summary-SPARK-13430.
      9c6556c5
    • Xusen Yin's avatar
      [SPARK-13786][ML][PYSPARK] Add save/load for pyspark.ml.tuning · db0b06c6
      Xusen Yin authored
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-13786
      
      Add save/load for Python CrossValidator/Model and TrainValidationSplit/Model.
      
      ## How was this patch tested?
      
      Test with Python doctest.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #12020 from yinxusen/SPARK-13786.
      db0b06c6
    • Davies Liu's avatar
      [SPARK-14418][PYSPARK] fix unpersist of Broadcast in Python · 90ca1844
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Currently, Broaccast.unpersist() will remove the file of broadcast, which should be the behavior of destroy().
      
      This PR added destroy() for Broadcast in Python, to match the sematics in Scala.
      
      ## How was this patch tested?
      
      Added regression tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12189 from davies/py_unpersist.
      90ca1844
  3. Apr 05, 2016
    • Burak Yavuz's avatar
      [SPARK-14353] Dataset Time Window `window` API for Python, and SQL · 9ee5c257
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      The `window` function was added to Dataset with [this PR](https://github.com/apache/spark/pull/12008).
      This PR adds the Python, and SQL, API for this function.
      
      With this PR, SQL, Java, and Scala will share the same APIs as in users can use:
       - `window(timeColumn, windowDuration)`
       - `window(timeColumn, windowDuration, slideDuration)`
       - `window(timeColumn, windowDuration, slideDuration, startTime)`
      
      In Python, users can access all APIs above, but in addition they can do
       - In Python:
         `window(timeColumn, windowDuration, startTime=...)`
      
      that is, they can provide the startTime without providing the `slideDuration`. In this case, we will generate tumbling windows.
      
      ## How was this patch tested?
      
      Unit tests + manual tests
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #12136 from brkyvz/python-windows.
      9ee5c257
  4. Apr 04, 2016
    • Yong Tang's avatar
      [SPARK-14368][PYSPARK] Support python.spark.worker.memory with upper-case unit. · 7db56244
      Yong Tang authored
      ## What changes were proposed in this pull request?
      
      This fix tries to address the issue in PySpark where `spark.python.worker.memory`
      could only be configured with a lower case unit (`k`, `m`, `g`, `t`). This fix
      allows the upper case unit (`K`, `M`, `G`, `T`) to be used as well. This is to
      conform to the JVM memory string as is specified in the documentation .
      
      ## How was this patch tested?
      
      This fix adds additional test to cover the changes.
      
      Author: Yong Tang <yong.tang.github@outlook.com>
      
      Closes #12163 from yongtang/SPARK-14368.
      7db56244
    • Marcelo Vanzin's avatar
      [SPARK-13579][BUILD] Stop building the main Spark assembly. · 24d7d2e4
      Marcelo Vanzin authored
      This change modifies the "assembly/" module to just copy needed
      dependencies to its build directory, and modifies the packaging
      script to pick those up (and remove duplicate jars packages in the
      examples module).
      
      I also made some minor adjustments to dependencies to remove some
      test jars from the final packaging, and remove jars that conflict with each
      other when packaged separately (e.g. servlet api).
      
      Also note that this change restores guava in applications' classpaths, even
      though it's still shaded inside Spark. This is now needed for the Hadoop
      libraries that are packaged with Spark, which now are not processed by
      the shade plugin.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #11796 from vanzin/SPARK-13579.
      24d7d2e4
    • Davies Liu's avatar
      [SPARK-14334] [SQL] add toLocalIterator for Dataset/DataFrame · cc70f174
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      RDD.toLocalIterator() could be used to fetch one partition at a time to reduce the memory usage. Right now, for Dataset/Dataframe we have to use df.rdd.toLocalIterator, which is super slow also requires lots of memory (because of the Java serializer or even Kyro serializer).
      
      This PR introduce an optimized toLocalIterator for Dataset/DataFrame, which is much faster and requires much less memory. For a partition with 5 millions rows, `df.rdd.toIterator` took about 100 seconds, but df.toIterator took less than 7 seconds. For 10 millions row, rdd.toIterator will crash (not enough memory) with 4G heap, but df.toLocalIterator could finished in 12 seconds.
      
      The JDBC server has been updated to use DataFrame.toIterator.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12114 from davies/local_iterator.
      cc70f174
    • Davies Liu's avatar
      [SPARK-12981] [SQL] extract Pyhton UDF in physical plan · 5743c647
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Currently we extract Python UDFs into a special logical plan EvaluatePython in analyzer, But EvaluatePython is not part of catalyst, many rules have no knowledge of it , which will break many things (for example, filter push down or column pruning).
      
      We should treat Python UDFs as normal expressions, until we want to evaluate in physical plan, we could extract them in end of optimizer, or physical plan.
      
      This PR extract Python UDFs in physical plan.
      
      Closes #10935
      
      ## How was this patch tested?
      
      Added regression tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12127 from davies/py_udf.
      5743c647
  5. Apr 03, 2016
    • hyukjinkwon's avatar
      [SPARK-14231] [SQL] JSON data source infers floating-point values as a double... · 2262a933
      hyukjinkwon authored
      [SPARK-14231] [SQL] JSON data source infers floating-point values as a double when they do not fit in a decimal
      
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-14231
      
      Currently, JSON data source supports to infer `DecimalType` for big numbers and `floatAsBigDecimal` option which reads floating-point values as `DecimalType`.
      
      But there are few restrictions in Spark `DecimalType` below:
      
      1. The precision cannot be bigger than 38.
      2. scale cannot be bigger than precision.
      
      Currently, both restrictions are not being handled.
      
      This PR handles the cases by inferring them as `DoubleType`. Also, the option name was changed from `floatAsBigDecimal` to `prefersDecimal` as suggested [here](https://issues.apache.org/jira/browse/SPARK-14231?focusedCommentId=15215579&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15215579).
      
      So, the codes below:
      
      ```scala
      def doubleRecords: RDD[String] =
        sqlContext.sparkContext.parallelize(
          s"""{"a": 1${"0" * 38}, "b": 0.01}""" ::
          s"""{"a": 2${"0" * 38}, "b": 0.02}""" :: Nil)
      
      val jsonDF = sqlContext.read
        .option("prefersDecimal", "true")
        .json(doubleRecords)
      jsonDF.printSchema()
      ```
      
      produces below:
      
      - **Before**
      
      ```scala
      org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater than precision (1).;
      	at org.apache.spark.sql.types.DecimalType.<init>(DecimalType.scala:44)
      	at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144)
      	at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108)
      	at
      ...
      ```
      
      - **After**
      
      ```scala
      root
       |-- a: double (nullable = true)
       |-- b: double (nullable = true)
      ```
      
      ## How was this patch tested?
      
      Unit tests were used and `./dev/run_tests` for coding style tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #12030 from HyukjinKwon/SPARK-14231.
      2262a933
  6. Apr 02, 2016
  7. Apr 01, 2016
    • Yanbo Liang's avatar
      [SPARK-14305][ML][PYSPARK] PySpark ml.clustering BisectingKMeans support export/import · 381358fb
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      PySpark ml.clustering BisectingKMeans support export/import
      ## How was this patch tested?
      doc test.
      
      cc jkbradley
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #12112 from yanboliang/spark-14305.
      381358fb
    • Alexander Ulanov's avatar
      [SPARK-11262][ML] Unit test for gradient, loss layers, memory management for multilayer perceptron · 26867ebc
      Alexander Ulanov authored
      1.Implement LossFunction trait and implement squared error and cross entropy
      loss with it
      2.Implement unit test for gradient and loss
      3.Implement InPlace trait and in-place layer evaluation
      4.Refactor interface for ActivationFunction
      5.Update of Layer and LayerModel interfaces
      6.Fix random weights assignment
      7.Implement memory allocation by MLP model instead of individual layers
      
      These features decreased the memory usage and increased flexibility of
      internal API.
      
      Author: Alexander Ulanov <nashb@yandex.ru>
      Author: avulanov <avulanov@gmail.com>
      
      Closes #9229 from avulanov/mlp-refactoring.
      26867ebc
  8. Mar 31, 2016
    • Davies Liu's avatar
      [SPARK-14267] [SQL] [PYSPARK] execute multiple Python UDFs within single batch · f0afafdc
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR support multiple Python UDFs within single batch, also improve the performance.
      
      ```python
      >>> from pyspark.sql.types import IntegerType
      >>> sqlContext.registerFunction("double", lambda x: x * 2, IntegerType())
      >>> sqlContext.registerFunction("add", lambda x, y: x + y, IntegerType())
      >>> sqlContext.sql("SELECT double(add(1, 2)), add(double(2), 1)").explain(True)
      == Parsed Logical Plan ==
      'Project [unresolvedalias('double('add(1, 2)), None),unresolvedalias('add('double(2), 1), None)]
      +- OneRowRelation$
      
      == Analyzed Logical Plan ==
      double(add(1, 2)): int, add(double(2), 1): int
      Project [double(add(1, 2))#14,add(double(2), 1)#15]
      +- Project [double(add(1, 2))#14,add(double(2), 1)#15]
         +- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
            +- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18]
               +- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
                  +- OneRowRelation$
      
      == Optimized Logical Plan ==
      Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
      +- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18]
         +- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
            +- OneRowRelation$
      
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
      :     +- INPUT
      +- !BatchPythonEvaluation [add(pythonUDF1#17, 1)], [pythonUDF0#16,pythonUDF1#17,pythonUDF0#18]
         +- !BatchPythonEvaluation [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
            +- Scan OneRowRelation[]
      ```
      
      ## How was this patch tested?
      
      Added new tests.
      
      Using the following script to benchmark 1, 2 and 3 udfs,
      ```
      df = sqlContext.range(1, 1 << 23, 1, 4)
      double = F.udf(lambda x: x * 2, LongType())
      print df.select(double(df.id)).count()
      print df.select(double(df.id), double(df.id + 1)).count()
      print df.select(double(df.id), double(df.id + 1), double(df.id + 2)).count()
      ```
      Here is the results:
      
      N | Before | After  | speed up
      ---- |------------ | -------------|------
      1 | 22 s | 7 s |  3.1X
      2 | 38 s | 13 s | 2.9X
      3 | 58 s | 16 s | 3.6X
      
      This benchmark ran locally with 4 CPUs. For 3 UDFs, it launched 12 Python before before this patch, 4 process after this patch. After this patch, it will use less memory for multiple UDFs than before (less buffering).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12057 from davies/multi_udfs.
      f0afafdc
    • sethah's avatar
      [SPARK-14264][PYSPARK][ML] Add feature importance for GBTs in pyspark · b11887c0
      sethah authored
      ## What changes were proposed in this pull request?
      
      Feature importances are exposed in the python API for GBTs.
      
      Other changes:
      * Update the random forest feature importance documentation to not repeat decision tree docstring and instead place a reference to it.
      
      ## How was this patch tested?
      
      Python doc tests were updated to validate GBT feature importance.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #12056 from sethah/Pyspark_GBT_feature_importance.
      b11887c0
    • Herman van Hovell's avatar
      [SPARK-14211][SQL] Remove ANTLR3 based parser · a9b93e07
      Herman van Hovell authored
      ### What changes were proposed in this pull request?
      
      This PR removes the ANTLR3 based parser, and moves the new ANTLR4 based parser into the `org.apache.spark.sql.catalyst.parser package`.
      
      ### How was this patch tested?
      
      Existing unit tests.
      
      cc rxin andrewor14 yhuai
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #12071 from hvanhovell/SPARK-14211.
      a9b93e07
  9. Mar 30, 2016
  10. Mar 29, 2016
    • Davies Liu's avatar
      [SPARK-14215] [SQL] [PYSPARK] Support chained Python UDFs · a7a93a11
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR brings the support for chained Python UDFs, for example
      
      ```sql
      select udf1(udf2(a))
      select udf1(udf2(a) + 3)
      select udf1(udf2(a) + udf3(b))
      ```
      
      Also directly chained unary Python UDFs are put in single batch of Python UDFs, others may require multiple batches.
      
      For example,
      ```python
      >>> sqlContext.sql("select double(double(1))").explain()
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [pythonUDF#10 AS double(double(1))#9]
      :     +- INPUT
      +- !BatchPythonEvaluation double(double(1)), [pythonUDF#10]
         +- Scan OneRowRelation[]
      >>> sqlContext.sql("select double(double(1) + double(2))").explain()
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [pythonUDF#19 AS double((double(1) + double(2)))#16]
      :     +- INPUT
      +- !BatchPythonEvaluation double((pythonUDF#17 + pythonUDF#18)), [pythonUDF#17,pythonUDF#18,pythonUDF#19]
         +- !BatchPythonEvaluation double(2), [pythonUDF#17,pythonUDF#18]
            +- !BatchPythonEvaluation double(1), [pythonUDF#17]
               +- Scan OneRowRelation[]
      ```
      
      TODO: will support multiple unrelated Python UDFs in one batch (another PR).
      
      ## How was this patch tested?
      
      Added new unit tests for chained UDFs.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12014 from davies/py_udfs.
      a7a93a11
    • wm624@hotmail.com's avatar
      [SPARK-14071][PYSPARK][ML] Change MLWritable.write to be a property · 63b200e8
      wm624@hotmail.com authored
      Add property to MLWritable.write method, so we can use .write instead of .write()
      
      Add a new test to ml/test.py to check whether the write is a property.
      ./python/run-tests --python-executables=python2.7 --modules=pyspark-ml
      
      Will test against the following Python executables: ['python2.7']
      Will test the following Python modules: ['pyspark-ml']
      Finished test(python2.7): pyspark.ml.evaluation (11s)
      Finished test(python2.7): pyspark.ml.clustering (16s)
      Finished test(python2.7): pyspark.ml.classification (24s)
      Finished test(python2.7): pyspark.ml.recommendation (24s)
      Finished test(python2.7): pyspark.ml.feature (39s)
      Finished test(python2.7): pyspark.ml.regression (26s)
      Finished test(python2.7): pyspark.ml.tuning (15s)
      Finished test(python2.7): pyspark.ml.tests (30s)
      Tests passed in 55 seconds
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #11945 from wangmiao1981/fix_property.
      63b200e8
  11. Mar 28, 2016
    • zero323's avatar
      [SPARK-14202] [PYTHON] Use generator expression instead of list comp in python_full_outer_jo… · 39f743a6
      zero323 authored
      ## What changes were proposed in this pull request?
      
      This PR replaces list comprehension in python_full_outer_join.dispatch with a generator expression.
      
      ## How was this patch tested?
      
      PySpark-Core, PySpark-MLlib test suites against Python 2.7, 3.5.
      
      Author: zero323 <matthew.szymkiewicz@gmail.com>
      
      Closes #11998 from zero323/pyspark-join-generator-expr.
      39f743a6
    • Herman van Hovell's avatar
      [SPARK-13713][SQL] Migrate parser from ANTLR3 to ANTLR4 · 600c0b69
      Herman van Hovell authored
      ### What changes were proposed in this pull request?
      The current ANTLR3 parser is quite complex to maintain and suffers from code blow-ups. This PR introduces a new parser that is based on ANTLR4.
      
      This parser is based on the [Presto's SQL parser](https://github.com/facebook/presto/blob/master/presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase.g4). The current implementation can parse and create Catalyst and SQL plans. Large parts of the HiveQl DDL and some of the DML functionality is currently missing, the plan is to add this in follow-up PRs.
      
      This PR is a work in progress, and work needs to be done in the following area's:
      
      - [x] Error handling should be improved.
      - [x] Documentation should be improved.
      - [x] Multi-Insert needs to be tested.
      - [ ] Naming and package locations.
      
      ### How was this patch tested?
      
      Catalyst and SQL unit tests.
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #11557 from hvanhovell/ngParser.
      600c0b69
  12. Mar 26, 2016
    • Shixiong Zhu's avatar
      [SPARK-13874][DOC] Remove docs of streaming-akka, streaming-zeromq,... · d23ad7c1
      Shixiong Zhu authored
      [SPARK-13874][DOC] Remove docs of streaming-akka, streaming-zeromq, streaming-mqtt and streaming-twitter
      
      ## What changes were proposed in this pull request?
      
      This PR removes all docs about the old streaming-akka, streaming-zeromq, streaming-mqtt and streaming-twitter projects since I have already copied them to https://github.com/spark-packages
      
      Also remove mqtt_wordcount.py that I forgot to remove previously.
      
      ## How was this patch tested?
      
      Jenkins PR Build.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11824 from zsxwing/remove-doc.
      d23ad7c1
  13. Mar 25, 2016
    • Shixiong Zhu's avatar
      [SPARK-14073][STREAMING][TEST-MAVEN] Move flume back to Spark · 24587ce4
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR moves flume back to Spark as per the discussion in the dev mail-list.
      
      ## How was this patch tested?
      
      Existing Jenkins tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11895 from zsxwing/move-flume-back.
      24587ce4
    • Wenchen Fan's avatar
      [SPARK-14061][SQL] implement CreateMap · 43b15e01
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      As we have `CreateArray` and `CreateStruct`, we should also have `CreateMap`.  This PR adds the `CreateMap` expression, and the DataFrame API, and python API.
      
      ## How was this patch tested?
      
      various new tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11879 from cloud-fan/create_map.
      43b15e01
    • Andrew Or's avatar
      [SPARK-14014][SQL] Integrate session catalog (attempt #2) · 20ddf5fd
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      This reopens #11836, which was merged but promptly reverted because it introduced flaky Hive tests.
      
      ## How was this patch tested?
      
      See `CatalogTestCases`, `SessionCatalogSuite` and `HiveContextSuite`.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #11938 from andrewor14/session-catalog-again.
      20ddf5fd
    • Reynold Xin's avatar
      [SPARK-14142][SQL] Replace internal use of unionAll with union · 3619fec1
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      unionAll has been deprecated in SPARK-14088.
      
      ## How was this patch tested?
      Should be covered by all existing tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11946 from rxin/SPARK-14142.
      3619fec1
  14. Mar 24, 2016
  15. Mar 23, 2016
    • Joseph K. Bradley's avatar
      [SPARK-12183][ML][MLLIB] Remove mllib tree implementation, and wrap spark.ml one · cf823bea
      Joseph K. Bradley authored
      Primary change:
      * Removed spark.mllib.tree.DecisionTree implementation of tree and forest learning.
      * spark.mllib now calls the spark.ml implementation.
      * Moved unit tests (of tree learning internals) from spark.mllib to spark.ml as needed.
      
      ml.tree.DecisionTreeModel
      * Added toOld and made ```private[spark]```, implemented for Classifier and Regressor in subclasses.  These methods now use OldInformationGainStats.invalidInformationGainStats for LeafNodes in order to mimic the spark.mllib implementation.
      
      ml.tree.Node
      * Added ```private[tree] def deepCopy```, used by unit tests
      
      Copied developer comments from spark.mllib implementation to spark.ml one.
      
      Moving unit tests
      * Tree learning internals were tested by spark.mllib.tree.DecisionTreeSuite, or spark.mllib.tree.RandomForestSuite.
      * Those tests were all moved to spark.ml.tree.impl.RandomForestSuite.  The order in the file + the test names are the same, so you should be able to compare them by opening them in 2 windows side-by-side.
      * I made minimal changes to each test to allow it to run.  Each test makes the same checks as before, except for a few removed assertions which were checking irrelevant values.
      * No new unit tests were added.
      * mllib.tree.DecisionTreeSuite: I removed some checks of splits and bins which were not relevant to the unit tests they were in.  Those same split calculations were already being tested in other unit tests, for each dataset type.
      
      **Changes of behavior** (to be noted in SPARK-13448 once this PR is merged)
      * spark.ml.tree.impl.RandomForest: Rather than throwing an error when maxMemoryInMB is set to too small a value (to split any node), we now allow 1 node to be split, even if its memory requirements exceed maxMemoryInMB.  This involved removing the maxMemoryPerNode check in RandomForest.run, as well as modifying selectNodesToSplit().  Once this PR is merged, I will note the change of behavior on SPARK-13448.
      * spark.mllib.tree.DecisionTree: When a tree only has one node (root = leaf node), the "stats" field will now be empty, rather than being set to InformationGainStats.invalidInformationGainStats.  This does not remove information from the tree, and it will save a bit of storage.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #11855 from jkbradley/remove-mllib-tree-impl.
      cf823bea
    • Andrew Or's avatar
      [SPARK-14014][SQL] Replace existing catalog with SessionCatalog · 5dfc0197
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      `SessionCatalog`, introduced in #11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to `ExternalCatalog`. This functionality overlaps a lot with the existing `analysis.Catalog`.
      
      As of this commit, `SessionCatalog` and `ExternalCatalog` will no longer be dead code. There are still things that need to be done after this patch, namely:
      - SPARK-14013: Properly implement temporary functions in `SessionCatalog`
      - SPARK-13879: Decide which DDL/DML commands to support natively in Spark
      - SPARK-?????: Implement the ones we do want to support through `SessionCatalog`.
      - SPARK-?????: Merge SQL/HiveContext
      
      ## How was this patch tested?
      
      This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are `SessionCatalogSuite` and `ExternalCatalogSuite`.
      
      Author: Andrew Or <andrew@databricks.com>
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #11836 from andrewor14/use-session-catalog.
      5dfc0197
    • sethah's avatar
      [SPARK-13068][PYSPARK][ML] Type conversion for Pyspark params · 30bdb5cb
      sethah authored
      ## What changes were proposed in this pull request?
      
      This patch adds type conversion functionality for parameters in Pyspark. A `typeConverter` field is added to the constructor of `Param` class. This argument is a function which converts values passed to this param to the appropriate type if possible. This is beneficial so that the params can fail at set time if they are given inappropriate values, but even more so because coherent error messages are now provided when Py4J cannot cast the python type to the appropriate Java type.
      
      This patch also adds a `TypeConverters` class with factory methods for common type conversions. Most of the changes involve adding these factory type converters to existing params. The previous solution to this issue, `expectedType`, is deprecated and can be removed in 2.1.0 as discussed on the Jira.
      
      ## How was this patch tested?
      
      Unit tests were added in python/pyspark/ml/tests.py to test parameter type conversion. These tests check that values that should be convertible are converted correctly, and that the appropriate errors are thrown when invalid values are provided.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #11663 from sethah/SPARK-13068-tc.
      30bdb5cb
    • Reynold Xin's avatar
      [SPARK-14088][SQL] Some Dataset API touch-up · 926a93e5
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      1. Deprecated unionAll. It is pretty confusing to have both "union" and "unionAll" when the two do the same thing in Spark but are different in SQL.
      2. Rename reduce in KeyValueGroupedDataset to reduceGroups so it is more consistent with rest of the functions in KeyValueGroupedDataset. Also makes it more obvious what "reduce" and "reduceGroups" mean. Previously it was confusing because it could be reducing a Dataset, or just reducing groups.
      3. Added a "name" function, which is more natural to name columns than "as" for non-SQL users.
      4. Remove "subtract" function since it is just an alias for "except".
      
      ## How was this patch tested?
      All changes should be covered by existing tests. Also added couple test cases to cover "name".
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11908 from rxin/SPARK-14088.
      926a93e5
  16. Mar 22, 2016
    • Joseph K. Bradley's avatar
      [SPARK-13951][ML][PYTHON] Nested Pipeline persistence · 7e3423b9
      Joseph K. Bradley authored
      Adds support for saving and loading nested ML Pipelines from Python.  Pipeline and PipelineModel do not extend JavaWrapper, but they are able to utilize the JavaMLWriter, JavaMLReader implementations.
      
      Also:
      * Separates out interfaces from Java wrapper implementations for MLWritable, MLReadable, MLWriter, MLReader.
      * Moves methods _stages_java2py, _stages_py2java into Pipeline, PipelineModel as _transfer_stage_from_java, _transfer_stage_to_java
      
      Added new unit test for nested Pipelines.  Abstracted validity check into a helper method for the 2 unit tests.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #11866 from jkbradley/nested-pipeline-io.
      Closes #11835
      7e3423b9
    • hyukjinkwon's avatar
      [SPARK-13953][SQL] Specifying the field name for corrupted record via option at JSON datasource · 4e09a0d5
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-13953
      
      Currently, JSON data source creates a new field in `PERMISSIVE` mode for storing malformed string.
      This field can be renamed via `spark.sql.columnNameOfCorruptRecord` option but it is a global configuration.
      
      This PR make that option can be applied per read and can be specified via `option()`. This will overwrites `spark.sql.columnNameOfCorruptRecord` if it is set.
      
      ## How was this patch tested?
      
      Unit tests were used and `./dev/run_tests` for coding style tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #11881 from HyukjinKwon/SPARK-13953.
      4e09a0d5
    • zero323's avatar
      [SPARK-14058][PYTHON] Incorrect docstring in Window.order · 8193a266
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Replaces current docstring ("Creates a :class:`WindowSpec` with the partitioning defined.") with "Creates a :class:`WindowSpec` with the ordering defined."
      
      ## How was this patch tested?
      
      PySpark unit tests (no regression introduced). No changes to the code.
      
      Author: zero323 <matthew.szymkiewicz@gmail.com>
      
      Closes #11877 from zero323/order-by-description.
      8193a266
  17. Mar 21, 2016
    • hyukjinkwon's avatar
      [SPARK-13764][SQL] Parse modes in JSON data source · e4740881
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, there is no way to control the behaviour when fails to parse corrupt records in JSON data source .
      
      This PR adds the support for parse modes just like CSV data source. There are three modes below:
      
      - `PERMISSIVE` :  When it fails to parse, this sets `null` to to field. This is a default mode when it has been this mode.
      - `DROPMALFORMED`: When it fails to parse, this drops the whole record.
      - `FAILFAST`: When it fails to parse, it just throws an exception.
      
      This PR also make JSON data source share the `ParseModes` in CSV data source.
      
      ## How was this patch tested?
      
      Unit tests were used and `./dev/run_tests` for code style tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #11756 from HyukjinKwon/SPARK-13764.
      e4740881
  18. Mar 20, 2016
  19. Mar 17, 2016
    • Bryan Cutler's avatar
      [SPARK-13937][PYSPARK][ML] Change JavaWrapper _java_obj from static to member variable · 828213d4
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      In PySpark wrapper.py JavaWrapper change _java_obj from an unused static variable to a member variable that is consistent with usage in derived classes.
      
      ## How was this patch tested?
      Ran python tests for ML and MLlib.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #11767 from BryanCutler/JavaWrapper-static-_java_obj-SPARK-13937.
      828213d4
Loading