Skip to content
Snippets Groups Projects
  1. Mar 04, 2016
    • Dongjoon Hyun's avatar
      [SPARK-13676] Fix mismatched default values for regParam in LogisticRegression · c8f25459
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      The default value of regularization parameter for `LogisticRegression` algorithm is different in Scala and Python. We should provide the same value.
      
      **Scala**
      ```
      scala> new org.apache.spark.ml.classification.LogisticRegression().getRegParam
      res0: Double = 0.0
      ```
      
      **Python**
      ```
      >>> from pyspark.ml.classification import LogisticRegression
      >>> LogisticRegression().getRegParam()
      0.1
      ```
      
      ## How was this patch tested?
      manual. Check the following in `pyspark`.
      ```
      >>> from pyspark.ml.classification import LogisticRegression
      >>> LogisticRegression().getRegParam()
      0.0
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11519 from dongjoon-hyun/SPARK-13676.
      c8f25459
  2. Mar 03, 2016
    • Wenchen Fan's avatar
      [SPARK-13647] [SQL] also check if numeric value is within allowed range in _verify_type · 15d57f9c
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This PR makes the `_verify_type` in `types.py` more strict, also check if numeric value is within allowed range.
      
      ## How was this patch tested?
      
      newly added doc test.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11492 from cloud-fan/py-verify.
      15d57f9c
    • Dongjoon Hyun's avatar
      [MINOR] Fix typos in comments and testcase name of code · 941b270b
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR fixes typos in comments and testcase name of code.
      
      ## How was this patch tested?
      
      manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11481 from dongjoon-hyun/minor_fix_typos_in_code.
      941b270b
    • hyukjinkwon's avatar
      [SPARK-13543][SQL] Support for specifying compression codec for Parquet/ORC via option() · cf95d728
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR adds the support to specify compression codecs for both ORC and Parquet.
      
      ## How was this patch tested?
      
      unittests within IDE and code style tests with `dev/run_tests`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #11464 from HyukjinKwon/SPARK-13543.
      cf95d728
    • JeremyNixon's avatar
      [SPARK-12877][ML] Add train-validation-split to pyspark · 511d4929
      JeremyNixon authored
      ## What changes were proposed in this pull request?
      The changes proposed were to add train-validation-split to pyspark.ml.tuning.
      
      ## How was the this patch tested?
      This patch was tested through unit tests located in pyspark/ml/test.py.
      
      This is my original work and I license it to Spark.
      
      Author: JeremyNixon <jnixon2@gmail.com>
      
      Closes #11335 from JeremyNixon/tvs_pyspark.
      511d4929
  3. Mar 02, 2016
  4. Mar 01, 2016
    • Joseph K. Bradley's avatar
      [SPARK-13008][ML][PYTHON] Put one alg per line in pyspark.ml all lists · 9495c40f
      Joseph K. Bradley authored
      This is to fix a long-time annoyance: Whenever we add a new algorithm to pyspark.ml, we have to add it to the ```__all__``` list at the top.  Since we keep it alphabetized, it often creates a lot more changes than needed.  It is also easy to add the Estimator and forget the Model.  I'm going to switch it to have one algorithm per line.
      
      This also alphabetizes a few out-of-place classes in pyspark.ml.feature.  No changes have been made to the moved classes.
      
      CC: thunterdb
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #10927 from jkbradley/ml-python-all-list.
      9495c40f
  5. Feb 29, 2016
    • hyukjinkwon's avatar
      [SPARK-13509][SPARK-13507][SQL] Support for writing CSV with a single function call · 02aa499d
      hyukjinkwon authored
      https://issues.apache.org/jira/browse/SPARK-13507
      https://issues.apache.org/jira/browse/SPARK-13509
      
      ## What changes were proposed in this pull request?
      This PR adds the support to write CSV data directly by a single call to the given path.
      
      Several unitests were added for each functionality.
      ## How was this patch tested?
      
      This was tested with unittests and with `dev/run_tests` for coding style
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: Hyukjin Kwon <gurwls223@gmail.com>
      
      Closes #11389 from HyukjinKwon/SPARK-13507-13509.
      02aa499d
    • vijaykiran's avatar
      [SPARK-12633][PYSPARK] [DOC] PySpark regression parameter desc to consistent format · 236e3c8f
      vijaykiran authored
      Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the regression module.  Also, updated 2 params in classification to read as `Supported values:` to be consistent.
      
      closes #10600
      
      Author: vijaykiran <mail@vijaykiran.com>
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #11404 from BryanCutler/param-desc-consistent-regression-SPARK-12633.
      236e3c8f
    • Yanbo Liang's avatar
      [SPARK-13545][MLLIB][PYSPARK] Make MLlib LogisticRegressionWithLBFGS's default... · d81a7135
      Yanbo Liang authored
      [SPARK-13545][MLLIB][PYSPARK] Make MLlib LogisticRegressionWithLBFGS's default parameters consistent in Scala and Python
      
      ## What changes were proposed in this pull request?
      * The default value of ```regParam``` of PySpark MLlib ```LogisticRegressionWithLBFGS``` should be consistent with Scala which is ```0.0```. (This is also consistent with ML ```LogisticRegression```.)
      * BTW, if we use a known updater(L1 or L2) for binary classification, ```LogisticRegressionWithLBFGS``` will call the ML implementation. We should update the API doc to clarifying ```numCorrections``` will have no effect if we fall into that route.
      * Make a pass for all parameters of ```LogisticRegressionWithLBFGS```, others are set properly.
      
      cc mengxr dbtsai
      ## How was this patch tested?
      No new tests, it should pass all current tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11424 from yanboliang/spark-13545.
      d81a7135
  6. Feb 26, 2016
  7. Feb 25, 2016
    • Tommy YU's avatar
      [SPARK-13033] [ML] [PYSPARK] Add import/export for ml.regression · f3be369e
      Tommy YU authored
      Add export/import for all estimators and transformers(which have Scala implementation) under pyspark/ml/regression.py.
      
      yanboliang Please help to review.
      For doctest, I though it's enough to add one since it's common usage. But I can add to all if we want it.
      
      Author: Tommy YU <tummyyu@163.com>
      
      Closes #11000 from Wenpei/spark-13033-ml.regression-exprot-import and squashes the following commits:
      
      3646b36 [Tommy YU] address review comments
      9cddc98 [Tommy YU] change base on review and pr 11197
      cc61d9d [Tommy YU] remove default parameter set
      19535d4 [Tommy YU] add export/import to regression
      44a9dc2 [Tommy YU] add import/export for ml.regression
      f3be369e
    • Yu ISHIKAWA's avatar
      [SPARK-13292] [ML] [PYTHON] QuantileDiscretizer should take random seed in PySpark · 35316cb0
      Yu ISHIKAWA authored
      ## What changes were proposed in this pull request?
      QuantileDiscretizer in Python should also specify a random seed.
      
      ## How was this patch tested?
      unit tests
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #11362 from yu-iskw/SPARK-13292 and squashes the following commits:
      
      02ffa76 [Yu ISHIKAWA] [SPARK-13292][ML][PYTHON] QuantileDiscretizer should take random seed in PySpark
      35316cb0
    • Kai Jiang's avatar
      [SPARK-7106][MLLIB][PYSPARK] Support model save/load in Python's FPGrowth · 4d2864b2
      Kai Jiang authored
      ## What changes were proposed in this pull request?
      
      Python API supports mode save/load in FPGrowth
      JIRA: [https://issues.apache.org/jira/browse/SPARK-7106](https://issues.apache.org/jira/browse/SPARK-7106)
      ## How was the this patch tested?
      
      The patch is tested with Python doctest.
      
      Author: Kai Jiang <jiangkai@gmail.com>
      
      Closes #11321 from vectorijk/spark-7106.
      4d2864b2
    • Joseph K. Bradley's avatar
      [SPARK-13479][SQL][PYTHON] Added Python API for approxQuantile · 13ce10e9
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      * Scala DataFrameStatFunctions: Added version of approxQuantile taking a List instead of an Array, for Python compatbility
      * Python DataFrame and DataFrameStatFunctions: Added approxQuantile
      
      ## How was this patch tested?
      
      * unit test in sql/tests.py
      
      Documentation was copied from the existing approxQuantile exactly.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #11356 from jkbradley/approx-quantile-python.
      13ce10e9
  8. Feb 24, 2016
    • Nong Li's avatar
      [SPARK-13250] [SQL] Update PhysicallRDD to convert to UnsafeRow if using the vectorized scanner. · 5a7af9e7
      Nong Li authored
      Some parts of the engine rely on UnsafeRow which the vectorized parquet scanner does not want
      to produce. This add a conversion in Physical RDD. In the case where codegen is used (and the
      scan is the start of the pipeline), there is no requirement to use UnsafeRow. This patch adds
      update PhysicallRDD to support codegen, which eliminates the need for the UnsafeRow conversion
      in all cases.
      
      The result of these changes for TPCDS-Q19 at the 10gb sf reduces the query time from 9.5 seconds
      to 6.5 seconds.
      
      Author: Nong Li <nong@databricks.com>
      
      Closes #11141 from nongli/spark-13250.
      5a7af9e7
    • Wenchen Fan's avatar
      [SPARK-13467] [PYSPARK] abstract python function to simplify pyspark code · a60f9128
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      When we pass a Python function to JVM side, we also need to send its context, e.g. `envVars`, `pythonIncludes`, `pythonExec`, etc. However, it's annoying to pass around so many parameters at many places. This PR abstract python function along with its context, to simplify some pyspark code and make the logic more clear.
      
      ## How was the this patch tested?
      
      by existing unit tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11342 from cloud-fan/python-clean.
      a60f9128
  9. Feb 23, 2016
    • Davies Liu's avatar
      [SPARK-13329] [SQL] considering output for statistics of logical plan · c481bdf5
      Davies Liu authored
      The current implementation of statistics of UnaryNode does not considering output (for example, Project may product much less columns than it's child), we should considering it to have a better guess.
      
      We usually only join with few columns from a parquet table, the size of projected plan could be much smaller than the original parquet files. Having a better guess of size help we choose between broadcast join or sort merge join.
      
      After this PR, I saw a few queries choose broadcast join other than sort merge join without turning spark.sql.autoBroadcastJoinThreshold for every query, ended up with about 6-8X improvements on end-to-end time.
      
      We use `defaultSize` of DataType to estimate the size of a column, currently For DecimalType/StringType/BinaryType and UDT, we are over-estimate too much (4096 Bytes), so this PR change them to some more reasonable values. Here are the new defaultSize for them:
      
      DecimalType:  8 or 16 bytes, based on the precision
      StringType:  20 bytes
      BinaryType: 100 bytes
      UDF: default size of SQL type
      
      These numbers are not perfect (hard to have a perfect number for them), but should be better than 4096.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11210 from davies/statics.
      c481bdf5
    • Yanbo Liang's avatar
      [SPARK-13429][MLLIB] Unify Logistic Regression convergence tolerance of ML & MLlib · 72427c3e
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      In order to provide better and consistent result, let's change the default value of MLlib ```LogisticRegressionWithLBFGS convergenceTol``` from ```1E-4``` to ```1E-6``` which will be equal to ML ```LogisticRegression```.
      cc dbtsai
      ## How was the this patch tested?
      unit tests
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11299 from yanboliang/spark-13429.
      72427c3e
  10. Feb 22, 2016
  11. Feb 21, 2016
    • Franklyn D'souza's avatar
      [SPARK-13410][SQL] Support unionAll for DataFrames with UDT columns. · 0f90f4e6
      Franklyn D'souza authored
      ## What changes were proposed in this pull request?
      
      This PR adds equality operators to UDT classes so that they can be correctly tested for dataType equality during union operations.
      
      This was previously causing `"AnalysisException: u"unresolved operator 'Union;""` when trying to unionAll two dataframes with UDT columns as below.
      
      ```
      from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
      from pyspark.sql import types
      
      schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])
      
      a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
      b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)
      
      c = a.unionAll(b)
      ```
      
      ## How was the this patch tested?
      
      Tested using two unit tests in sql/test.py and the DataFrameSuite.
      
      Additional information here : https://issues.apache.org/jira/browse/SPARK-13410
      
      Author: Franklyn D'souza <franklynd@gmail.com>
      
      Closes #11279 from damnMeddlingKid/udt-union-all.
      0f90f4e6
    • Cheng Lian's avatar
      [SPARK-12799] Simplify various string output for expressions · d9efe63e
      Cheng Lian authored
      This PR introduces several major changes:
      
      1. Replacing `Expression.prettyString` with `Expression.sql`
      
         The `prettyString` method is mostly an internal, developer faced facility for debugging purposes, and shouldn't be exposed to users.
      
      1. Using SQL-like representation as column names for selected fields that are not named expression (back-ticks and double quotes should be removed)
      
         Before, we were using `prettyString` as column names when possible, and sometimes the result column names can be weird.  Here are several examples:
      
         Expression         | `prettyString` | `sql`      | Note
         ------------------ | -------------- | ---------- | ---------------
         `a && b`           | `a && b`       | `a AND b`  |
         `a.getField("f")`  | `a[f]`         | `a.f`      | `a` is a struct
      
      1. Adding trait `NonSQLExpression` extending from `Expression` for expressions that don't have a SQL representation (e.g. Scala UDF/UDAF and Java/Scala object expressions used for encoders)
      
         `NonSQLExpression.sql` may return an arbitrary user facing string representation of the expression.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #10757 from liancheng/spark-12799.simplify-expression-string-methods.
      d9efe63e
  12. Feb 20, 2016
  13. Feb 19, 2016
  14. Feb 16, 2016
    • Miles Yucht's avatar
      Correct SparseVector.parse documentation · 827ed1c0
      Miles Yucht authored
      There's a small typo in the SparseVector.parse docstring (which says that it returns a DenseVector rather than a SparseVector), which seems to be incorrect.
      
      Author: Miles Yucht <miles@databricks.com>
      
      Closes #11213 from mgyucht/fix-sparsevector-docs.
      827ed1c0
  15. Feb 13, 2016
    • Reynold Xin's avatar
      [SPARK-13296][SQL] Move UserDefinedFunction into sql.expressions. · 354d4c24
      Reynold Xin authored
      This pull request has the following changes:
      
      1. Moved UserDefinedFunction into expressions package. This is more consistent with how we structure the packages for window functions and UDAFs.
      
      2. Moved UserDefinedPythonFunction into execution.python package, so we don't have a random private class in the top level sql package.
      
      3. Move everything in execution/python.scala into the newly created execution.python package.
      
      Most of the diffs are just straight copy-paste.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11181 from rxin/SPARK-13296.
      354d4c24
    • Liang-Chi Hsieh's avatar
      [SPARK-12363][MLLIB] Remove setRun and fix PowerIterationClustering failed test · e3441e3f
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-12363
      
      This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #10539 from viirya/fix-poweriter.
      e3441e3f
  16. Feb 12, 2016
  17. Feb 11, 2016
Loading