Skip to content
Snippets Groups Projects
  1. Jun 20, 2017
    • Joseph K. Bradley's avatar
      [SPARK-20929][ML] LinearSVC should use its own threshold param · cc67bd57
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      LinearSVC should use its own threshold param, rather than the shared one, since it applies to rawPrediction instead of probability.  This PR changes the param in the Scala, Python and R APIs.
      ## How was this patch tested?
      New unit test to make sure the threshold can be set to any Double value.
      Author: Joseph K. Bradley <>
      Closes #18151 from jkbradley/ml-2.2-linearsvc-cleanup.
  2. Jun 19, 2017
  3. Jun 15, 2017
    • Xiao Li's avatar
      [SPARK-20980][SQL] Rename `wholeFile` to `multiLine` for both CSV and JSON · 20514281
      Xiao Li authored
      ### What changes were proposed in this pull request?
      The current option name `wholeFile` is misleading for CSV users. Currently, it is not representing a record per file. Actually, one file could have multiple records. Thus, we should rename it. Now, the proposal is `multiLine`.
      ### How was this patch tested?
      Author: Xiao Li <>
      Closes #18202 from gatorsmile/renameCVSOption.
  4. Jun 09, 2017
    • Reynold Xin's avatar
      [SPARK-21042][SQL] Document Dataset.union is resolution by position · b78e3849
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      Document Dataset.union is resolution by position, not by name, since this has been a confusing point for a lot of users.
      ## How was this patch tested?
      N/A - doc only change.
      Author: Reynold Xin <>
      Closes #18256 from rxin/SPARK-21042.
  5. Jun 03, 2017
    • Ruben Berenguel Montoro's avatar
      [SPARK-19732][SQL][PYSPARK] Add fill functions for nulls in bool fields of datasets · 6cbc61d1
      Ruben Berenguel Montoro authored
      ## What changes were proposed in this pull request?
      Allow fill/replace of NAs with booleans, both in Python and Scala
      ## How was this patch tested?
      Unit tests, doctests
      This PR is original work from me and I license this work to the Spark project
      Author: Ruben Berenguel Montoro <>
      Author: Ruben Berenguel <>
      Closes #18164 from rberenguel/SPARK-19732-fillna-bools.
  6. May 31, 2017
    • gatorsmile's avatar
      [SPARK-19236][SQL][FOLLOW-UP] Added createOrReplaceGlobalTempView method · de934e67
      gatorsmile authored
      ### What changes were proposed in this pull request?
      This PR does the following tasks:
      - Added  since
      - Added the Python API
      - Added test cases
      ### How was this patch tested?
      Added test cases to both Scala and Python
      Author: gatorsmile <>
      Closes #18147 from gatorsmile/createOrReplaceGlobalTempView.
  7. May 30, 2017
  8. May 26, 2017
    • Michael Armbrust's avatar
      [SPARK-20844] Remove experimental from Structured Streaming APIs · d935e0a9
      Michael Armbrust authored
      Now that Structured Streaming has been out for several Spark release and has large production use cases, the `Experimental` label is no longer appropriate.  I've left `InterfaceStability.Evolving` however, as I think we may make a few changes to the pluggable Source & Sink API in Spark 2.3.
      Author: Michael Armbrust <>
      Closes #18065 from marmbrus/streamingGA.
  9. May 25, 2017
  10. May 24, 2017
    • Bago Amirbekian's avatar
      [SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel · bc66a77b
      Bago Amirbekian authored
      ## What changes were proposed in this pull request?
      Fixed TypeError with python3 and numpy 1.12.1. Numpy's `reshape` no longer takes floats as arguments as of 1.12. Also, python3 uses float division for `/`, we should be using `//` to ensure that `_dataWithBiasSize` doesn't get set to a float.
      ## How was this patch tested?
      Existing tests run using python3 and numpy 1.12.
      Author: Bago Amirbekian <>
      Closes #18081 from MrBago/BF-py3floatbug.
    • zero323's avatar
      [SPARK-20631][FOLLOW-UP] Fix incorrect tests. · 1816eb3b
      zero323 authored
      ## What changes were proposed in this pull request?
      - Fix incorrect tests for `_check_thresholds`.
      - Move test to `ParamTests`.
      ## How was this patch tested?
      Unit tests.
      Author: zero323 <>
      Closes #18085 from zero323/SPARK-20631-FOLLOW-UP.
    • Peng's avatar
      [SPARK-20764][ML][PYSPARK][FOLLOWUP] Fix visibility discrepancy with... · 9afcf127
      Peng authored
      [SPARK-20764][ML][PYSPARK][FOLLOWUP] Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version
      ## What changes were proposed in this pull request?
      Add test cases for PR-18062
      ## How was this patch tested?
      The existing UT
      Author: Peng <>
      Closes #18068 from mpjlu/moreTest.
  11. May 23, 2017
    • Bago Amirbekian's avatar
      [SPARK-20861][ML][PYTHON] Delegate looping over paramMaps to estimators · 9434280c
      Bago Amirbekian authored
    Estimators can take either a list of param maps or a dict of params. This change allows the CrossValidator and TrainValidationSplit Estimators to pass through lists of param maps to the underlying estimators so that those estimators can handle parallelization when appropriate (eg distributed hyper parameter tuning).
      Existing unit tests.
      Author: Bago Amirbekian <>
      Closes #18077 from MrBago/delegate_params.
  12. May 22, 2017
    • Peng's avatar
      [SPARK-20764][ML][PYSPARK] Fix visibility discrepancy with numInstances and... · cfca0113
      Peng authored
      [SPARK-20764][ML][PYSPARK] Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version
      ## What changes were proposed in this pull request?
      SPARK-20097 exposed degreesOfFreedom in LinearRegressionSummary and numInstances in GeneralizedLinearRegressionSummary. Python API should be updated to reflect these changes.
      ## How was this patch tested?
      The existing UT
      Author: Peng <>
      Closes #18062 from mpjlu/spark-20764.
  13. May 21, 2017
  14. May 15, 2017
  15. May 12, 2017
    • hyukjinkwon's avatar
      [SPARK-20639][SQL] Add single argument support for to_timestamp in SQL with... · 720708cc
      hyukjinkwon authored
      [SPARK-20639][SQL] Add single argument support for to_timestamp in SQL with documentation improvement
      ## What changes were proposed in this pull request?
      This PR proposes three things as below:
      - Use casting rules to a timestamp in `to_timestamp` by default (it was `yyyy-MM-dd HH:mm:ss`).
      - Support single argument for `to_timestamp` similarly with APIs in other languages.
        For example, the one below works
        import org.apache.spark.sql.functions._
        Seq("2016-12-31 00:12:00.00").toDF("a").select(to_timestamp(col("a"))).show()
        |to_timestamp(`a`, 'yyyy-MM-dd HH:mm:ss')|
        |                     2016-12-31 00:12:00|
        whereas this does not work in SQL.
        spark-sql> SELECT to_timestamp('2016-12-31 00:12:00');
        Error in query: Invalid number of arguments for function to_timestamp; line 1 pos 7
        spark-sql> SELECT to_timestamp('2016-12-31 00:12:00');
        2016-12-31 00:12:00
      - Related document improvement for SQL function descriptions and other API descriptions accordingly.
        spark-sql> DESCRIBE FUNCTION extended to_date;
        Usage: to_date(date_str, fmt) - Parses the `left` expression with the `fmt` expression. Returns null with invalid input.
        Extended Usage:
              > SELECT to_date('2016-12-31', 'yyyy-MM-dd');
        spark-sql> DESCRIBE FUNCTION extended to_timestamp;
        Usage: to_timestamp(timestamp, fmt) - Parses the `left` expression with the `format` expression to a timestamp. Returns null with invalid input.
        Extended Usage:
              > SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd');
               2016-12-31 00:00:00.0
        spark-sql> DESCRIBE FUNCTION extended to_date;
            to_date(date_str[, fmt]) - Parses the `date_str` expression with the `fmt` expression to
              a date. Returns null with invalid input. By default, it follows casting rules to a date if
              the `fmt` is omitted.
        Extended Usage:
              > SELECT to_date('2009-07-30 04:17:52');
              > SELECT to_date('2016-12-31', 'yyyy-MM-dd');
        spark-sql> DESCRIBE FUNCTION extended to_timestamp;
            to_timestamp(timestamp[, fmt]) - Parses the `timestamp` expression with the `fmt` expression to
              a timestamp. Returns null with invalid input. By default, it follows casting rules to
              a timestamp if the `fmt` is omitted.
        Extended Usage:
              > SELECT to_timestamp('2016-12-31 00:12:00');
               2016-12-31 00:12:00
              > SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd');
               2016-12-31 00:00:00
      ## How was this patch tested?
      Added tests in `datetime.sql`.
      Author: hyukjinkwon <>
      Closes #17901 from HyukjinKwon/to_timestamp_arg.
  16. May 11, 2017
  17. May 10, 2017
    • Josh Rosen's avatar
      [SPARK-20685] Fix BatchPythonEvaluation bug in case of single UDF w/ repeated arg. · 8ddbc431
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      There's a latent corner-case bug in PySpark UDF evaluation where executing a `BatchPythonEvaluation` with a single multi-argument UDF where _at least one argument value is repeated_ will crash at execution with a confusing error.
      This problem was introduced in #12057: the code there has a fast path for handling a "batch UDF evaluation consisting of a single Python UDF", but that branch incorrectly assumes that a single UDF won't have repeated arguments and therefore skips the code for unpacking arguments from the input row (whose schema may not necessarily match the UDF inputs due to de-duplication of repeated arguments which occurred in the JVM before sending UDF inputs to Python).
      This fix here is simply to remove this special-casing: it turns out that the code in the "multiple UDFs" branch just so happens to work for the single-UDF case because Python treats `(x)` as equivalent to `x`, not as a single-argument tuple.
      ## How was this patch tested?
      New regression test in `pyspark.python.sql.tests` module (tested and confirmed that it fails before my fix).
      Author: Josh Rosen <>
      Closes #17927 from JoshRosen/SPARK-20685.
    • Felix Cheung's avatar
      [SPARK-20689][PYSPARK] python doctest leaking bucketed table · af8b6cc8
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      It turns out pyspark doctest is calling saveAsTable without ever dropping them. Since we have separate python tests for bucketed table, and there is no checking of results, there is really no need to run the doctest, other than leaving it as an example in the generated doc
      ## How was this patch tested?
      Author: Felix Cheung <>
      Closes #17932 from felixcheung/pytablecleanup.
    • zero323's avatar
      [SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should... · 804949c6
      zero323 authored
      [SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should use values not Params
      ## What changes were proposed in this pull request?
      - Replace `getParam` calls with `getOrDefault` calls.
      - Fix exception message to avoid unintended `TypeError`.
      - Add unit tests
      ## How was this patch tested?
      New unit tests.
      Author: zero323 <>
      Closes #17891 from zero323/SPARK-20631.
  18. May 09, 2017
  19. May 07, 2017
    • zero323's avatar
      [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucketBy · f53a8207
      zero323 authored
      ## What changes were proposed in this pull request?
      Adds Python wrappers for `DataFrameWriter.bucketBy` and `DataFrameWriter.sortBy` ([SPARK-16931](
      ## How was this patch tested?
      Unit tests covering new feature.
      __Note__: Based on work of GregBowyer (f49b9a23468f7af32cb53d2b654272757c151725)
      CC HyukjinKwon
      Author: zero323 <>
      Author: Greg Bowyer <>
      Closes #17077 from zero323/SPARK-16931.
    • zero323's avatar
      [SPARK-18777][PYTHON][SQL] Return UDF from udf.register · 63d90e7d
      zero323 authored
      ## What changes were proposed in this pull request?
      - Move udf wrapping code from `functions.udf` to `functions.UserDefinedFunction`.
      - Return wrapped udf from `catalog.registerFunction` and dependent methods.
      - Update docstrings in `catalog.registerFunction` and `SQLContext.registerFunction`.
      - Unit tests.
      ## How was this patch tested?
      - Existing unit tests and docstests.
      - Additional tests covering new feature.
      Author: zero323 <>
      Closes #17831 from zero323/SPARK-18777.
  20. May 03, 2017
    • zero323's avatar
      [SPARK-20584][PYSPARK][SQL] Python generic hint support · 02bbe731
      zero323 authored
      ## What changes were proposed in this pull request?
      Adds `hint` method to PySpark `DataFrame`.
      ## How was this patch tested?
      Unit tests, doctests.
      Author: zero323 <>
      Closes #17850 from zero323/SPARK-20584.
    • Yan Facai (颜发才)'s avatar
      [SPARK-16957][MLLIB] Use midpoints for split values. · 7f96f2d7
      Yan Facai (颜发才) authored
      ## What changes were proposed in this pull request?
      Use midpoints for split values now, and maybe later to make it weighted.
      ## How was this patch tested?
      + [x] add unit test.
      + [x] revise Split's unit test.
      Author: Yan Facai (颜发才) <>
      Author: 颜发才(Yan Facai) <>
      Closes #17556 from facaiy/ENH/decision_tree_overflow_and_precision_in_aggregation.
    • MechCoder's avatar
      [SPARK-6227][MLLIB][PYSPARK] Implement PySpark wrappers for SVD and PCA (v2) · db2fb84b
      MechCoder authored
      Add PCA and SVD to PySpark's wrappers for `RowMatrix` and `IndexedRowMatrix` (SVD only).
      Based on #7963, updated.
      ## How was this patch tested?
      New doc tests and unit tests. Ran all examples locally.
      Author: MechCoder <>
      Author: Nick Pentreath <>
      Closes #17621 from MLnick/SPARK-6227-pyspark-svd-pca.
  21. May 02, 2017
  22. May 01, 2017
    • zero323's avatar
      [SPARK-20290][MINOR][PYTHON][SQL] Add PySpark wrapper for eqNullSafe · f0169a1c
      zero323 authored
      ## What changes were proposed in this pull request?
      Adds Python bindings for `Column.eqNullSafe`
      ## How was this patch tested?
      Manual tests, existing unit tests, doc build.
      Author: zero323 <>
      Closes #17605 from zero323/SPARK-20290.
  23. Apr 30, 2017
  24. Apr 29, 2017
    • hyukjinkwon's avatar
      [SPARK-20442][PYTHON][DOCS] Fill up documentations for functions in Column API in PySpark · d228cd0b
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      This PR proposes to fill up the documentation with examples for `bitwiseOR`, `bitwiseAND`, `bitwiseXOR`. `contains`, `asc` and `desc` in `Column` API.
      Also, this PR fixes minor typos in the documentation and matches some of the contents between Scala doc and Python doc.
      Lastly, this PR suggests to use `spark` rather than `sc` in doc tests in `Column` for Python documentation.
      ## How was this patch tested?
      Doc tests were added and manually tested with the commands below:
      `./python/ --module pyspark-sql`
      `./python/ --module pyspark-sql --python-executable python3`
      Output was checked via `make html` under `./python/docs`. The snapshots will be left on the codes with comments.
      Author: hyukjinkwon <>
      Closes #17737 from HyukjinKwon/SPARK-20442.
  25. Apr 27, 2017
    • Takeshi Yamamuro's avatar
      [SPARK-20425][SQL] Support a vertical display mode for · b4724db1
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr added a new display mode for `` to print output rows vertically (one line per column value). In the current master, when printing Dataset with many columns, the readability is low like;
      scala> val df = spark.range(100).selectExpr((0 until 100).map(i => s"rand() AS c$i"): _*)
      scala>, 0)
      |c0                |c1                |c2                |c3                 |c4                |c5                |c6                 |c7                |c8                |c9                |c10               |c11                |c12               |c13               |c14               |c15                |c16                |c17                |c18               |c19               |c20                |c21               |c22                |c23               |c24                |c25                |c26                |c27                 |c28                |c29               |c30                |c31                 |c32               |c33               |c34                |c35                |c36                |c37               |c38               |c39                |c40               |c41               |c42                |c43                |c44                |c45               |c46                 |c47                 |c48                |c49                |c50                |c51                |c52                |c53                |c54                 |c55                |c56                |c57                |c58                |c59               |c60               |c61                |c62                |c63               |c64                |c65               |c66               |c67              |c68                |c69                |c70               |c71                |c72               |c73                |c74                |c75                |c76               |c77                |c78               |c79                |c80                |c81                |c82                |c83                |c84                |c85                |c86                |c87               |c88                |c89                |c90               |c91               |c92               |c93                |c94               |c95                |c96               |c97                |c98                |c99                |
      |0.6306087152476858|0.9174349686288383|0.5511324165035159|0.3320844128641819 |0.7738486877101489|0.2154915886962553|0.4754997600674299 |0.922780639280355 |0.7136894772661909|0.2277580838165979|0.5926874459847249|0.40311408392226633|0.467830264333843 |0.8330466896984213|0.1893258482389527|0.6320849515511165 |0.7530911056912044 |0.06700254871955424|0.370528597355559 |0.2755437445193154|0.23704391110980128|0.8067400174905822|0.13597793616251852|0.1708888820162453|0.01672725007605702|0.983118121881555  |0.25040195628629924|0.060537253723083384|0.20000530582637488|0.3400572407133511|0.9375689433322597 |0.057039316954370256|0.8053269714347623|0.5247817572228813|0.28419308820527944|0.9798908885194533 |0.31805988175678146|0.7034448027077574|0.5400575751346084|0.25336322371116216|0.9361634546853429|0.6118681368289798|0.6295081549153907 |0.13417468943957422|0.41617137072255794|0.7267230869252035|0.023792726137561115|0.5776157058356362  |0.04884204913195467|0.26728716103441275|0.646680370807925  |0.9782712690657244 |0.16434031314818154|0.20985522381321275|0.24739842475440077 |0.26335189682977334|0.19604841662422068|0.10742950487300651|0.20283136488091502|0.3100312319723688|0.886959006630645 |0.25157102269776244|0.34428775168410786|0.3500506818575777|0.3781142441912052 |0.8560316444386715|0.4737104888956839|0.735903101602148|0.02236617130529006|0.8769074095835873 |0.2001426662503153|0.5534032319238532 |0.7289496620397098|0.41955191309992157|0.9337700133660436 |0.34059094378451005|0.6419144759403556|0.08167496930341167|0.9947099478497635|0.48010888605366586|0.22314796858167918|0.17786598882331306|0.7351521162297135 |0.5422057170020095 |0.9521927872726792 |0.7459825486368227 |0.40907708791990627|0.8903819313311575|0.7251413746923618 |0.2977174938745204 |0.9515209660203555|0.9375968604766713|0.5087851740042524|0.4255237544908751 |0.8023768698664653|0.48003189618006703|0.1775841829745185|0.09050775629268382|0.6743909291138167 |0.2498415755876865 |
      |0.6866473844170801|0.4774360641212433|0.631696201340726 |0.33979113021468343|0.5663049010847052|0.7280190472258865|0.41370958502324806|0.9977433873622218|0.7671957338989901|0.2788708556233931|0.3355106391656496|0.88478952319287   |0.0333974166999893|0.6061744715862606|0.9617779139652359|0.22484954822341863|0.12770906021550898|0.5577789629508672 |0.2877649024640704|0.5566577406549361|0.9334933255278052 |0.9166720585157266|0.9689249324600591 |0.6367502457478598|0.7993572745928459 |0.23213222324218108|0.11928284054154137|0.6173493362456599  |0.0505122058694798 |0.9050228629552983|0.17112767911121707|0.47395598348370005 |0.5820498657823081|0.6241124650645072|0.18587258258036776|0.14987593554122225|0.3079446253653946 |0.9414228822867968|0.8362276265462365|0.9155655305576353 |0.5121559807153562|0.8963362656525707|0.22765970274318037|0.8177039187132797 |0.8190326635933787 |0.5256005177032199|0.8167598457269669  |0.030936807130934496|0.6733006585281015 |0.4208049626816347 |0.24603085738518538|0.22719198954208153|0.1622280557565281 |0.22217325159218038|0.014684419513742553|0.08987111517447499|0.2157764759142622 |0.8223414104088321 |0.4868624404491777 |0.4016191733088167|0.6169281906889263|0.15603611040433385|0.18289285085714913|0.9538408988218972|0.15037154865295121|0.5364516961987454|0.8077254873163031|0.712600478545675|0.7277477241003857 |0.19822912960348305|0.8305051199208777|0.18631911396566114|0.8909532487898342|0.3470409226992506 |0.35306974180587636|0.9107058868891469 |0.3321327206004986|0.48952332459050607|0.3630403307479373|0.5400046826340376 |0.5387377194310529 |0.42860539421837585|0.23214101630985995|0.21438968839794847|0.15370603160082352|0.04355605642700022|0.6096006707067466 |0.6933354157094292|0.06302172470859002|0.03174631856164001|0.664243581650643 |0.7833239547446621|0.696884598352864 |0.34626385933237736|0.9263495598791336|0.404818892816584  |0.2085585394755507|0.6150004897990109 |0.05391193524302473|0.28188484028329097|
      only showing top 2 rows
      `psql`, CLI for PostgreSQL, supports a vertical display mode for this case like:
      -RECORD 0-------------------
       c0  | 0.6306087152476858
       c1  | 0.9174349686288383
       c2  | 0.5511324165035159
       c98 | 0.05391193524302473
       c99 | 0.28188484028329097
      -RECORD 1-------------------
       c0  | 0.6866473844170801
       c1  | 0.4774360641212433
       c2  | 0.631696201340726
       c98 | 0.05391193524302473
       c99 | 0.28188484028329097
      only showing top 2 rows
      ## How was this patch tested?
      Added tests in `DataFrameSuite`.
      Author: Takeshi Yamamuro <>
      Closes #17733 from maropu/SPARK-20425.
  26. Apr 26, 2017
    • Yanbo Liang's avatar
      [MINOR][ML] Fix some PySpark & SparkR flaky tests · dbb06c68
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Some PySpark & SparkR tests run with tiny dataset and tiny ```maxIter```, which means they are not converged. I don’t think checking intermediate result during iteration make sense, and these intermediate result may vulnerable and not stable, so we should switch to check the converged result. We hit this issue at #17746 when we upgrade breeze to 0.13.1.
      ## How was this patch tested?
      Existing tests.
      Author: Yanbo Liang <>
      Closes #17757 from yanboliang/flaky-test.
  27. Apr 25, 2017
    • Yanbo Liang's avatar
      [SPARK-20449][ML] Upgrade breeze version to 0.13.1 · 67eef47a
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B.
      ## How was this patch tested?
      Existing unit tests.
      Author: Yanbo Liang <>
      Closes #17746 from yanboliang/spark-20449.
  28. Apr 22, 2017
    • Michael Patterson's avatar
      [SPARK-20132][DOCS] Add documentation for column string functions · 8765bc17
      Michael Patterson authored
      ## What changes were proposed in this pull request?
      Add docstrings to for the Column functions `rlike`, `like`, `startswith`, and `endswith`. Pass these docstrings through `_bin_op`
      There may be a better place to put the docstrings. I put them immediately above the Column class.
      ## How was this patch tested?
      I ran `make html` on my local computer to remake the documentation, and verified that the html pages were displaying the docstrings correctly. I tried running `dev-tests`, and the formatting tests passed. However, my mvn build didn't work I think due to issues on my computer.
      These docstrings are my original work and free license.
      davies has done the most recent work reorganizing `_bin_op`
      Author: Michael Patterson <>
      Closes #17469 from map222/patterson-documentation.