Skip to content
Snippets Groups Projects
  1. Mar 02, 2017
    • Felix Cheung's avatar
      [SPARK-18352][DOCS] wholeFile JSON update doc and programming guide · 8d6ef895
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Update doc for R, programming guide. Clarify default behavior for all languages.
      
      ## How was this patch tested?
      
      manually
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17128 from felixcheung/jsonwholefiledoc.
      8d6ef895
    • Mark Grover's avatar
      [SPARK-19734][PYTHON][ML] Correct OneHotEncoder doc string to say dropLast · d2a87976
      Mark Grover authored
      ## What changes were proposed in this pull request?
      Updates the doc string to match up with the code
      i.e. say dropLast instead of includeFirst
      
      ## How was this patch tested?
      Not much, since it's a doc-like change. Will run unit tests via Jenkins job.
      
      Author: Mark Grover <mark@apache.org>
      
      Closes #17127 from markgrover/spark_19734.
      d2a87976
    • Yun Ni's avatar
      [MINOR][ML] Fix comments in LSH Examples and Python API · 3bd8ddf7
      Yun Ni authored
      ## What changes were proposed in this pull request?
      Remove `org.apache.spark.examples.` in
      Add slash in one of the python doc.
      
      ## How was this patch tested?
      Run examples using the commands in the comments.
      
      Author: Yun Ni <yunn@uber.com>
      
      Closes #17104 from Yunni/yunn_minor.
      3bd8ddf7
  2. Feb 28, 2017
  3. Feb 24, 2017
    • Jeff Zhang's avatar
      [SPARK-13330][PYSPARK] PYTHONHASHSEED is not propgated to python worker · 330c3e33
      Jeff Zhang authored
      ## What changes were proposed in this pull request?
      self.environment will be propagated to executor. Should set PYTHONHASHSEED as long as the python version is greater than 3.3
      
      ## How was this patch tested?
      Manually tested it.
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #11211 from zjffdu/SPARK-13330.
      330c3e33
    • zero323's avatar
      [SPARK-19161][PYTHON][SQL] Improving UDF Docstrings · 4a5e38f5
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Replaces `UserDefinedFunction` object returned from `udf` with a function wrapper providing docstring and arguments information as proposed in [SPARK-19161](https://issues.apache.org/jira/browse/SPARK-19161).
      
      ### Backward incompatible changes:
      
      - `pyspark.sql.functions.udf` will return a `function` instead of `UserDefinedFunction`. To ensure backward compatible public API we use function attributes to mimic  `UserDefinedFunction` API (`func` and `returnType` attributes).  This should have a minimal impact on the user code.
      
        An alternative implementation could use dynamical sub-classing. This would ensure full backward compatibility but is more fragile in practice.
      
      ### Limitations:
      
      Full functionality (retained docstring and argument list) is achieved only in the recent Python version. Legacy Python version will preserve only docstrings, but not argument list. This should be an acceptable trade-off between achieved improvements and overall complexity.
      
      ### Possible impact on other tickets:
      
      This can affect [SPARK-18777](https://issues.apache.org/jira/browse/SPARK-18777).
      
      ## How was this patch tested?
      
      Existing unit tests to ensure backward compatibility, additional tests targeting proposed changes.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16534 from zero323/SPARK-19161.
      4a5e38f5
  4. Feb 23, 2017
    • Bryan Cutler's avatar
      [SPARK-14772][PYTHON][ML] Fixed Params.copy method to match Scala implementation · 2f69e3f6
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Fixed the PySpark Params.copy method to behave like the Scala implementation.  The main issue was that it did not account for the _defaultParamMap and merged it into the explicitly created param map.
      
      ## How was this patch tested?
      Added new unit test to verify the copy method behaves correctly for copying uid, explicitly created params, and default params.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #16772 from BryanCutler/pyspark-ml-param_copy-Scala_sync-SPARK-14772.
      2f69e3f6
    • Wenchen Fan's avatar
      [SPARK-19706][PYSPARK] add Column.contains in pyspark · 4fa4cf1d
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      to be consistent with the scala API, we should also add `contains` to `Column` in pyspark.
      
      ## How was this patch tested?
      
      updated unit test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17036 from cloud-fan/pyspark.
      4fa4cf1d
    • Takeshi Yamamuro's avatar
      [SPARK-18699][SQL] Put malformed tokens into a new field when parsing CSV data · 09ed6e77
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr added a logic to put malformed tokens into a new field when parsing CSV data  in case of permissive modes. In the current master, if the CSV parser hits these malformed ones, it throws an exception below (and then a job fails);
      ```
      Caused by: java.lang.IllegalArgumentException
      	at java.sql.Date.valueOf(Date.java:143)
      	at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
      	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272)
      	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
      	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
      	at scala.util.Try.getOrElse(Try.scala:79)
      	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269)
      	at
      ```
      In case that users load large CSV-formatted data, the job failure makes users get some confused. So, this fix set NULL for original columns and put malformed tokens in a new field.
      
      ## How was this patch tested?
      Added tests in `CSVSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #16928 from maropu/SPARK-18699-2.
      09ed6e77
    • Shixiong Zhu's avatar
      [SPARK-19497][SS] Implement streaming deduplication · 9bf4e2ba
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR adds a special streaming deduplication operator to support `dropDuplicates` with `aggregation` and watermark. It reuses the `dropDuplicates` API but creates new logical plan `Deduplication` and new physical plan `DeduplicationExec`.
      
      The following cases are supported:
      
      - one or multiple `dropDuplicates()` without aggregation (with or without watermark)
      - `dropDuplicates` before aggregation
      
      Not supported cases:
      
      - `dropDuplicates` after aggregation
      
      Breaking changes:
      - `dropDuplicates` without aggregation doesn't work with `complete` or `update` mode.
      
      ## How was this patch tested?
      
      The new unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16970 from zsxwing/dedup.
      9bf4e2ba
  5. Feb 22, 2017
    • Adam Budde's avatar
      [SPARK-19405][STREAMING] Support for cross-account Kinesis reads via STS · e4065376
      Adam Budde authored
      - Add dependency on aws-java-sdk-sts
      - Replace SerializableAWSCredentials with new SerializableCredentialsProvider interface
      - Make KinesisReceiver take SerializableCredentialsProvider as argument and
        pass credential provider to KCL
      - Add new implementations of KinesisUtils.createStream() that take STS
        arguments
      - Make JavaKinesisStreamSuite test the entire KinesisUtils Java API
      - Update KCL/AWS SDK dependencies to 1.7.x/1.11.x
      
      ## What changes were proposed in this pull request?
      
      [JIRA link with detailed description.](https://issues.apache.org/jira/browse/SPARK-19405)
      
      * Replace SerializableAWSCredentials with new SerializableKCLAuthProvider class that takes 5 optional config params for configuring AWS auth and returns the appropriate credential provider object
      * Add new public createStream() APIs for specifying these parameters in KinesisUtils
      
      ## How was this patch tested?
      
      * Manually tested using explicit keypair and instance profile to read data from Kinesis stream in separate account (difficult to write a test orchestrating creation and assumption of IAM roles across separate accounts)
      * Expanded JavaKinesisStreamSuite to test the entire Java API in KinesisUtils
      
      ## License acknowledgement
      This contribution is my original work and that I license the work to the project under the project’s open source license.
      
      Author: Budde <budde@amazon.com>
      
      Closes #16744 from budde/master.
      e4065376
  6. Feb 17, 2017
  7. Feb 16, 2017
    • Nathan Howell's avatar
      [SPARK-18352][SQL] Support parsing multiline json files · 21fde57f
      Nathan Howell authored
      ## What changes were proposed in this pull request?
      
      If a new option `wholeFile` is set to `true` the JSON reader will parse each file (instead of a single line) as a value. This is done with Jackson streaming and it should be capable of parsing very large documents, assuming the row will fit in memory.
      
      Because the file is not buffered in memory the corrupt record handling is also slightly different when `wholeFile` is enabled: the corrupt column will contain the filename instead of the literal JSON if there is a parsing failure. It would be easy to extend this to add the parser location (line, column and byte offsets) to the output if desired.
      
      These changes have allowed types other than `String` to be parsed. Support for `UTF8String` and `Text` have been added (alongside `String` and `InputFormat`) and no longer require a conversion to `String` just for parsing.
      
      I've also included a few other changes that generate slightly better bytecode and (imo) make it more obvious when and where boxing is occurring in the parser. These are included as separate commits, let me know if they should be flattened into this PR or moved to a new one.
      
      ## How was this patch tested?
      
      New and existing unit tests. No performance or load tests have been run.
      
      Author: Nathan Howell <nhowell@godaddy.com>
      
      Closes #16386 from NathanHowell/SPARK-18352.
      21fde57f
  8. Feb 15, 2017
    • Yun Ni's avatar
      [SPARK-18080][ML][PYTHON] Python API & Examples for Locality Sensitive Hashing · 08c1972a
      Yun Ni authored
      ## What changes were proposed in this pull request?
      This pull request includes python API and examples for LSH. The API changes was based on yanboliang 's PR #15768 and resolved conflicts and API changes on the Scala API. The examples are consistent with Scala examples of MinHashLSH and BucketedRandomProjectionLSH.
      
      ## How was this patch tested?
      API and examples are tested using spark-submit:
      `bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py`
      `bin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh.py`
      
      User guide changes are generated and manually inspected:
      `SKIP_API=1 jekyll build`
      
      Author: Yun Ni <yunn@uber.com>
      Author: Yanbo Liang <ybliang8@gmail.com>
      Author: Yunni <Euler57721@gmail.com>
      
      Closes #16715 from Yunni/spark-18080.
      08c1972a
    • Yin Huai's avatar
      [SPARK-19604][TESTS] Log the start of every Python test · f6c3bba2
      Yin Huai authored
      ## What changes were proposed in this pull request?
      Right now, we only have info level log after we finish the tests of a Python test file. We should also log the start of a test. So, if a test is hanging, we can tell which test file is running.
      
      ## How was this patch tested?
      This is a change for python tests.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #16935 from yhuai/SPARK-19604.
      f6c3bba2
    • Takuya UESHIN's avatar
      [SPARK-18937][SQL] Timezone support in CSV/JSON parsing · 865b2fd8
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up pr of #16308.
      
      This pr enables timezone support in CSV/JSON parsing.
      
      We should introduce `timeZone` option for CSV/JSON datasources (the default value of the option is session local timezone).
      
      The datasources should use the `timeZone` option to format/parse to write/read timestamp values.
      Notice that while reading, if the timestampFormat has the timezone info, the timezone will not be used because we should respect the timezone in the values.
      
      For example, if you have timestamp `"2016-01-01 00:00:00"` in `GMT`, the values written with the default timezone option, which is `"GMT"` because session local timezone is `"GMT"` here, are:
      
      ```scala
      scala> spark.conf.set("spark.sql.session.timeZone", "GMT")
      
      scala> val df = Seq(new java.sql.Timestamp(1451606400000L)).toDF("ts")
      df: org.apache.spark.sql.DataFrame = [ts: timestamp]
      
      scala> df.show()
      +-------------------+
      |ts                 |
      +-------------------+
      |2016-01-01 00:00:00|
      +-------------------+
      
      scala> df.write.json("/path/to/gmtjson")
      ```
      
      ```sh
      $ cat /path/to/gmtjson/part-*
      {"ts":"2016-01-01T00:00:00.000Z"}
      ```
      
      whereas setting the option to `"PST"`, they are:
      
      ```scala
      scala> df.write.option("timeZone", "PST").json("/path/to/pstjson")
      ```
      
      ```sh
      $ cat /path/to/pstjson/part-*
      {"ts":"2015-12-31T16:00:00.000-08:00"}
      ```
      
      We can properly read these files even if the timezone option is wrong because the timestamp values have timezone info:
      
      ```scala
      scala> val schema = new StructType().add("ts", TimestampType)
      schema: org.apache.spark.sql.types.StructType = StructType(StructField(ts,TimestampType,true))
      
      scala> spark.read.schema(schema).json("/path/to/gmtjson").show()
      +-------------------+
      |ts                 |
      +-------------------+
      |2016-01-01 00:00:00|
      +-------------------+
      
      scala> spark.read.schema(schema).option("timeZone", "PST").json("/path/to/gmtjson").show()
      +-------------------+
      |ts                 |
      +-------------------+
      |2016-01-01 00:00:00|
      +-------------------+
      ```
      
      And even if `timezoneFormat` doesn't contain timezone info, we can properly read the values with setting correct timezone option:
      
      ```scala
      scala> df.write.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").option("timeZone", "JST").json("/path/to/jstjson")
      ```
      
      ```sh
      $ cat /path/to/jstjson/part-*
      {"ts":"2016-01-01T09:00:00"}
      ```
      
      ```scala
      // wrong result
      scala> spark.read.schema(schema).option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").json("/path/to/jstjson").show()
      +-------------------+
      |ts                 |
      +-------------------+
      |2016-01-01 09:00:00|
      +-------------------+
      
      // correct result
      scala> spark.read.schema(schema).option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").option("timeZone", "JST").json("/path/to/jstjson").show()
      +-------------------+
      |ts                 |
      +-------------------+
      |2016-01-01 00:00:00|
      +-------------------+
      ```
      
      This pr also makes `JsonToStruct` and `StructToJson` `TimeZoneAwareExpression` to be able to evaluate values with timezone option.
      
      ## How was this patch tested?
      
      Existing tests and added some tests.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #16750 from ueshin/issues/SPARK-18937.
      865b2fd8
    • Felix Cheung's avatar
      [SPARK-19399][SPARKR] Add R coalesce API for DataFrame and Column · 671bc08e
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Add coalesce on DataFrame for down partitioning without shuffle and coalesce on Column
      
      ## How was this patch tested?
      
      manual, unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16739 from felixcheung/rcoalesce.
      671bc08e
    • zero323's avatar
      [SPARK-19160][PYTHON][SQL] Add udf decorator · c97f4e17
      zero323 authored
      ## What changes were proposed in this pull request?
      
      This PR adds `udf` decorator syntax as proposed in [SPARK-19160](https://issues.apache.org/jira/browse/SPARK-19160).
      
      This allows users to define UDF using simplified syntax:
      
      ```python
      from pyspark.sql.decorators import udf
      
      udf(IntegerType())
      def add_one(x):
          """Adds one"""
          if x is not None:
              return x + 1
       ```
      
      without need to define a separate function and udf.
      
      ## How was this patch tested?
      
      Existing unit tests to ensure backward compatibility and additional unit tests covering new functionality.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16533 from zero323/SPARK-19160.
      c97f4e17
    • VinceShieh's avatar
      [SPARK-19590][PYSPARK][ML] Update the document for QuantileDiscretizer in pyspark · 6eca21ba
      VinceShieh authored
      ## What changes were proposed in this pull request?
      This PR is to document the changes on QuantileDiscretizer in pyspark for PR:
      https://github.com/apache/spark/pull/15428
      
      ## How was this patch tested?
      No test needed
      
      Signed-off-by: VinceShieh <vincent.xieintel.com>
      
      Author: VinceShieh <vincent.xie@intel.com>
      
      Closes #16922 from VinceShieh/spark-19590.
      6eca21ba
  9. Feb 14, 2017
    • Sheamus K. Parkes's avatar
      [SPARK-18541][PYTHON] Add metadata parameter to pyspark.sql.Column.alias() · 7b64f7aa
      Sheamus K. Parkes authored
      ## What changes were proposed in this pull request?
      
      Add a `metadata` keyword parameter to `pyspark.sql.Column.alias()` to allow users to mix-in metadata while manipulating `DataFrame`s in `pyspark`.  Without this, I believe it was necessary to pass back through `SparkSession.createDataFrame` each time a user wanted to manipulate `StructField.metadata` in `pyspark`.
      
      This pull request also improves consistency between the Scala and Python APIs (i.e. I did not add any functionality that was not already in the Scala API).
      
      Discussed ahead of time on JIRA with marmbrus
      
      ## How was this patch tested?
      
      Added unit tests (and doc tests).  Ran the pertinent tests manually.
      
      Author: Sheamus K. Parkes <shea.parkes@milliman.com>
      
      Closes #16094 from shea-parkes/pyspark-column-alias-metadata.
      7b64f7aa
    • zero323's avatar
      [SPARK-19162][PYTHON][SQL] UserDefinedFunction should validate that func is callable · e0eeb0f8
      zero323 authored
      ## What changes were proposed in this pull request?
      
      UDF constructor checks if `func` argument is callable and if it is not, fails fast instead of waiting for an action.
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16535 from zero323/SPARK-19162.
      e0eeb0f8
    • zero323's avatar
      [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataFrame.replace docstring · 9c4405e8
      zero323 authored
      ## What changes were proposed in this pull request?
      
      - Provides correct description of the semantics of a `dict` argument passed as `to_replace`.
      - Describes type requirements for collection arguments.
      - Describes behavior with `to_replace: List[T]` and `value: T`
      
      ## How was this patch tested?
      
      Manual testing, documentation build.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16792 from zero323/SPARK-19453.
      9c4405e8
  10. Feb 13, 2017
    • zero323's avatar
      [SPARK-19429][PYTHON][SQL] Support slice arguments in Column.__getitem__ · e02ac303
      zero323 authored
      ## What changes were proposed in this pull request?
      
      - Add support for `slice` arguments in `Column.__getitem__`.
      - Remove obsolete `__getslice__` bindings.
      
      ## How was this patch tested?
      
      Existing unit tests, additional tests covering `[]` with `slice`.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16771 from zero323/SPARK-19429.
      e02ac303
    • zero323's avatar
      [SPARK-19427][PYTHON][SQL] Support data type string as a returnType argument of UDF · ab88b241
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Add support for data type string as a return type argument of `UserDefinedFunction`:
      
      ```python
      f = udf(lambda x: x, "integer")
       f.returnType
      
      ## IntegerType
      ```
      
      ## How was this patch tested?
      
      Existing unit tests, additional unit tests covering new feature.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16769 from zero323/SPARK-19427.
      ab88b241
    • zero323's avatar
      [SPARK-19506][ML][PYTHON] Import warnings in pyspark.ml.util · 5e7cd332
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Add missing `warnings` import.
      
      ## How was this patch tested?
      
      Manual tests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16846 from zero323/SPARK-19506.
      5e7cd332
  11. Feb 07, 2017
    • anabranch's avatar
      [SPARK-16609] Add to_date/to_timestamp with format functions · 7a7ce272
      anabranch authored
      ## What changes were proposed in this pull request?
      
      This pull request adds two new user facing functions:
      - `to_date` which accepts an expression and a format and returns a date.
      - `to_timestamp` which accepts an expression and a format and returns a timestamp.
      
      For example, Given a date in format: `2016-21-05`. (YYYY-dd-MM)
      
      ### Date Function
      *Previously*
      ```
      to_date(unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp"))
      ```
      *Current*
      ```
      to_date(lit("2016-21-05"), "yyyy-dd-MM")
      ```
      
      ### Timestamp Function
      *Previously*
      ```
      unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp")
      ```
      *Current*
      ```
      to_timestamp(lit("2016-21-05"), "yyyy-dd-MM")
      ```
      ### Tasks
      
      - [X] Add `to_date` to Scala Functions
      - [x] Add `to_date` to Python Functions
      - [x] Add `to_date` to SQL Functions
      - [X] Add `to_timestamp` to Scala Functions
      - [x] Add `to_timestamp` to Python Functions
      - [x] Add `to_timestamp` to SQL Functions
      - [x] Add function to R
      
      ## How was this patch tested?
      
      - [x] Add Functions to `DateFunctionsSuite`
      - Test new `ParseToTimestamp` Expression (*not necessary*)
      - Test new `ParseToDate` Expression (*not necessary*)
      - [x] Add test for R
      - [x] Add test for Python in test.py
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: anabranch <wac.chambers@gmail.com>
      Author: Bill Chambers <bill@databricks.com>
      Author: anabranch <bill@databricks.com>
      
      Closes #16138 from anabranch/SPARK-16609.
      7a7ce272
  12. Feb 06, 2017
  13. Feb 05, 2017
    • Zheng RuiFeng's avatar
      [SPARK-19421][ML][PYSPARK] Remove numClasses and numFeatures methods in LinearSVC · 317fa750
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      Methods `numClasses` and `numFeatures` in LinearSVCModel are already usable by inheriting `JavaClassificationModel`
      we should not explicitly add them.
      
      ## How was this patch tested?
      existing tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #16727 from zhengruifeng/nits_in_linearSVC.
      317fa750
  14. Feb 02, 2017
  15. Feb 01, 2017
    • Zheng RuiFeng's avatar
      [SPARK-14352][SQL] approxQuantile should support multi columns · b0985764
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      
      1, add the multi-cols support based on current private api
      2, add the multi-cols support to pyspark
      ## How was this patch tested?
      
      unit tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      Author: Ruifeng Zheng <ruifengz@foxmail.com>
      
      Closes #12135 from zhengruifeng/quantile4multicols.
      b0985764
  16. Jan 31, 2017
    • zero323's avatar
      [SPARK-19163][PYTHON][SQL] Delay _judf initialization to the __call__ · 90638358
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Defer `UserDefinedFunction._judf` initialization to the first call. This prevents unintended `SparkSession` initialization.  This allows users to define and import UDF without creating a context / session as a side effect.
      
      [SPARK-19163](https://issues.apache.org/jira/browse/SPARK-19163)
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16536 from zero323/SPARK-19163.
      90638358
    • Bryan Cutler's avatar
      [SPARK-17161][PYSPARK][ML] Add PySpark-ML JavaWrapper convenience function to... · 57d70d26
      Bryan Cutler authored
      [SPARK-17161][PYSPARK][ML] Add PySpark-ML JavaWrapper convenience function to create Py4J JavaArrays
      
      ## What changes were proposed in this pull request?
      
      Adding convenience function to Python `JavaWrapper` so that it is easy to create a Py4J JavaArray that is compatible with current class constructors that have a Scala `Array` as input so that it is not necessary to have a Java/Python friendly constructor.  The function takes a Java class as input that is used by Py4J to create the Java array of the given class.  As an example, `OneVsRest` has been updated to use this and the alternate constructor is removed.
      
      ## How was this patch tested?
      
      Added unit tests for the new convenience function and updated `OneVsRest` doctests which use this to persist the model.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #14725 from BryanCutler/pyspark-new_java_array-CountVectorizer-SPARK-17161.
      57d70d26
  17. Jan 30, 2017
    • zero323's avatar
      [SPARK-19403][PYTHON][SQL] Correct pyspark.sql.column.__all__ list. · 06fbc355
      zero323 authored
      ## What changes were proposed in this pull request?
      
      This removes from the `__all__` list class names that are not defined (visible) in the `pyspark.sql.column`.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16742 from zero323/SPARK-19403.
      06fbc355
  18. Jan 27, 2017
    • wm624@hotmail.com's avatar
      [SPARK-19336][ML][PYSPARK] LinearSVC Python API · bb1a1fe0
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      Add Python API for the newly added LinearSVC algorithm.
      
      ## How was this patch tested?
      
      Add new doc string test.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #16694 from wangmiao1981/ser.
      bb1a1fe0
  19. Jan 25, 2017
    • Takeshi YAMAMURO's avatar
      [SPARK-18020][STREAMING][KINESIS] Checkpoint SHARD_END to finish reading closed shards · 256a3a80
      Takeshi YAMAMURO authored
      ## What changes were proposed in this pull request?
      This pr is to fix an issue occurred when resharding Kinesis streams; the resharding makes the KCL throw an exception because Spark does not checkpoint `SHARD_END` when finishing reading closed shards in `KinesisRecordProcessor#shutdown`. This bug finally leads to stopping subscribing new split (or merged) shards.
      
      ## How was this patch tested?
      Added a test in `KinesisStreamSuite` to check if it works well when splitting/merging shards.
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #16213 from maropu/SPARK-18020.
      256a3a80
    • Holden Karau's avatar
      [SPARK-19064][PYSPARK] Fix pip installing of sub components · 965c82d8
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Fix instalation of mllib and ml sub components, and more eagerly cleanup cache files during test script & make-distribution.
      
      ## How was this patch tested?
      
      Updated sanity test script to import mllib and ml sub-components.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #16465 from holdenk/SPARK-19064-fix-pip-install-sub-components.
      965c82d8
    • Marcelo Vanzin's avatar
      [SPARK-19307][PYSPARK] Make sure user conf is propagated to SparkContext. · 92afaa93
      Marcelo Vanzin authored
      The code was failing to propagate the user conf in the case where the
      JVM was already initialized, which happens when a user submits a
      python script via spark-submit.
      
      Tested with new unit test and by running a python script in a real cluster.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16682 from vanzin/SPARK-19307.
      92afaa93
  20. Jan 22, 2017
Loading