Skip to content
Snippets Groups Projects
  1. Dec 08, 2016
  2. Dec 07, 2016
  3. Dec 06, 2016
    • Shuai Lin's avatar
      [SPARK-18652][PYTHON] Include the example data and third-party licenses in pyspark package. · 65f5331a
      Shuai Lin authored
      
      ## What changes were proposed in this pull request?
      
      Since we already include the python examples in the pyspark package, we should include the example data with it as well.
      
      We should also include the third-party licences since we distribute their jars with the pyspark package.
      
      ## How was this patch tested?
      
      Manually tested with python2.7 and python3.4
      ```sh
      $ ./build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Pmesos clean package
      $ cd python
      $ python setup.py sdist
      $ pip install  dist/pyspark-2.1.0.dev0.tar.gz
      
      $ ls -1 /usr/local/lib/python2.7/dist-packages/pyspark/data/
      graphx
      mllib
      streaming
      
      $ du -sh /usr/local/lib/python2.7/dist-packages/pyspark/data/
      600K    /usr/local/lib/python2.7/dist-packages/pyspark/data/
      
      $ ls -1  /usr/local/lib/python2.7/dist-packages/pyspark/licenses/|head -5
      LICENSE-AnchorJS.txt
      LICENSE-DPark.txt
      LICENSE-Mockito.txt
      LICENSE-SnapTree.txt
      LICENSE-antlr.txt
      ```
      
      Author: Shuai Lin <linshuai2012@gmail.com>
      
      Closes #16082 from lins05/include-data-in-pyspark-dist.
      
      (cherry picked from commit bd9a4a5a)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      65f5331a
  4. Dec 05, 2016
    • Tathagata Das's avatar
      [SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart and... · 1946854a
      Tathagata Das authored
      [SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart and not auto-generate StreamingQuery.name
      
      Here are the major changes in this PR.
      - Added the ability to recover `StreamingQuery.id` from checkpoint location, by writing the id to `checkpointLoc/metadata`.
      - Added `StreamingQuery.runId` which is unique for every query started and does not persist across restarts. This is to identify each restart of a query separately (same as earlier behavior of `id`).
      - Removed auto-generation of `StreamingQuery.name`. The purpose of name was to have the ability to define an identifier across restarts, but since id is precisely that, there is no need for a auto-generated name. This means name becomes purely cosmetic, and is null by default.
      - Added `runId` to `StreamingQueryListener` events and `StreamingQueryProgress`.
      
      Implementation details
      - Renamed existing `StreamExecutionMetadata` to `OffsetSeqMetadata`, and moved it to the file `OffsetSeq.scala`, because that is what this metadata is tied to. Also did some refactoring to make the code cleaner (got rid of a lot of `.json` and `.getOrElse("{}")`).
      - Added the `id` as the new `StreamMetadata`.
      - When a StreamingQuery is created it gets or writes the `StreamMetadata` from `checkpointLoc/metadata`.
      - All internal logging in `StreamExecution` uses `(name, id, runId)` instead of just `name`
      
      TODO
      - [x] Test handling of name=null in json generation of StreamingQueryProgress
      - [x] Test handling of name=null in json generation of StreamingQueryListener events
      - [x] Test python API of runId
      
      Updated unit tests and new unit tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16113 from tdas/SPARK-18657.
      
      (cherry picked from commit bb57bfe9)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      1946854a
    • Liang-Chi Hsieh's avatar
      [SPARK-18634][PYSPARK][SQL] Corruption and Correctness issues with exploding Python UDFs · fecd23d2
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      As reported in the Jira, there are some weird issues with exploding Python UDFs in SparkSQL.
      
      The following test code can reproduce it. Notice: the following test code is reported to return wrong results in the Jira. However, as I tested on master branch, it causes exception and so can't return any result.
      
          >>> from pyspark.sql.functions import *
          >>> from pyspark.sql.types import *
          >>>
          >>> df = spark.range(10)
          >>>
          >>> def return_range(value):
          ...   return [(i, str(i)) for i in range(value - 1, value + 1)]
          ...
          >>> range_udf = udf(return_range, ArrayType(StructType([StructField("integer_val", IntegerType()),
          ...                                                     StructField("string_val", StringType())])))
          >>>
          >>> df.select("id", explode(range_udf(df.id))).show()
          Traceback (most recent call last):
            File "<stdin>", line 1, in <module>
            File "/spark/python/pyspark/sql/dataframe.py", line 318, in show
              print(self._jdf.showString(n, 20))
            File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
            File "/spark/python/pyspark/sql/utils.py", line 63, in deco
              return f(*a, **kw)
            File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o126.showString.: java.lang.AssertionError: assertion failed
              at scala.Predef$.assert(Predef.scala:156)
              at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:120)
              at org.apache.spark.sql.execution.GenerateExec.consume(GenerateExec.scala:57)
      
      The cause of this issue is, in `ExtractPythonUDFs` we insert `BatchEvalPythonExec` to run PythonUDFs in batch. `BatchEvalPythonExec` will add extra outputs (e.g., `pythonUDF0`) to original plan. In above case, the original `Range` only has one output `id`. After `ExtractPythonUDFs`, the added `BatchEvalPythonExec` has two outputs `id` and `pythonUDF0`.
      
      Because the output of `GenerateExec` is given after analysis phase, in above case, it is the combination of `id`, i.e., the output of `Range`, and `col`. But in planning phase, we change `GenerateExec`'s child plan to `BatchEvalPythonExec` with additional output attributes.
      
      It will cause no problem in non wholestage codegen. Because when evaluating the additional attributes are projected out the final output of `GenerateExec`.
      
      However, as `GenerateExec` now supports wholestage codegen, the framework will input all the outputs of the child plan to `GenerateExec`. Then when consuming `GenerateExec`'s output data (i.e., calling `consume`), the number of output attributes is different to the output variables in wholestage codegen.
      
      To solve this issue, this patch only gives the generator's output to `GenerateExec` after analysis phase. `GenerateExec`'s output is the combination of its child plan's output and the generator's output. So when we change `GenerateExec`'s child, its output is still correct.
      
      ## How was this patch tested?
      
      Added test cases to PySpark.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16120 from viirya/fix-py-udf-with-generator.
      
      (cherry picked from commit 3ba69b64)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      fecd23d2
    • Shixiong Zhu's avatar
      [SPARK-18694][SS] Add StreamingQuery.explain and exception to Python and fix... · c6a4e3d9
      Shixiong Zhu authored
      [SPARK-18694][SS] Add StreamingQuery.explain and exception to Python and fix StreamingQueryException (branch 2.1)
      
      ## What changes were proposed in this pull request?
      
      Backport #16125 to branch 2.1.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16153 from zsxwing/SPARK-18694-2.1.
      c6a4e3d9
  5. Dec 02, 2016
  6. Dec 01, 2016
    • Sandeep Singh's avatar
      [SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWrapper · 4c673c65
      Sandeep Singh authored
      
      ## What changes were proposed in this pull request?
      In`JavaWrapper `'s destructor make Java Gateway dereference object in destructor, using `SparkContext._active_spark_context._gateway.detach`
      Fixing the copying parameter bug, by moving the `copy` method from `JavaModel` to `JavaParams`
      
      ## How was this patch tested?
      ```scala
      import random, string
      from pyspark.ml.feature import StringIndexer
      
      l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))]  # 700000 random strings of 10 characters
      df = spark.createDataFrame(l, ['string'])
      
      for i in range(50):
          indexer = StringIndexer(inputCol='string', outputCol='index')
          indexer.fit(df)
      ```
      * Before: would keep StringIndexer strong reference, causing GC issues and is halted midway
      After: garbage collection works as the object is dereferenced, and computation completes
      * Mem footprint tested using profiler
      * Added a parameter copy related test which was failing before.
      
      Author: Sandeep Singh <sandeep@techaddict.me>
      Author: jkbradley <joseph.kurata.bradley@gmail.com>
      
      Closes #15843 from techaddict/SPARK-18274.
      
      (cherry picked from commit 78bb7f80)
      Signed-off-by: default avatarJoseph K. Bradley <joseph@databricks.com>
      4c673c65
  7. Nov 30, 2016
  8. Nov 29, 2016
    • Jeff Zhang's avatar
      [SPARK-15819][PYSPARK][ML] Add KMeanSummary in KMeans of PySpark · b95aad7c
      Jeff Zhang authored
      
      ## What changes were proposed in this pull request?
      
      Add python api for KMeansSummary
      ## How was this patch tested?
      
      unit test added
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #13557 from zjffdu/SPARK-15819.
      
      (cherry picked from commit 4c82ca86)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      b95aad7c
    • Yuhao's avatar
      [SPARK-18319][ML][QA2.1] 2.1 QA: API: Experimental, DeveloperApi, final, sealed audit · eb0b3631
      Yuhao authored
      ## What changes were proposed in this pull request?
      make a pass through the items marked as Experimental or DeveloperApi and see if any are stable enough to be unmarked. Also check for items marked final or sealed to see if they are stable enough to be opened up as APIs.
      
      Some discussions in the jira: https://issues.apache.org/jira/browse/SPARK-18319
      
      
      
      ## How was this patch tested?
      existing ut
      
      Author: Yuhao <yuhao.yang@intel.com>
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #15972 from hhbyyh/experimental21.
      
      (cherry picked from commit 9b670bca)
      Signed-off-by: default avatarJoseph K. Bradley <joseph@databricks.com>
      eb0b3631
    • Tathagata Das's avatar
      [SPARK-18516][SQL] Split state and progress in streaming · 28b57c8a
      Tathagata Das authored
      
      This PR separates the status of a `StreamingQuery` into two separate APIs:
       - `status` - describes the status of a `StreamingQuery` at this moment, including what phase of processing is currently happening and if data is available.
       - `recentProgress` - an array of statistics about the most recent microbatches that have executed.
      
      A recent progress contains the following information:
      ```
      {
        "id" : "2be8670a-fce1-4859-a530-748f29553bb6",
        "name" : "query-29",
        "timestamp" : 1479705392724,
        "inputRowsPerSecond" : 230.76923076923077,
        "processedRowsPerSecond" : 10.869565217391303,
        "durationMs" : {
          "triggerExecution" : 276,
          "queryPlanning" : 3,
          "getBatch" : 5,
          "getOffset" : 3,
          "addBatch" : 234,
          "walCommit" : 30
        },
        "currentWatermark" : 0,
        "stateOperators" : [ ],
        "sources" : [ {
          "description" : "KafkaSource[Subscribe[topic-14]]",
          "startOffset" : {
            "topic-14" : {
              "2" : 0,
              "4" : 1,
              "1" : 0,
              "3" : 0,
              "0" : 0
            }
          },
          "endOffset" : {
            "topic-14" : {
              "2" : 1,
              "4" : 2,
              "1" : 0,
              "3" : 0,
              "0" : 1
            }
          },
          "numRecords" : 3,
          "inputRowsPerSecond" : 230.76923076923077,
          "processedRowsPerSecond" : 10.869565217391303
        } ]
      }
      ```
      
      Additionally, in order to make it possible to correlate progress updates across restarts, we change the `id` field from an integer that is unique with in the JVM to a `UUID` that is globally unique.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #15954 from marmbrus/queryProgress.
      
      (cherry picked from commit c3d08e2f)
      Signed-off-by: default avatarMichael Armbrust <michael@databricks.com>
      28b57c8a
  9. Nov 28, 2016
  10. Nov 26, 2016
  11. Nov 22, 2016
  12. Nov 21, 2016
  13. Nov 19, 2016
    • hyukjinkwon's avatar
      [SPARK-18445][BUILD][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note... · 4b396a65
      hyukjinkwon authored
      [SPARK-18445][BUILD][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that`/`'''Note:'''` across Scala/Java API documentation
      
      It seems in Scala/Java,
      
      - `Note:`
      - `NOTE:`
      - `Note that`
      - `'''Note:'''`
      - `note`
      
      This PR proposes to fix those to `note` to be consistent.
      
      **Before**
      
      - Scala
        ![2016-11-17 6 16 39](https://cloud.githubusercontent.com/assets/6477701/20383180/1a7aed8c-acf2-11e6-9611-5eaf6d52c2e0.png)
      
      - Java
        ![2016-11-17 6 14 41](https://cloud.githubusercontent.com/assets/6477701/20383096/c8ffc680-acf1-11e6-914a-33460bf1401d.png)
      
      **After**
      
      - Scala
        ![2016-11-17 6 16 44](https://cloud.githubusercontent.com/assets/6477701/20383167/09940490-acf2-11e6-937a-0d5e1dc2cadf.png)
      
      - Java
        ![2016-11-17 6 13 39](https://cloud.githubusercontent.com/assets/6477701/20383132/e7c2a57e-acf1-11e6-9c47-b849674d4d88.png
      
      )
      
      The notes were found via
      
      ```bash
      grep -r "NOTE: " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// NOTE: " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \ # note that this is a regular expression. So actual matches were mostly `org/apache/spark/api/java/functions ...`
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "Note that " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// Note that " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "Note: " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// Note: " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "'''Note:'''" . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// '''Note:''' " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      And then fixed one by one comparing with API documentation/access modifiers.
      
      After that, manually tested via `jekyll build`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15889 from HyukjinKwon/SPARK-18437.
      
      (cherry picked from commit d5b1d5fc)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      4b396a65
  14. Nov 17, 2016
  15. Nov 16, 2016
    • Holden Karau's avatar
      [SPARK-1267][SPARK-18129] Allow PySpark to be pip installed · 6a3cbbc0
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129).
      
      Done:
      - pip installable on conda [manual tested]
      - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested]
      - Automated testing of this (virtualenv)
      - packaging and signing with release-build*
      
      Possible follow up work:
      - release-build update to publish to PyPI (SPARK-18128)
      - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?)
      - Windows support and or testing ( SPARK-18136 )
      - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test
      - consider how we want to number our dev/snapshot versions
      
      Explicitly out of scope:
      - Using pip installed PySpark to start a standalone cluster
      - Using pip installed PySpark for non-Python Spark programs
      
      *I've done some work to test release-build locally but as a non-committer I've just done local testing.
      ## How was this patch tested?
      
      Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration.
      
      release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites)
      
      Author: Holden Karau <holden@us.ibm.com>
      Author: Juliet Hougland <juliet@cloudera.com>
      Author: Juliet Hougland <not@myemail.com>
      
      Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
      6a3cbbc0
    • Tathagata Das's avatar
      [SPARK-18459][SPARK-18460][STRUCTUREDSTREAMING] Rename triggerId to batchId... · b86e962c
      Tathagata Das authored
      [SPARK-18459][SPARK-18460][STRUCTUREDSTREAMING] Rename triggerId to batchId and add triggerDetails to json in StreamingQueryStatus
      
      ## What changes were proposed in this pull request?
      
      SPARK-18459: triggerId seems like a number that should be increasing with each trigger, whether or not there is data in it. However, actually, triggerId increases only where there is a batch of data in a trigger. So its better to rename it to batchId.
      
      SPARK-18460: triggerDetails was missing from json representation. Fixed it.
      
      ## How was this patch tested?
      Updated existing unit tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #15895 from tdas/SPARK-18459.
      
      (cherry picked from commit 0048ce7c)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      b86e962c
  16. Nov 10, 2016
  17. Nov 09, 2016
    • Tyson Condie's avatar
      [SPARK-17829][SQL] Stable format for offset log · b7d29256
      Tyson Condie authored
      ## What changes were proposed in this pull request?
      
      Currently we use java serialization for the WAL that stores the offsets contained in each batch. This has two main issues:
      It can break across spark releases (though this is not the only thing preventing us from upgrading a running query)
      It is unnecessarily opaque to the user.
      I'd propose we require offsets to provide a user readable serialization and use that instead. JSON is probably a good option.
      ## How was this patch tested?
      
      Tests were added for KafkaSourceOffset in [KafkaSourceOffsetSuite](external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala) and for LongOffset in [OffsetSuite](sql/core/src/test/scala/org/apache/spark/sql/streaming/OffsetSuite.scala)
      
      Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
      
       before opening a pull request.
      
      zsxwing marmbrus
      
      Author: Tyson Condie <tcondie@gmail.com>
      Author: Tyson Condie <tcondie@clash.local>
      
      Closes #15626 from tcondie/spark-8360.
      
      (cherry picked from commit 3f62e1b5)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      b7d29256
  18. Nov 08, 2016
    • Felix Cheung's avatar
      [SPARK-18239][SPARKR] Gradient Boosted Tree for R · 98dd7ac7
      Felix Cheung authored
      
      ## What changes were proposed in this pull request?
      
      Gradient Boosted Tree in R.
      With a few minor improvements to RandomForest in R.
      
      Since this is relatively isolated I'd like to target this for branch-2.1
      
      ## How was this patch tested?
      
      manual tests, unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #15746 from felixcheung/rgbt.
      
      (cherry picked from commit 55964c15)
      Signed-off-by: default avatarFelix Cheung <felixcheung@apache.org>
      98dd7ac7
  19. Nov 05, 2016
  20. Nov 04, 2016
  21. Nov 03, 2016
  22. Nov 01, 2016
    • Joseph K. Bradley's avatar
      [SPARK-18088][ML] Various ChiSqSelector cleanups · 91c33a0c
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      - Renamed kbest to numTopFeatures
      - Renamed alpha to fpr
      - Added missing Since annotations
      - Doc cleanups
      ## How was this patch tested?
      
      Added new standardized unit tests for spark.ml.
      Improved existing unit test coverage a bit.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #15647 from jkbradley/chisqselector-follow-ups.
      91c33a0c
    • hyukjinkwon's avatar
      [SPARK-17764][SQL] Add `to_json` supporting to convert nested struct column to JSON string · 01dd0083
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to add `to_json` function in contrast with `from_json` in Scala, Java and Python.
      
      It'd be useful if we can convert a same column from/to json. Also, some datasources do not support nested types. If we are forced to save a dataframe into those data sources, we might be able to work around by this function.
      
      The usage is as below:
      
      ``` scala
      val df = Seq(Tuple1(Tuple1(1))).toDF("a")
      df.select(to_json($"a").as("json")).show()
      ```
      
      ``` bash
      +--------+
      |    json|
      +--------+
      |{"_1":1}|
      +--------+
      ```
      ## How was this patch tested?
      
      Unit tests in `JsonFunctionsSuite` and `JsonExpressionsSuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15354 from HyukjinKwon/SPARK-17764.
      01dd0083
  23. Oct 30, 2016
    • Felix Cheung's avatar
      [SPARK-18110][PYTHON][ML] add missing parameter in Python for RandomForest... · 7c378692
      Felix Cheung authored
      [SPARK-18110][PYTHON][ML] add missing parameter in Python for RandomForest regression and classification
      
      ## What changes were proposed in this pull request?
      
      Add subsmaplingRate to randomForestClassifier
      Add varianceCol to randomForestRegressor
      In Python
      
      ## How was this patch tested?
      
      manual tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #15638 from felixcheung/pyrandomforest.
      7c378692
  24. Oct 27, 2016
    • VinceShieh's avatar
      [SPARK-17219][ML] enhanced NaN value handling in Bucketizer · 0b076d4c
      VinceShieh authored
      ## What changes were proposed in this pull request?
      
      This PR is an enhancement of PR with commit ID:57dc326b.
      NaN is a special type of value which is commonly seen as invalid. But We find that there are certain cases where NaN are also valuable, thus need special handling. We provided user when dealing NaN values with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by setting handleNaN "keep", "skip", or "error"(default) respectively.
      
      '''Before:
      val bucketizer: Bucketizer = new Bucketizer()
                .setInputCol("feature")
                .setOutputCol("result")
                .setSplits(splits)
      '''After:
      val bucketizer: Bucketizer = new Bucketizer()
                .setInputCol("feature")
                .setOutputCol("result")
                .setSplits(splits)
                .setHandleNaN("keep")
      
      ## How was this patch tested?
      Tests added in QuantileDiscretizerSuite, BucketizerSuite and DataFrameStatSuite
      
      Signed-off-by: VinceShieh <vincent.xieintel.com>
      
      Author: VinceShieh <vincent.xie@intel.com>
      Author: Vincent Xie <vincent.xie@intel.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #15428 from VinceShieh/spark-17219_followup.
      0b076d4c
    • Felix Cheung's avatar
      [SQL][DOC] updating doc for JSON source to link to jsonlines.org · 44c8bfda
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      API and programming guide doc changes for Scala, Python and R.
      
      ## How was this patch tested?
      
      manual test
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #15629 from felixcheung/jsondoc.
      44c8bfda
  25. Oct 21, 2016
    • Tathagata Das's avatar
      [SPARK-17926][SQL][STREAMING] Added json for statuses · 7a531e30
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      StreamingQueryStatus exposed through StreamingQueryListener often needs to be recorded (similar to SparkListener events). This PR adds `.json` and `.prettyJson` to `StreamingQueryStatus`, `SourceStatus` and `SinkStatus`.
      
      ## How was this patch tested?
      New unit tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #15476 from tdas/SPARK-17926.
      7a531e30
    • Jagadeesan's avatar
      [SPARK-17960][PYSPARK][UPGRADE TO PY4J 0.10.4] · 595893d3
      Jagadeesan authored
      ## What changes were proposed in this pull request?
      
      1) Upgrade the Py4J version on the Java side
      2) Update the py4j src zip file we bundle with Spark
      
      ## How was this patch tested?
      
      Existing doctests & unit tests pass
      
      Author: Jagadeesan <as2@us.ibm.com>
      
      Closes #15514 from jagadeesanas2/SPARK-17960.
      Unverified
      595893d3
  26. Oct 18, 2016
    • Liang-Chi Hsieh's avatar
      [SPARK-17817] [PYSPARK] [FOLLOWUP] PySpark RDD Repartitioning Results in... · 1e35e969
      Liang-Chi Hsieh authored
      [SPARK-17817] [PYSPARK] [FOLLOWUP] PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes
      
      ## What changes were proposed in this pull request?
      
      This change is a followup for #15389 which calls `_to_java_object_rdd()` to solve this issue. Due to the concern of the possible expensive cost of the call, we can choose to decrease the batch size to solve this issue too.
      
      Simple benchmark:
      
          import time
          num_partitions = 20000
          a = sc.parallelize(range(int(1e6)), 2)
          start = time.time()
          l = a.repartition(num_partitions).glom().map(len).collect()
          end = time.time()
          print(end - start)
      
      Before: 419.447577953
      _to_java_object_rdd(): 421.916361094
      decreasing the batch size: 423.712255955
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #15445 from viirya/repartition-batch-size.
      1e35e969
Loading