Skip to content
Snippets Groups Projects
  1. Oct 18, 2016
    • Liang-Chi Hsieh's avatar
      [SPARK-17817] [PYSPARK] [FOLLOWUP] PySpark RDD Repartitioning Results in... · 1e35e969
      Liang-Chi Hsieh authored
      [SPARK-17817] [PYSPARK] [FOLLOWUP] PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes
      
      ## What changes were proposed in this pull request?
      
      This change is a followup for #15389 which calls `_to_java_object_rdd()` to solve this issue. Due to the concern of the possible expensive cost of the call, we can choose to decrease the batch size to solve this issue too.
      
      Simple benchmark:
      
          import time
          num_partitions = 20000
          a = sc.parallelize(range(int(1e6)), 2)
          start = time.time()
          l = a.repartition(num_partitions).glom().map(len).collect()
          end = time.time()
          print(end - start)
      
      Before: 419.447577953
      _to_java_object_rdd(): 421.916361094
      decreasing the batch size: 423.712255955
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #15445 from viirya/repartition-batch-size.
      1e35e969
  2. Oct 14, 2016
    • Srinath Shankar's avatar
      [SPARK-17946][PYSPARK] Python crossJoin API similar to Scala · 2d96d35d
      Srinath Shankar authored
      ## What changes were proposed in this pull request?
      
      Add a crossJoin function to the DataFrame API similar to that in Scala. Joins with no condition (cartesian products) must be specified with the crossJoin API
      
      ## How was this patch tested?
      Added python tests to ensure that an AnalysisException if a cartesian product is specified without crossJoin(), and that cartesian products can execute if specified via crossJoin()
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.
      
      Author: Srinath Shankar <srinath@databricks.com>
      
      Closes #15493 from srinathshankar/crosspython.
      2d96d35d
    • Jeff Zhang's avatar
      [SPARK-11775][PYSPARK][SQL] Allow PySpark to register Java UDF · f00df40c
      Jeff Zhang authored
      Currently pyspark can only call the builtin java UDF, but can not call custom java UDF. It would be better to allow that. 2 benefits:
      * Leverage the power of rich third party java library
      * Improve the performance. Because if we use python UDF, python daemons will be started on worker which will affect the performance.
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #9766 from zjffdu/SPARK-11775.
      f00df40c
    • Nick Pentreath's avatar
      [SPARK-16063][SQL] Add storageLevel to Dataset · 5aeb7384
      Nick Pentreath authored
      [SPARK-11905](https://issues.apache.org/jira/browse/SPARK-11905
      
      ) added support for `persist`/`cache` for `Dataset`. However, there is no user-facing API to check if a `Dataset` is cached and if so what the storage level is. This PR adds `getStorageLevel` to `Dataset`, analogous to `RDD.getStorageLevel`.
      
      Updated `DatasetCacheSuite`.
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #13780 from MLnick/ds-storagelevel.
      
      Signed-off-by: default avatarMichael Armbrust <michael@databricks.com>
      5aeb7384
    • Peng's avatar
      [SPARK-17870][MLLIB][ML] Change statistic to pValue for SelectKBest and... · c8b612de
      Peng authored
      [SPARK-17870][MLLIB][ML] Change statistic to pValue for SelectKBest and SelectPercentile because of DoF difference
      
      ## What changes were proposed in this pull request?
      
      For feature selection method ChiSquareSelector, it is based on the ChiSquareTestResult.statistic (ChiSqure value) to select the features. It select the features with the largest ChiSqure value. But the Degree of Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and for different df, you cannot base on ChiSqure value to select features.
      
      So we change statistic to pValue for SelectKBest and SelectPercentile
      
      ## How was this patch tested?
      change existing test
      
      Author: Peng <peng.meng@intel.com>
      
      Closes #15444 from mpjlu/chisqure-bug.
      Unverified
      c8b612de
    • Yanbo Liang's avatar
      [SPARK-15402][ML][PYSPARK] PySpark ml.evaluation should support save/load · 1db8feab
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Since ```ml.evaluation``` has supported save/load at Scala side, supporting it at Python side is very straightforward and easy.
      
      ## How was this patch tested?
      Add python doctest.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #13194 from yanboliang/spark-15402.
      1db8feab
  3. Oct 13, 2016
    • Yanbo Liang's avatar
      [SPARK-15957][FOLLOW-UP][ML][PYSPARK] Add Python API for RFormula forceIndexLabel. · 44cbb61b
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Follow-up work of #13675, add Python API for ```RFormula forceIndexLabel```.
      
      ## How was this patch tested?
      Unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15430 from yanboliang/spark-15957-python.
      44cbb61b
    • Tathagata Das's avatar
      [SPARK-17731][SQL][STREAMING] Metrics for structured streaming · 7106866c
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Metrics are needed for monitoring structured streaming apps. Here is the design doc for implementing the necessary metrics.
      https://docs.google.com/document/d/1NIdcGuR1B3WIe8t7VxLrt58TJB4DtipWEbj5I_mzJys/edit?usp=sharing
      
      Specifically, this PR adds the following public APIs changes.
      
      ### New APIs
      - `StreamingQuery.status` returns a `StreamingQueryStatus` object (renamed from `StreamingQueryInfo`, see later)
      
      - `StreamingQueryStatus` has the following important fields
        - inputRate - Current rate (rows/sec) at which data is being generated by all the sources
        - processingRate - Current rate (rows/sec) at which the query is processing data from
                                        all the sources
        - ~~outputRate~~ - *Does not work with wholestage codegen*
        - latency - Current average latency between the data being available in source and the sink writing the corresponding output
        - sourceStatuses: Array[SourceStatus] - Current statuses of the sources
        - sinkStatus: SinkStatus - Current status of the sink
        - triggerStatus - Low-level detailed status of the last completed/currently active trigger
          - latencies - getOffset, getBatch, full trigger, wal writes
          - timestamps - trigger start, finish, after getOffset, after getBatch
          - numRows - input, output, state total/updated rows for aggregations
      
      - `SourceStatus` has the following important fields
        - inputRate - Current rate (rows/sec) at which data is being generated by the source
        - processingRate - Current rate (rows/sec) at which the query is processing data from the source
        - triggerStatus - Low-level detailed status of the last completed/currently active trigger
      
      - Python API for `StreamingQuery.status()`
      
      ### Breaking changes to existing APIs
      **Existing direct public facing APIs**
      - Deprecated direct public-facing APIs `StreamingQuery.sourceStatuses` and `StreamingQuery.sinkStatus` in favour of `StreamingQuery.status.sourceStatuses/sinkStatus`.
        - Branch 2.0 should have it deprecated, master should have it removed.
      
      **Existing advanced listener APIs**
      - `StreamingQueryInfo` renamed to `StreamingQueryStatus` for consistency with `SourceStatus`, `SinkStatus`
         - Earlier StreamingQueryInfo was used only in the advanced listener API, but now it is used in direct public-facing API (StreamingQuery.status)
      
      - Field `queryInfo` in listener events `QueryStarted`, `QueryProgress`, `QueryTerminated` changed have name `queryStatus` and return type `StreamingQueryStatus`.
      
      - Field `offsetDesc` in `SourceStatus` was Option[String], converted it to `String`.
      
      - For `SourceStatus` and `SinkStatus` made constructor private instead of private[sql] to make them more java-safe. Instead added `private[sql] object SourceStatus/SinkStatus.apply()` which are harder to accidentally use in Java.
      
      ## How was this patch tested?
      
      Old and new unit tests.
      - Rate calculation and other internal logic of StreamMetrics tested by StreamMetricsSuite.
      - New info in statuses returned through StreamingQueryListener is tested in StreamingQueryListenerSuite.
      - New and old info returned through StreamingQuery.status is tested in StreamingQuerySuite.
      - Source-specific tests for making sure input rows are counted are is source-specific test suites.
      - Additional tests to test minor additions in LocalTableScanExec, StateStore, etc.
      
      Metrics also manually tested using Ganglia sink
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #15307 from tdas/SPARK-17731.
      7106866c
  4. Oct 12, 2016
    • WeichenXu's avatar
      [SPARK-17745][ML][PYSPARK] update NB python api - add weight col parameter · 0d4a6952
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      update python api for NaiveBayes: add weight col parameter.
      
      ## How was this patch tested?
      
      doctests added.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #15406 from WeichenXu123/nb_python_update.
      0d4a6952
    • Reynold Xin's avatar
      [SPARK-17845] [SQL] More self-evident window function frame boundary API · 6f20a92c
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch improves the window function frame boundary API to make it more obvious to read and to use. The two high level changes are:
      
      1. Create Window.currentRow, Window.unboundedPreceding, Window.unboundedFollowing to indicate the special values in frame boundaries. These methods map to the special integral values so we are not breaking backward compatibility here. This change makes the frame boundaries more self-evident (instead of Long.MinValue, it becomes Window.unboundedPreceding).
      
      2. In Python, for any value less than or equal to JVM's Long.MinValue, treat it as Window.unboundedPreceding. For any value larger than or equal to JVM's Long.MaxValue, treat it as Window.unboundedFollowing. Before this change, if the user specifies any value that is less than Long.MinValue but not -sys.maxsize (e.g. -sys.maxsize + 1), the number we pass over to the JVM would overflow, resulting in a frame that does not make sense.
      
      Code example required to specify a frame before this patch:
      ```
      Window.rowsBetween(-Long.MinValue, 0)
      ```
      
      While the above code should still work, the new way is more obvious to read:
      ```
      Window.rowsBetween(Window.unboundedPreceding, Window.currentRow)
      ```
      
      ## How was this patch tested?
      - Updated DataFrameWindowSuite (for Scala/Java)
      - Updated test_window_functions_cumulative_sum (for Python)
      - Renamed DataFrameWindowSuite DataFrameWindowFunctionsSuite to better reflect its purpose
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15438 from rxin/SPARK-17845.
      6f20a92c
    • Bijay Pathak's avatar
      [SPARK-14761][SQL] Reject invalid join methods when join columns are not... · 8880fd13
      Bijay Pathak authored
      [SPARK-14761][SQL] Reject invalid join methods when join columns are not specified in PySpark DataFrame join.
      
      ## What changes were proposed in this pull request?
      
      In PySpark, the invalid join type will not throw error for the following join:
      ```df1.join(df2, how='not-a-valid-join-type')```
      
      The signature of the join is:
      ```def join(self, other, on=None, how=None):```
      The existing code completely ignores the `how` parameter when `on` is `None`. This patch will process the arguments passed to join and pass in to JVM Spark SQL Analyzer, which will validate the join type passed.
      
      ## How was this patch tested?
      Used manual and existing test suites.
      
      Author: Bijay Pathak <bkpathak@mtu.edu>
      
      Closes #15409 from bkpathak/SPARK-14761.
      8880fd13
  5. Oct 11, 2016
    • Wenchen Fan's avatar
      [SPARK-17720][SQL] introduce static SQL conf · b9a14718
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      SQLConf is session-scoped and mutable. However, we do have the requirement for a static SQL conf, which is global and immutable, e.g. the `schemaStringThreshold` in `HiveExternalCatalog`, the flag to enable/disable hive support, the global temp view database in https://github.com/apache/spark/pull/14897.
      
      Actually we've already implemented static SQL conf implicitly via `SparkConf`, this PR just make it explicit and expose it to users, so that they can see the config value via SQL command or `SparkSession.conf`, and forbid users to set/unset static SQL conf.
      
      ## How was this patch tested?
      
      new tests in SQLConfSuite
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15295 from cloud-fan/global-conf.
      b9a14718
    • Jeff Zhang's avatar
      [SPARK-17387][PYSPARK] Creating SparkContext() from python without spark-submit ignores user conf · 5b77e66d
      Jeff Zhang authored
      ## What changes were proposed in this pull request?
      
      The root cause that we would ignore SparkConf when launching JVM is that SparkConf require JVM to be created first.  https://github.com/apache/spark/blob/master/python/pyspark/conf.py#L106
      In this PR, I would defer the launching of JVM until SparkContext is created so that we can pass SparkConf to JVM correctly.
      
      ## How was this patch tested?
      
      Use the example code in the description of SPARK-17387,
      ```
      $ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python
      Python 2.7.12 (default, Jul  1 2016, 15:12:24)
      [GCC 5.4.0 20160609] on linux2
      Type "help", "copyright", "credits" or "license" for more information.
      >>> from pyspark import SparkContext
      >>> from pyspark import SparkConf
      >>> conf = SparkConf().set("spark.driver.memory", "4g")
      >>> sc = SparkContext(conf=conf)
      ```
      And verify the spark.driver.memory is correctly picked up.
      
      ```
      ...op/ -Xmx4g org.apache.spark.deploy.SparkSubmit --conf spark.driver.memory=4g pyspark-shell
      ```
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #14959 from zjffdu/SPARK-17387.
      5b77e66d
    • Liang-Chi Hsieh's avatar
      [SPARK-17817][PYSPARK] PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes · 07508bd0
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      Quoted from JIRA description:
      
      Calling repartition on a PySpark RDD to increase the number of partitions results in highly skewed partition sizes, with most having 0 rows. The repartition method should evenly spread out the rows across the partitions, and this behavior is correctly seen on the Scala side.
      
      Please reference the following code for a reproducible example of this issue:
      
          num_partitions = 20000
          a = sc.parallelize(range(int(1e6)), 2)  # start with 2 even partitions
          l = a.repartition(num_partitions).glom().map(len).collect()  # get length of each partition
          min(l), max(l), sum(l)/len(l), len(l)  # skewed!
      
      In Scala's `repartition` code, we will distribute elements evenly across output partitions. However, the RDD from Python is serialized as a single binary data, so the distribution fails. We need to convert the RDD in Python to java object before repartitioning.
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #15389 from viirya/pyspark-rdd-repartition.
      07508bd0
    • Wenchen Fan's avatar
      [SPARK-17338][SQL][FOLLOW-UP] add global temp view · 7388ad94
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      address post hoc review comments for https://github.com/apache/spark/pull/14897
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15424 from cloud-fan/global-temp-view.
      7388ad94
    • Bryan Cutler's avatar
      [SPARK-17808][PYSPARK] Upgraded version of Pyrolite to 4.13 · 658c7147
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Upgraded to a newer version of Pyrolite which supports serialization of a BinaryType StructField for PySpark.SQL
      
      ## How was this patch tested?
      Added a unit test which fails with a raised ValueError when using the previous version of Pyrolite 4.9 and Python3
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #15386 from BryanCutler/pyrolite-upgrade-SPARK-17808.
      Unverified
      658c7147
    • Reynold Xin's avatar
      [SPARK-17844] Simplify DataFrame API for defining frame boundaries in window functions · b515768f
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      When I was creating the example code for SPARK-10496, I realized it was pretty convoluted to define the frame boundaries for window functions when there is no partition column or ordering column. The reason is that we don't provide a way to create a WindowSpec directly with the frame boundaries. We can trivially improve this by adding rowsBetween and rangeBetween to Window object.
      
      As an example, to compute cumulative sum using the natural ordering, before this pr:
      ```
      df.select('key, sum("value").over(Window.partitionBy(lit(1)).rowsBetween(Long.MinValue, 0)))
      ```
      
      After this pr:
      ```
      df.select('key, sum("value").over(Window.rowsBetween(Long.MinValue, 0)))
      ```
      
      Note that you could argue there is no point specifying a window frame without partitionBy/orderBy -- but it is strange that only rowsBetween and rangeBetween are not the only two APIs not available.
      
      This also fixes https://issues.apache.org/jira/browse/SPARK-17656 (removing _root_.scala).
      
      ## How was this patch tested?
      Added test cases to compute cumulative sum in DataFrameWindowSuite for Scala/Java and tests.py for Python.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15412 from rxin/SPARK-17844.
      b515768f
  6. Oct 10, 2016
    • Wenchen Fan's avatar
      [SPARK-17338][SQL] add global temp view · 23ddff4b
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Global temporary view is a cross-session temporary view, which means it's shared among all sessions. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database `global_temp`(configurable via SparkConf), and we must use the qualified name to refer a global temp view, e.g. SELECT * FROM global_temp.view1.
      
      changes for `SessionCatalog`:
      
      1. add a new field `gloabalTempViews: GlobalTempViewManager`, to access the shared global temp views, and the global temp db name.
      2. `createDatabase` will fail if users wanna create `global_temp`, which is system preserved.
      3. `setCurrentDatabase` will fail if users wanna set `global_temp`, which is system preserved.
      4. add `createGlobalTempView`, which is used in `CreateViewCommand` to create global temp views.
      5. add `dropGlobalTempView`, which is used in `CatalogImpl` to drop global temp view.
      6. add `alterTempViewDefinition`, which is used in `AlterViewAsCommand` to update the view definition for local/global temp views.
      7. `renameTable`/`dropTable`/`isTemporaryTable`/`lookupRelation`/`getTempViewOrPermanentTableMetadata`/`refreshTable` will handle global temp views.
      
      changes for SQL commands:
      
      1. `CreateViewCommand`/`AlterViewAsCommand` is updated to support global temp views
      2. `ShowTablesCommand` outputs a new column `database`, which is used to distinguish global and local temp views.
      3. other commands can also handle global temp views if they call `SessionCatalog` APIs which accepts global temp views, e.g. `DropTableCommand`, `AlterTableRenameCommand`, `ShowColumnsCommand`, etc.
      
      changes for other public API
      
      1. add a new method `dropGlobalTempView` in `Catalog`
      2. `Catalog.findTable` can find global temp view
      3. add a new method `createGlobalTempView` in `Dataset`
      
      ## How was this patch tested?
      
      new tests in `SQLViewSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #14897 from cloud-fan/global-temp-view.
      23ddff4b
  7. Oct 07, 2016
  8. Oct 04, 2016
  9. Oct 03, 2016
  10. Oct 01, 2016
  11. Sep 29, 2016
    • Michael Armbrust's avatar
      [SPARK-17699] Support for parsing JSON string columns · fe33121a
      Michael Armbrust authored
      Spark SQL has great support for reading text files that contain JSON data.  However, in many cases the JSON data is just one column amongst others.  This is particularly true when reading from sources such as Kafka.  This PR adds a new functions `from_json` that converts a string column into a nested `StructType` with a user specified schema.
      
      Example usage:
      ```scala
      val df = Seq("""{"a": 1}""").toDS()
      val schema = new StructType().add("a", IntegerType)
      
      df.select(from_json($"value", schema) as 'json) // => [json: <a: int>]
      ```
      
      This PR adds support for java, scala and python.  I leveraged our existing JSON parsing support by moving it into catalyst (so that we could define expressions using it).  I left SQL out for now, because I'm not sure how users would specify a schema.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #15274 from marmbrus/jsonParser.
      fe33121a
  12. Sep 28, 2016
  13. Sep 27, 2016
    • WeichenXu's avatar
      [SPARK-17138][ML][MLIB] Add Python API for multinomial logistic regression · 7f16affa
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      Add Python API for multinomial logistic regression.
      
      - add `family` param in python api.
      - expose `coefficientMatrix` and `interceptVector` for `LogisticRegressionModel`
      - add python-side testcase for multinomial logistic regression
      - update python doc.
      
      ## How was this patch tested?
      
      existing and added doc tests.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #14852 from WeichenXu123/add_MLOR_python.
      7f16affa
  14. Sep 26, 2016
    • Yanbo Liang's avatar
      [SPARK-17017][FOLLOW-UP][ML] Refactor of ChiSqSelector and add ML Python API. · ac65139b
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      #14597 modified ```ChiSqSelector``` to support ```fpr``` type selector, however, it left some issue need to be addressed:
      * We should allow users to set selector type explicitly rather than switching them by using different setting function, since the setting order will involves some unexpected issue. For example, if users both set ```numTopFeatures``` and ```percentile```, it will train ```kbest``` or ```percentile``` model based on the order of setting (the latter setting one will be trained). This make users confused, and we should allow users to set selector type explicitly. We handle similar issues at other place of ML code base such as ```GeneralizedLinearRegression``` and ```LogisticRegression```.
      * Meanwhile, if there are more than one parameter except ```alpha``` can be set for ```fpr``` model, we can not handle it elegantly in the existing framework. And similar issues for ```kbest``` and ```percentile``` model. Setting selector type explicitly can solve this issue also.
      * If setting selector type explicitly by users is allowed, we should handle param interaction such as if users set ```selectorType = percentile``` and ```alpha = 0.1```, we should notify users the parameter ```alpha``` will take no effect. We should handle complex parameter interaction checks at ```transformSchema```. (FYI #11620)
      * We should use lower case of the selector type names to follow MLlib convention.
      * Add ML Python API.
      
      ## How was this patch tested?
      Unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15214 from yanboliang/spark-17017.
      Unverified
      ac65139b
  15. Sep 24, 2016
  16. Sep 23, 2016
    • Holden Karau's avatar
      [SPARK-16861][PYSPARK][CORE] Refactor PySpark accumulator API on top of Accumulator V2 · 90d57542
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Move the internals of the PySpark accumulator API from the old deprecated API on top of the new accumulator API.
      
      ## How was this patch tested?
      
      The existing PySpark accumulator tests (both unit tests and doc tests at the start of accumulator.py).
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #14467 from holdenk/SPARK-16861-refactor-pyspark-accumulator-api.
      Unverified
      90d57542
  17. Sep 22, 2016
  18. Sep 21, 2016
  19. Sep 20, 2016
    • Adrian Petrescu's avatar
      [SPARK-17437] Add uiWebUrl to JavaSparkContext and pyspark.SparkContext · 4a426ff8
      Adrian Petrescu authored
      ## What changes were proposed in this pull request?
      
      The Scala version of `SparkContext` has a handy field called `uiWebUrl` that tells you which URL the SparkUI spawned by that instance lives at. This is often very useful because the value for `spark.ui.port` in the config is only a suggestion; if that port number is taken by another Spark instance on the same machine, Spark will just keep incrementing the port until it finds a free one. So, on a machine with a lot of running PySpark instances, you often have to start trying all of them one-by-one until you find your application name.
      
      Scala users have a way around this with `uiWebUrl` but Java and Python users do not. This pull request fixes this in the most straightforward way possible, simply propagating this field through the `JavaSparkContext` and into pyspark through the Java gateway.
      
      Please let me know if any additional documentation/testing is needed.
      
      ## How was this patch tested?
      
      Existing tests were run to make sure there were no regressions, and a binary distribution was created and tested manually for the correct value of `sc.uiWebPort` in a variety of circumstances.
      
      Author: Adrian Petrescu <apetresc@gmail.com>
      
      Closes #15000 from apetresc/pyspark-uiweburl.
      Unverified
      4a426ff8
  20. Sep 19, 2016
    • Davies Liu's avatar
      [SPARK-17100] [SQL] fix Python udf in filter on top of outer join · d8104158
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      In optimizer, we try to evaluate the condition to see whether it's nullable or not, but some expressions are not evaluable, we should check that before evaluate it.
      
      ## How was this patch tested?
      
      Added regression tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #15103 from davies/udf_join.
      d8104158
  21. Sep 18, 2016
    • Liwei Lin's avatar
      [SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV cast null values properly · 1dbb725d
      Liwei Lin authored
      ## Problem
      
      CSV in Spark 2.0.0:
      -  does not read null values back correctly for certain data types such as `Boolean`, `TimestampType`, `DateType` -- this is a regression comparing to 1.6;
      - does not read empty values (specified by `options.nullValue`) as `null`s for `StringType` -- this is compatible with 1.6 but leads to problems like SPARK-16903.
      
      ## What changes were proposed in this pull request?
      
      This patch makes changes to read all empty values back as `null`s.
      
      ## How was this patch tested?
      
      New test cases.
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #14118 from lw-lin/csv-cast-null.
      Unverified
      1dbb725d
  22. Sep 17, 2016
    • William Benton's avatar
      [SPARK-17548][MLLIB] Word2VecModel.findSynonyms no longer spuriously rejects... · 25cbbe6c
      William Benton authored
      [SPARK-17548][MLLIB] Word2VecModel.findSynonyms no longer spuriously rejects the best match when invoked with a vector
      
      ## What changes were proposed in this pull request?
      
      This pull request changes the behavior of `Word2VecModel.findSynonyms` so that it will not spuriously reject the best match when invoked with a vector that does not correspond to a word in the model's vocabulary.  Instead of blindly discarding the best match, the changed implementation discards a match that corresponds to the query word (in cases where `findSynonyms` is invoked with a word) or that has an identical angle to the query vector.
      
      ## How was this patch tested?
      
      I added a test to `Word2VecSuite` to ensure that the word with the most similar vector from a supplied vector would not be spuriously rejected.
      
      Author: William Benton <willb@redhat.com>
      
      Closes #15105 from willb/fix/findSynonyms.
      Unverified
      25cbbe6c
  23. Sep 14, 2016
    • Eric Liang's avatar
      [SPARK-17472] [PYSPARK] Better error message for serialization failures of large objects in Python · dbfc7aa4
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      For large objects, pickle does not raise useful error messages. However, we can wrap them to be slightly more user friendly:
      
      Example 1:
      ```
      def run():
        import numpy.random as nr
        b = nr.bytes(8 * 1000000000)
        sc.parallelize(range(1000), 1000).map(lambda x: len(b)).count()
      
      run()
      ```
      
      Before:
      ```
      error: 'i' format requires -2147483648 <= number <= 2147483647
      ```
      
      After:
      ```
      pickle.PicklingError: Object too large to serialize: 'i' format requires -2147483648 <= number <= 2147483647
      ```
      
      Example 2:
      ```
      def run():
        import numpy.random as nr
        b = sc.broadcast(nr.bytes(8 * 1000000000))
        sc.parallelize(range(1000), 1000).map(lambda x: len(b.value)).count()
      
      run()
      ```
      
      Before:
      ```
      SystemError: error return without exception set
      ```
      
      After:
      ```
      cPickle.PicklingError: Could not serialize broadcast: SystemError: error return without exception set
      ```
      
      ## How was this patch tested?
      
      Manually tried out these cases
      
      cc davies
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #15026 from ericl/spark-17472.
      dbfc7aa4
Loading