Skip to content
Snippets Groups Projects
  1. Mar 07, 2017
    • Jason White's avatar
      [SPARK-19561] [PYTHON] cast TimestampType.toInternal output to long · 711addd4
      Jason White authored
      
      ## What changes were proposed in this pull request?
      
      Cast the output of `TimestampType.toInternal` to long to allow for proper Timestamp creation in DataFrames near the epoch.
      
      ## How was this patch tested?
      
      Added a new test that fails without the change.
      
      dongjoon-hyun davies Mind taking a look?
      
      The contribution is my original work and I license the work to the project under the project’s open source license.
      
      Author: Jason White <jason.white@shopify.com>
      
      Closes #16896 from JasonMWhite/SPARK-19561.
      
      (cherry picked from commit 6f468462)
      Signed-off-by: default avatarDavies Liu <davies.liu@gmail.com>
      711addd4
  2. Feb 25, 2017
    • Bryan Cutler's avatar
      [SPARK-14772][PYTHON][ML] Fixed Params.copy method to match Scala implementation · 20a43295
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Fixed the PySpark Params.copy method to behave like the Scala implementation.  The main issue was that it did not account for the _defaultParamMap and merged it into the explicitly created param map.
      
      ## How was this patch tested?
      Added new unit test to verify the copy method behaves correctly for copying uid, explicitly created params, and default params.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #17048 from BryanCutler/pyspark-ml-param_copy-Scala_sync-SPARK-14772-2_1.
      20a43295
  3. Feb 15, 2017
  4. Feb 13, 2017
  5. Jan 25, 2017
  6. Jan 20, 2017
    • Davies Liu's avatar
      [SPARK-18589][SQL] Fix Python UDF accessing attributes from both side of join · 4d286c90
      Davies Liu authored
      PythonUDF is unevaluable, which can not be used inside a join condition, currently the optimizer will push a PythonUDF which accessing both side of join into the join condition, then the query will fail to plan.
      
      This PR fix this issue by checking the expression is evaluable  or not before pushing it into Join.
      
      Add a regression test.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #16581 from davies/pyudf_join.
      4d286c90
  7. Jan 17, 2017
    • hyukjinkwon's avatar
      [SPARK-19019] [PYTHON] Fix hijacked `collections.namedtuple` and port... · 2ff36691
      hyukjinkwon authored
      [SPARK-19019] [PYTHON] Fix hijacked `collections.namedtuple` and port cloudpickle changes for PySpark to work with Python 3.6.0
      
      ## What changes were proposed in this pull request?
      
      Currently, PySpark does not work with Python 3.6.0.
      
      Running `./bin/pyspark` simply throws the error as below and PySpark does not work at all:
      
      ```
      Traceback (most recent call last):
        File ".../spark/python/pyspark/shell.py", line 30, in <module>
          import pyspark
        File ".../spark/python/pyspark/__init__.py", line 46, in <module>
          from pyspark.context import SparkContext
        File ".../spark/python/pyspark/context.py", line 36, in <module>
          from pyspark.java_gateway import launch_gateway
        File ".../spark/python/pyspark/java_gateway.py", line 31, in <module>
          from py4j.java_gateway import java_import, JavaGateway, GatewayClient
        File "<frozen importlib._bootstrap>", line 961, in _find_and_load
        File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
        File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
        File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 18, in <module>
        File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py", line 62, in <module>
          import pkgutil
        File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py", line 22, in <module>
          ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg')
        File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple
          cls = _old_namedtuple(*args, **kwargs)
      TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'
      ```
      
      The root cause seems because some arguments of `namedtuple` are now completely keyword-only arguments from Python 3.6.0 (See https://bugs.python.org/issue25628
      
      ).
      
      We currently copy this function via `types.FunctionType` which does not set the default values of keyword-only arguments (meaning `namedtuple.__kwdefaults__`) and this seems causing internally missing values in the function (non-bound arguments).
      
      This PR proposes to work around this by manually setting it via `kwargs` as `types.FunctionType` seems not supporting to set this.
      
      Also, this PR ports the changes in cloudpickle for compatibility for Python 3.6.0.
      
      ## How was this patch tested?
      
      Manually tested with Python 2.7.6 and Python 3.6.0.
      
      ```
      ./bin/pyspsark
      ```
      
      , manual creation of `namedtuple` both in local and rdd with Python 3.6.0,
      
      and Jenkins tests for other Python versions.
      
      Also,
      
      ```
      ./run-tests --python-executables=python3.6
      ```
      
      ```
      Will test against the following Python executables: ['python3.6']
      Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
      Finished test(python3.6): pyspark.sql.tests (192s)
      Finished test(python3.6): pyspark.accumulators (3s)
      Finished test(python3.6): pyspark.mllib.tests (198s)
      Finished test(python3.6): pyspark.broadcast (3s)
      Finished test(python3.6): pyspark.conf (2s)
      Finished test(python3.6): pyspark.context (14s)
      Finished test(python3.6): pyspark.ml.classification (21s)
      Finished test(python3.6): pyspark.ml.evaluation (11s)
      Finished test(python3.6): pyspark.ml.clustering (20s)
      Finished test(python3.6): pyspark.ml.linalg.__init__ (0s)
      Finished test(python3.6): pyspark.streaming.tests (240s)
      Finished test(python3.6): pyspark.tests (240s)
      Finished test(python3.6): pyspark.ml.recommendation (19s)
      Finished test(python3.6): pyspark.ml.feature (36s)
      Finished test(python3.6): pyspark.ml.regression (37s)
      Finished test(python3.6): pyspark.ml.tuning (28s)
      Finished test(python3.6): pyspark.mllib.classification (26s)
      Finished test(python3.6): pyspark.mllib.evaluation (18s)
      Finished test(python3.6): pyspark.mllib.clustering (44s)
      Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s)
      Finished test(python3.6): pyspark.mllib.feature (26s)
      Finished test(python3.6): pyspark.mllib.fpm (23s)
      Finished test(python3.6): pyspark.mllib.random (8s)
      Finished test(python3.6): pyspark.ml.tests (92s)
      Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s)
      Finished test(python3.6): pyspark.mllib.linalg.distributed (25s)
      Finished test(python3.6): pyspark.mllib.stat._statistics (15s)
      Finished test(python3.6): pyspark.mllib.recommendation (24s)
      Finished test(python3.6): pyspark.mllib.regression (26s)
      Finished test(python3.6): pyspark.profiler (9s)
      Finished test(python3.6): pyspark.mllib.tree (16s)
      Finished test(python3.6): pyspark.shuffle (1s)
      Finished test(python3.6): pyspark.mllib.util (18s)
      Finished test(python3.6): pyspark.serializers (11s)
      Finished test(python3.6): pyspark.rdd (20s)
      Finished test(python3.6): pyspark.sql.conf (8s)
      Finished test(python3.6): pyspark.sql.catalog (17s)
      Finished test(python3.6): pyspark.sql.column (18s)
      Finished test(python3.6): pyspark.sql.context (18s)
      Finished test(python3.6): pyspark.sql.group (27s)
      Finished test(python3.6): pyspark.sql.dataframe (33s)
      Finished test(python3.6): pyspark.sql.functions (35s)
      Finished test(python3.6): pyspark.sql.types (6s)
      Finished test(python3.6): pyspark.sql.streaming (13s)
      Finished test(python3.6): pyspark.streaming.util (0s)
      Finished test(python3.6): pyspark.sql.session (16s)
      Finished test(python3.6): pyspark.sql.window (4s)
      Finished test(python3.6): pyspark.sql.readwriter (35s)
      Tests passed in 433 seconds
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16429 from HyukjinKwon/SPARK-19019.
      
      (cherry picked from commit 20e62806)
      Signed-off-by: default avatarDavies Liu <davies.liu@gmail.com>
      2ff36691
  8. Jan 13, 2017
    • Vinayak's avatar
      [SPARK-18687][PYSPARK][SQL] Backward compatibility - creating a Dataframe on a... · b2c9a2c8
      Vinayak authored
      [SPARK-18687][PYSPARK][SQL] Backward compatibility - creating a Dataframe on a new SQLContext object fails with a Derby error
      
      Change is for SQLContext to reuse the active SparkSession during construction if the sparkContext supplied is the same as the currently active SparkContext. Without this change, a new SparkSession is instantiated that results in a Derby error when attempting to create a dataframe using a new SQLContext object even though the SparkContext supplied to the new SQLContext is same as the currently active one. Refer https://issues.apache.org/jira/browse/SPARK-18687 for details on the error and a repro.
      
      Existing unit tests and a new unit test added to pyspark-sql:
      
      /python/run-tests --python-executables=python --modules=pyspark-sql
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Vinayak <vijoshi5@in.ibm.com>
      Author: Vinayak Joshi <vijoshi@users.noreply.github.com>
      
      Closes #16119 from vijoshi/SPARK-18687_master.
      
      (cherry picked from commit 285a7798)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      b2c9a2c8
  9. Jan 12, 2017
    • Liang-Chi Hsieh's avatar
      [SPARK-19055][SQL][PYSPARK] Fix SparkSession initialization when SparkContext is stopped · 042e32d1
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      In SparkSession initialization, we store created the instance of SparkSession into a class variable _instantiatedContext. Next time we can use SparkSession.builder.getOrCreate() to retrieve the existing SparkSession instance.
      
      However, when the active SparkContext is stopped and we create another new SparkContext to use, the existing SparkSession is still associated with the stopped SparkContext. So the operations with this existing SparkSession will be failed.
      
      We need to detect such case in SparkSession and renew the class variable _instantiatedContext if needed.
      
      ## How was this patch tested?
      
      New test added in PySpark.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16454 from viirya/fix-pyspark-sparksession.
      
      (cherry picked from commit c6c37b8a)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      042e32d1
  10. Jan 10, 2017
  11. Jan 08, 2017
  12. Dec 21, 2016
    • gatorsmile's avatar
      [SPARK-18949][SQL][BACKPORT-2.1] Add recoverPartitions API to Catalog · 0e51bb08
      gatorsmile authored
      ### What changes were proposed in this pull request?
      
      This PR is to backport https://github.com/apache/spark/pull/16356 to Spark 2.1.1 branch.
      
      ----
      
      Currently, we only have a SQL interface for recovering all the partitions in the directory of a table and update the catalog. `MSCK REPAIR TABLE` or `ALTER TABLE table RECOVER PARTITIONS`. (Actually, very hard for me to remember `MSCK` and have no clue what it means)
      
      After the new "Scalable Partition Handling", the table repair becomes much more important for making visible the data in the created data source partitioned table.
      
      Thus, this PR is to add it into the Catalog interface. After this PR, users can repair the table by
      ```Scala
      spark.catalog.recoverPartitions("testTable")
      ```
      
      ### How was this patch tested?
      Modified the existing test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16372 from gatorsmile/repairTable2.1.1.
      0e51bb08
  13. Dec 20, 2016
    • Liang-Chi Hsieh's avatar
      [SPARK-18281] [SQL] [PYSPARK] Remove timeout for reading data through socket for local iterator · cd297c39
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      There is a timeout failure when using `rdd.toLocalIterator()` or `df.toLocalIterator()` for a PySpark RDD and DataFrame:
      
          df = spark.createDataFrame([[1],[2],[3]])
          it = df.toLocalIterator()
          row = next(it)
      
          df2 = df.repartition(1000)  # create many empty partitions which increase materialization time so causing timeout
          it2 = df2.toLocalIterator()
          row = next(it2)
      
      The cause of this issue is, we open a socket to serve the data from JVM side. We set timeout for connection and reading through the socket in Python side. In Python we use a generator to read the data, so we only begin to connect the socket once we start to ask data from it. If we don't consume it immediately, there is connection timeout.
      
      In the other side, the materialization time for RDD partitions is unpredictable. So we can't set a timeout for reading data through the socket. Otherwise, it is very possibly to fail.
      
      ## How was this patch tested?
      
      Added tests into PySpark.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16263 from viirya/fix-pyspark-localiterator.
      
      (cherry picked from commit 95c95b71)
      Signed-off-by: default avatarDavies Liu <davies.liu@gmail.com>
      cd297c39
  14. Dec 15, 2016
  15. Dec 14, 2016
  16. Dec 11, 2016
  17. Dec 08, 2016
    • Andrew Ray's avatar
      [SPARK-16589] [PYTHON] Chained cartesian produces incorrect number of records · e0173f14
      Andrew Ray authored
      
      ## What changes were proposed in this pull request?
      
      Fixes a bug in the python implementation of rdd cartesian product related to batching that showed up in repeated cartesian products with seemingly random results. The root cause being multiple iterators pulling from the same stream in the wrong order because of logic that ignored batching.
      
      `CartesianDeserializer` and `PairDeserializer` were changed to implement `_load_stream_without_unbatching` and borrow the one line implementation of `load_stream` from `BatchedSerializer`. The default implementation of `_load_stream_without_unbatching` was changed to give consistent results (always an iterable) so that it could be used without additional checks.
      
      `PairDeserializer` no longer extends `CartesianDeserializer` as it was not really proper. If wanted a new common super class could be added.
      
      Both `CartesianDeserializer` and `PairDeserializer` now only extend `Serializer` (which has no `dump_stream` implementation) since they are only meant for *de*serialization.
      
      ## How was this patch tested?
      
      Additional unit tests (sourced from #14248) plus one for testing a cartesian with zip.
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #16121 from aray/fix-cartesian.
      
      (cherry picked from commit 3c68944b)
      Signed-off-by: default avatarDavies Liu <davies.liu@gmail.com>
      e0173f14
    • Liang-Chi Hsieh's avatar
      [SPARK-18667][PYSPARK][SQL] Change the way to group row in BatchEvalPythonExec... · 726217eb
      Liang-Chi Hsieh authored
      [SPARK-18667][PYSPARK][SQL] Change the way to group row in BatchEvalPythonExec so input_file_name function can work with UDF in pyspark
      
      ## What changes were proposed in this pull request?
      
      `input_file_name` doesn't return filename when working with UDF in PySpark. An example shows the problem:
      
          from pyspark.sql.functions import *
          from pyspark.sql.types import *
      
          def filename(path):
              return path
      
          sourceFile = udf(filename, StringType())
          spark.read.json("tmp.json").select(sourceFile(input_file_name())).show()
      
          +---------------------------+
          |filename(input_file_name())|
          +---------------------------+
          |                           |
          +---------------------------+
      
      The cause of this issue is, we group rows in `BatchEvalPythonExec` for batching processing of PythonUDF. Currently we group rows first and then evaluate expressions on the rows. If the data is less than the required number of rows for a group, the iterator will be consumed to the end before the evaluation. However, once the iterator reaches the end, we will unset input filename. So the input_file_name expression can't return correct filename.
      
      This patch fixes the approach to group the batch of rows. We evaluate the expression first and then group evaluated results to batch.
      
      ## How was this patch tested?
      
      Added unit test to PySpark.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16115 from viirya/fix-py-udf-input-filename.
      
      (cherry picked from commit 6a5a7254)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      726217eb
    • Patrick Wendell's avatar
      48aa6775
    • Patrick Wendell's avatar
      Preparing Spark release v2.1.0-rc2 · 08071749
      Patrick Wendell authored
      08071749
  18. Dec 07, 2016
  19. Dec 06, 2016
    • Shuai Lin's avatar
      [SPARK-18652][PYTHON] Include the example data and third-party licenses in pyspark package. · 65f5331a
      Shuai Lin authored
      
      ## What changes were proposed in this pull request?
      
      Since we already include the python examples in the pyspark package, we should include the example data with it as well.
      
      We should also include the third-party licences since we distribute their jars with the pyspark package.
      
      ## How was this patch tested?
      
      Manually tested with python2.7 and python3.4
      ```sh
      $ ./build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Pmesos clean package
      $ cd python
      $ python setup.py sdist
      $ pip install  dist/pyspark-2.1.0.dev0.tar.gz
      
      $ ls -1 /usr/local/lib/python2.7/dist-packages/pyspark/data/
      graphx
      mllib
      streaming
      
      $ du -sh /usr/local/lib/python2.7/dist-packages/pyspark/data/
      600K    /usr/local/lib/python2.7/dist-packages/pyspark/data/
      
      $ ls -1  /usr/local/lib/python2.7/dist-packages/pyspark/licenses/|head -5
      LICENSE-AnchorJS.txt
      LICENSE-DPark.txt
      LICENSE-Mockito.txt
      LICENSE-SnapTree.txt
      LICENSE-antlr.txt
      ```
      
      Author: Shuai Lin <linshuai2012@gmail.com>
      
      Closes #16082 from lins05/include-data-in-pyspark-dist.
      
      (cherry picked from commit bd9a4a5a)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      65f5331a
  20. Dec 05, 2016
    • Tathagata Das's avatar
      [SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart and... · 1946854a
      Tathagata Das authored
      [SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart and not auto-generate StreamingQuery.name
      
      Here are the major changes in this PR.
      - Added the ability to recover `StreamingQuery.id` from checkpoint location, by writing the id to `checkpointLoc/metadata`.
      - Added `StreamingQuery.runId` which is unique for every query started and does not persist across restarts. This is to identify each restart of a query separately (same as earlier behavior of `id`).
      - Removed auto-generation of `StreamingQuery.name`. The purpose of name was to have the ability to define an identifier across restarts, but since id is precisely that, there is no need for a auto-generated name. This means name becomes purely cosmetic, and is null by default.
      - Added `runId` to `StreamingQueryListener` events and `StreamingQueryProgress`.
      
      Implementation details
      - Renamed existing `StreamExecutionMetadata` to `OffsetSeqMetadata`, and moved it to the file `OffsetSeq.scala`, because that is what this metadata is tied to. Also did some refactoring to make the code cleaner (got rid of a lot of `.json` and `.getOrElse("{}")`).
      - Added the `id` as the new `StreamMetadata`.
      - When a StreamingQuery is created it gets or writes the `StreamMetadata` from `checkpointLoc/metadata`.
      - All internal logging in `StreamExecution` uses `(name, id, runId)` instead of just `name`
      
      TODO
      - [x] Test handling of name=null in json generation of StreamingQueryProgress
      - [x] Test handling of name=null in json generation of StreamingQueryListener events
      - [x] Test python API of runId
      
      Updated unit tests and new unit tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16113 from tdas/SPARK-18657.
      
      (cherry picked from commit bb57bfe9)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      1946854a
    • Liang-Chi Hsieh's avatar
      [SPARK-18634][PYSPARK][SQL] Corruption and Correctness issues with exploding Python UDFs · fecd23d2
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      As reported in the Jira, there are some weird issues with exploding Python UDFs in SparkSQL.
      
      The following test code can reproduce it. Notice: the following test code is reported to return wrong results in the Jira. However, as I tested on master branch, it causes exception and so can't return any result.
      
          >>> from pyspark.sql.functions import *
          >>> from pyspark.sql.types import *
          >>>
          >>> df = spark.range(10)
          >>>
          >>> def return_range(value):
          ...   return [(i, str(i)) for i in range(value - 1, value + 1)]
          ...
          >>> range_udf = udf(return_range, ArrayType(StructType([StructField("integer_val", IntegerType()),
          ...                                                     StructField("string_val", StringType())])))
          >>>
          >>> df.select("id", explode(range_udf(df.id))).show()
          Traceback (most recent call last):
            File "<stdin>", line 1, in <module>
            File "/spark/python/pyspark/sql/dataframe.py", line 318, in show
              print(self._jdf.showString(n, 20))
            File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
            File "/spark/python/pyspark/sql/utils.py", line 63, in deco
              return f(*a, **kw)
            File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o126.showString.: java.lang.AssertionError: assertion failed
              at scala.Predef$.assert(Predef.scala:156)
              at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:120)
              at org.apache.spark.sql.execution.GenerateExec.consume(GenerateExec.scala:57)
      
      The cause of this issue is, in `ExtractPythonUDFs` we insert `BatchEvalPythonExec` to run PythonUDFs in batch. `BatchEvalPythonExec` will add extra outputs (e.g., `pythonUDF0`) to original plan. In above case, the original `Range` only has one output `id`. After `ExtractPythonUDFs`, the added `BatchEvalPythonExec` has two outputs `id` and `pythonUDF0`.
      
      Because the output of `GenerateExec` is given after analysis phase, in above case, it is the combination of `id`, i.e., the output of `Range`, and `col`. But in planning phase, we change `GenerateExec`'s child plan to `BatchEvalPythonExec` with additional output attributes.
      
      It will cause no problem in non wholestage codegen. Because when evaluating the additional attributes are projected out the final output of `GenerateExec`.
      
      However, as `GenerateExec` now supports wholestage codegen, the framework will input all the outputs of the child plan to `GenerateExec`. Then when consuming `GenerateExec`'s output data (i.e., calling `consume`), the number of output attributes is different to the output variables in wholestage codegen.
      
      To solve this issue, this patch only gives the generator's output to `GenerateExec` after analysis phase. `GenerateExec`'s output is the combination of its child plan's output and the generator's output. So when we change `GenerateExec`'s child, its output is still correct.
      
      ## How was this patch tested?
      
      Added test cases to PySpark.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16120 from viirya/fix-py-udf-with-generator.
      
      (cherry picked from commit 3ba69b64)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      fecd23d2
    • Shixiong Zhu's avatar
      [SPARK-18694][SS] Add StreamingQuery.explain and exception to Python and fix... · c6a4e3d9
      Shixiong Zhu authored
      [SPARK-18694][SS] Add StreamingQuery.explain and exception to Python and fix StreamingQueryException (branch 2.1)
      
      ## What changes were proposed in this pull request?
      
      Backport #16125 to branch 2.1.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16153 from zsxwing/SPARK-18694-2.1.
      c6a4e3d9
  21. Dec 02, 2016
  22. Dec 01, 2016
    • Sandeep Singh's avatar
      [SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWrapper · 4c673c65
      Sandeep Singh authored
      
      ## What changes were proposed in this pull request?
      In`JavaWrapper `'s destructor make Java Gateway dereference object in destructor, using `SparkContext._active_spark_context._gateway.detach`
      Fixing the copying parameter bug, by moving the `copy` method from `JavaModel` to `JavaParams`
      
      ## How was this patch tested?
      ```scala
      import random, string
      from pyspark.ml.feature import StringIndexer
      
      l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))]  # 700000 random strings of 10 characters
      df = spark.createDataFrame(l, ['string'])
      
      for i in range(50):
          indexer = StringIndexer(inputCol='string', outputCol='index')
          indexer.fit(df)
      ```
      * Before: would keep StringIndexer strong reference, causing GC issues and is halted midway
      After: garbage collection works as the object is dereferenced, and computation completes
      * Mem footprint tested using profiler
      * Added a parameter copy related test which was failing before.
      
      Author: Sandeep Singh <sandeep@techaddict.me>
      Author: jkbradley <joseph.kurata.bradley@gmail.com>
      
      Closes #15843 from techaddict/SPARK-18274.
      
      (cherry picked from commit 78bb7f80)
      Signed-off-by: default avatarJoseph K. Bradley <joseph@databricks.com>
      4c673c65
  23. Nov 30, 2016
  24. Nov 29, 2016
Loading