Skip to content
Snippets Groups Projects
  1. Feb 25, 2017
    • Bryan Cutler's avatar
      [SPARK-14772][PYTHON][ML] Fixed Params.copy method to match Scala implementation · 20a43295
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Fixed the PySpark Params.copy method to behave like the Scala implementation.  The main issue was that it did not account for the _defaultParamMap and merged it into the explicitly created param map.
      
      ## How was this patch tested?
      Added new unit test to verify the copy method behaves correctly for copying uid, explicitly created params, and default params.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #17048 from BryanCutler/pyspark-ml-param_copy-Scala_sync-SPARK-14772-2_1.
      20a43295
  2. Feb 15, 2017
  3. Feb 13, 2017
  4. Jan 25, 2017
  5. Jan 20, 2017
    • Davies Liu's avatar
      [SPARK-18589][SQL] Fix Python UDF accessing attributes from both side of join · 4d286c90
      Davies Liu authored
      PythonUDF is unevaluable, which can not be used inside a join condition, currently the optimizer will push a PythonUDF which accessing both side of join into the join condition, then the query will fail to plan.
      
      This PR fix this issue by checking the expression is evaluable  or not before pushing it into Join.
      
      Add a regression test.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #16581 from davies/pyudf_join.
      4d286c90
  6. Jan 17, 2017
    • hyukjinkwon's avatar
      [SPARK-19019] [PYTHON] Fix hijacked `collections.namedtuple` and port... · 2ff36691
      hyukjinkwon authored
      [SPARK-19019] [PYTHON] Fix hijacked `collections.namedtuple` and port cloudpickle changes for PySpark to work with Python 3.6.0
      
      ## What changes were proposed in this pull request?
      
      Currently, PySpark does not work with Python 3.6.0.
      
      Running `./bin/pyspark` simply throws the error as below and PySpark does not work at all:
      
      ```
      Traceback (most recent call last):
        File ".../spark/python/pyspark/shell.py", line 30, in <module>
          import pyspark
        File ".../spark/python/pyspark/__init__.py", line 46, in <module>
          from pyspark.context import SparkContext
        File ".../spark/python/pyspark/context.py", line 36, in <module>
          from pyspark.java_gateway import launch_gateway
        File ".../spark/python/pyspark/java_gateway.py", line 31, in <module>
          from py4j.java_gateway import java_import, JavaGateway, GatewayClient
        File "<frozen importlib._bootstrap>", line 961, in _find_and_load
        File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
        File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
        File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 18, in <module>
        File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py", line 62, in <module>
          import pkgutil
        File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py", line 22, in <module>
          ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg')
        File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple
          cls = _old_namedtuple(*args, **kwargs)
      TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'
      ```
      
      The root cause seems because some arguments of `namedtuple` are now completely keyword-only arguments from Python 3.6.0 (See https://bugs.python.org/issue25628
      
      ).
      
      We currently copy this function via `types.FunctionType` which does not set the default values of keyword-only arguments (meaning `namedtuple.__kwdefaults__`) and this seems causing internally missing values in the function (non-bound arguments).
      
      This PR proposes to work around this by manually setting it via `kwargs` as `types.FunctionType` seems not supporting to set this.
      
      Also, this PR ports the changes in cloudpickle for compatibility for Python 3.6.0.
      
      ## How was this patch tested?
      
      Manually tested with Python 2.7.6 and Python 3.6.0.
      
      ```
      ./bin/pyspsark
      ```
      
      , manual creation of `namedtuple` both in local and rdd with Python 3.6.0,
      
      and Jenkins tests for other Python versions.
      
      Also,
      
      ```
      ./run-tests --python-executables=python3.6
      ```
      
      ```
      Will test against the following Python executables: ['python3.6']
      Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
      Finished test(python3.6): pyspark.sql.tests (192s)
      Finished test(python3.6): pyspark.accumulators (3s)
      Finished test(python3.6): pyspark.mllib.tests (198s)
      Finished test(python3.6): pyspark.broadcast (3s)
      Finished test(python3.6): pyspark.conf (2s)
      Finished test(python3.6): pyspark.context (14s)
      Finished test(python3.6): pyspark.ml.classification (21s)
      Finished test(python3.6): pyspark.ml.evaluation (11s)
      Finished test(python3.6): pyspark.ml.clustering (20s)
      Finished test(python3.6): pyspark.ml.linalg.__init__ (0s)
      Finished test(python3.6): pyspark.streaming.tests (240s)
      Finished test(python3.6): pyspark.tests (240s)
      Finished test(python3.6): pyspark.ml.recommendation (19s)
      Finished test(python3.6): pyspark.ml.feature (36s)
      Finished test(python3.6): pyspark.ml.regression (37s)
      Finished test(python3.6): pyspark.ml.tuning (28s)
      Finished test(python3.6): pyspark.mllib.classification (26s)
      Finished test(python3.6): pyspark.mllib.evaluation (18s)
      Finished test(python3.6): pyspark.mllib.clustering (44s)
      Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s)
      Finished test(python3.6): pyspark.mllib.feature (26s)
      Finished test(python3.6): pyspark.mllib.fpm (23s)
      Finished test(python3.6): pyspark.mllib.random (8s)
      Finished test(python3.6): pyspark.ml.tests (92s)
      Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s)
      Finished test(python3.6): pyspark.mllib.linalg.distributed (25s)
      Finished test(python3.6): pyspark.mllib.stat._statistics (15s)
      Finished test(python3.6): pyspark.mllib.recommendation (24s)
      Finished test(python3.6): pyspark.mllib.regression (26s)
      Finished test(python3.6): pyspark.profiler (9s)
      Finished test(python3.6): pyspark.mllib.tree (16s)
      Finished test(python3.6): pyspark.shuffle (1s)
      Finished test(python3.6): pyspark.mllib.util (18s)
      Finished test(python3.6): pyspark.serializers (11s)
      Finished test(python3.6): pyspark.rdd (20s)
      Finished test(python3.6): pyspark.sql.conf (8s)
      Finished test(python3.6): pyspark.sql.catalog (17s)
      Finished test(python3.6): pyspark.sql.column (18s)
      Finished test(python3.6): pyspark.sql.context (18s)
      Finished test(python3.6): pyspark.sql.group (27s)
      Finished test(python3.6): pyspark.sql.dataframe (33s)
      Finished test(python3.6): pyspark.sql.functions (35s)
      Finished test(python3.6): pyspark.sql.types (6s)
      Finished test(python3.6): pyspark.sql.streaming (13s)
      Finished test(python3.6): pyspark.streaming.util (0s)
      Finished test(python3.6): pyspark.sql.session (16s)
      Finished test(python3.6): pyspark.sql.window (4s)
      Finished test(python3.6): pyspark.sql.readwriter (35s)
      Tests passed in 433 seconds
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16429 from HyukjinKwon/SPARK-19019.
      
      (cherry picked from commit 20e62806)
      Signed-off-by: default avatarDavies Liu <davies.liu@gmail.com>
      2ff36691
  7. Jan 13, 2017
    • Vinayak's avatar
      [SPARK-18687][PYSPARK][SQL] Backward compatibility - creating a Dataframe on a... · b2c9a2c8
      Vinayak authored
      [SPARK-18687][PYSPARK][SQL] Backward compatibility - creating a Dataframe on a new SQLContext object fails with a Derby error
      
      Change is for SQLContext to reuse the active SparkSession during construction if the sparkContext supplied is the same as the currently active SparkContext. Without this change, a new SparkSession is instantiated that results in a Derby error when attempting to create a dataframe using a new SQLContext object even though the SparkContext supplied to the new SQLContext is same as the currently active one. Refer https://issues.apache.org/jira/browse/SPARK-18687 for details on the error and a repro.
      
      Existing unit tests and a new unit test added to pyspark-sql:
      
      /python/run-tests --python-executables=python --modules=pyspark-sql
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Vinayak <vijoshi5@in.ibm.com>
      Author: Vinayak Joshi <vijoshi@users.noreply.github.com>
      
      Closes #16119 from vijoshi/SPARK-18687_master.
      
      (cherry picked from commit 285a7798)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      b2c9a2c8
  8. Jan 12, 2017
    • Liang-Chi Hsieh's avatar
      [SPARK-19055][SQL][PYSPARK] Fix SparkSession initialization when SparkContext is stopped · 042e32d1
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      In SparkSession initialization, we store created the instance of SparkSession into a class variable _instantiatedContext. Next time we can use SparkSession.builder.getOrCreate() to retrieve the existing SparkSession instance.
      
      However, when the active SparkContext is stopped and we create another new SparkContext to use, the existing SparkSession is still associated with the stopped SparkContext. So the operations with this existing SparkSession will be failed.
      
      We need to detect such case in SparkSession and renew the class variable _instantiatedContext if needed.
      
      ## How was this patch tested?
      
      New test added in PySpark.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16454 from viirya/fix-pyspark-sparksession.
      
      (cherry picked from commit c6c37b8a)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      042e32d1
  9. Jan 10, 2017
  10. Jan 08, 2017
  11. Dec 21, 2016
    • gatorsmile's avatar
      [SPARK-18949][SQL][BACKPORT-2.1] Add recoverPartitions API to Catalog · 0e51bb08
      gatorsmile authored
      ### What changes were proposed in this pull request?
      
      This PR is to backport https://github.com/apache/spark/pull/16356 to Spark 2.1.1 branch.
      
      ----
      
      Currently, we only have a SQL interface for recovering all the partitions in the directory of a table and update the catalog. `MSCK REPAIR TABLE` or `ALTER TABLE table RECOVER PARTITIONS`. (Actually, very hard for me to remember `MSCK` and have no clue what it means)
      
      After the new "Scalable Partition Handling", the table repair becomes much more important for making visible the data in the created data source partitioned table.
      
      Thus, this PR is to add it into the Catalog interface. After this PR, users can repair the table by
      ```Scala
      spark.catalog.recoverPartitions("testTable")
      ```
      
      ### How was this patch tested?
      Modified the existing test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16372 from gatorsmile/repairTable2.1.1.
      0e51bb08
  12. Dec 20, 2016
    • Liang-Chi Hsieh's avatar
      [SPARK-18281] [SQL] [PYSPARK] Remove timeout for reading data through socket for local iterator · cd297c39
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      There is a timeout failure when using `rdd.toLocalIterator()` or `df.toLocalIterator()` for a PySpark RDD and DataFrame:
      
          df = spark.createDataFrame([[1],[2],[3]])
          it = df.toLocalIterator()
          row = next(it)
      
          df2 = df.repartition(1000)  # create many empty partitions which increase materialization time so causing timeout
          it2 = df2.toLocalIterator()
          row = next(it2)
      
      The cause of this issue is, we open a socket to serve the data from JVM side. We set timeout for connection and reading through the socket in Python side. In Python we use a generator to read the data, so we only begin to connect the socket once we start to ask data from it. If we don't consume it immediately, there is connection timeout.
      
      In the other side, the materialization time for RDD partitions is unpredictable. So we can't set a timeout for reading data through the socket. Otherwise, it is very possibly to fail.
      
      ## How was this patch tested?
      
      Added tests into PySpark.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16263 from viirya/fix-pyspark-localiterator.
      
      (cherry picked from commit 95c95b71)
      Signed-off-by: default avatarDavies Liu <davies.liu@gmail.com>
      cd297c39
  13. Dec 15, 2016
  14. Dec 14, 2016
  15. Dec 11, 2016
  16. Dec 08, 2016
    • Andrew Ray's avatar
      [SPARK-16589] [PYTHON] Chained cartesian produces incorrect number of records · e0173f14
      Andrew Ray authored
      
      ## What changes were proposed in this pull request?
      
      Fixes a bug in the python implementation of rdd cartesian product related to batching that showed up in repeated cartesian products with seemingly random results. The root cause being multiple iterators pulling from the same stream in the wrong order because of logic that ignored batching.
      
      `CartesianDeserializer` and `PairDeserializer` were changed to implement `_load_stream_without_unbatching` and borrow the one line implementation of `load_stream` from `BatchedSerializer`. The default implementation of `_load_stream_without_unbatching` was changed to give consistent results (always an iterable) so that it could be used without additional checks.
      
      `PairDeserializer` no longer extends `CartesianDeserializer` as it was not really proper. If wanted a new common super class could be added.
      
      Both `CartesianDeserializer` and `PairDeserializer` now only extend `Serializer` (which has no `dump_stream` implementation) since they are only meant for *de*serialization.
      
      ## How was this patch tested?
      
      Additional unit tests (sourced from #14248) plus one for testing a cartesian with zip.
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #16121 from aray/fix-cartesian.
      
      (cherry picked from commit 3c68944b)
      Signed-off-by: default avatarDavies Liu <davies.liu@gmail.com>
      e0173f14
    • Liang-Chi Hsieh's avatar
      [SPARK-18667][PYSPARK][SQL] Change the way to group row in BatchEvalPythonExec... · 726217eb
      Liang-Chi Hsieh authored
      [SPARK-18667][PYSPARK][SQL] Change the way to group row in BatchEvalPythonExec so input_file_name function can work with UDF in pyspark
      
      ## What changes were proposed in this pull request?
      
      `input_file_name` doesn't return filename when working with UDF in PySpark. An example shows the problem:
      
          from pyspark.sql.functions import *
          from pyspark.sql.types import *
      
          def filename(path):
              return path
      
          sourceFile = udf(filename, StringType())
          spark.read.json("tmp.json").select(sourceFile(input_file_name())).show()
      
          +---------------------------+
          |filename(input_file_name())|
          +---------------------------+
          |                           |
          +---------------------------+
      
      The cause of this issue is, we group rows in `BatchEvalPythonExec` for batching processing of PythonUDF. Currently we group rows first and then evaluate expressions on the rows. If the data is less than the required number of rows for a group, the iterator will be consumed to the end before the evaluation. However, once the iterator reaches the end, we will unset input filename. So the input_file_name expression can't return correct filename.
      
      This patch fixes the approach to group the batch of rows. We evaluate the expression first and then group evaluated results to batch.
      
      ## How was this patch tested?
      
      Added unit test to PySpark.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16115 from viirya/fix-py-udf-input-filename.
      
      (cherry picked from commit 6a5a7254)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      726217eb
    • Patrick Wendell's avatar
      48aa6775
    • Patrick Wendell's avatar
      Preparing Spark release v2.1.0-rc2 · 08071749
      Patrick Wendell authored
      08071749
  17. Dec 07, 2016
  18. Dec 06, 2016
    • Shuai Lin's avatar
      [SPARK-18652][PYTHON] Include the example data and third-party licenses in pyspark package. · 65f5331a
      Shuai Lin authored
      
      ## What changes were proposed in this pull request?
      
      Since we already include the python examples in the pyspark package, we should include the example data with it as well.
      
      We should also include the third-party licences since we distribute their jars with the pyspark package.
      
      ## How was this patch tested?
      
      Manually tested with python2.7 and python3.4
      ```sh
      $ ./build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Pmesos clean package
      $ cd python
      $ python setup.py sdist
      $ pip install  dist/pyspark-2.1.0.dev0.tar.gz
      
      $ ls -1 /usr/local/lib/python2.7/dist-packages/pyspark/data/
      graphx
      mllib
      streaming
      
      $ du -sh /usr/local/lib/python2.7/dist-packages/pyspark/data/
      600K    /usr/local/lib/python2.7/dist-packages/pyspark/data/
      
      $ ls -1  /usr/local/lib/python2.7/dist-packages/pyspark/licenses/|head -5
      LICENSE-AnchorJS.txt
      LICENSE-DPark.txt
      LICENSE-Mockito.txt
      LICENSE-SnapTree.txt
      LICENSE-antlr.txt
      ```
      
      Author: Shuai Lin <linshuai2012@gmail.com>
      
      Closes #16082 from lins05/include-data-in-pyspark-dist.
      
      (cherry picked from commit bd9a4a5a)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      65f5331a
  19. Dec 05, 2016
    • Tathagata Das's avatar
      [SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart and... · 1946854a
      Tathagata Das authored
      [SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart and not auto-generate StreamingQuery.name
      
      Here are the major changes in this PR.
      - Added the ability to recover `StreamingQuery.id` from checkpoint location, by writing the id to `checkpointLoc/metadata`.
      - Added `StreamingQuery.runId` which is unique for every query started and does not persist across restarts. This is to identify each restart of a query separately (same as earlier behavior of `id`).
      - Removed auto-generation of `StreamingQuery.name`. The purpose of name was to have the ability to define an identifier across restarts, but since id is precisely that, there is no need for a auto-generated name. This means name becomes purely cosmetic, and is null by default.
      - Added `runId` to `StreamingQueryListener` events and `StreamingQueryProgress`.
      
      Implementation details
      - Renamed existing `StreamExecutionMetadata` to `OffsetSeqMetadata`, and moved it to the file `OffsetSeq.scala`, because that is what this metadata is tied to. Also did some refactoring to make the code cleaner (got rid of a lot of `.json` and `.getOrElse("{}")`).
      - Added the `id` as the new `StreamMetadata`.
      - When a StreamingQuery is created it gets or writes the `StreamMetadata` from `checkpointLoc/metadata`.
      - All internal logging in `StreamExecution` uses `(name, id, runId)` instead of just `name`
      
      TODO
      - [x] Test handling of name=null in json generation of StreamingQueryProgress
      - [x] Test handling of name=null in json generation of StreamingQueryListener events
      - [x] Test python API of runId
      
      Updated unit tests and new unit tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16113 from tdas/SPARK-18657.
      
      (cherry picked from commit bb57bfe9)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      1946854a
    • Liang-Chi Hsieh's avatar
      [SPARK-18634][PYSPARK][SQL] Corruption and Correctness issues with exploding Python UDFs · fecd23d2
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      As reported in the Jira, there are some weird issues with exploding Python UDFs in SparkSQL.
      
      The following test code can reproduce it. Notice: the following test code is reported to return wrong results in the Jira. However, as I tested on master branch, it causes exception and so can't return any result.
      
          >>> from pyspark.sql.functions import *
          >>> from pyspark.sql.types import *
          >>>
          >>> df = spark.range(10)
          >>>
          >>> def return_range(value):
          ...   return [(i, str(i)) for i in range(value - 1, value + 1)]
          ...
          >>> range_udf = udf(return_range, ArrayType(StructType([StructField("integer_val", IntegerType()),
          ...                                                     StructField("string_val", StringType())])))
          >>>
          >>> df.select("id", explode(range_udf(df.id))).show()
          Traceback (most recent call last):
            File "<stdin>", line 1, in <module>
            File "/spark/python/pyspark/sql/dataframe.py", line 318, in show
              print(self._jdf.showString(n, 20))
            File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
            File "/spark/python/pyspark/sql/utils.py", line 63, in deco
              return f(*a, **kw)
            File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o126.showString.: java.lang.AssertionError: assertion failed
              at scala.Predef$.assert(Predef.scala:156)
              at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:120)
              at org.apache.spark.sql.execution.GenerateExec.consume(GenerateExec.scala:57)
      
      The cause of this issue is, in `ExtractPythonUDFs` we insert `BatchEvalPythonExec` to run PythonUDFs in batch. `BatchEvalPythonExec` will add extra outputs (e.g., `pythonUDF0`) to original plan. In above case, the original `Range` only has one output `id`. After `ExtractPythonUDFs`, the added `BatchEvalPythonExec` has two outputs `id` and `pythonUDF0`.
      
      Because the output of `GenerateExec` is given after analysis phase, in above case, it is the combination of `id`, i.e., the output of `Range`, and `col`. But in planning phase, we change `GenerateExec`'s child plan to `BatchEvalPythonExec` with additional output attributes.
      
      It will cause no problem in non wholestage codegen. Because when evaluating the additional attributes are projected out the final output of `GenerateExec`.
      
      However, as `GenerateExec` now supports wholestage codegen, the framework will input all the outputs of the child plan to `GenerateExec`. Then when consuming `GenerateExec`'s output data (i.e., calling `consume`), the number of output attributes is different to the output variables in wholestage codegen.
      
      To solve this issue, this patch only gives the generator's output to `GenerateExec` after analysis phase. `GenerateExec`'s output is the combination of its child plan's output and the generator's output. So when we change `GenerateExec`'s child, its output is still correct.
      
      ## How was this patch tested?
      
      Added test cases to PySpark.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16120 from viirya/fix-py-udf-with-generator.
      
      (cherry picked from commit 3ba69b64)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      fecd23d2
    • Shixiong Zhu's avatar
      [SPARK-18694][SS] Add StreamingQuery.explain and exception to Python and fix... · c6a4e3d9
      Shixiong Zhu authored
      [SPARK-18694][SS] Add StreamingQuery.explain and exception to Python and fix StreamingQueryException (branch 2.1)
      
      ## What changes were proposed in this pull request?
      
      Backport #16125 to branch 2.1.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16153 from zsxwing/SPARK-18694-2.1.
      c6a4e3d9
  20. Dec 02, 2016
  21. Dec 01, 2016
    • Sandeep Singh's avatar
      [SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWrapper · 4c673c65
      Sandeep Singh authored
      
      ## What changes were proposed in this pull request?
      In`JavaWrapper `'s destructor make Java Gateway dereference object in destructor, using `SparkContext._active_spark_context._gateway.detach`
      Fixing the copying parameter bug, by moving the `copy` method from `JavaModel` to `JavaParams`
      
      ## How was this patch tested?
      ```scala
      import random, string
      from pyspark.ml.feature import StringIndexer
      
      l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))]  # 700000 random strings of 10 characters
      df = spark.createDataFrame(l, ['string'])
      
      for i in range(50):
          indexer = StringIndexer(inputCol='string', outputCol='index')
          indexer.fit(df)
      ```
      * Before: would keep StringIndexer strong reference, causing GC issues and is halted midway
      After: garbage collection works as the object is dereferenced, and computation completes
      * Mem footprint tested using profiler
      * Added a parameter copy related test which was failing before.
      
      Author: Sandeep Singh <sandeep@techaddict.me>
      Author: jkbradley <joseph.kurata.bradley@gmail.com>
      
      Closes #15843 from techaddict/SPARK-18274.
      
      (cherry picked from commit 78bb7f80)
      Signed-off-by: default avatarJoseph K. Bradley <joseph@databricks.com>
      4c673c65
  22. Nov 30, 2016
  23. Nov 29, 2016
    • Jeff Zhang's avatar
      [SPARK-15819][PYSPARK][ML] Add KMeanSummary in KMeans of PySpark · b95aad7c
      Jeff Zhang authored
      
      ## What changes were proposed in this pull request?
      
      Add python api for KMeansSummary
      ## How was this patch tested?
      
      unit test added
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #13557 from zjffdu/SPARK-15819.
      
      (cherry picked from commit 4c82ca86)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      b95aad7c
    • Yuhao's avatar
      [SPARK-18319][ML][QA2.1] 2.1 QA: API: Experimental, DeveloperApi, final, sealed audit · eb0b3631
      Yuhao authored
      ## What changes were proposed in this pull request?
      make a pass through the items marked as Experimental or DeveloperApi and see if any are stable enough to be unmarked. Also check for items marked final or sealed to see if they are stable enough to be opened up as APIs.
      
      Some discussions in the jira: https://issues.apache.org/jira/browse/SPARK-18319
      
      
      
      ## How was this patch tested?
      existing ut
      
      Author: Yuhao <yuhao.yang@intel.com>
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #15972 from hhbyyh/experimental21.
      
      (cherry picked from commit 9b670bca)
      Signed-off-by: default avatarJoseph K. Bradley <joseph@databricks.com>
      eb0b3631
    • Tathagata Das's avatar
      [SPARK-18516][SQL] Split state and progress in streaming · 28b57c8a
      Tathagata Das authored
      
      This PR separates the status of a `StreamingQuery` into two separate APIs:
       - `status` - describes the status of a `StreamingQuery` at this moment, including what phase of processing is currently happening and if data is available.
       - `recentProgress` - an array of statistics about the most recent microbatches that have executed.
      
      A recent progress contains the following information:
      ```
      {
        "id" : "2be8670a-fce1-4859-a530-748f29553bb6",
        "name" : "query-29",
        "timestamp" : 1479705392724,
        "inputRowsPerSecond" : 230.76923076923077,
        "processedRowsPerSecond" : 10.869565217391303,
        "durationMs" : {
          "triggerExecution" : 276,
          "queryPlanning" : 3,
          "getBatch" : 5,
          "getOffset" : 3,
          "addBatch" : 234,
          "walCommit" : 30
        },
        "currentWatermark" : 0,
        "stateOperators" : [ ],
        "sources" : [ {
          "description" : "KafkaSource[Subscribe[topic-14]]",
          "startOffset" : {
            "topic-14" : {
              "2" : 0,
              "4" : 1,
              "1" : 0,
              "3" : 0,
              "0" : 0
            }
          },
          "endOffset" : {
            "topic-14" : {
              "2" : 1,
              "4" : 2,
              "1" : 0,
              "3" : 0,
              "0" : 1
            }
          },
          "numRecords" : 3,
          "inputRowsPerSecond" : 230.76923076923077,
          "processedRowsPerSecond" : 10.869565217391303
        } ]
      }
      ```
      
      Additionally, in order to make it possible to correlate progress updates across restarts, we change the `id` field from an integer that is unique with in the JVM to a `UUID` that is globally unique.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #15954 from marmbrus/queryProgress.
      
      (cherry picked from commit c3d08e2f)
      Signed-off-by: default avatarMichael Armbrust <michael@databricks.com>
      28b57c8a
Loading