Skip to content
Snippets Groups Projects
  1. May 24, 2017
  2. May 10, 2017
    • Josh Rosen's avatar
      [SPARK-20685] Fix BatchPythonEvaluation bug in case of single UDF w/ repeated arg. · 92a71a66
      Josh Rosen authored
      
      ## What changes were proposed in this pull request?
      
      There's a latent corner-case bug in PySpark UDF evaluation where executing a `BatchPythonEvaluation` with a single multi-argument UDF where _at least one argument value is repeated_ will crash at execution with a confusing error.
      
      This problem was introduced in #12057: the code there has a fast path for handling a "batch UDF evaluation consisting of a single Python UDF", but that branch incorrectly assumes that a single UDF won't have repeated arguments and therefore skips the code for unpacking arguments from the input row (whose schema may not necessarily match the UDF inputs due to de-duplication of repeated arguments which occurred in the JVM before sending UDF inputs to Python).
      
      This fix here is simply to remove this special-casing: it turns out that the code in the "multiple UDFs" branch just so happens to work for the single-UDF case because Python treats `(x)` as equivalent to `x`, not as a single-argument tuple.
      
      ## How was this patch tested?
      
      New regression test in `pyspark.python.sql.tests` module (tested and confirmed that it fails before my fix).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #17927 from JoshRosen/SPARK-20685.
      
      (cherry picked from commit 8ddbc431)
      Signed-off-by: default avatarXiao Li <gatorsmile@gmail.com>
      92a71a66
    • zero323's avatar
      [SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should... · 69786ea3
      zero323 authored
      [SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should use values not Params
      
      ## What changes were proposed in this pull request?
      
      - Replace `getParam` calls with `getOrDefault` calls.
      - Fix exception message to avoid unintended `TypeError`.
      - Add unit tests
      
      ## How was this patch tested?
      
      New unit tests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17891 from zero323/SPARK-20631.
      
      (cherry picked from commit 804949c6)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      69786ea3
  3. Apr 25, 2017
  4. Apr 14, 2017
  5. Apr 10, 2017
    • Shixiong Zhu's avatar
      [SPARK-20285][TESTS] Increase the pyspark streaming test timeout to 30 seconds · 489c1f35
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Saw the following failure locally:
      
      ```
      Traceback (most recent call last):
        File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, in test_cogroup
          self._test_func(input, func, expected, sort=True, input2=input2)
        File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, in _test_func
          self.assertEqual(expected, result)
      AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != []
      
      First list contains 3 additional elements.
      First extra element 0:
      [(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))]
      
      + []
      - [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))],
      -  [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))],
      -  [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]]
      ```
      
      It also happened on Jenkins: http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120
      
      
      
      It's because when the machine is overloaded, the timeout is not enough. This PR just increases the timeout to 30 seconds.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #17597 from zsxwing/SPARK-20285.
      
      (cherry picked from commit f9a50ba2)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      489c1f35
  6. Apr 05, 2017
    • Liang-Chi Hsieh's avatar
      [SPARK-20214][ML] Make sure converted csc matrix has sorted indices · fb81a412
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      `_convert_to_vector` converts a scipy sparse matrix to csc matrix for initializing `SparseVector`. However, it doesn't guarantee the converted csc matrix has sorted indices and so a failure happens when you do something like that:
      
          from scipy.sparse import lil_matrix
          lil = lil_matrix((4, 1))
          lil[1, 0] = 1
          lil[3, 0] = 2
          _convert_to_vector(lil.todok())
      
          File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 78, in _convert_to_vector
            return SparseVector(l.shape[0], csc.indices, csc.data)
          File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 556, in __init__
            % (self.indices[i], self.indices[i + 1]))
          TypeError: Indices 3 and 1 are not strictly increasing
      
      A simple test can confirm that `dok_matrix.tocsc()` won't guarantee sorted indices:
      
          >>> from scipy.sparse import lil_matrix
          >>> lil = lil_matrix((4, 1))
          >>> lil[1, 0] = 1
          >>> lil[3, 0] = 2
          >>> dok = lil.todok()
          >>> csc = dok.tocsc()
          >>> csc.has_sorted_indices
          0
          >>> csc.indices
          array([3, 1], dtype=int32)
      
      I checked the source codes of scipy. The only way to guarantee it is `csc_matrix.tocsr()` and `csr_matrix.tocsc()`.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #17532 from viirya/make-sure-sorted-indices.
      
      (cherry picked from commit 12206058)
      Signed-off-by: default avatarJoseph K. Bradley <joseph@databricks.com>
      fb81a412
  7. Mar 28, 2017
  8. Mar 27, 2017
    • Josh Rosen's avatar
      [SPARK-20102] Fix nightly packaging and RC packaging scripts w/ two minor build fixes · 4056191d
      Josh Rosen authored
      
      ## What changes were proposed in this pull request?
      
      The master snapshot publisher builds are currently broken due to two minor build issues:
      
      1. For unknown reasons, the LFTP `mkdir -p` command began throwing errors when the remote directory already exists. This change of behavior might have been caused by configuration changes in the ASF's SFTP server, but I'm not entirely sure of that. To work around this problem, this patch updates the script to ignore errors from the `lftp mkdir -p` commands.
      2. The PySpark `setup.py` file references a non-existent `pyspark.ml.stat` module, causing Python packaging to fail by complaining about a missing directory. The fix is to simply drop that line from the setup script.
      
      ## How was this patch tested?
      
      The LFTP fix was tested by manually running the failing commands on AMPLab Jenkins against the ASF SFTP server. The PySpark fix was tested locally.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #17437 from JoshRosen/spark-20102.
      
      (cherry picked from commit 314cf51d)
      Signed-off-by: default avatarJosh Rosen <joshrosen@databricks.com>
      4056191d
  9. Mar 21, 2017
  10. Mar 17, 2017
    • Shixiong Zhu's avatar
      [SPARK-19986][TESTS] Make pyspark.streaming.tests.CheckpointTests more stable · 5fb70831
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      Sometimes, CheckpointTests will hang on a busy machine because the streaming jobs are too slow and cannot catch up. I observed the scheduled delay was keeping increasing for dozens of seconds locally.
      
      This PR increases the batch interval from 0.5 seconds to 2 seconds to generate less Spark jobs. It should make `pyspark.streaming.tests.CheckpointTests` more stable. I also replaced `sleep` with `awaitTerminationOrTimeout` so that if the streaming job fails, it will also fail the test.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #17323 from zsxwing/SPARK-19986.
      
      (cherry picked from commit 376d7821)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      5fb70831
  11. Mar 15, 2017
    • hyukjinkwon's avatar
      [SPARK-19872] [PYTHON] Use the correct deserializer for RDD construction for coalesce/repartition · 06225463
      hyukjinkwon authored
      
      ## What changes were proposed in this pull request?
      
      This PR proposes to use the correct deserializer, `BatchedSerializer` for RDD construction for coalesce/repartition when the shuffle is enabled. Currently, it is passing `UTF8Deserializer` as is not `BatchedSerializer` from the copied one.
      
      with the file, `text.txt` below:
      
      ```
      a
      b
      
      d
      e
      f
      g
      h
      i
      j
      k
      l
      
      ```
      
      - Before
      
      ```python
      >>> sc.textFile('text.txt').repartition(1).collect()
      ```
      
      ```
      UTF8Deserializer(True)
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark/python/pyspark/rdd.py", line 811, in collect
          return list(_load_from_socket(port, self._jrdd_deserializer))
        File ".../spark/python/pyspark/serializers.py", line 549, in load_stream
          yield self.loads(stream)
        File ".../spark/python/pyspark/serializers.py", line 544, in loads
          return s.decode("utf-8") if self.use_unicode else s
        File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
          return codecs.utf_8_decode(input, errors, True)
      UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
      ```
      
      - After
      
      ```python
      >>> sc.textFile('text.txt').repartition(1).collect()
      ```
      
      ```
      [u'a', u'b', u'', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l', u'']
      ```
      
      ## How was this patch tested?
      
      Unit test in `python/pyspark/tests.py`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17282 from HyukjinKwon/SPARK-19872.
      
      (cherry picked from commit 7387126f)
      Signed-off-by: default avatarDavies Liu <davies.liu@gmail.com>
      06225463
  12. Mar 09, 2017
    • Jason White's avatar
      [SPARK-19561][SQL] add int case handling for TimestampType · 2a76e242
      Jason White authored
      ## What changes were proposed in this pull request?
      
      Add handling of input of type `Int` for dataType `TimestampType` to `EvaluatePython.scala`. Py4J serializes ints smaller than MIN_INT or larger than MAX_INT to Long, which are handled correctly already, but values between MIN_INT and MAX_INT are serialized to Int.
      
      These range limits correspond to roughly half an hour on either side of the epoch. As a result, PySpark doesn't allow TimestampType values to be created in this range.
      
      Alternatives attempted: patching the `TimestampType.toInternal` function to cast return values to `long`, so Py4J would always serialize them to Scala Long. Python3 does not have a `long` type, so this approach failed on Python3.
      
      ## How was this patch tested?
      
      Added a new PySpark-side test that fails without the change.
      
      The contribution is my original work and I license the work to the project under the project’s open source license.
      
      Resubmission of https://github.com/apache/spark/pull/16896
      
      . The original PR didn't go through Jenkins and broke the build. davies dongjoon-hyun
      
      cloud-fan Could you kick off a Jenkins run for me? It passed everything for me locally, but it's possible something has changed in the last few weeks.
      
      Author: Jason White <jason.white@shopify.com>
      
      Closes #17200 from JasonMWhite/SPARK-19561.
      
      (cherry picked from commit 206030bd)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      2a76e242
  13. Mar 07, 2017
    • Bryan Cutler's avatar
      [SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe · 0ba9ecbe
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      The `keyword_only` decorator in PySpark is not thread-safe.  It writes kwargs to a static class variable in the decorator, which is then retrieved later in the class method as `_input_kwargs`.  If multiple threads are constructing the same class with different kwargs, it becomes a race condition to read from the static class variable before it's overwritten.  See [SPARK-19348](https://issues.apache.org/jira/browse/SPARK-19348) for reproduction code.
      
      This change will write the kwargs to a member variable so that multiple threads can operate on separate instances without the race condition.  It does not protect against multiple threads operating on a single instance, but that is better left to the user to synchronize.
      
      ## How was this patch tested?
      Added new unit tests for using the keyword_only decorator and a regression test that verifies `_input_kwargs` can be overwritten from different class instances.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #17193 from BryanCutler/pyspark-keyword_only-threadsafe-SPARK-19348-2_1.
      0ba9ecbe
    • Wenchen Fan's avatar
      cbc37007
    • Jason White's avatar
      [SPARK-19561] [PYTHON] cast TimestampType.toInternal output to long · 711addd4
      Jason White authored
      
      ## What changes were proposed in this pull request?
      
      Cast the output of `TimestampType.toInternal` to long to allow for proper Timestamp creation in DataFrames near the epoch.
      
      ## How was this patch tested?
      
      Added a new test that fails without the change.
      
      dongjoon-hyun davies Mind taking a look?
      
      The contribution is my original work and I license the work to the project under the project’s open source license.
      
      Author: Jason White <jason.white@shopify.com>
      
      Closes #16896 from JasonMWhite/SPARK-19561.
      
      (cherry picked from commit 6f468462)
      Signed-off-by: default avatarDavies Liu <davies.liu@gmail.com>
      711addd4
  14. Feb 25, 2017
    • Bryan Cutler's avatar
      [SPARK-14772][PYTHON][ML] Fixed Params.copy method to match Scala implementation · 20a43295
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Fixed the PySpark Params.copy method to behave like the Scala implementation.  The main issue was that it did not account for the _defaultParamMap and merged it into the explicitly created param map.
      
      ## How was this patch tested?
      Added new unit test to verify the copy method behaves correctly for copying uid, explicitly created params, and default params.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #17048 from BryanCutler/pyspark-ml-param_copy-Scala_sync-SPARK-14772-2_1.
      20a43295
  15. Feb 15, 2017
  16. Feb 13, 2017
  17. Jan 25, 2017
  18. Jan 20, 2017
    • Davies Liu's avatar
      [SPARK-18589][SQL] Fix Python UDF accessing attributes from both side of join · 4d286c90
      Davies Liu authored
      PythonUDF is unevaluable, which can not be used inside a join condition, currently the optimizer will push a PythonUDF which accessing both side of join into the join condition, then the query will fail to plan.
      
      This PR fix this issue by checking the expression is evaluable  or not before pushing it into Join.
      
      Add a regression test.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #16581 from davies/pyudf_join.
      4d286c90
  19. Jan 17, 2017
    • hyukjinkwon's avatar
      [SPARK-19019] [PYTHON] Fix hijacked `collections.namedtuple` and port... · 2ff36691
      hyukjinkwon authored
      [SPARK-19019] [PYTHON] Fix hijacked `collections.namedtuple` and port cloudpickle changes for PySpark to work with Python 3.6.0
      
      ## What changes were proposed in this pull request?
      
      Currently, PySpark does not work with Python 3.6.0.
      
      Running `./bin/pyspark` simply throws the error as below and PySpark does not work at all:
      
      ```
      Traceback (most recent call last):
        File ".../spark/python/pyspark/shell.py", line 30, in <module>
          import pyspark
        File ".../spark/python/pyspark/__init__.py", line 46, in <module>
          from pyspark.context import SparkContext
        File ".../spark/python/pyspark/context.py", line 36, in <module>
          from pyspark.java_gateway import launch_gateway
        File ".../spark/python/pyspark/java_gateway.py", line 31, in <module>
          from py4j.java_gateway import java_import, JavaGateway, GatewayClient
        File "<frozen importlib._bootstrap>", line 961, in _find_and_load
        File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
        File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
        File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 18, in <module>
        File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py", line 62, in <module>
          import pkgutil
        File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py", line 22, in <module>
          ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg')
        File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple
          cls = _old_namedtuple(*args, **kwargs)
      TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'
      ```
      
      The root cause seems because some arguments of `namedtuple` are now completely keyword-only arguments from Python 3.6.0 (See https://bugs.python.org/issue25628
      
      ).
      
      We currently copy this function via `types.FunctionType` which does not set the default values of keyword-only arguments (meaning `namedtuple.__kwdefaults__`) and this seems causing internally missing values in the function (non-bound arguments).
      
      This PR proposes to work around this by manually setting it via `kwargs` as `types.FunctionType` seems not supporting to set this.
      
      Also, this PR ports the changes in cloudpickle for compatibility for Python 3.6.0.
      
      ## How was this patch tested?
      
      Manually tested with Python 2.7.6 and Python 3.6.0.
      
      ```
      ./bin/pyspsark
      ```
      
      , manual creation of `namedtuple` both in local and rdd with Python 3.6.0,
      
      and Jenkins tests for other Python versions.
      
      Also,
      
      ```
      ./run-tests --python-executables=python3.6
      ```
      
      ```
      Will test against the following Python executables: ['python3.6']
      Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
      Finished test(python3.6): pyspark.sql.tests (192s)
      Finished test(python3.6): pyspark.accumulators (3s)
      Finished test(python3.6): pyspark.mllib.tests (198s)
      Finished test(python3.6): pyspark.broadcast (3s)
      Finished test(python3.6): pyspark.conf (2s)
      Finished test(python3.6): pyspark.context (14s)
      Finished test(python3.6): pyspark.ml.classification (21s)
      Finished test(python3.6): pyspark.ml.evaluation (11s)
      Finished test(python3.6): pyspark.ml.clustering (20s)
      Finished test(python3.6): pyspark.ml.linalg.__init__ (0s)
      Finished test(python3.6): pyspark.streaming.tests (240s)
      Finished test(python3.6): pyspark.tests (240s)
      Finished test(python3.6): pyspark.ml.recommendation (19s)
      Finished test(python3.6): pyspark.ml.feature (36s)
      Finished test(python3.6): pyspark.ml.regression (37s)
      Finished test(python3.6): pyspark.ml.tuning (28s)
      Finished test(python3.6): pyspark.mllib.classification (26s)
      Finished test(python3.6): pyspark.mllib.evaluation (18s)
      Finished test(python3.6): pyspark.mllib.clustering (44s)
      Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s)
      Finished test(python3.6): pyspark.mllib.feature (26s)
      Finished test(python3.6): pyspark.mllib.fpm (23s)
      Finished test(python3.6): pyspark.mllib.random (8s)
      Finished test(python3.6): pyspark.ml.tests (92s)
      Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s)
      Finished test(python3.6): pyspark.mllib.linalg.distributed (25s)
      Finished test(python3.6): pyspark.mllib.stat._statistics (15s)
      Finished test(python3.6): pyspark.mllib.recommendation (24s)
      Finished test(python3.6): pyspark.mllib.regression (26s)
      Finished test(python3.6): pyspark.profiler (9s)
      Finished test(python3.6): pyspark.mllib.tree (16s)
      Finished test(python3.6): pyspark.shuffle (1s)
      Finished test(python3.6): pyspark.mllib.util (18s)
      Finished test(python3.6): pyspark.serializers (11s)
      Finished test(python3.6): pyspark.rdd (20s)
      Finished test(python3.6): pyspark.sql.conf (8s)
      Finished test(python3.6): pyspark.sql.catalog (17s)
      Finished test(python3.6): pyspark.sql.column (18s)
      Finished test(python3.6): pyspark.sql.context (18s)
      Finished test(python3.6): pyspark.sql.group (27s)
      Finished test(python3.6): pyspark.sql.dataframe (33s)
      Finished test(python3.6): pyspark.sql.functions (35s)
      Finished test(python3.6): pyspark.sql.types (6s)
      Finished test(python3.6): pyspark.sql.streaming (13s)
      Finished test(python3.6): pyspark.streaming.util (0s)
      Finished test(python3.6): pyspark.sql.session (16s)
      Finished test(python3.6): pyspark.sql.window (4s)
      Finished test(python3.6): pyspark.sql.readwriter (35s)
      Tests passed in 433 seconds
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16429 from HyukjinKwon/SPARK-19019.
      
      (cherry picked from commit 20e62806)
      Signed-off-by: default avatarDavies Liu <davies.liu@gmail.com>
      2ff36691
  20. Jan 13, 2017
    • Vinayak's avatar
      [SPARK-18687][PYSPARK][SQL] Backward compatibility - creating a Dataframe on a... · b2c9a2c8
      Vinayak authored
      [SPARK-18687][PYSPARK][SQL] Backward compatibility - creating a Dataframe on a new SQLContext object fails with a Derby error
      
      Change is for SQLContext to reuse the active SparkSession during construction if the sparkContext supplied is the same as the currently active SparkContext. Without this change, a new SparkSession is instantiated that results in a Derby error when attempting to create a dataframe using a new SQLContext object even though the SparkContext supplied to the new SQLContext is same as the currently active one. Refer https://issues.apache.org/jira/browse/SPARK-18687 for details on the error and a repro.
      
      Existing unit tests and a new unit test added to pyspark-sql:
      
      /python/run-tests --python-executables=python --modules=pyspark-sql
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Vinayak <vijoshi5@in.ibm.com>
      Author: Vinayak Joshi <vijoshi@users.noreply.github.com>
      
      Closes #16119 from vijoshi/SPARK-18687_master.
      
      (cherry picked from commit 285a7798)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      b2c9a2c8
  21. Jan 12, 2017
    • Liang-Chi Hsieh's avatar
      [SPARK-19055][SQL][PYSPARK] Fix SparkSession initialization when SparkContext is stopped · 042e32d1
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      In SparkSession initialization, we store created the instance of SparkSession into a class variable _instantiatedContext. Next time we can use SparkSession.builder.getOrCreate() to retrieve the existing SparkSession instance.
      
      However, when the active SparkContext is stopped and we create another new SparkContext to use, the existing SparkSession is still associated with the stopped SparkContext. So the operations with this existing SparkSession will be failed.
      
      We need to detect such case in SparkSession and renew the class variable _instantiatedContext if needed.
      
      ## How was this patch tested?
      
      New test added in PySpark.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16454 from viirya/fix-pyspark-sparksession.
      
      (cherry picked from commit c6c37b8a)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      042e32d1
  22. Jan 10, 2017
  23. Jan 08, 2017
  24. Dec 21, 2016
    • gatorsmile's avatar
      [SPARK-18949][SQL][BACKPORT-2.1] Add recoverPartitions API to Catalog · 0e51bb08
      gatorsmile authored
      ### What changes were proposed in this pull request?
      
      This PR is to backport https://github.com/apache/spark/pull/16356 to Spark 2.1.1 branch.
      
      ----
      
      Currently, we only have a SQL interface for recovering all the partitions in the directory of a table and update the catalog. `MSCK REPAIR TABLE` or `ALTER TABLE table RECOVER PARTITIONS`. (Actually, very hard for me to remember `MSCK` and have no clue what it means)
      
      After the new "Scalable Partition Handling", the table repair becomes much more important for making visible the data in the created data source partitioned table.
      
      Thus, this PR is to add it into the Catalog interface. After this PR, users can repair the table by
      ```Scala
      spark.catalog.recoverPartitions("testTable")
      ```
      
      ### How was this patch tested?
      Modified the existing test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16372 from gatorsmile/repairTable2.1.1.
      0e51bb08
  25. Dec 20, 2016
    • Liang-Chi Hsieh's avatar
      [SPARK-18281] [SQL] [PYSPARK] Remove timeout for reading data through socket for local iterator · cd297c39
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      There is a timeout failure when using `rdd.toLocalIterator()` or `df.toLocalIterator()` for a PySpark RDD and DataFrame:
      
          df = spark.createDataFrame([[1],[2],[3]])
          it = df.toLocalIterator()
          row = next(it)
      
          df2 = df.repartition(1000)  # create many empty partitions which increase materialization time so causing timeout
          it2 = df2.toLocalIterator()
          row = next(it2)
      
      The cause of this issue is, we open a socket to serve the data from JVM side. We set timeout for connection and reading through the socket in Python side. In Python we use a generator to read the data, so we only begin to connect the socket once we start to ask data from it. If we don't consume it immediately, there is connection timeout.
      
      In the other side, the materialization time for RDD partitions is unpredictable. So we can't set a timeout for reading data through the socket. Otherwise, it is very possibly to fail.
      
      ## How was this patch tested?
      
      Added tests into PySpark.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16263 from viirya/fix-pyspark-localiterator.
      
      (cherry picked from commit 95c95b71)
      Signed-off-by: default avatarDavies Liu <davies.liu@gmail.com>
      cd297c39
  26. Dec 15, 2016
Loading