Skip to content
Snippets Groups Projects
  1. May 06, 2018
  2. May 05, 2018
  3. Apr 01, 2018
  4. Mar 01, 2018
    • KaiXinXiaoLei's avatar
      [SPARK-23405] Generate additional constraints for Join's children · cdcccd7b
      KaiXinXiaoLei authored
      ## What changes were proposed in this pull request?
      
      (Please fill in changes proposed in this fix)
      I run a sql: `select ls.cs_order_number from ls left semi join catalog_sales cs on ls.cs_order_number = cs.cs_order_number`, The `ls` table is a small table ,and the number is one. The `catalog_sales` table is a big table,  and the number is 10 billion. The task will be hang up. And i find the many null values of `cs_order_number` in the `catalog_sales` table. I think the null value should be removed in the logical plan.
      
      >== Optimized Logical Plan ==
      >Join LeftSemi, (cs_order_number#1 = cs_order_number#22)
      >:- Project cs_order_number#1
      >   : +- Filter isnotnull(cs_order_number#1)
      >      : +- MetastoreRelation 100t, ls
      >+- Project cs_order_number#22
      >   +- MetastoreRelation 100t, catalog_sales
      
      Now, use this patch, the plan will be:
      >== Optimized Logical Plan ==
      >Join LeftSemi, (cs_order_number#1 = cs_order_number#22)
      >:- Project cs_order_number#1
      >   : +- Filter isnotnull(cs_order_number#1)
      >      : +- MetastoreRelation 100t, ls
      >+- Project cs_order_number#22
      >   : **+- Filter isnotnull(cs_order_number#22)**
      >     :+- MetastoreRelation 100t, catalog_sales
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: KaiXinXiaoLei <584620569@qq.com>
      Author: hanghang <584620569@qq.com>
      
      Closes #20670 from KaiXinXiaoLei/Spark-23405.
      cdcccd7b
    • Yuming Wang's avatar
      [SPARK-23510][SQL] Support Hive 2.2 and Hive 2.3 metastore · ff148018
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      This is based on https://github.com/apache/spark/pull/20668 for supporting Hive 2.2 and Hive 2.3 metastore.
      
      When we merge the PR, we should give the major credit to wangyum
      
      ## How was this patch tested?
      Added the test cases
      
      Author: Yuming Wang <yumwang@ebay.com>
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #20671 from gatorsmile/pr-20668.
      ff148018
    • liuxian's avatar
      [SPARK-23389][CORE] When the shuffle dependency specifies aggregation ,and... · 22f3d333
      liuxian authored
      [SPARK-23389][CORE] When the shuffle dependency specifies aggregation ,and `dependency.mapSideCombine =false`, we should be able to use serialized sorting.
      
      ## What changes were proposed in this pull request?
      When the shuffle dependency specifies aggregation ,and `dependency.mapSideCombine=false`, in the map side,there is no need for aggregation and sorting, so we should be able to use serialized sorting.
      
      ## How was this patch tested?
      Existing unit test
      
      Author: liuxian <liu.xian3@zte.com.cn>
      
      Closes #20576 from 10110346/mapsidecombine.
      22f3d333
  5. Feb 28, 2018
    • Xingbo Jiang's avatar
      [SPARK-23523][SQL][FOLLOWUP] Minor refactor of OptimizeMetadataOnlyQuery · 25c2776d
      Xingbo Jiang authored
      ## What changes were proposed in this pull request?
      
      Inside `OptimizeMetadataOnlyQuery.getPartitionAttrs`, avoid using `zip` to generate attribute map.
      Also include other minor update of comments and format.
      
      ## How was this patch tested?
      
      Existing test cases.
      
      Author: Xingbo Jiang <xingbo.jiang@databricks.com>
      
      Closes #20693 from jiangxb1987/SPARK-23523.
      25c2776d
    • Juliusz Sompolski's avatar
      [SPARK-23514] Use SessionState.newHadoopConf() to propage hadoop configs set in SQLConf. · 476a7f02
      Juliusz Sompolski authored
      ## What changes were proposed in this pull request?
      
      A few places in `spark-sql` were using `sc.hadoopConfiguration` directly. They should be using `sessionState.newHadoopConf()` to blend in configs that were set through `SQLConf`.
      
      Also, for better UX, for these configs blended in from `SQLConf`, we should consider removing the `spark.hadoop` prefix, so that the settings are recognized whether or not they were specified by the user.
      
      ## How was this patch tested?
      
      Tested that AlterTableRecoverPartitions now correctly recognizes settings that are passed in to the FileSystem through SQLConf.
      
      Author: Juliusz Sompolski <julek@databricks.com>
      
      Closes #20679 from juliuszsompolski/SPARK-23514.
      476a7f02
    • hyukjinkwon's avatar
      [SPARK-23517][PYTHON] Make `pyspark.util._exception_message` produce the trace... · fab563b9
      hyukjinkwon authored
      [SPARK-23517][PYTHON] Make `pyspark.util._exception_message` produce the trace from Java side by Py4JJavaError
      
      ## What changes were proposed in this pull request?
      
      This PR proposes for `pyspark.util._exception_message` to produce the trace from Java side by `Py4JJavaError`.
      
      Currently, in Python 2, it uses `message` attribute which `Py4JJavaError` didn't happen to have:
      
      ```python
      >>> from pyspark.util import _exception_message
      >>> try:
      ...     sc._jvm.java.lang.String(None)
      ... except Exception as e:
      ...     pass
      ...
      >>> e.message
      ''
      ```
      
      Seems we should use `str` instead for now:
      
       https://github.com/bartdag/py4j/blob/aa6c53b59027925a426eb09b58c453de02c21b7c/py4j-python/src/py4j/protocol.py#L412
      
      but this doesn't address the problem with non-ascii string from Java side -
       `https://github.com/bartdag/py4j/issues/306`
      
      So, we could directly call `__str__()`:
      
      ```python
      >>> e.__str__()
      u'An error occurred while calling None.java.lang.String.\n: java.lang.NullPointerException\n\tat java.lang.String.<init>(String.java:588)\n\tat sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)\n\tat sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)\n\tat sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)\n\tat java.lang.reflect.Constructor.newInstance(Constructor.java:422)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat py4j.Gateway.invoke(Gateway.java:238)\n\tat py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)\n\tat py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:214)\n\tat java.lang.Thread.run(Thread.java:745)\n'
      ```
      
      which doesn't type coerce unicodes to `str` in Python 2.
      
      This can be actually a problem:
      
      ```python
      from pyspark.sql.functions import udf
      spark.conf.set("spark.sql.execution.arrow.enabled", True)
      spark.range(1).select(udf(lambda x: [[]])()).toPandas()
      ```
      
      **Before**
      
      ```
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/.../spark/python/pyspark/sql/dataframe.py", line 2009, in toPandas
          raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
      RuntimeError:
      Note: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this.
      ```
      
      **After**
      
      ```
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/.../spark/python/pyspark/sql/dataframe.py", line 2009, in toPandas
          raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
      RuntimeError: An error occurred while calling o47.collectAsArrowToPython.
      : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 1 times, most recent failure: Lost task 7.0 in stage 0.0 (TID 7, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
        File "/.../spark/python/pyspark/worker.py", line 245, in main
          process()
        File "/.../spark/python/pyspark/worker.py", line 240, in process
      ...
      Note: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this.
      ```
      
      ## How was this patch tested?
      
      Manually tested and unit tests were added.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #20680 from HyukjinKwon/SPARK-23517.
      fab563b9
    • zhoukang's avatar
      [SPARK-23508][CORE] Fix BlockmanagerId in case blockManagerIdCache cause oom · 6a8abe29
      zhoukang authored
      … cause oom
      
      ## What changes were proposed in this pull request?
      blockManagerIdCache in BlockManagerId will not remove old values which may cause oom
      
      `val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]()`
      Since whenever we apply a new BlockManagerId, it will put into this map.
      
      This patch will use guava cahce for  blockManagerIdCache instead.
      
      A heap dump show in [SPARK-23508](https://issues.apache.org/jira/browse/SPARK-23508)
      
      ## How was this patch tested?
      Exist tests.
      
      Author: zhoukang <zhoukang199191@gmail.com>
      
      Closes #20667 from caneGuy/zhoukang/fix-history.
      6a8abe29
  6. Feb 27, 2018
    • Liang-Chi Hsieh's avatar
      [SPARK-23448][SQL] Clarify JSON and CSV parser behavior in document · b14993e1
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      Clarify JSON and CSV reader behavior in document.
      
      JSON doesn't support partial results for corrupted records.
      CSV only supports partial results for the records with more or less tokens.
      
      ## How was this patch tested?
      
      Pass existing tests.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #20666 from viirya/SPARK-23448-2.
      b14993e1
    • Bruce Robbins's avatar
      [SPARK-23417][PYTHON] Fix the build instructions supplied by exception... · 23ac3aab
      Bruce Robbins authored
      [SPARK-23417][PYTHON] Fix the build instructions supplied by exception messages in python streaming tests
      
      ## What changes were proposed in this pull request?
      
      Fix the build instructions supplied by exception messages in python streaming tests.
      
      I also added -DskipTests to the maven instructions to avoid the 170 minutes of scala tests that occurs each time one wants to add a jar to the assembly directory.
      
      ## How was this patch tested?
      
      - clone branch
      - run build/sbt package
      - run python/run-tests --modules "pyspark-streaming" , expect error message
      - follow instructions in error message. i.e., run build/sbt assembly/package streaming-kafka-0-8-assembly/assembly
      - rerun python tests, expect error message
      - follow instructions in error message. i.e run build/sbt -Pflume assembly/package streaming-flume-assembly/assembly
      - rerun python tests, see success.
      - repeated all of the above for mvn version of the process.
      
      Author: Bruce Robbins <bersprockets@gmail.com>
      
      Closes #20638 from bersprockets/SPARK-23417_propa.
      23ac3aab
    • Marco Gaido's avatar
      [SPARK-23501][UI] Refactor AllStagesPage in order to avoid redundant code · 598446b7
      Marco Gaido authored
      As suggested in #20651, the code is very redundant in `AllStagesPage` and modifying it is a copy-and-paste work. We should avoid such a pattern, which is error prone, and have a cleaner solution which avoids code redundancy.
      
      existing UTs
      
      Author: Marco Gaido <marcogaido91@gmail.com>
      
      Closes #20663 from mgaido91/SPARK-23475_followup.
      598446b7
    • Imran Rashid's avatar
      [SPARK-23365][CORE] Do not adjust num executors when killing idle executors. · ecb8b383
      Imran Rashid authored
      The ExecutorAllocationManager should not adjust the target number of
      executors when killing idle executors, as it has already adjusted the
      target number down based on the task backlog.
      
      The name `replace` was misleading with DynamicAllocation on, as the target number
      of executors is changed outside of the call to `killExecutors`, so I adjusted that name.  Also separated out the logic of `countFailures` as you don't always want that tied to `replace`.
      
      While I was there I made two changes that weren't directly related to this:
      1) Fixed `countFailures` in a couple cases where it was getting an incorrect value since it used to be tied to `replace`, eg. when killing executors on a blacklisted node.
      2) hard error if you call `sc.killExecutors` with dynamic allocation on, since that's another way the ExecutorAllocationManager and the CoarseGrainedSchedulerBackend would get out of sync.
      
      Added a unit test case which verifies that the calls to ExecutorAllocationClient do not adjust the number of executors.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #20604 from squito/SPARK-23365.
      ecb8b383
    • gatorsmile's avatar
      [SPARK-23523][SQL] Fix the incorrect result caused by the rule OptimizeMetadataOnlyQuery · 414ee867
      gatorsmile authored
      ## What changes were proposed in this pull request?
      ```Scala
      val tablePath = new File(s"${path.getCanonicalPath}/cOl3=c/cOl1=a/cOl5=e")
       Seq(("a", "b", "c", "d", "e")).toDF("cOl1", "cOl2", "cOl3", "cOl4", "cOl5")
       .write.json(tablePath.getCanonicalPath)
       val df = spark.read.json(path.getCanonicalPath).select("CoL1", "CoL5", "CoL3").distinct()
       df.show()
      ```
      
      It generates a wrong result.
      ```
      [c,e,a]
      ```
      
      We have a bug in the rule `OptimizeMetadataOnlyQuery `. We should respect the attribute order in the original leaf node. This PR is to fix it.
      
      ## How was this patch tested?
      Added a test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #20684 from gatorsmile/optimizeMetadataOnly.
      414ee867
    • cody koeninger's avatar
      [SPARK-17147][STREAMING][KAFKA] Allow non-consecutive offsets · eac0b067
      cody koeninger authored
      ## What changes were proposed in this pull request?
      
      Add a configuration spark.streaming.kafka.allowNonConsecutiveOffsets to allow streaming jobs to proceed on compacted topics (or other situations involving gaps between offsets in the log).
      
      ## How was this patch tested?
      
      Added new unit test
      
      justinrmiller has been testing this branch in production for a few weeks
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #20572 from koeninger/SPARK-17147.
      eac0b067
    • Kazuaki Ishizaki's avatar
      [SPARK-23509][BUILD] Upgrade commons-net from 2.2 to 3.1 · 649ed9c5
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR avoids version conflicts of `commons-net` by upgrading commons-net from 2.2 to 3.1. We are seeing the following message during the build using sbt.
      
      ```
      [warn] Found version conflict(s) in library dependencies; some are suspected to be binary incompatible:
      ...
      [warn] 	* commons-net:commons-net:3.1 is selected over 2.2
      [warn] 	    +- org.apache.hadoop:hadoop-common:2.6.5              (depends on 3.1)
      [warn] 	    +- org.apache.spark:spark-core_2.11:2.4.0-SNAPSHOT    (depends on 2.2)
      [warn]
      ```
      
      [Here](https://commons.apache.org/proper/commons-net/changes-report.html) is a release history.
      
      [Here](https://commons.apache.org/proper/commons-net/migration.html) is a migration guide from 2.x to 3.0.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #20672 from kiszk/SPARK-23509.
      649ed9c5
    • Juliusz Sompolski's avatar
      [SPARK-23445] ColumnStat refactoring · 8077bb04
      Juliusz Sompolski authored
      ## What changes were proposed in this pull request?
      
      Refactor ColumnStat to be more flexible.
      
      * Split `ColumnStat` and `CatalogColumnStat` just like `CatalogStatistics` is split from `Statistics`. This detaches how the statistics are stored from how they are processed in the query plan. `CatalogColumnStat` keeps `min` and `max` as `String`, making it not depend on dataType information.
      * For `CatalogColumnStat`, parse column names from property names in the metastore (`KEY_VERSION` property), not from metastore schema. This means that `CatalogColumnStat`s can be created for columns even if the schema itself is not stored in the metastore.
      * Make all fields optional. `min`, `max` and `histogram` for columns were optional already. Having them all optional is more consistent, and gives flexibility to e.g. drop some of the fields through transformations if they are difficult / impossible to calculate.
      
      The added flexibility will make it possible to have alternative implementations for stats, and separates stats collection from stats and estimation processing in plans.
      
      ## How was this patch tested?
      
      Refactored existing tests to work with refactored `ColumnStat` and `CatalogColumnStat`.
      New tests added in `StatisticsSuite` checking that backwards / forwards compatibility is not broken.
      
      Author: Juliusz Sompolski <julek@databricks.com>
      
      Closes #20624 from juliuszsompolski/SPARK-23445.
      8077bb04
Loading