Skip to content
Snippets Groups Projects
  1. Sep 19, 2017
    • Armin's avatar
      [MINOR][CORE] Cleanup dead code and duplication in Mem. Management · 7c92351f
      Armin authored
      ## What changes were proposed in this pull request?
      
      * Removed the method `org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter#alignToWords`.
      It became unused as a result of 85b0a157
      (SPARK-15962) introducing word alignment for unsafe arrays.
      * Cleaned up duplicate code in memory management and unsafe sorters
        * The change extracting the exception paths is more than just cosmetics since it def. reduces the size the affected methods compile to
      
      ## How was this patch tested?
      
      * Build still passes after removing the method, grepping the codebase for `alignToWords` shows no reference to it anywhere either.
      * Dried up code is covered by existing tests.
      
      Author: Armin <me@obrown.io>
      
      Closes #19254 from original-brownbear/cleanup-mem-consumer.
      7c92351f
    • Xianyang Liu's avatar
      [SPARK-21923][CORE] Avoid calling reserveUnrollMemoryForThisTask for every record · a11db942
      Xianyang Liu authored
      ## What changes were proposed in this pull request?
      When Spark persist data to Unsafe memory, we call  the method `MemoryStore.putIteratorAsBytes`, which need synchronize the `memoryManager` for every record write. This implementation is not necessary, we can apply for more memory at a time to reduce unnecessary synchronization.
      
      ## How was this patch tested?
      
      Test case (with 1 executor 20 core):
      ```scala
      val start = System.currentTimeMillis()
      val data = sc.parallelize(0 until Integer.MAX_VALUE, 100)
            .persist(StorageLevel.OFF_HEAP)
            .count()
      
      println(System.currentTimeMillis() - start)
      
      ```
      
      Test result:
      
      before
      
      |  27647  |  29108  |  28591  |  28264  |  27232  |
      
      after
      
      |  26868  |  26358  |  27767  |  26653  |  26693  |
      
      Author: Xianyang Liu <xianyang.liu@intel.com>
      
      Closes #19135 from ConeyLiu/memorystore.
      a11db942
  2. Sep 18, 2017
  3. Sep 17, 2017
    • hyukjinkwon's avatar
      [SPARK-22043][PYTHON] Improves error message for show_profiles and dump_profiles · 7c726620
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to improve error message from:
      
      ```
      >>> sc.show_profiles()
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark/python/pyspark/context.py", line 1000, in show_profiles
          self.profiler_collector.show_profiles()
      AttributeError: 'NoneType' object has no attribute 'show_profiles'
      >>> sc.dump_profiles("/tmp/abc")
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark/python/pyspark/context.py", line 1005, in dump_profiles
          self.profiler_collector.dump_profiles(path)
      AttributeError: 'NoneType' object has no attribute 'dump_profiles'
      ```
      
      to
      
      ```
      >>> sc.show_profiles()
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark/python/pyspark/context.py", line 1003, in show_profiles
          raise RuntimeError("'spark.python.profile' configuration must be set "
      RuntimeError: 'spark.python.profile' configuration must be set to 'true' to enable Python profile.
      >>> sc.dump_profiles("/tmp/abc")
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark/python/pyspark/context.py", line 1012, in dump_profiles
          raise RuntimeError("'spark.python.profile' configuration must be set "
      RuntimeError: 'spark.python.profile' configuration must be set to 'true' to enable Python profile.
      ```
      
      ## How was this patch tested?
      
      Unit tests added in `python/pyspark/tests.py` and manual tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19260 from HyukjinKwon/profile-errors.
      7c726620
    • Andrew Ash's avatar
      [SPARK-21953] Show both memory and disk bytes spilled if either is present · 6308c65f
      Andrew Ash authored
      As written now, there must be both memory and disk bytes spilled to show either of them. If there is only one of those types of spill recorded, it will be hidden.
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #19164 from ash211/patch-3.
      6308c65f
    • Andrew Ray's avatar
      [SPARK-21985][PYSPARK] PairDeserializer is broken for double-zipped RDDs · 6adf67dd
      Andrew Ray authored
      ## What changes were proposed in this pull request?
      (edited)
      Fixes a bug introduced in #16121
      
      In PairDeserializer convert each batch of keys and values to lists (if they do not have `__len__` already) so that we can check that they are the same size. Normally they already are lists so this should not have a performance impact, but this is needed when repeated `zip`'s are done.
      
      ## How was this patch tested?
      
      Additional unit test
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #19226 from aray/SPARK-21985.
      6adf67dd
    • Maciej Bryński's avatar
      [SPARK-22032][PYSPARK] Speed up StructType conversion · f4073020
      Maciej Bryński authored
      ## What changes were proposed in this pull request?
      
      StructType.fromInternal is calling f.fromInternal(v) for every field.
      We can use precalculated information about type to limit the number of function calls. (its calculated once per StructType and used in per record calculations)
      
      Benchmarks (Python profiler)
      ```
      df = spark.range(10000000).selectExpr("id as id0", "id as id1", "id as id2", "id as id3", "id as id4", "id as id5", "id as id6", "id as id7", "id as id8", "id as id9", "struct(id) as s").cache()
      df.count()
      df.rdd.map(lambda x: x).count()
      ```
      
      Before
      ```
      310274584 function calls (300272456 primitive calls) in 1320.684 seconds
      
      Ordered by: internal time, cumulative time
      
         ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       10000000  253.417    0.000  486.991    0.000 types.py:619(<listcomp>)
       30000000  192.272    0.000 1009.986    0.000 types.py:612(fromInternal)
      100000000  176.140    0.000  176.140    0.000 types.py:88(fromInternal)
       20000000  156.832    0.000  328.093    0.000 types.py:1471(_create_row)
          14000  107.206    0.008 1237.917    0.088 {built-in method loads}
       20000000   80.176    0.000 1090.162    0.000 types.py:1468(<lambda>)
      ```
      
      After
      ```
      210274584 function calls (200272456 primitive calls) in 1035.974 seconds
      
      Ordered by: internal time, cumulative time
      
         ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       30000000  215.845    0.000  698.748    0.000 types.py:612(fromInternal)
       20000000  165.042    0.000  351.572    0.000 types.py:1471(_create_row)
          14000  116.834    0.008  946.791    0.068 {built-in method loads}
       20000000   87.326    0.000  786.073    0.000 types.py:1468(<lambda>)
       20000000   85.477    0.000  134.607    0.000 types.py:1519(__new__)
       10000000   65.777    0.000  126.712    0.000 types.py:619(<listcomp>)
      ```
      
      Main difference is types.py:619(<listcomp>) and types.py:88(fromInternal) (which is removed in After)
      The number of function calls is 100 million less. And performance is 20% better.
      
      Benchmark (worst case scenario.)
      
      Test
      ```
      df = spark.range(1000000).selectExpr("current_timestamp as id0", "current_timestamp as id1", "current_timestamp as id2", "current_timestamp as id3", "current_timestamp as id4", "current_timestamp as id5", "current_timestamp as id6", "current_timestamp as id7", "current_timestamp as id8", "current_timestamp as id9").cache()
      df.count()
      df.rdd.map(lambda x: x).count()
      ```
      
      Before
      ```
      31166064 function calls (31163984 primitive calls) in 150.882 seconds
      ```
      
      After
      ```
      31166064 function calls (31163984 primitive calls) in 153.220 seconds
      ```
      
      IMPORTANT:
      The benchmark was done on top of https://github.com/apache/spark/pull/19246.
      Without https://github.com/apache/spark/pull/19246 the performance improvement will be even greater.
      
      ## How was this patch tested?
      
      Existing tests.
      Performance benchmark.
      
      Author: Maciej Bryński <maciek-github@brynski.pl>
      
      Closes #19249 from maver1ck/spark_22032.
      f4073020
  4. Sep 16, 2017
    • Armin's avatar
      [SPARK-21967][CORE] org.apache.spark.unsafe.types.UTF8String#compareTo Should... · 73d90672
      Armin authored
      [SPARK-21967][CORE] org.apache.spark.unsafe.types.UTF8String#compareTo Should Compare 8 Bytes at a Time for Better Performance
      
      ## What changes were proposed in this pull request?
      
      * Using 64 bit unsigned long comparison instead of unsigned int comparison in `org.apache.spark.unsafe.types.UTF8String#compareTo` for better performance.
      * Making `IS_LITTLE_ENDIAN` a constant for correctness reasons (shouldn't use a non-constant in `compareTo` implementations and it def. is a constant per JVM)
      
      ## How was this patch tested?
      
      Build passes and the functionality is widely covered by existing tests as far as I can see.
      
      Author: Armin <me@obrown.io>
      
      Closes #19180 from original-brownbear/SPARK-21967.
      73d90672
  5. Sep 15, 2017
    • Jose Torres's avatar
      [SPARK-22017] Take minimum of all watermark execs in StreamExecution. · 0bad10d3
      Jose Torres authored
      ## What changes were proposed in this pull request?
      
      Take the minimum of all watermark exec nodes as the "real" watermark in StreamExecution, rather than picking one arbitrarily.
      
      ## How was this patch tested?
      
      new unit test
      
      Author: Jose Torres <jose@databricks.com>
      
      Closes #19239 from joseph-torres/SPARK-22017.
      0bad10d3
    • Wenchen Fan's avatar
      [SPARK-15689][SQL] data source v2 read path · c7307acd
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This PR adds the infrastructure for data source v2, and implement features which Spark already have in data source v1, i.e. column pruning, filter push down, catalyst expression filter push down, InternalRow scan, schema inference, data size report. The write path is excluded to avoid making this PR growing too big, and will be added in follow-up PR.
      
      ## How was this patch tested?
      
      new tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #19136 from cloud-fan/data-source-v2.
      c7307acd
    • Travis Hegner's avatar
      [SPARK-21958][ML] Word2VecModel save: transform data in the cluster · 79a4dab6
      Travis Hegner authored
      ## What changes were proposed in this pull request?
      
      Change a data transformation while saving a Word2VecModel to happen with distributed data instead of local driver data.
      
      ## How was this patch tested?
      
      Unit tests for the ML sub-component still pass.
      Running this patch against v2.2.0 in a fully distributed production cluster allows a 4.0G model to save and load correctly, where it would not do so without the patch.
      
      Author: Travis Hegner <thegner@trilliumit.com>
      
      Closes #19191 from travishegner/master.
      79a4dab6
    • Wenchen Fan's avatar
      [SPARK-21987][SQL] fix a compatibility issue of sql event logs · 3c6198c8
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      In https://github.com/apache/spark/pull/18600 we removed the `metadata` field from `SparkPlanInfo`. This causes a problem when we replay event logs that are generated by older Spark versions.
      
      ## How was this patch tested?
      
      a regression test.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #19237 from cloud-fan/event.
      3c6198c8
    • Yuming Wang's avatar
      [SPARK-22002][SQL] Read JDBC table use custom schema support specify partial fields. · 4decedfd
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      
      https://github.com/apache/spark/pull/18266 add a new feature to support read JDBC table use custom schema, but we must specify all the fields. For simplicity, this PR support  specify partial fields.
      
      ## How was this patch tested?
      unit tests
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #19231 from wangyum/SPARK-22002.
      4decedfd
    • zhoukang's avatar
      [SPARK-21902][CORE] Print root cause for BlockManager#doPut · 22b111ef
      zhoukang authored
      ## What changes were proposed in this pull request?
      
      As logging below, actually exception will be hidden when removeBlockInternal throw an exception.
      `2017-08-31,10:26:57,733 WARN org.apache.spark.storage.BlockManager: Putting block broadcast_110 failed due to an exception
      2017-08-31,10:26:57,734 WARN org.apache.spark.broadcast.BroadcastManager: Failed to create a new broadcast in 1 attempts
      java.io.IOException: Failed to create local dir in /tmp/blockmgr-5bb5ac1e-c494-434a-ab89-bd1808c6b9ed/2e.
              at org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70)
              at org.apache.spark.storage.DiskStore.remove(DiskStore.scala:115)
              at org.apache.spark.storage.BlockManager.removeBlockInternal(BlockManager.scala:1339)
              at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:910)
              at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
              at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:726)
              at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1233)
              at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:122)
              at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:88)
              at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
              at org.apache.spark.broadcast.BroadcastManager$$anonfun$newBroadcast$1.apply$mcVI$sp(BroadcastManager.scala:60)
              at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
              at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:58)
              at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1415)
              at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1002)
              at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:924)
              at org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:771)
              at org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:770)
              at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
              at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
              at org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:770)
              at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1235)
              at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1662)
              at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620)
              at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609)
              at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)`
      
      In this pr i will print exception first make troubleshooting more conveniently.
      PS:
      This one split from [PR-19133](https://github.com/apache/spark/pull/19133)
      
      ## How was this patch tested?
      Exsist unit test
      
      Author: zhoukang <zhoukang199191@gmail.com>
      
      Closes #19171 from caneGuy/zhoukang/print-rootcause.
      22b111ef
    • Tathagata Das's avatar
      [SPARK-22018][SQL] Preserve top-level alias metadata when collapsing projects · 88661747
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      If there are two projects like as follows.
      ```
      Project [a_with_metadata#27 AS b#26]
      +- Project [a#0 AS a_with_metadata#27]
         +- LocalRelation <empty>, [a#0, b#1]
      ```
      Child Project has an output column with a metadata in it, and the parent Project has an alias that implicitly forwards the metadata. So this metadata is visible for higher operators. Upon applying CollapseProject optimizer rule, the metadata is not preserved.
      ```
      Project [a#0 AS b#26]
      +- LocalRelation <empty>, [a#0, b#1]
      ```
      This is incorrect, as downstream operators that expect certain metadata (e.g. watermark in structured streaming) to identify certain fields will fail to do so. This PR fixes it by preserving the metadata of top-level aliases.
      
      ## How was this patch tested?
      New unit test
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #19240 from tdas/SPARK-22018.
      88661747
  6. Sep 14, 2017
    • goldmedal's avatar
      [SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json support converting MapType to... · a28728a9
      goldmedal authored
      [SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json support converting MapType to json for PySpark and SparkR
      
      ## What changes were proposed in this pull request?
      In previous work SPARK-21513, we has allowed `MapType` and `ArrayType` of `MapType`s convert to a json string but only for Scala API. In this follow-up PR, we will make SparkSQL support it for PySpark and SparkR, too. We also fix some little bugs and comments of the previous work in this follow-up PR.
      
      ### For PySpark
      ```
      >>> data = [(1, {"name": "Alice"})]
      >>> df = spark.createDataFrame(data, ("key", "value"))
      >>> df.select(to_json(df.value).alias("json")).collect()
      [Row(json=u'{"name":"Alice")']
      >>> data = [(1, [{"name": "Alice"}, {"name": "Bob"}])]
      >>> df = spark.createDataFrame(data, ("key", "value"))
      >>> df.select(to_json(df.value).alias("json")).collect()
      [Row(json=u'[{"name":"Alice"},{"name":"Bob"}]')]
      ```
      ### For SparkR
      ```
      # Converts a map into a JSON object
      df2 <- sql("SELECT map('name', 'Bob')) as people")
      df2 <- mutate(df2, people_json = to_json(df2$people))
      # Converts an array of maps into a JSON array
      df2 <- sql("SELECT array(map('name', 'Bob'), map('name', 'Alice')) as people")
      df2 <- mutate(df2, people_json = to_json(df2$people))
      ```
      ## How was this patch tested?
      Add unit test cases.
      
      cc viirya HyukjinKwon
      
      Author: goldmedal <liugs963@gmail.com>
      
      Closes #19223 from goldmedal/SPARK-21513-fp-PySaprkAndSparkR.
      a28728a9
    • Jose Torres's avatar
      [SPARK-21988] Add default stats to StreamingExecutionRelation. · 054ddb2f
      Jose Torres authored
      ## What changes were proposed in this pull request?
      
      Add default stats to StreamingExecutionRelation.
      
      ## How was this patch tested?
      
      existing unit tests and an explain() test to be sure
      
      Author: Jose Torres <jose@databricks.com>
      
      Closes #19212 from joseph-torres/SPARK-21988.
      054ddb2f
    • Zhenhua Wang's avatar
      [SPARK-17642][SQL][FOLLOWUP] drop test tables and improve comments · ddd7f5e1
      Zhenhua Wang authored
      ## What changes were proposed in this pull request?
      
      Drop test tables and improve comments.
      
      ## How was this patch tested?
      
      Modified existing test.
      
      Author: Zhenhua Wang <wangzhenhua@huawei.com>
      
      Closes #19213 from wzhfy/useless_comment.
      ddd7f5e1
    • zhoukang's avatar
      [SPARK-21922] Fix duration always updating when task failed but status is still RUN… · 4b88393c
      zhoukang authored
      …NING
      
      ## What changes were proposed in this pull request?
      When driver quit abnormally which cause executor shutdown and task metrics can not be sent to driver for updating.In this case the status will always be 'RUNNING' and the duration on history UI will be 'CurrentTime - launchTime' which increase infinitely.
      We can fix this time by modify time of event log since this time has gotten when `FSHistoryProvider` fetch event log from File System.
      And the result picture is uploaded in [SPARK-21922](https://issues.apache.org/jira/browse/SPARK-21922).
      How to reproduce?
      (1) Submit a job to spark on yarn
      (2) Mock an oom(or other case can make driver quit abnormally)  senario for driver
      (3) Make sure executor is running task when driver quitting
      (4) Open the history server and checkout result
      It is not a corner case since there are many such jobs in our current cluster.
      
      ## How was this patch tested?
      Deploy historyserver and open a job has this problem.
      
      Author: zhoukang <zhoukang199191@gmail.com>
      
      Closes #19132 from caneGuy/zhoukang/fix-duration.
      4b88393c
    • gatorsmile's avatar
      [SPARK-4131][FOLLOW-UP] Support "Writing data into the filesystem from queries" · 4e6fc690
      gatorsmile authored
      ## What changes were proposed in this pull request?
      This PR is clean the codes in https://github.com/apache/spark/pull/18975
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19225 from gatorsmile/refactorSPARK-4131.
      4e6fc690
    • Yanbo Liang's avatar
      [SPARK-18608][ML][FOLLOWUP] Fix double caching for PySpark OneVsRest. · c76153cc
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      #19197 fixed double caching for MLlib algorithms, but missed PySpark ```OneVsRest```, this PR fixed it.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #19220 from yanboliang/SPARK-18608.
      c76153cc
    • Zheng RuiFeng's avatar
      [MINOR][DOC] Add missing call of `update()` in examples of... · 66cb72d7
      Zheng RuiFeng authored
      [MINOR][DOC] Add missing call of `update()` in examples of PeriodicGraphCheckpointer & PeriodicRDDCheckpointer
      
      ## What changes were proposed in this pull request?
      forgot to call `update()` with `graph1` & `rdd1` in examples for `PeriodicGraphCheckpointer` & `PeriodicRDDCheckpoin`
      ## How was this patch tested?
      existing tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #19198 from zhengruifeng/fix_doc_checkpointer.
      66cb72d7
    • Ming Jiang's avatar
      [SPARK-21854] Added LogisticRegressionTrainingSummary for... · 8d8641f1
      Ming Jiang authored
      [SPARK-21854] Added LogisticRegressionTrainingSummary for MultinomialLogisticRegression in Python API
      
      ## What changes were proposed in this pull request?
      
      Added LogisticRegressionTrainingSummary for MultinomialLogisticRegression in Python API
      
      ## How was this patch tested?
      
      Added unit test
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Ming Jiang <mjiang@fanatics.com>
      Author: Ming Jiang <jmwdpk@gmail.com>
      Author: jmwdpk <jmwdpk@gmail.com>
      
      Closes #19185 from jmwdpk/SPARK-21854.
      8d8641f1
    • Dilip Biswal's avatar
      [MINOR][SQL] Only populate type metadata for required types such as CHAR/VARCHAR. · dcbb2294
      Dilip Biswal authored
      ## What changes were proposed in this pull request?
      When reading column descriptions from hive catalog, we currently populate the metadata for all types to record the raw hive type string. In terms of processing , we need this additional metadata information for CHAR/VARCHAR types or complex type containing the CHAR/VARCHAR types.
      
      Its a minor cleanup. I haven't created a JIRA for it.
      
      ## How was this patch tested?
      Test added in HiveMetastoreCatalogSuite
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #19215 from dilipbiswal/column_metadata.
      dcbb2294
  7. Sep 13, 2017
    • Takeshi Yamamuro's avatar
      [SPARK-21973][SQL] Add an new option to filter queries in TPC-DS · 8be7e6bb
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr added a new option to filter TPC-DS queries to run in `TPCDSQueryBenchmark`.
      By default, `TPCDSQueryBenchmark` runs all the TPC-DS queries.
      This change could enable developers to run some of the TPC-DS queries by this option,
      e.g., to run q2, q4, and q6 only:
      ```
      spark-submit --class <this class> --conf spark.sql.tpcds.queryFilter="q2,q4,q6" --jars <spark sql test jar>
      ```
      
      ## How was this patch tested?
      Manually checked.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #19188 from maropu/RunPartialQueriesInTPCDS.
      8be7e6bb
    • Yuming Wang's avatar
      [SPARK-20427][SQL] Read JDBC table use custom schema · 17edfec5
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      
      Auto generated Oracle schema some times not we expect:
      
      - `number(1)` auto mapped to BooleanType, some times it's not we expect, per [SPARK-20921](https://issues.apache.org/jira/browse/SPARK-20921).
      -  `number` auto mapped to Decimal(38,10), It can't read big data, per [SPARK-20427](https://issues.apache.org/jira/browse/SPARK-20427).
      
      This PR fix this issue by custom schema as follows:
      ```scala
      val props = new Properties()
      props.put("customSchema", "ID decimal(38, 0), N1 int, N2 boolean")
      val dfRead = spark.read.schema(schema).jdbc(jdbcUrl, "tableWithCustomSchema", props)
      dfRead.show()
      ```
      or
      ```sql
      CREATE TEMPORARY VIEW tableWithCustomSchema
      USING org.apache.spark.sql.jdbc
      OPTIONS (url '$jdbcUrl', dbTable 'tableWithCustomSchema', customSchema'ID decimal(38, 0), N1 int, N2 boolean')
      ```
      
      ## How was this patch tested?
      
      unit tests
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #18266 from wangyum/SPARK-20427.
      17edfec5
    • Jane Wang's avatar
      [SPARK-4131] Merge HiveTmpFile.scala to SaveAsHiveFile.scala · 8c7e19a3
      Jane Wang authored
      ## What changes were proposed in this pull request?
      
      The code is already merged to master:
      https://github.com/apache/spark/pull/18975
      
      This is a following up PR to merge HiveTmpFile.scala to SaveAsHiveFile.
      
      ## How was this patch tested?
      
      Build successfully
      
      Author: Jane Wang <janewang@fb.com>
      
      Closes #19221 from janewangfb/merge_savehivefile_hivetmpfile.
      8c7e19a3
    • donnyzone's avatar
      [SPARK-21980][SQL] References in grouping functions should be indexed with semanticEquals · 21c4450f
      donnyzone authored
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-21980
      
      This PR fixes the issue in ResolveGroupingAnalytics rule, which indexes the column references in grouping functions without considering case sensitive configurations.
      
      The problem can be reproduced by:
      
      `val df = spark.createDataFrame(Seq((1, 1), (2, 1), (2, 2))).toDF("a", "b")
       df.cube("a").agg(grouping("A")).show()`
      
      ## How was this patch tested?
      unit tests
      
      Author: donnyzone <wellfengzhu@gmail.com>
      
      Closes #19202 from DonnyZone/ResolveGroupingAnalytics.
      21c4450f
    • Armin's avatar
      [SPARK-21970][CORE] Fix Redundant Throws Declarations in Java Codebase · b6ef1f57
      Armin authored
      ## What changes were proposed in this pull request?
      
      1. Removing all redundant throws declarations from Java codebase.
      2. Removing dead code made visible by this from `ShuffleExternalSorter#closeAndGetSpills`
      
      ## How was this patch tested?
      
      Build still passes.
      
      Author: Armin <me@obrown.io>
      
      Closes #19182 from original-brownbear/SPARK-21970.
      b6ef1f57
    • Zheng RuiFeng's avatar
      [SPARK-21690][ML] one-pass imputer · 0fa5b7ca
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      parallelize the computation of all columns
      
      performance tests:
      
      |numColums| Mean(Old) | Median(Old) | Mean(RDD) | Median(RDD) | Mean(DF) | Median(DF) |
      |------|----------|------------|----------|------------|----------|------------|
      |1|0.0771394713|0.0658712813|0.080779802|0.048165981499999996|0.10525509870000001|0.0499620203|
      |10|0.7234340630999999|0.5954440414|0.0867935197|0.13263428659999998|0.09255724889999999|0.1573943635|
      |100|7.3756451568|6.2196631259|0.1911931552|0.8625376817000001|0.5557462431|1.7216837982000002|
      
      ## How was this patch tested?
      existing tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #18902 from zhengruifeng/parallelize_imputer.
      0fa5b7ca
    • caoxuewen's avatar
      [SPARK-21963][CORE][TEST] Create temp file should be delete after use · ca00cc70
      caoxuewen authored
      ## What changes were proposed in this pull request?
      
      After you create a temporary table, you need to delete it, otherwise it will leave a file similar to the file name ‘SPARK194465907929586320484966temp’.
      
      ## How was this patch tested?
      
      N / A
      
      Author: caoxuewen <cao.xuewen@zte.com.cn>
      
      Closes #19174 from heary-cao/DeleteTempFile.
      ca00cc70
    • Sean Owen's avatar
      [SPARK-21893][BUILD][STREAMING][WIP] Put Kafka 0.8 behind a profile · 4fbf748b
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Put Kafka 0.8 support behind a kafka-0-8 profile.
      
      ## How was this patch tested?
      
      Existing tests, but, until PR builder and Jenkins configs are updated the effect here is to not build or test Kafka 0.8 support at all.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19134 from srowen/SPARK-21893.
      4fbf748b
    • German Schiavon's avatar
      [SPARK-21982] Set locale to US · a1d98c6d
      German Schiavon authored
      ## What changes were proposed in this pull request?
      
      In UtilsSuite Locale was set by default to US, but at the moment of using format function it wasn't, taking by default JVM locale which could be different than US making this test fail.
      
      ## How was this patch tested?
      Unit test (UtilsSuite)
      
      Author: German Schiavon <germanschiavon@gmail.com>
      
      Closes #19205 from Gschiavon/fix/test-locale.
      a1d98c6d
    • Sean Owen's avatar
      [BUILD] Close stale PRs · dd88fa3d
      Sean Owen authored
      Closes #18522
      Closes #17722
      Closes #18879
      Closes #18891
      Closes #18806
      Closes #18948
      Closes #18949
      Closes #19070
      Closes #19039
      Closes #19142
      Closes #18515
      Closes #19154
      Closes #19162
      Closes #19187
      Closes #19091
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19203 from srowen/CloseStalePRs3.
      dd88fa3d
    • WeichenXu's avatar
      [SPARK-21027][MINOR][FOLLOW-UP] add missing since tag · f6c5d8f6
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      add missing since tag for `setParallelism` in #19110
      
      ## How was this patch tested?
      
      N/A
      
      Author: WeichenXu <weichen.xu@databricks.com>
      
      Closes #19214 from WeichenXu123/minor01.
      f6c5d8f6
Loading