Skip to content
Snippets Groups Projects
  1. Apr 02, 2017
    • zuotingbing's avatar
      [SPARK-20123][BUILD] SPARK_HOME variable might have spaces in it(e.g. $SPARK… · 76de2d11
      zuotingbing authored
      JIRA Issue: https://issues.apache.org/jira/browse/SPARK-20123
      
      ## What changes were proposed in this pull request?
      
      If $SPARK_HOME or $FWDIR variable contains spaces, then use "./dev/make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn" build spark will failed.
      
      ## How was this patch tested?
      
      manual tests
      
      Author: zuotingbing <zuo.tingbing9@zte.com.cn>
      
      Closes #17452 from zuotingbing/spark-bulid.
      76de2d11
    • hyukjinkwon's avatar
      [SPARK-20143][SQL] DataType.fromJson should throw an exception with better message · d40cbb86
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, `DataType.fromJson` throws `scala.MatchError` or `java.util.NoSuchElementException` in some cases when the JSON input is invalid as below:
      
      ```scala
      DataType.fromJson(""""abcd"""")
      ```
      
      ```
      java.util.NoSuchElementException: key not found: abcd
        at ...
      ```
      
      ```scala
      DataType.fromJson("""{"abcd":"a"}""")
      ```
      
      ```
      scala.MatchError: JObject(List((abcd,JString(a)))) (of class org.json4s.JsonAST$JObject)
        at ...
      ```
      
      ```scala
      DataType.fromJson("""{"fields": [{"a":123}], "type": "struct"}""")
      ```
      
      ```
      scala.MatchError: JObject(List((a,JInt(123)))) (of class org.json4s.JsonAST$JObject)
        at ...
      ```
      
      After this PR,
      
      ```scala
      DataType.fromJson(""""abcd"""")
      ```
      
      ```
      java.lang.IllegalArgumentException: Failed to convert the JSON string 'abcd' to a data type.
        at ...
      ```
      
      ```scala
      DataType.fromJson("""{"abcd":"a"}""")
      ```
      
      ```
      java.lang.IllegalArgumentException: Failed to convert the JSON string '{"abcd":"a"}' to a data type.
        at ...
      ```
      
      ```scala
      DataType.fromJson("""{"fields": [{"a":123}], "type": "struct"}""")
        at ...
      ```
      
      ```
      java.lang.IllegalArgumentException: Failed to convert the JSON string '{"a":123}' to a field.
      ```
      
      ## How was this patch tested?
      
      Unit test added in `DataTypeSuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17468 from HyukjinKwon/fromjson_exception.
      d40cbb86
  2. Apr 01, 2017
    • wangzhenhua's avatar
      [SPARK-20186][SQL] BroadcastHint should use child's stats · 2287f3d0
      wangzhenhua authored
      ## What changes were proposed in this pull request?
      
      `BroadcastHint` should use child's statistics and set `isBroadcastable` to true.
      
      ## How was this patch tested?
      
      Added a new stats estimation test for `BroadcastHint`.
      
      Author: wangzhenhua <wangzhenhua@huawei.com>
      
      Closes #17504 from wzhfy/broadcastHintEstimation.
      2287f3d0
    • Xiao Li's avatar
      [SPARK-19148][SQL][FOLLOW-UP] do not expose the external table concept in Catalog · 89d6822f
      Xiao Li authored
      ### What changes were proposed in this pull request?
      After we renames `Catalog`.`createExternalTable` to `createTable` in the PR: https://github.com/apache/spark/pull/16528, we also need to deprecate the corresponding functions in `SQLContext`.
      
      ### How was this patch tested?
      N/A
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17502 from gatorsmile/deprecateCreateExternalTable.
      89d6822f
    • 郭小龙 10207633's avatar
      [SPARK-20177] Document about compression way has some little detail ch… · cf5963c9
      郭小龙 10207633 authored
      …anges.
      
      ## What changes were proposed in this pull request?
      
      Document compression way little detail changes.
      1.spark.eventLog.compress add 'Compression will use spark.io.compression.codec.'
      2.spark.broadcast.compress add 'Compression will use spark.io.compression.codec.'
      3,spark.rdd.compress add 'Compression will use spark.io.compression.codec.'
      4.spark.io.compression.codec add 'event log describe'.
      
      eg
      Through the documents, I don't know  what is compression mode about 'event log'.
      
      ## How was this patch tested?
      
      manual tests
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn>
      
      Closes #17498 from guoxiaolongzte/SPARK-20177.
      cf5963c9
  3. Mar 31, 2017
    • Tathagata Das's avatar
      [SPARK-20165][SS] Resolve state encoder's deserializer in driver in FlatMapGroupsWithStateExec · 567a50ac
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      - Encoder's deserializer must be resolved at the driver where the class is defined. Otherwise there are corner cases using nested classes where resolving at the executor can fail.
      
      - Fixed flaky test related to processing time timeout. The flakiness is caused because the test thread (that adds data to memory source) has a race condition with the streaming query thread. When testing the manual clock, the goal is to add data and increment clock together atomically, such that a trigger sees new data AND updated clock simultaneously (both or none). This fix adds additional synchronization in when adding data; it makes sure that the streaming query thread is waiting on the manual clock to be incremented (so no batch is currently running) before adding data.
      
      - Added`testQuietly` on some tests that generate a lot of error logs.
      
      ## How was this patch tested?
      Multiple runs on existing unit tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #17488 from tdas/SPARK-20165.
      567a50ac
    • Xiao Li's avatar
      [SPARK-20160][SQL] Move ParquetConversions and OrcConversions Out Of HiveSessionCatalog · b2349e6a
      Xiao Li authored
      ### What changes were proposed in this pull request?
      `ParquetConversions` and `OrcConversions` should be treated as regular `Analyzer` rules. It is not reasonable to be part of `HiveSessionCatalog`. This PR also combines two rules `ParquetConversions` and `OrcConversions` to build a new rule `RelationConversions `.
      
      After moving these two rules out of HiveSessionCatalog, the next step is to clean up, rename and move `HiveMetastoreCatalog` because it is not related to the hive package any more.
      
      ### How was this patch tested?
      The existing test cases
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17484 from gatorsmile/cleanup.
      b2349e6a
    • Ryan Blue's avatar
      [SPARK-20084][CORE] Remove internal.metrics.updatedBlockStatuses from history files. · c4c03eed
      Ryan Blue authored
      ## What changes were proposed in this pull request?
      
      Remove accumulator updates for internal.metrics.updatedBlockStatuses from SparkListenerTaskEnd entries in the history file. These can cause history files to grow to hundreds of GB because the value of the accumulator contains all tracked blocks.
      
      ## How was this patch tested?
      
      Current History UI tests cover use of the history file.
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #17412 from rdblue/SPARK-20084-remove-block-accumulator-info.
      c4c03eed
    • Kunal Khamar's avatar
      [SPARK-20164][SQL] AnalysisException not tolerant of null query plan. · 254877c2
      Kunal Khamar authored
      ## What changes were proposed in this pull request?
      
      The query plan in an `AnalysisException` may be `null` when an `AnalysisException` object is serialized and then deserialized, since `plan` is marked `transient`. Or when someone throws an `AnalysisException` with a null query plan (which should not happen).
      `def getMessage` is not tolerant of this and throws a `NullPointerException`, leading to loss of information about the original exception.
      The fix is to add a `null` check in `getMessage`.
      
      ## How was this patch tested?
      
      - Unit test
      
      Author: Kunal Khamar <kkhamar@outlook.com>
      
      Closes #17486 from kunalkhamar/spark-20164.
      254877c2
    • Reynold Xin's avatar
      [SPARK-20151][SQL] Account for partition pruning in scan metadataTime metrics · a8a765b3
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      After SPARK-20136, we report metadata timing metrics in scan operator. However, that timing metric doesn't include one of the most important part of metadata, which is partition pruning. This patch adds that time measurement to the scan metrics.
      
      ## How was this patch tested?
      N/A - I tried adding a test in SQLMetricsSuite but it was extremely convoluted to the point that I'm not sure if this is worth it.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #17476 from rxin/SPARK-20151.
      a8a765b3
  4. Mar 30, 2017
    • Wenchen Fan's avatar
      [SPARK-20121][SQL] simplify NullPropagation with NullIntolerant · c734fc50
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Instead of iterating all expressions that can return null for null inputs, we can just check `NullIntolerant`.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17450 from cloud-fan/null.
      c734fc50
    • Denis Bolshakov's avatar
      [SPARK-20127][CORE] few warning have been fixed which Intellij IDEA reported Intellij IDEA · 5e00a5de
      Denis Bolshakov authored
      ## What changes were proposed in this pull request?
      Few changes related to Intellij IDEA inspection.
      
      ## How was this patch tested?
      Changes were tested by existing unit tests
      
      Author: Denis Bolshakov <denis.bolshakov@onefactor.com>
      
      Closes #17458 from dbolshak/SPARK-20127.
      5e00a5de
    • Seigneurin, Alexis (CONT)'s avatar
      [DOCS][MINOR] Fixed a few typos in the Structured Streaming documentation · 669a11b6
      Seigneurin, Alexis (CONT) authored
      Fixed a few typos.
      
      There is one more I'm not sure of:
      
      ```
              Append mode uses watermark to drop old aggregation state. But the output of a
              windowed aggregation is delayed the late threshold specified in `withWatermark()` as by
              the modes semantics, rows can be added to the Result Table only once after they are
      ```
      
      Not sure how to change `is delayed the late threshold`.
      
      Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com>
      
      Closes #17443 from aseigneurin/typos.
      669a11b6
    • Kent Yao's avatar
      [SPARK-20096][SPARK SUBMIT][MINOR] Expose the right queue name not null if set... · e9d268f6
      Kent Yao authored
      [SPARK-20096][SPARK SUBMIT][MINOR] Expose the right queue name not null if set by --conf or configure file
      
      ## What changes were proposed in this pull request?
      
      while submit apps with -v or --verbose, we can print the right queue name, but if we set a queue name with `spark.yarn.queue` by --conf or in the spark-default.conf, we just got `null`  for the queue in Parsed arguments.
      ```
      bin/spark-shell -v --conf spark.yarn.queue=thequeue
      Using properties file: /home/hadoop/spark-2.1.0-bin-apache-hdp2.7.3/conf/spark-defaults.conf
      ....
      Adding default property: spark.yarn.queue=default
      Parsed arguments:
        master                  yarn
        deployMode              client
        ...
        queue                   null
        ....
        verbose                 true
      Spark properties used, including those specified through
       --conf and those from the properties file /home/hadoop/spark-2.1.0-bin-apache-hdp2.7.3/conf/spark-defaults.conf:
        spark.yarn.queue -> thequeue
        ....
      ```
      ## How was this patch tested?
      
      ut and local verify
      
      Author: Kent Yao <yaooqinn@hotmail.com>
      
      Closes #17430 from yaooqinn/SPARK-20096.
      e9d268f6
    • samelamin's avatar
      [SPARK-19999] Workaround JDK-8165231 to identify PPC64 architectures as supporting unaligned access · 258bff2c
      samelamin authored
       java.nio.Bits.unaligned() does not return true for the ppc64le arch.
      see https://bugs.openjdk.java.net/browse/JDK-8165231
      ## What changes were proposed in this pull request?
      check architecture
      
      ## How was this patch tested?
      
      unit test
      
      Author: samelamin <hussam.elamin@gmail.com>
      Author: samelamin <sam_elamin@discovery.com>
      
      Closes #17472 from samelamin/SPARK-19999.
      258bff2c
    • Jacek Laskowski's avatar
      [DOCS] Docs-only improvements · 0197262a
      Jacek Laskowski authored
      …adoc
      
      ## What changes were proposed in this pull request?
      
      Use recommended values for row boundaries in Window's scaladoc, i.e. `Window.unboundedPreceding`, `Window.unboundedFollowing`, and `Window.currentRow` (that were introduced in 2.1.0).
      
      ## How was this patch tested?
      
      Local build
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #17417 from jaceklaskowski/window-expression-scaladoc.
      0197262a
    • Shubham Chopra's avatar
      [SPARK-15354][CORE] Topology aware block replication strategies · b454d440
      Shubham Chopra authored
      ## What changes were proposed in this pull request?
      
      Implementations of strategies for resilient block replication for different resource managers that replicate the 3-replica strategy used by HDFS, where the first replica is on an executor, the second replica within the same rack as the executor and a third replica on a different rack.
      The implementation involves providing two pluggable classes, one running in the driver that provides topology information for every host at cluster start and the second prioritizing a list of peer BlockManagerIds.
      
      The prioritization itself can be thought of an optimization problem to find a minimal set of peers that satisfy certain objectives and replicating to these peers first. The objectives can be used to express richer constraints over and above HDFS like 3-replica strategy.
      ## How was this patch tested?
      
      This patch was tested with unit tests for storage, along with new unit tests to verify prioritization behaviour.
      
      Author: Shubham Chopra <schopra31@bloomberg.net>
      
      Closes #13932 from shubhamchopra/PrioritizerStrategy.
      b454d440
    • Yuming Wang's avatar
      [SPARK-20107][DOC] Add... · edc87d76
      Yuming Wang authored
      [SPARK-20107][DOC] Add spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option to configuration.md
      
      ## What changes were proposed in this pull request?
      
      Add `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version` option to `configuration.md`.
      Set `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2` can speed up [HadoopMapReduceCommitProtocol.commitJob](https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121) for many output files.
      
      All cloudera's hadoop 2.6.0-cdh5.4.0 or higher versions(see: https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433 and https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0) and apache's hadoop 2.7.0 or higher versions support this improvement.
      
      More see:
      
      1. [MAPREDUCE-4815](https://issues.apache.org/jira/browse/MAPREDUCE-4815): Speed up FileOutputCommitter#commitJob for many output files.
      2. [MAPREDUCE-6406](https://issues.apache.org/jira/browse/MAPREDUCE-6406): Update the default version for the property mapreduce.fileoutputcommitter.algorithm.version to 2.
      
      ## How was this patch tested?
      
      Manual test and exist tests.
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #17442 from wangyum/SPARK-20107.
      edc87d76
  5. Mar 29, 2017
    • wm624@hotmail.com's avatar
      [MINOR][SPARKR] Add run command comment in examples · 471de5db
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      There are two examples in r folder missing the run commands.
      
      In this PR, I just add the missing comment, which is consistent with other examples.
      
      ## How was this patch tested?
      
      Manual test.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #17474 from wangmiao1981/stat.
      471de5db
    • Eric Liang's avatar
      [SPARK-20148][SQL] Extend the file commit API to allow subscribing to task commit messages · 79636054
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      The internal FileCommitProtocol interface returns all task commit messages in bulk to the implementation when a job finishes. However, it is sometimes useful to access those messages before the job completes, so that the driver gets incremental progress updates before the job finishes.
      
      This adds an `onTaskCommit` listener to the internal api.
      
      ## How was this patch tested?
      
      Unit tests.
      
      cc rxin
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #17475 from ericl/file-commit-api-ext.
      79636054
    • Reynold Xin's avatar
      [SPARK-20136][SQL] Add num files and metadata operation timing to scan operator metrics · 60977889
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch adds explicit metadata operation timing and number of files in data source metrics. Those would be useful to include for performance profiling.
      
      Screenshot of a UI with this change (num files and metadata time are new metrics):
      
      <img width="321" alt="screen shot 2017-03-29 at 12 29 28 am" src="https://cloud.githubusercontent.com/assets/323388/24443272/d4ea58c0-1416-11e7-8940-ecb69375554a.png">
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #17465 from rxin/SPARK-20136.
      60977889
    • bomeng's avatar
      [SPARK-20146][SQL] fix comment missing issue for thrift server · 22f07fef
      bomeng authored
      ## What changes were proposed in this pull request?
      
      The column comment was missing while constructing the Hive TableSchema. This fix will preserve the original comment.
      
      ## How was this patch tested?
      
      I have added a new test case to test the column with/without comment.
      
      Author: bomeng <bmeng@us.ibm.com>
      
      Closes #17470 from bomeng/SPARK-20146.
      22f07fef
    • Takuya UESHIN's avatar
      [SPARK-19088][SQL] Fix 2.10 build. · dd2e7d52
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      Commit 6c70a38c broke the build for scala 2.10. The commit uses some reflections which are not available in Scala 2.10. This PR fixes them.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #17473 from ueshin/issues/SPARK-19088.
      dd2e7d52
    • Yuming Wang's avatar
      [SPARK-20120][SQL] spark-sql support silent mode · fe1d6b05
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      
      It is similar to Hive silent mode, just show the query result. see: [Hive LanguageManual+Cli](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli) and [the implementation of Hive silent mode](https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java#L948-L950).
      
      This PR set the Logger level to `WARN` to get similar result.
      
      ## How was this patch tested?
      
      manual tests
      
      ![manual test spark sql silent mode](https://cloud.githubusercontent.com/assets/5399861/24390165/989b7780-13b9-11e7-8496-6e68f55757e3.gif)
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #17449 from wangyum/SPARK-20120.
      fe1d6b05
    • Xiao Li's avatar
      [SPARK-17075][SQL][FOLLOWUP] Add Estimation of Constant Literal · 5c8ef376
      Xiao Li authored
      ### What changes were proposed in this pull request?
      `FalseLiteral` and `TrueLiteral` should have been eliminated by optimizer rule `BooleanSimplification`, but null literals might be added by optimizer rule `NullPropagation`. For safety, our filter estimation should handle all the eligible literal cases.
      
      Our optimizer rule BooleanSimplification is unable to remove the null literal in many cases. For example, `a < 0 or null`. Thus, we need to handle null literal in filter estimation.
      
      `Not` can be pushed down below `And` and `Or`. Then, we could see two consecutive `Not`, which need to be collapsed into one. Because of the limited expression support for filter estimation, we just need to handle the case `Not(null)` for avoiding incorrect error due to the boolean operation on null. For details, see below matrix.
      
      ```
      not NULL = NULL
      NULL or false = NULL
      NULL or true = true
      NULL or NULL = NULL
      NULL and false = false
      NULL and true = NULL
      NULL and NULL = NULL
      ```
      ### How was this patch tested?
      Added the test cases.
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17446 from gatorsmile/constantFilterEstimation.
      5c8ef376
    • Takeshi Yamamuro's avatar
      [SPARK-20009][SQL] Support DDL strings for defining schema in functions.from_json · c4008480
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr added `StructType.fromDDL`  to convert a DDL format string into `StructType` for defining schemas in `functions.from_json`.
      
      ## How was this patch tested?
      Added tests in `JsonFunctionsSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #17406 from maropu/SPARK-20009.
      c4008480
    • Kunal Khamar's avatar
      [SPARK-20048][SQL] Cloning SessionState does not clone query execution listeners · 142f6d14
      Kunal Khamar authored
      ## What changes were proposed in this pull request?
      
      Bugfix from [SPARK-19540.](https://github.com/apache/spark/pull/16826)
      Cloning SessionState does not clone query execution listeners, so cloned session is unable to listen to events on queries.
      
      ## How was this patch tested?
      
      - Unit test
      
      Author: Kunal Khamar <kkhamar@outlook.com>
      
      Closes #17379 from kunalkhamar/clone-bugfix.
      142f6d14
    • Holden Karau's avatar
      [SPARK-19955][PYSPARK] Jenkins Python Conda based test. · d6ddfdf6
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Allow Jenkins Python tests to use the installed conda to test Python 2.7 support & test pip installability.
      
      ## How was this patch tested?
      
      Updated shell scripts, ran tests locally with installed conda, ran tests in Jenkins.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #17355 from holdenk/SPARK-19955-support-python-tests-with-conda.
      d6ddfdf6
    • jerryshao's avatar
      [SPARK-20059][YARN] Use the correct classloader for HBaseCredentialProvider · c622a87c
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Currently we use system classloader to find HBase jars, if it is specified by `--jars`, then it will be failed with ClassNotFound issue. So here changing to use child classloader.
      
      Also putting added jars and main jar into classpath of submitted application in yarn cluster mode, otherwise HBase jars specified with `--jars` will never be honored in cluster mode, and fetching tokens in client side will always be failed.
      
      ## How was this patch tested?
      
      Unit test and local verification.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #17388 from jerryshao/SPARK-20059.
      c622a87c
    • Marcelo Vanzin's avatar
      [SPARK-19556][CORE] Do not encrypt block manager data in memory. · b56ad2b1
      Marcelo Vanzin authored
      This change modifies the way block data is encrypted to make the more
      common cases faster, while penalizing an edge case. As a side effect
      of the change, all data that goes through the block manager is now
      encrypted only when needed, including the previous path (broadcast
      variables) where that did not happen.
      
      The way the change works is by not encrypting data that is stored in
      memory; so if a serialized block is in memory, it will only be encrypted
      once it is evicted to disk.
      
      The penalty comes when transferring that encrypted data from disk. If the
      data ends up in memory again, it is as efficient as before; but if the
      evicted block needs to be transferred directly to a remote executor, then
      there's now a performance penalty, since the code now uses a custom
      FileRegion implementation to decrypt the data before transferring.
      
      This also means that block data transferred between executors now is
      not encrypted (and thus relies on the network library encryption support
      for secrecy). Shuffle blocks are still transferred in encrypted form,
      since they're handled in a slightly different way by the code. This also
      keeps compatibility with existing external shuffle services, which transfer
      encrypted shuffle blocks, and avoids having to make the external service
      aware of encryption at all.
      
      The serialization and deserialization APIs in the SerializerManager now
      do not do encryption automatically; callers need to explicitly wrap their
      streams with an appropriate crypto stream before using those.
      
      As a result of these changes, some of the workarounds added in SPARK-19520
      are removed here.
      
      Testing: a new trait ("EncryptionFunSuite") was added that provides an easy
      way to run a test twice, with encryption on and off; broadcast, block manager
      and caching tests were modified to use this new trait so that the existing
      tests exercise both encrypted and non-encrypted paths. I also ran some
      applications with encryption turned on to verify that they still work,
      including streaming tests that failed without the fix for SPARK-19520.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #17295 from vanzin/SPARK-19556.
      b56ad2b1
    • Reynold Xin's avatar
      [SPARK-20134][SQL] SQLMetrics.postDriverMetricUpdates to simplify driver side metric updates · 9712bd39
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      It is not super intuitive how to update SQLMetric on the driver side. This patch introduces a new SQLMetrics.postDriverMetricUpdates function to do that, and adds documentation to make it more obvious.
      
      ## How was this patch tested?
      Updated a test case to use this method.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #17464 from rxin/SPARK-20134.
      9712bd39
  6. Mar 28, 2017
    • Bago Amirbekian's avatar
      [SPARK-20040][ML][PYTHON] pyspark wrapper for ChiSquareTest · a5c87707
      Bago Amirbekian authored
      ## What changes were proposed in this pull request?
      
      A pyspark wrapper for spark.ml.stat.ChiSquareTest
      
      ## How was this patch tested?
      
      unit tests
      doctests
      
      Author: Bago Amirbekian <bago@databricks.com>
      
      Closes #17421 from MrBago/chiSquareTestWrapper.
      a5c87707
    • 颜发才(Yan Facai)'s avatar
      [SPARK-20043][ML] DecisionTreeModel: ImpurityCalculator builder fails for... · 7d432af8
      颜发才(Yan Facai) authored
      [SPARK-20043][ML] DecisionTreeModel: ImpurityCalculator builder fails for uppercase impurity type Gini
      
      Fix bug: DecisionTreeModel can't recongnize Impurity "Gini" when loading
      
      TODO:
      + [x] add unit test
      + [x] fix the bug
      
      Author: 颜发才(Yan Facai) <facai.yan@gmail.com>
      
      Closes #17407 from facaiy/BUG/decision_tree_loader_failer_with_Gini_impurity.
      7d432af8
    • liujianhui's avatar
      [SPARK-19868] conflict TasksetManager lead to spark stopped · 92e385e0
      liujianhui authored
      ## What changes were proposed in this pull request?
      
      We must set the taskset to zombie before the DAGScheduler handles the taskEnded event. It's possible the taskEnded event will cause the DAGScheduler to launch a new stage attempt (this happens when map output data was lost), and if this happens before the taskSet has been set to zombie, it will appear that we have conflicting task sets.
      
      Author: liujianhui <liujianhui@didichuxing>
      
      Closes #17208 from liujianhuiouc/spark-19868.
      92e385e0
    • Wenchen Fan's avatar
      [SPARK-20125][SQL] Dataset of type option of map does not work · d4fac410
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      When we build the deserializer expression for map type, we will use `StaticInvoke` to call `ArrayBasedMapData.toScalaMap`, and declare the return type as `scala.collection.immutable.Map`. If the map is inside an Option, we will wrap this `StaticInvoke` with `WrapOption`, which requires the input to be `scala.collect.Map`. Ideally this should be fine, as `scala.collection.immutable.Map` extends `scala.collect.Map`, but our `ObjectType` is too strict about this, this PR fixes it.
      
      ## How was this patch tested?
      
      new regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17454 from cloud-fan/map.
      d4fac410
    • jerryshao's avatar
      [SPARK-19995][YARN] Register tokens to current UGI to avoid re-issuing of... · 17eddb35
      jerryshao authored
      [SPARK-19995][YARN] Register tokens to current UGI to avoid re-issuing of tokens in yarn client mode
      
      ## What changes were proposed in this pull request?
      
      In the current Spark on YARN code, we will obtain tokens from provided services, but we're not going to add these tokens to the current user's credentials. This will make all the following operations to these services still require TGT rather than delegation tokens. This is unnecessary since we already got the tokens, also this will lead to failure in user impersonation scenario, because the TGT is granted by real user, not proxy user.
      
      So here changing to put all the tokens to the current UGI, so that following operations to these services will honor tokens rather than TGT, and this will further handle the proxy user issue mentioned above.
      
      ## How was this patch tested?
      
      Local verified in secure cluster.
      
      vanzin tgravescs mridulm  dongjoon-hyun please help to review, thanks a lot.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #17335 from jerryshao/SPARK-19995.
      17eddb35
    • Herman van Hovell's avatar
      [SPARK-20126][SQL] Remove HiveSessionState · f82461fc
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      Commit https://github.com/apache/spark/commit/ea361165e1ddce4d8aa0242ae3e878d7b39f1de2 moved most of the logic from the SessionState classes into an accompanying builder. This makes the existence of the `HiveSessionState` redundant. This PR removes the `HiveSessionState`.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #17457 from hvanhovell/SPARK-20126.
      f82461fc
    • wangzhenhua's avatar
      [SPARK-20124][SQL] Join reorder should keep the same order of final project attributes · 4fcc214d
      wangzhenhua authored
      ## What changes were proposed in this pull request?
      
      Join reorder algorithm should keep exactly the same order of output attributes in the top project.
      For example, if user want to select a, b, c, after reordering, we should output a, b, c in the same order as specified by user, instead of b, a, c or other orders.
      
      ## How was this patch tested?
      
      A new test case is added in `JoinReorderSuite`.
      
      Author: wangzhenhua <wangzhenhua@huawei.com>
      
      Closes #17453 from wzhfy/keepOrderInProject.
      4fcc214d
    • wangzhenhua's avatar
      [SPARK-20094][SQL] Preventing push down of IN subquery to Join operator · 91559d27
      wangzhenhua authored
      ## What changes were proposed in this pull request?
      
      TPCDS q45 fails becuase:
      `ReorderJoin` collects all predicates and try to put them into join condition when creating ordered join. If a predicate with an IN subquery (`ListQuery`) is in a join condition instead of a filter condition, `RewritePredicateSubquery.rewriteExistentialExpr` would fail to convert the subquery to an `ExistenceJoin`, and thus result in error.
      
      We should prevent push down of IN subquery to Join operator.
      
      ## How was this patch tested?
      
      Add a new test case in `FilterPushdownSuite`.
      
      Author: wangzhenhua <wangzhenhua@huawei.com>
      
      Closes #17428 from wzhfy/noSubqueryInJoinCond.
      91559d27
    • Xiao Li's avatar
      [SPARK-20119][TEST-MAVEN] Fix the test case fail in DataSourceScanExecRedactionSuite · a9abff28
      Xiao Li authored
      ### What changes were proposed in this pull request?
      Changed the pattern to match the first n characters in the location field so that the string truncation does not affect it.
      
      ### How was this patch tested?
      N/A
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17448 from gatorsmile/fixTestCAse.
      a9abff28
Loading