Skip to content
Snippets Groups Projects
  1. Jun 15, 2016
    • Reynold Xin's avatar
      [SPARK-15851][BUILD] Fix the call of the bash script to enable proper run in Windows · 5a52ba0f
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      The way bash script `build/spark-build-info` is called from core/pom.xml prevents Spark building on Windows. Instead of calling the script directly we call bash and pass the script as an argument. This enables running it on Windows with bash installed which typically comes with Git.
      
      This brings https://github.com/apache/spark/pull/13612 up-to-date and also addresses comments from the code review.
      
      Closes #13612
      
      ## How was this patch tested?
      I built manually (on a Mac) to verify it didn't break Mac compilation.
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: avulanov <nashb@yandex.ru>
      
      Closes #13691 from rxin/SPARK-15851.
      5a52ba0f
    • Wayne Song's avatar
      [SPARK-13498][SQL] Increment the recordsRead input metric for JDBC data source · ebdd7512
      Wayne Song authored
      ## What changes were proposed in this pull request?
      This patch brings https://github.com/apache/spark/pull/11373 up-to-date and increments the record count for JDBC data source.
      
      Closes #11373.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13694 from rxin/SPARK-13498.
      ebdd7512
    • Reynold Xin's avatar
      [SPARK-15979][SQL] Rename various Parquet support classes. · 865e7cc3
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch renames various Parquet support classes from CatalystAbc to ParquetAbc. This new naming makes more sense for two reasons:
      
      1. These are not optimizer related (i.e. Catalyst) classes.
      2. We are in the Spark code base, and as a result it'd be more clear to call out these are Parquet support classes, rather than some Spark classes.
      
      ## How was this patch tested?
      Renamed test cases as well.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13696 from rxin/parquet-rename.
      865e7cc3
    • KaiXinXiaoLei's avatar
      [SPARK-12492][SQL] Add missing SQLExecution.withNewExecutionId for hiveResultString · 3e6d567a
      KaiXinXiaoLei authored
      ## What changes were proposed in this pull request?
      
      Add missing SQLExecution.withNewExecutionId for hiveResultString so that queries running in `spark-sql` will be shown in Web UI.
      
      Closes #13115
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: KaiXinXiaoLei <huleilei1@huawei.com>
      
      Closes #13689 from zsxwing/pr13115.
      3e6d567a
    • Wojciech Jurczyk's avatar
      [DOCS] Fix Gini and Entropy scaladocs in context of multiclass classification · 6e0b3d79
      Wojciech Jurczyk authored
      The PR changes outdated scaladocs for Gini and Entropy classes. Since PR #886 Spark supports multiclass classification, but the docs tell only about binary classification.
      
      Author: Wojciech Jurczyk <wojciech.jurczyk@codilime.com>
      
      Closes #11252 from wjur/wjur/docs_multiclass.
      6e0b3d79
    • Davies Liu's avatar
      a153e41c
    • Reynold Xin's avatar
      Closing stale pull requests. · 1a33f2e0
      Reynold Xin authored
      Closes #13103
      Closes #8320
      Closes #7871
      Closes #7461
      Closes #9159
      Closes #9150
      Closes #9200
      Closes #9089
      Closes #8022
      Closes #6767
      Closes #8505
      Closes #9457
      Closes #9397
      Closes #8563
      Closes #10062
      Closes #9944
      Closes #10137
      Closes #10148
      Closes #9057
      Closes #10163
      Closes #8023
      Closes #10302
      Closes #8979
      Closes #8981
      Closes #10258
      Closes #7345
      Closes #9183
      Closes #10087
      Closes #10292
      Closes #10254
      Closes #10374
      Closes #8915
      Closes #10128
      Closes #10666
      Closes #8533
      Closes #10625
      Closes #8013
      Closes #8427
      Closes #7753
      Closes #10116
      Closes #11005
      Closes #10797
      Closes #11026
      Closes #11009
      Closes #10117
      Closes #11382
      Closes #9483
      Closes #10566
      Closes #10753
      Closes #11386
      Closes #9097
      Closes #11245
      Closes #11257
      Closes #11045
      Closes #10144
      Closes #11066
      Closes #8610
      Closes #10634
      Closes #11224
      Closes #11212
      Closes #11244
      Closes #10326
      Closes #13524
      1a33f2e0
    • Nirman Narang's avatar
      [SPARK-7848][STREAMING][UPDATE SPARKSTREAMING DOCS TO INCORPORATE IMPORTANT POINTS.] · 04d7b3d2
      Nirman Narang authored
      Updated the SparkStreaming Doc with some important points.
      
      Author: Nirman Narang <narang@us.ibm.com>
      
      Closes #11114 from nirmannarang/SPARK-7848.
      04d7b3d2
    • Imran Rashid's avatar
      [HOTFIX][CORE] fix flaky BasicSchedulerIntegrationTest · cafc696d
      Imran Rashid authored
      ## What changes were proposed in this pull request?
      
      SPARK-15927 exacerbated a race in BasicSchedulerIntegrationTest, so it went from very unlikely to fairly frequent.  The issue is that stage numbering is not completely deterministic, but these tests treated it like it was.  So turn off the tests.
      
      ## How was this patch tested?
      
      on my laptop the test failed abotu 10% of the time before this change, and didn't fail in 500 runs after the change.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #13688 from squito/hotfix_basic_scheduler.
      cafc696d
    • Sean Zhong's avatar
      [SPARK-15776][SQL] Divide Expression inside Aggregation function is casted to wrong type · 9bd80ad6
      Sean Zhong authored
      ## What changes were proposed in this pull request?
      
      This PR fixes the problem that Divide Expression inside Aggregation function is casted to wrong type, which cause `select 1/2` and `select sum(1/2)`returning different result.
      
      **Before the change:**
      
      ```
      scala> sql("select 1/2 as a").show()
      +---+
      |  a|
      +---+
      |0.5|
      +---+
      
      scala> sql("select sum(1/2) as a").show()
      +---+
      |  a|
      +---+
      |0  |
      +---+
      
      scala> sql("select sum(1 / 2) as a").schema
      res4: org.apache.spark.sql.types.StructType = StructType(StructField(a,LongType,true))
      ```
      
      **After the change:**
      
      ```
      scala> sql("select 1/2 as a").show()
      +---+
      |  a|
      +---+
      |0.5|
      +---+
      
      scala> sql("select sum(1/2) as a").show()
      +---+
      |  a|
      +---+
      |0.5|
      +---+
      
      scala> sql("select sum(1/2) as a").schema
      res4: org.apache.spark.sql.types.StructType = StructType(StructField(a,DoubleType,true))
      ```
      
      ## How was this patch tested?
      
      Unit test.
      
      This PR is based on https://github.com/apache/spark/pull/13524 by Sephiroth-Lin
      
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #13651 from clockfly/SPARK-15776.
      9bd80ad6
    • Egor Pakhomov's avatar
      [SPARK-15934] [SQL] Return binary mode in ThriftServer · 049e639f
      Egor Pakhomov authored
      Returning binary mode to ThriftServer for backward compatibility.
      
      Tested with Squirrel and Tableau.
      
      Author: Egor Pakhomov <egor@anchorfree.com>
      
      Closes #13667 from epahomov/SPARK-15095-2.0.
      049e639f
    • gatorsmile's avatar
      [SPARK-15901][SQL][TEST] Verification of CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET · 09925735
      gatorsmile authored
      #### What changes were proposed in this pull request?
      So far, we do not have test cases for verifying whether the external parameters `HiveUtils .CONVERT_METASTORE_ORC` and `HiveUtils.CONVERT_METASTORE_PARQUET` properly works when users use non-default values. This PR is to add such test cases for avoiding potential regression.
      
      #### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13622 from gatorsmile/addTestCase4parquetOrcConversion.
      09925735
    • Nezih Yigitbasi's avatar
      [SPARK-15782][YARN] Set spark.jars system property in client mode · 4df8df5c
      Nezih Yigitbasi authored
      ## What changes were proposed in this pull request?
      
      When `--packages` is specified with `spark-shell` the classes from those packages cannot be found, which I think is due to some of the changes in `SPARK-12343`. In particular `SPARK-12343` removes a line that sets the `spark.jars` system property in client mode, which is used by the repl main class to set the classpath.
      
      ## How was this patch tested?
      
      Tested manually.
      
      This system property is used by the repl to populate its classpath. If
      this is not set properly the classes for external packages cannot be
      found.
      
      tgravescs vanzin as you may be familiar with this part of the code.
      
      Author: Nezih Yigitbasi <nyigitbasi@netflix.com>
      
      Closes #13527 from nezihyigitbasi/repl-fix.
      4df8df5c
    • Davies Liu's avatar
      [SPARK-15888] [SQL] fix Python UDF with aggregate · 5389013a
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      After we move the ExtractPythonUDF rule into physical plan, Python UDF can't work on top of aggregate anymore, because they can't be evaluated before aggregate, should be evaluated after aggregate. This PR add another rule to extract these kind of Python UDF from logical aggregate, create a Project on top of Aggregate.
      
      ## How was this patch tested?
      
      Added regression tests. The plan of added test query looks like this:
      ```
      == Parsed Logical Plan ==
      'Project [<lambda>('k, 's) AS t#26]
      +- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS k#17, sum(cast(<lambda>(value#6) as bigint)) AS s#22L]
         +- LogicalRDD [key#5L, value#6]
      
      == Analyzed Logical Plan ==
      t: int
      Project [<lambda>(k#17, s#22L) AS t#26]
      +- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS k#17, sum(cast(<lambda>(value#6) as bigint)) AS s#22L]
         +- LogicalRDD [key#5L, value#6]
      
      == Optimized Logical Plan ==
      Project [<lambda>(agg#29, agg#30L) AS t#26]
      +- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS agg#29, sum(cast(<lambda>(value#6) as bigint)) AS agg#30L]
         +- LogicalRDD [key#5L, value#6]
      
      == Physical Plan ==
      *Project [pythonUDF0#37 AS t#26]
      +- BatchEvalPython [<lambda>(agg#29, agg#30L)], [agg#29, agg#30L, pythonUDF0#37]
         +- *HashAggregate(key=[<lambda>(key#5L)#31], functions=[sum(cast(<lambda>(value#6) as bigint))], output=[agg#29,agg#30L])
            +- Exchange hashpartitioning(<lambda>(key#5L)#31, 200)
               +- *HashAggregate(key=[pythonUDF0#34 AS <lambda>(key#5L)#31], functions=[partial_sum(cast(pythonUDF1#35 as bigint))], output=[<lambda>(key#5L)#31,sum#33L])
                  +- BatchEvalPython [<lambda>(key#5L), <lambda>(value#6)], [key#5L, value#6, pythonUDF0#34, pythonUDF1#35]
                     +- Scan ExistingRDD[key#5L,value#6]
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13682 from davies/fix_py_udf.
      5389013a
    • Tejas Patil's avatar
      [SPARK-15826][CORE] PipedRDD to allow configurable char encoding · 279bd4aa
      Tejas Patil authored
      ## What changes were proposed in this pull request?
      
      Link to jira which describes the problem: https://issues.apache.org/jira/browse/SPARK-15826
      
      The fix in this PR is to allow users specify encoding in the pipe() operation. For backward compatibility,
      keeping the default value to be system default.
      
      ## How was this patch tested?
      
      Ran existing unit tests
      
      Author: Tejas Patil <tejasp@fb.com>
      
      Closes #13563 from tejasapatil/pipedrdd_utf8.
      279bd4aa
    • Liwei Lin's avatar
      [SPARK-15518][CORE][FOLLOW-UP] Rename LocalSchedulerBackendEndpoint -> LocalSchedulerBackend · 9b234b55
      Liwei Lin authored
      ## What changes were proposed in this pull request?
      
      This patch is a follow-up to https://github.com/apache/spark/pull/13288 completing the renaming:
       - LocalScheduler -> LocalSchedulerBackend~~Endpoint~~
      
      ## How was this patch tested?
      
      Updated test cases to reflect the name change.
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #13683 from lw-lin/rename-backend.
      9b234b55
    • Yin Huai's avatar
      [SPARK-15959][SQL] Add the support of hive.metastore.warehouse.dir back · e1585cc7
      Yin Huai authored
      ## What changes were proposed in this pull request?
      This PR adds the support of conf `hive.metastore.warehouse.dir` back. With this patch, the way of setting the warehouse dir is described as follows:
      * If `spark.sql.warehouse.dir` is set, `hive.metastore.warehouse.dir` will be automatically set to the value of `spark.sql.warehouse.dir`. The warehouse dir is effectively set to the value of `spark.sql.warehouse.dir`.
      * If `spark.sql.warehouse.dir` is not set but `hive.metastore.warehouse.dir` is set, `spark.sql.warehouse.dir` will be automatically set to the value of `hive.metastore.warehouse.dir`. The warehouse dir is effectively set to the value of `hive.metastore.warehouse.dir`.
      * If neither `spark.sql.warehouse.dir` nor `hive.metastore.warehouse.dir` is set, `hive.metastore.warehouse.dir` will be automatically set to the default value of `spark.sql.warehouse.dir`. The warehouse dir is effectively set to the default value of `spark.sql.warehouse.dir`.
      
      ## How was this patch tested?
      `set hive.metastore.warehouse.dir` in `HiveSparkSubmitSuite`.
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-15959
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #13679 from yhuai/hiveWarehouseDir.
      e1585cc7
    • Tathagata Das's avatar
      [SPARK-15953][WIP][STREAMING] Renamed ContinuousQuery to StreamingQuery · 9a507199
      Tathagata Das authored
      Renamed for simplicity, so that its obvious that its related to streaming.
      
      Existing unit tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #13673 from tdas/SPARK-15953.
      9a507199
    • Felix Cheung's avatar
      [SPARK-15637][SPARK-15931][SPARKR] Fix R masked functions checks · d30b7e66
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Because of the fix in SPARK-15684, this exclusion is no longer necessary.
      
      ## How was this patch tested?
      
      unit tests
      
      shivaram
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #13636 from felixcheung/rendswith.
      d30b7e66
    • Herman van Hovell's avatar
      [SPARK-15960][SQL] Rename `spark.sql.enableFallBackToHdfsForStats` config · de99c3d0
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      Since we are probably going to add more statistics related configurations in the future, I'd like to rename the newly added `spark.sql.enableFallBackToHdfsForStats` configuration option to `spark.sql.statistics.fallBackToHdfs`. This allows us to put all statistics related configurations in the same namespace.
      
      ## How was this patch tested?
      None - just a usability thing
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #13681 from hvanhovell/SPARK-15960.
      de99c3d0
    • Marcelo Vanzin's avatar
      [SPARK-15046][YARN] Parse value of token renewal interval correctly. · 40eeef95
      Marcelo Vanzin authored
      Use the config variable definition both to set and parse the value,
      avoiding issues with code expecting the value in a different format.
      
      Tested by running spark-submit with --principal / --keytab.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #13669 from vanzin/SPARK-15046.
      40eeef95
  2. Jun 14, 2016
    • Shixiong Zhu's avatar
      [SPARK-15935][PYSPARK] Fix a wrong format tag in the error message · 0ee9fd9e
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      A follow up PR for #13655 to fix a wrong format tag.
      
      ## How was this patch tested?
      
      Jenkins unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13665 from zsxwing/fix.
      0ee9fd9e
    • Xiangrui Meng's avatar
      [SPARK-15945][MLLIB] Conversion between old/new vector columns in a DataFrame (Scala/Java) · 63e0aebe
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      This PR provides conversion utils between old/new vector columns in a DataFrame. So users can use it to migrate their datasets and pipelines manually. The methods are implemented under `MLUtils` and called `convertVectorColumnsToML` and `convertVectorColumnsFromML`. Both take a DataFrame and a list of vector columns to be converted. It is a no-op on vector columns that are already converted. A warning message is logged if actual conversion happens.
      
      This is the first sub-task under SPARK-15944 to make it easier to migrate existing pipelines to Spark 2.0.
      
      ## How was this patch tested?
      
      Unit tests in Scala and Java.
      
      cc: yanboliang
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #13662 from mengxr/SPARK-15945.
      63e0aebe
    • bomeng's avatar
      [SPARK-15952][SQL] fix "show databases" ordering issue · 42a28caf
      bomeng authored
      ## What changes were proposed in this pull request?
      
      Two issues I've found for "show databases" command:
      
      1. The returned database name list was not sorted, it only works when "like" was used together; (HIVE will always return a sorted list)
      
      2. When it is used as sql("show databases").show, it will output a table with column named as "result", but for sql("show tables").show, it will output the column name as "tableName", so I think we should be consistent and use "databaseName" at least.
      
      ## How was this patch tested?
      
      Updated existing test case to test its ordering as well.
      
      Author: bomeng <bmeng@us.ibm.com>
      
      Closes #13671 from bomeng/SPARK-15952.
      42a28caf
    • Herman van Hovell's avatar
      [SPARK-15011][SQL] Re-enable 'analyze MetastoreRelations' in hive StatisticsSuite · 0bd86c0f
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      This test re-enables the `analyze MetastoreRelations` in `org.apache.spark.sql.hive.StatisticsSuite`.
      
      The flakiness of this test was traced back to a shared configuration option, `hive.exec.compress.output`, in `TestHive`. This property was set to `true` by the `HiveCompatibilitySuite`. I have added configuration resetting logic to `HiveComparisonTest`, in order to prevent such a thing from happening again.
      
      ## How was this patch tested?
      Is a test.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #13498 from hvanhovell/SPARK-15011.
      0bd86c0f
    • Tathagata Das's avatar
      [SPARK-15933][SQL][STREAMING] Refactored DF reader-writer to use readStream... · 214adb14
      Tathagata Das authored
      [SPARK-15933][SQL][STREAMING] Refactored DF reader-writer to use readStream and writeStream for streaming DFs
      
      ## What changes were proposed in this pull request?
      Currently, the DataFrameReader/Writer has method that are needed for streaming and non-streaming DFs. This is quite awkward because each method in them through runtime exception for one case or the other. So rather having half the methods throw runtime exceptions, its just better to have a different reader/writer API for streams.
      
      - [x] Python API!!
      
      ## How was this patch tested?
      Existing unit tests + two sets of unit tests for DataFrameReader/Writer and DataStreamReader/Writer.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #13653 from tdas/SPARK-15933.
      214adb14
    • Kay Ousterhout's avatar
      [SPARK-15927] Eliminate redundant DAGScheduler code. · 5d50d4f0
      Kay Ousterhout authored
      To try to eliminate redundant code to traverse the RDD dependency graph,
      this PR creates a new function getShuffleDependencies that returns
      shuffle dependencies that are immediate parents of a given RDD.  This
      new function is used by getParentStages and
      getAncestorShuffleDependencies.
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #13646 from kayousterhout/SPARK-15927.
      5d50d4f0
    • Takeshi YAMAMURO's avatar
      [SPARK-15247][SQL] Set the default number of partitions for reading parquet schemas · dae4d5db
      Takeshi YAMAMURO authored
      ## What changes were proposed in this pull request?
      This pr sets the default number of partitions when reading parquet schemas.
      SQLContext#read#parquet currently yields at least n_executors * n_cores tasks even if parquet data consist of a  single small file. This issue could increase the latency for small jobs.
      
      ## How was this patch tested?
      Manually tested and checked.
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #13137 from maropu/SPARK-15247.
      dae4d5db
    • Cheng Lian's avatar
      [SPARK-15895][SQL] Filters out metadata files while doing partition discovery · bd39ffe3
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      Take the following directory layout as an example:
      
      ```
      dir/
      +- p0=0/
         |-_metadata
         +- p1=0/
            |-part-00001.parquet
            |-part-00002.parquet
            |-...
      ```
      
      The `_metadata` file under `p0=0` shouldn't fail partition discovery.
      
      This PR filters output all metadata files whose names start with `_` while doing partition discovery.
      
      ## How was this patch tested?
      
      New unit test added in `ParquetPartitionDiscoverySuite`.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #13623 from liancheng/spark-15895-partition-disco-no-metafiles.
      bd39ffe3
    • gatorsmile's avatar
      [SPARK-15864][SQL] Fix Inconsistent Behaviors when Uncaching Non-cached Tables · df4ea661
      gatorsmile authored
      #### What changes were proposed in this pull request?
      To uncache a table, we have three different ways:
      - _SQL interface_: `UNCACHE TABLE`
      - _DataSet API_: `sparkSession.catalog.uncacheTable`
      - _DataSet API_: `sparkSession.table(tableName).unpersist()`
      
      When the table is not cached,
      - _SQL interface_: `UNCACHE TABLE non-cachedTable` -> **no error message**
      - _Dataset API_: `sparkSession.catalog.uncacheTable("non-cachedTable")` -> **report a strange error message:**
      ```requirement failed: Table [a: int] is not cached```
      - _Dataset API_: `sparkSession.table("non-cachedTable").unpersist()` -> **no error message**
      
      This PR will make them consistent. No operation if the table has already been uncached.
      
      In addition, this PR also removes `uncacheQuery` and renames `tryUncacheQuery` to `uncacheQuery`, and documents it that it's noop if the table has already been uncached
      
      #### How was this patch tested?
      Improved the existing test case for verifying the cases when the table has not been cached.
      Also added test cases for verifying the cases when the table does not exist
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #13593 from gatorsmile/uncacheNonCachedTable.
      df4ea661
    • Takuya UESHIN's avatar
      [SPARK-15915][SQL] Logical plans should use canonicalized plan when override sameResult. · c5b73558
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      `DataFrame` with plan overriding `sameResult` but not using canonicalized plan to compare can't cacheTable.
      
      The example is like:
      
      ```
          val localRelation = Seq(1, 2, 3).toDF()
          localRelation.createOrReplaceTempView("localRelation")
      
          spark.catalog.cacheTable("localRelation")
          assert(
            localRelation.queryExecution.withCachedData.collect {
              case i: InMemoryRelation => i
            }.size == 1)
      ```
      
      and this will fail as:
      
      ```
      ArrayBuffer() had size 0 instead of expected size 1
      ```
      
      The reason is that when do `spark.catalog.cacheTable("localRelation")`, `CacheManager` tries to cache for the plan wrapped by `SubqueryAlias` but when planning for the DataFrame `localRelation`, `CacheManager` tries to find cached table for the not-wrapped plan because the plan for DataFrame `localRelation` is not wrapped.
      Some plans like `LocalRelation`, `LogicalRDD`, etc. override `sameResult` method, but not use canonicalized plan to compare so the `CacheManager` can't detect the plans are the same.
      
      This pr modifies them to use canonicalized plan when override `sameResult` method.
      
      ## How was this patch tested?
      
      Added a test to check if DataFrame with plan overriding sameResult but not using canonicalized plan to compare can cacheTable.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #13638 from ueshin/issues/SPARK-15915.
      c5b73558
    • gatorsmile's avatar
      [SPARK-15655][SQL] Fix Wrong Partition Column Order when Fetching Partitioned Tables · bc02d011
      gatorsmile authored
      #### What changes were proposed in this pull request?
      When fetching the partitioned table, the output contains wrong results. The order of partition key values do not match the order of partition key columns in output schema. For example,
      
      ```SQL
      CREATE TABLE table_with_partition(c1 string) PARTITIONED BY (p1 string,p2 string,p3 string,p4 string,p5 string)
      
      INSERT OVERWRITE TABLE table_with_partition PARTITION (p1='a',p2='b',p3='c',p4='d',p5='e') SELECT 'blarr'
      
      SELECT p1, p2, p3, p4, p5, c1 FROM table_with_partition
      ```
      ```
      +---+---+---+---+---+-----+
      | p1| p2| p3| p4| p5|   c1|
      +---+---+---+---+---+-----+
      |  d|  e|  c|  b|  a|blarr|
      +---+---+---+---+---+-----+
      ```
      
      The expected result should be
      ```
      +---+---+---+---+---+-----+
      | p1| p2| p3| p4| p5|   c1|
      +---+---+---+---+---+-----+
      |  a|  b|  c|  d|  e|blarr|
      +---+---+---+---+---+-----+
      ```
      This PR is to fix this by enforcing the order matches the table partition definition.
      
      #### How was this patch tested?
      Added a test case into `SQLQuerySuite`
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13400 from gatorsmile/partitionedTableFetch.
      bc02d011
    • Sean Owen's avatar
      [MINOR] Clean up several build warnings, mostly due to internal use of old accumulators · 6151d264
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Another PR to clean up recent build warnings. This particularly cleans up several instances of the old accumulator API usage in tests that are straightforward to update. I think this qualifies as "minor".
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #13642 from srowen/BuildWarnings.
      6151d264
    • Sean Zhong's avatar
      [SPARK-15914][SQL] Add deprecated method back to SQLContext for backward source code compatibility · 6e8cdef0
      Sean Zhong authored
      ## What changes were proposed in this pull request?
      
      Revert partial changes in SPARK-12600, and add some deprecated method back to SQLContext for backward source code compatibility.
      
      ## How was this patch tested?
      
      Manual test.
      
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #13637 from clockfly/SPARK-15914.
      6e8cdef0
    • Jeff Zhang's avatar
      doc fix of HiveThriftServer · 53bb0308
      Jeff Zhang authored
      ## What changes were proposed in this pull request?
      
      Just minor doc fix.
      
      \cc yhuai
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #13659 from zjffdu/doc_fix.
      53bb0308
    • Adam Roberts's avatar
      [SPARK-15821][DOCS] Include parallel build info · a431e3f1
      Adam Roberts authored
      ## What changes were proposed in this pull request?
      
      We should mention that users can build Spark using multiple threads to decrease build times; either here or in "Building Spark"
      
      ## How was this patch tested?
      
      Built on machines with between one core to 192 cores using mvn -T 1C and observed faster build times with no loss in stability
      
      In response to the question here https://issues.apache.org/jira/browse/SPARK-15821 I think we should suggest this option as we know it works for Spark and can result in faster builds
      
      Author: Adam Roberts <aroberts@uk.ibm.com>
      
      Closes #13562 from a-roberts/patch-3.
      a431e3f1
    • Shixiong Zhu's avatar
      [SPARK-15935][PYSPARK] Enable test for sql/streaming.py and fix these tests · 96c3500c
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR just enables tests for sql/streaming.py and also fixes the failures.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13655 from zsxwing/python-streaming-test.
      96c3500c
    • Mortada Mehyar's avatar
      [DOCUMENTATION] fixed typos in python programming guide · a87a56f5
      Mortada Mehyar authored
      ## What changes were proposed in this pull request?
      
      minor typo
      
      ## How was this patch tested?
      
      minor typo in the doc, should be self explanatory
      
      Author: Mortada Mehyar <mortada.mehyar@gmail.com>
      
      Closes #13639 from mortada/typo.
      a87a56f5
    • Wenchen Fan's avatar
      [SPARK-15932][SQL][DOC] document the contract of encoder serializer expressions · 688b6ef9
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      In our encoder framework, we imply that serializer expressions should use `BoundReference` to refer to the input object, and a lot of codes depend on this contract(e.g. ExpressionEncoder.tuple).  This PR adds some document and assert in `ExpressionEncoder` to make it clearer.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13648 from cloud-fan/comment.
      688b6ef9
  3. Jun 13, 2016
    • Sandeep Singh's avatar
      [SPARK-15663][SQL] SparkSession.catalog.listFunctions shouldn't include the... · 1842cdd4
      Sandeep Singh authored
      [SPARK-15663][SQL] SparkSession.catalog.listFunctions shouldn't include the list of built-in functions
      
      ## What changes were proposed in this pull request?
      SparkSession.catalog.listFunctions currently returns all functions, including the list of built-in functions. This makes the method not as useful because anytime it is run the result set contains over 100 built-in functions.
      
      ## How was this patch tested?
      CatalogSuite
      
      Author: Sandeep Singh <sandeep@techaddict.me>
      
      Closes #13413 from techaddict/SPARK-15663.
      1842cdd4
Loading