Skip to content
Snippets Groups Projects
  1. Jul 07, 2016
    • Xusen Yin's avatar
      [SPARK-16372][MLLIB] Retag RDD to tallSkinnyQR of RowMatrix · 4c6f00d0
      Xusen Yin authored
      ## What changes were proposed in this pull request?
      
      The following Java code because of type erasing:
      
      ```Java
      JavaRDD<Vector> rows = jsc.parallelize(...);
      RowMatrix mat = new RowMatrix(rows.rdd());
      QRDecomposition<RowMatrix, Matrix> result = mat.tallSkinnyQR(true);
      ```
      
      We should use retag to restore the type to prevent the following exception:
      
      ```Java
      java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lorg.apache.spark.mllib.linalg.Vector;
      ```
      
      ## How was this patch tested?
      
      Java unit test
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #14051 from yinxusen/SPARK-16372.
      4c6f00d0
    • Reynold Xin's avatar
      [SPARK-16400][SQL] Remove InSet filter pushdown from Parquet · 986b2514
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch removes InSet filter pushdown from Parquet data source, since row-based pushdown is not beneficial to Spark and brings extra complexity to the code base.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #14076 from rxin/SPARK-16400.
      986b2514
    • gatorsmile's avatar
      [SPARK-16368][SQL] Fix Strange Errors When Creating View With Unmatched Column Num · ab05db0b
      gatorsmile authored
      #### What changes were proposed in this pull request?
      When creating a view, a common user error is the number of columns produced by the `SELECT` clause does not match the number of column names specified by `CREATE VIEW`.
      
      For example, given Table `t1` only has 3 columns
      ```SQL
      create view v1(col2, col4, col3, col5) as select * from t1
      ```
      Currently, Spark SQL reports the following error:
      ```
      requirement failed
      java.lang.IllegalArgumentException: requirement failed
      	at scala.Predef$.require(Predef.scala:212)
      	at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:90)
      ```
      
      This error message is very confusing. This PR is to detect the error and issue a meaningful error message.
      
      #### How was this patch tested?
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14047 from gatorsmile/viewMismatchedColumns.
      ab05db0b
    • Tom Magrino's avatar
      [SPARK-15885][WEB UI] Provide links to executor logs from stage details page in UI · ce3ea969
      Tom Magrino authored
      ## What changes were proposed in this pull request?
      
      This moves over old PR https://github.com/apache/spark/pull/13664 to target master rather than branch-1.6.
      
      Added links to logs (or an indication that there are no logs) for entries which list an executor in the stage details page of the UI.
      
      This helps streamline the workflow where a user views a stage details page and determines that they would like to see the associated executor log for further examination.  Previously, a user would have to cross reference the executor id listed on the stage details page with the corresponding entry on the executors tab.
      
      Link to the JIRA: https://issues.apache.org/jira/browse/SPARK-15885
      
      ## How was this patch tested?
      
      Ran existing unit tests.
      Ran test queries on a platform which did not record executor logs and again on a platform which did record executor logs and verified that the new table column was empty and links to the logs (which were verified as linking to the appropriate files), respectively.
      
      Attached is a screenshot of the UI page with no links, with the new columns highlighted.  Additional screenshot of these columns with the populated links.
      
      Without links:
      ![updated without logs](https://cloud.githubusercontent.com/assets/1450821/16059721/2b69dbaa-3239-11e6-9eed-e539764ca159.png)
      
      With links:
      ![updated with logs](https://cloud.githubusercontent.com/assets/1450821/16059725/32c6e316-3239-11e6-90bd-2553f43f7779.png)
      
      This contribution is my original work and I license the work to the project under the Apache Spark project's open source license.
      
      Author: Tom Magrino <tmagrino@fb.com>
      
      Closes #13861 from tmagrino/uilogstweak.
      ce3ea969
    • Shixiong Zhu's avatar
      [SPARK-16021][TEST-MAVEN] Fix the maven build · 4b5a72c7
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Fixed the maven build for #13983
      
      ## How was this patch tested?
      
      The existing tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #14084 from zsxwing/fix-maven.
      4b5a72c7
    • MasterDDT's avatar
      [SPARK-16398][CORE] Make cancelJob and cancelStage APIs public · 69f53914
      MasterDDT authored
      ## What changes were proposed in this pull request?
      
      Make SparkContext `cancelJob` and `cancelStage` APIs public. This allows applications to use `SparkListener` to do their own management of jobs via events, but without using the REST API.
      
      ## How was this patch tested?
      
      Existing tests (dev/run-tests)
      
      Author: MasterDDT <miteshp@live.com>
      
      Closes #14072 from MasterDDT/SPARK-16398.
      69f53914
  2. Jul 06, 2016
    • gatorsmile's avatar
      [SPARK-16374][SQL] Remove Alias from MetastoreRelation and SimpleCatalogRelation · 42279bff
      gatorsmile authored
      #### What changes were proposed in this pull request?
      Different from the other leaf nodes, `MetastoreRelation` and `SimpleCatalogRelation` have a pre-defined `alias`, which is used to change the qualifier of the node. However, based on the existing alias handling, alias should be put in `SubqueryAlias`.
      
      This PR is to separate alias handling from `MetastoreRelation` and `SimpleCatalogRelation` to make it consistent with the other nodes. It simplifies the signature and conversion to a `BaseRelation`.
      
      For example, below is an example query for `MetastoreRelation`,  which is converted to a `LogicalRelation`:
      ```SQL
      SELECT tmp.a + 1 FROM test_parquet_ctas tmp WHERE tmp.a > 2
      ```
      
      Before changes, the analyzed plan is
      ```
      == Analyzed Logical Plan ==
      (a + 1): int
      Project [(a#951 + 1) AS (a + 1)#952]
      +- Filter (a#951 > 2)
         +- SubqueryAlias tmp
            +- Relation[a#951] parquet
      ```
      After changes, the analyzed plan becomes
      ```
      == Analyzed Logical Plan ==
      (a + 1): int
      Project [(a#951 + 1) AS (a + 1)#952]
      +- Filter (a#951 > 2)
         +- SubqueryAlias tmp
            +- SubqueryAlias test_parquet_ctas
               +- Relation[a#951] parquet
      ```
      
      **Note: the optimized plans are the same.**
      
      For `SimpleCatalogRelation`, the existing code always generates two Subqueries. Thus, no change is needed.
      
      #### How was this patch tested?
      Added test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14053 from gatorsmile/removeAliasFromMetastoreRelation.
      42279bff
    • hyukjinkwon's avatar
      [SPARK-14839][SQL] Support for other types for `tableProperty` rule in SQL syntax · 34283de1
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, Scala API supports to take options with the types, `String`, `Long`, `Double` and `Boolean` and Python API also supports other types.
      
      This PR corrects `tableProperty` rule to support other types (string, boolean, double and integer) so that support the options for data sources in a consistent way. This will affect other rules such as DBPROPERTIES and TBLPROPERTIES (allowing other types as values).
      
      Also, `TODO add bucketing and partitioning.` was removed because it was resolved in https://github.com/apache/spark/commit/24bea000476cdd0b43be5160a76bc5b170ef0b42
      
      ## How was this patch tested?
      
      Unit test in `MetastoreDataSourcesSuite.scala`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #13517 from HyukjinKwon/SPARK-14839.
      34283de1
    • Eric Liang's avatar
      [SPARK-16021] Fill freed memory in test to help catch correctness bugs · 44c7c62b
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      This patches `MemoryAllocator` to fill clean and freed memory with known byte values, similar to https://github.com/jemalloc/jemalloc/wiki/Use-Case:-Find-a-memory-corruption-bug . Memory filling is flag-enabled in test only by default.
      
      ## How was this patch tested?
      
      Unit test that it's on in test.
      
      cc sameeragarwal
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #13983 from ericl/spark-16021.
      44c7c62b
    • cody koeninger's avatar
      [SPARK-16212][STREAMING][KAFKA] apply test tweaks from 0-10 to 0-8 as well · b8ebf63c
      cody koeninger authored
      ## What changes were proposed in this pull request?
      
      Bring the kafka-0-8 subproject up to date with some test modifications from development on 0-10.
      
      Main changes are
      - eliminating waits on concurrent queue in favor of an assert on received results,
      - atomics instead of volatile (although this probably doesn't matter)
      - increasing uniqueness of topic names
      
      ## How was this patch tested?
      
      Unit tests
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #14073 from koeninger/kafka-0-8-test-direct-cleanup.
      b8ebf63c
    • Reynold Xin's avatar
      [SPARK-16371][SQL] Two follow-up tasks · 8e3e4ed6
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This is a small follow-up for SPARK-16371:
      
      1. Hide removeMetadata from public API.
      2. Add JIRA ticket number to test case name.
      
      ## How was this patch tested?
      Updated a test comment.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #14074 from rxin/parquet-filter.
      8e3e4ed6
    • Michael Gummelt's avatar
      [MESOS] expand coarse-grained mode docs · 9c041990
      Michael Gummelt authored
      ## What changes were proposed in this pull request?
      
      docs
      
      ## How was this patch tested?
      
      viewed the docs in github
      
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      
      Closes #14059 from mgummelt/coarse-grained.
      9c041990
    • Sean Owen's avatar
      [SPARK-16379][CORE][MESOS] Spark on mesos is broken due to race condition in Logging · a8f89df3
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      The commit https://github.com/apache/spark/commit/044971eca0ff3c2ce62afa665dbd3072d52cbbec introduced a lazy val to simplify code in Logging. Simple enough, though one side effect is that accessing log now means grabbing the instance's lock. This in turn turned up a form of deadlock in the Mesos code. It was arguably a bit of a problem in how this code is structured, but, in any event the safest thing to do seems to be to revert the commit, and that's 90% of the change here; it's just not worth the risk of similar more subtle issues.
      
      What I didn't revert here was the removal of this odd override of log in the Mesos code. In retrospect it might have been put in place at some stage as a defense against this type of problem. After all the Logging code still involved a lock at initialization before the change in question.
      
      Even after the revert, it doesn't seem like it does anything, given how Logging works now, so I left it removed. However, I also removed the particular log message that ended up playing a part in this problem anyway, maybe being paranoid, to make sure this type of problem can't happen even with how the current locking works in logging initialization.
      
      ## How was this patch tested?
      
      Jenkins tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #14069 from srowen/SPARK-16379.
      a8f89df3
    • tmnd1991's avatar
      [SPARK-15740][MLLIB] Word2VecSuite "big model load / save" caused OOM in maven jenkins builds · 040f6f9f
      tmnd1991 authored
      ## What changes were proposed in this pull request?
      "test big model load / save" in Word2VecSuite, lately resulted into OOM.
      Therefore we decided to make the partitioning adaptive (not based on spark default "spark.kryoserializer.buffer.max" conf) and then testing it using a small buffer size in order to trigger partitioning without allocating too much memory for the test.
      
      ## How was this patch tested?
      It was tested running the following unit test:
      org.apache.spark.mllib.feature.Word2VecSuite
      
      Author: tmnd1991 <antonio.murgia2@studio.unibo.it>
      
      Closes #13509 from tmnd1991/SPARK-15740.
      040f6f9f
    • hyukjinkwon's avatar
      [SPARK-16371][SQL] Do not push down filters incorrectly when inner name and... · 4f8ceed5
      hyukjinkwon authored
      [SPARK-16371][SQL] Do not push down filters incorrectly when inner name and outer name are the same in Parquet
      
      ## What changes were proposed in this pull request?
      
      Currently, if there is a schema as below:
      
      ```
      root
        |-- _1: struct (nullable = true)
        |    |-- _1: integer (nullable = true)
      ```
      
      and if we execute the codes below:
      
      ```scala
      df.filter("_1 IS NOT NULL").count()
      ```
      
      This pushes down a filter although this filter is being applied to `StructType`.(If my understanding is correct, Spark does not pushes down filters for those).
      
      The reason is, `ParquetFilters.getFieldMap` produces results below:
      
      ```
      (_1,StructType(StructField(_1,IntegerType,true)))
      (_1,IntegerType)
      ```
      
      and then it becomes a `Map`
      
      ```
      (_1,IntegerType)
      ```
      
      Now, because of ` ....lift(dataTypeOf(name)).map(_(name, value))`, this pushes down filters for `_1` which Parquet thinks is `IntegerType`. However, it is actually `StructType`.
      
      So, Parquet filter2 produces incorrect results, for example, the codes below:
      
      ```
      df.filter("_1 IS NOT NULL").count()
      ```
      
      produces always 0.
      
      This PR prevents this by not finding nested fields.
      
      ## How was this patch tested?
      
      Unit test in `ParquetFilterSuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #14067 from HyukjinKwon/SPARK-16371.
      4f8ceed5
    • petermaxlee's avatar
      [SPARK-16304] LinkageError should not crash Spark executor · 480357cc
      petermaxlee authored
      ## What changes were proposed in this pull request?
      This patch updates the failure handling logic so Spark executor does not crash when seeing LinkageError.
      
      ## How was this patch tested?
      Added an end-to-end test in FailureSuite.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #13982 from petermaxlee/SPARK-16304.
      480357cc
    • hyukjinkwon's avatar
      [MINOR][PYSPARK][DOC] Fix wrongly formatted examples in PySpark documentation · 4e14199f
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR fixes wrongly formatted examples in PySpark documentation as below:
      
      - **`SparkSession`**
      
        - **Before**
      
          ![2016-07-06 11 34 41](https://cloud.githubusercontent.com/assets/6477701/16605847/ae939526-436d-11e6-8ab8-6ad578362425.png)
      
        - **After**
      
          ![2016-07-06 11 33 56](https://cloud.githubusercontent.com/assets/6477701/16605845/ace9ee78-436d-11e6-8923-b76d4fc3e7c3.png)
      
      - **`Builder`**
      
        - **Before**
          ![2016-07-06 11 34 44](https://cloud.githubusercontent.com/assets/6477701/16605844/aba60dbc-436d-11e6-990a-c87bc0281c6b.png)
      
        - **After**
          ![2016-07-06 1 26 37](https://cloud.githubusercontent.com/assets/6477701/16607562/586704c0-437d-11e6-9483-e0af93d8f74e.png)
      
      This PR also fixes several similar instances across the documentation in `sql` PySpark module.
      
      ## How was this patch tested?
      
      N/A
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #14063 from HyukjinKwon/minor-pyspark-builder.
      4e14199f
    • WeichenXu's avatar
      [DOC][SQL] update out-of-date code snippets using SQLContext in all documents. · b1310425
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      I search the whole documents directory using SQLContext, and update the following places:
      
      - docs/configuration.md, sparkR code snippets.
      - docs/streaming-programming-guide.md, several example code.
      
      ## How was this patch tested?
      
      N/A
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #14025 from WeichenXu123/WIP_SQLContext_update.
      b1310425
    • Cheng Lian's avatar
      [SPARK-15979][SQL] Renames CatalystWriteSupport to ParquetWriteSupport · 23eff5e5
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      PR #13696 renamed various Parquet support classes but left `CatalystWriteSupport` behind. This PR is renames it as a follow-up.
      
      ## How was this patch tested?
      
      N/A.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #14070 from liancheng/spark-15979-follow-up.
      23eff5e5
    • Tao Lin's avatar
      [SPARK-15591][WEBUI] Paginate Stage Table in Stages tab · 478b71d0
      Tao Lin authored
      ## What changes were proposed in this pull request?
      
      This patch adds pagination support for the Stage Tables in the Stage tab. Pagination is provided for all of the four Job Tables (active, pending, completed, and failed). Besides, the paged stage tables are also used in JobPage (the detail page for one job) and PoolPage.
      
      Interactions (jumping, sorting, and setting page size) for paged tables are also included.
      
      ## How was this patch tested?
      
      Tested manually by using checking the Web UI after completing and failing hundreds of jobs.  Same as the testings for [Paginate Job Table in Jobs tab](https://github.com/apache/spark/pull/13620).
      
      This shows the pagination for completed stages:
      ![paged stage table](https://cloud.githubusercontent.com/assets/5558370/16125696/5804e35e-3427-11e6-8923-5c5948982648.png)
      
      Author: Tao Lin <nblintao@gmail.com>
      
      Closes #13708 from nblintao/stageTable.
      478b71d0
    • gatorsmile's avatar
      [SPARK-16229][SQL] Drop Empty Table After CREATE TABLE AS SELECT fails · 21eadd1d
      gatorsmile authored
      #### What changes were proposed in this pull request?
      In `CREATE TABLE AS SELECT`, if the `SELECT` query failed, the table should not exist. For example,
      
      ```SQL
      CREATE TABLE tab
      STORED AS TEXTFILE
      SELECT 1 AS a, (SELECT a FROM (SELECT 1 AS a UNION ALL SELECT 2 AS a) t) AS b
      ```
      The above query failed as expected but an empty table `t` is created.
      
      This PR is to drop the created table when hitting any non-fatal exception.
      
      #### How was this patch tested?
      Added a test case to verify the behavior
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13926 from gatorsmile/dropTableAfterException.
      21eadd1d
    • MechCoder's avatar
      [SPARK-16307][ML] Add test to verify the predicted variances of a DT on toy data · 909c6d81
      MechCoder authored
      ## What changes were proposed in this pull request?
      
      The current tests assumes that `impurity.calculate()` returns the variance correctly. It should be better to make the tests independent of this assumption. In other words verify that the variance computed equals the variance computed manually on a small tree.
      
      ## How was this patch tested?
      
      The patch is a test....
      
      Author: MechCoder <mks542@nyu.edu>
      
      Closes #13981 from MechCoder/dt_variance.
      909c6d81
    • Reynold Xin's avatar
      [SPARK-16388][SQL] Remove spark.sql.nativeView and spark.sql.nativeView.canonical config · 7e28fabd
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      These two configs should always be true after Spark 2.0. This patch removes them from the config list. Note that ideally this should've gone into branch-2.0, but due to the timing of the release we should only merge this in master for Spark 2.1.
      
      ## How was this patch tested?
      Updated test cases.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #14061 from rxin/SPARK-16388.
      7e28fabd
    • Yuhao Yang's avatar
      [SPARK-16249][ML] Change visibility of Object ml.clustering.LDA to public for loading · 5497242c
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      jira: https://issues.apache.org/jira/browse/SPARK-16249
      Change visibility of Object ml.clustering.LDA to public for loading, thus users can invoke LDA.load("path").
      
      ## How was this patch tested?
      
      existing ut and manually test for load ( saved with current code)
      
      Author: Yuhao Yang <yuhao.yang@intel.com>
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #13941 from hhbyyh/ldapublic.
      5497242c
    • Tejas Patil's avatar
      [SPARK-16339][CORE] ScriptTransform does not print stderr when outstream is lost · 5f342049
      Tejas Patil authored
      ## What changes were proposed in this pull request?
      
      Currently, if due to some failure, the outstream gets destroyed or closed and later `outstream.close()` leads to IOException in such case. Due to this, the `stderrBuffer` does not get logged and there is no way for users to see why the job failed.
      
      The change is to first display the stderr buffer and then try closing the outstream.
      
      ## How was this patch tested?
      
      The correct way to test this fix would be to grep the log to see if the `stderrBuffer` gets logged but I dont think having test cases which do that is a good idea.
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      …
      
      Author: Tejas Patil <tejasp@fb.com>
      
      Closes #13834 from tejasapatil/script_transform.
      5f342049
    • Dongjoon Hyun's avatar
      [SPARK-16340][SQL] Support column arguments for `regexp_replace` Dataset operation · ec79183a
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, `regexp_replace` function supports `Column` arguments in a query. This PR supports that in a `Dataset` operation, too.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests with a updated testcase.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14060 from dongjoon-hyun/SPARK-16340.
      ec79183a
  3. Jul 05, 2016
    • gatorsmile's avatar
      [SPARK-16389][SQL] Remove MetastoreRelation from SparkHiveWriterContainer and... · ec18cd0a
      gatorsmile authored
      [SPARK-16389][SQL] Remove MetastoreRelation from SparkHiveWriterContainer and SparkHiveDynamicPartitionWriterContainer
      
      #### What changes were proposed in this pull request?
      - Remove useless `MetastoreRelation` from the signature of `SparkHiveWriterContainer` and `SparkHiveDynamicPartitionWriterContainer`.
      - Avoid unnecessary metadata retrieval using Hive client in `InsertIntoHiveTable`.
      
      #### How was this patch tested?
      Existing test cases already cover it.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14062 from gatorsmile/removeMetastoreRelation.
      ec18cd0a
    • Dongjoon Hyun's avatar
      [SPARK-16286][SQL] Implement stack table generating function · d0d28507
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR implements `stack` table generating function.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests including new testcases.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14033 from dongjoon-hyun/SPARK-16286.
      d0d28507
    • Joseph K. Bradley's avatar
      [SPARK-16348][ML][MLLIB][PYTHON] Use full classpaths for pyspark ML JVM calls · fdde7d0a
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Issue: Omitting the full classpath can cause problems when calling JVM methods or classes from pyspark.
      
      This PR: Changed all uses of jvm.X in pyspark.ml and pyspark.mllib to use full classpath for X
      
      ## How was this patch tested?
      
      Existing unit tests.  Manual testing in an environment where this was an issue.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #14023 from jkbradley/SPARK-16348.
      fdde7d0a
    • Marcelo Vanzin's avatar
      [SPARK-16385][CORE] Catch correct exception when calling method via reflection. · 59f9c1bd
      Marcelo Vanzin authored
      Using "Method.invoke" causes an exception to be thrown, not an error, so
      Utils.waitForProcess() was always throwing an exception when run on Java 7.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #14056 from vanzin/SPARK-16385.
      59f9c1bd
    • Dongjoon Hyun's avatar
      [SPARK-16383][SQL] Remove `SessionState.executeSql` · 4db63fd2
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR removes `SessionState.executeSql` in favor of `SparkSession.sql`. We can remove this safely since the visibility `SessionState` is `private[sql]` and `executeSql` is only used in one **ignored** test, `test("Multiple Hive Instances")`.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14055 from dongjoon-hyun/SPARK-16383.
      4db63fd2
    • cody koeninger's avatar
      [SPARK-16359][STREAMING][KAFKA] unidoc skip kafka 0.10 · 1f0d0213
      cody koeninger authored
      ## What changes were proposed in this pull request?
      during sbt unidoc task, skip the streamingKafka010 subproject and filter kafka 0.10 classes from the classpath, so that at least existing kafka 0.8 doc can be included in unidoc without error
      
      ## How was this patch tested?
      sbt spark/scalaunidoc:doc | grep -i error
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #14041 from koeninger/SPARK-16359.
      1f0d0213
    • Cheng Hao's avatar
      [SPARK-15730][SQL] Respect the --hiveconf in the spark-sql command line · 920cb5fe
      Cheng Hao authored
      ## What changes were proposed in this pull request?
      This PR makes spark-sql (backed by SparkSQLCLIDriver) respects confs set by hiveconf, which is what we do in previous versions. The change is that when we start SparkSQLCLIDriver, we explicitly set confs set through --hiveconf to SQLContext's conf (basically treating those confs as a SparkSQL conf).
      
      ## How was this patch tested?
      A new test in CliSuite.
      
      Closes #13542
      
      Author: Cheng Hao <hao.cheng@intel.com>
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #14058 from yhuai/hiveConfThriftServer.
      920cb5fe
    • Reynold Xin's avatar
      [HOTFIX] Fix build break. · 5b7a1770
      Reynold Xin authored
      5b7a1770
    • cody koeninger's avatar
      [SPARK-16212][STREAMING][KAFKA] use random port for embedded kafka · 1fca9da9
      cody koeninger authored
      ## What changes were proposed in this pull request?
      
      Testing for 0.10 uncovered an issue with a fixed port number being used in KafkaTestUtils.  This is making a roughly equivalent fix for the 0.8 connector
      
      ## How was this patch tested?
      
      Unit tests, manual tests
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #14018 from koeninger/kafka-0-8-test-port.
      1fca9da9
    • Reynold Xin's avatar
      [SPARK-16311][SQL] Metadata refresh should work on temporary views · 16a2a7d7
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch fixes the bug that the refresh command does not work on temporary views. This patch is based on https://github.com/apache/spark/pull/13989, but removes the public Dataset.refresh() API as well as improved test coverage.
      
      Note that I actually think the public refresh() API is very useful. We can in the future implement it by also invalidating the lazy vals in QueryExecution (or alternatively just create a new QueryExecution).
      
      ## How was this patch tested?
      Re-enabled a previously ignored test, and added a new test suite for Hive testing behavior of temporary views against MetastoreRelation.
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #14009 from rxin/SPARK-16311.
      16a2a7d7
    • hyukjinkwon's avatar
      [SPARK-9876][SQL][FOLLOWUP] Enable string and binary tests for Parquet... · 07d9c532
      hyukjinkwon authored
      [SPARK-9876][SQL][FOLLOWUP] Enable string and binary tests for Parquet predicate pushdown and replace deprecated fromByteArray.
      
      ## What changes were proposed in this pull request?
      
      It seems Parquet has been upgraded to 1.8.1 by https://github.com/apache/spark/pull/13280. So,  this PR enables string and binary predicate push down which was disabled due to [SPARK-11153](https://issues.apache.org/jira/browse/SPARK-11153) and [PARQUET-251](https://issues.apache.org/jira/browse/PARQUET-251) and cleans up some comments unremoved (I think by mistake).
      
      This PR also replace the API, `fromByteArray()` deprecated in [PARQUET-251](https://issues.apache.org/jira/browse/PARQUET-251).
      
      ## How was this patch tested?
      
      Unit tests in `ParquetFilters`
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #13389 from HyukjinKwon/parquet-1.8-followup.
      07d9c532
    • Dongjoon Hyun's avatar
      [SPARK-16360][SQL] Speed up SQL query performance by removing redundant `executePlan` call · 7f7eb393
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, there are a few reports about Spark 2.0 query performance regression for large queries.
      
      This PR speeds up SQL query processing performance by removing redundant **consecutive `executePlan`** call in `Dataset.ofRows` function and `Dataset` instantiation. Specifically, this PR aims to reduce the overhead of SQL query execution plan generation, not real query execution. So, we can not see the result in the Spark Web UI. Please use the following query script. The result is **25.78 sec** -> **12.36 sec** as expected.
      
      **Sample Query**
      ```scala
      val n = 4000
      val values = (1 to n).map(_.toString).mkString(", ")
      val columns = (1 to n).map("column" + _).mkString(", ")
      val query =
        s"""
           |SELECT $columns
           |FROM VALUES ($values) T($columns)
           |WHERE 1=2 AND 1 IN ($columns)
           |GROUP BY $columns
           |ORDER BY $columns
           |""".stripMargin
      
      def time[R](block: => R): R = {
        val t0 = System.nanoTime()
        val result = block
        println("Elapsed time: " + ((System.nanoTime - t0) / 1e9) + "s")
        result
      }
      ```
      
      **Before**
      ```scala
      scala> time(sql(query))
      Elapsed time: 30.138142577s  // First query has a little overhead of initialization.
      res0: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields]
      scala> time(sql(query))
      Elapsed time: 25.787751452s  // Let's compare this one.
      res1: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields]
      ```
      
      **After**
      ```scala
      scala> time(sql(query))
      Elapsed time: 17.500279659s  // First query has a little overhead of initialization.
      res0: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields]
      scala> time(sql(query))
      Elapsed time: 12.364812255s  // This shows the real difference. The speed up is about 2 times.
      res1: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields]
      ```
      
      ## How was this patch tested?
      
      Manual by the above script.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14044 from dongjoon-hyun/SPARK-16360.
      7f7eb393
    • hyukjinkwon's avatar
      [SPARK-15198][SQL] Support for pushing down filters for boolean types in ORC data source · 7742d9f1
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      It seems ORC supports all the types in  ([`PredicateLeaf.Type`](https://github.com/apache/hive/blob/e085b7e9bd059d91aaf013df0db4d71dca90ec6f/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/PredicateLeaf.java#L50-L56)) which includes boolean types. So, this was tested first.
      
      This PR adds the support for pushing filters down for `BooleanType` in ORC data source.
      
      This PR also removes `OrcTableScan` class and the companion object, which is not used anymore.
      
      ## How was this patch tested?
      
      Unittest in `OrcFilterSuite` and `OrcQuerySuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #12972 from HyukjinKwon/SPARK-15198.
      7742d9f1
  4. Jul 04, 2016
    • Michael Allman's avatar
      [SPARK-15968][SQL] Nonempty partitioned metastore tables are not cached · 8f6cf00c
      Michael Allman authored
      (Please note this is a revision of PR #13686, which has been closed in favor of this PR.)
      
      This PR addresses [SPARK-15968](https://issues.apache.org/jira/browse/SPARK-15968).
      
      ## What changes were proposed in this pull request?
      
      The `getCached` method of [HiveMetastoreCatalog](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala) computes `pathsInMetastore` from the metastore relation's catalog table. This only returns the table base path, which is incomplete/inaccurate for a nonempty partitioned table. As a result, cached lookups on nonempty partitioned tables always miss.
      
      Rather than get `pathsInMetastore` from
      
          metastoreRelation.catalogTable.storage.locationUri.toSeq
      
      I modified the `getCached` method to take a `pathsInMetastore` argument. Calls to this method pass in the paths computed from calls to the Hive metastore. This is how `getCached` was implemented in Spark 1.5:
      
      https://github.com/apache/spark/blob/e0c3212a9b42e3e704b070da4ac25b68c584427f/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L444.
      
      I also added a call in `InsertIntoHiveTable.scala` to invalidate the table from the SQL session catalog.
      
      ## How was this patch tested?
      
      I've added a new unit test to `parquetSuites.scala`:
      
          SPARK-15968: nonempty partitioned metastore Parquet table lookup should use cached relation
      
      Note that the only difference between this new test and the one above it in the file is that the new test populates its partitioned table with a single value, while the existing test leaves the table empty. This reveals a subtle, unexpected hole in test coverage present before this patch.
      
      Note I also modified a different but related unit test in `parquetSuites.scala`:
      
          SPARK-15248: explicitly added partitions should be readable
      
      This unit test asserts that Spark SQL should return data from a table partition which has been placed there outside a metastore query immediately after it is added. I changed the test so that, instead of adding the data as a parquet file saved in the partition's location, the data is added through a SQL `INSERT` query. I made this change because I could find no way to efficiently support partitioned table caching without failing that test.
      
      In addition to my primary motivation, I can offer a few reasons I believe this is an acceptable weakening of that test. First, it still validates a fix for [SPARK-15248](https://issues.apache.org/jira/browse/SPARK-15248), the issue for which it was written. Second, the assertion made is stronger than that required for non-partitioned tables. If you write data to the storage location of a non-partitioned metastore table without using a proper SQL DML query, a subsequent call to show that data will not return it. I believe this is an intentional limitation put in place to make table caching feasible, but I'm only speculating.
      
      Building a large `HadoopFsRelation` requires `stat`-ing all of its data files. In our environment, where we have tables with 10's of thousands of partitions, the difference between using a cached relation versus a new one is a matter of seconds versus minutes. Caching partitioned table metadata vastly improves the usability of Spark SQL for these cases.
      
      Thanks.
      
      Author: Michael Allman <michael@videoamp.com>
      
      Closes #13818 from mallman/spark-15968.
      8f6cf00c
Loading