Skip to content
Snippets Groups Projects
  1. Jul 07, 2016
    • MechCoder's avatar
      [SPARK-16399][PYSPARK] Force PYSPARK_PYTHON to python · 6343f665
      MechCoder authored
      ## What changes were proposed in this pull request?
      
      I would like to change
      
      ```bash
      if hash python2.7 2>/dev/null; then
        # Attempt to use Python 2.7, if installed:
        DEFAULT_PYTHON="python2.7"
      else
        DEFAULT_PYTHON="python"
      fi
      ```
      
      to just ```DEFAULT_PYTHON="python"```
      
      I'm not sure if it is a great assumption that python2.7 is used by default, when python points to something else.
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: MechCoder <mks542@nyu.edu>
      
      Closes #14016 from MechCoder/followup.
      6343f665
    • Xusen Yin's avatar
      [SPARK-16372][MLLIB] Retag RDD to tallSkinnyQR of RowMatrix · 4c6f00d0
      Xusen Yin authored
      ## What changes were proposed in this pull request?
      
      The following Java code because of type erasing:
      
      ```Java
      JavaRDD<Vector> rows = jsc.parallelize(...);
      RowMatrix mat = new RowMatrix(rows.rdd());
      QRDecomposition<RowMatrix, Matrix> result = mat.tallSkinnyQR(true);
      ```
      
      We should use retag to restore the type to prevent the following exception:
      
      ```Java
      java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lorg.apache.spark.mllib.linalg.Vector;
      ```
      
      ## How was this patch tested?
      
      Java unit test
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #14051 from yinxusen/SPARK-16372.
      4c6f00d0
    • Reynold Xin's avatar
      [SPARK-16400][SQL] Remove InSet filter pushdown from Parquet · 986b2514
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch removes InSet filter pushdown from Parquet data source, since row-based pushdown is not beneficial to Spark and brings extra complexity to the code base.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #14076 from rxin/SPARK-16400.
      986b2514
    • gatorsmile's avatar
      [SPARK-16368][SQL] Fix Strange Errors When Creating View With Unmatched Column Num · ab05db0b
      gatorsmile authored
      #### What changes were proposed in this pull request?
      When creating a view, a common user error is the number of columns produced by the `SELECT` clause does not match the number of column names specified by `CREATE VIEW`.
      
      For example, given Table `t1` only has 3 columns
      ```SQL
      create view v1(col2, col4, col3, col5) as select * from t1
      ```
      Currently, Spark SQL reports the following error:
      ```
      requirement failed
      java.lang.IllegalArgumentException: requirement failed
      	at scala.Predef$.require(Predef.scala:212)
      	at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:90)
      ```
      
      This error message is very confusing. This PR is to detect the error and issue a meaningful error message.
      
      #### How was this patch tested?
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14047 from gatorsmile/viewMismatchedColumns.
      ab05db0b
    • Tom Magrino's avatar
      [SPARK-15885][WEB UI] Provide links to executor logs from stage details page in UI · ce3ea969
      Tom Magrino authored
      ## What changes were proposed in this pull request?
      
      This moves over old PR https://github.com/apache/spark/pull/13664 to target master rather than branch-1.6.
      
      Added links to logs (or an indication that there are no logs) for entries which list an executor in the stage details page of the UI.
      
      This helps streamline the workflow where a user views a stage details page and determines that they would like to see the associated executor log for further examination.  Previously, a user would have to cross reference the executor id listed on the stage details page with the corresponding entry on the executors tab.
      
      Link to the JIRA: https://issues.apache.org/jira/browse/SPARK-15885
      
      ## How was this patch tested?
      
      Ran existing unit tests.
      Ran test queries on a platform which did not record executor logs and again on a platform which did record executor logs and verified that the new table column was empty and links to the logs (which were verified as linking to the appropriate files), respectively.
      
      Attached is a screenshot of the UI page with no links, with the new columns highlighted.  Additional screenshot of these columns with the populated links.
      
      Without links:
      ![updated without logs](https://cloud.githubusercontent.com/assets/1450821/16059721/2b69dbaa-3239-11e6-9eed-e539764ca159.png)
      
      With links:
      ![updated with logs](https://cloud.githubusercontent.com/assets/1450821/16059725/32c6e316-3239-11e6-90bd-2553f43f7779.png)
      
      This contribution is my original work and I license the work to the project under the Apache Spark project's open source license.
      
      Author: Tom Magrino <tmagrino@fb.com>
      
      Closes #13861 from tmagrino/uilogstweak.
      ce3ea969
    • Shixiong Zhu's avatar
      [SPARK-16021][TEST-MAVEN] Fix the maven build · 4b5a72c7
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Fixed the maven build for #13983
      
      ## How was this patch tested?
      
      The existing tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #14084 from zsxwing/fix-maven.
      4b5a72c7
    • MasterDDT's avatar
      [SPARK-16398][CORE] Make cancelJob and cancelStage APIs public · 69f53914
      MasterDDT authored
      ## What changes were proposed in this pull request?
      
      Make SparkContext `cancelJob` and `cancelStage` APIs public. This allows applications to use `SparkListener` to do their own management of jobs via events, but without using the REST API.
      
      ## How was this patch tested?
      
      Existing tests (dev/run-tests)
      
      Author: MasterDDT <miteshp@live.com>
      
      Closes #14072 from MasterDDT/SPARK-16398.
      69f53914
  2. Jul 06, 2016
    • gatorsmile's avatar
      [SPARK-16374][SQL] Remove Alias from MetastoreRelation and SimpleCatalogRelation · 42279bff
      gatorsmile authored
      #### What changes were proposed in this pull request?
      Different from the other leaf nodes, `MetastoreRelation` and `SimpleCatalogRelation` have a pre-defined `alias`, which is used to change the qualifier of the node. However, based on the existing alias handling, alias should be put in `SubqueryAlias`.
      
      This PR is to separate alias handling from `MetastoreRelation` and `SimpleCatalogRelation` to make it consistent with the other nodes. It simplifies the signature and conversion to a `BaseRelation`.
      
      For example, below is an example query for `MetastoreRelation`,  which is converted to a `LogicalRelation`:
      ```SQL
      SELECT tmp.a + 1 FROM test_parquet_ctas tmp WHERE tmp.a > 2
      ```
      
      Before changes, the analyzed plan is
      ```
      == Analyzed Logical Plan ==
      (a + 1): int
      Project [(a#951 + 1) AS (a + 1)#952]
      +- Filter (a#951 > 2)
         +- SubqueryAlias tmp
            +- Relation[a#951] parquet
      ```
      After changes, the analyzed plan becomes
      ```
      == Analyzed Logical Plan ==
      (a + 1): int
      Project [(a#951 + 1) AS (a + 1)#952]
      +- Filter (a#951 > 2)
         +- SubqueryAlias tmp
            +- SubqueryAlias test_parquet_ctas
               +- Relation[a#951] parquet
      ```
      
      **Note: the optimized plans are the same.**
      
      For `SimpleCatalogRelation`, the existing code always generates two Subqueries. Thus, no change is needed.
      
      #### How was this patch tested?
      Added test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14053 from gatorsmile/removeAliasFromMetastoreRelation.
      42279bff
    • hyukjinkwon's avatar
      [SPARK-14839][SQL] Support for other types for `tableProperty` rule in SQL syntax · 34283de1
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, Scala API supports to take options with the types, `String`, `Long`, `Double` and `Boolean` and Python API also supports other types.
      
      This PR corrects `tableProperty` rule to support other types (string, boolean, double and integer) so that support the options for data sources in a consistent way. This will affect other rules such as DBPROPERTIES and TBLPROPERTIES (allowing other types as values).
      
      Also, `TODO add bucketing and partitioning.` was removed because it was resolved in https://github.com/apache/spark/commit/24bea000476cdd0b43be5160a76bc5b170ef0b42
      
      ## How was this patch tested?
      
      Unit test in `MetastoreDataSourcesSuite.scala`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #13517 from HyukjinKwon/SPARK-14839.
      34283de1
    • Eric Liang's avatar
      [SPARK-16021] Fill freed memory in test to help catch correctness bugs · 44c7c62b
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      This patches `MemoryAllocator` to fill clean and freed memory with known byte values, similar to https://github.com/jemalloc/jemalloc/wiki/Use-Case:-Find-a-memory-corruption-bug . Memory filling is flag-enabled in test only by default.
      
      ## How was this patch tested?
      
      Unit test that it's on in test.
      
      cc sameeragarwal
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #13983 from ericl/spark-16021.
      44c7c62b
    • cody koeninger's avatar
      [SPARK-16212][STREAMING][KAFKA] apply test tweaks from 0-10 to 0-8 as well · b8ebf63c
      cody koeninger authored
      ## What changes were proposed in this pull request?
      
      Bring the kafka-0-8 subproject up to date with some test modifications from development on 0-10.
      
      Main changes are
      - eliminating waits on concurrent queue in favor of an assert on received results,
      - atomics instead of volatile (although this probably doesn't matter)
      - increasing uniqueness of topic names
      
      ## How was this patch tested?
      
      Unit tests
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #14073 from koeninger/kafka-0-8-test-direct-cleanup.
      b8ebf63c
    • Reynold Xin's avatar
      [SPARK-16371][SQL] Two follow-up tasks · 8e3e4ed6
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This is a small follow-up for SPARK-16371:
      
      1. Hide removeMetadata from public API.
      2. Add JIRA ticket number to test case name.
      
      ## How was this patch tested?
      Updated a test comment.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #14074 from rxin/parquet-filter.
      8e3e4ed6
    • Michael Gummelt's avatar
      [MESOS] expand coarse-grained mode docs · 9c041990
      Michael Gummelt authored
      ## What changes were proposed in this pull request?
      
      docs
      
      ## How was this patch tested?
      
      viewed the docs in github
      
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      
      Closes #14059 from mgummelt/coarse-grained.
      9c041990
    • Sean Owen's avatar
      [SPARK-16379][CORE][MESOS] Spark on mesos is broken due to race condition in Logging · a8f89df3
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      The commit https://github.com/apache/spark/commit/044971eca0ff3c2ce62afa665dbd3072d52cbbec introduced a lazy val to simplify code in Logging. Simple enough, though one side effect is that accessing log now means grabbing the instance's lock. This in turn turned up a form of deadlock in the Mesos code. It was arguably a bit of a problem in how this code is structured, but, in any event the safest thing to do seems to be to revert the commit, and that's 90% of the change here; it's just not worth the risk of similar more subtle issues.
      
      What I didn't revert here was the removal of this odd override of log in the Mesos code. In retrospect it might have been put in place at some stage as a defense against this type of problem. After all the Logging code still involved a lock at initialization before the change in question.
      
      Even after the revert, it doesn't seem like it does anything, given how Logging works now, so I left it removed. However, I also removed the particular log message that ended up playing a part in this problem anyway, maybe being paranoid, to make sure this type of problem can't happen even with how the current locking works in logging initialization.
      
      ## How was this patch tested?
      
      Jenkins tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #14069 from srowen/SPARK-16379.
      a8f89df3
    • tmnd1991's avatar
      [SPARK-15740][MLLIB] Word2VecSuite "big model load / save" caused OOM in maven jenkins builds · 040f6f9f
      tmnd1991 authored
      ## What changes were proposed in this pull request?
      "test big model load / save" in Word2VecSuite, lately resulted into OOM.
      Therefore we decided to make the partitioning adaptive (not based on spark default "spark.kryoserializer.buffer.max" conf) and then testing it using a small buffer size in order to trigger partitioning without allocating too much memory for the test.
      
      ## How was this patch tested?
      It was tested running the following unit test:
      org.apache.spark.mllib.feature.Word2VecSuite
      
      Author: tmnd1991 <antonio.murgia2@studio.unibo.it>
      
      Closes #13509 from tmnd1991/SPARK-15740.
      040f6f9f
    • hyukjinkwon's avatar
      [SPARK-16371][SQL] Do not push down filters incorrectly when inner name and... · 4f8ceed5
      hyukjinkwon authored
      [SPARK-16371][SQL] Do not push down filters incorrectly when inner name and outer name are the same in Parquet
      
      ## What changes were proposed in this pull request?
      
      Currently, if there is a schema as below:
      
      ```
      root
        |-- _1: struct (nullable = true)
        |    |-- _1: integer (nullable = true)
      ```
      
      and if we execute the codes below:
      
      ```scala
      df.filter("_1 IS NOT NULL").count()
      ```
      
      This pushes down a filter although this filter is being applied to `StructType`.(If my understanding is correct, Spark does not pushes down filters for those).
      
      The reason is, `ParquetFilters.getFieldMap` produces results below:
      
      ```
      (_1,StructType(StructField(_1,IntegerType,true)))
      (_1,IntegerType)
      ```
      
      and then it becomes a `Map`
      
      ```
      (_1,IntegerType)
      ```
      
      Now, because of ` ....lift(dataTypeOf(name)).map(_(name, value))`, this pushes down filters for `_1` which Parquet thinks is `IntegerType`. However, it is actually `StructType`.
      
      So, Parquet filter2 produces incorrect results, for example, the codes below:
      
      ```
      df.filter("_1 IS NOT NULL").count()
      ```
      
      produces always 0.
      
      This PR prevents this by not finding nested fields.
      
      ## How was this patch tested?
      
      Unit test in `ParquetFilterSuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #14067 from HyukjinKwon/SPARK-16371.
      4f8ceed5
    • petermaxlee's avatar
      [SPARK-16304] LinkageError should not crash Spark executor · 480357cc
      petermaxlee authored
      ## What changes were proposed in this pull request?
      This patch updates the failure handling logic so Spark executor does not crash when seeing LinkageError.
      
      ## How was this patch tested?
      Added an end-to-end test in FailureSuite.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #13982 from petermaxlee/SPARK-16304.
      480357cc
    • hyukjinkwon's avatar
      [MINOR][PYSPARK][DOC] Fix wrongly formatted examples in PySpark documentation · 4e14199f
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR fixes wrongly formatted examples in PySpark documentation as below:
      
      - **`SparkSession`**
      
        - **Before**
      
          ![2016-07-06 11 34 41](https://cloud.githubusercontent.com/assets/6477701/16605847/ae939526-436d-11e6-8ab8-6ad578362425.png)
      
        - **After**
      
          ![2016-07-06 11 33 56](https://cloud.githubusercontent.com/assets/6477701/16605845/ace9ee78-436d-11e6-8923-b76d4fc3e7c3.png)
      
      - **`Builder`**
      
        - **Before**
          ![2016-07-06 11 34 44](https://cloud.githubusercontent.com/assets/6477701/16605844/aba60dbc-436d-11e6-990a-c87bc0281c6b.png)
      
        - **After**
          ![2016-07-06 1 26 37](https://cloud.githubusercontent.com/assets/6477701/16607562/586704c0-437d-11e6-9483-e0af93d8f74e.png)
      
      This PR also fixes several similar instances across the documentation in `sql` PySpark module.
      
      ## How was this patch tested?
      
      N/A
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #14063 from HyukjinKwon/minor-pyspark-builder.
      4e14199f
    • WeichenXu's avatar
      [DOC][SQL] update out-of-date code snippets using SQLContext in all documents. · b1310425
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      I search the whole documents directory using SQLContext, and update the following places:
      
      - docs/configuration.md, sparkR code snippets.
      - docs/streaming-programming-guide.md, several example code.
      
      ## How was this patch tested?
      
      N/A
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #14025 from WeichenXu123/WIP_SQLContext_update.
      b1310425
    • Cheng Lian's avatar
      [SPARK-15979][SQL] Renames CatalystWriteSupport to ParquetWriteSupport · 23eff5e5
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      PR #13696 renamed various Parquet support classes but left `CatalystWriteSupport` behind. This PR is renames it as a follow-up.
      
      ## How was this patch tested?
      
      N/A.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #14070 from liancheng/spark-15979-follow-up.
      23eff5e5
    • Tao Lin's avatar
      [SPARK-15591][WEBUI] Paginate Stage Table in Stages tab · 478b71d0
      Tao Lin authored
      ## What changes were proposed in this pull request?
      
      This patch adds pagination support for the Stage Tables in the Stage tab. Pagination is provided for all of the four Job Tables (active, pending, completed, and failed). Besides, the paged stage tables are also used in JobPage (the detail page for one job) and PoolPage.
      
      Interactions (jumping, sorting, and setting page size) for paged tables are also included.
      
      ## How was this patch tested?
      
      Tested manually by using checking the Web UI after completing and failing hundreds of jobs.  Same as the testings for [Paginate Job Table in Jobs tab](https://github.com/apache/spark/pull/13620).
      
      This shows the pagination for completed stages:
      ![paged stage table](https://cloud.githubusercontent.com/assets/5558370/16125696/5804e35e-3427-11e6-8923-5c5948982648.png)
      
      Author: Tao Lin <nblintao@gmail.com>
      
      Closes #13708 from nblintao/stageTable.
      478b71d0
    • gatorsmile's avatar
      [SPARK-16229][SQL] Drop Empty Table After CREATE TABLE AS SELECT fails · 21eadd1d
      gatorsmile authored
      #### What changes were proposed in this pull request?
      In `CREATE TABLE AS SELECT`, if the `SELECT` query failed, the table should not exist. For example,
      
      ```SQL
      CREATE TABLE tab
      STORED AS TEXTFILE
      SELECT 1 AS a, (SELECT a FROM (SELECT 1 AS a UNION ALL SELECT 2 AS a) t) AS b
      ```
      The above query failed as expected but an empty table `t` is created.
      
      This PR is to drop the created table when hitting any non-fatal exception.
      
      #### How was this patch tested?
      Added a test case to verify the behavior
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13926 from gatorsmile/dropTableAfterException.
      21eadd1d
    • MechCoder's avatar
      [SPARK-16307][ML] Add test to verify the predicted variances of a DT on toy data · 909c6d81
      MechCoder authored
      ## What changes were proposed in this pull request?
      
      The current tests assumes that `impurity.calculate()` returns the variance correctly. It should be better to make the tests independent of this assumption. In other words verify that the variance computed equals the variance computed manually on a small tree.
      
      ## How was this patch tested?
      
      The patch is a test....
      
      Author: MechCoder <mks542@nyu.edu>
      
      Closes #13981 from MechCoder/dt_variance.
      909c6d81
    • Reynold Xin's avatar
      [SPARK-16388][SQL] Remove spark.sql.nativeView and spark.sql.nativeView.canonical config · 7e28fabd
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      These two configs should always be true after Spark 2.0. This patch removes them from the config list. Note that ideally this should've gone into branch-2.0, but due to the timing of the release we should only merge this in master for Spark 2.1.
      
      ## How was this patch tested?
      Updated test cases.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #14061 from rxin/SPARK-16388.
      7e28fabd
    • Yuhao Yang's avatar
      [SPARK-16249][ML] Change visibility of Object ml.clustering.LDA to public for loading · 5497242c
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      jira: https://issues.apache.org/jira/browse/SPARK-16249
      Change visibility of Object ml.clustering.LDA to public for loading, thus users can invoke LDA.load("path").
      
      ## How was this patch tested?
      
      existing ut and manually test for load ( saved with current code)
      
      Author: Yuhao Yang <yuhao.yang@intel.com>
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #13941 from hhbyyh/ldapublic.
      5497242c
    • Tejas Patil's avatar
      [SPARK-16339][CORE] ScriptTransform does not print stderr when outstream is lost · 5f342049
      Tejas Patil authored
      ## What changes were proposed in this pull request?
      
      Currently, if due to some failure, the outstream gets destroyed or closed and later `outstream.close()` leads to IOException in such case. Due to this, the `stderrBuffer` does not get logged and there is no way for users to see why the job failed.
      
      The change is to first display the stderr buffer and then try closing the outstream.
      
      ## How was this patch tested?
      
      The correct way to test this fix would be to grep the log to see if the `stderrBuffer` gets logged but I dont think having test cases which do that is a good idea.
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      …
      
      Author: Tejas Patil <tejasp@fb.com>
      
      Closes #13834 from tejasapatil/script_transform.
      5f342049
    • Dongjoon Hyun's avatar
      [SPARK-16340][SQL] Support column arguments for `regexp_replace` Dataset operation · ec79183a
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, `regexp_replace` function supports `Column` arguments in a query. This PR supports that in a `Dataset` operation, too.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests with a updated testcase.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14060 from dongjoon-hyun/SPARK-16340.
      ec79183a
  3. Jul 05, 2016
    • gatorsmile's avatar
      [SPARK-16389][SQL] Remove MetastoreRelation from SparkHiveWriterContainer and... · ec18cd0a
      gatorsmile authored
      [SPARK-16389][SQL] Remove MetastoreRelation from SparkHiveWriterContainer and SparkHiveDynamicPartitionWriterContainer
      
      #### What changes were proposed in this pull request?
      - Remove useless `MetastoreRelation` from the signature of `SparkHiveWriterContainer` and `SparkHiveDynamicPartitionWriterContainer`.
      - Avoid unnecessary metadata retrieval using Hive client in `InsertIntoHiveTable`.
      
      #### How was this patch tested?
      Existing test cases already cover it.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14062 from gatorsmile/removeMetastoreRelation.
      ec18cd0a
    • Dongjoon Hyun's avatar
      [SPARK-16286][SQL] Implement stack table generating function · d0d28507
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR implements `stack` table generating function.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests including new testcases.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14033 from dongjoon-hyun/SPARK-16286.
      d0d28507
    • Joseph K. Bradley's avatar
      [SPARK-16348][ML][MLLIB][PYTHON] Use full classpaths for pyspark ML JVM calls · fdde7d0a
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Issue: Omitting the full classpath can cause problems when calling JVM methods or classes from pyspark.
      
      This PR: Changed all uses of jvm.X in pyspark.ml and pyspark.mllib to use full classpath for X
      
      ## How was this patch tested?
      
      Existing unit tests.  Manual testing in an environment where this was an issue.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #14023 from jkbradley/SPARK-16348.
      fdde7d0a
    • Marcelo Vanzin's avatar
      [SPARK-16385][CORE] Catch correct exception when calling method via reflection. · 59f9c1bd
      Marcelo Vanzin authored
      Using "Method.invoke" causes an exception to be thrown, not an error, so
      Utils.waitForProcess() was always throwing an exception when run on Java 7.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #14056 from vanzin/SPARK-16385.
      59f9c1bd
    • Dongjoon Hyun's avatar
      [SPARK-16383][SQL] Remove `SessionState.executeSql` · 4db63fd2
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR removes `SessionState.executeSql` in favor of `SparkSession.sql`. We can remove this safely since the visibility `SessionState` is `private[sql]` and `executeSql` is only used in one **ignored** test, `test("Multiple Hive Instances")`.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14055 from dongjoon-hyun/SPARK-16383.
      4db63fd2
    • cody koeninger's avatar
      [SPARK-16359][STREAMING][KAFKA] unidoc skip kafka 0.10 · 1f0d0213
      cody koeninger authored
      ## What changes were proposed in this pull request?
      during sbt unidoc task, skip the streamingKafka010 subproject and filter kafka 0.10 classes from the classpath, so that at least existing kafka 0.8 doc can be included in unidoc without error
      
      ## How was this patch tested?
      sbt spark/scalaunidoc:doc | grep -i error
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #14041 from koeninger/SPARK-16359.
      1f0d0213
    • Cheng Hao's avatar
      [SPARK-15730][SQL] Respect the --hiveconf in the spark-sql command line · 920cb5fe
      Cheng Hao authored
      ## What changes were proposed in this pull request?
      This PR makes spark-sql (backed by SparkSQLCLIDriver) respects confs set by hiveconf, which is what we do in previous versions. The change is that when we start SparkSQLCLIDriver, we explicitly set confs set through --hiveconf to SQLContext's conf (basically treating those confs as a SparkSQL conf).
      
      ## How was this patch tested?
      A new test in CliSuite.
      
      Closes #13542
      
      Author: Cheng Hao <hao.cheng@intel.com>
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #14058 from yhuai/hiveConfThriftServer.
      920cb5fe
    • Reynold Xin's avatar
      [HOTFIX] Fix build break. · 5b7a1770
      Reynold Xin authored
      5b7a1770
    • cody koeninger's avatar
      [SPARK-16212][STREAMING][KAFKA] use random port for embedded kafka · 1fca9da9
      cody koeninger authored
      ## What changes were proposed in this pull request?
      
      Testing for 0.10 uncovered an issue with a fixed port number being used in KafkaTestUtils.  This is making a roughly equivalent fix for the 0.8 connector
      
      ## How was this patch tested?
      
      Unit tests, manual tests
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #14018 from koeninger/kafka-0-8-test-port.
      1fca9da9
    • Reynold Xin's avatar
      [SPARK-16311][SQL] Metadata refresh should work on temporary views · 16a2a7d7
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch fixes the bug that the refresh command does not work on temporary views. This patch is based on https://github.com/apache/spark/pull/13989, but removes the public Dataset.refresh() API as well as improved test coverage.
      
      Note that I actually think the public refresh() API is very useful. We can in the future implement it by also invalidating the lazy vals in QueryExecution (or alternatively just create a new QueryExecution).
      
      ## How was this patch tested?
      Re-enabled a previously ignored test, and added a new test suite for Hive testing behavior of temporary views against MetastoreRelation.
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #14009 from rxin/SPARK-16311.
      16a2a7d7
    • hyukjinkwon's avatar
      [SPARK-9876][SQL][FOLLOWUP] Enable string and binary tests for Parquet... · 07d9c532
      hyukjinkwon authored
      [SPARK-9876][SQL][FOLLOWUP] Enable string and binary tests for Parquet predicate pushdown and replace deprecated fromByteArray.
      
      ## What changes were proposed in this pull request?
      
      It seems Parquet has been upgraded to 1.8.1 by https://github.com/apache/spark/pull/13280. So,  this PR enables string and binary predicate push down which was disabled due to [SPARK-11153](https://issues.apache.org/jira/browse/SPARK-11153) and [PARQUET-251](https://issues.apache.org/jira/browse/PARQUET-251) and cleans up some comments unremoved (I think by mistake).
      
      This PR also replace the API, `fromByteArray()` deprecated in [PARQUET-251](https://issues.apache.org/jira/browse/PARQUET-251).
      
      ## How was this patch tested?
      
      Unit tests in `ParquetFilters`
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #13389 from HyukjinKwon/parquet-1.8-followup.
      07d9c532
    • Dongjoon Hyun's avatar
      [SPARK-16360][SQL] Speed up SQL query performance by removing redundant `executePlan` call · 7f7eb393
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, there are a few reports about Spark 2.0 query performance regression for large queries.
      
      This PR speeds up SQL query processing performance by removing redundant **consecutive `executePlan`** call in `Dataset.ofRows` function and `Dataset` instantiation. Specifically, this PR aims to reduce the overhead of SQL query execution plan generation, not real query execution. So, we can not see the result in the Spark Web UI. Please use the following query script. The result is **25.78 sec** -> **12.36 sec** as expected.
      
      **Sample Query**
      ```scala
      val n = 4000
      val values = (1 to n).map(_.toString).mkString(", ")
      val columns = (1 to n).map("column" + _).mkString(", ")
      val query =
        s"""
           |SELECT $columns
           |FROM VALUES ($values) T($columns)
           |WHERE 1=2 AND 1 IN ($columns)
           |GROUP BY $columns
           |ORDER BY $columns
           |""".stripMargin
      
      def time[R](block: => R): R = {
        val t0 = System.nanoTime()
        val result = block
        println("Elapsed time: " + ((System.nanoTime - t0) / 1e9) + "s")
        result
      }
      ```
      
      **Before**
      ```scala
      scala> time(sql(query))
      Elapsed time: 30.138142577s  // First query has a little overhead of initialization.
      res0: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields]
      scala> time(sql(query))
      Elapsed time: 25.787751452s  // Let's compare this one.
      res1: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields]
      ```
      
      **After**
      ```scala
      scala> time(sql(query))
      Elapsed time: 17.500279659s  // First query has a little overhead of initialization.
      res0: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields]
      scala> time(sql(query))
      Elapsed time: 12.364812255s  // This shows the real difference. The speed up is about 2 times.
      res1: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields]
      ```
      
      ## How was this patch tested?
      
      Manual by the above script.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14044 from dongjoon-hyun/SPARK-16360.
      7f7eb393
    • hyukjinkwon's avatar
      [SPARK-15198][SQL] Support for pushing down filters for boolean types in ORC data source · 7742d9f1
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      It seems ORC supports all the types in  ([`PredicateLeaf.Type`](https://github.com/apache/hive/blob/e085b7e9bd059d91aaf013df0db4d71dca90ec6f/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/PredicateLeaf.java#L50-L56)) which includes boolean types. So, this was tested first.
      
      This PR adds the support for pushing filters down for `BooleanType` in ORC data source.
      
      This PR also removes `OrcTableScan` class and the companion object, which is not used anymore.
      
      ## How was this patch tested?
      
      Unittest in `OrcFilterSuite` and `OrcQuerySuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #12972 from HyukjinKwon/SPARK-15198.
      7742d9f1
Loading