Skip to content
Snippets Groups Projects
  1. Jun 17, 2016
    • Sameer Agarwal's avatar
      Remove non-obvious conf settings from TPCDS benchmark · 34d6c4cd
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      My fault -- these 2 conf entries are mysteriously hidden inside the benchmark code and makes it non-obvious to disable whole stage codegen and/or the vectorized parquet reader.
      
      PS: Didn't attach a JIRA as this change should otherwise be a no-op (both these conf are enabled by default in Spark)
      
      ## How was this patch tested?
      
      N/A
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #13726 from sameeragarwal/tpcds-conf.
      34d6c4cd
    • Davies Liu's avatar
      [SPARK-15811][SQL] fix the Python UDF in Scala 2.10 · ef43b4ed
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Iterator can't be serialized in Scala 2.10, we should force it into a array to make sure that .
      
      ## How was this patch tested?
      
      Build with Scala 2.10 and ran all the Python unit tests manually (will be covered by a jenkins build).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13717 from davies/fix_udf_210.
      ef43b4ed
    • gatorsmile's avatar
      [SPARK-15706][SQL] Fix Wrong Answer when using IF NOT EXISTS in INSERT... · e5d703bc
      gatorsmile authored
      [SPARK-15706][SQL] Fix Wrong Answer when using IF NOT EXISTS in INSERT OVERWRITE for DYNAMIC PARTITION
      
      #### What changes were proposed in this pull request?
      `IF NOT EXISTS` in `INSERT OVERWRITE` should not support dynamic partitions. If we specify `IF NOT EXISTS`, the inserted statement is not shown in the table.
      
      This PR is to issue an exception in this case, just like what Hive does. Also issue an exception if users specify `IF NOT EXISTS` if users do not specify any `PARTITION` specification.
      
      #### How was this patch tested?
      Added test cases into `PlanParserSuite` and `InsertIntoHiveTableSuite`
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13447 from gatorsmile/insertIfNotExist.
      e5d703bc
    • Pete Robbins's avatar
      [SPARK-15822] [SQL] Prevent byte array backed classes from referencing freed memory · 5ada6061
      Pete Robbins authored
      ## What changes were proposed in this pull request?
      `UTF8String` and all `Unsafe*` classes are backed by either on-heap or off-heap byte arrays. The code generated version `SortMergeJoin` buffers the left hand side join keys during iteration. This was actually problematic in off-heap mode when one of the keys is a `UTF8String` (or any other 'Unsafe*` object) and the left hand side iterator was exhausted (and released its memory); the buffered keys would reference freed memory. This causes Seg-faults and all kinds of other undefined behavior when we would use one these buffered keys.
      
      This PR fixes this problem by creating copies of the buffered variables. I have added a general method to the `CodeGenerator` for this. I have checked all places in which this could happen, and only `SortMergeJoin` had this problem.
      
      This PR is largely based on the work of robbinspg and he should be credited for this.
      
      closes https://github.com/apache/spark/pull/13707
      
      ## How was this patch tested?
      Manually tested on problematic workloads.
      
      Author: Pete Robbins <robbinspg@gmail.com>
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #13723 from hvanhovell/SPARK-15822-2.
      5ada6061
  2. Jun 16, 2016
    • Dongjoon Hyun's avatar
      [SPARK-15908][R] Add varargs-type dropDuplicates() function in SparkR · 513a03e4
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR adds varargs-type `dropDuplicates` function to SparkR for API parity.
      Refer to https://issues.apache.org/jira/browse/SPARK-15807, too.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests with new testcases.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13684 from dongjoon-hyun/SPARK-15908.
      513a03e4
    • Kai Jiang's avatar
      [SPARK-15490][R][DOC] SparkR 2.0 QA: New R APIs and API docs for non-MLib changes · 5fd20b66
      Kai Jiang authored
      ## What changes were proposed in this pull request?
      R Docs changes
      include typos, format, layout.
      ## How was this patch tested?
      Test locally.
      
      Author: Kai Jiang <jiangkai@gmail.com>
      
      Closes #13394 from vectorijk/spark-15490.
      5fd20b66
    • Nezih Yigitbasi's avatar
      [SPARK-15782][YARN] Fix spark.jars and spark.yarn.dist.jars handling · 63470afc
      Nezih Yigitbasi authored
      When `--packages` is specified with spark-shell the classes from those packages cannot be found, which I think is due to some of the changes in SPARK-12343.
      
      Tested manually with both scala 2.10 and 2.11 repls.
      
      vanzin davies can you guys please review?
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      Author: Nezih Yigitbasi <nyigitbasi@netflix.com>
      
      Closes #13709 from nezihyigitbasi/SPARK-15782.
      63470afc
    • Dhruve Ashar's avatar
      [SPARK-15966][DOC] Add closing tag to fix rendering issue for Spark monitoring · f1bf0d2f
      Dhruve Ashar authored
      ## What changes were proposed in this pull request?
      Adds the missing closing tag for spark.ui.view.acls.groups
      
      ## How was this patch tested?
      I built the docs locally and verified the changed in browser.
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      **Before:**
      ![image](https://cloud.githubusercontent.com/assets/7732317/16135005/49fc0724-33e6-11e6-9390-98711593fa5b.png)
      
      **After:**
      ![image](https://cloud.githubusercontent.com/assets/7732317/16135021/62b5c4a8-33e6-11e6-8118-b22fda5c66eb.png)
      
      Author: Dhruve Ashar <dhruveashar@gmail.com>
      
      Closes #13719 from dhruve/doc/SPARK-15966.
      f1bf0d2f
    • WeichenXu's avatar
      [SPARK-15608][ML][EXAMPLES][DOC] add examples and documents of ml.isotonic regression · 9040d83b
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      add ml doc for ml isotonic regression
      add scala example for ml isotonic regression
      add java example for ml isotonic regression
      add python example for ml isotonic regression
      
      modify scala example for mllib isotonic regression
      modify java example for mllib isotonic regression
      modify python example for mllib isotonic regression
      
      add data/mllib/sample_isotonic_regression_libsvm_data.txt
      delete data/mllib/sample_isotonic_regression_data.txt
      ## How was this patch tested?
      
      N/A
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #13381 from WeichenXu123/add_isotonic_regression_doc.
      9040d83b
    • Yin Huai's avatar
      [SPARK-15991] SparkContext.hadoopConfiguration should be always the base of... · d9c6628c
      Yin Huai authored
      [SPARK-15991] SparkContext.hadoopConfiguration should be always the base of hadoop conf created by SessionState
      
      ## What changes were proposed in this pull request?
      Before this patch, after a SparkSession has been created, hadoop conf set directly to SparkContext.hadoopConfiguration will not affect the hadoop conf created by SessionState. This patch makes the change to always use SparkContext.hadoopConfiguration  as the base.
      
      This patch also changes the behavior of hive-site.xml support added in https://github.com/apache/spark/pull/12689/. With this patch, we will load hive-site.xml to SparkContext.hadoopConfiguration.
      
      ## How was this patch tested?
      New test in SparkSessionBuilderSuite.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #13711 from yhuai/SPARK-15991.
      d9c6628c
    • Huaxin Gao's avatar
      [SPARK-15749][SQL] make the error message more meaningful · 62d2fa5e
      Huaxin Gao authored
      ## What changes were proposed in this pull request?
      
      For table test1 (C1 varchar (10), C2 varchar (10)), when I insert a row using
      ```
      sqlContext.sql("insert into test1 values ('abc', 'def', 1)")
      ```
      I got error message
      
      ```
      Exception in thread "main" java.lang.RuntimeException: RelationC1#0,C2#1 JDBCRelation(test1)
      requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE statement
      generates the same number of columns as its schema.
      ```
      The error message is a little confusing. In my simple insert statement, it doesn't have a SELECT clause.
      
      I will change the error message to a more general one
      
      ```
      Exception in thread "main" java.lang.RuntimeException: RelationC1#0,C2#1 JDBCRelation(test1)
      requires that the data to be inserted have the same number of columns as the target table.
      ```
      
      ## How was this patch tested?
      
      I tested the patch using my simple unit test, but it's a very trivial change and I don't think I need to check in any test.
      
      Author: Huaxin Gao <huaxing@us.ibm.com>
      
      Closes #13492 from huaxingao/spark-15749.
      62d2fa5e
    • Alex Bozarth's avatar
      [SPARK-15868][WEB UI] Executors table in Executors tab should sort Executor IDs in numerical order · e849285d
      Alex Bozarth authored
      ## What changes were proposed in this pull request?
      
      Currently the Executors table sorts by id using a string sort (since that's what it is stored as). Since  the id is a number (other than the driver) we should be sorting numerically. I have changed both the initial sort on page load as well as the table sort to sort on id numerically, treating non-numeric strings (like the driver) as "-1"
      
      ## How was this patch tested?
      
      Manually tested and dev/run-tests
      
      ![pageload](https://cloud.githubusercontent.com/assets/13952758/16027882/d32edd0a-318e-11e6-9faf-fc972b7c36ab.png)
      ![sorted](https://cloud.githubusercontent.com/assets/13952758/16027883/d34541c6-318e-11e6-9ed7-6bfc0cd4152e.png)
      
      Author: Alex Bozarth <ajbozart@us.ibm.com>
      
      Closes #13654 from ajbozarth/spark15868.
      e849285d
    • Dongjoon Hyun's avatar
      [MINOR][DOCS][SQL] Fix some comments about types(TypeCoercion,Partition) and exceptions. · 2d27eb1e
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR contains a few changes on code comments.
      - `HiveTypeCoercion` is renamed into `TypeCoercion`.
      - `NoSuchDatabaseException` is only used for the absence of database.
      - For partition type inference, only `DoubleType` is considered.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13674 from dongjoon-hyun/minor_doc_types.
      2d27eb1e
    • gatorsmile's avatar
      [SPARK-15998][SQL] Verification of SQLConf HIVE_METASTORE_PARTITION_PRUNING · 796429d7
      gatorsmile authored
      #### What changes were proposed in this pull request?
      `HIVE_METASTORE_PARTITION_PRUNING` is a public `SQLConf`. When `true`, some predicates will be pushed down into the Hive metastore so that unmatching partitions can be eliminated earlier. The current default value is `false`. For performance improvement, users might turn this parameter on.
      
      So far, the code base does not have such a test case to verify whether this `SQLConf` properly works. This PR is to improve the test case coverage for avoiding future regression.
      
      #### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13716 from gatorsmile/addTestMetastorePartitionPruning.
      796429d7
    • Cheng Lian's avatar
      [SQL] Minor HashAggregateExec string output fixes · 7a89f2ad
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR fixes some minor `.toString` format issues for `HashAggregateExec`.
      
      Before:
      
      ```
      *HashAggregate(key=[a#234L,b#235L], functions=[count(1),max(c#236L)], output=[a#234L,b#235L,count(c)#247L,max(c)#248L])
      ```
      
      After:
      
      ```
      *HashAggregate(keys=[a#234L, b#235L], functions=[count(1), max(c#236L)], output=[a#234L, b#235L, count(c)#247L, max(c)#248L])
      ```
      
      ## How was this patch tested?
      
      Manually tested.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #13710 from liancheng/minor-agg-string-fix.
      7a89f2ad
    • Josh Rosen's avatar
      [SPARK-15975] Fix improper Popen retcode code handling in dev/run-tests · acef843f
      Josh Rosen authored
      In the `dev/run-tests.py` script we check a `Popen.retcode` for success using `retcode > 0`, but this is subtlety wrong because Popen's return code will be negative if the child process was terminated by a signal: https://docs.python.org/2/library/subprocess.html#subprocess.Popen.returncode
      
      In order to properly handle signals, we should change this to check `retcode != 0` instead.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #13692 from JoshRosen/dev-run-tests-return-code-handling.
      acef843f
    • bomeng's avatar
      [SPARK-15978][SQL] improve 'show tables' command related codes · bbad4cb4
      bomeng authored
      ## What changes were proposed in this pull request?
      
      I've found some minor issues in "show tables" command:
      
      1. In the `SessionCatalog.scala`, `listTables(db: String)` method will call `listTables(formatDatabaseName(db), "*")` to list all the tables for certain db, but in the method `listTables(db: String, pattern: String)`, this db name is formatted once more. So I think we should remove
      `formatDatabaseName()` in the caller.
      
      2. I suggest to add sort to listTables(db: String) in InMemoryCatalog.scala, just like listDatabases().
      
      ## How was this patch tested?
      
      The existing test cases should cover it.
      
      Author: bomeng <bmeng@us.ibm.com>
      
      Closes #13695 from bomeng/SPARK-15978.
      bbad4cb4
    • Sean Owen's avatar
      [SPARK-15796][CORE] Reduce spark.memory.fraction default to avoid overrunning... · 457126e4
      Sean Owen authored
      [SPARK-15796][CORE] Reduce spark.memory.fraction default to avoid overrunning old gen in JVM default config
      
      ## What changes were proposed in this pull request?
      
      Reduce `spark.memory.fraction` default to 0.6 in order to make it fit within default JVM old generation size (2/3 heap). See JIRA discussion. This means a full cache doesn't spill into the new gen. CC andrewor14
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #13618 from srowen/SPARK-15796.
      457126e4
    • Dongjoon Hyun's avatar
      [SPARK-15922][MLLIB] `toIndexedRowMatrix` should consider the case `cols < offset+colsPerBlock` · 36110a83
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      SPARK-15922 reports the following scenario throwing an exception due to the mismatched vector sizes. This PR handles the exceptional case, `cols < (offset + colsPerBlock)`.
      
      **Before**
      ```scala
      scala> import org.apache.spark.mllib.linalg.distributed._
      scala> import org.apache.spark.mllib.linalg._
      scala> val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, new DenseVector(Array(1,2,3))):: IndexedRow(2L, new DenseVector(Array(1,2,3))):: Nil
      scala> val rdd = sc.parallelize(rows)
      scala> val matrix = new IndexedRowMatrix(rdd, 3, 3)
      scala> val bmat = matrix.toBlockMatrix
      scala> val imat = bmat.toIndexedRowMatrix
      scala> imat.rows.collect
      ... // java.lang.IllegalArgumentException: requirement failed: Vectors must be the same length!
      ```
      
      **After**
      ```scala
      ...
      scala> imat.rows.collect
      res0: Array[org.apache.spark.mllib.linalg.distributed.IndexedRow] = Array(IndexedRow(0,[1.0,2.0,3.0]), IndexedRow(1,[1.0,2.0,3.0]), IndexedRow(2,[1.0,2.0,3.0]))
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (including the above case)
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13643 from dongjoon-hyun/SPARK-15922.
      36110a83
    • Herman van Hovell's avatar
      [SPARK-15977][SQL] Fix TRUNCATE TABLE for Spark specific datasource tables · f9bf15d9
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      `TRUNCATE TABLE` is currently broken for Spark specific datasource tables (json, csv, ...). This PR correctly sets the location for these datasources which allows them to be truncated.
      
      ## How was this patch tested?
      Extended the datasources `TRUNCATE TABLE` tests in `DDLSuite`.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #13697 from hvanhovell/SPARK-15977.
      f9bf15d9
    • Tathagata Das's avatar
      [SPARK-15981][SQL][STREAMING] Fixed bug and added tests in DataStreamReader Python API · 084dca77
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      - Fixed bug in Python API of DataStreamReader.  Because a single path was being converted to a array before calling Java DataStreamReader method (which takes a string only), it gave the following error.
      ```
      File "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/readwriter.py", line 947, in pyspark.sql.readwriter.DataStreamReader.json
      Failed example:
          json_sdf = spark.readStream.json(os.path.join(tempfile.mkdtemp(), 'data'),                 schema = sdf_schema)
      Exception raised:
          Traceback (most recent call last):
            File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", line 1253, in __run
              compileflags, 1) in test.globs
            File "<doctest pyspark.sql.readwriter.DataStreamReader.json[0]>", line 1, in <module>
              json_sdf = spark.readStream.json(os.path.join(tempfile.mkdtemp(), 'data'),                 schema = sdf_schema)
            File "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/readwriter.py", line 963, in json
              return self._df(self._jreader.json(path))
            File "/Users/tdas/Projects/Spark/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
              answer, self.gateway_client, self.target_id, self.name)
            File "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/utils.py", line 63, in deco
              return f(*a, **kw)
            File "/Users/tdas/Projects/Spark/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 316, in get_return_value
              format(target_id, ".", name, value))
          Py4JError: An error occurred while calling o121.json. Trace:
          py4j.Py4JException: Method json([class java.util.ArrayList]) does not exist
          	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
          	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
          	at py4j.Gateway.invoke(Gateway.java:272)
          	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
          	at py4j.commands.CallCommand.execute(CallCommand.java:79)
          	at py4j.GatewayConnection.run(GatewayConnection.java:211)
          	at java.lang.Thread.run(Thread.java:744)
      ```
      
      - Reduced code duplication between DataStreamReader and DataFrameWriter
      - Added missing Python doctests
      
      ## How was this patch tested?
      New tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #13703 from tdas/SPARK-15981.
      084dca77
    • Dongjoon Hyun's avatar
      [SPARK-15996][R] Fix R examples by removing deprecated functions · a865f6e0
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, R examples(`dataframe.R` and `data-manipulation.R`) fail like the following. We had better update them before releasing 2.0 RC. This PR updates them to use up-to-date APIs.
      
      ```bash
      $ bin/spark-submit examples/src/main/r/dataframe.R
      ...
      Warning message:
      'createDataFrame(sqlContext...)' is deprecated.
      Use 'createDataFrame(data, schema = NULL, samplingRatio = 1.0)' instead.
      See help("Deprecated")
      ...
      Warning message:
      'read.json(sqlContext...)' is deprecated.
      Use 'read.json(path)' instead.
      See help("Deprecated")
      ...
      Error: could not find function "registerTempTable"
      Execution halted
      ```
      
      ## How was this patch tested?
      
      Manual.
      ```
      curl -LO http://s3-us-west-2.amazonaws.com/sparkr-data/flights.csv
      bin/spark-submit examples/src/main/r/dataframe.R
      bin/spark-submit examples/src/main/r/data-manipulation.R flights.csv
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13714 from dongjoon-hyun/SPARK-15996.
      a865f6e0
    • Cheng Lian's avatar
      [SPARK-15983][SQL] Removes FileFormat.prepareRead · 9ea0d5e3
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      Interface method `FileFormat.prepareRead()` was added in #12088 to handle a special case in the LibSVM data source.
      
      However, the semantics of this interface method isn't intuitive: it returns a modified version of the data source options map. Considering that the LibSVM case can be easily handled using schema metadata inside `inferSchema`, we can remove this interface method to keep the `FileFormat` interface clean.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #13698 from liancheng/remove-prepare-read.
      9ea0d5e3
    • gatorsmile's avatar
      [SPARK-15862][SQL] Better Error Message When Having Database Name in CACHE TABLE AS SELECT · 6451cf92
      gatorsmile authored
      #### What changes were proposed in this pull request?
      ~~If the temp table already exists, we should not silently replace it when doing `CACHE TABLE AS SELECT`. This is inconsistent with the behavior of `CREAT VIEW` or `CREATE TABLE`. This PR is to fix this silent drop.~~
      
      ~~Maybe, we also can introduce new syntax for replacing the existing one. For example, in Hive, to replace a view, the syntax should be like `ALTER VIEW AS SELECT` or `CREATE OR REPLACE VIEW AS SELECT`~~
      
      The table name in `CACHE TABLE AS SELECT` should NOT contain database prefix like "database.table". Thus, this PR captures this in Parser and outputs a better error message, instead of reporting the view already exists.
      
      In addition, refactoring the `Parser` to generate table identifiers instead of returning the table name string.
      
      #### How was this patch tested?
      - Added a test case for caching and uncaching qualified table names
      - Fixed a few test cases that do not drop temp table at the end
      - Added the related test case for the issue resolved in this PR
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #13572 from gatorsmile/cacheTableAsSelect.
      6451cf92
  3. Jun 15, 2016
    • Narine Kokhlikyan's avatar
      [SPARK-12922][SPARKR][WIP] Implement gapply() on DataFrame in SparkR · 7c6c6926
      Narine Kokhlikyan authored
      ## What changes were proposed in this pull request?
      
      gapply() applies an R function on groups grouped by one or more columns of a DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() in the Dataset API.
      
      Please, let me know what do you think and if you have any ideas to improve it.
      
      Thank you!
      
      ## How was this patch tested?
      Unit tests.
      1. Primitive test with different column types
      2. Add a boolean column
      3. Compute average by a group
      
      Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
      Author: NarineK <narine.kokhlikyan@us.ibm.com>
      
      Closes #12836 from NarineK/gapply2.
      7c6c6926
    • Herman van Hovell's avatar
      [SPARK-15824][SQL] Execute WITH .... INSERT ... statements immediately · b75f454f
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      We currently immediately execute `INSERT` commands when they are issued. This is not the case as soon as we use a `WITH` to define common table expressions, for example:
      ```sql
      WITH
      tbl AS (SELECT * FROM x WHERE id = 10)
      INSERT INTO y
      SELECT *
      FROM   tbl
      ```
      
      This PR fixes this problem. This PR closes https://github.com/apache/spark/pull/13561 (which fixes the a instance of this problem in the ThriftSever).
      
      ## How was this patch tested?
      Added a test to `InsertSuite`
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #13678 from hvanhovell/SPARK-15824.
      b75f454f
    • Reynold Xin's avatar
      [SPARK-15851][BUILD] Fix the call of the bash script to enable proper run in Windows · 5a52ba0f
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      The way bash script `build/spark-build-info` is called from core/pom.xml prevents Spark building on Windows. Instead of calling the script directly we call bash and pass the script as an argument. This enables running it on Windows with bash installed which typically comes with Git.
      
      This brings https://github.com/apache/spark/pull/13612 up-to-date and also addresses comments from the code review.
      
      Closes #13612
      
      ## How was this patch tested?
      I built manually (on a Mac) to verify it didn't break Mac compilation.
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: avulanov <nashb@yandex.ru>
      
      Closes #13691 from rxin/SPARK-15851.
      5a52ba0f
    • Wayne Song's avatar
      [SPARK-13498][SQL] Increment the recordsRead input metric for JDBC data source · ebdd7512
      Wayne Song authored
      ## What changes were proposed in this pull request?
      This patch brings https://github.com/apache/spark/pull/11373 up-to-date and increments the record count for JDBC data source.
      
      Closes #11373.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13694 from rxin/SPARK-13498.
      ebdd7512
    • Reynold Xin's avatar
      [SPARK-15979][SQL] Rename various Parquet support classes. · 865e7cc3
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch renames various Parquet support classes from CatalystAbc to ParquetAbc. This new naming makes more sense for two reasons:
      
      1. These are not optimizer related (i.e. Catalyst) classes.
      2. We are in the Spark code base, and as a result it'd be more clear to call out these are Parquet support classes, rather than some Spark classes.
      
      ## How was this patch tested?
      Renamed test cases as well.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13696 from rxin/parquet-rename.
      865e7cc3
    • KaiXinXiaoLei's avatar
      [SPARK-12492][SQL] Add missing SQLExecution.withNewExecutionId for hiveResultString · 3e6d567a
      KaiXinXiaoLei authored
      ## What changes were proposed in this pull request?
      
      Add missing SQLExecution.withNewExecutionId for hiveResultString so that queries running in `spark-sql` will be shown in Web UI.
      
      Closes #13115
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: KaiXinXiaoLei <huleilei1@huawei.com>
      
      Closes #13689 from zsxwing/pr13115.
      3e6d567a
    • Wojciech Jurczyk's avatar
      [DOCS] Fix Gini and Entropy scaladocs in context of multiclass classification · 6e0b3d79
      Wojciech Jurczyk authored
      The PR changes outdated scaladocs for Gini and Entropy classes. Since PR #886 Spark supports multiclass classification, but the docs tell only about binary classification.
      
      Author: Wojciech Jurczyk <wojciech.jurczyk@codilime.com>
      
      Closes #11252 from wjur/wjur/docs_multiclass.
      6e0b3d79
    • Davies Liu's avatar
      a153e41c
    • Reynold Xin's avatar
      Closing stale pull requests. · 1a33f2e0
      Reynold Xin authored
      Closes #13103
      Closes #8320
      Closes #7871
      Closes #7461
      Closes #9159
      Closes #9150
      Closes #9200
      Closes #9089
      Closes #8022
      Closes #6767
      Closes #8505
      Closes #9457
      Closes #9397
      Closes #8563
      Closes #10062
      Closes #9944
      Closes #10137
      Closes #10148
      Closes #9057
      Closes #10163
      Closes #8023
      Closes #10302
      Closes #8979
      Closes #8981
      Closes #10258
      Closes #7345
      Closes #9183
      Closes #10087
      Closes #10292
      Closes #10254
      Closes #10374
      Closes #8915
      Closes #10128
      Closes #10666
      Closes #8533
      Closes #10625
      Closes #8013
      Closes #8427
      Closes #7753
      Closes #10116
      Closes #11005
      Closes #10797
      Closes #11026
      Closes #11009
      Closes #10117
      Closes #11382
      Closes #9483
      Closes #10566
      Closes #10753
      Closes #11386
      Closes #9097
      Closes #11245
      Closes #11257
      Closes #11045
      Closes #10144
      Closes #11066
      Closes #8610
      Closes #10634
      Closes #11224
      Closes #11212
      Closes #11244
      Closes #10326
      Closes #13524
      1a33f2e0
    • Nirman Narang's avatar
      [SPARK-7848][STREAMING][UPDATE SPARKSTREAMING DOCS TO INCORPORATE IMPORTANT POINTS.] · 04d7b3d2
      Nirman Narang authored
      Updated the SparkStreaming Doc with some important points.
      
      Author: Nirman Narang <narang@us.ibm.com>
      
      Closes #11114 from nirmannarang/SPARK-7848.
      04d7b3d2
    • Imran Rashid's avatar
      [HOTFIX][CORE] fix flaky BasicSchedulerIntegrationTest · cafc696d
      Imran Rashid authored
      ## What changes were proposed in this pull request?
      
      SPARK-15927 exacerbated a race in BasicSchedulerIntegrationTest, so it went from very unlikely to fairly frequent.  The issue is that stage numbering is not completely deterministic, but these tests treated it like it was.  So turn off the tests.
      
      ## How was this patch tested?
      
      on my laptop the test failed abotu 10% of the time before this change, and didn't fail in 500 runs after the change.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #13688 from squito/hotfix_basic_scheduler.
      cafc696d
    • Sean Zhong's avatar
      [SPARK-15776][SQL] Divide Expression inside Aggregation function is casted to wrong type · 9bd80ad6
      Sean Zhong authored
      ## What changes were proposed in this pull request?
      
      This PR fixes the problem that Divide Expression inside Aggregation function is casted to wrong type, which cause `select 1/2` and `select sum(1/2)`returning different result.
      
      **Before the change:**
      
      ```
      scala> sql("select 1/2 as a").show()
      +---+
      |  a|
      +---+
      |0.5|
      +---+
      
      scala> sql("select sum(1/2) as a").show()
      +---+
      |  a|
      +---+
      |0  |
      +---+
      
      scala> sql("select sum(1 / 2) as a").schema
      res4: org.apache.spark.sql.types.StructType = StructType(StructField(a,LongType,true))
      ```
      
      **After the change:**
      
      ```
      scala> sql("select 1/2 as a").show()
      +---+
      |  a|
      +---+
      |0.5|
      +---+
      
      scala> sql("select sum(1/2) as a").show()
      +---+
      |  a|
      +---+
      |0.5|
      +---+
      
      scala> sql("select sum(1/2) as a").schema
      res4: org.apache.spark.sql.types.StructType = StructType(StructField(a,DoubleType,true))
      ```
      
      ## How was this patch tested?
      
      Unit test.
      
      This PR is based on https://github.com/apache/spark/pull/13524 by Sephiroth-Lin
      
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #13651 from clockfly/SPARK-15776.
      9bd80ad6
    • Egor Pakhomov's avatar
      [SPARK-15934] [SQL] Return binary mode in ThriftServer · 049e639f
      Egor Pakhomov authored
      Returning binary mode to ThriftServer for backward compatibility.
      
      Tested with Squirrel and Tableau.
      
      Author: Egor Pakhomov <egor@anchorfree.com>
      
      Closes #13667 from epahomov/SPARK-15095-2.0.
      049e639f
    • gatorsmile's avatar
      [SPARK-15901][SQL][TEST] Verification of CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET · 09925735
      gatorsmile authored
      #### What changes were proposed in this pull request?
      So far, we do not have test cases for verifying whether the external parameters `HiveUtils .CONVERT_METASTORE_ORC` and `HiveUtils.CONVERT_METASTORE_PARQUET` properly works when users use non-default values. This PR is to add such test cases for avoiding potential regression.
      
      #### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13622 from gatorsmile/addTestCase4parquetOrcConversion.
      09925735
    • Nezih Yigitbasi's avatar
      [SPARK-15782][YARN] Set spark.jars system property in client mode · 4df8df5c
      Nezih Yigitbasi authored
      ## What changes were proposed in this pull request?
      
      When `--packages` is specified with `spark-shell` the classes from those packages cannot be found, which I think is due to some of the changes in `SPARK-12343`. In particular `SPARK-12343` removes a line that sets the `spark.jars` system property in client mode, which is used by the repl main class to set the classpath.
      
      ## How was this patch tested?
      
      Tested manually.
      
      This system property is used by the repl to populate its classpath. If
      this is not set properly the classes for external packages cannot be
      found.
      
      tgravescs vanzin as you may be familiar with this part of the code.
      
      Author: Nezih Yigitbasi <nyigitbasi@netflix.com>
      
      Closes #13527 from nezihyigitbasi/repl-fix.
      4df8df5c
    • Davies Liu's avatar
      [SPARK-15888] [SQL] fix Python UDF with aggregate · 5389013a
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      After we move the ExtractPythonUDF rule into physical plan, Python UDF can't work on top of aggregate anymore, because they can't be evaluated before aggregate, should be evaluated after aggregate. This PR add another rule to extract these kind of Python UDF from logical aggregate, create a Project on top of Aggregate.
      
      ## How was this patch tested?
      
      Added regression tests. The plan of added test query looks like this:
      ```
      == Parsed Logical Plan ==
      'Project [<lambda>('k, 's) AS t#26]
      +- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS k#17, sum(cast(<lambda>(value#6) as bigint)) AS s#22L]
         +- LogicalRDD [key#5L, value#6]
      
      == Analyzed Logical Plan ==
      t: int
      Project [<lambda>(k#17, s#22L) AS t#26]
      +- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS k#17, sum(cast(<lambda>(value#6) as bigint)) AS s#22L]
         +- LogicalRDD [key#5L, value#6]
      
      == Optimized Logical Plan ==
      Project [<lambda>(agg#29, agg#30L) AS t#26]
      +- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS agg#29, sum(cast(<lambda>(value#6) as bigint)) AS agg#30L]
         +- LogicalRDD [key#5L, value#6]
      
      == Physical Plan ==
      *Project [pythonUDF0#37 AS t#26]
      +- BatchEvalPython [<lambda>(agg#29, agg#30L)], [agg#29, agg#30L, pythonUDF0#37]
         +- *HashAggregate(key=[<lambda>(key#5L)#31], functions=[sum(cast(<lambda>(value#6) as bigint))], output=[agg#29,agg#30L])
            +- Exchange hashpartitioning(<lambda>(key#5L)#31, 200)
               +- *HashAggregate(key=[pythonUDF0#34 AS <lambda>(key#5L)#31], functions=[partial_sum(cast(pythonUDF1#35 as bigint))], output=[<lambda>(key#5L)#31,sum#33L])
                  +- BatchEvalPython [<lambda>(key#5L), <lambda>(value#6)], [key#5L, value#6, pythonUDF0#34, pythonUDF1#35]
                     +- Scan ExistingRDD[key#5L,value#6]
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13682 from davies/fix_py_udf.
      5389013a
Loading