Skip to content
Snippets Groups Projects
  1. May 02, 2017
    • Wenchen Fan's avatar
      [SPARK-20558][CORE] clear InheritableThreadLocal variables in SparkContext when stopping it · b946f316
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      To better understand this problem, let's take a look at an example first:
      ```
      object Main {
        def main(args: Array[String]): Unit = {
          var t = new Test
          new Thread(new Runnable {
            override def run() = {}
          }).start()
          println("first thread finished")
      
          t.a = null
          t = new Test
          new Thread(new Runnable {
            override def run() = {}
          }).start()
        }
      
      }
      
      class Test {
        var a = new InheritableThreadLocal[String] {
          override protected def childValue(parent: String): String = {
            println("parent value is: " + parent)
            parent
          }
        }
        a.set("hello")
      }
      ```
      The result is:
      ```
      parent value is: hello
      first thread finished
      parent value is: hello
      parent value is: hello
      ```
      
      Once an `InheritableThreadLocal` has been set value, child threads will inherit its value as long as it has not been GCed, so setting the variable which holds the `InheritableThreadLocal` to `null` doesn't work as we expected.
      
      In `SparkContext`, we have an `InheritableThreadLocal` for local properties, we should clear it when stopping `SparkContext`, or all the future child threads will still inherit it and copy the properties and waste memory.
      
      This is the root cause of https://issues.apache.org/jira/browse/SPARK-20548 , which creates/stops `SparkContext` many times and finally have a lot of `InheritableThreadLocal` alive, and cause OOM when starting new threads in the internal thread pools.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17833 from cloud-fan/core.
      b946f316
    • Marcelo Vanzin's avatar
      [SPARK-20421][CORE] Add a missing deprecation tag. · ef3df912
      Marcelo Vanzin authored
      In the previous patch I deprecated StorageStatus, but not the
      method in SparkContext that exposes that class publicly. So deprecate
      the method too.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #17824 from vanzin/SPARK-20421.
      ef3df912
    • Felix Cheung's avatar
      [SPARK-20490][SPARKR][DOC] add family tag for not function · 13f47dc5
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      doc only
      
      ## How was this patch tested?
      
      manual
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17828 from felixcheung/rnotfamily.
      13f47dc5
    • Xiao Li's avatar
      [SPARK-19235][SQL][TEST][FOLLOW-UP] Enable Test Cases in DDLSuite with Hive Metastore · b1e639ab
      Xiao Li authored
      ### What changes were proposed in this pull request?
      This is a follow-up of enabling test cases in DDLSuite with Hive Metastore. It consists of the following remaining tasks:
      - Run all the `alter table` and `drop table` DDL tests against data source tables when using Hive metastore.
      - Do not run any `alter table` and `drop table` DDL test against Hive serde tables when using InMemoryCatalog.
      - Reenable `alter table: set serde partition` and `alter table: set serde` tests for Hive serde tables.
      
      ### How was this patch tested?
      N/A
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17524 from gatorsmile/cleanupDDLSuite.
      b1e639ab
    • Nick Pentreath's avatar
      [SPARK-20300][ML][PYSPARK] Python API for ALSModel.recommendForAllUsers,Items · e300a5a1
      Nick Pentreath authored
      Add Python API for `ALSModel` methods `recommendForAllUsers`, `recommendForAllItems`
      
      ## How was this patch tested?
      
      New doc tests.
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #17622 from MLnick/SPARK-20300-pyspark-recall.
      e300a5a1
    • Burak Yavuz's avatar
      [SPARK-20549] java.io.CharConversionException: Invalid UTF-32' in JsonToStructs · 86174ea8
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      A fix for the same problem was made in #17693 but ignored `JsonToStructs`. This PR uses the same fix for `JsonToStructs`.
      
      ## How was this patch tested?
      
      Regression test
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #17826 from brkyvz/SPARK-20549.
      86174ea8
    • Kazuaki Ishizaki's avatar
      [SPARK-20537][CORE] Fixing OffHeapColumnVector reallocation · afb21bf2
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      As #17773 revealed `OnHeapColumnVector` may copy a part of the original storage.
      
      `OffHeapColumnVector` reallocation also copies to the new storage data up to 'elementsAppended'. This variable is only updated when using the `ColumnVector.appendX` API, while `ColumnVector.putX` is more commonly used.
      This PR copies the new storage data up to the previously-allocated size in`OffHeapColumnVector`.
      
      ## How was this patch tested?
      
      Existing test suites
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #17811 from kiszk/SPARK-20537.
      afb21bf2
  2. May 01, 2017
    • zero323's avatar
      [SPARK-20532][SPARKR] Implement grouping and grouping_id · 90d77e97
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Adds R wrappers for:
      
      - `o.a.s.sql.functions.grouping` as `o.a.s.sql.functions.is_grouping` (to avoid shading `base::grouping`
      - `o.a.s.sql.functions.grouping_id`
      
      ## How was this patch tested?
      
      Existing unit tests, additional unit tests. `check-cran.sh`.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17807 from zero323/SPARK-20532.
      90d77e97
    • Felix Cheung's avatar
      [SPARK-20192][SPARKR][DOC] SparkR migration guide to 2.2.0 · d20a976e
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Updating R Programming Guide
      
      ## How was this patch tested?
      
      manually
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17816 from felixcheung/r22relnote.
      d20a976e
    • Sameer Agarwal's avatar
      [SPARK-20548] Disable ReplSuite.newProductSeqEncoder with REPL defined class · 943a684b
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      `newProductSeqEncoder with REPL defined class` in `ReplSuite` has been failing in-deterministically : https://spark-tests.appspot.com/failed-tests over the last few days. Disabling the test until a fix is in place.
      
      https://spark.test.databricks.com/job/spark-master-test-sbt-hadoop-2.7/176/testReport/junit/org.apache.spark.repl/ReplSuite/newProductSeqEncoder_with_REPL_defined_class/history/
      
      ## How was this patch tested?
      
      N/A
      
      Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
      
      Closes #17823 from sameeragarwal/disable-test.
      943a684b
    • ptkool's avatar
      [SPARK-20463] Add support for IS [NOT] DISTINCT FROM. · 259860d2
      ptkool authored
      ## What changes were proposed in this pull request?
      
      Add support for the SQL standard distinct predicate to SPARK SQL.
      
      ```
      <expression> IS [NOT] DISTINCT FROM <expression>
      ```
      
      ## How was this patch tested?
      
      Tested using unit tests, integration tests, manual tests.
      
      Author: ptkool <michael.styles@shopify.com>
      
      Closes #17764 from ptkool/is_not_distinct_from.
      259860d2
    • Sean Owen's avatar
      [SPARK-20459][SQL] JdbcUtils throws IllegalStateException: Cause already... · af726cd6
      Sean Owen authored
      [SPARK-20459][SQL] JdbcUtils throws IllegalStateException: Cause already initialized after getting SQLException
      
      ## What changes were proposed in this pull request?
      
      Avoid failing to initCause on JDBC exception with cause initialized to null
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17800 from srowen/SPARK-20459.
      af726cd6
    • Ryan Blue's avatar
      [SPARK-20540][CORE] Fix unstable executor requests. · 2b2dd08e
      Ryan Blue authored
      There are two problems fixed in this commit. First, the
      ExecutorAllocationManager sets a timeout to avoid requesting executors
      too often. However, the timeout is always updated based on its value and
      a timeout, not the current time. If the call is delayed by locking for
      more than the ongoing scheduler timeout, the manager will request more
      executors on every run. This seems to be the main cause of SPARK-20540.
      
      The second problem is that the total number of requested executors is
      not tracked by the CoarseGrainedSchedulerBackend. Instead, it calculates
      the value based on the current status of 3 variables: the number of
      known executors, the number of executors that have been killed, and the
      number of pending executors. But, the number of pending executors is
      never less than 0, even though there may be more known than requested.
      When executors are killed and not replaced, this can cause the request
      sent to YARN to be incorrect because there were too many executors due
      to the scheduler's state being slightly out of date. This is fixed by tracking
      the currently requested size explicitly.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #17813 from rdblue/SPARK-20540-fix-dynamic-allocation.
      2b2dd08e
    • Kunal Khamar's avatar
      [SPARK-20464][SS] Add a job group and description for streaming queries and... · 6fc6cf88
      Kunal Khamar authored
      [SPARK-20464][SS] Add a job group and description for streaming queries and fix cancellation of running jobs using the job group
      
      ## What changes were proposed in this pull request?
      
      Job group: adding a job group is required to properly cancel running jobs related to a query.
      Description: the new description makes it easier to group the batches of a query by sorting by name in the Spark Jobs UI.
      
      ## How was this patch tested?
      
      - Unit tests
      - UI screenshot
      
        - Order by job id:
      ![screen shot 2017-04-27 at 5 10 09 pm](https://cloud.githubusercontent.com/assets/7865120/25509468/15452274-2b6e-11e7-87ba-d929816688cf.png)
      
        - Order by description:
      ![screen shot 2017-04-27 at 5 10 22 pm](https://cloud.githubusercontent.com/assets/7865120/25509474/1c298512-2b6e-11e7-99b8-fef1ef7665c1.png)
      
        - Order by job id (no query name):
      ![screen shot 2017-04-27 at 5 21 33 pm](https://cloud.githubusercontent.com/assets/7865120/25509482/28c96dc8-2b6e-11e7-8df0-9d3cdbb05e36.png)
      
        - Order by description (no query name):
      ![screen shot 2017-04-27 at 5 21 44 pm](https://cloud.githubusercontent.com/assets/7865120/25509489/37674742-2b6e-11e7-9357-b5c38ec16ac4.png)
      
      Author: Kunal Khamar <kkhamar@outlook.com>
      
      Closes #17765 from kunalkhamar/sc-6696.
      6fc6cf88
    • jerryshao's avatar
      [SPARK-20517][UI] Fix broken history UI download link · ab30590f
      jerryshao authored
      The download link in history server UI is concatenated with:
      
      ```
       <td><a href="{{uiroot}}/api/v1/applications/{{id}}/{{num}}/logs" class="btn btn-info btn-mini">Download</a></td>
      ```
      
      Here `num` field represents number of attempts, this is not equal to REST APIs. In the REST API, if attempt id is not existed the URL should be `api/v1/applications/<id>/logs`, otherwise the URL should be `api/v1/applications/<id>/<attemptId>/logs`. Using `<num>` to represent `<attemptId>` will lead to the issue of "no such app".
      
      Manual verification.
      
      CC ajbozarth can you please review this change, since you add this feature before? Thanks!
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #17795 from jerryshao/SPARK-20517.
      ab30590f
    • Herman van Hovell's avatar
      [SPARK-20534][SQL] Make outer generate exec return empty rows · 6b44c4d6
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      Generate exec does not produce `null` values if the generator for the input row is empty and the generate operates in outer mode without join. This is caused by the fact that the `join=false` code path is different from the `join=true` code path, and that the `join=false` code path did deal with outer properly. This PR addresses this issue.
      
      ## How was this patch tested?
      Updated `outer*` tests in `GeneratorFunctionSuite`.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #17810 from hvanhovell/SPARK-20534.
      6b44c4d6
    • zero323's avatar
      [SPARK-20290][MINOR][PYTHON][SQL] Add PySpark wrapper for eqNullSafe · f0169a1c
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Adds Python bindings for `Column.eqNullSafe`
      
      ## How was this patch tested?
      
      Manual tests, existing unit tests, doc build.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17605 from zero323/SPARK-20290.
      f0169a1c
    • Felix Cheung's avatar
      [SPARK-20541][SPARKR][SS] support awaitTermination without timeout · a355b667
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Add without param for timeout - will need this to submit a job that runs until stopped
      Need this for 2.2
      
      ## How was this patch tested?
      
      manually, unit test
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17815 from felixcheung/rssawaitinfinite.
      a355b667
    • zero323's avatar
      [SPARK-20490][SPARKR] Add R wrappers for eqNullSafe and ! / not · 80e9cf1b
      zero323 authored
      ## What changes were proposed in this pull request?
      
      - Add null-safe equality operator `%<=>%` (sames as `o.a.s.sql.Column.eqNullSafe`, `o.a.s.sql.Column.<=>`)
      - Add boolean negation operator `!` and function `not `.
      
      ## How was this patch tested?
      
      Existing unit tests, additional unit tests, `check-cran.sh`.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17783 from zero323/SPARK-20490.
      80e9cf1b
  3. Apr 30, 2017
    • Srinivasa Reddy Vundela's avatar
      [MINOR][DOCS][PYTHON] Adding missing boolean type for replacement value in fillna · 6613046c
      Srinivasa Reddy Vundela authored
      ## What changes were proposed in this pull request?
      
      Currently pyspark Dataframe.fillna API supports boolean type when we pass dict, but it is missing in documentation.
      
      ## How was this patch tested?
      >>> spark.createDataFrame([Row(a=True),Row(a=None)]).fillna({"a" : True}).show()
      +----+
      |   a|
      +----+
      |true|
      |true|
      +----+
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Srinivasa Reddy Vundela <vsr@cloudera.com>
      
      Closes #17688 from vundela/fillna_doc_fix.
      6613046c
    • zero323's avatar
      [SPARK-20535][SPARKR] R wrappers for explode_outer and posexplode_outer · ae3df4e9
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Ad R wrappers for
      
      - `o.a.s.sql.functions.explode_outer`
      - `o.a.s.sql.functions.posexplode_outer`
      
      ## How was this patch tested?
      
      Additional unit tests, manual testing.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17809 from zero323/SPARK-20535.
      ae3df4e9
    • hyukjinkwon's avatar
      [SPARK-20492][SQL] Do not print empty parentheses for invalid primitive types in parser · 1ee494d0
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, when the type string is invalid, it looks printing empty parentheses. This PR proposes a small improvement in an error message by removing it in the parse as below:
      
      ```scala
      spark.range(1).select($"col".cast("aa"))
      ```
      
      **Before**
      
      ```
      org.apache.spark.sql.catalyst.parser.ParseException:
      DataType aa() is not supported.(line 1, pos 0)
      
      == SQL ==
      aa
      ^^^
      ```
      
      **After**
      
      ```
      org.apache.spark.sql.catalyst.parser.ParseException:
      DataType aa is not supported.(line 1, pos 0)
      
      == SQL ==
      aa
      ^^^
      ```
      
      ## How was this patch tested?
      
      Unit tests in `DataTypeParserSuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17784 from HyukjinKwon/SPARK-20492.
      1ee494d0
    • 郭小龙 10207633's avatar
      [SPARK-20521][DOC][CORE] The default of 'spark.worker.cleanup.appDataTtl'... · 4d99b95a
      郭小龙 10207633 authored
      [SPARK-20521][DOC][CORE] The default of 'spark.worker.cleanup.appDataTtl' should be 604800 in spark-standalone.md
      
      ## What changes were proposed in this pull request?
      
      Currently, our project needs to be set to clean up the worker directory cleanup cycle is three days.
      When I follow http://spark.apache.org/docs/latest/spark-standalone.html, configure the 'spark.worker.cleanup.appDataTtl' parameter, I configured to 3 * 24 * 3600.
      When I start the spark service, the startup fails, and the worker log displays the error log as follows:
      
      2017-04-28 15:02:03,306 INFO Utils: Successfully started service 'sparkWorker' on port 48728.
      Exception in thread "main" java.lang.NumberFormatException: For input string: "3 * 24 * 3600"
      	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
      	at java.lang.Long.parseLong(Long.java:430)
      	at java.lang.Long.parseLong(Long.java:483)
      	at scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276)
      	at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
      	at org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
      	at org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
      	at scala.Option.map(Option.scala:146)
      	at org.apache.spark.SparkConf.getLong(SparkConf.scala:380)
      	at org.apache.spark.deploy.worker.Worker.<init>(Worker.scala:100)
      	at org.apache.spark.deploy.worker.Worker$.startRpcEnvAndEndpoint(Worker.scala:730)
      	at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:709)
      	at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
      
      **Because we put 7 * 24 * 3600 as a string, forced to convert to the dragon type,  will lead to problems in the program.**
      
      **So I think the default value of the current configuration should be a specific long value, rather than 7 * 24 * 3600,should be 604800. Because it would mislead users for similar configurations, resulting in spark start failure.**
      
      ## How was this patch tested?
      manual tests
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn>
      Author: guoxiaolong <guo.xiaolong1@zte.com.cn>
      Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn>
      
      Closes #17798 from guoxiaolongzte/SPARK-20521.
      4d99b95a
  4. Apr 29, 2017
    • hyukjinkwon's avatar
      [SPARK-20442][PYTHON][DOCS] Fill up documentations for functions in Column API in PySpark · d228cd0b
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to fill up the documentation with examples for `bitwiseOR`, `bitwiseAND`, `bitwiseXOR`. `contains`, `asc` and `desc` in `Column` API.
      
      Also, this PR fixes minor typos in the documentation and matches some of the contents between Scala doc and Python doc.
      
      Lastly, this PR suggests to use `spark` rather than `sc` in doc tests in `Column` for Python documentation.
      
      ## How was this patch tested?
      
      Doc tests were added and manually tested with the commands below:
      
      `./python/run-tests.py --module pyspark-sql`
      `./python/run-tests.py --module pyspark-sql --python-executable python3`
      `./dev/lint-python`
      
      Output was checked via `make html` under `./python/docs`. The snapshots will be left on the codes with comments.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17737 from HyukjinKwon/SPARK-20442.
      d228cd0b
    • hyukjinkwon's avatar
      [SPARK-20493][R] De-duplicate parse logics for DDL-like type strings in R · 70f1bcd7
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      It seems we are using `SQLUtils.getSQLDataType` for type string in structField. It looks we can replace this with `CatalystSqlParser.parseDataType`.
      
      They look similar DDL-like type definitions as below:
      
      ```scala
      scala> Seq(Tuple1(Tuple1("a"))).toDF.show()
      ```
      ```
      +---+
      | _1|
      +---+
      |[a]|
      +---+
      ```
      
      ```scala
      scala> Seq(Tuple1(Tuple1("a"))).toDF.select($"_1".cast("struct<_1:string>")).show()
      ```
      ```
      +---+
      | _1|
      +---+
      |[a]|
      +---+
      ```
      
      Such type strings looks identical when R’s one as below:
      
      ```R
      > write.df(sql("SELECT named_struct('_1', 'a') as struct"), "/tmp/aa", "parquet")
      > collect(read.df("/tmp/aa", "parquet", structType(structField("struct", "struct<_1:string>"))))
        struct
      1      a
      ```
      
      R’s one is stricter because we are checking the types via regular expressions in R side ahead.
      
      Actual logics there look a bit different but as we check it ahead in R side, it looks replacing it would not introduce (I think) no behaviour changes. To make this sure, the tests dedicated for it were added in SPARK-20105. (It looks `structField` is the only place that calls this method).
      
      ## How was this patch tested?
      
      Existing tests - https://github.com/apache/spark/blob/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L143-L194 should cover this.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17785 from HyukjinKwon/SPARK-20493.
      70f1bcd7
    • wangmiao1981's avatar
      [SPARK-20533][SPARKR] SparkR Wrappers Model should be private and value should be lazy · ee694cdf
      wangmiao1981 authored
      ## What changes were proposed in this pull request?
      
      MultilayerPerceptronClassifierWrapper model should be private.
      LogisticRegressionWrapper.scala rFeatures and rCoefficients should be lazy.
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: wangmiao1981 <wm624@hotmail.com>
      
      Closes #17808 from wangmiao1981/lazy.
      ee694cdf
    • Yuhao Yang's avatar
      [SPARK-19791][ML] Add doc and example for fpgrowth · add9d1bb
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      
      Add a new section for fpm
      Add Example for FPGrowth in scala and Java
      
      updated: Rewrite transform to be more compact.
      
      ## How was this patch tested?
      
      local doc generation.
      
      Author: Yuhao Yang <yuhao.yang@intel.com>
      
      Closes #17130 from hhbyyh/fpmdoc.
      add9d1bb
    • wangmiao1981's avatar
      [SPARK-20477][SPARKR][DOC] Document R bisecting k-means in R programming guide · b28c3bc2
      wangmiao1981 authored
      ## What changes were proposed in this pull request?
      
      Add hyper link in the SparkR programming guide.
      
      ## How was this patch tested?
      
      Build doc and manually check the doc link.
      
      Author: wangmiao1981 <wm624@hotmail.com>
      
      Closes #17805 from wangmiao1981/doc.
      b28c3bc2
    • Tejas Patil's avatar
      [SPARK-20487][SQL] Display `serde` for `HiveTableScan` node in explained plan · 814a61a8
      Tejas Patil authored
      ## What changes were proposed in this pull request?
      
      This was a suggestion by rxin at https://github.com/apache/spark/pull/17780#issuecomment-298073408
      
      ## How was this patch tested?
      
      - modified existing unit test
      - manual testing:
      
      ```
      scala> hc.sql(" SELECT * FROM tejasp_bucketed_partitioned_1  where name = ''  ").explain(true)
      == Parsed Logical Plan ==
      'Project [*]
      +- 'Filter ('name = )
         +- 'UnresolvedRelation `tejasp_bucketed_partitioned_1`
      
      == Analyzed Logical Plan ==
      user_id: bigint, name: string, ds: string
      Project [user_id#24L, name#25, ds#26]
      +- Filter (name#25 = )
         +- SubqueryAlias tejasp_bucketed_partitioned_1
            +- CatalogRelation `default`.`tejasp_bucketed_partitioned_1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [user_id#24L, name#25], [ds#26]
      
      == Optimized Logical Plan ==
      Filter (isnotnull(name#25) && (name#25 = ))
      +- CatalogRelation `default`.`tejasp_bucketed_partitioned_1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [user_id#24L, name#25], [ds#26]
      
      == Physical Plan ==
      *Filter (isnotnull(name#25) && (name#25 = ))
      +- HiveTableScan [user_id#24L, name#25, ds#26], CatalogRelation `default`.`tejasp_bucketed_partitioned_1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [user_id#24L, name#25], [ds#26]
      ```
      
      Author: Tejas Patil <tejasp@fb.com>
      
      Closes #17806 from tejasapatil/add_serde.
      814a61a8
  5. Apr 28, 2017
    • Aaditya Ramesh's avatar
      [SPARK-19525][CORE] Add RDD checkpoint compression support · 77bcd77e
      Aaditya Ramesh authored
      ## What changes were proposed in this pull request?
      
      This PR adds RDD checkpoint compression support and add a new config `spark.checkpoint.compress` to enable/disable it. Credit goes to aramesh117
      
      Closes #17024
      
      ## How was this patch tested?
      
      The new unit test.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      Author: Aaditya Ramesh <aramesh@conviva.com>
      
      Closes #17789 from zsxwing/pr17024.
      77bcd77e
    • caoxuewen's avatar
      [SPARK-20471] Remove AggregateBenchmark testsuite warning: Two level hashmap... · ebff519c
      caoxuewen authored
      [SPARK-20471] Remove AggregateBenchmark testsuite warning: Two level hashmap is disabled but vectorized hashmap is enabled
      
      What changes were proposed in this pull request?
      
      remove  AggregateBenchmark testsuite warning:
      such as '14:26:33.220 WARN org.apache.spark.sql.execution.aggregate.HashAggregateExec: Two level hashmap is disabled but vectorized hashmap is enabled.'
      
      How was this patch tested?
      unit tests: AggregateBenchmark
      Modify the 'ignore function for 'test funtion
      
      Author: caoxuewen <cao.xuewen@zte.com.cn>
      
      Closes #17771 from heary-cao/AggregateBenchmark.
      ebff519c
    • Mark Grover's avatar
      [SPARK-20514][CORE] Upgrade Jetty to 9.3.11.v20160721 · 5d71f3db
      Mark Grover authored
      Upgrade Jetty so it can work with Hadoop 3 (alpha 2 release, in particular).
      Without this change, because of incompatibily between Jetty versions,
      Spark fails to compile when built against Hadoop 3
      
      ## How was this patch tested?
      Unit tests being run.
      
      Author: Mark Grover <mark@apache.org>
      
      Closes #17790 from markgrover/spark-20514.
      5d71f3db
    • Bill Chambers's avatar
      [SPARK-20496][SS] Bug in KafkaWriter Looks at Unanalyzed Plans · 733b81b8
      Bill Chambers authored
      ## What changes were proposed in this pull request?
      
      We didn't enforce analyzed plans in Spark 2.1 when writing out to Kafka.
      
      ## How was this patch tested?
      
      New unit test.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Bill Chambers <bill@databricks.com>
      
      Closes #17804 from anabranch/SPARK-20496-2.
      733b81b8
    • hyukjinkwon's avatar
      [SPARK-20465][CORE] Throws a proper exception when any temp directory could not be got · 8c911ada
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to throw an exception with better message rather than `ArrayIndexOutOfBoundsException` when temp directories could not be created.
      
      Running the commands below:
      
      ```bash
      ./bin/spark-shell --conf spark.local.dir=/NONEXISTENT_DIR_ONE,/NONEXISTENT_DIR_TWO
      ```
      
      produces ...
      
      **Before**
      
      ```
      Exception in thread "main" java.lang.ExceptionInInitializerError
              ...
      Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
              ...
      ```
      
      **After**
      
      ```
      Exception in thread "main" java.lang.ExceptionInInitializerError
              ...
      Caused by: java.io.IOException: Failed to get a temp directory under [/NONEXISTENT_DIR_ONE,/NONEXISTENT_DIR_TWO].
              ...
      ```
      
      ## How was this patch tested?
      
      Unit tests in `LocalDirsSuite.scala`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17768 from HyukjinKwon/throws-temp-dir-exception.
      8c911ada
    • Takeshi Yamamuro's avatar
      [SPARK-14471][SQL] Aliases in SELECT could be used in GROUP BY · 59e3a564
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr added a new rule in `Analyzer` to resolve aliases in `GROUP BY`.
      The current master throws an exception if `GROUP BY` clauses have aliases in `SELECT`;
      ```
      scala> spark.sql("select a a1, a1 + 1 as b, count(1) from t group by a1")
      org.apache.spark.sql.AnalysisException: cannot resolve '`a1`' given input columns: [a]; line 1 pos 51;
      'Aggregate ['a1], [a#83L AS a1#87L, ('a1 + 1) AS b#88, count(1) AS count(1)#90L]
      +- SubqueryAlias t
         +- Project [id#80L AS a#83L]
            +- Range (0, 10, step=1, splits=Some(8))
      
        at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
      ```
      
      ## How was this patch tested?
      Added tests in `SQLQuerySuite` and `SQLQueryTestSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #17191 from maropu/SPARK-14471.
      59e3a564
    • Xiao Li's avatar
      [SPARK-20476][SQL] Block users to create a table that use commas in the column names · e3c81604
      Xiao Li authored
      ### What changes were proposed in this pull request?
      ```SQL
      hive> create table t1(`a,` string);
      OK
      Time taken: 1.399 seconds
      
      hive> create table t2(`a,` string, b string);
      FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 3 elements while columns.types has 2 elements!)
      
      hive> create table t2(`a,` string, b string) stored as parquet;
      FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.IllegalArgumentException: ParquetHiveSerde initialization failed. Number of column name and column type differs. columnNames = [a, , b], columnTypes = [string, string]
      ```
      It has a bug in Hive metastore.
      
      When users do not provide alias name in the SELECT query, we call `toPrettySQL` to generate the alias name. For example, the string `get_json_object(jstring, '$.f1')` will be the alias name for the function call in the statement
      ```SQL
      SELECT key, get_json_object(jstring, '$.f1') FROM tempView
      ```
      Above is not an issue for the SELECT query statements. However, for CTAS, we hit the issue due to a bug in Hive metastore. Hive metastore does not like the column names containing commas and returned a confusing error message, like:
      ```
      17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements!
      org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements!
      ```
      
      Thus, this PR is to block users to create a table in Hive metastore when the table table has a column containing commas in the name.
      
      ### How was this patch tested?
      Added a test case
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17781 from gatorsmile/blockIllegalColumnNames.
      e3c81604
    • wangmiao1981's avatar
      [SPARKR][DOC] Document LinearSVC in R programming guide · 7fe82497
      wangmiao1981 authored
      ## What changes were proposed in this pull request?
      
      add link to svmLinear in the SparkR programming document.
      
      ## How was this patch tested?
      
      Build doc manually and click the link to the document. It looks good.
      
      Author: wangmiao1981 <wm624@hotmail.com>
      
      Closes #17797 from wangmiao1981/doc.
      7fe82497
  6. Apr 27, 2017
    • Wenchen Fan's avatar
      [SPARK-12837][CORE] Do not send the name of internal accumulator to executor side · b90bf520
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      When sending accumulator updates back to driver, the network overhead is pretty big as there are a lot of accumulators, e.g. `TaskMetrics` will send about 20 accumulators everytime, there may be a lot of `SQLMetric` if the query plan is complicated.
      
      Therefore, it's critical to reduce the size of serialized accumulator. A simple way is to not send the name of internal accumulators to executor side, as it's unnecessary. When executor sends accumulator updates back to driver, we can look up the accumulator name in `AccumulatorContext` easily. Note that, we still need to send names of normal accumulators, as the user code run at executor side may rely on accumulator names.
      
      In the future, we should reimplement `TaskMetrics` to not rely on accumulators and use custom serialization.
      
      Tried on the example in https://issues.apache.org/jira/browse/SPARK-12837, the size of serialized accumulator has been cut down by about 40%.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17596 from cloud-fan/oom.
      b90bf520
    • Shixiong Zhu's avatar
      [SPARK-20452][SS][KAFKA] Fix a potential ConcurrentModificationException for batch Kafka DataFrame · 823baca2
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Cancel a batch Kafka query but one of task cannot be cancelled, and rerun the same DataFrame may cause ConcurrentModificationException because it may launch two tasks sharing the same group id.
      
      This PR always create a new consumer when `reuseKafkaConsumer = false` to avoid ConcurrentModificationException. It also contains other minor fixes.
      
      ## How was this patch tested?
      
      Jenkins.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #17752 from zsxwing/kafka-fix.
      823baca2
    • Shixiong Zhu's avatar
      [SPARK-20461][CORE][SS] Use UninterruptibleThread for Executor and fix the... · 01c999e7
      Shixiong Zhu authored
      [SPARK-20461][CORE][SS] Use UninterruptibleThread for Executor and fix the potential hang in CachedKafkaConsumer
      
      ## What changes were proposed in this pull request?
      
      This PR changes Executor's threads to `UninterruptibleThread` so that we can use `runUninterruptibly` in `CachedKafkaConsumer`. However, this is just best effort to avoid hanging forever. If the user uses`CachedKafkaConsumer` in another thread (e.g., create a new thread or Future), the potential hang may still happen.
      
      ## How was this patch tested?
      
      The new added test.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #17761 from zsxwing/int.
      01c999e7
Loading