Skip to content
Snippets Groups Projects
  1. Dec 02, 2016
    • zero323's avatar
      [SPARK-18690][PYTHON][SQL] Backward compatibility of unbounded frames · cf3dbec6
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Makes `Window.unboundedPreceding` and `Window.unboundedFollowing` backward compatible.
      
      ## How was this patch tested?
      
      Pyspark SQL unittests.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16123 from zero323/SPARK-17845-follow-up.
      
      (cherry picked from commit a9cbfc4f)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      cf3dbec6
    • Yanbo Liang's avatar
      [SPARK-18324][ML][DOC] Update ML programming and migration guide for 2.1 release · 839d4e9c
      Yanbo Liang authored
      
      ## What changes were proposed in this pull request?
      Update ML programming and migration guide for 2.1 release.
      
      ## How was this patch tested?
      Doc change, no test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16076 from yanboliang/spark-18324.
      
      (cherry picked from commit 2dc0d7ef)
      Signed-off-by: default avatarJoseph K. Bradley <joseph@databricks.com>
      839d4e9c
    • Shixiong Zhu's avatar
      [SPARK-18670][SS] Limit the number of... · f5376327
      Shixiong Zhu authored
      [SPARK-18670][SS] Limit the number of StreamingQueryListener.StreamProgressEvent when there is no data
      
      ## What changes were proposed in this pull request?
      
      This PR adds a sql conf `spark.sql.streaming.noDataReportInterval` to control how long to wait before outputing the next StreamProgressEvent when there is no data.
      
      ## How was this patch tested?
      
      The added unit test.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16108 from zsxwing/SPARK-18670.
      
      (cherry picked from commit 56a503df)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      f5376327
    • Yanbo Liang's avatar
      [SPARK-18291][SPARKR][ML] Revert "[SPARK-18291][SPARKR][ML] SparkR glm predict... · f915f812
      Yanbo Liang authored
      [SPARK-18291][SPARKR][ML] Revert "[SPARK-18291][SPARKR][ML] SparkR glm predict should output original label when family = binomial."
      
      ## What changes were proposed in this pull request?
      It's better we can fix this issue by providing an option ```type``` for users to change the ```predict``` output schema, then they could output probabilities, log-space predictions, or original labels. In order to not involve breaking API change for 2.1, so revert this change firstly and will add it back after [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618
      
      ) resolved.
      
      ## How was this patch tested?
      Existing unit tests.
      
      This reverts commit daa975f4.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16118 from yanboliang/spark-18291-revert.
      
      (cherry picked from commit a985dd8e)
      Signed-off-by: default avatarJoseph K. Bradley <joseph@databricks.com>
      f915f812
    • Ryan Blue's avatar
      [SPARK-18677] Fix parsing ['key'] in JSON path expressions. · c69825a9
      Ryan Blue authored
      
      ## What changes were proposed in this pull request?
      
      This fixes the parser rule to match named expressions, which doesn't work for two reasons:
      1. The name match is not coerced to a regular expression (missing .r)
      2. The surrounding literals are incorrect and attempt to escape a single quote, which is unnecessary
      
      ## How was this patch tested?
      
      This adds test cases for named expressions using the bracket syntax, including one with quoted spaces.
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #16107 from rdblue/SPARK-18677-fix-json-path.
      
      (cherry picked from commit 48778976)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      c69825a9
    • gatorsmile's avatar
      [SPARK-18674][SQL][FOLLOW-UP] improve the error message of using join · 32c85383
      gatorsmile authored
      
      ### What changes were proposed in this pull request?
      Added a test case for using joins with nested fields.
      
      ### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16110 from gatorsmile/followup-18674.
      
      (cherry picked from commit 2f8776cc)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      32c85383
    • Eric Liang's avatar
      [SPARK-18659][SQL] Incorrect behaviors in overwrite table for datasource tables · e374b242
      Eric Liang authored
      
      ## What changes were proposed in this pull request?
      
      Two bugs are addressed here
      1. INSERT OVERWRITE TABLE sometime crashed when catalog partition management was enabled. This was because when dropping partitions after an overwrite operation, the Hive client will attempt to delete the partition files. If the entire partition directory was dropped, this would fail. The PR fixes this by adding a flag to control whether the Hive client should attempt to delete files.
      2. The static partition spec for OVERWRITE TABLE was not correctly resolved to the case-sensitive original partition names. This resulted in the entire table being overwritten if you did not correctly capitalize your partition names.
      
      cc yhuai cloud-fan
      
      ## How was this patch tested?
      
      Unit tests. Surprisingly, the existing overwrite table tests did not catch these edge cases.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #16088 from ericl/spark-18659.
      
      (cherry picked from commit 7935c847)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      e374b242
    • Dongjoon Hyun's avatar
      [SPARK-18419][SQL] `JDBCRelation.insert` should not remove Spark options · 415730e1
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, `JDBCRelation.insert` removes Spark options too early by mistakenly using `asConnectionProperties`. Spark options like `numPartitions` should be passed into `DataFrameWriter.jdbc` correctly. This bug have been **hidden** because `JDBCOptions.asConnectionProperties` fails to filter out the mixed-case options. This PR aims to fix both.
      
      **JDBCRelation.insert**
      ```scala
      override def insert(data: DataFrame, overwrite: Boolean): Unit = {
        val url = jdbcOptions.url
        val table = jdbcOptions.table
      - val properties = jdbcOptions.asConnectionProperties
      + val properties = jdbcOptions.asProperties
        data.write
          .mode(if (overwrite) SaveMode.Overwrite else SaveMode.Append)
          .jdbc(url, table, properties)
      ```
      
      **JDBCOptions.asConnectionProperties**
      ```scala
      scala> import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions
      scala> import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
      scala> new JDBCOptions(Map("url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10")).asConnectionProperties
      res0: java.util.Properties = {numpartitions=10}
      scala> new JDBCOptions(new CaseInsensitiveMap(Map("url" -> "jdbc:mysql://localhost:3306/temp
      
      ", "dbtable" -> "t1", "numPartitions" -> "10"))).asConnectionProperties
      res1: java.util.Properties = {numpartitions=10}
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins with a new testcase.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15863 from dongjoon-hyun/SPARK-18419.
      
      (cherry picked from commit 55d528f2)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      415730e1
    • Eric Liang's avatar
      [SPARK-18679][SQL] Fix regression in file listing performance for non-catalog tables · 65e896a6
      Eric Liang authored
      
      ## What changes were proposed in this pull request?
      
      In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to InMemoryFileIndex). This introduced a regression where parallelism could only be introduced at the very top of the tree. However, in many cases (e.g. `spark.read.parquet(topLevelDir)`), the top of the tree is only a single directory.
      
      This PR simplifies and fixes the parallel recursive listing code to allow parallelism to be introduced at any level during recursive descent (though note that once we decide to list a sub-tree in parallel, the sub-tree is listed in serial on executors).
      
      cc mallman  cloud-fan
      
      ## How was this patch tested?
      
      Checked metrics in unit tests.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #16112 from ericl/spark-18679.
      
      (cherry picked from commit 294163ee)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      65e896a6
    • Cheng Lian's avatar
      [SPARK-17213][SQL] Disable Parquet filter push-down for string and binary... · a7f8ebb8
      Cheng Lian authored
      [SPARK-17213][SQL] Disable Parquet filter push-down for string and binary columns due to PARQUET-686
      
      This PR targets to both master and branch-2.1.
      
      ## What changes were proposed in this pull request?
      
      Due to PARQUET-686, Parquet doesn't do string comparison correctly while doing filter push-down for string columns. This PR disables filter push-down for both string and binary columns to work around this issue. Binary columns are also affected because some Parquet data models (like Hive) may store string columns as a plain Parquet `binary` instead of a `binary (UTF8)`.
      
      ## How was this patch tested?
      
      New test case added in `ParquetFilterSuite`.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #16106 from liancheng/spark-17213-bad-string-ppd.
      
      (cherry picked from commit ca639163)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      a7f8ebb8
  2. Dec 01, 2016
    • Wenchen Fan's avatar
      [SPARK-18647][SQL] do not put provider in table properties for Hive serde table · 0f0903d1
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      In Spark 2.1, we make Hive serde tables case-preserving by putting the table metadata in table properties, like what we did for data source table. However, we should not put table provider, as it will break forward compatibility. e.g. if we create a Hive serde table with Spark 2.1, using `sql("create table test stored as parquet as select 1")`, we will fail to read it with Spark 2.0, as Spark 2.0 mistakenly treat it as data source table because there is a `provider` entry in table properties.
      
      Logically Hive serde table's provider is always hive, we don't need to store it in table properties, this PR removes it.
      
      ## How was this patch tested?
      
      manually test the forward compatibility issue.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16080 from cloud-fan/hive.
      
      (cherry picked from commit a5f02b00)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      0f0903d1
    • Kazuaki Ishizaki's avatar
      [SPARK-18284][SQL] Make ExpressionEncoder.serializer.nullable precise · fce1be6c
      Kazuaki Ishizaki authored
      
      ## What changes were proposed in this pull request?
      
      This PR makes `ExpressionEncoder.serializer.nullable` for flat encoder for a primitive type `false`. Since it is `true` for now, it is too conservative.
      While `ExpressionEncoder.schema` has correct information (e.g. `<IntegerType, false>`), `serializer.head.nullable` of `ExpressionEncoder`, which got from `encoderFor[T]`, is always false. It is too conservative.
      
      This is accomplished by checking whether a type is one of primitive types. If it is `true`, `nullable` should be `false`.
      
      ## How was this patch tested?
      
      Added new tests for encoder and dataframe
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #15780 from kiszk/SPARK-18284.
      
      (cherry picked from commit 38b9e696)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      fce1be6c
    • gatorsmile's avatar
      [SPARK-18538][SQL][BACKPORT-2.1] Fix Concurrent Table Fetching Using DataFrameReader JDBC APIs · b9eb1004
      gatorsmile authored
      ### What changes were proposed in this pull request?
      
      #### This PR is to backport https://github.com/apache/spark/pull/15975 to Branch 2.1
      
      ---
      
      The following two `DataFrameReader` JDBC APIs ignore the user-specified parameters of parallelism degree.
      
      ```Scala
        def jdbc(
            url: String,
            table: String,
            columnName: String,
            lowerBound: Long,
            upperBound: Long,
            numPartitions: Int,
            connectionProperties: Properties): DataFrame
      ```
      
      ```Scala
        def jdbc(
            url: String,
            table: String,
            predicates: Array[String],
            connectionProperties: Properties): DataFrame
      ```
      
      This PR is to fix the issues. To verify the behavior correctness, we improve the plan output of `EXPLAIN` command by adding `numPartitions` in the `JDBCRelation` node.
      
      Before the fix,
      ```
      == Physical Plan ==
      *Scan JDBCRelation(TEST.PEOPLE) [NAME#1896,THEID#1897] ReadSchema: struct<NAME:string,THEID:int>
      ```
      
      After the fix,
      ```
      == Physical Plan ==
      *Scan JDBCRelation(TEST.PEOPLE) [numPartitions=3] [NAME#1896,THEID#1897] ReadSchema: struct<NAME:string,THEID:int>
      ```
      ### How was this patch tested?
      Added the verification logics on all the test cases for JDBC concurrent fetching.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16111 from gatorsmile/jdbcFix2.1.
      b9eb1004
    • sureshthalamati's avatar
      [SPARK-18141][SQL] Fix to quote column names in the predicate clause of the... · 2f91b015
      sureshthalamati authored
      [SPARK-18141][SQL] Fix to quote column names in the predicate clause  of the JDBC RDD generated sql statement
      
      ## What changes were proposed in this pull request?
      
      SQL query generated for the JDBC data source is not quoting columns in the predicate clause. When the source table has quoted column names,  spark jdbc read fails with column not found error incorrectly.
      
      Error:
      org.h2.jdbc.JdbcSQLException: Column "ID" not found;
      Source SQL statement:
      SELECT "Name","Id" FROM TEST."mixedCaseCols" WHERE (Id < 1)
      
      This PR fixes by quoting column names in the generated  SQL for predicate clause  when filters are pushed down to the data source.
      
      Source SQL statement after the fix:
      SELECT "Name","Id" FROM TEST."mixedCaseCols" WHERE ("Id" < 1)
      
      ## How was this patch tested?
      
      Added new test case to the JdbcSuite
      
      Author: sureshthalamati <suresh.thalamati@gmail.com>
      
      Closes #15662 from sureshthalamati/filter_quoted_cols-SPARK-18141.
      
      (cherry picked from commit 70c5549e)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      2f91b015
    • Reynold Xin's avatar
      [SPARK-18639] Build only a single pip package · 2d2e8018
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      We current build 5 separate pip binary tar balls, doubling the release script runtime. It'd be better to build one, especially for use cases that are just using Spark locally. In the long run, it would make more sense to have Hadoop support be pluggable.
      
      ## How was this patch tested?
      N/A - this is a release build script that doesn't have any automated test coverage. We will know if it goes wrong when we prepare releases.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #16072 from rxin/SPARK-18639.
      
      (cherry picked from commit 37e52f87)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      2d2e8018
    • Shixiong Zhu's avatar
      [SPARK-18617][SPARK-18560][TESTS] Fix flaky test: StreamingContextSuite.... · 4746674a
      Shixiong Zhu authored
      [SPARK-18617][SPARK-18560][TESTS] Fix flaky test: StreamingContextSuite. Receiver data should be deserialized properly
      
      ## What changes were proposed in this pull request?
      
      Avoid to create multiple threads to stop StreamingContext. Otherwise, the latch added in #16091 can be passed too early.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16105 from zsxwing/SPARK-18617-2.
      
      (cherry picked from commit 086b0c8f)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      4746674a
    • Sandeep Singh's avatar
      [SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWrapper · 4c673c65
      Sandeep Singh authored
      
      ## What changes were proposed in this pull request?
      In`JavaWrapper `'s destructor make Java Gateway dereference object in destructor, using `SparkContext._active_spark_context._gateway.detach`
      Fixing the copying parameter bug, by moving the `copy` method from `JavaModel` to `JavaParams`
      
      ## How was this patch tested?
      ```scala
      import random, string
      from pyspark.ml.feature import StringIndexer
      
      l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))]  # 700000 random strings of 10 characters
      df = spark.createDataFrame(l, ['string'])
      
      for i in range(50):
          indexer = StringIndexer(inputCol='string', outputCol='index')
          indexer.fit(df)
      ```
      * Before: would keep StringIndexer strong reference, causing GC issues and is halted midway
      After: garbage collection works as the object is dereferenced, and computation completes
      * Mem footprint tested using profiler
      * Added a parameter copy related test which was failing before.
      
      Author: Sandeep Singh <sandeep@techaddict.me>
      Author: jkbradley <joseph.kurata.bradley@gmail.com>
      
      Closes #15843 from techaddict/SPARK-18274.
      
      (cherry picked from commit 78bb7f80)
      Signed-off-by: default avatarJoseph K. Bradley <joseph@databricks.com>
      4c673c65
    • Wenchen Fan's avatar
      [SPARK-18674][SQL] improve the error message of using join · 6916ddc3
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      The current error message of USING join is quite confusing, for example:
      ```
      scala> val df1 = List(1,2,3).toDS.withColumnRenamed("value", "c1")
      df1: org.apache.spark.sql.DataFrame = [c1: int]
      
      scala> val df2 = List(1,2,3).toDS.withColumnRenamed("value", "c2")
      df2: org.apache.spark.sql.DataFrame = [c2: int]
      
      scala> df1.join(df2, usingColumn = "c1")
      org.apache.spark.sql.AnalysisException: using columns ['c1] can not be resolved given input columns: [c1, c2] ;;
      'Join UsingJoin(Inner,List('c1))
      :- Project [value#1 AS c1#3]
      :  +- LocalRelation [value#1]
      +- Project [value#7 AS c2#9]
         +- LocalRelation [value#7]
      ```
      
      after this PR, it becomes:
      ```
      scala> val df1 = List(1,2,3).toDS.withColumnRenamed("value", "c1")
      df1: org.apache.spark.sql.DataFrame = [c1: int]
      
      scala> val df2 = List(1,2,3).toDS.withColumnRenamed("value", "c2")
      df2: org.apache.spark.sql.DataFrame = [c2: int]
      
      scala> df1.join(df2, usingColumn = "c1")
      org.apache.spark.sql.AnalysisException: USING column `c1` can not be resolved with the right join side, the right output is: [c2];
      ```
      
      ## How was this patch tested?
      
      updated tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16100 from cloud-fan/natural.
      
      (cherry picked from commit e6534847)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      6916ddc3
    • Yuming Wang's avatar
      [SPARK-18645][DEPLOY] Fix spark-daemon.sh arguments error lead to throws Unrecognized option · cbbe2177
      Yuming Wang authored
      
      ## What changes were proposed in this pull request?
      
      spark-daemon.sh will lost single quotes around after #15338. as follows:
      ```
      execute_command nice -n 0 bash /opt/cloudera/parcels/SPARK-2.1.0-cdh5.4.3.d20161129-21.04.38/lib/spark/bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --name Thrift JDBC/ODBC Server --conf spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:-HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp
      ```
      With this fix, as follows:
      ```
      execute_command nice -n 0 bash /opt/cloudera/parcels/SPARK-2.1.0-cdh5.4.3.d20161129-21.04.38/lib/spark/bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --name 'Thrift JDBC/ODBC Server' --conf 'spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:-HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp'
      ```
      
      ## How was this patch tested?
      
      - Manual tests
      - Build the package and start-thriftserver.sh with `--conf 'spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:-HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp'`
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #16079 from wangyum/SPARK-18645.
      
      (cherry picked from commit 2ab8551e)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      cbbe2177
    • Liang-Chi Hsieh's avatar
      [SPARK-18666][WEB UI] Remove the codes checking deprecated config spark.sql.unsafe.enabled · 8579ab5d
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      `spark.sql.unsafe.enabled` is deprecated since 1.6. There still are codes in UI to check it. We should remove it and clean the codes.
      
      ## How was this patch tested?
      
      Changes to related existing unit test.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16095 from viirya/remove-deprecated-config-code.
      
      (cherry picked from commit dbf842b7)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      8579ab5d
    • Eric Liang's avatar
      [SPARK-18635][SQL] Partition name/values not escaped correctly in some cases · 9dc3ef6e
      Eric Liang authored
      
      ## What changes were proposed in this pull request?
      
      Due to confusion between URI vs paths, in certain cases we escape partition values too many times, which causes some Hive client operations to fail or write data to the wrong location. This PR fixes at least some of these cases.
      
      To my understanding this is how values, filesystem paths, and URIs interact.
      - Hive stores raw (unescaped) partition values that are returned to you directly when you call listPartitions.
      - Internally, we convert these raw values to filesystem paths via `ExternalCatalogUtils.[un]escapePathName`.
      - In some circumstances we store URIs instead of filesystem paths. When a path is converted to a URI via `path.toURI`, the escaped partition values are further URI-encoded. This means that to get a path back from a URI, you must call `new Path(new URI(uriTxt))` in order to decode the URI-encoded string.
      - In `CatalogStorageFormat` we store URIs as strings. This makes it easy to forget to URI-decode the value before converting it into a path.
      - Finally, the Hive client itself uses mostly Paths for representing locations, and only URIs occasionally.
      
      In the future we should probably clean this up, perhaps by dropping use of URIs when unnecessary. We should also try fixing escaping for partition names as well as values, though names are unlikely to contain special characters.
      
      cc mallman cloud-fan yhuai
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #16071 from ericl/spark-18635.
      
      (cherry picked from commit 88f559f2)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      9dc3ef6e
  3. Nov 30, 2016
    • wm624@hotmail.com's avatar
      [SPARK-18476][SPARKR][ML] SparkR Logistic Regression should should support output original label. · e8d8e350
      wm624@hotmail.com authored
      
      ## What changes were proposed in this pull request?
      
      Similar to SPARK-18401, as a classification algorithm, logistic regression should support output original label instead of supporting index label.
      
      In this PR, original label output is supported and test cases are modified and added. Document is also modified.
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #15910 from wangmiao1981/audit.
      
      (cherry picked from commit 2eb6764f)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      e8d8e350
    • Shixiong Zhu's avatar
      [SPARK-18617][SPARK-18560][TEST] Fix flaky test: StreamingContextSuite.... · 7d459673
      Shixiong Zhu authored
      [SPARK-18617][SPARK-18560][TEST] Fix flaky test: StreamingContextSuite. Receiver data should be deserialized properly
      
      ## What changes were proposed in this pull request?
      
      Fixed the potential SparkContext leak in `StreamingContextSuite.SPARK-18560 Receiver data should be deserialized properly` which was added in #16052. I also removed FakeByteArrayReceiver and used TestReceiver directly.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16091 from zsxwing/SPARK-18617-follow-up.
      
      (cherry picked from commit 0a811210)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      7d459673
    • Shixiong Zhu's avatar
      [SPARK-18655][SS] Ignore Structured Streaming 2.0.2 logs in history server · 6e2e987b
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      As `queryStatus` in StreamingQueryListener events was removed in #15954, parsing 2.0.2 structured streaming logs will throw the following errror:
      
      ```
      [info]   com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "queryStatus" (class org.apache.spark.sql.streaming.StreamingQueryListener$QueryTerminatedEvent), not marked as ignorable (2 known properties: "id", "exception"])
      [info]  at [Source: {"Event":"org.apache.spark.sql.streaming.StreamingQueryListener$QueryTerminatedEvent","queryStatus":{"name":"query-1","id":1,"timestamp":1480491532753,"inputRate":0.0,"processingRate":0.0,"latency":null,"sourceStatuses":[{"description":"FileStreamSource[file:/Users/zsx/stream]","offsetDesc":"#0","inputRate":0.0,"processingRate":0.0,"triggerDetails":{"latency.getOffset.source":"1","triggerId":"1"}}],"sinkStatus":{"description":"FileSink[/Users/zsx/stream2]","offsetDesc":"[#0]"},"triggerDetails":{}},"exception":null}; line: 1, column: 521] (through reference chain: org.apache.spark.sql.streaming.QueryTerminatedEvent["queryStatus"])
      [info]   at com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:51)
      [info]   at com.fasterxml.jackson.databind.DeserializationContext.reportUnknownProperty(DeserializationContext.java:839)
      [info]   at com.fasterxml.jackson.databind.deser.std.StdDeserializer.handleUnknownProperty(StdDeserializer.java:1045)
      [info]   at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperty(BeanDeserializerBase.java:1352)
      [info]   at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperties(BeanDeserializerBase.java:1306)
      [info]   at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:453)
      [info]   at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1099)
      ...
      ```
      
      This PR just ignores such errors and adds a test to make sure we can read 2.0.2 logs.
      
      ## How was this patch tested?
      
      `query-event-logs-version-2.0.2.txt` has all types of events generated by Structured Streaming in Spark 2.0.2. `testQuietly("ReplayListenerBus should ignore broken event jsons generated in 2.0.2")` verified we can load them without any error.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16085 from zsxwing/SPARK-18655.
      
      (cherry picked from commit c4979f6e)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      6e2e987b
    • Marcelo Vanzin's avatar
      [SPARK-18546][CORE] Fix merging shuffle spills when using encryption. · c2c2fdcb
      Marcelo Vanzin authored
      
      The problem exists because it's not possible to just concatenate encrypted
      partition data from different spill files; currently each partition would
      have its own initial vector to set up encryption, and the final merged file
      should contain a single initial vector for each merged partiton, otherwise
      iterating over each record becomes really hard.
      
      To fix that, UnsafeShuffleWriter now decrypts the partitions when merging,
      so that the merged file contains a single initial vector at the start of
      the partition data.
      
      Because it's not possible to do that using the fast transferTo path, when
      encryption is enabled UnsafeShuffleWriter will revert back to using file
      streams when merging. It may be possible to use a hybrid approach when
      using encryption, using an intermediate direct buffer when reading from
      files and encrypting the data, but that's better left for a separate patch.
      
      As part of the change I made DiskBlockObjectWriter take a SerializerManager
      instead of a "wrap stream" closure, since that makes it easier to test the
      code without having to mock SerializerManager functionality.
      
      Tested with newly added unit tests (UnsafeShuffleWriterSuite for the write
      side and ExternalAppendOnlyMapSuite for integration), and by running some
      apps that failed without the fix.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #15982 from vanzin/SPARK-18546.
      
      (cherry picked from commit 93e9d880)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      c2c2fdcb
    • Wenchen Fan's avatar
      [SPARK-18251][SQL] the type of Dataset can't be Option of non-flat type · 9e96ac5a
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow the entire row to be null, only its columns can be null. That's the reason we forbid users to use top level null objects in https://github.com/apache/spark/pull/13469
      
      
      
      However, if users wrap non-flat type with `Option`, then we may still encoder top level null object to row, which is not allowed.
      
      This PR fixes this case, and suggests users to wrap their type with `Tuple1` if they do wanna top level null objects.
      
      ## How was this patch tested?
      
      new test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15979 from cloud-fan/option.
      
      (cherry picked from commit f135b70f)
      Signed-off-by: default avatarCheng Lian <lian@databricks.com>
      9e96ac5a
    • Yanbo Liang's avatar
      [SPARK-18318][ML] ML, Graph 2.1 QA: API: New Scala APIs, docs · f542df31
      Yanbo Liang authored
      
      ## What changes were proposed in this pull request?
      API review for 2.1, except ```LSH``` related classes which are still under development.
      
      ## How was this patch tested?
      Only doc changes, no new tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16009 from yanboliang/spark-18318.
      
      (cherry picked from commit 60022bfd)
      Signed-off-by: default avatarJoseph K. Bradley <joseph@databricks.com>
      f542df31
    • Josh Rosen's avatar
      [SPARK-18640] Add synchronization to TaskScheduler.runningTasksByExecutors · 7c0e2962
      Josh Rosen authored
      
      ## What changes were proposed in this pull request?
      
      The method `TaskSchedulerImpl.runningTasksByExecutors()` accesses the mutable `executorIdToRunningTaskIds` map without proper synchronization. In addition, as markhamstra pointed out in #15986, the signature's use of parentheses is a little odd given that this is a pure getter method.
      
      This patch fixes both issues.
      
      ## How was this patch tested?
      
      Covered by existing tests.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #16073 from JoshRosen/runningTasksByExecutors-thread-safety.
      
      (cherry picked from commit c51c7725)
      Signed-off-by: default avatarAndrew Or <andrewor14@gmail.com>
      7c0e2962
    • manishAtGit's avatar
      [SPARK][EXAMPLE] Added missing semicolon in quick-start-guide example · eae85da3
      manishAtGit authored
      ## What changes were proposed in this pull request?
      
      Added missing semicolon in quick-start-guide java example code which wasn't compiling before.
      
      ## How was this patch tested?
      Locally by running and generating site for docs. You can see the last line contains ";" in the below snapshot.
      ![image](https://cloud.githubusercontent.com/assets/10628224/20751760/9a7e0402-b723-11e6-9aa8-3b6ca2d92ebf.png
      
      )
      
      Author: manishAtGit <manish@knoldus.com>
      
      Closes #16081 from manishatGit/fixed-quick-start-guide.
      
      (cherry picked from commit bc95ea0b)
      Signed-off-by: default avatarAndrew Or <andrewor14@gmail.com>
      eae85da3
    • Wenchen Fan's avatar
      [SPARK-18220][SQL] read Hive orc table with varchar column should not fail · 3de93fb4
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Spark SQL only has `StringType`, when reading hive table with varchar column, we will read that column as `StringType`. However, we still need to use varchar `ObjectInspector` to read varchar column in hive table, which means we need to know the actual column type at hive side.
      
      In Spark 2.1, after https://github.com/apache/spark/pull/14363
      
       , we parse hive type string to catalyst type, which means the actual column type at hive side is erased. Then we may use string `ObjectInspector` to read varchar column and fail.
      
      This PR keeps the original hive column type string in the metadata of `StructField`, and use it when we convert it to a hive column.
      
      ## How was this patch tested?
      
      newly added regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16060 from cloud-fan/varchar.
      
      (cherry picked from commit 3f03c90a)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      3de93fb4
    • gatorsmile's avatar
      [SPARK-17897][SQL] Fixed IsNotNull Constraint Inference Rule · 6e044ab9
      gatorsmile authored
      
      ### What changes were proposed in this pull request?
      The `constraints` of an operator is the expressions that evaluate to `true` for all the rows produced. That means, the expression result should be neither `false` nor `unknown` (NULL). Thus, we can conclude that `IsNotNull` on all the constraints, which are generated by its own predicates or propagated from the children. The constraint can be a complex expression. For better usage of these constraints, we try to push down `IsNotNull` to the lowest-level expressions (i.e., `Attribute`). `IsNotNull` can be pushed through an expression when it is null intolerant. (When the input is NULL, the null-intolerant expression always evaluates to NULL.)
      
      Below is the existing code we have for `IsNotNull` pushdown.
      ```Scala
        private def scanNullIntolerantExpr(expr: Expression): Seq[Attribute] = expr match {
          case a: Attribute => Seq(a)
          case _: NullIntolerant | IsNotNull(_: NullIntolerant) =>
            expr.children.flatMap(scanNullIntolerantExpr)
          case _ => Seq.empty[Attribute]
        }
      ```
      
      **`IsNotNull` itself is not null-intolerant.** It converts `null` to `false`. If the expression does not include any `Not`-like expression, it works; otherwise, it could generate a wrong result. This PR is to fix the above function by removing the `IsNotNull` from the inference. After the fix, when a constraint has a `IsNotNull` expression, we infer new attribute-specific `IsNotNull` constraints if and only if `IsNotNull` appears in the root.
      
      Without the fix, the following test case will return empty.
      ```Scala
      val data = Seq[java.lang.Integer](1, null).toDF("key")
      data.filter("not key is not null").show()
      ```
      Before the fix, the optimized plan is like
      ```
      == Optimized Logical Plan ==
      Project [value#1 AS key#3]
      +- Filter (isnotnull(value#1) && NOT isnotnull(value#1))
         +- LocalRelation [value#1]
      ```
      
      After the fix, the optimized plan is like
      ```
      == Optimized Logical Plan ==
      Project [value#1 AS key#3]
      +- Filter NOT isnotnull(value#1)
         +- LocalRelation [value#1]
      ```
      
      ### How was this patch tested?
      Added a test
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16067 from gatorsmile/isNotNull2.
      
      (cherry picked from commit 2eb093de)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      6e044ab9
    • Anthony Truchet's avatar
      [SPARK-18612][MLLIB] Delete broadcasted variable in LBFGS CostFun · 05ba5eed
      Anthony Truchet authored
      ## What changes were proposed in this pull request?
      
      Fix a broadcasted variable leak occurring at each invocation of CostFun in L-BFGS.
      
      ## How was this patch tested?
      
      UTests + check that fixed fatal memory consumption on Criteo's use cases.
      
      This contribution is made on behalf of Criteo S.A.
      (http://labs.criteo.com/
      
      ) under the terms of the Apache v2 License.
      
      Author: Anthony Truchet <a.truchet@criteo.com>
      
      Closes #16040 from AnthonyTruchet/SPARK-18612-lbfgs-cost-fun.
      
      (cherry picked from commit c5a64d76)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      05ba5eed
    • Sandeep Singh's avatar
      [SPARK-18366][PYSPARK][ML] Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer · 7043c6b6
      Sandeep Singh authored
      
      ## What changes were proposed in this pull request?
      added the new handleInvalid param for these transformers to Python to maintain API parity.
      
      ## How was this patch tested?
      existing tests
      testing is done with new doctests
      
      Author: Sandeep Singh <sandeep@techaddict.me>
      
      Closes #15817 from techaddict/SPARK-18366.
      
      (cherry picked from commit fe854f2e)
      Signed-off-by: default avatarNick Pentreath <nickp@za.ibm.com>
      7043c6b6
    • uncleGen's avatar
      [SPARK-18617][CORE][STREAMING] Close "kryo auto pick" feature for Spark Streaming · 5e4afbfb
      uncleGen authored
      
      ## What changes were proposed in this pull request?
      
      #15992 provided a solution to fix the bug, i.e. **receiver data can not be deserialized properly**. As zsxwing said, it is a critical bug, but we should not break APIs between maintenance releases. It may be a rational choice to close auto pick kryo serializer for Spark Streaming in the first step. I will continue #15992 to optimize the solution.
      
      ## How was this patch tested?
      
      existing ut
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #16052 from uncleGen/SPARK-18617.
      
      (cherry picked from commit 56c82eda)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      5e4afbfb
    • Herman van Hovell's avatar
      [SPARK-18622][SQL] Fix the datatype of the Sum aggregate function · 8cd466e8
      Herman van Hovell authored
      
      ## What changes were proposed in this pull request?
      The result of a `sum` aggregate function is typically a Decimal, Double or a Long. Currently the output dataType is based on input's dataType.
      
      The `FunctionArgumentConversion` rule will make sure that the input is promoted to the largest type, and that also ensures that the output uses a (hopefully) sufficiently large output dataType. The issue is that sum is in a resolved state when we cast the input type, this means that rules assuming that the dataType of the expression does not change anymore could have been applied in the mean time. This is what happens if we apply `WidenSetOperationTypes` before applying the casts, and this breaks analysis.
      
      The most straight forward and future proof solution is to make `sum` always output the widest dataType in its class (Long for IntegralTypes, Decimal for DecimalTypes & Double for FloatType and DoubleType). This PR implements that solution.
      
      We should move expression specific type casting rules into the given Expression at some point.
      
      ## How was this patch tested?
      Added (regression) tests to SQLQueryTestSuite's `union.sql`.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #16063 from hvanhovell/SPARK-18622.
      
      (cherry picked from commit 879ba711)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      8cd466e8
    • gatorsmile's avatar
      [SPARK-17680][SQL][TEST] Added a Testcase for Verifying Unicode Character... · a5ec2a7b
      gatorsmile authored
      [SPARK-17680][SQL][TEST] Added a Testcase for Verifying Unicode Character Support for Column Names and Comments
      
      ### What changes were proposed in this pull request?
      
      Spark SQL supports Unicode characters for column names when specified within backticks(`). When the Hive support is enabled, the version of the Hive metastore must be higher than 0.12,  See the JIRA: https://issues.apache.org/jira/browse/HIVE-6013
      
       Hive metastore supports Unicode characters for column names since 0.13.
      
      In Spark SQL, table comments, and view comments always allow Unicode characters without backticks.
      
      BTW, a separate PR has been submitted for database and table name validation because we do not support Unicode characters in these two cases.
      ### How was this patch tested?
      
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15255 from gatorsmile/unicodeSupport.
      
      (cherry picked from commit a1d9138a)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      a5ec2a7b
    • Tathagata Das's avatar
      [SPARK-18516][STRUCTURED STREAMING] Follow up PR to add StreamingQuery.status to Python · e780733b
      Tathagata Das authored
      
      ## What changes were proposed in this pull request?
      - Add StreamingQueryStatus.json
      - Make it not case class (to avoid unnecessarily exposing implicit object StreamingQueryStatus, consistent with StreamingQueryProgress)
      - Add StreamingQuery.status to Python
      - Fix post-termination status
      
      ## How was this patch tested?
      New unit tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16075 from tdas/SPARK-18516-1.
      
      (cherry picked from commit bc09a2b8)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      e780733b
  4. Nov 29, 2016
Loading