Skip to content
Snippets Groups Projects
  1. Mar 22, 2017
    • Prashant Sharma's avatar
      [SPARK-20027][DOCS] Compilation fix in java docs. · 0caade63
      Prashant Sharma authored
      ## What changes were proposed in this pull request?
      
      During build/sbt publish-local, build breaks due to javadocs errors. This patch fixes those errors.
      
      ## How was this patch tested?
      
      Tested by running the sbt build.
      
      Author: Prashant Sharma <prashsh1@in.ibm.com>
      
      Closes #17358 from ScrapCodes/docs-fix.
      0caade63
    • uncleGen's avatar
      [SPARK-20021][PYSPARK] Miss backslash in python code · facfd608
      uncleGen authored
      ## What changes were proposed in this pull request?
      
      Add backslash for line continuation in python code.
      
      ## How was this patch tested?
      
      Jenkins.
      
      Author: uncleGen <hustyugm@gmail.com>
      Author: dylon <hustyugm@gmail.com>
      
      Closes #17352 from uncleGen/python-example-doc.
      facfd608
    • Xiao Li's avatar
      [SPARK-20023][SQL] Output table comment for DESC FORMATTED · 7343a094
      Xiao Li authored
      ### What changes were proposed in this pull request?
      Currently, `DESC FORMATTED` did not output the table comment, unlike what `DESC EXTENDED` does. This PR is to fix it.
      
      Also correct the following displayed names in `DESC FORMATTED`, for being consistent with `DESC EXTENDED`
      - `"Create Time:"` -> `"Created:"`
      - `"Last Access Time:"` -> `"Last Access:"`
      
      ### How was this patch tested?
      Added test cases in `describe.sql`
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17381 from gatorsmile/descFormattedTableComment.
      7343a094
  2. Mar 21, 2017
    • Yanbo Liang's avatar
      [SPARK-19925][SPARKR] Fix SparkR spark.getSparkFiles fails when it was called on executors. · 478fbc86
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      SparkR ```spark.getSparkFiles``` fails when it was called on executors, see details at [SPARK-19925](https://issues.apache.org/jira/browse/SPARK-19925).
      
      ## How was this patch tested?
      Add unit tests, and verify this fix at standalone and yarn cluster.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #17274 from yanboliang/spark-19925.
      478fbc86
    • Tathagata Das's avatar
      [SPARK-20030][SS] Event-time-based timeout for MapGroupsWithState · c1e87e38
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Adding event time based timeout. The user sets the timeout timestamp directly using `KeyedState.setTimeoutTimestamp`. The keys times out when the watermark crosses the timeout timestamp.
      
      ## How was this patch tested?
      Unit tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #17361 from tdas/SPARK-20030.
      c1e87e38
    • Kunal Khamar's avatar
      [SPARK-20051][SS] Fix StreamSuite flaky test - recover from v2.1 checkpoint · 2d73fcce
      Kunal Khamar authored
      ## What changes were proposed in this pull request?
      
      There is a race condition between calling stop on a streaming query and deleting directories in `withTempDir` that causes test to fail, fixing to do lazy deletion using delete on shutdown JVM hook.
      
      ## How was this patch tested?
      
      - Unit test
        - repeated 300 runs with no failure
      
      Author: Kunal Khamar <kkhamar@outlook.com>
      
      Closes #17382 from kunalkhamar/partition-bugfix.
      2d73fcce
    • hyukjinkwon's avatar
      [SPARK-19919][SQL] Defer throwing the exception for empty paths in CSV datasource into `DataSource` · 9281a3d5
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to defer throwing the exception within `DataSource`.
      
      Currently, if other datasources fail to infer the schema, it returns `None` and then this is being validated in `DataSource` as below:
      
      ```
      scala> spark.read.json("emptydir")
      org.apache.spark.sql.AnalysisException: Unable to infer schema for JSON. It must be specified manually.;
      ```
      
      ```
      scala> spark.read.orc("emptydir")
      org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It must be specified manually.;
      ```
      
      ```
      scala> spark.read.parquet("emptydir")
      org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
      ```
      
      However, CSV it checks it within the datasource implementation and throws another exception message as below:
      
      ```
      scala> spark.read.csv("emptydir")
      java.lang.IllegalArgumentException: requirement failed: Cannot infer schema from an empty set of files
      ```
      
      We could remove this duplicated check and validate this in one place in the same way with the same message.
      
      ## How was this patch tested?
      
      Unit test in `CSVSuite` and manual test.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17256 from HyukjinKwon/SPARK-19919.
      9281a3d5
    • Will Manning's avatar
      clarify array_contains function description · a04dcde8
      Will Manning authored
      ## What changes were proposed in this pull request?
      
      The description in the comment for array_contains is vague/incomplete (i.e., doesn't mention that it returns `null` if the array is `null`); this PR fixes that.
      
      ## How was this patch tested?
      
      No testing, since it merely changes a comment.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Will Manning <lwwmanning@gmail.com>
      
      Closes #17380 from lwwmanning/patch-1.
      a04dcde8
    • Felix Cheung's avatar
      [SPARK-19237][SPARKR][CORE] On Windows spark-submit should handle when java is not installed · a8877bdb
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      When SparkR is installed as a R package there might not be any java runtime.
      If it is not there SparkR's `sparkR.session()` will block waiting for the connection timeout, hanging the R IDE/shell, without any notification or message.
      
      ## How was this patch tested?
      
      manually
      
      - [x] need to test on Windows
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16596 from felixcheung/rcheckjava.
      a8877bdb
    • zhaorongsheng's avatar
      [SPARK-20017][SQL] change the nullability of function 'StringToMap' from 'false' to 'true' · 7dbc162f
      zhaorongsheng authored
      ## What changes were proposed in this pull request?
      
      Change the nullability of function `StringToMap` from `false` to `true`.
      
      Author: zhaorongsheng <334362872@qq.com>
      
      Closes #17350 from zhaorongsheng/bug-fix_strToMap_NPE.
      7dbc162f
    • Joseph K. Bradley's avatar
      [SPARK-20039][ML] rename ChiSquare to ChiSquareTest · ae4b91d1
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      I realized that since ChiSquare is in the package stat, it's pretty unclear if it's the hypothesis test, distribution, or what. This PR renames it to ChiSquareTest to clarify this.
      
      ## How was this patch tested?
      
      Existing unit tests
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #17368 from jkbradley/SPARK-20039.
      ae4b91d1
    • Xin Wu's avatar
      [SPARK-19261][SQL] Alter add columns for Hive serde and some datasource tables · 4c0ff5f5
      Xin Wu authored
      ## What changes were proposed in this pull request?
      Support` ALTER TABLE ADD COLUMNS (...) `syntax for Hive serde and some datasource tables.
      In this PR, we consider a few aspects:
      
      1. View is not supported for `ALTER ADD COLUMNS`
      
      2. Since tables created in SparkSQL with Hive DDL syntax will populate table properties with schema information, we need make sure the consistency of the schema before and after ALTER operation in order for future use.
      
      3. For embedded-schema type of format, such as `parquet`, we need to make sure that the predicate on the newly-added columns can be evaluated properly, or pushed down properly. In case of the data file does not have the columns for the newly-added columns, such predicates should return as if the column values are NULLs.
      
      4. For datasource table, this feature does not support the following:
      4.1 TEXT format, since there is only one default column `value` is inferred for text format data.
      4.2 ORC format, since SparkSQL native ORC reader does not support the difference between user-specified-schema and inferred schema from ORC files.
      4.3 Third party datasource types that implements RelationProvider, including the built-in JDBC format, since different implementations by the vendors may have different ways to dealing with schema.
      4.4 Other datasource types, such as `parquet`, `json`, `csv`, `hive` are supported.
      
      5. Column names being added can not be duplicate of any existing data column or partition column names. Case sensitivity is taken into consideration according to the sql configuration.
      
      6. This feature also supports In-Memory catalog, while Hive support is turned off.
      ## How was this patch tested?
      Add new test cases
      
      Author: Xin Wu <xinwu@us.ibm.com>
      
      Closes #16626 from xwu0226/alter_add_columns.
      4c0ff5f5
    • Zheng RuiFeng's avatar
      [SPARK-20041][DOC] Update docs for NaN handling in approxQuantile · 63f077fb
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      Update docs for NaN handling in approxQuantile.
      
      ## How was this patch tested?
      existing tests.
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #17369 from zhengruifeng/doc_quantiles_nan.
      63f077fb
    • wangzhenhua's avatar
      [SPARK-17080][SQL][FOLLOWUP] Improve documentation, change buildJoin method... · 14865d7f
      wangzhenhua authored
      [SPARK-17080][SQL][FOLLOWUP] Improve documentation, change buildJoin method structure and add a debug log
      
      ## What changes were proposed in this pull request?
      
      1. Improve documentation for class `Cost` and `JoinReorderDP` and method `buildJoin()`.
      2. Change code structure of `buildJoin()` to make the logic clearer.
      3. Add a debug-level log to record information for join reordering, including time cost, the number of items and the number of plans in memo.
      
      ## How was this patch tested?
      
      Not related.
      
      Author: wangzhenhua <wangzhenhua@huawei.com>
      
      Closes #17353 from wzhfy/reorderFollow.
      14865d7f
    • jianran.tfh's avatar
      [SPARK-19998][BLOCK MANAGER] Change the exception log to add RDD id of the related the block · 650d03cf
      jianran.tfh authored
      ## What changes were proposed in this pull request?
      
      "java.lang.Exception: Could not compute split, block $blockId not found" doesn't have the rdd id info, the "BlockManager: Removing RDD $id" has only the RDD id, so it couldn't find that the Exception's reason is the Removing; so it's better block not found Exception add RDD id info
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: jianran.tfh <jianran.tfh@taobao.com>
      Author: jianran <tanfanhua1984@163.com>
      
      Closes #17334 from jianran/SPARK-19998.
      650d03cf
    • christopher snow's avatar
      [SPARK-20011][ML][DOCS] Clarify documentation for ALS 'rank' parameter · 7620aed8
      christopher snow authored
      ## What changes were proposed in this pull request?
      
      API documentation and collaborative filtering documentation page changes to clarify inconsistent description of ALS rank parameter.
      
       - [DOCS] was previously: "rank is the number of latent factors in the model."
       - [API] was previously:  "rank - number of features to use"
      
      This change describes rank in both places consistently as:
      
       - "Number of features to use (also referred to as the number of latent factors)"
      
      Author: Chris Snow <chris.snowuk.ibm.com>
      
      Author: christopher snow <chsnow123@gmail.com>
      
      Closes #17345 from snowch/SPARK-20011.
      7620aed8
    • Xiao Li's avatar
      [SPARK-20024][SQL][TEST-MAVEN] SessionCatalog reset need to set the current... · d2dcd679
      Xiao Li authored
      [SPARK-20024][SQL][TEST-MAVEN] SessionCatalog reset need to set the current database of ExternalCatalog
      
      ### What changes were proposed in this pull request?
      SessionCatalog API setCurrentDatabase does not set the current database of the underlying ExternalCatalog. Thus, weird errors could come in the test suites after we call reset. We need to fix it.
      
      So far, have not found the direct impact in the other code paths because we expect all the SessionCatalog APIs should always use the current database value we managed, unless some of code paths skip it. Thus, we fix it in the test-only function reset().
      
      ### How was this patch tested?
      Multiple test case failures are observed in mvn and add a test case in SessionCatalogSuite.
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17354 from gatorsmile/useDB.
      d2dcd679
  3. Mar 20, 2017
    • Wenchen Fan's avatar
      [SPARK-19949][SQL] unify bad record handling in CSV and JSON · 68d65fae
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Currently JSON and CSV have exactly the same logic about handling bad records, this PR tries to abstract it and put it in a upper level to reduce code duplication.
      
      The overall idea is, we make the JSON and CSV parser to throw a BadRecordException, then the upper level, FailureSafeParser, handles bad records according to the parse mode.
      
      Behavior changes:
      1. with PERMISSIVE mode, if the number of tokens doesn't match the schema, previously CSV parser will treat it as a legal record and parse as many tokens as possible. After this PR, we treat it as an illegal record, and put the raw record string in a special column, but we still parse as many tokens as possible.
      2. all logging is removed as they are not very useful in practice.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: Wenchen Fan <cloud0fan@gmail.com>
      
      Closes #17315 from cloud-fan/bad-record2.
      68d65fae
    • Dongjoon Hyun's avatar
      [SPARK-19912][SQL] String literals should be escaped for Hive metastore partition pruning · 21e366ae
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Since current `HiveShim`'s `convertFilters` does not escape the string literals. There exists the following correctness issues. This PR aims to return the correct result and also shows the more clear exception message.
      
      **BEFORE**
      
      ```scala
      scala> Seq((1, "p1", "q1"), (2, "p1\" and q=\"q1", "q2")).toDF("a", "p", "q").write.partitionBy("p", "q").saveAsTable("t1")
      
      scala> spark.table("t1").filter($"p" === "p1\" and q=\"q1").select($"a").show
      +---+
      |  a|
      +---+
      +---+
      
      scala> spark.table("t1").filter($"p" === "'\"").select($"a").show
      java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from ...
      ```
      
      **AFTER**
      
      ```scala
      scala> spark.table("t1").filter($"p" === "p1\" and q=\"q1").select($"a").show
      +---+
      |  a|
      +---+
      |  2|
      +---+
      
      scala> spark.table("t1").filter($"p" === "'\"").select($"a").show
      java.lang.UnsupportedOperationException: Partition filter cannot have both `"` and `'` characters
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins test with new test cases.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #17266 from dongjoon-hyun/SPARK-19912.
      21e366ae
    • Michael Allman's avatar
      [SPARK-17204][CORE] Fix replicated off heap storage · 7fa116f8
      Michael Allman authored
      (Jira: https://issues.apache.org/jira/browse/SPARK-17204)
      
      ## What changes were proposed in this pull request?
      
      There are a couple of bugs in the `BlockManager` with respect to support for replicated off-heap storage. First, the locally-stored off-heap byte buffer is disposed of when it is replicated. It should not be. Second, the replica byte buffers are stored as heap byte buffers instead of direct byte buffers even when the storage level memory mode is off-heap. This PR addresses both of these problems.
      
      ## How was this patch tested?
      
      `BlockManagerReplicationSuite` was enhanced to fill in the coverage gaps. It now fails if either of the bugs in this PR exist.
      
      Author: Michael Allman <michael@videoamp.com>
      
      Closes #16499 from mallman/spark-17204-replicated_off_heap_storage.
      7fa116f8
    • Takeshi Yamamuro's avatar
      [SPARK-19980][SQL] Add NULL checks in Bean serializer · 0ec1db54
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      A Bean serializer in `ExpressionEncoder`  could change values when Beans having NULL. A concrete example is as follows;
      ```
      scala> :paste
      class Outer extends Serializable {
        private var cls: Inner = _
        def setCls(c: Inner): Unit = cls = c
        def getCls(): Inner = cls
      }
      
      class Inner extends Serializable {
        private var str: String = _
        def setStr(s: String): Unit = str = str
        def getStr(): String = str
      }
      
      scala> Seq("""{"cls":null}""", """{"cls": {"str":null}}""").toDF().write.text("data")
      scala> val encoder = Encoders.bean(classOf[Outer])
      scala> val schema = encoder.schema
      scala> val df = spark.read.schema(schema).json("data").as[Outer](encoder)
      scala> df.show
      +------+
      |   cls|
      +------+
      |[null]|
      |  null|
      +------+
      
      scala> df.map(x => x)(encoder).show()
      +------+
      |   cls|
      +------+
      |[null]|
      |[null]|     // <-- Value changed
      +------+
      ```
      
      This is because the Bean serializer does not have the NULL-check expressions that the serializer of Scala's product types has. Actually, this value change does not happen in Scala's product types;
      
      ```
      scala> :paste
      case class Outer(cls: Inner)
      case class Inner(str: String)
      
      scala> val encoder = Encoders.product[Outer]
      scala> val schema = encoder.schema
      scala> val df = spark.read.schema(schema).json("data").as[Outer](encoder)
      scala> df.show
      +------+
      |   cls|
      +------+
      |[null]|
      |  null|
      +------+
      
      scala> df.map(x => x)(encoder).show()
      +------+
      |   cls|
      +------+
      |[null]|
      |  null|
      +------+
      ```
      
      This pr added the NULL-check expressions in Bean serializer along with the serializer of Scala's product types.
      
      ## How was this patch tested?
      Added tests in `JavaDatasetSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #17347 from maropu/SPARK-19980.
      0ec1db54
    • wangzhenhua's avatar
      [SPARK-20010][SQL] Sort information is lost after sort merge join · e9c91bad
      wangzhenhua authored
      ## What changes were proposed in this pull request?
      
      After sort merge join for inner join, now we only keep left key ordering. However, after inner join, right key has the same value and order as left key. So if we need another smj on right key, we will unnecessarily add a sort which causes additional cost.
      
      As a more complicated example, A join B on A.key = B.key join C on B.key = C.key join D on A.key = D.key. We will unnecessarily add a sort on B.key when join {A, B} and C, and add a sort on A.key when join {A, B, C} and D.
      
      To fix this, we need to propagate all sorted information (equivalent expressions) from bottom up through `outputOrdering` and `SortOrder`.
      
      ## How was this patch tested?
      
      Test cases are added.
      
      Author: wangzhenhua <wangzhenhua@huawei.com>
      
      Closes #17339 from wzhfy/sortEnhance.
      e9c91bad
    • Zheng RuiFeng's avatar
      [SPARK-19573][SQL] Make NaN/null handling consistent in approxQuantile · 10691d36
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      update `StatFunctions.multipleApproxQuantiles` to handle NaN/null
      
      ## How was this patch tested?
      existing tests and added tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #16971 from zhengruifeng/quantiles_nan.
      10691d36
    • Tyson Condie's avatar
      [SPARK-19906][SS][DOCS] Documentation describing how to write queries to Kafka · c2d1761a
      Tyson Condie authored
      ## What changes were proposed in this pull request?
      
      Add documentation that describes how to write streaming and batch queries to Kafka.
      
      zsxwing tdas
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Tyson Condie <tcondie@gmail.com>
      
      Closes #17246 from tcondie/kafka-write-docs.
      c2d1761a
    • zero323's avatar
      [SPARK-19899][ML] Replace featuresCol with itemsCol in ml.fpm.FPGrowth · bec6b16c
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Replaces `featuresCol` `Param` with `itemsCol`. See [SPARK-19899](https://issues.apache.org/jira/browse/SPARK-19899).
      
      ## How was this patch tested?
      
      Manual tests. Existing unit tests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17321 from zero323/SPARK-19899.
      bec6b16c
    • Dongjoon Hyun's avatar
      [SPARK-19970][SQL] Table owner should be USER instead of PRINCIPAL in kerberized clusters · fc755459
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      In the kerberized hadoop cluster, when Spark creates tables, the owner of tables are filled with PRINCIPAL strings instead of USER names. This is inconsistent with Hive and causes problems when using [ROLE](https://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization) in Hive. We had better to fix this.
      
      **BEFORE**
      ```scala
      scala> sql("create table t(a int)").show
      scala> sql("desc formatted t").show(false)
      ...
      |Owner:                      |sparkEXAMPLE.COM                                         |       |
      ```
      
      **AFTER**
      ```scala
      scala> sql("create table t(a int)").show
      scala> sql("desc formatted t").show(false)
      ...
      |Owner:                      |spark                                         |       |
      ```
      
      ## How was this patch tested?
      
      Manually do `create table` and `desc formatted` because this happens in Kerberized clusters.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #17311 from dongjoon-hyun/SPARK-19970.
      fc755459
    • windpiger's avatar
      [SPARK-19990][SQL][TEST-MAVEN] create a temp file for file in test.jar's... · 7ce30e00
      windpiger authored
      [SPARK-19990][SQL][TEST-MAVEN] create a temp file for file in test.jar's resource when run mvn test accross different modules
      
      ## What changes were proposed in this pull request?
      
      After we have merged the `HiveDDLSuite` and `DDLSuite` in [SPARK-19235](https://issues.apache.org/jira/browse/SPARK-19235), we have two subclasses of `DDLSuite`, that is `HiveCatalogedDDLSuite` and `InMemoryCatalogDDLSuite`.
      
      While `DDLSuite` is in `sql/core module`, and `HiveCatalogedDDLSuite` is in `sql/hive module`, if we mvn test
      `HiveCatalogedDDLSuite`, it will run the test in its parent class `DDLSuite`, this will cause some test case failed which will get and use the test file path in `sql/core module` 's `resource`.
      
      Because the test file path getted will start with 'jar:' like "jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv", which will failed when new Path() in datasource.scala
      
      This PR fix this by copy file from resource to  a temp dir.
      
      ## How was this patch tested?
      N/A
      
      Author: windpiger <songjun@outlook.com>
      
      Closes #17338 from windpiger/fixtestfailemvn.
      7ce30e00
    • Ioana Delaney's avatar
      [SPARK-17791][SQL] Join reordering using star schema detection · 81639115
      Ioana Delaney authored
      ## What changes were proposed in this pull request?
      
      Star schema consists of one or more fact tables referencing a number of dimension tables. In general, queries against star schema are expected to run fast because of the established RI constraints among the tables. This design proposes a join reordering based on natural, generally accepted heuristics for star schema queries:
      - Finds the star join with the largest fact table and places it on the driving arm of the left-deep join. This plan avoids large tables on the inner, and thus favors hash joins.
      - Applies the most selective dimensions early in the plan to reduce the amount of data flow.
      
      The design document was included in SPARK-17791.
      
      Link to the google doc: [StarSchemaDetection](https://docs.google.com/document/d/1UAfwbm_A6wo7goHlVZfYK99pqDMEZUumi7pubJXETEA/edit?usp=sharing)
      
      ## How was this patch tested?
      
      A new test suite StarJoinSuite.scala was implemented.
      
      Author: Ioana Delaney <ioanamdelaney@gmail.com>
      
      Closes #15363 from ioana-delaney/starJoinReord2.
      81639115
    • Felix Cheung's avatar
      [SPARK-20020][SPARKR][FOLLOWUP] DataFrame checkpoint API fix version tag · f14f81e9
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      doc only change
      
      ## How was this patch tested?
      
      manual
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17356 from felixcheung/rdfcheckpoint2.
      f14f81e9
    • wangzhenhua's avatar
      [SPARK-19994][SQL] Wrong outputOrdering for right/full outer smj · 965a5abc
      wangzhenhua authored
      ## What changes were proposed in this pull request?
      
      For right outer join, values of the left key will be filled with nulls if it can't match the value of the right key, so `nullOrdering` of the left key can't be guaranteed. We should output right key order instead of left key order.
      
      For full outer join, neither left key nor right key guarantees `nullOrdering`. We should not output any ordering.
      
      In tests, besides adding three test cases for left/right/full outer sort merge join, this patch also reorganizes code in `PlannerSuite` by putting together tests for `Sort`, and also extracts common logic in Sort tests into a method.
      
      ## How was this patch tested?
      
      Corresponding test cases are added.
      
      Author: wangzhenhua <wangzhenhua@huawei.com>
      Author: Zhenhua Wang <wzh_zju@163.com>
      
      Closes #17331 from wzhfy/wrongOrdering.
      965a5abc
    • Felix Cheung's avatar
      [SPARK-20020][SPARKR] DataFrame checkpoint API · c4059772
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Add checkpoint, setCheckpointDir API to R
      
      ## How was this patch tested?
      
      unit tests, manual tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17351 from felixcheung/rdfcheckpoint.
      c4059772
    • hyukjinkwon's avatar
      [SPARK-19849][SQL] Support ArrayType in to_json to produce JSON array · 0cdcf911
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to support an array of struct type in `to_json` as below:
      
      ```scala
      import org.apache.spark.sql.functions._
      
      val df = Seq(Tuple1(Tuple1(1) :: Nil)).toDF("a")
      df.select(to_json($"a").as("json")).show()
      ```
      
      ```
      +----------+
      |      json|
      +----------+
      |[{"_1":1}]|
      +----------+
      ```
      
      Currently, it throws an exception as below (a newline manually inserted for readability):
      
      ```
      org.apache.spark.sql.AnalysisException: cannot resolve 'structtojson(`array`)' due to data type
      mismatch: structtojson requires that the expression is a struct expression.;;
      ```
      
      This allows the roundtrip with `from_json` as below:
      
      ```scala
      import org.apache.spark.sql.functions._
      import org.apache.spark.sql.types._
      
      val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
      val df = Seq("""[{"a":1}, {"a":2}]""").toDF("json").select(from_json($"json", schema).as("array"))
      df.show()
      
      // Read back.
      df.select(to_json($"array").as("json")).show()
      ```
      
      ```
      +----------+
      |     array|
      +----------+
      |[[1], [2]]|
      +----------+
      
      +-----------------+
      |             json|
      +-----------------+
      |[{"a":1},{"a":2}]|
      +-----------------+
      ```
      
      Also, this PR proposes to rename from `StructToJson` to `StructsToJson ` and `JsonToStruct` to `JsonToStructs`.
      
      ## How was this patch tested?
      
      Unit tests in `JsonFunctionsSuite` and `JsonExpressionsSuite` for Scala, doctest for Python and test in `test_sparkSQL.R` for R.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17192 from HyukjinKwon/SPARK-19849.
      0cdcf911
  4. Mar 19, 2017
    • Tathagata Das's avatar
      [SPARK-19067][SS] Processing-time-based timeout in MapGroupsWithState · 990af630
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      When a key does not get any new data in `mapGroupsWithState`, the mapping function is never called on it. So we need a timeout feature that calls the function again in such cases, so that the user can decide whether to continue waiting or clean up (remove state, save stuff externally, etc.).
      Timeouts can be either based on processing time or event time. This JIRA is for processing time, but defines the high level API design for both. The usage would look like this.
      ```
      def stateFunction(key: K, value: Iterator[V], state: KeyedState[S]): U = {
        ...
        state.setTimeoutDuration(10000)
        ...
      }
      
      dataset					// type is Dataset[T]
        .groupByKey[K](keyingFunc)   // generates KeyValueGroupedDataset[K, T]
        .mapGroupsWithState[S, U](
           func = stateFunction,
           timeout = KeyedStateTimeout.withProcessingTime)	// returns Dataset[U]
      ```
      
      Note the following design aspects.
      
      - The timeout type is provided as a param in mapGroupsWithState as a parameter global to all the keys. This is so that the planner knows this at planning time, and accordingly optimize the execution based on whether to saves extra info in state or not (e.g. timeout durations or timestamps).
      
      - The exact timeout duration is provided inside the function call so that it can be customized on a per key basis.
      
      - When the timeout occurs for a key, the function is called with no values, and KeyedState.isTimingOut() set to true.
      
      - The timeout is reset for key every time the function is called on the key, that is, when the key has new data, or the key has timed out. So the user has to set the timeout duration everytime the function is called, otherwise there will not be any timeout set.
      
      Guarantees provided on timeout of key, when timeout duration is D ms:
      - Timeout will never be called before real clock time has advanced by D ms
      - Timeout will be called eventually when there is a trigger with any data in it (i.e. after D ms). So there is a no strict upper bound on when the timeout would occur. For example, if there is no data in the stream (for any key) for a while, then the timeout will not be hit.
      
      Implementation details:
      - Added new param to `mapGroupsWithState` for timeout
      - Added new method to `StateStore` to filter data based on timeout timestamp
      - Changed the internal map type of `HDFSBackedStateStore` from Java's `HashMap` to `ConcurrentHashMap` as the latter allows weakly-consistent fail-safe iterators on the map data. See comments in code for more details.
      - Refactored logic of `MapGroupsWithStateExec` to
        - Save timeout info to state store for each key that has data.
        - Then, filter states that should be timed out based on the current batch processing timestamp.
      - Moved KeyedState for `o.a.s.sql` to `o.a.s.sql.streaming`. I remember that this was a feedback in the MapGroupsWithState PR that I had forgotten to address.
      
      ## How was this patch tested?
      New unit tests in
      - MapGroupsWithStateSuite for timeouts.
      - StateStoreSuite for new APIs in StateStore.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #17179 from tdas/mapgroupwithstate-timeout.
      990af630
    • Xiao Li's avatar
      [SPARK-19990][TEST] Use the database after Hive's current Database is dropped · 0ee9fbf5
      Xiao Li authored
      ### What changes were proposed in this pull request?
      This PR is to fix the following test failure in maven and the PR https://github.com/apache/spark/pull/15363.
      
      > org.apache.spark.sql.hive.orc.OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by Hive
      
      The[ test history](https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.orc.OrcSourceSuite&test_name=SPARK-19459%2FSPARK-18220%3A+read+char%2Fvarchar+column+written+by+Hive) shows all the maven builds failed this test case with the same error message.
      
      ```
      FAILED: SemanticException [Error 10072]: Database does not exist: db2
      
            org.apache.spark.sql.execution.QueryExecutionException: FAILED: SemanticException [Error 10072]: Database does not exist: db2
            at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:637)
            at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:621)
            at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:288)
            at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
            at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
            at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
            at org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:621)
            at org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:611)
            at org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply$mcV$sp(OrcSourceSuite.scala:160)
            at org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
            at org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
            at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
            at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
            at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
            at org.scalatest.Transformer.apply(Transformer.scala:22)
            at org.scalatest.Transformer.apply(Transformer.scala:20)
            at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
            at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
            at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
            at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
      ```
      
      ### How was this patch tested?
      N/A
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17344 from gatorsmile/testtest.
      0ee9fbf5
    • Felix Cheung's avatar
      [SPARK-18817][SPARKR][SQL] change derby log output to temp dir · 422aa67d
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Passes R `tempdir()` (this is the R session temp dir, shared with other temp files/dirs) to JVM, set System.Property for derby home dir to move derby.log
      
      ## How was this patch tested?
      
      Manually, unit tests
      
      With this, these are relocated to under /tmp
      ```
      # ls /tmp/RtmpG2M0cB/
      derby.log
      ```
      And they are removed automatically when the R session is ended.
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16330 from felixcheung/rderby.
      422aa67d
    • hyukjinkwon's avatar
      [MINOR][R] Reorder `Collate` fields in DESCRIPTION file · 60262bc9
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      It seems cran check scripts corrects `R/pkg/DESCRIPTION` and follows the order in `Collate` fields.
      
      This PR proposes to fix this so that running this script does not show up a diff in this file.
      
      ## How was this patch tested?
      
      Manually via `./R/check-cran.sh`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17349 from HyukjinKwon/minor-cran.
      60262bc9
  5. Mar 18, 2017
    • Felix Cheung's avatar
      [SPARK-19654][SPARKR][SS] Structured Streaming API for R · 5c165596
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Add "experimental" API for SS in R
      
      ## How was this patch tested?
      
      manual, unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16982 from felixcheung/rss.
      5c165596
    • Sean Owen's avatar
      [SPARK-16599][CORE] java.util.NoSuchElementException: None.get at at... · 54e61df2
      Sean Owen authored
      [SPARK-16599][CORE] java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask
      
      ## What changes were proposed in this pull request?
      
      Avoid None.get exception in (rare?) case that no readLocks exist
      Note that while this would resolve the immediate cause of the exception, it's not clear it is the root problem.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17290 from srowen/SPARK-16599.
      54e61df2
    • Takeshi Yamamuro's avatar
      [SPARK-19896][SQL] Throw an exception if case classes have circular references in toDS · ccba622e
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      If case classes have circular references below, it throws StackOverflowError;
      ```
      scala> :pasge
      case class classA(i: Int, cls: classB)
      case class classB(cls: classA)
      
      scala> Seq(classA(0, null)).toDS()
      java.lang.StackOverflowError
        at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1494)
        at scala.reflect.runtime.JavaMirrors$JavaMirror$$anon$1.scala$reflect$runtime$SynchronizedSymbols$SynchronizedSymbol$$super$info(JavaMirrors.scala:66)
        at scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anonfun$info$1.apply(SynchronizedSymbols.scala:127)
        at scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anonfun$info$1.apply(SynchronizedSymbols.scala:127)
        at scala.reflect.runtime.Gil$class.gilSynchronized(Gil.scala:19)
        at scala.reflect.runtime.JavaUniverse.gilSynchronized(JavaUniverse.scala:16)
        at scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$class.gilSynchronizedIfNotThreadsafe(SynchronizedSymbols.scala:123)
        at scala.reflect.runtime.JavaMirrors$JavaMirror$$anon$1.gilSynchronizedIfNotThreadsafe(JavaMirrors.scala:66)
        at scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$class.info(SynchronizedSymbols.scala:127)
        at scala.reflect.runtime.JavaMirrors$JavaMirror$$anon$1.info(JavaMirrors.scala:66)
        at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
        at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:45)
        at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:45)
        at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:45)
        at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:45)
      ```
      This pr added code to throw UnsupportedOperationException in that case as follows;
      ```
      scala> :paste
      case class A(cls: B)
      case class B(cls: A)
      
      scala> Seq(A(null)).toDS()
      java.lang.UnsupportedOperationException: cannot have circular references in class, but got the circular reference of class B
        at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:627)
        at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:644)
        at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:632)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
      ```
      
      ## How was this patch tested?
      Added tests in `DatasetSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #17318 from maropu/SPARK-19896.
      ccba622e
    • wangzhenhua's avatar
      [SPARK-19915][SQL] Exclude cartesian product candidates to reduce the search space · c083b6b7
      wangzhenhua authored
      ## What changes were proposed in this pull request?
      
      We have some concerns about removing size in the cost model [in the previous pr](https://github.com/apache/spark/pull/17240). It's a tradeoff between code structure and algorithm completeness. I tend to keep the size and thus create this new pr without changing cost model.
      
      What this pr does:
      1. We only consider consecutive inner joinable items, thus excluding cartesian products in reordering procedure. This significantly reduces the search space and memory overhead of memo. Otherwise every combination of items will exist in the memo.
      2. This pr also includes a bug fix: if a leaf item is a project(_, child), current solution will miss the project.
      
      ## How was this patch tested?
      
      Added test cases.
      
      Author: wangzhenhua <wangzhenhua@huawei.com>
      
      Closes #17286 from wzhfy/joinReorder3.
      c083b6b7
Loading