Skip to content
Snippets Groups Projects
  1. Jul 25, 2016
    • Philipp Hoffmann's avatar
      [SPARK-15271][MESOS] Allow force pulling executor docker images · 978cd5f1
      Philipp Hoffmann authored
      ## What changes were proposed in this pull request?
      
      Mesos agents by default will not pull docker images which are cached
      locally already. In order to run Spark executors from mutable tags like
      `:latest` this commit introduces a Spark setting
      `spark.mesos.executor.docker.forcePullImage`. Setting this flag to
      true will tell the Mesos agent to force pull the docker image (default is `false` which is consistent with the previous
      implementation and Mesos' default
      behaviour).
      
      ## How was this patch tested?
      
      I ran a sample application including this change on a Mesos cluster and verified the correct behaviour for both, with and without, force pulling the executor image. As expected the image is being force pulled if the flag is set.
      
      Author: Philipp Hoffmann <mail@philipphoffmann.de>
      
      Closes #13051 from philipphoffmann/force-pull-image.
      978cd5f1
    • Reynold Xin's avatar
      [SPARK-16685] Remove audit-release scripts. · dd784a88
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch removes dev/audit-release. It was initially created to do basic release auditing. They have been unused by for the last one year+.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #14342 from rxin/SPARK-16685.
      dd784a88
    • WeichenXu's avatar
      [SPARK-16653][ML][OPTIMIZER] update ANN convergence tolerance param default to 1e-6 · ad3708e7
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      replace ANN convergence tolerance param default
      from 1e-4 to 1e-6
      
      so that it will be the same with other algorithms in MLLib which use LBFGS as optimizer.
      
      ## How was this patch tested?
      
      Existing Test.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #14286 from WeichenXu123/update_ann_tol.
      ad3708e7
    • Felix Cheung's avatar
      [SPARKR][DOCS] fix broken url in doc · b73defdd
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Fix broken url, also,
      
      sparkR.session.stop doc page should have it in the header, instead of saying "sparkR.stop"
      ![image](https://cloud.githubusercontent.com/assets/8969467/17080129/26d41308-50d9-11e6-8967-79d6c920313f.png)
      
      Data type section is in the middle of a list of gapply/gapplyCollect subsections:
      ![image](https://cloud.githubusercontent.com/assets/8969467/17080122/f992d00a-50d8-11e6-8f2c-fd5786213920.png)
      
      ## How was this patch tested?
      
      manual test
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #14329 from felixcheung/rdoclinkfix.
      b73defdd
    • Cheng Lian's avatar
      [SPARK-16703][SQL] Remove extra whitespace in SQL generation for window functions · 7ea6d282
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR fixes a minor formatting issue of `WindowSpecDefinition.sql` when no partitioning expressions are present.
      
      Before:
      
      ```sql
      ( ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
      ```
      
      After:
      
      ```sql
      (ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
      ```
      
      ## How was this patch tested?
      
      New test case added in `ExpressionSQLBuilderSuite`.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #14334 from liancheng/window-spec-sql-format.
      7ea6d282
    • hyukjinkwon's avatar
      [SPARK-16698][SQL] Field names having dots should be allowed for datasources based on FileFormat · 79826f3c
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      It seems this is a regression assuming from https://issues.apache.org/jira/browse/SPARK-16698.
      
      Field name having dots throws an exception. For example the codes below:
      
      ```scala
      val path = "/tmp/path"
      val json =""" {"a.b":"data"}"""
      spark.sparkContext
        .parallelize(json :: Nil)
        .saveAsTextFile(path)
      spark.read.json(path).collect()
      ```
      
      throws an exception as below:
      
      ```
      Unable to resolve a.b given [a.b];
      org.apache.spark.sql.AnalysisException: Unable to resolve a.b given [a.b];
      	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
      	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
      	at scala.Option.getOrElse(Option.scala:121)
      ```
      
      This problem was introduced in https://github.com/apache/spark/commit/17eec0a71ba8713c559d641e3f43a1be726b037c#diff-27c76f96a7b2733ecfd6f46a1716e153R121
      
      When extracting the data columns, it does not count that it can contains dots in field names. Actually, it seems the fields name are not expected as quoted when defining schema. So, It not have to consider whether this is wrapped with quotes because the actual schema (inferred or user-given schema) would not have the quotes for fields.
      
      For example, this throws an exception. (**Loading JSON from RDD is fine**)
      
      ```scala
      val json =""" {"a.b":"data"}"""
      val rdd = spark.sparkContext.parallelize(json :: Nil)
      spark.read.schema(StructType(Seq(StructField("`a.b`", StringType, true))))
        .json(rdd).select("`a.b`").printSchema()
      ```
      
      as below:
      
      ```
      cannot resolve '```a.b```' given input columns: [`a.b`];
      org.apache.spark.sql.AnalysisException: cannot resolve '```a.b```' given input columns: [`a.b`];
      	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
      ```
      
      ## How was this patch tested?
      
      Unit tests in `FileSourceStrategySuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #14339 from HyukjinKwon/SPARK-16698-regression.
      79826f3c
    • Sameer Agarwal's avatar
      [SPARK-16668][TEST] Test parquet reader for row groups containing both... · d6a52176
      Sameer Agarwal authored
      [SPARK-16668][TEST] Test parquet reader for row groups containing both dictionary and plain encoded pages
      
      ## What changes were proposed in this pull request?
      
      This patch adds an explicit test for [SPARK-14217] by setting the parquet dictionary and page size the generated parquet file spans across 3 pages (within a single row group) where the first page is dictionary encoded and the remaining two are plain encoded.
      
      ## How was this patch tested?
      
      1. ParquetEncodingSuite
      2. Also manually tested that this test fails without https://github.com/apache/spark/pull/12279
      
      Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
      
      Closes #14304 from sameeragarwal/hybrid-encoding-test.
      d6a52176
    • Wenchen Fan's avatar
      [SPARK-16691][SQL] move BucketSpec to catalyst module and use it in CatalogTable · 64529b18
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      It's weird that we have `BucketSpec` to abstract bucket info, but don't use it in `CatalogTable`. This PR moves `BucketSpec` into catalyst module.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #14331 from cloud-fan/check.
      64529b18
    • Wenchen Fan's avatar
      [SPARK-16660][SQL] CreateViewCommand should not take CatalogTable · d27d362e
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      `CreateViewCommand` only needs some information of a `CatalogTable`, but not all of them. We have some tricks(e.g. we need to check the table type is `VIEW`, we need to make `CatalogColumn.dataType` nullable) to allow it to take a `CatalogTable`.
      This PR cleans it up and only pass in necessary information to `CreateViewCommand`.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #14297 from cloud-fan/minor2.
      d27d362e
    • hyukjinkwon's avatar
      [SPARK-16674][SQL] Avoid per-record type dispatch in JDBC when reading · 7ffd99ec
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, `JDBCRDD.compute` is doing type dispatch for each row to read appropriate values.
      It might not have to be done like this because the schema is already kept in `JDBCRDD`.
      
      So, appropriate converters can be created first according to the schema, and then apply them to each row.
      
      ## How was this patch tested?
      
      Existing tests should cover this.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #14313 from HyukjinKwon/SPARK-16674.
      7ffd99ec
    • Cheng Lian's avatar
      [SPARK-16648][SQL] Make ignoreNullsExpr a child expression of First and Last · 68b4020d
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      Default `TreeNode.withNewChildren` implementation doesn't work for `Last` and when both constructor arguments are the same, e.g.:
      
      ```sql
      LAST_VALUE(FALSE) -- The 2nd argument defaults to FALSE
      LAST_VALUE(FALSE, FALSE)
      LAST_VALUE(TRUE, TRUE)
      ```
      
      This is because although `Last` is a unary expression, both of its constructor arguments, `child` and `ignoreNullsExpr`, are `Expression`s. When they have the same value, `TreeNode.withNewChildren` treats both of them as child nodes by mistake. `First` is also affected by this issue in exactly the same way.
      
      This PR fixes this issue by making `ignoreNullsExpr` a child expression of `First` and `Last`.
      
      ## How was this patch tested?
      
      New test case added in `WindowQuerySuite`.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #14295 from liancheng/spark-16648-last-value.
      68b4020d
  2. Jul 24, 2016
    • Qifan Pu's avatar
      [SPARK-16699][SQL] Fix performance bug in hash aggregate on long string keys · 468a3c3a
      Qifan Pu authored
      
      In the following code in `VectorizedHashMapGenerator.scala`:
      ```
          def hashBytes(b: String): String = {
            val hash = ctx.freshName("hash")
            s"""
               |int $result = 0;
               |for (int i = 0; i < $b.length; i++) {
               |  ${genComputeHash(ctx, s"$b[i]", ByteType, hash)}
               |  $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + ($result >>> 2);
               |}
             """.stripMargin
          }
      
      ```
      when b=input.getBytes(), the current 2.0 code results in getBytes() being called n times, n being length of input. getBytes() involves memory copy is thus expensive and causes a performance degradation.
      Fix is to evaluate getBytes() before the for loop.
      
      Performance bug, no additional test added.
      
      Author: Qifan Pu <qifan.pu@gmail.com>
      
      Closes #14337 from ooq/SPARK-16699.
      
      (cherry picked from commit d226dce1)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      468a3c3a
    • Brian Cho's avatar
      [SPARK-5581][CORE] When writing sorted map output file, avoid open / … · daace601
      Brian Cho authored
      …close between each partition
      
      ## What changes were proposed in this pull request?
      
      Replace commitAndClose with separate commit and close to avoid opening and closing
      the file between partitions.
      
      ## How was this patch tested?
      
      Run existing unit tests, add a few unit tests regarding reverts.
      
      Observed a ~20% reduction in total time in tasks on stages with shuffle
      writes to many partitions.
      
      JoshRosen
      
      Author: Brian Cho <bcho@fb.com>
      
      Closes #13382 from dafrista/separatecommit-master.
      daace601
    • Wenchen Fan's avatar
      [SPARK-16645][SQL] rename CatalogStorageFormat.serdeProperties to properties · 1221ce04
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      we also store data source table options in this field, it's unreasonable to call it `serdeProperties`.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #14283 from cloud-fan/minor1.
      1221ce04
    • Mikael Ståldal's avatar
      [SPARK-16416][CORE] force eager creation of loggers to avoid shutdown hook conflicts · 23e047f4
      Mikael Ståldal authored
      ## What changes were proposed in this pull request?
      
      Force eager creation of loggers to avoid shutdown hook conflicts.
      
      ## How was this patch tested?
      
      Manually tested with a project using Log4j 2, verified that the shutdown hook conflict issue was solved.
      
      Author: Mikael Ståldal <mikael.staldal@magine.com>
      
      Closes #14320 from mikaelstaldal/shutdown-hook-logging.
      23e047f4
    • WeichenXu's avatar
      [PYSPARK] add picklable SparseMatrix in pyspark.ml.common · 37bed97d
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      add `SparseMatrix` class whick support pickler.
      
      ## How was this patch tested?
      
      Existing test.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #14265 from WeichenXu123/picklable_py.
      37bed97d
    • Dongjoon Hyun's avatar
      [SPARK-16463][SQL] Support `truncate` option in Overwrite mode for JDBC DataFrameWriter · cc1d2dcb
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR adds a boolean option, `truncate`, for `SaveMode.Overwrite` of JDBC DataFrameWriter. If this option is `true`, it try to take advantage of `TRUNCATE TABLE` instead of `DROP TABLE`. This is a trivial option, but will provide great **convenience** for BI tool users based on RDBMS tables generated by Spark.
      
      **Goal**
      - Without `CREATE/DROP` privilege, we can save dataframe to database. Sometime these are not allowed for security.
      - It will preserve the existing table information, so users can add and keep some additional `INDEX` and `CONSTRAINT`s for the table.
      - Sometime, `TRUNCATE` is faster than the combination of `DROP/CREATE`.
      
      **Supported DBMS**
      The following is `truncate`-option support table. Due to the different behavior of `TRUNCATE TABLE` among DBMSs, it's not always safe to use `TRUNCATE TABLE`. Spark will ignore the `truncate` option for **unknown** and **some** DBMS with **default CASCADING** behavior. Newly added JDBCDialect should implement corresponding function to support `truncate` option additionally.
      
      Spark Dialects | `truncate` OPTION SUPPORT
      ---------------|-------------------------------
      MySQLDialect | O
      PostgresDialect | X
      DB2Dialect | O
      MsSqlServerDialect | O
      DerbyDialect | O
      OracleDialect | O
      
      **Before (TABLE with INDEX case)**: SparkShell & MySQL CLI are interleaved intentionally.
      ```scala
      scala> val (url, prop)=("jdbc:mysql://localhost:3306/temp?useSSL=false", new java.util.Properties)
      scala> prop.setProperty("user","root")
      scala> df.write.mode("overwrite").jdbc(url, "table_with_index", prop)
      scala> spark.range(10).write.mode("overwrite").jdbc(url, "table_with_index", prop)
      mysql> DESC table_with_index;
      +-------+------------+------+-----+---------+-------+
      | Field | Type       | Null | Key | Default | Extra |
      +-------+------------+------+-----+---------+-------+
      | id    | bigint(20) | NO   |     | NULL    |       |
      +-------+------------+------+-----+---------+-------+
      mysql> CREATE UNIQUE INDEX idx_id ON table_with_index(id);
      mysql> DESC table_with_index;
      +-------+------------+------+-----+---------+-------+
      | Field | Type       | Null | Key | Default | Extra |
      +-------+------------+------+-----+---------+-------+
      | id    | bigint(20) | NO   | PRI | NULL    |       |
      +-------+------------+------+-----+---------+-------+
      scala> spark.range(10).write.mode("overwrite").jdbc(url, "table_with_index", prop)
      mysql> DESC table_with_index;
      +-------+------------+------+-----+---------+-------+
      | Field | Type       | Null | Key | Default | Extra |
      +-------+------------+------+-----+---------+-------+
      | id    | bigint(20) | NO   |     | NULL    |       |
      +-------+------------+------+-----+---------+-------+
      ```
      
      **After (TABLE with INDEX case)**
      ```scala
      scala> spark.range(10).write.mode("overwrite").option("truncate", true).jdbc(url, "table_with_index", prop)
      mysql> DESC table_with_index;
      +-------+------------+------+-----+---------+-------+
      | Field | Type       | Null | Key | Default | Extra |
      +-------+------------+------+-----+---------+-------+
      | id    | bigint(20) | NO   | PRI | NULL    |       |
      +-------+------------+------+-----+---------+-------+
      ```
      
      **Error Handling**
      - In case of exceptions, Spark will not retry. Users should turn off the `truncate` option.
      - In case of schema change:
        - If one of the column names changes, this will raise exceptions intuitively.
        - If there exists only type difference, this will work like Append mode.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests with a updated testcase.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14086 from dongjoon-hyun/SPARK-16410.
      cc1d2dcb
    • Liwei Lin's avatar
      [SPARK-16515][SQL][FOLLOW-UP] Fix test `script` on OS X/Windows... · d6795c7a
      Liwei Lin authored
      ## Problem
      
      The current `sed` in `test_script.sh` is missing a `$`, leading to the failure of `script` test on OS X:
      ```
      == Results ==
      !== Correct Answer - 2 ==   == Spark Answer - 2 ==
      ![x1_y1]                    [x1]
      ![x2_y2]                    [x2]
      ```
      
      In addition, this `script` test would also fail on systems like Windows where we couldn't be able to invoke `bash` or `echo | sed`.
      
      ## What changes were proposed in this pull request?
      This patch
      - fixes `sed` in `test_script.sh`
      - adds command guards so that the `script` test would pass on systems like Windows
      
      ## How was this patch tested?
      
      - Jenkins
      - Manually verified tests pass on OS X
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #14280 from lw-lin/osx-sed.
      d6795c7a
    • Sean Owen's avatar
      [MINOR] Close old PRs that should be closed but have not been · e3c7039b
      Sean Owen authored
      Closes #11598
      Closes #7278
      Closes #13882
      Closes #12053
      Closes #14125
      Closes #8760
      Closes #12848
      Closes #14224
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #14328 from srowen/CloseOldPRs.
      e3c7039b
  3. Jul 23, 2016
    • Cheng Lian's avatar
      [SPARK-16380][EXAMPLES] Update SQL examples and programming guide for Python language binding · 53b2456d
      Cheng Lian authored
      This PR is based on PR #14098 authored by wangmiao1981.
      
      ## What changes were proposed in this pull request?
      
      This PR replaces the original Python Spark SQL example file with the following three files:
      
      - `sql/basic.py`
      
        Demonstrates basic Spark SQL features.
      
      - `sql/datasource.py`
      
        Demonstrates various Spark SQL data sources.
      
      - `sql/hive.py`
      
        Demonstrates Spark SQL Hive interaction.
      
      This PR also removes hard-coded Python example snippets in the SQL programming guide by extracting snippets from the above files using the `include_example` Liquid template tag.
      
      ## How was this patch tested?
      
      Manually tested.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #14317 from liancheng/py-examples-update.
      53b2456d
    • Wenchen Fan's avatar
      [SPARK-16690][TEST] rename SQLTestUtils.withTempTable to withTempView · 86c27520
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      after https://github.com/apache/spark/pull/12945, we renamed the `registerTempTable` to `createTempView`, as we do create a view actually. This PR renames `SQLTestUtils.withTempTable` to reflect this change.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #14318 from cloud-fan/minor4.
      86c27520
    • WeichenXu's avatar
      [SPARK-16662][PYSPARK][SQL] fix HiveContext warning bug · ab6e4aea
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      move the `HiveContext` deprecate warning printing statement into `HiveContext` constructor.
      so that this warning will appear only when we use `HiveContext`
      otherwise this warning will always appear if we reference the pyspark.ml.context code file.
      
      ## How was this patch tested?
      
      Manual.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #14301 from WeichenXu123/hiveContext_python_warning_update.
      ab6e4aea
    • WeichenXu's avatar
      [SPARK-16561][MLLIB] fix multivarOnlineSummary min/max bug · 25db5167
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      renaming var names to make code more clear:
      nnz => weightSum
      weightSum => totalWeightSum
      
      and add a new member vector `nnz` (not `nnz` in previous code, which renamed to `weightSum`) to count each dimensions non-zero value number.
      using `nnz` which I added above instead of `weightSum` when calculating min/max so that it fix several numerical error in some extreme case.
      
      ## How was this patch tested?
      
      A new testcase added.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #14216 from WeichenXu123/multivarOnlineSummary.
      25db5167
  4. Jul 22, 2016
    • Liang-Chi Hsieh's avatar
      [SPARK-16622][SQL] Fix NullPointerException when the returned value of the... · e10b8741
      Liang-Chi Hsieh authored
      [SPARK-16622][SQL] Fix NullPointerException when the returned value of the called method in Invoke is null
      
      ## What changes were proposed in this pull request?
      
      Currently we don't check the value returned by called method in `Invoke`. When the returned value is null and is assigned to a variable of primitive type, `NullPointerException` will be thrown.
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #14259 from viirya/agg-empty-ds.
      e10b8741
    • Dongjoon Hyun's avatar
      [SPARK-16651][PYSPARK][DOC] Make `withColumnRenamed/drop` description more... · 47f5b88d
      Dongjoon Hyun authored
      [SPARK-16651][PYSPARK][DOC] Make `withColumnRenamed/drop` description more consistent with Scala API
      
      ## What changes were proposed in this pull request?
      
      `withColumnRenamed` and `drop` is a no-op if the given column name does not exists. Python documentation also describe that, but this PR adds more explicit line consistently with Scala to reduce the ambiguity.
      
      ## How was this patch tested?
      
      It's about docs.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14288 from dongjoon-hyun/SPARK-16651.
      47f5b88d
    • Tom Graves's avatar
      [SPARK-16650] Improve documentation of spark.task.maxFailures · 6c56fff1
      Tom Graves authored
      Clarify documentation on spark.task.maxFailures
      
      No tests run as its documentation
      
      Author: Tom Graves <tgraves@yahoo-inc.com>
      
      Closes #14287 from tgravescs/SPARK-16650.
      6c56fff1
    • WeichenXu's avatar
      [GIT] add pydev & Rstudio project file to gitignore list · b4e16bd5
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      Add Pydev & Rstudio project file to gitignore list, I think the two IEDs are used by many developers.
      so that won't need personal gitignore_global config.
      
      ## How was this patch tested?
      
      N/A
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #14293 from WeichenXu123/update_gitignore.
      b4e16bd5
    • Ahmed Mahran's avatar
      [SPARK-16487][STREAMING] Fix some batches might not get marked as fully processed in JobGenerator · 2c72a443
      Ahmed Mahran authored
      ## What changes were proposed in this pull request?
      
      In `JobGenerator`, the code reads like that some batches might not get marked as fully processed. In the following flowchart, the batch should get marked fully processed before endpoint C however it is not. Currently, this does not actually cause an issue, as the condition `(time - zeroTime) is multiple of checkpoint duration?` always evaluates to `true` as the `checkpoint duration` is always set to be equal to the `batch duration`.
      
      ![Flowchart](https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png)
      
      This PR fixes this issue so as to improve code readability and to avoid any potential issue in case there is any future change making checkpoint duration to be set different from batch duration.
      
      Author: Ahmed Mahran <ahmed.mahran@mashin.io>
      
      Closes #14145 from ahmed-mahran/b-mark-batch-fully-processed.
      2c72a443
    • Jacek Laskowski's avatar
      [SPARK-16287][HOTFIX][BUILD][SQL] Fix annotation argument needs to be a constant · e1bd70f4
      Jacek Laskowski authored
      ## What changes were proposed in this pull request?
      
      Build fix for [SPARK-16287][SQL] Implement str_to_map SQL function that has introduced this compilation error:
      
      ```
      /Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala:402: error: annotation argument needs to be a constant; found: "_FUNC_(text[, pairDelim, keyValueDelim]) - Creates a map after splitting the text ".+("into key/value pairs using delimiters. ").+("Default delimiters are \',\' for pairDelim and \':\' for keyValueDelim.")
          "into key/value pairs using delimiters. " +
                                                    ^
      ```
      
      ## How was this patch tested?
      
      Local build
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #14315 from jaceklaskowski/build-fix-complexTypeCreator.
      e1bd70f4
    • gatorsmile's avatar
      [SPARK-16556][SPARK-16559][SQL] Fix Two Bugs in Bucket Specification · 94f14b52
      gatorsmile authored
      ### What changes were proposed in this pull request?
      
      **Issue 1: Silent Ignorance of Bucket Specification When Creating Table Using Schema Inference**
      
      When creating a data source table without explicit specification of schema or SELECT clause, we silently ignore the bucket specification (CLUSTERED BY... SORTED BY...) in [the code](https://github.com/apache/spark/blob/ce3b98bae28af72299722f56e4e4ef831f471ec0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala#L339-L354).
      
      For example,
      ```SQL
      CREATE TABLE jsonTable
      USING org.apache.spark.sql.json
      OPTIONS (
        path '${tempDir.getCanonicalPath}'
      )
      CLUSTERED BY (inexistentColumnA) SORTED BY (inexistentColumnB) INTO 2 BUCKETS
      ```
      
      This PR captures it and issues an error message.
      
      **Issue 2: Got a run-time `java.lang.ArithmeticException` when num of buckets is set to zero.**
      
      For example,
      ```SQL
      CREATE TABLE t USING PARQUET
      OPTIONS (PATH '${path.toString}')
      CLUSTERED BY (a) SORTED BY (b) INTO 0 BUCKETS
      AS SELECT 1 AS a, 2 AS b
      ```
      The exception we got is
      ```
      ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 1.0 (TID 2)
      java.lang.ArithmeticException: / by zero
      ```
      
      This PR captures the misuse and issues an appropriate error message.
      
      ### How was this patch tested?
      Added a test case in DDLSuite
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14210 from gatorsmile/createTableWithoutSchema.
      94f14b52
  5. Jul 21, 2016
  6. Jul 20, 2016
    • Wenchen Fan's avatar
      [SPARK-16644][SQL] Aggregate should not propagate constraints containing aggregate expressions · cfa5ae84
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      aggregate expressions can only be executed inside `Aggregate`, if we propagate it up with constraints, the parent operator can not execute it and will fail at runtime.
      
      ## How was this patch tested?
      
      new test in SQLQuerySuite
      
      Author: Wenchen Fan <wenchen@databricks.com>
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #14281 from cloud-fan/bug.
      cfa5ae84
Loading