Skip to content
Snippets Groups Projects
  1. Apr 05, 2017
    • Tathagata Das's avatar
      [SPARK-20209][SS] Execute next trigger immediately if previous batch took... · dad499f3
      Tathagata Das authored
      [SPARK-20209][SS] Execute next trigger immediately if previous batch took longer than trigger interval
      
      ## What changes were proposed in this pull request?
      
      For large trigger intervals (e.g. 10 minutes), if a batch takes 11 minutes, then it will wait for 9 mins before starting the next batch. This does not make sense. The processing time based trigger policy should be to do process batches as fast as possible, but no faster than 1 in every trigger interval. If batches are taking longer than trigger interval anyways, then no point waiting extra trigger interval.
      
      In this PR, I modified the ProcessingTimeExecutor to do so. Another minor change I did was to extract our StreamManualClock into a separate class so that it can be used outside subclasses of StreamTest. For example, ProcessingTimeExecutorSuite does not need to create any context for testing, just needs the StreamManualClock.
      
      ## How was this patch tested?
      Added new unit tests to comprehensively test this behavior.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #17525 from tdas/SPARK-20209.
      dad499f3
    • Reynold Xin's avatar
      Small doc fix for ReuseSubquery. · b6e71032
      Reynold Xin authored
      b6e71032
    • Felix Cheung's avatar
      [SPARKR][DOC] update doc for fpgrowth · c1b8b667
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      minor update
      
      zero323
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17526 from felixcheung/rfpgrowthfollowup.
      c1b8b667
  2. Apr 04, 2017
    • Yuhao Yang's avatar
      [SPARK-20003][ML] FPGrowthModel setMinConfidence should affect rules generation and transform · b28bbffb
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      
      jira: https://issues.apache.org/jira/browse/SPARK-20003
      I was doing some test and found the issue. ml.fpm.FPGrowthModel `setMinConfidence` should always affect rules generation and transform.
      Currently associationRules in FPGrowthModel is a lazy val and `setMinConfidence` in FPGrowthModel has no impact once associationRules got computed .
      
      I try to cache the associationRules to avoid re-computation if `minConfidence` is not changed, but this makes FPGrowthModel somehow stateful. Let me know if there's any concern.
      
      ## How was this patch tested?
      
      new unit test and I strength the unit test for model save/load to ensure the cache mechanism.
      
      Author: Yuhao Yang <yuhao.yang@intel.com>
      
      Closes #17336 from hhbyyh/fpmodelminconf.
      b28bbffb
    • Seth Hendrickson's avatar
      [SPARK-20183][ML] Added outlierRatio arg to MLTestingUtils.testOutliersWithSmallWeights · a59759e6
      Seth Hendrickson authored
      ## What changes were proposed in this pull request?
      
      This is a small piece from https://github.com/apache/spark/pull/16722 which ultimately will add sample weights to decision trees.  This is to allow more flexibility in testing outliers since linear models and trees behave differently.
      
      Note: The primary author when this is committed should be sethah since this is taken from his code.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #17501 from jkbradley/SPARK-20183.
      a59759e6
    • Wenchen Fan's avatar
      [SPARK-19716][SQL] support by-name resolution for struct type elements in array · 295747e5
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Previously when we construct deserializer expression for array type, we will first cast the corresponding field to expected array type and then apply `MapObjects`.
      
      However, by doing that, we lose the opportunity to do by-name resolution for struct type inside array type. In this PR, I introduce a `UnresolvedMapObjects` to hold the lambda function and the input array expression. Then during analysis, after the input array expression is resolved, we get the actual array element type and apply by-name resolution. Then we don't need to add `Cast` for array type when constructing the deserializer expression, as the element type is determined later at analyzer.
      
      ## How was this patch tested?
      
      new regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17398 from cloud-fan/dataset.
      295747e5
    • Wenchen Fan's avatar
      [SPARK-20204][SQL] remove SimpleCatalystConf and CatalystConf type alias · 402bf2a5
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up of https://github.com/apache/spark/pull/17285 .
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17521 from cloud-fan/conf.
      402bf2a5
    • hyukjinkwon's avatar
      [MINOR][R] Reorder `Collate` fields in DESCRIPTION file · 0e2ee820
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      It seems cran check scripts corrects `R/pkg/DESCRIPTION` and follows the order in `Collate` fields.
      
      This PR proposes to fix `catalog.R`'s order so that running this script does not show up a small diff in this file every time.
      
      ## How was this patch tested?
      
      Manually via `./R/check-cran.sh`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17528 from HyukjinKwon/minor-reorder-description.
      0e2ee820
    • Marcelo Vanzin's avatar
      [SPARK-20191][YARN] Crate wrapper for RackResolver so tests can override it. · 0736980f
      Marcelo Vanzin authored
      Current test code tries to override the RackResolver used by setting
      configuration params, but because YARN libs statically initialize the
      resolver the first time it's used, that means that those configs don't
      really take effect during Spark tests.
      
      This change adds a wrapper class that easily allows tests to override the
      behavior of the resolver for the Spark code that uses it.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #17508 from vanzin/SPARK-20191.
      0736980f
    • Anirudh Ramanathan's avatar
      [SPARK-18278][SCHEDULER] Documentation to point to Kubernetes cluster scheduler · 11238d4c
      Anirudh Ramanathan authored
      ## What changes were proposed in this pull request?
      
      Adding documentation to point to Kubernetes cluster scheduler being developed out-of-repo in https://github.com/apache-spark-on-k8s/spark
      cc rxin srowen tnachen ash211 mccheah erikerlandson
      
      ## How was this patch tested?
      
      Docs only change
      
      Author: Anirudh Ramanathan <foxish@users.noreply.github.com>
      Author: foxish <ramanathana@google.com>
      
      Closes #17522 from foxish/upstream-doc.
      11238d4c
    • Xiao Li's avatar
      [SPARK-20198][SQL] Remove the inconsistency in table/function name conventions... · 26e7bca2
      Xiao Li authored
      [SPARK-20198][SQL] Remove the inconsistency in table/function name conventions in SparkSession.Catalog APIs
      
      ### What changes were proposed in this pull request?
      Observed by felixcheung , in `SparkSession`.`Catalog` APIs, we have different conventions/rules for table/function identifiers/names. Most APIs accept the qualified name (i.e., `databaseName`.`tableName` or `databaseName`.`functionName`). However, the following five APIs do not accept it.
      - def listColumns(tableName: String): Dataset[Column]
      - def getTable(tableName: String): Table
      - def getFunction(functionName: String): Function
      - def tableExists(tableName: String): Boolean
      - def functionExists(functionName: String): Boolean
      
      To make them consistent with the other Catalog APIs, this PR does the changes, updates the function/API comments and adds the `params` to clarify the inputs we allow.
      
      ### How was this patch tested?
      Added the test cases .
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17518 from gatorsmile/tableIdentifier.
      26e7bca2
    • guoxiaolongzte's avatar
      [SPARK-20190][APP-ID] applications//jobs' in rest api,status should be [running|s… · c95fbea6
      guoxiaolongzte authored
      …ucceeded|failed|unknown]
      
      ## What changes were proposed in this pull request?
      
      '/applications/[app-id]/jobs' in rest api.status should be'[running|succeeded|failed|unknown]'.
      now status is '[complete|succeeded|failed]'.
      but '/applications/[app-id]/jobs?status=complete' the server return 'HTTP ERROR 404'.
      Added '?status=running' and '?status=unknown'.
      code :
      public enum JobExecutionStatus {
      RUNNING,
      SUCCEEDED,
      FAILED,
      UNKNOWN;
      
      ## How was this patch tested?
      
       manual tests
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn>
      
      Closes #17507 from guoxiaolongzte/SPARK-20190.
      c95fbea6
    • zero323's avatar
      [SPARK-19825][R][ML] spark.ml R API for FPGrowth · b34f7665
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Adds SparkR API for FPGrowth: [SPARK-19825](https://issues.apache.org/jira/browse/SPARK-19825):
      
      - `spark.fpGrowth` -model training.
      - `freqItemsets` and `associationRules` methods with new corresponding generics.
      - Scala helper: `org.apache.spark.ml.r. FPGrowthWrapper`
      - unit tests.
      
      ## How was this patch tested?
      
      Feature specific unit tests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17170 from zero323/SPARK-19825.
      b34f7665
    • Xiao Li's avatar
      [SPARK-20067][SQL] Unify and Clean Up Desc Commands Using Catalog Interface · 51d3c854
      Xiao Li authored
      ### What changes were proposed in this pull request?
      
      This PR is to unify and clean up the outputs of `DESC EXTENDED/FORMATTED` and `SHOW TABLE EXTENDED` by moving the logics into the Catalog interface. The output formats are improved. We also add the missing attributes. It impacts the DDL commands like `SHOW TABLE EXTENDED`, `DESC EXTENDED` and `DESC FORMATTED`.
      
      In addition, by following what we did in Dataset API `printSchema`, we can use `treeString` to show the schema in the more readable way.
      
      Below is the current way:
      ```
      Schema: STRUCT<`a`: STRING (nullable = true), `b`: INT (nullable = true), `c`: STRING (nullable = true), `d`: STRING (nullable = true)>
      ```
      After the change, it should look like
      ```
      Schema: root
       |-- a: string (nullable = true)
       |-- b: integer (nullable = true)
       |-- c: string (nullable = true)
       |-- d: string (nullable = true)
      ```
      
      ### How was this patch tested?
      `describe.sql` and `show-tables.sql`
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17394 from gatorsmile/descFollowUp.
      51d3c854
  3. Apr 03, 2017
    • Dilip Biswal's avatar
      [SPARK-10364][SQL] Support Parquet logical type TIMESTAMP_MILLIS · 3bfb639c
      Dilip Biswal authored
      ## What changes were proposed in this pull request?
      
      **Description** from JIRA
      
      The TimestampType in Spark SQL is of microsecond precision. Ideally, we should convert Spark SQL timestamp values into Parquet TIMESTAMP_MICROS. But unfortunately parquet-mr hasn't supported it yet.
      For the read path, we should be able to read TIMESTAMP_MILLIS Parquet values and pad a 0 microsecond part to read values.
      For the write path, currently we are writing timestamps as INT96, similar to Impala and Hive. One alternative is that, we can have a separate SQL option to let users be able to write Spark SQL timestamp values as TIMESTAMP_MILLIS. Of course, in this way the microsecond part will be truncated.
      ## How was this patch tested?
      
      Added new tests in ParquetQuerySuite and ParquetIOSuite
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #15332 from dilipbiswal/parquet-time-millis.
      3bfb639c
    • Ron Hu's avatar
      [SPARK-19408][SQL] filter estimation on two columns of same table · e7877fd4
      Ron Hu authored
      ## What changes were proposed in this pull request?
      
      In SQL queries, we also see predicate expressions involving two columns such as "column-1 (op) column-2" where column-1 and column-2 belong to same table. Note that, if column-1 and column-2 belong to different tables, then it is a join operator's work, NOT a filter operator's work.
      
      This PR estimates filter selectivity on two columns of same table.  For example, multiple tpc-h queries have this predicate "WHERE l_commitdate < l_receiptdate"
      
      ## How was this patch tested?
      
      We added 6 new test cases to test various logical predicates involving two columns of same table.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Ron Hu <ron.hu@huawei.com>
      Author: U-CHINA\r00754707 <r00754707@R00754707-SC04.china.huawei.com>
      
      Closes #17415 from ron8hu/filterTwoColumns.
      e7877fd4
    • samelamin's avatar
      [SPARK-20145] Fix range case insensitive bug in SQL · 58c9e6e7
      samelamin authored
      ## What changes were proposed in this pull request?
      Range in SQL should be case insensitive
      
      ## How was this patch tested?
      unit test
      
      Author: samelamin <hussam.elamin@gmail.com>
      Author: samelamin <sam_elamin@discovery.com>
      
      Closes #17487 from samelamin/SPARK-20145.
      58c9e6e7
    • Adrian Ionescu's avatar
      [SPARK-20194] Add support for partition pruning to in-memory catalog · 703c42c3
      Adrian Ionescu authored
      ## What changes were proposed in this pull request?
      This patch implements `listPartitionsByFilter()` for `InMemoryCatalog` and thus resolves an outstanding TODO causing the `PruneFileSourcePartitions` optimizer rule not to apply when "spark.sql.catalogImplementation" is set to "in-memory" (which is the default).
      
      The change is straightforward: it extracts the code for further filtering of the list of partitions returned by the metastore's `getPartitionsByFilter()` out from `HiveExternalCatalog` into `ExternalCatalogUtils` and calls this new function from `InMemoryCatalog` on the whole list of partitions.
      
      Now that this method is implemented we can always pass the `CatalogTable` to the `DataSource` in `FindDataSourceTable`, so that the latter is resolved to a relation with a `CatalogFileIndex`, which is what the `PruneFileSourcePartitions` rule matches for.
      
      ## How was this patch tested?
      Ran existing tests and added new test for `listPartitionsByFilter` in `ExternalCatalogSuite`, which is subclassed by both `InMemoryCatalogSuite` and `HiveExternalCatalogSuite`.
      
      Author: Adrian Ionescu <adrian@databricks.com>
      
      Closes #17510 from adrian-ionescu/InMemoryCatalog.
      703c42c3
    • hyukjinkwon's avatar
      [SPARK-19641][SQL] JSON schema inference in DROPMALFORMED mode produces... · 4fa1a43a
      hyukjinkwon authored
      [SPARK-19641][SQL] JSON schema inference in DROPMALFORMED mode produces incorrect schema for non-array/object JSONs
      
      ## What changes were proposed in this pull request?
      
      Currently, when we infer the types for vaild JSON strings but object or array, we are producing empty schemas regardless of parse modes as below:
      
      ```scala
      scala> spark.read.option("mode", "DROPMALFORMED").json(Seq("""{"a": 1}""", """"a"""").toDS).printSchema()
      root
      ```
      
      ```scala
      scala> spark.read.option("mode", "FAILFAST").json(Seq("""{"a": 1}""", """"a"""").toDS).printSchema()
      root
      ```
      
      This PR proposes to handle parse modes in type inference.
      
      After this PR,
      
      ```scala
      
      scala> spark.read.option("mode", "DROPMALFORMED").json(Seq("""{"a": 1}""", """"a"""").toDS).printSchema()
      root
       |-- a: long (nullable = true)
      ```
      
      ```
      scala> spark.read.option("mode", "FAILFAST").json(Seq("""{"a": 1}""", """"a"""").toDS).printSchema()
      java.lang.RuntimeException: Failed to infer a common schema. Struct types are expected but string was found.
      ```
      
      This PR is based on https://github.com/NathanHowell/spark/commit/e233fd03346a73b3b447fa4c24f3b12c8b2e53ae and I and NathanHowell talked about this in https://issues.apache.org/jira/browse/SPARK-19641
      
      ## How was this patch tested?
      
      Unit tests in `JsonSuite` for both `DROPMALFORMED` and `FAILFAST` modes.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17492 from HyukjinKwon/SPARK-19641.
      4fa1a43a
    • Yuhao Yang's avatar
      [SPARK-19969][ML] Imputer doc and example · 4d28e843
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      
      Add docs and examples for spark.ml.feature.Imputer. Currently scala and Java examples are included. Python example will be added after https://github.com/apache/spark/pull/17316
      
      ## How was this patch tested?
      
      local doc generation and example execution
      
      Author: Yuhao Yang <yuhao.yang@intel.com>
      
      Closes #17324 from hhbyyh/imputerdoc.
      4d28e843
    • Denis Bolshakov's avatar
      [SPARK-9002][CORE] KryoSerializer initialization does not include 'Array[Int]' · fb5869f2
      Denis Bolshakov authored
      [SPARK-9002][CORE] KryoSerializer initialization does not include 'Array[Int]'
      
      ## What changes were proposed in this pull request?
      
      Array[Int] has been registered in KryoSerializer.
      The following file has been changed
      core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala
      
      ## How was this patch tested?
      
      First, the issue was reproduced by new unit test.
      Then, the issue was fixed to pass the failed test.
      
      Author: Denis Bolshakov <denis.bolshakov@onefactor.com>
      
      Closes #17482 from dbolshak/SPARK-9002.
      fb5869f2
    • hyukjinkwon's avatar
      [MINOR][DOCS] Replace non-breaking space to normal spaces that breaks rendering markdown · 364b0db7
      hyukjinkwon authored
      # What changes were proposed in this pull request?
      
      It seems there are several non-breaking spaces were inserted into several `.md`s and they look breaking rendering markdown files.
      
      These are different. For example, this can be checked via `python` as below:
      
      ```python
      >>> " "
      '\xc2\xa0'
      >>> " "
      ' '
      ```
      
      _Note that it seems this PR description automatically replaces non-breaking spaces into normal spaces. Please open a `vi` and copy and paste it into `python` to verify this (do not copy the characters here)._
      
      I checked the output below in  Sapari and Chrome on Mac OS and, Internal Explorer on Windows 10.
      
      **Before**
      
      ![2017-04-03 12 37 17](https://cloud.githubusercontent.com/assets/6477701/24594655/50aaba02-186a-11e7-80bb-d34b17a3398a.png)
      ![2017-04-03 12 36 57](https://cloud.githubusercontent.com/assets/6477701/24594654/50a855e6-186a-11e7-94e2-661e56544b0f.png)
      
      **After**
      
      ![2017-04-03 12 36 46](https://cloud.githubusercontent.com/assets/6477701/24594657/53c2545c-186a-11e7-9a73-00529afbfd75.png)
      ![2017-04-03 12 36 31](https://cloud.githubusercontent.com/assets/6477701/24594658/53c286c0-186a-11e7-99c9-e66b1f510fe7.png)
      
      ## How was this patch tested?
      
      Manually checking.
      
      These instances were found via
      
      ```
      grep --include=*.scala --include=*.python --include=*.java --include=*.r --include=*.R --include=*.md --include=*.r -r -I " " .
      ```
      
      in Mac OS.
      
      It seems there are several instances more as below:
      
      ```
      ./docs/sql-programming-guide.md:        │   ├── ...
      ./docs/sql-programming-guide.md:        │   │
      ./docs/sql-programming-guide.md:        │   ├── country=US
      ./docs/sql-programming-guide.md:        │   │   └── data.parquet
      ./docs/sql-programming-guide.md:        │   ├── country=CN
      ./docs/sql-programming-guide.md:        │   │   └── data.parquet
      ./docs/sql-programming-guide.md:        │   └── ...
      ./docs/sql-programming-guide.md:            ├── ...
      ./docs/sql-programming-guide.md:            │
      ./docs/sql-programming-guide.md:            ├── country=US
      ./docs/sql-programming-guide.md:            │   └── data.parquet
      ./docs/sql-programming-guide.md:            ├── country=CN
      ./docs/sql-programming-guide.md:            │   └── data.parquet
      ./docs/sql-programming-guide.md:            └── ...
      ./sql/core/src/test/README.md:│   ├── *.avdl                  # Testing Avro IDL(s)
      ./sql/core/src/test/README.md:│   └── *.avpr                  # !! NO TOUCH !! Protocol files generated from Avro IDL(s)
      ./sql/core/src/test/README.md:│   ├── gen-avro.sh             # Script used to generate Java code for Avro
      ./sql/core/src/test/README.md:│   └── gen-thrift.sh           # Script used to generate Java code for Thrift
      ```
      
      These seems generated via `tree` command which inserts non-breaking spaces. They do not look causing any problem for rendering within code blocks and I did not fix it to reduce the overhead to manually replace it when it is overwritten via `tree` command in the future.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17517 from HyukjinKwon/non-breaking-space.
      364b0db7
    • hyukjinkwon's avatar
      [SPARK-20166][SQL] Use XXX for ISO 8601 timezone instead of ZZ (FastDateFormat... · cff11fd2
      hyukjinkwon authored
      [SPARK-20166][SQL] Use XXX for ISO 8601 timezone instead of ZZ (FastDateFormat specific) in CSV/JSON timeformat options
      
      ## What changes were proposed in this pull request?
      
      This PR proposes to use `XXX` format instead of `ZZ`. `ZZ` seems a `FastDateFormat` specific.
      
      `ZZ` supports "ISO 8601 extended format time zones" but it seems `FastDateFormat` specific option.
      I misunderstood this is compatible format with `SimpleDateFormat` when this change is introduced.
      Please see [SimpleDateFormat documentation]( https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html#iso8601timezone) and [FastDateFormat documentation](https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/time/FastDateFormat.html).
      
      It seems we better replace `ZZ` to `XXX` because they look using the same strategy - [FastDateParser.java#L930](https://github.com/apache/commons-lang/blob/8767cd4f1a6af07093c1e6c422dae8e574be7e5e/src/main/java/org/apache/commons/lang3/time/FastDateParser.java#L930), [FastDateParser.java#L932-L951 ](https://github.com/apache/commons-lang/blob/8767cd4f1a6af07093c1e6c422dae8e574be7e5e/src/main/java/org/apache/commons/lang3/time/FastDateParser.java#L932-L951) and [FastDateParser.java#L596-L601](https://github.com/apache/commons-lang/blob/8767cd4f1a6af07093c1e6c422dae8e574be7e5e/src/main/java/org/apache/commons/lang3/time/FastDateParser.java#L596-L601).
      
      I also checked the codes and manually debugged it for sure. It seems both cases use the same pattern `( Z|(?:[+-]\\d{2}(?::)\\d{2}))`.
      
      _Note that this should be rather a fix about documentation and not the behaviour change because `ZZ` seems invalid date format in `SimpleDateFormat` as documented in `DataFrameReader` and etc, and both `ZZ` and `XXX` look identically working with `FastDateFormat`_
      
      Current documentation is as below:
      
      ```
         * <li>`timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss.SSSZZ`): sets the string that
         * indicates a timestamp format. Custom date formats follow the formats at
         * `java.text.SimpleDateFormat`. This applies to timestamp type.</li>
      ```
      
      ## How was this patch tested?
      
      Existing tests should cover this. Also, manually tested as below (BTW, I don't think these are worth being added as tests within Spark):
      
      **Parse**
      
      ```scala
      scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00")
      res4: java.util.Date = Tue Mar 21 20:00:00 KST 2017
      
      scala>  new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000Z")
      res10: java.util.Date = Tue Mar 21 09:00:00 KST 2017
      
      scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11:00")
      java.text.ParseException: Unparseable date: "2017-03-21T00:00:00.000-11:00"
        at java.text.DateFormat.parse(DateFormat.java:366)
        ... 48 elided
      scala>  new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000Z")
      java.text.ParseException: Unparseable date: "2017-03-21T00:00:00.000Z"
        at java.text.DateFormat.parse(DateFormat.java:366)
        ... 48 elided
      ```
      
      ```scala
      scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00")
      res7: java.util.Date = Tue Mar 21 20:00:00 KST 2017
      
      scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000Z")
      res1: java.util.Date = Tue Mar 21 09:00:00 KST 2017
      
      scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11:00")
      res8: java.util.Date = Tue Mar 21 20:00:00 KST 2017
      
      scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000Z")
      res2: java.util.Date = Tue Mar 21 09:00:00 KST 2017
      ```
      
      **Format**
      
      ```scala
      scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").format(new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00"))
      res6: String = 2017-03-21T20:00:00.000+09:00
      ```
      
      ```scala
      scala> val fd = org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ")
      fd: org.apache.commons.lang3.time.FastDateFormat = FastDateFormat[yyyy-MM-dd'T'HH:mm:ss.SSSZZ,ko_KR,Asia/Seoul]
      
      scala> fd.format(fd.parse("2017-03-21T00:00:00.000-11:00"))
      res1: String = 2017-03-21T20:00:00.000+09:00
      
      scala> val fd = org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX")
      fd: org.apache.commons.lang3.time.FastDateFormat = FastDateFormat[yyyy-MM-dd'T'HH:mm:ss.SSSXXX,ko_KR,Asia/Seoul]
      
      scala> fd.format(fd.parse("2017-03-21T00:00:00.000-11:00"))
      res2: String = 2017-03-21T20:00:00.000+09:00
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17489 from HyukjinKwon/SPARK-20166.
      cff11fd2
    • Bryan Cutler's avatar
      [SPARK-19985][ML] Fixed copy method for some ML Models · 2a903a1e
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Some ML Models were using `defaultCopy` which expects a default constructor, and others were not setting the parent estimator.  This change fixes these by creating a new instance of the model and explicitly setting values and parent.
      
      ## How was this patch tested?
      Added `MLTestingUtils.checkCopy` to the offending models to tests to verify the copy is made and parent is set.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #17326 from BryanCutler/ml-model-copy-error-SPARK-19985.
      2a903a1e
  4. Apr 02, 2017
    • Felix Cheung's avatar
      [SPARK-20159][SPARKR][SQL] Support all catalog API in R · 93dbfe70
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Add a set of catalog API in R
      
      ```
      "currentDatabase",
      "listColumns",
      "listDatabases",
      "listFunctions",
      "listTables",
      "recoverPartitions",
      "refreshByPath",
      "refreshTable",
      "setCurrentDatabase",
      ```
      https://github.com/apache/spark/pull/17483/files#diff-6929e6c5e59017ff954e110df20ed7ff
      
      ## How was this patch tested?
      
      manual tests, unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17483 from felixcheung/rcatalog.
      93dbfe70
    • zuotingbing's avatar
      [SPARK-20173][SQL][HIVE-THRIFTSERVER] Throw NullPointerException when HiveThriftServer2 is shutdown · 657cb954
      zuotingbing authored
      ## What changes were proposed in this pull request?
      
      If the shutdown hook called before the variable `uiTab` is set , it will throw a NullPointerException.
      
      ## How was this patch tested?
      
      manual tests
      
      Author: zuotingbing <zuo.tingbing9@zte.com.cn>
      
      Closes #17496 from zuotingbing/SPARK-HiveThriftServer2.
      657cb954
    • zuotingbing's avatar
      [SPARK-20123][BUILD] SPARK_HOME variable might have spaces in it(e.g. $SPARK… · 76de2d11
      zuotingbing authored
      JIRA Issue: https://issues.apache.org/jira/browse/SPARK-20123
      
      ## What changes were proposed in this pull request?
      
      If $SPARK_HOME or $FWDIR variable contains spaces, then use "./dev/make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn" build spark will failed.
      
      ## How was this patch tested?
      
      manual tests
      
      Author: zuotingbing <zuo.tingbing9@zte.com.cn>
      
      Closes #17452 from zuotingbing/spark-bulid.
      76de2d11
    • hyukjinkwon's avatar
      [SPARK-20143][SQL] DataType.fromJson should throw an exception with better message · d40cbb86
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, `DataType.fromJson` throws `scala.MatchError` or `java.util.NoSuchElementException` in some cases when the JSON input is invalid as below:
      
      ```scala
      DataType.fromJson(""""abcd"""")
      ```
      
      ```
      java.util.NoSuchElementException: key not found: abcd
        at ...
      ```
      
      ```scala
      DataType.fromJson("""{"abcd":"a"}""")
      ```
      
      ```
      scala.MatchError: JObject(List((abcd,JString(a)))) (of class org.json4s.JsonAST$JObject)
        at ...
      ```
      
      ```scala
      DataType.fromJson("""{"fields": [{"a":123}], "type": "struct"}""")
      ```
      
      ```
      scala.MatchError: JObject(List((a,JInt(123)))) (of class org.json4s.JsonAST$JObject)
        at ...
      ```
      
      After this PR,
      
      ```scala
      DataType.fromJson(""""abcd"""")
      ```
      
      ```
      java.lang.IllegalArgumentException: Failed to convert the JSON string 'abcd' to a data type.
        at ...
      ```
      
      ```scala
      DataType.fromJson("""{"abcd":"a"}""")
      ```
      
      ```
      java.lang.IllegalArgumentException: Failed to convert the JSON string '{"abcd":"a"}' to a data type.
        at ...
      ```
      
      ```scala
      DataType.fromJson("""{"fields": [{"a":123}], "type": "struct"}""")
        at ...
      ```
      
      ```
      java.lang.IllegalArgumentException: Failed to convert the JSON string '{"a":123}' to a field.
      ```
      
      ## How was this patch tested?
      
      Unit test added in `DataTypeSuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17468 from HyukjinKwon/fromjson_exception.
      d40cbb86
  5. Apr 01, 2017
    • wangzhenhua's avatar
      [SPARK-20186][SQL] BroadcastHint should use child's stats · 2287f3d0
      wangzhenhua authored
      ## What changes were proposed in this pull request?
      
      `BroadcastHint` should use child's statistics and set `isBroadcastable` to true.
      
      ## How was this patch tested?
      
      Added a new stats estimation test for `BroadcastHint`.
      
      Author: wangzhenhua <wangzhenhua@huawei.com>
      
      Closes #17504 from wzhfy/broadcastHintEstimation.
      2287f3d0
    • Xiao Li's avatar
      [SPARK-19148][SQL][FOLLOW-UP] do not expose the external table concept in Catalog · 89d6822f
      Xiao Li authored
      ### What changes were proposed in this pull request?
      After we renames `Catalog`.`createExternalTable` to `createTable` in the PR: https://github.com/apache/spark/pull/16528, we also need to deprecate the corresponding functions in `SQLContext`.
      
      ### How was this patch tested?
      N/A
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17502 from gatorsmile/deprecateCreateExternalTable.
      89d6822f
    • 郭小龙 10207633's avatar
      [SPARK-20177] Document about compression way has some little detail ch… · cf5963c9
      郭小龙 10207633 authored
      …anges.
      
      ## What changes were proposed in this pull request?
      
      Document compression way little detail changes.
      1.spark.eventLog.compress add 'Compression will use spark.io.compression.codec.'
      2.spark.broadcast.compress add 'Compression will use spark.io.compression.codec.'
      3,spark.rdd.compress add 'Compression will use spark.io.compression.codec.'
      4.spark.io.compression.codec add 'event log describe'.
      
      eg
      Through the documents, I don't know  what is compression mode about 'event log'.
      
      ## How was this patch tested?
      
      manual tests
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn>
      
      Closes #17498 from guoxiaolongzte/SPARK-20177.
      cf5963c9
  6. Mar 31, 2017
    • Tathagata Das's avatar
      [SPARK-20165][SS] Resolve state encoder's deserializer in driver in FlatMapGroupsWithStateExec · 567a50ac
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      - Encoder's deserializer must be resolved at the driver where the class is defined. Otherwise there are corner cases using nested classes where resolving at the executor can fail.
      
      - Fixed flaky test related to processing time timeout. The flakiness is caused because the test thread (that adds data to memory source) has a race condition with the streaming query thread. When testing the manual clock, the goal is to add data and increment clock together atomically, such that a trigger sees new data AND updated clock simultaneously (both or none). This fix adds additional synchronization in when adding data; it makes sure that the streaming query thread is waiting on the manual clock to be incremented (so no batch is currently running) before adding data.
      
      - Added`testQuietly` on some tests that generate a lot of error logs.
      
      ## How was this patch tested?
      Multiple runs on existing unit tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #17488 from tdas/SPARK-20165.
      567a50ac
    • Xiao Li's avatar
      [SPARK-20160][SQL] Move ParquetConversions and OrcConversions Out Of HiveSessionCatalog · b2349e6a
      Xiao Li authored
      ### What changes were proposed in this pull request?
      `ParquetConversions` and `OrcConversions` should be treated as regular `Analyzer` rules. It is not reasonable to be part of `HiveSessionCatalog`. This PR also combines two rules `ParquetConversions` and `OrcConversions` to build a new rule `RelationConversions `.
      
      After moving these two rules out of HiveSessionCatalog, the next step is to clean up, rename and move `HiveMetastoreCatalog` because it is not related to the hive package any more.
      
      ### How was this patch tested?
      The existing test cases
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17484 from gatorsmile/cleanup.
      b2349e6a
    • Ryan Blue's avatar
      [SPARK-20084][CORE] Remove internal.metrics.updatedBlockStatuses from history files. · c4c03eed
      Ryan Blue authored
      ## What changes were proposed in this pull request?
      
      Remove accumulator updates for internal.metrics.updatedBlockStatuses from SparkListenerTaskEnd entries in the history file. These can cause history files to grow to hundreds of GB because the value of the accumulator contains all tracked blocks.
      
      ## How was this patch tested?
      
      Current History UI tests cover use of the history file.
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #17412 from rdblue/SPARK-20084-remove-block-accumulator-info.
      c4c03eed
    • Kunal Khamar's avatar
      [SPARK-20164][SQL] AnalysisException not tolerant of null query plan. · 254877c2
      Kunal Khamar authored
      ## What changes were proposed in this pull request?
      
      The query plan in an `AnalysisException` may be `null` when an `AnalysisException` object is serialized and then deserialized, since `plan` is marked `transient`. Or when someone throws an `AnalysisException` with a null query plan (which should not happen).
      `def getMessage` is not tolerant of this and throws a `NullPointerException`, leading to loss of information about the original exception.
      The fix is to add a `null` check in `getMessage`.
      
      ## How was this patch tested?
      
      - Unit test
      
      Author: Kunal Khamar <kkhamar@outlook.com>
      
      Closes #17486 from kunalkhamar/spark-20164.
      254877c2
    • Reynold Xin's avatar
      [SPARK-20151][SQL] Account for partition pruning in scan metadataTime metrics · a8a765b3
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      After SPARK-20136, we report metadata timing metrics in scan operator. However, that timing metric doesn't include one of the most important part of metadata, which is partition pruning. This patch adds that time measurement to the scan metrics.
      
      ## How was this patch tested?
      N/A - I tried adding a test in SQLMetricsSuite but it was extremely convoluted to the point that I'm not sure if this is worth it.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #17476 from rxin/SPARK-20151.
      a8a765b3
  7. Mar 30, 2017
    • Wenchen Fan's avatar
      [SPARK-20121][SQL] simplify NullPropagation with NullIntolerant · c734fc50
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Instead of iterating all expressions that can return null for null inputs, we can just check `NullIntolerant`.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17450 from cloud-fan/null.
      c734fc50
    • Denis Bolshakov's avatar
      [SPARK-20127][CORE] few warning have been fixed which Intellij IDEA reported Intellij IDEA · 5e00a5de
      Denis Bolshakov authored
      ## What changes were proposed in this pull request?
      Few changes related to Intellij IDEA inspection.
      
      ## How was this patch tested?
      Changes were tested by existing unit tests
      
      Author: Denis Bolshakov <denis.bolshakov@onefactor.com>
      
      Closes #17458 from dbolshak/SPARK-20127.
      5e00a5de
    • Seigneurin, Alexis (CONT)'s avatar
      [DOCS][MINOR] Fixed a few typos in the Structured Streaming documentation · 669a11b6
      Seigneurin, Alexis (CONT) authored
      Fixed a few typos.
      
      There is one more I'm not sure of:
      
      ```
              Append mode uses watermark to drop old aggregation state. But the output of a
              windowed aggregation is delayed the late threshold specified in `withWatermark()` as by
              the modes semantics, rows can be added to the Result Table only once after they are
      ```
      
      Not sure how to change `is delayed the late threshold`.
      
      Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com>
      
      Closes #17443 from aseigneurin/typos.
      669a11b6
    • Kent Yao's avatar
      [SPARK-20096][SPARK SUBMIT][MINOR] Expose the right queue name not null if set... · e9d268f6
      Kent Yao authored
      [SPARK-20096][SPARK SUBMIT][MINOR] Expose the right queue name not null if set by --conf or configure file
      
      ## What changes were proposed in this pull request?
      
      while submit apps with -v or --verbose, we can print the right queue name, but if we set a queue name with `spark.yarn.queue` by --conf or in the spark-default.conf, we just got `null`  for the queue in Parsed arguments.
      ```
      bin/spark-shell -v --conf spark.yarn.queue=thequeue
      Using properties file: /home/hadoop/spark-2.1.0-bin-apache-hdp2.7.3/conf/spark-defaults.conf
      ....
      Adding default property: spark.yarn.queue=default
      Parsed arguments:
        master                  yarn
        deployMode              client
        ...
        queue                   null
        ....
        verbose                 true
      Spark properties used, including those specified through
       --conf and those from the properties file /home/hadoop/spark-2.1.0-bin-apache-hdp2.7.3/conf/spark-defaults.conf:
        spark.yarn.queue -> thequeue
        ....
      ```
      ## How was this patch tested?
      
      ut and local verify
      
      Author: Kent Yao <yaooqinn@hotmail.com>
      
      Closes #17430 from yaooqinn/SPARK-20096.
      e9d268f6
Loading