Skip to content
Snippets Groups Projects
  1. Sep 05, 2017
    • Xingbo Jiang's avatar
      [SPARK-21652][SQL] Fix rule confliction between InferFiltersFromConstraints and ConstantPropagation · fd60d4fa
      Xingbo Jiang authored
      ## What changes were proposed in this pull request?
      
      For the given example below, the predicate added by `InferFiltersFromConstraints` is folded by `ConstantPropagation` later, this leads to unconverged optimize iteration:
      ```
      Seq((1, 1)).toDF("col1", "col2").createOrReplaceTempView("t1")
      Seq(1, 2).toDF("col").createOrReplaceTempView("t2")
      sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 = t2.col AND t1.col2 = t2.col")
      ```
      
      We can fix this by adjusting the indent of the optimize rules.
      
      ## How was this patch tested?
      
      Add test case that would have failed in `SQLQuerySuite`.
      
      Author: Xingbo Jiang <xingbo.jiang@databricks.com>
      
      Closes #19099 from jiangxb1987/unconverge-optimization.
      fd60d4fa
    • Burak Yavuz's avatar
      [SPARK-21925] Update trigger interval documentation in docs with behavior change in Spark 2.2 · 8c954d2c
      Burak Yavuz authored
      Forgot to update docs with behavior change.
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #19138 from brkyvz/trigger-doc-fix.
      8c954d2c
    • gatorsmile's avatar
      [SPARK-21845][SQL][TEST-MAVEN] Make codegen fallback of expressions configurable · 2974406d
      gatorsmile authored
      ## What changes were proposed in this pull request?
      We should make codegen fallback of expressions configurable. So far, it is always on. We might hide it when our codegen have compilation bugs. Thus, we should also disable the codegen fallback when running test cases.
      
      ## How was this patch tested?
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19119 from gatorsmile/fallbackCodegen.
      2974406d
    • hyukjinkwon's avatar
      [SPARK-20978][SQL] Bump up Univocity version to 2.5.4 · 02a4386a
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      There was a bug in Univocity Parser that causes the issue in SPARK-20978. This was fixed as below:
      
      ```scala
      val df = spark.read.schema("a string, b string, unparsed string").option("columnNameOfCorruptRecord", "unparsed").csv(Seq("a").toDS())
      df.show()
      ```
      
      **Before**
      
      ```
      java.lang.NullPointerException
      	at scala.collection.immutable.StringLike$class.stripLineEnd(StringLike.scala:89)
      	at scala.collection.immutable.StringOps.stripLineEnd(StringOps.scala:29)
      	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$getCurrentInput(UnivocityParser.scala:56)
      	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207)
      	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207)
      ...
      ```
      
      **After**
      
      ```
      +---+----+--------+
      |  a|   b|unparsed|
      +---+----+--------+
      |  a|null|       a|
      +---+----+--------+
      ```
      
      It was fixed in 2.5.0 and 2.5.4 was released. I guess it'd be safe to upgrade this.
      
      ## How was this patch tested?
      
      Unit test added in `CSVSuite.scala`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19113 from HyukjinKwon/bump-up-univocity.
      02a4386a
    • hyukjinkwon's avatar
      [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0. · 7f3c6ff4
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      1.0.0 fixes an issue with import order, explicit type for public methods, line length limitation and comment validation:
      
      ```
      [error] .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/Main.scala:50:16: Are you sure you want to println? If yes, wrap the code block with
      [error]       // scalastyle:off println
      [error]       println(...)
      [error]       // scalastyle:on println
      [error] .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala:49: File line length exceeds 100 characters
      [error] .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala:22:21: Are you sure you want to println? If yes, wrap the code block with
      [error]       // scalastyle:off println
      [error]       println(...)
      [error]       // scalastyle:on println
      [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:35:6: Public method must have explicit type
      [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:51:6: Public method must have explicit type
      [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:93:15: Public method must have explicit type
      [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:98:15: Public method must have explicit type
      [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:47:2: Insert a space after the start of the comment
      [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:26:43: JavaDStream should come before JavaDStreamLike.
      ```
      
      This PR also fixes the workaround added in SPARK-16877 for `org.scalastyle.scalariform.OverrideJavaChecker` feature, added from 0.9.0.
      
      ## How was this patch tested?
      
      Manually tested.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19116 from HyukjinKwon/scalastyle-1.0.0.
      7f3c6ff4
    • Dongjoon Hyun's avatar
      [SPARK-21913][SQL][TEST] withDatabase` should drop database with CASCADE · 4e7a29ef
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, `withDatabase` fails if the database is not empty. It would be great if we drop cleanly with CASCADE.
      
      ## How was this patch tested?
      
      This is a change on test util. Pass the existing Jenkins.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19125 from dongjoon-hyun/SPARK-21913.
      4e7a29ef
  2. Sep 04, 2017
    • Sean Owen's avatar
      [SPARK-21418][SQL] NoSuchElementException: None.get in DataSourceScanExec with... · ca59445a
      Sean Owen authored
      [SPARK-21418][SQL] NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true
      
      ## What changes were proposed in this pull request?
      
      If no SparkConf is available to Utils.redact, simply don't redact.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19123 from srowen/SPARK-21418.
      ca59445a
  3. Sep 03, 2017
    • Liang-Chi Hsieh's avatar
      [SPARK-21654][SQL] Complement SQL predicates expression description · 9f30d928
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      SQL predicates don't have complete expression description. This patch goes to complement the description by adding arguments, examples.
      
      This change also adds related test cases for the SQL predicate expressions.
      
      ## How was this patch tested?
      
      Existing tests. And added predicate test.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #18869 from viirya/SPARK-21654.
      9f30d928
    • hyukjinkwon's avatar
      [SPARK-21897][PYTHON][R] Add unionByName API to DataFrame in Python and R · 07fd68a2
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to add a wrapper for `unionByName` API to R and Python as well.
      
      **Python**
      
      ```python
      df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
      df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"])
      df1.unionByName(df2).show()
      ```
      
      ```
      +----+----+----+
      |col0|col1|col3|
      +----+----+----+
      |   1|   2|   3|
      |   6|   4|   5|
      +----+----+----+
      ```
      
      **R**
      
      ```R
      df1 <- select(createDataFrame(mtcars), "carb", "am", "gear")
      df2 <- select(createDataFrame(mtcars), "am", "gear", "carb")
      head(unionByName(limit(df1, 2), limit(df2, 2)))
      ```
      
      ```
        carb am gear
      1    4  1    4
      2    4  1    4
      3    4  1    4
      4    4  1    4
      ```
      
      ## How was this patch tested?
      
      Doctests for Python and unit test added in `test_sparkSQL.R` for R.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19105 from HyukjinKwon/unionByName-r-python.
      07fd68a2
  4. Sep 02, 2017
    • gatorsmile's avatar
      [SPARK-21891][SQL] Add TBLPROPERTIES to DDL statement: CREATE TABLE USING · acb7fed2
      gatorsmile authored
      ## What changes were proposed in this pull request?
      Add `TBLPROPERTIES` to the DDL statement `CREATE TABLE USING`.
      
      After this change, the DDL becomes
      ```
      CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
      USING table_provider
      [OPTIONS table_property_list]
      [PARTITIONED BY (col_name, col_name, ...)]
      [CLUSTERED BY (col_name, col_name, ...)
       [SORTED BY (col_name [ASC|DESC], ...)]
       INTO num_buckets BUCKETS
      ]
      [LOCATION path]
      [COMMENT table_comment]
      [TBLPROPERTIES (property_name=property_value, ...)]
      [[AS] select_statement];
      ```
      
      ## How was this patch tested?
      Add a few tests
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19100 from gatorsmile/addTablePropsToCreateTableUsing.
      acb7fed2
  5. Sep 01, 2017
    • WeichenXu's avatar
      [SPARK-21729][ML][TEST] Generic test for ProbabilisticClassifier to ensure... · 900f14f6
      WeichenXu authored
      [SPARK-21729][ML][TEST] Generic test for ProbabilisticClassifier to ensure consistent output columns
      
      ## What changes were proposed in this pull request?
      
      Add test for prediction using the model with all combinations of output columns turned on/off.
      Make sure the output column values match, presumably by comparing vs. the case with all 3 output columns turned on.
      
      ## How was this patch tested?
      
      Test updated.
      
      Author: WeichenXu <weichen.xu@databricks.com>
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #19065 from WeichenXu123/generic_test_for_prob_classifier.
      900f14f6
    • gatorsmile's avatar
      [SPARK-21895][SQL] Support changing database in HiveClient · aba9492d
      gatorsmile authored
      ## What changes were proposed in this pull request?
      Supporting moving tables across different database in HiveClient `alterTable`
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19104 from gatorsmile/alterTable.
      aba9492d
    • Sean Owen's avatar
      [SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala... · 12ab7f7e
      Sean Owen authored
      [SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala 2.12 profiles and enable 2.12 compilation
      
      …build; fix some things that will be warnings or errors in 2.12; restore Scala 2.12 profile infrastructure
      
      ## What changes were proposed in this pull request?
      
      This change adds back the infrastructure for a Scala 2.12 build, but does not enable it in the release or Python test scripts.
      
      In order to make that meaningful, it also resolves compile errors that the code hits in 2.12 only, in a way that still works with 2.11.
      
      It also updates dependencies to the earliest minor release of dependencies whose current version does not yet support Scala 2.12. This is in a sense covered by other JIRAs under the main umbrella, but implemented here. The versions below still work with 2.11, and are the _latest_ maintenance release in the _earliest_ viable minor release.
      
      - Scalatest 2.x -> 3.0.3
      - Chill 0.8.0 -> 0.8.4
      - Clapper 1.0.x -> 1.1.2
      - json4s 3.2.x -> 3.4.2
      - Jackson 2.6.x -> 2.7.9 (required by json4s)
      
      This change does _not_ fully enable a Scala 2.12 build:
      
      - It will also require dropping support for Kafka before 0.10. Easy enough, just didn't do it yet here
      - It will require recreating `SparkILoop` and `Main` for REPL 2.12, which is SPARK-14650. Possible to do here too.
      
      What it does do is make changes that resolve much of the remaining gap without affecting the current 2.11 build.
      
      ## How was this patch tested?
      
      Existing tests and build. Manually tested with `./dev/change-scala-version.sh 2.12` to verify it compiles, modulo the exceptions above.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #18645 from srowen/SPARK-14280.
      12ab7f7e
    • he.qiao's avatar
      [SPARK-21880][WEB UI] In the SQL table page, modify jobs trace information · 12f0d242
      he.qiao authored
      ## What changes were proposed in this pull request?
      As shown below, for example, When the job 5 is running, It was a mistake to think that five jobs were running, So I think it would be more appropriate to change jobs to job id.
      ![image](https://user-images.githubusercontent.com/21355020/29909612-4dc85064-8e59-11e7-87cd-275a869243bb.png)
      
      ## How was this patch tested?
      no need
      
      Author: he.qiao <he.qiao17@zte.com.cn>
      
      Closes #19093 from Geek-He/08_31_sqltable.
      12f0d242
    • Marcelo Vanzin's avatar
      [SPARK-21728][CORE] Follow up: fix user config, auth in SparkSubmit logging. · 0bdbefe9
      Marcelo Vanzin authored
      - SecurityManager complains when auth is enabled but no secret is defined;
        SparkSubmit doesn't use the auth functionality of the SecurityManager,
        so use a dummy secret to work around the exception.
      
      - Only reset the log4j configuration when Spark was the one initializing
        it, otherwise user-defined log configuration may be lost.
      
      Tested with the log config file posted to the bug, on a secured YARN cluster.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #19089 from vanzin/SPARK-21728.
      0bdbefe9
  6. Aug 31, 2017
    • hyukjinkwon's avatar
      [SPARK-21789][PYTHON] Remove obsolete codes for parsing abstract schema strings · 648a8626
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to remove private functions that look not used in the main codes, `_split_schema_abstract`, `_parse_field_abstract`, `_parse_schema_abstract` and `_infer_schema_type`.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18647 from HyukjinKwon/remove-abstract.
      648a8626
    • hyukjinkwon's avatar
      [SPARK-21779][PYTHON] Simpler DataFrame.sample API in Python · 5cd8ea99
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR make `DataFrame.sample(...)` can omit `withReplacement` defaulting `False`, consistently with equivalent Scala / Java API.
      
      In short, the following examples are allowed:
      
      ```python
      >>> df = spark.range(10)
      >>> df.sample(0.5).count()
      7
      >>> df.sample(fraction=0.5).count()
      3
      >>> df.sample(0.5, seed=42).count()
      5
      >>> df.sample(fraction=0.5, seed=42).count()
      5
      ```
      
      In addition, this PR also adds some type checking logics as below:
      
      ```python
      >>> df = spark.range(10)
      >>> df.sample().count()
      ...
      TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [].
      >>> df.sample(True).count()
      ...
      TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'bool'>].
      >>> df.sample(42).count()
      ...
      TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'int'>].
      >>> df.sample(fraction=False, seed="a").count()
      ...
      TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'bool'>, <type 'str'>].
      >>> df.sample(seed=[1]).count()
      ...
      TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'list'>].
      >>> df.sample(withReplacement="a", fraction=0.5, seed=1)
      ...
      TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'str'>, <type 'float'>, <type 'int'>].
      ```
      
      ## How was this patch tested?
      
      Manually tested, unit tests added in doc tests and manually checked the built documentation for Python.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18999 from HyukjinKwon/SPARK-21779.
      5cd8ea99
    • WeichenXu's avatar
      [SPARK-21862][ML] Add overflow check in PCA · f5e10a34
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      add overflow check in PCA, otherwise it is possible to throw `NegativeArraySizeException` when `k` and `numFeatures` are too large.
      The overflow checking formula is here:
      https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala#L87
      
      ## How was this patch tested?
      
      N/A
      
      Author: WeichenXu <weichen.xu@databricks.com>
      
      Closes #19078 from WeichenXu123/SVD_overflow_check.
      f5e10a34
    • WeichenXu's avatar
      [SPARK-17139][ML][FOLLOW-UP] Add convenient method `asBinary` for casting to... · 96028e36
      WeichenXu authored
      [SPARK-17139][ML][FOLLOW-UP] Add convenient method `asBinary` for casting to BinaryLogisticRegressionSummary
      
      ## What changes were proposed in this pull request?
      
      add an "asBinary" method to LogisticRegressionSummary for convenient casting to BinaryLogisticRegressionSummary.
      
      ## How was this patch tested?
      
      Testcase updated.
      
      Author: WeichenXu <weichen.xu@databricks.com>
      
      Closes #19072 from WeichenXu123/mlor_summary_as_binary.
      96028e36
    • Andrew Ray's avatar
      [SPARK-21110][SQL] Structs, arrays, and other orderable datatypes should be usable in inequalities · cba69aeb
      Andrew Ray authored
      ## What changes were proposed in this pull request?
      
      Allows `BinaryComparison` operators to work on any data type that actually supports ordering as verified by `TypeUtils.checkForOrderingExpr` instead of relying on the incomplete list `TypeCollection.Ordered` (which is removed by this PR).
      
      ## How was this patch tested?
      
      Updated unit tests to cover structs and arrays.
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #18818 from aray/SPARK-21110.
      cba69aeb
    • gatorsmile's avatar
      [SPARK-17107][SQL][FOLLOW-UP] Remove redundant pushdown rule for Union · 7ce11082
      gatorsmile authored
      ## What changes were proposed in this pull request?
      Also remove useless function `partitionByDeterministic` after the changes of https://github.com/apache/spark/pull/14687
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19097 from gatorsmile/followupSPARK-17107.
      7ce11082
    • Bryan Cutler's avatar
      [SPARK-21583][HOTFIX] Removed intercept in test causing failures · 501370d9
      Bryan Cutler authored
      Removing a check in the ColumnarBatchSuite that depended on a Java assertion.  This assertion is being compiled out in the Maven builds causing the test to fail.  This part of the test is not specifically from to the functionality that is being tested here.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #19098 from BryanCutler/hotfix-ColumnarBatchSuite-assertion.
      501370d9
    • ArtRand's avatar
      [SPARK-20812][MESOS] Add secrets support to the dispatcher · fc45c2c8
      ArtRand authored
      Mesos has secrets primitives for environment and file-based secrets, this PR adds that functionality to the Spark dispatcher and the appropriate configuration flags.
      Unit tested and manually tested against a DC/OS cluster with Mesos 1.4.
      
      Author: ArtRand <arand@soe.ucsc.edu>
      
      Closes #18837 from ArtRand/spark-20812-dispatcher-secrets-and-labels.
      fc45c2c8
    • Jacek Laskowski's avatar
      [SPARK-21886][SQL] Use SparkSession.internalCreateDataFrame to create… · 9696580c
      Jacek Laskowski authored
      … Dataset with LogicalRDD logical operator
      
      ## What changes were proposed in this pull request?
      
      Reusing `SparkSession.internalCreateDataFrame` wherever possible (to cut dups)
      
      ## How was this patch tested?
      
      Local build and waiting for Jenkins
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #19095 from jaceklaskowski/SPARK-21886-internalCreateDataFrame.
      9696580c
    • gatorsmile's avatar
      [SPARK-21878][SQL][TEST] Create SQLMetricsTestUtils · 19b0240d
      gatorsmile authored
      ## What changes were proposed in this pull request?
      Creates `SQLMetricsTestUtils` for the utility functions of both Hive-specific and the other SQLMetrics test cases.
      
      Also, move two SQLMetrics test cases from sql/hive to sql/core.
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19092 from gatorsmile/rewriteSQLMetrics.
      19b0240d
  7. Aug 30, 2017
    • Bryan Cutler's avatar
      [SPARK-21583][SQL] Create a ColumnarBatch from ArrowColumnVectors · 964b507c
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      This PR allows the creation of a `ColumnarBatch` from `ReadOnlyColumnVectors` where previously a columnar batch could only allocate vectors internally.  This is useful for using `ArrowColumnVectors` in a batch form to do row-based iteration.  Also added `ArrowConverter.fromPayloadIterator` which converts `ArrowPayload` iterator to `InternalRow` iterator and uses a `ColumnarBatch` internally.
      
      ## How was this patch tested?
      
      Added a new unit test for creating a `ColumnarBatch` with `ReadOnlyColumnVectors` and a test to verify the roundtrip of rows -> ArrowPayload -> rows, using `toPayloadIterator` and `fromPayloadIterator`.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #18787 from BryanCutler/arrow-ColumnarBatch-support-SPARK-21583.
      964b507c
    • Liang-Chi Hsieh's avatar
      [SPARK-21534][SQL][PYSPARK] PickleException when creating dataframe from... · ecf437a6
      Liang-Chi Hsieh authored
      [SPARK-21534][SQL][PYSPARK] PickleException when creating dataframe from python row with empty bytearray
      
      ## What changes were proposed in this pull request?
      
      `PickleException` is thrown when creating dataframe from python row with empty bytearray
      
          spark.createDataFrame(spark.sql("select unhex('') as xx").rdd.map(lambda x: {"abc": x.xx})).show()
      
          net.razorvine.pickle.PickleException: invalid pickle data for bytearray; expected 1 or 2 args, got 0
          	at net.razorvine.pickle.objects.ByteArrayConstructor.construct(ByteArrayConstructor.java
              ...
      
      `ByteArrayConstructor` doesn't deal with empty byte array pickled by Python3.
      
      ## How was this patch tested?
      
      Added test.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #19085 from viirya/SPARK-21534.
      ecf437a6
    • jerryshao's avatar
      [SPARK-17321][YARN] Avoid writing shuffle metadata to disk if NM recovery is disabled · 4482ff23
      jerryshao authored
      In the current code, if NM recovery is not enabled then `YarnShuffleService` will write shuffle metadata to NM local dir-1, if this local dir-1 is on bad disk, then `YarnShuffleService` will be failed to start. So to solve this issue, in Spark side if NM recovery is not enabled, then Spark will not persist data into leveldb, in that case yarn shuffle service can still be served but lose the ability for recovery, (it is fine because the failure of NM will kill the containers as well as applications).
      
      Tested in the local cluster with NM recovery off and on to see if folder is created or not. MiniCluster UT isn't added because in MiniCluster NM will always set port to 0, but NM recovery requires non-ephemeral port.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #19032 from jerryshao/SPARK-17321.
      
      Change-Id: I8f2fe73d175e2ad2c4e380caede3873e0192d027
      4482ff23
    • Xiaofeng Lin's avatar
      [SPARK-11574][CORE] Add metrics StatsD sink · cd5d0f33
      Xiaofeng Lin authored
      This patch adds statsd sink to the current metrics system in spark core.
      
      Author: Xiaofeng Lin <xlin@twilio.com>
      
      Closes #9518 from xflin/statsd.
      
      Change-Id: Ib8720e86223d4a650df53f51ceb963cd95b49a44
      cd5d0f33
    • Andrew Ash's avatar
      [SPARK-21875][BUILD] Fix Java style bugs · 313c6ca4
      Andrew Ash authored
      ## What changes were proposed in this pull request?
      
      Fix Java code style so `./dev/lint-java` succeeds
      
      ## How was this patch tested?
      
      Run `./dev/lint-java`
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #19088 from ash211/spark-21875-lint-java.
      313c6ca4
    • Dongjoon Hyun's avatar
      [SPARK-21839][SQL] Support SQL config for ORC compression · d8f45408
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR aims to support `spark.sql.orc.compression.codec` like Parquet's `spark.sql.parquet.compression.codec`. Users can use SQLConf to control ORC compression, too.
      
      ## How was this patch tested?
      
      Pass the Jenkins with new and updated test cases.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19055 from dongjoon-hyun/SPARK-21839.
      d8f45408
    • Sital Kedia's avatar
      [SPARK-21834] Incorrect executor request in case of dynamic allocation · 6949a9c5
      Sital Kedia authored
      ## What changes were proposed in this pull request?
      
      killExecutor api currently does not allow killing an executor without updating the total number of executors needed. In case of dynamic allocation is turned on and the allocator tries to kill an executor, the scheduler reduces the total number of executors needed ( see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L635) which is incorrect because the allocator already takes care of setting the required number of executors itself.
      
      ## How was this patch tested?
      
      Ran a job on the cluster and made sure the executor request is correct
      
      Author: Sital Kedia <skedia@fb.com>
      
      Closes #19081 from sitalkedia/skedia/oss_fix_executor_allocation.
      6949a9c5
    • caoxuewen's avatar
      [MINOR][SQL][TEST] Test shuffle hash join while is not expected · 235d2833
      caoxuewen authored
      ## What changes were proposed in this pull request?
      
      igore("shuffle hash join") is to shuffle hash join to test _case class ShuffledHashJoinExec_.
      But when you 'ignore' -> 'test', the test is _case class BroadcastHashJoinExec_.
      
      Before modified,  as a result of:canBroadcast is true.
      Print information in _canBroadcast(plan: LogicalPlan)_
      ```
      canBroadcast plan.stats.sizeInBytes:6710880
      canBroadcast conf.autoBroadcastJoinThreshold:10000000
      ```
      
      After modified, plan.stats.sizeInBytes is 11184808.
      Print information in _canBuildLocalHashMap(plan: LogicalPlan)_
      and _muchSmaller(a: LogicalPlan, b: LogicalPlan)_ :
      
      ```
      canBuildLocalHashMap plan.stats.sizeInBytes:11184808
      canBuildLocalHashMap conf.autoBroadcastJoinThreshold:10000000
      canBuildLocalHashMap conf.numShufflePartitions:2
      ```
      ```
      muchSmaller a.stats.sizeInBytes * 3:33554424
      muchSmaller b.stats.sizeInBytes:33554432
      ```
      ## How was this patch tested?
      
      existing test case.
      
      Author: caoxuewen <cao.xuewen@zte.com.cn>
      
      Closes #19069 from heary-cao/shuffle_hash_join.
      235d2833
    • gatorsmile's avatar
      32d6d9d7
    • Bryan Cutler's avatar
      [SPARK-21469][ML][EXAMPLES] Adding Examples for FeatureHasher · 4133c1b0
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      This PR adds ML examples for the FeatureHasher transform in Scala, Java, Python.
      
      ## How was this patch tested?
      
      Manually ran examples and verified that output is consistent for different APIs
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #19024 from BryanCutler/ml-examples-FeatureHasher-SPARK-21810.
      4133c1b0
    • hyukjinkwon's avatar
      [SPARK-21764][TESTS] Fix tests failures on Windows: resources not being closed and incorrect paths · b30a11a6
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      `org.apache.spark.deploy.RPackageUtilsSuite`
      
      ```
       - jars without manifest return false *** FAILED *** (109 milliseconds)
         java.io.IOException: Unable to delete file: C:\projects\spark\target\tmp\1500266936418-0\dep1-c.jar
      ```
      
      `org.apache.spark.deploy.SparkSubmitSuite`
      
      ```
       - download one file to local *** FAILED *** (16 milliseconds)
         java.net.URISyntaxException: Illegal character in authority at index 6: s3a://C:\projects\spark\target\tmp\test2630198944759847458.jar
      
       - download list of files to local *** FAILED *** (0 milliseconds)
         java.net.URISyntaxException: Illegal character in authority at index 6: s3a://C:\projects\spark\target\tmp\test2783551769392880031.jar
      ```
      
      `org.apache.spark.scheduler.ReplayListenerSuite`
      
      ```
       - Replay compressed inprogress log file succeeding on partial read (156 milliseconds)
         Exception encountered when attempting to run a suite with class name:
         org.apache.spark.scheduler.ReplayListenerSuite *** ABORTED *** (1 second, 391 milliseconds)
         java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-8f3cacd6-faad-4121-b901-ba1bba8025a0
      
       - End-to-end replay *** FAILED *** (62 milliseconds)
         java.io.IOException: No FileSystem for scheme: C
      
       - End-to-end replay with compression *** FAILED *** (110 milliseconds)
         java.io.IOException: No FileSystem for scheme: C
      ```
      
      `org.apache.spark.sql.hive.StatisticsSuite`
      
      ```
       - SPARK-21079 - analyze table with location different than that of individual partitions *** FAILED *** (875 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - SPARK-21079 - analyze partitioned table with only a subset of partitions visible *** FAILED *** (47 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      ```
      
      **Note:** this PR does not fix:
      
      `org.apache.spark.deploy.SparkSubmitSuite`
      
      ```
       - launch simple application with spark-submit with redaction *** FAILED *** (172 milliseconds)
         java.util.NoSuchElementException: next on empty iterator
      ```
      
      I can't reproduce this on my Windows machine but looks appearntly consistently failed on AppVeyor. This one is unclear to me yet and hard to debug so I did not include this one for now.
      
      **Note:** it looks there are more instances but it is hard to identify them partly due to flakiness and partly due to swarming logs and errors. Will probably go one more time if it is fine.
      
      ## How was this patch tested?
      
      Manually via AppVeyor:
      
      **Before**
      
      - `org.apache.spark.deploy.RPackageUtilsSuite`: https://ci.appveyor.com/project/spark-test/spark/build/771-windows-fix/job/8t8ra3lrljuir7q4
      - `org.apache.spark.deploy.SparkSubmitSuite`: https://ci.appveyor.com/project/spark-test/spark/build/771-windows-fix/job/taquy84yudjjen64
      - `org.apache.spark.scheduler.ReplayListenerSuite`: https://ci.appveyor.com/project/spark-test/spark/build/771-windows-fix/job/24omrfn2k0xfa9xq
      - `org.apache.spark.sql.hive.StatisticsSuite`: https://ci.appveyor.com/project/spark-test/spark/build/771-windows-fix/job/2079y1plgj76dc9l
      
      **After**
      
      - `org.apache.spark.deploy.RPackageUtilsSuite`: https://ci.appveyor.com/project/spark-test/spark/build/775-windows-fix/job/3803dbfn89ne1164
      - `org.apache.spark.deploy.SparkSubmitSuite`: https://ci.appveyor.com/project/spark-test/spark/build/775-windows-fix/job/m5l350dp7u9a4xjr
      - `org.apache.spark.scheduler.ReplayListenerSuite`: https://ci.appveyor.com/project/spark-test/spark/build/775-windows-fix/job/565vf74pp6bfdk18
      - `org.apache.spark.sql.hive.StatisticsSuite`: https://ci.appveyor.com/project/spark-test/spark/build/775-windows-fix/job/qm78tsk8c37jb6s4
      
      Jenkins tests are required and AppVeyor tests will be triggered.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18971 from HyukjinKwon/windows-fixes.
      b30a11a6
    • Sean Owen's avatar
      [SPARK-21806][MLLIB] BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading · 734ed7a7
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Prepend (0,p) to precision-recall curve not (0,1) where p matches lowest recall point
      
      ## How was this patch tested?
      
      Updated tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19038 from srowen/SPARK-21806.
      734ed7a7
    • Yuval Itzchakov's avatar
      [SPARK-21873][SS] - Avoid using `return` inside `CachedKafkaConsumer.get` · 8f0df6bc
      Yuval Itzchakov authored
      During profiling of a structured streaming application with Kafka as the source, I came across this exception:
      
      ![Structured Streaming Kafka Exceptions](https://user-images.githubusercontent.com/3448320/29743366-4149ef78-8a99-11e7-94d6-f0cbb691134a.png)
      
      This is a 1 minute sample, which caused 106K `NonLocalReturnControl` exceptions to be thrown.
      This happens because `CachedKafkaConsumer.get` is ran inside:
      
      `private def runUninterruptiblyIfPossible[T](body: => T): T`
      
      Where `body: => T` is the `get` method. Turning the method into a function means that in order to escape the `while` loop defined in `get` the runtime has to do dirty tricks which involve throwing the above exception.
      
      ## What changes were proposed in this pull request?
      
      Instead of using `return` (which is generally not recommended in Scala), we place the result of the `fetchData` method inside a local variable and use a boolean flag to indicate the status of fetching data, which we monitor as our predicate to the `while` loop.
      
      ## How was this patch tested?
      
      I've ran the `KafkaSourceSuite` to make sure regression passes. Since the exception isn't visible from user code, there is no way (at least that I could think of) to add this as a test to the existing suite.
      
      Author: Yuval Itzchakov <yuval.itzchakov@clicktale.com>
      
      Closes #19059 from YuvalItzchakov/master.
      8f0df6bc
    • liuxian's avatar
      [MINOR][TEST] Off -heap memory leaks for unit tests · d4895c9d
      liuxian authored
      ## What changes were proposed in this pull request?
      Free off -heap memory .
      I have checked all the unit tests.
      
      ## How was this patch tested?
      N/A
      
      Author: liuxian <liu.xian3@zte.com.cn>
      
      Closes #19075 from 10110346/memleak.
      d4895c9d
  8. Aug 29, 2017
    • Steve Loughran's avatar
      [SPARK-20886][CORE] HadoopMapReduceCommitProtocol to handle FileOutputCommitter.getWorkPath==null · e47f48c7
      Steve Loughran authored
      ## What changes were proposed in this pull request?
      
      Handles the situation where a `FileOutputCommitter.getWorkPath()` returns `null` by downgrading to the supplied `path` argument.
      
      The existing code does an  `Option(workPath.toString).getOrElse(path)`, which triggers an NPE in the `toString()` operation if the workPath == null. The code apparently was meant to handle this (hence the getOrElse() clause, but as the NPE has already occurred at that point the else-clause never gets invoked.
      
      ## How was this patch tested?
      
      Manually, with some later code review.
      
      Author: Steve Loughran <stevel@hortonworks.com>
      
      Closes #18111 from steveloughran/cloud/SPARK-20886-committer-NPE.
      e47f48c7
Loading