Skip to content
Snippets Groups Projects
  1. Sep 06, 2017
    • Jacek Laskowski's avatar
      [SPARK-21901][SS] Define toString for StateOperatorProgress · fa0092bd
      Jacek Laskowski authored
      ## What changes were proposed in this pull request?
      
      Just `StateOperatorProgress.toString` + few formatting fixes
      
      ## How was this patch tested?
      
      Local build. Waiting for OK from Jenkins.
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #19112 from jaceklaskowski/SPARK-21901-StateOperatorProgress-toString.
      fa0092bd
    • Jose Torres's avatar
      [SPARK-21765] Check that optimization doesn't affect isStreaming bit. · acdf45fb
      Jose Torres authored
      ## What changes were proposed in this pull request?
      
      Add an assert in logical plan optimization that the isStreaming bit stays the same, and fix empty relation rules where that wasn't happening.
      
      ## How was this patch tested?
      
      new and existing unit tests
      
      Author: Jose Torres <joseph.torres@databricks.com>
      Author: Jose Torres <joseph-torres@databricks.com>
      
      Closes #19056 from joseph-torres/SPARK-21765-followup.
      acdf45fb
    • Felix Cheung's avatar
      [SPARK-21801][SPARKR][TEST] set random seed for predictable test · 36b48ee6
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      set.seed() before running tests
      
      ## How was this patch tested?
      
      jenkins, appveyor
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #19111 from felixcheung/rranseed.
      36b48ee6
    • Liang-Chi Hsieh's avatar
      [SPARK-21835][SQL] RewritePredicateSubquery should not produce unresolved query plans · f2e22aeb
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      Correlated predicate subqueries are rewritten into `Join` by the rule `RewritePredicateSubquery`  during optimization.
      
      It is possibly that the two sides of the `Join` have conflicting attributes. The query plans produced by `RewritePredicateSubquery` become unresolved and break structural integrity.
      
      We should check if there are conflicting attributes in the `Join` and de-duplicate them by adding a `Project`.
      
      ## How was this patch tested?
      
      Added tests.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #19050 from viirya/SPARK-21835.
      f2e22aeb
    • hyukjinkwon's avatar
      [SPARK-21903][BUILD][FOLLOWUP] Upgrade scalastyle-maven-plugin and scalastyle... · 64936c14
      hyukjinkwon authored
      [SPARK-21903][BUILD][FOLLOWUP] Upgrade scalastyle-maven-plugin and scalastyle as well in POM and SparkBuild.scala
      
      ## What changes were proposed in this pull request?
      
      This PR proposes to match scalastyle version in POM and SparkBuild.scala
      
      ## How was this patch tested?
      
      Manual builds.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19146 from HyukjinKwon/SPARK-21903-follow-up.
      64936c14
    • Bryan Cutler's avatar
      [SPARK-19357][ML] Adding parallel model evaluation in ML tuning · 16c4c03c
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Modified `CrossValidator` and `TrainValidationSplit` to be able to evaluate models in parallel for a given parameter grid.  The level of parallelism is controlled by a parameter `numParallelEval` used to schedule a number of models to be trained/evaluated so that the jobs can be run concurrently.  This is a naive approach that does not check the cluster for needed resources, so care must be taken by the user to tune the parameter appropriately.  The default value is `1` which will train/evaluate in serial.
      
      ## How was this patch tested?
      Added unit tests for CrossValidator and TrainValidationSplit to verify that model selection is the same when run in serial vs parallel.  Manual testing to verify tasks run in parallel when param is > 1. Added parameter usage to relevant examples.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #16774 from BryanCutler/parallel-model-eval-SPARK-19357.
      16c4c03c
    • Riccardo Corbella's avatar
      [SPARK-21924][DOCS] Update structured streaming programming guide doc · 4ee7dfe4
      Riccardo Corbella authored
      ## What changes were proposed in this pull request?
      
      Update the line "For example, the data (12:09, cat) is out of order and late, and it falls in windows 12:05 - 12:15 and 12:10 - 12:20." as follow "For example, the data (12:09, cat) is out of order and late, and it falls in windows 12:00 - 12:10 and 12:05 - 12:15." under the programming structured streaming programming guide.
      
      Author: Riccardo Corbella <r.corbella@reply.it>
      
      Closes #19137 from riccardocorbella/bugfix.
      4ee7dfe4
  2. Sep 05, 2017
    • jerryshao's avatar
      [SPARK-9104][CORE] Expose Netty memory metrics in Spark · 445f1790
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      This PR exposes Netty memory usage for Spark's `TransportClientFactory` and `TransportServer`, including the details of each direct arena and heap arena metrics, as well as aggregated metrics. The purpose of adding the Netty metrics is to better know the memory usage of Netty in Spark shuffle, rpc and others network communications, and guide us to better configure the memory size of executors.
      
      This PR doesn't expose these metrics to any sink, to leverage this feature, still requires to connect to either MetricsSystem or collect them back to Driver to display.
      
      ## How was this patch tested?
      
      Add Unit test to verify it, also manually verified in real cluster.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #18935 from jerryshao/SPARK-9104.
      445f1790
    • jerryshao's avatar
      [SPARK-18061][THRIFTSERVER] Add spnego auth support for ThriftServer thrift/http protocol · 6a232544
      jerryshao authored
      Spark ThriftServer doesn't support spnego auth for thrift/http protocol, this mainly used for knox+thriftserver scenario. Since in HiveServer2 CLIService there already has existing codes to support it. So here copy it to Spark ThriftServer to make it support.
      
      Related Hive JIRA HIVE-6697.
      
      Manual verification.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #18628 from jerryshao/SPARK-21407.
      
      Change-Id: I61ef0c09f6972bba982475084a6b0ae3a74e385e
      6a232544
    • Dongjoon Hyun's avatar
      [MINOR][DOC] Update `Partition Discovery` section to enumerate all available file sources · 9e451bcf
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      All built-in data sources support `Partition Discovery`. We had better update the document to give the users more benefit clearly.
      
      **AFTER**
      
      <img width="906" alt="1" src="https://user-images.githubusercontent.com/9700541/30083628-14278908-9244-11e7-98dc-9ad45fe233a9.png">
      
      ## How was this patch tested?
      
      ```
      SKIP_API=1 jekyll serve --watch
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19139 from dongjoon-hyun/partitiondiscovery.
      9e451bcf
    • Xingbo Jiang's avatar
      [SPARK-21652][SQL] Fix rule confliction between InferFiltersFromConstraints and ConstantPropagation · fd60d4fa
      Xingbo Jiang authored
      ## What changes were proposed in this pull request?
      
      For the given example below, the predicate added by `InferFiltersFromConstraints` is folded by `ConstantPropagation` later, this leads to unconverged optimize iteration:
      ```
      Seq((1, 1)).toDF("col1", "col2").createOrReplaceTempView("t1")
      Seq(1, 2).toDF("col").createOrReplaceTempView("t2")
      sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 = t2.col AND t1.col2 = t2.col")
      ```
      
      We can fix this by adjusting the indent of the optimize rules.
      
      ## How was this patch tested?
      
      Add test case that would have failed in `SQLQuerySuite`.
      
      Author: Xingbo Jiang <xingbo.jiang@databricks.com>
      
      Closes #19099 from jiangxb1987/unconverge-optimization.
      fd60d4fa
    • Burak Yavuz's avatar
      [SPARK-21925] Update trigger interval documentation in docs with behavior change in Spark 2.2 · 8c954d2c
      Burak Yavuz authored
      Forgot to update docs with behavior change.
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #19138 from brkyvz/trigger-doc-fix.
      8c954d2c
    • gatorsmile's avatar
      [SPARK-21845][SQL][TEST-MAVEN] Make codegen fallback of expressions configurable · 2974406d
      gatorsmile authored
      ## What changes were proposed in this pull request?
      We should make codegen fallback of expressions configurable. So far, it is always on. We might hide it when our codegen have compilation bugs. Thus, we should also disable the codegen fallback when running test cases.
      
      ## How was this patch tested?
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19119 from gatorsmile/fallbackCodegen.
      2974406d
    • hyukjinkwon's avatar
      [SPARK-20978][SQL] Bump up Univocity version to 2.5.4 · 02a4386a
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      There was a bug in Univocity Parser that causes the issue in SPARK-20978. This was fixed as below:
      
      ```scala
      val df = spark.read.schema("a string, b string, unparsed string").option("columnNameOfCorruptRecord", "unparsed").csv(Seq("a").toDS())
      df.show()
      ```
      
      **Before**
      
      ```
      java.lang.NullPointerException
      	at scala.collection.immutable.StringLike$class.stripLineEnd(StringLike.scala:89)
      	at scala.collection.immutable.StringOps.stripLineEnd(StringOps.scala:29)
      	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$getCurrentInput(UnivocityParser.scala:56)
      	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207)
      	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207)
      ...
      ```
      
      **After**
      
      ```
      +---+----+--------+
      |  a|   b|unparsed|
      +---+----+--------+
      |  a|null|       a|
      +---+----+--------+
      ```
      
      It was fixed in 2.5.0 and 2.5.4 was released. I guess it'd be safe to upgrade this.
      
      ## How was this patch tested?
      
      Unit test added in `CSVSuite.scala`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19113 from HyukjinKwon/bump-up-univocity.
      02a4386a
    • hyukjinkwon's avatar
      [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0. · 7f3c6ff4
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      1.0.0 fixes an issue with import order, explicit type for public methods, line length limitation and comment validation:
      
      ```
      [error] .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/Main.scala:50:16: Are you sure you want to println? If yes, wrap the code block with
      [error]       // scalastyle:off println
      [error]       println(...)
      [error]       // scalastyle:on println
      [error] .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala:49: File line length exceeds 100 characters
      [error] .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala:22:21: Are you sure you want to println? If yes, wrap the code block with
      [error]       // scalastyle:off println
      [error]       println(...)
      [error]       // scalastyle:on println
      [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:35:6: Public method must have explicit type
      [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:51:6: Public method must have explicit type
      [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:93:15: Public method must have explicit type
      [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:98:15: Public method must have explicit type
      [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:47:2: Insert a space after the start of the comment
      [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:26:43: JavaDStream should come before JavaDStreamLike.
      ```
      
      This PR also fixes the workaround added in SPARK-16877 for `org.scalastyle.scalariform.OverrideJavaChecker` feature, added from 0.9.0.
      
      ## How was this patch tested?
      
      Manually tested.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19116 from HyukjinKwon/scalastyle-1.0.0.
      7f3c6ff4
    • Dongjoon Hyun's avatar
      [SPARK-21913][SQL][TEST] withDatabase` should drop database with CASCADE · 4e7a29ef
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, `withDatabase` fails if the database is not empty. It would be great if we drop cleanly with CASCADE.
      
      ## How was this patch tested?
      
      This is a change on test util. Pass the existing Jenkins.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19125 from dongjoon-hyun/SPARK-21913.
      4e7a29ef
  3. Sep 04, 2017
    • Sean Owen's avatar
      [SPARK-21418][SQL] NoSuchElementException: None.get in DataSourceScanExec with... · ca59445a
      Sean Owen authored
      [SPARK-21418][SQL] NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true
      
      ## What changes were proposed in this pull request?
      
      If no SparkConf is available to Utils.redact, simply don't redact.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19123 from srowen/SPARK-21418.
      ca59445a
  4. Sep 03, 2017
    • Liang-Chi Hsieh's avatar
      [SPARK-21654][SQL] Complement SQL predicates expression description · 9f30d928
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      SQL predicates don't have complete expression description. This patch goes to complement the description by adding arguments, examples.
      
      This change also adds related test cases for the SQL predicate expressions.
      
      ## How was this patch tested?
      
      Existing tests. And added predicate test.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #18869 from viirya/SPARK-21654.
      9f30d928
    • hyukjinkwon's avatar
      [SPARK-21897][PYTHON][R] Add unionByName API to DataFrame in Python and R · 07fd68a2
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to add a wrapper for `unionByName` API to R and Python as well.
      
      **Python**
      
      ```python
      df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
      df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"])
      df1.unionByName(df2).show()
      ```
      
      ```
      +----+----+----+
      |col0|col1|col3|
      +----+----+----+
      |   1|   2|   3|
      |   6|   4|   5|
      +----+----+----+
      ```
      
      **R**
      
      ```R
      df1 <- select(createDataFrame(mtcars), "carb", "am", "gear")
      df2 <- select(createDataFrame(mtcars), "am", "gear", "carb")
      head(unionByName(limit(df1, 2), limit(df2, 2)))
      ```
      
      ```
        carb am gear
      1    4  1    4
      2    4  1    4
      3    4  1    4
      4    4  1    4
      ```
      
      ## How was this patch tested?
      
      Doctests for Python and unit test added in `test_sparkSQL.R` for R.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19105 from HyukjinKwon/unionByName-r-python.
      07fd68a2
  5. Sep 02, 2017
    • gatorsmile's avatar
      [SPARK-21891][SQL] Add TBLPROPERTIES to DDL statement: CREATE TABLE USING · acb7fed2
      gatorsmile authored
      ## What changes were proposed in this pull request?
      Add `TBLPROPERTIES` to the DDL statement `CREATE TABLE USING`.
      
      After this change, the DDL becomes
      ```
      CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
      USING table_provider
      [OPTIONS table_property_list]
      [PARTITIONED BY (col_name, col_name, ...)]
      [CLUSTERED BY (col_name, col_name, ...)
       [SORTED BY (col_name [ASC|DESC], ...)]
       INTO num_buckets BUCKETS
      ]
      [LOCATION path]
      [COMMENT table_comment]
      [TBLPROPERTIES (property_name=property_value, ...)]
      [[AS] select_statement];
      ```
      
      ## How was this patch tested?
      Add a few tests
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19100 from gatorsmile/addTablePropsToCreateTableUsing.
      acb7fed2
  6. Sep 01, 2017
    • WeichenXu's avatar
      [SPARK-21729][ML][TEST] Generic test for ProbabilisticClassifier to ensure... · 900f14f6
      WeichenXu authored
      [SPARK-21729][ML][TEST] Generic test for ProbabilisticClassifier to ensure consistent output columns
      
      ## What changes were proposed in this pull request?
      
      Add test for prediction using the model with all combinations of output columns turned on/off.
      Make sure the output column values match, presumably by comparing vs. the case with all 3 output columns turned on.
      
      ## How was this patch tested?
      
      Test updated.
      
      Author: WeichenXu <weichen.xu@databricks.com>
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #19065 from WeichenXu123/generic_test_for_prob_classifier.
      900f14f6
    • gatorsmile's avatar
      [SPARK-21895][SQL] Support changing database in HiveClient · aba9492d
      gatorsmile authored
      ## What changes were proposed in this pull request?
      Supporting moving tables across different database in HiveClient `alterTable`
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19104 from gatorsmile/alterTable.
      aba9492d
    • Sean Owen's avatar
      [SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala... · 12ab7f7e
      Sean Owen authored
      [SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala 2.12 profiles and enable 2.12 compilation
      
      …build; fix some things that will be warnings or errors in 2.12; restore Scala 2.12 profile infrastructure
      
      ## What changes were proposed in this pull request?
      
      This change adds back the infrastructure for a Scala 2.12 build, but does not enable it in the release or Python test scripts.
      
      In order to make that meaningful, it also resolves compile errors that the code hits in 2.12 only, in a way that still works with 2.11.
      
      It also updates dependencies to the earliest minor release of dependencies whose current version does not yet support Scala 2.12. This is in a sense covered by other JIRAs under the main umbrella, but implemented here. The versions below still work with 2.11, and are the _latest_ maintenance release in the _earliest_ viable minor release.
      
      - Scalatest 2.x -> 3.0.3
      - Chill 0.8.0 -> 0.8.4
      - Clapper 1.0.x -> 1.1.2
      - json4s 3.2.x -> 3.4.2
      - Jackson 2.6.x -> 2.7.9 (required by json4s)
      
      This change does _not_ fully enable a Scala 2.12 build:
      
      - It will also require dropping support for Kafka before 0.10. Easy enough, just didn't do it yet here
      - It will require recreating `SparkILoop` and `Main` for REPL 2.12, which is SPARK-14650. Possible to do here too.
      
      What it does do is make changes that resolve much of the remaining gap without affecting the current 2.11 build.
      
      ## How was this patch tested?
      
      Existing tests and build. Manually tested with `./dev/change-scala-version.sh 2.12` to verify it compiles, modulo the exceptions above.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #18645 from srowen/SPARK-14280.
      12ab7f7e
    • he.qiao's avatar
      [SPARK-21880][WEB UI] In the SQL table page, modify jobs trace information · 12f0d242
      he.qiao authored
      ## What changes were proposed in this pull request?
      As shown below, for example, When the job 5 is running, It was a mistake to think that five jobs were running, So I think it would be more appropriate to change jobs to job id.
      ![image](https://user-images.githubusercontent.com/21355020/29909612-4dc85064-8e59-11e7-87cd-275a869243bb.png)
      
      ## How was this patch tested?
      no need
      
      Author: he.qiao <he.qiao17@zte.com.cn>
      
      Closes #19093 from Geek-He/08_31_sqltable.
      12f0d242
    • Marcelo Vanzin's avatar
      [SPARK-21728][CORE] Follow up: fix user config, auth in SparkSubmit logging. · 0bdbefe9
      Marcelo Vanzin authored
      - SecurityManager complains when auth is enabled but no secret is defined;
        SparkSubmit doesn't use the auth functionality of the SecurityManager,
        so use a dummy secret to work around the exception.
      
      - Only reset the log4j configuration when Spark was the one initializing
        it, otherwise user-defined log configuration may be lost.
      
      Tested with the log config file posted to the bug, on a secured YARN cluster.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #19089 from vanzin/SPARK-21728.
      0bdbefe9
  7. Aug 31, 2017
    • hyukjinkwon's avatar
      [SPARK-21789][PYTHON] Remove obsolete codes for parsing abstract schema strings · 648a8626
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to remove private functions that look not used in the main codes, `_split_schema_abstract`, `_parse_field_abstract`, `_parse_schema_abstract` and `_infer_schema_type`.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18647 from HyukjinKwon/remove-abstract.
      648a8626
    • hyukjinkwon's avatar
      [SPARK-21779][PYTHON] Simpler DataFrame.sample API in Python · 5cd8ea99
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR make `DataFrame.sample(...)` can omit `withReplacement` defaulting `False`, consistently with equivalent Scala / Java API.
      
      In short, the following examples are allowed:
      
      ```python
      >>> df = spark.range(10)
      >>> df.sample(0.5).count()
      7
      >>> df.sample(fraction=0.5).count()
      3
      >>> df.sample(0.5, seed=42).count()
      5
      >>> df.sample(fraction=0.5, seed=42).count()
      5
      ```
      
      In addition, this PR also adds some type checking logics as below:
      
      ```python
      >>> df = spark.range(10)
      >>> df.sample().count()
      ...
      TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [].
      >>> df.sample(True).count()
      ...
      TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'bool'>].
      >>> df.sample(42).count()
      ...
      TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'int'>].
      >>> df.sample(fraction=False, seed="a").count()
      ...
      TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'bool'>, <type 'str'>].
      >>> df.sample(seed=[1]).count()
      ...
      TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'list'>].
      >>> df.sample(withReplacement="a", fraction=0.5, seed=1)
      ...
      TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'str'>, <type 'float'>, <type 'int'>].
      ```
      
      ## How was this patch tested?
      
      Manually tested, unit tests added in doc tests and manually checked the built documentation for Python.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18999 from HyukjinKwon/SPARK-21779.
      5cd8ea99
    • WeichenXu's avatar
      [SPARK-21862][ML] Add overflow check in PCA · f5e10a34
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      add overflow check in PCA, otherwise it is possible to throw `NegativeArraySizeException` when `k` and `numFeatures` are too large.
      The overflow checking formula is here:
      https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala#L87
      
      ## How was this patch tested?
      
      N/A
      
      Author: WeichenXu <weichen.xu@databricks.com>
      
      Closes #19078 from WeichenXu123/SVD_overflow_check.
      f5e10a34
    • WeichenXu's avatar
      [SPARK-17139][ML][FOLLOW-UP] Add convenient method `asBinary` for casting to... · 96028e36
      WeichenXu authored
      [SPARK-17139][ML][FOLLOW-UP] Add convenient method `asBinary` for casting to BinaryLogisticRegressionSummary
      
      ## What changes were proposed in this pull request?
      
      add an "asBinary" method to LogisticRegressionSummary for convenient casting to BinaryLogisticRegressionSummary.
      
      ## How was this patch tested?
      
      Testcase updated.
      
      Author: WeichenXu <weichen.xu@databricks.com>
      
      Closes #19072 from WeichenXu123/mlor_summary_as_binary.
      96028e36
    • Andrew Ray's avatar
      [SPARK-21110][SQL] Structs, arrays, and other orderable datatypes should be usable in inequalities · cba69aeb
      Andrew Ray authored
      ## What changes were proposed in this pull request?
      
      Allows `BinaryComparison` operators to work on any data type that actually supports ordering as verified by `TypeUtils.checkForOrderingExpr` instead of relying on the incomplete list `TypeCollection.Ordered` (which is removed by this PR).
      
      ## How was this patch tested?
      
      Updated unit tests to cover structs and arrays.
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #18818 from aray/SPARK-21110.
      cba69aeb
    • gatorsmile's avatar
      [SPARK-17107][SQL][FOLLOW-UP] Remove redundant pushdown rule for Union · 7ce11082
      gatorsmile authored
      ## What changes were proposed in this pull request?
      Also remove useless function `partitionByDeterministic` after the changes of https://github.com/apache/spark/pull/14687
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19097 from gatorsmile/followupSPARK-17107.
      7ce11082
    • Bryan Cutler's avatar
      [SPARK-21583][HOTFIX] Removed intercept in test causing failures · 501370d9
      Bryan Cutler authored
      Removing a check in the ColumnarBatchSuite that depended on a Java assertion.  This assertion is being compiled out in the Maven builds causing the test to fail.  This part of the test is not specifically from to the functionality that is being tested here.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #19098 from BryanCutler/hotfix-ColumnarBatchSuite-assertion.
      501370d9
    • ArtRand's avatar
      [SPARK-20812][MESOS] Add secrets support to the dispatcher · fc45c2c8
      ArtRand authored
      Mesos has secrets primitives for environment and file-based secrets, this PR adds that functionality to the Spark dispatcher and the appropriate configuration flags.
      Unit tested and manually tested against a DC/OS cluster with Mesos 1.4.
      
      Author: ArtRand <arand@soe.ucsc.edu>
      
      Closes #18837 from ArtRand/spark-20812-dispatcher-secrets-and-labels.
      fc45c2c8
    • Jacek Laskowski's avatar
      [SPARK-21886][SQL] Use SparkSession.internalCreateDataFrame to create… · 9696580c
      Jacek Laskowski authored
      … Dataset with LogicalRDD logical operator
      
      ## What changes were proposed in this pull request?
      
      Reusing `SparkSession.internalCreateDataFrame` wherever possible (to cut dups)
      
      ## How was this patch tested?
      
      Local build and waiting for Jenkins
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #19095 from jaceklaskowski/SPARK-21886-internalCreateDataFrame.
      9696580c
    • gatorsmile's avatar
      [SPARK-21878][SQL][TEST] Create SQLMetricsTestUtils · 19b0240d
      gatorsmile authored
      ## What changes were proposed in this pull request?
      Creates `SQLMetricsTestUtils` for the utility functions of both Hive-specific and the other SQLMetrics test cases.
      
      Also, move two SQLMetrics test cases from sql/hive to sql/core.
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19092 from gatorsmile/rewriteSQLMetrics.
      19b0240d
  8. Aug 30, 2017
    • Bryan Cutler's avatar
      [SPARK-21583][SQL] Create a ColumnarBatch from ArrowColumnVectors · 964b507c
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      This PR allows the creation of a `ColumnarBatch` from `ReadOnlyColumnVectors` where previously a columnar batch could only allocate vectors internally.  This is useful for using `ArrowColumnVectors` in a batch form to do row-based iteration.  Also added `ArrowConverter.fromPayloadIterator` which converts `ArrowPayload` iterator to `InternalRow` iterator and uses a `ColumnarBatch` internally.
      
      ## How was this patch tested?
      
      Added a new unit test for creating a `ColumnarBatch` with `ReadOnlyColumnVectors` and a test to verify the roundtrip of rows -> ArrowPayload -> rows, using `toPayloadIterator` and `fromPayloadIterator`.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #18787 from BryanCutler/arrow-ColumnarBatch-support-SPARK-21583.
      964b507c
    • Liang-Chi Hsieh's avatar
      [SPARK-21534][SQL][PYSPARK] PickleException when creating dataframe from... · ecf437a6
      Liang-Chi Hsieh authored
      [SPARK-21534][SQL][PYSPARK] PickleException when creating dataframe from python row with empty bytearray
      
      ## What changes were proposed in this pull request?
      
      `PickleException` is thrown when creating dataframe from python row with empty bytearray
      
          spark.createDataFrame(spark.sql("select unhex('') as xx").rdd.map(lambda x: {"abc": x.xx})).show()
      
          net.razorvine.pickle.PickleException: invalid pickle data for bytearray; expected 1 or 2 args, got 0
          	at net.razorvine.pickle.objects.ByteArrayConstructor.construct(ByteArrayConstructor.java
              ...
      
      `ByteArrayConstructor` doesn't deal with empty byte array pickled by Python3.
      
      ## How was this patch tested?
      
      Added test.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #19085 from viirya/SPARK-21534.
      ecf437a6
    • jerryshao's avatar
      [SPARK-17321][YARN] Avoid writing shuffle metadata to disk if NM recovery is disabled · 4482ff23
      jerryshao authored
      In the current code, if NM recovery is not enabled then `YarnShuffleService` will write shuffle metadata to NM local dir-1, if this local dir-1 is on bad disk, then `YarnShuffleService` will be failed to start. So to solve this issue, in Spark side if NM recovery is not enabled, then Spark will not persist data into leveldb, in that case yarn shuffle service can still be served but lose the ability for recovery, (it is fine because the failure of NM will kill the containers as well as applications).
      
      Tested in the local cluster with NM recovery off and on to see if folder is created or not. MiniCluster UT isn't added because in MiniCluster NM will always set port to 0, but NM recovery requires non-ephemeral port.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #19032 from jerryshao/SPARK-17321.
      
      Change-Id: I8f2fe73d175e2ad2c4e380caede3873e0192d027
      4482ff23
    • Xiaofeng Lin's avatar
      [SPARK-11574][CORE] Add metrics StatsD sink · cd5d0f33
      Xiaofeng Lin authored
      This patch adds statsd sink to the current metrics system in spark core.
      
      Author: Xiaofeng Lin <xlin@twilio.com>
      
      Closes #9518 from xflin/statsd.
      
      Change-Id: Ib8720e86223d4a650df53f51ceb963cd95b49a44
      cd5d0f33
    • Andrew Ash's avatar
      [SPARK-21875][BUILD] Fix Java style bugs · 313c6ca4
      Andrew Ash authored
      ## What changes were proposed in this pull request?
      
      Fix Java code style so `./dev/lint-java` succeeds
      
      ## How was this patch tested?
      
      Run `./dev/lint-java`
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #19088 from ash211/spark-21875-lint-java.
      313c6ca4
Loading