Skip to content
Snippets Groups Projects
  1. Jul 30, 2017
    • GuoChenzhao's avatar
      [SQL] Fix typo in DataframeWriter doc · 51f99fb2
      GuoChenzhao authored
      ## What changes were proposed in this pull request?
      
      The format of none should be consistent with other compression codec(\`snappy\`, \`lz4\`) as \`none\`.
      
      ## How was this patch tested?
      
      This is a typo.
      
      Author: GuoChenzhao <chenzhao.guo@intel.com>
      
      Closes #18758 from gczsjdy/typo.
      51f99fb2
  2. Jul 29, 2017
    • Takeshi Yamamuro's avatar
      [SPARK-20962][SQL] Support subquery column aliases in FROM clause · 6550086b
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr added parsing rules to support subquery column aliases in FROM clause.
      This pr is a sub-task of #18079.
      
      ## How was this patch tested?
      Added tests in `PlanParserSuite` and `SQLQueryTestSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #18185 from maropu/SPARK-20962.
      6550086b
    • Xingbo Jiang's avatar
      [SPARK-19451][SQL] rangeBetween method should accept Long value as boundary · 92d85637
      Xingbo Jiang authored
      ## What changes were proposed in this pull request?
      
      Long values can be passed to `rangeBetween` as range frame boundaries, but we silently convert it to Int values, this can cause wrong results and we should fix this.
      
      Further more, we should accept any legal literal values as range frame boundaries. In this PR, we make it possible for Long values, and make accepting other DataTypes really easy to add.
      
      This PR is mostly based on Herman's previous amazing work: https://github.com/hvanhovell/spark/commit/596f53c339b1b4629f5651070e56a8836a397768
      
      After this been merged, we can close #16818 .
      
      ## How was this patch tested?
      
      Add new tests in `DataFrameWindowFunctionsSuite` and `TypeCoercionSuite`.
      
      Author: Xingbo Jiang <xingbo.jiang@databricks.com>
      
      Closes #18540 from jiangxb1987/rangeFrame.
      92d85637
    • Liang-Chi Hsieh's avatar
      [SPARK-21555][SQL] RuntimeReplaceable should be compared semantically by its canonicalized child · 9c8109ef
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      When there are aliases (these aliases were added for nested fields) as parameters in `RuntimeReplaceable`, as they are not in the children expression, those aliases can't be cleaned up in analyzer rule `CleanupAliases`.
      
      An expression `nvl(foo.foo1, "value")` can be resolved to two semantically different expressions in a group by query because they contain different aliases.
      
      Because those aliases are not children of `RuntimeReplaceable` which is an `UnaryExpression`. So we can't trim the aliases out by simple transforming the expressions in `CleanupAliases`.
      
      If we want to replace the non-children aliases in `RuntimeReplaceable`, we need to add more codes to `RuntimeReplaceable` and modify all expressions of `RuntimeReplaceable`. It makes the interface ugly IMO.
      
      Consider those aliases will be replaced later at optimization and so they're no harm, this patch chooses to simply override `canonicalized` of `RuntimeReplaceable`.
      
      One concern is about `CleanupAliases`. Because it actually cannot clean up ALL aliases inside a plan. To make caller of this rule notice that, this patch adds a comment to `CleanupAliases`.
      
      ## How was this patch tested?
      
      Added test.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #18761 from viirya/SPARK-21555.
      9c8109ef
    • shaofei007's avatar
      [SPARK-21357][DSTREAMS] FileInputDStream not remove out of date RDD · 60e9b2bd
      shaofei007 authored
      ## What changes were proposed in this pull request?
      
      ```DStreams
               class FileInputDStream
      
       [line 162]   protected[streaming] override def clearMetadata(time: Time) {
          batchTimeToSelectedFiles.synchronized {
            val oldFiles = batchTimeToSelectedFiles.filter(_._1 < (time - rememberDuration))
            batchTimeToSelectedFiles --= oldFiles.keys
      
      ```
      The above code does not remove the old generatedRDDs. "super.clearMetadata(time)" was added to the beginning of clearMetadata to remove the old generatedRDDs.
      
      ## How was this patch tested?
      
      At the end of clearMetadata, the testing code (print the number of generatedRDDs) was added to check the old RDDS were removed manually.
      
      Author: shaofei007 <1427357147@qq.com>
      Author: Fei Shao <1427357147@qq.com>
      
      Closes #18718 from shaofei007/master.
      60e9b2bd
    • Remis Haroon's avatar
      [SPARK-21508][DOC] Fix example code provided in Spark Streaming Documentation · c1438203
      Remis Haroon authored
      ## What changes were proposed in this pull request?
      
      JIRA ticket : [SPARK-21508](https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21508)
      
      correcting a mistake in example code provided in Spark Streaming Custom Receivers Documentation
      The example code provided in the documentation on 'Spark Streaming Custom Receivers' has an error.
      doc link : https://spark.apache.org/docs/latest/streaming-custom-receivers.html
      
      ```
      
      // Assuming ssc is the StreamingContext
      val customReceiverStream = ssc.receiverStream(new CustomReceiver(host, port))
      val words = lines.flatMap(_.split(" "))
      ...
      ```
      
      instead of `lines.flatMap(_.split(" "))`
      it should be `customReceiverStream.flatMap(_.split(" "))`
      
      ## How was this patch tested?
      this documentation change is tested manually by jekyll build , running below commands
      ```
      jekyll build
      jekyll serve --watch
      ```
      screen-shots provided below
      ![screenshot1](https://user-images.githubusercontent.com/8828470/28744636-a6de1ac6-7482-11e7-843b-ff84b5855ec0.png)
      ![screenshot2](https://user-images.githubusercontent.com/8828470/28744637-a6def496-7482-11e7-9512-7f4bbe027c6a.png)
      
      Author: Remis Haroon <Remis.Haroon@insdc01.pwc.com>
      
      Closes #18770 from remisharoon/master.
      c1438203
  3. Jul 28, 2017
    • hyukjinkwon's avatar
      [SPARK-20090][PYTHON] Add StructType.fieldNames in PySpark · b56f79cc
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes `StructType.fieldNames` that returns a copy of a field name list rather than a (undocumented) `StructType.names`.
      
      There are two points here:
      
        - API consistency with Scala/Java
      
        - Provide a safe way to get the field names. Manipulating these might cause unexpected behaviour as below:
      
          ```python
          from pyspark.sql.types import *
      
          struct = StructType([StructField("f1", StringType(), True)])
          names = struct.names
          del names[0]
          spark.createDataFrame([{"f1": 1}], struct).show()
          ```
      
          ```
          ...
          java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 1 fields are required while 0 values are provided.
          	at org.apache.spark.sql.execution.python.EvaluatePython$.fromJava(EvaluatePython.scala:138)
          	at org.apache.spark.sql.SparkSession$$anonfun$6.apply(SparkSession.scala:741)
          	at org.apache.spark.sql.SparkSession$$anonfun$6.apply(SparkSession.scala:741)
          ...
          ```
      
      ## How was this patch tested?
      
      Added tests in `python/pyspark/sql/tests.py`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18618 from HyukjinKwon/SPARK-20090.
      b56f79cc
    • Johan Grande's avatar
      Typo in comment · 0ef9fe64
      Johan Grande authored
      -
      
      Author: Johan Grande <nahoj@crans.org>
      
      Closes #18738 from nahoj/patch-1.
      0ef9fe64
    • pgandhi's avatar
      [SPARK-21541][YARN] Spark Logs show incorrect job status for a job that does... · 69ab0e4b
      pgandhi authored
      [SPARK-21541][YARN] Spark Logs show incorrect job status for a job that does not create SparkContext
      
      If you run a spark job without creating the SparkSession or SparkContext, the spark job logs says it succeeded but yarn says it fails and retries 3 times. Also, since, Application Master unregisters with Resource Manager and exits successfully, it deletes the spark staging directory, so when yarn makes subsequent retries, it fails to find the staging directory and thus, the retries fail.
      
      Added a flag to check whether user has initialized SparkContext. If it is true, we let Application Master unregister with Resource Manager else, we do not let AM unregister with RM.
      
      ## How was this patch tested?
      Manually tested the fix.
      Before:
      <img width="1253" alt="screen shot-before" src="https://user-images.githubusercontent.com/22228190/28647214-69bf81e2-722b-11e7-9ed0-d416d2bf23be.png">
      
      After:
      <img width="1319" alt="screen shot-after" src="https://user-images.githubusercontent.com/22228190/28647220-70f9eea2-722b-11e7-85c6-e56276b15614.png">
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: pgandhi <pgandhi@yahoo-inc.com>
      Author: pgandhi999 <parthkgandhi9@gmail.com>
      
      Closes #18741 from pgandhi999/SPARK-21541.
      69ab0e4b
    • davidxdh's avatar
      [SPARK-21553][SPARK SHELL] Add the description of the default value of master... · 78468090
      davidxdh authored
      [SPARK-21553][SPARK SHELL] Add the description of the default value of master parameter in the spark-shell
      
      When I type spark-shell --help, I find that the default value description for the master parameter is missing. The user does not know what the default value is when the master parameter is not included, so we need to add the master parameter default description to the help information.
      [https://issues.apache.org/jira/browse/SPARK-21553](https://issues.apache.org/jira/browse/SPARK-21553)
      
      Author: davidxdh <xu.donghui@zte.com.cn>
      Author: Donghui Xu <xu.donghui@zte.com.cn>
      
      Closes #18755 from davidxdh/dev_0728.
      78468090
    • Sean Owen's avatar
      [MINOR][BUILD] Fix current lint-java failures · 63d168cb
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fixes current failures in dev/lint-java
      
      ## How was this patch tested?
      
      Existing linter, tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #18757 from srowen/LintJava.
      63d168cb
  4. Jul 27, 2017
    • Yan Facai (颜发才)'s avatar
      [SPARK-21306][ML] OneVsRest should support setWeightCol · a5a31899
      Yan Facai (颜发才) authored
      ## What changes were proposed in this pull request?
      
      add `setWeightCol` method for OneVsRest.
      
      `weightCol` is ignored if classifier doesn't inherit HasWeightCol trait.
      
      ## How was this patch tested?
      
      + [x] add an unit test.
      
      Author: Yan Facai (颜发才) <facai.yan@gmail.com>
      
      Closes #18554 from facaiy/BUG/oneVsRest_missing_weightCol.
      a5a31899
    • aokolnychyi's avatar
      [SPARK-21538][SQL] Attribute resolution inconsistency in the Dataset API · f44ead89
      aokolnychyi authored
      ## What changes were proposed in this pull request?
      
      This PR contains a tiny update that removes an attribute resolution inconsistency in the Dataset API. The following example is taken from the ticket description:
      
      ```
      spark.range(1).withColumnRenamed("id", "x").sort(col("id"))  // works
      spark.range(1).withColumnRenamed("id", "x").sort($"id")  // works
      spark.range(1).withColumnRenamed("id", "x").sort('id) // works
      spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with:
      org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among (x);
      ```
      The above `AnalysisException` happens because the last case calls `Dataset.apply()` to convert strings into columns, which triggers attribute resolution. To make the API consistent between overloaded methods, this PR defers the resolution and constructs columns directly.
      
      Author: aokolnychyi <anton.okolnychyi@sap.com>
      
      Closes #18740 from aokolnychyi/spark-21538.
      f44ead89
    • Wenchen Fan's avatar
      [SPARK-21319][SQL] Fix memory leak in sorter · 9f5647d6
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      `UnsafeExternalSorter.recordComparator` can be either `KVComparator` or `RowComparator`, and both of them will keep the reference to the input rows they compared last time.
      
      After sorting, we return the sorted iterator to upstream operators. However, the upstream operators may take a while to consume up the sorted iterator, and `UnsafeExternalSorter` is registered to `TaskContext` at [here](https://github.com/apache/spark/blob/v2.2.0/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L159-L161), which means we will keep the `UnsafeExternalSorter` instance and keep the last compared input rows in memory until the sorted iterator is consumed up.
      
      Things get worse if we sort within partitions of a dataset and coalesce all partitions into one, as we will keep a lot of input rows in memory and the time to consume up all the sorted iterators is long.
      
      This PR takes over https://github.com/apache/spark/pull/18543 , the idea is that, we do not keep the record comparator instance in `UnsafeExternalSorter`, but a generator of record comparator.
      
      close #18543
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18679 from cloud-fan/memory-leak.
      9f5647d6
    • actuaryzhang's avatar
      [SPARK-19270][ML] Add summary table to GLM summary · ddcd2e82
      actuaryzhang authored
      ## What changes were proposed in this pull request?
      
      Add R-like summary table to GLM summary, which includes feature name (if exist), parameter estimate, standard error, t-stat and p-value. This allows scala users to easily gather these commonly used inference results.
      
      srowen yanboliang  felixcheung
      
      ## How was this patch tested?
      New tests. One for testing feature Name, and one for testing the summary Table.
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      Author: Wayne Zhang <actuaryzhang10@gmail.com>
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16630 from actuaryzhang/glmTable.
      ddcd2e82
    • Takuya UESHIN's avatar
      [SPARK-21440][SQL][PYSPARK] Refactor ArrowConverters and add ArrayType and StructType support. · 2ff35a05
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      This is a refactoring of `ArrowConverters` and related classes.
      
      1. Refactor `ColumnWriter` as `ArrowWriter`.
      2. Add `ArrayType` and `StructType` support.
      3. Refactor `ArrowConverters` to skip intermediate `ArrowRecordBatch` creation.
      
      ## How was this patch tested?
      
      Added some tests and existing tests.
      
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #18655 from ueshin/issues/SPARK-21440.
      2ff35a05
    • Kazuaki Ishizaki's avatar
      [SPARK-21271][SQL] Ensure Unsafe.sizeInBytes is a multiple of 8 · ebbe589d
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR ensures that `Unsafe.sizeInBytes` must be a multiple of 8. It it is not satisfied. `Unsafe.hashCode` causes the assertion violation.
      
      ## How was this patch tested?
      
      Will add test cases
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #18503 from kiszk/SPARK-21271.
      ebbe589d
  5. Jul 26, 2017
    • jinxing's avatar
      [SPARK-21530] Update description of spark.shuffle.maxChunksBeingTransferred. · cfb25b27
      jinxing authored
      ## What changes were proposed in this pull request?
      
      Update the description of `spark.shuffle.maxChunksBeingTransferred` to include that the new coming connections will be closed when the max is hit and client should have retry mechanism.
      
      Author: jinxing <jinxing6042@126.com>
      
      Closes #18735 from jinxing64/SPARK-21530.
      cfb25b27
    • hyukjinkwon's avatar
      [SPARK-21485][SQL][DOCS] Spark SQL documentation generation for built-in functions · 60472dbf
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This generates a documentation for Spark SQL built-in functions.
      
      One drawback is, this requires a proper build to generate built-in function list.
      Once it is built, it only takes few seconds by `sql/create-docs.sh`.
      
      Please see https://spark-test.github.io/sparksqldoc/ that I hosted to show the output documentation.
      
      There are few more works to be done in order to make the documentation pretty, for example, separating `Arguments:` and `Examples:` but I guess this should be done within `ExpressionDescription` and `ExpressionInfo` rather than manually parsing it. I will fix these in a follow up.
      
      This requires `pip install mkdocs` to generate HTMLs from markdown files.
      
      ## How was this patch tested?
      
      Manually tested:
      
      ```
      cd docs
      jekyll build
      ```
      ,
      
      ```
      cd docs
      jekyll serve
      ```
      
      and
      
      ```
      cd sql
      create-docs.sh
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18702 from HyukjinKwon/SPARK-21485.
      60472dbf
    • sethah's avatar
      [SPARK-20988][ML] Logistic regression uses aggregator hierarchy · cf29828d
      sethah authored
      ## What changes were proposed in this pull request?
      
      This change pulls the `LogisticAggregator` class out of LogisticRegression.scala and makes it extend `DifferentiableLossAggregator`. It also changes logistic regression to use the generic `RDDLossFunction` instead of having its own.
      
      Other minor changes:
      * L2Regularization accepts `Option[Int => Double]` for features standard deviation
      * L2Regularization uses `Vector` type instead of Array
      * Some tests added to LeastSquaresAggregator
      
      ## How was this patch tested?
      
      Unit test suites are added.
      
      Author: sethah <shendrickson@cloudera.com>
      
      Closes #18305 from sethah/SPARK-20988.
      cf29828d
    • Yuhao Yang's avatar
      [SPARK-21524][ML] unit test fix: ValidatorParamsSuiteHelpers generates wrong temp files · ae4ea5fe
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      jira: https://issues.apache.org/jira/browse/SPARK-21524
      
      ValidatorParamsSuiteHelpers.testFileMove() is generating temp dir in the wrong place and does not delete them.
      
      ValidatorParamsSuiteHelpers.testFileMove() is invoked by TrainValidationSplitSuite and crossValidatorSuite. Currently it uses `tempDir` from `TempDirectory`, which unfortunately is never initialized since the `boforeAll()` of `ValidatorParamsSuiteHelpers` is never invoked.
      
      In my system, it leaves some temp directories in the assembly folder each time I run the TrainValidationSplitSuite and crossValidatorSuite.
      
      ## How was this patch tested?
      unit test fix
      
      Author: Yuhao Yang <yuhao.yang@intel.com>
      
      Closes #18728 from hhbyyh/tempDirFix.
      ae4ea5fe
  6. Jul 25, 2017
    • zhoukang's avatar
      [SPARK-21517][CORE] Avoid copying memory when transfer chunks remotely · 16612638
      zhoukang authored
      ## What changes were proposed in this pull request?
      
      In our production cluster,oom happens when NettyBlockRpcServer receive OpenBlocks message.The reason we observed is below:
      When BlockManagerManagedBuffer call ChunkedByteBuffer#toNetty, it will use Unpooled.wrappedBuffer(ByteBuffer... buffers) which use default maxNumComponents=16 in low-level CompositeByteBuf.When our component's number is bigger than 16, it will execute consolidateIfNeeded
      
              int numComponents = this.components.size();
              if(numComponents > this.maxNumComponents) {
                  int capacity = ((CompositeByteBuf.Component)this.components.get(numComponents - 1)).endOffset;
                  ByteBuf consolidated = this.allocBuffer(capacity);
      
                  for(int c = 0; c < numComponents; ++c) {
                      CompositeByteBuf.Component c1 = (CompositeByteBuf.Component)this.components.get(c);
                      ByteBuf b = c1.buf;
                      consolidated.writeBytes(b);
                      c1.freeIfNecessary();
                  }
      
                  CompositeByteBuf.Component var7 = new CompositeByteBuf.Component(consolidated);
                  var7.endOffset = var7.length;
                  this.components.clear();
                  this.components.add(var7);
              }
      
      in CompositeByteBuf which will consume some memory during buffer copy.
      We can use another api Unpooled. wrappedBuffer(int maxNumComponents, ByteBuffer... buffers) to avoid this comsuming.
      
      ## How was this patch tested?
      
      Test in production cluster.
      
      Author: zhoukang <zhoukang@xiaomi.com>
      
      Closes #18723 from caneGuy/zhoukang/fix-chunkbuffer.
      16612638
    • Marcelo Vanzin's avatar
      [SPARK-21494][NETWORK] Use correct app id when authenticating to external service. · 300807c6
      Marcelo Vanzin authored
      There was some code based on the old SASL handler in the new auth client that
      was incorrectly using the SASL user as the user to authenticate against the
      external shuffle service. This caused the external service to not be able to
      find the correct secret to authenticate the connection, failing the connection.
      
      In the course of debugging, I found that some log messages from the YARN shuffle
      service were a little noisy, so I silenced some of them, and also added a couple
      of new ones that helped find this issue. On top of that, I found that a check
      in the code that records app secrets was wrong, causing more log spam and also
      using an O(n) operation instead of an O(1) call.
      
      Also added a new integration suite for the YARN shuffle service with auth on,
      and verified it failed before, and passes now.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #18706 from vanzin/SPARK-21494.
      300807c6
    • gatorsmile's avatar
      [SPARK-20586][SQL] Add deterministic to ScalaUDF · ebc24a9b
      gatorsmile authored
      ### What changes were proposed in this pull request?
      Like [Hive UDFType](https://hive.apache.org/javadocs/r2.0.1/api/org/apache/hadoop/hive/ql/udf/UDFType.html), we should allow users to add the extra flags for ScalaUDF and JavaUDF too. _stateful_/_impliesOrder_ are not applicable to our Scala UDF. Thus, we only add the following two flags.
      
      - deterministic: Certain optimizations should not be applied if UDF is not deterministic. Deterministic UDF returns same result each time it is invoked with a particular input. This determinism just needs to hold within the context of a query.
      
      When the deterministic flag is not correctly set, the results could be wrong.
      
      For ScalaUDF in Dataset APIs, users can call the following extra APIs for `UserDefinedFunction` to make the corresponding changes.
      - `nonDeterministic`: Updates UserDefinedFunction to non-deterministic.
      
      Also fixed the Java UDF name loss issue.
      
      Will submit a separate PR for `distinctLike`  for UDAF
      
      ### How was this patch tested?
      Added test cases for both ScalaUDF
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: Wenchen Fan <cloud0fan@gmail.com>
      
      Closes #17848 from gatorsmile/udfRegister.
      ebc24a9b
    • iurii.ant's avatar
      [SPARK-21491][GRAPHX] Enhance GraphX performance: breakOut instead of .toMap · 9b4da7b7
      iurii.ant authored
      ## What changes were proposed in this pull request?
      
      `Traversable.toMap` changed to 'collections.breakOut', that eliminates intermediate tuple collection creation, see [Stack Overflow article](https://stackoverflow.com/questions/1715681/scala-2-8-breakout).
      
      ## How was this patch tested?
      Unit tests run.
      No performance tests performed yet.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: iurii.ant <sereneant@gmail.com>
      
      Closes #18693 from SereneAnt/performance_toMap-breakOut.
      9b4da7b7
    • Eric Vandenberg's avatar
      [SPARK-21447][WEB UI] Spark history server fails to render compressed · 06a97937
      Eric Vandenberg authored
      inprogress history file in some cases.
      
      Add failure handling for EOFException that can be thrown during
      decompression of an inprogress spark history file, treat same as case
      where can't parse the last line.
      
      ## What changes were proposed in this pull request?
      
      Failure handling for case of EOFException thrown within the ReplayListenerBus.replay method to handle the case analogous to json parse fail case.  This path can arise in compressed inprogress history files since an incomplete compression block could be read (not flushed by writer on a block boundary).  See the stack trace of this occurrence in the jira ticket (https://issues.apache.org/jira/browse/SPARK-21447)
      
      ## How was this patch tested?
      
      Added a unit test that specifically targets validating the failure handling path appropriately when maybeTruncated is true and false.
      
      Author: Eric Vandenberg <ericvandenberg@fb.com>
      
      Closes #18673 from ericvandenbergfb/fix_inprogress_compr_history_file.
      06a97937
    • DjvuLee's avatar
      [SPARK-21383][YARN] Fix the YarnAllocator allocates more Resource · 8de080d9
      DjvuLee authored
      When NodeManagers launching Executors,
      the `missing` value will exceed the
      real value when the launch is slow, this can lead to YARN allocates more resource.
      
      We add the `numExecutorsRunning` when calculate the `missing` to avoid this.
      
      Test by experiment.
      
      Author: DjvuLee <lihu@bytedance.com>
      
      Closes #18651 from djvulee/YarnAllocate.
      8de080d9
    • jinxing's avatar
      [SPARK-21175] Reject OpenBlocks when memory shortage on shuffle service. · 799e1316
      jinxing authored
      ## What changes were proposed in this pull request?
      
      A shuffle service can serves blocks from multiple apps/tasks. Thus the shuffle service can suffers high memory usage when lots of shuffle-reads happen at the same time. In my cluster, OOM always happens on shuffle service. Analyzing heap dump, memory cost by Netty(ChannelOutboundBufferEntry) can be up to 2~3G. It might make sense to reject "open blocks" request when memory usage is high on shuffle service.
      
      https://github.com/apache/spark/commit/93dd0c518d040155b04e5ab258c5835aec7776fc and https://github.com/apache/spark/commit/85c6ce61930490e2247fb4b0e22dfebbb8b6a1ee tried to alleviate the memory pressure on shuffle service but cannot solve the root cause. This pr proposes to control currency of shuffle read.
      
      ## How was this patch tested?
      Added unit test.
      
      Author: jinxing <jinxing6042@126.com>
      
      Closes #18388 from jinxing64/SPARK-21175.
      799e1316
    • Trueman's avatar
      [SPARK-21498][EXAMPLES] quick start -> one py demo have some bug in code · 996a809c
      Trueman authored
      I find a bug about 'quick start',and created a new issues,Sean Owen  let
      me to make a pull request, and I do
      
      ## What changes were proposed in this pull request?
      
      (Please fill in changes proposed in this fix)
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Trueman <lizhaoch@users.noreply.github.com>
      Author: lizhaoch <lizhaoc@163.com>
      
      Closes #18722 from lizhaoch/master.
      996a809c
    • Yash Sharma's avatar
      [SPARK-20855][Docs][DStream] Update the Spark kinesis docs to use the... · 4f77c062
      Yash Sharma authored
      [SPARK-20855][Docs][DStream] Update the Spark kinesis docs to use the KinesisInputDStream builder instead of deprecated KinesisUtils
      
      ## What changes were proposed in this pull request?
      
      The examples and docs for Spark-Kinesis integrations use the deprecated KinesisUtils. We should update the docs to use the KinesisInputDStream builder to create DStreams.
      
      ## How was this patch tested?
      
      The patch primarily updates the documents. The patch will also need to make changes to the Spark-Kinesis examples. The examples need to be tested.
      
      Author: Yash Sharma <ysharma@atlassian.com>
      
      Closes #18071 from yssharma/ysharma/kinesis_docs.
      4f77c062
  7. Jul 24, 2017
    • Kazuaki Ishizaki's avatar
      [SPARK-21516][SQL][TEST] Overriding afterEach() in DatasetCacheSuite must call super.afterEach() · 7f295059
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR ensures to call `super.afterEach()` in overriding `afterEach()` method in `DatasetCacheSuite`. When we override `afterEach()` method in Testsuite, we have to call `super.afterEach()`.
      
      This is a follow-up of #18719 and SPARK-21512.
      
      ## How was this patch tested?
      
      Used the existing test suite
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #18721 from kiszk/SPARK-21516.
      7f295059
    • Stavros Kontopoulos's avatar
      [SPARK-21502][MESOS] fix --supervise for mesos in cluster mode · b09ec92a
      Stavros Kontopoulos authored
      ## What changes were proposed in this pull request?
      With supervise enabled for a driver, re-launching it was failing because the driver had the same framework Id. This patch creates a new driver framework id every time we re-launch a driver, but we keep the driver submission id the same since that is the same with the task id the driver was launched with on mesos and retry state and other info within Dispatcher's data structures uses that as a key.
      We append a "-retry-%4d" string as a suffix to the framework id passed by the dispatcher to the driver and the same value to the app_id created by each driver, except the first time where we dont need the retry suffix.
      The previous format for the frameworkId was   'DispactherFId-DriverSubmissionId'.
      
      We also detect the case where we have multiple spark contexts started from within the same driver and we do set proper names to their corresponding app-ids. The old practice was to unset the framework id passed from the dispatcher after the driver framework was started for the first time and let mesos decide the framework ID for subsequent spark contexts. The decided fId was passed as an appID.
      This patch affects heavily the history server. Btw we dont have the issues of the standalone case where driver id must be different since the dispatcher will re-launch a driver(mesos task) only if it gets an update that it is dead and this is verified by mesos implicitly. We also dont fix the fine grained mode which is deprecated and of no use.
      
      ## How was this patch tested?
      
      This task was manually tested on dc/os. Launched a driver, stoped its container and verified the expected behavior.
      
      Initial retry of the driver, driver in pending state:
      
      ![image](https://user-images.githubusercontent.com/7945591/28473862-1088b736-6e4f-11e7-8d7d-7b785b1da6a6.png)
      
      Driver re-launched:
      ![image](https://user-images.githubusercontent.com/7945591/28473885-26e02d16-6e4f-11e7-9eb8-6bf7bdb10cb8.png)
      
      Another re-try:
      ![image](https://user-images.githubusercontent.com/7945591/28473897-35702318-6e4f-11e7-9585-fd295ad7c6b6.png)
      
      The resulted entries in history server at the bottom:
      
      ![image](https://user-images.githubusercontent.com/7945591/28473910-4946dabc-6e4f-11e7-90a6-fa4f80893c61.png)
      
      Regarding multiple spark contexts here is the end result regarding the spark history server, for the second spark context we add an increasing number as a suffix:
      
      ![image](https://user-images.githubusercontent.com/7945591/28474432-69cf8b06-6e51-11e7-93c7-e6c0b04dec93.png)
      
      Author: Stavros Kontopoulos <st.kontopoulos@gmail.com>
      
      Closes #18705 from skonto/fix_supervise_flag.
      b09ec92a
    • Wenchen Fan's avatar
      [SPARK-17528][SQL][FOLLOWUP] remove unnecessary data copy in object hash aggregate · 86664338
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      In #18483 , we fixed the data copy bug when saving into `InternalRow`, and removed all workarounds for this bug in the aggregate code path. However, the object hash aggregate was missed, this PR fixes it.
      
      This patch is also a requirement for #17419 , which shows that DataFrame version is slower than RDD version because of this issue.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18712 from cloud-fan/minor.
      86664338
  8. Jul 23, 2017
    • Kazuaki Ishizaki's avatar
      [SPARK-21512][SQL][TEST] DatasetCacheSuite needs to execute unpersistent after executing peristent · 481f0792
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR avoids to reuse unpersistent dataset among test cases by making dataset unpersistent at the end of each test case.
      
      In `DatasetCacheSuite`, the test case `"get storage level"` does not make dataset unpersisit after make the dataset persisitent. The same dataset will be made persistent by the test case `"persist and then rebind right encoder when join 2 datasets"` Thus, we run these test cases, the second case does not perform to make dataset persistent. This is because in
      
      When we run only the second case, it performs to make dataset persistent. It is not good to change behavior of the second test suite. The first test case should correctly make dataset unpersistent.
      
      ```
      Testing started at 17:52 ...
      01:52:15.053 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      01:52:48.595 WARN org.apache.spark.sql.execution.CacheManager: Asked to cache already cached data.
      01:52:48.692 WARN org.apache.spark.sql.execution.CacheManager: Asked to cache already cached data.
      01:52:50.864 WARN org.apache.spark.storage.RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
      01:52:50.864 WARN org.apache.spark.storage.RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
      01:52:50.868 WARN org.apache.spark.storage.BlockManager: Block rdd_8_1 replicated to only 0 peer(s) instead of 1 peers
      01:52:50.868 WARN org.apache.spark.storage.BlockManager: Block rdd_8_0 replicated to only 0 peer(s) instead of 1 peers
      ```
      
      After this PR, these messages do not appear
      ```
      Testing started at 18:14 ...
      02:15:05.329 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      
      Process finished with exit code 0
      ```
      
      ## How was this patch tested?
      
      Used the existing test
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #18719 from kiszk/SPARK-21512.
      481f0792
    • Reynold Xin's avatar
      [MINOR] Remove **** in test case names in FlatMapGroupsWithStateSuite · a4eac8b0
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch removes the `****` string from test names in FlatMapGroupsWithStateSuite. `***` is a common string developers grep for when using Scala test (because it immediately shows the failing test cases). The existence of the `****` in test names disrupts that workflow.
      
      ## How was this patch tested?
      N/A - test only change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #18715 from rxin/FlatMapGroupsWithStateStar.
      a4eac8b0
    • pj.fanning's avatar
      [SPARK-20871][SQL] limit logging of Janino code · 2a53fbfc
      pj.fanning authored
      ## What changes were proposed in this pull request?
      
      When the code that is generated is greater than 64k, then Janino compile will fail and CodeGenerator.scala will log the entire code at Error level.
      SPARK-20871 suggests only logging the code at Debug level.
      Since, the code is already logged at debug level, this Pull Request proposes not including the formatted code in the Error logging and exception message at all.
      When an exception occurs, the code will be logged at Info level but truncated if it is more than 1000 lines long.
      
      ## How was this patch tested?
      
      Existing tests were run.
      An extra test test case was added to CodeFormatterSuite to test the new maxLines parameter,
      
      Author: pj.fanning <pj.fanning@workday.com>
      
      Closes #18658 from pjfanning/SPARK-20871.
      2a53fbfc
    • Marcelo Vanzin's avatar
      [SPARK-20904][CORE] Don't report task failures to driver during shutdown. · cecd285a
      Marcelo Vanzin authored
      Executors run a thread pool with daemon threads to run tasks. This means
      that those threads remain active when the JVM is shutting down, meaning
      those tasks are affected by code that runs in shutdown hooks.
      
      So if a shutdown hook messes with something that the task is using (e.g.
      an HDFS connection), the task will fail and will report that failure to
      the driver. That will make the driver mark the task as failed regardless
      of what caused the executor to shut down. So, for example, if YARN pre-empted
      that executor, the driver would consider that task failed when it should
      instead ignore the failure.
      
      This change avoids reporting failures to the driver when shutdown hooks
      are executing; this fixes the YARN preemption accounting, and doesn't really
      change things much for other scenarios, other than reporting a more generic
      error ("Executor lost") when the executor shuts down unexpectedly - which
      is arguably more correct.
      
      Tested with a hacky app running on spark-shell that tried to cause failures
      only when shutdown hooks were running, verified that preemption didn't cause
      the app to fail because of task failures exceeding the threshold.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #18594 from vanzin/SPARK-20904.
      cecd285a
    • Wenchen Fan's avatar
      [SPARK-10063] Follow-up: remove a useless test related to an old output committer · ccaee5b5
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      It's a follow-up of https://github.com/apache/spark/pull/18689 , which forgot to remove a useless test.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18716 from cloud-fan/test.
      ccaee5b5
  9. Jul 21, 2017
Loading