Skip to content
Snippets Groups Projects
  1. Nov 25, 2016
    • Takuya UESHIN's avatar
      [SPARK-18583][SQL] Fix nullability of InputFileName. · da66b974
      Takuya UESHIN authored
      
      ## What changes were proposed in this pull request?
      
      The nullability of `InputFileName` should be `false`.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #16007 from ueshin/issues/SPARK-18583.
      
      (cherry picked from commit a88329d4)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      da66b974
    • hyukjinkwon's avatar
      [SPARK-3359][BUILD][DOCS] More changes to resolve javadoc 8 errors that will... · 69856f28
      hyukjinkwon authored
      [SPARK-3359][BUILD][DOCS] More changes to resolve javadoc 8 errors that will help unidoc/genjavadoc compatibility
      
      ## What changes were proposed in this pull request?
      
      This PR only tries to fix things that looks pretty straightforward and were fixed in other previous PRs before.
      
      This PR roughly fixes several things as below:
      
      - Fix unrecognisable class and method links in javadoc by changing it from `[[..]]` to `` `...` ``
      
        ```
        [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/DataStreamReader.java:226: error: reference not found
        [error]    * Loads text files and returns a {link DataFrame} whose schema starts with a string column named
        ```
      
      - Fix an exception annotation and remove code backticks in `throws` annotation
      
        Currently, sbt unidoc with Java 8 complains as below:
      
        ```
        [error] .../java/org/apache/spark/sql/streaming/StreamingQuery.java:72: error: unexpected text
        [error]    * throws StreamingQueryException, if <code>this</code> query has terminated with an exception.
        ```
      
        `throws` should specify the correct class name from `StreamingQueryException,` to `StreamingQueryException` without backticks. (see [JDK-8007644](https://bugs.openjdk.java.net/browse/JDK-8007644)).
      
      - Fix `[[http..]]` to `<a href="http..."></a>`.
      
        ```diff
        -   * [[https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https Oracle
        -   * blog page]].
        +   * <a href="https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https
      
      ">
        +   * Oracle blog page</a>.
        ```
      
         `[[http...]]` link markdown in scaladoc is unrecognisable in javadoc.
      
      - It seems class can't have `return` annotation. So, two cases of this were removed.
      
        ```
        [error] .../java/org/apache/spark/mllib/regression/IsotonicRegression.java:27: error: invalid use of return
        [error]    * return New instance of IsotonicRegression.
        ```
      
      - Fix < to `&lt;` and > to `&gt;` according to HTML rules.
      
      - Fix `</p>` complaint
      
      - Exclude unrecognisable in javadoc, `constructor`, `todo` and `groupname`.
      
      ## How was this patch tested?
      
      Manually tested by `jekyll build` with Java 7 and 8
      
      ```
      java version "1.7.0_80"
      Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
      Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
      ```
      
      ```
      java version "1.8.0_45"
      Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
      Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
      ```
      
      Note: this does not yet make sbt unidoc suceed with Java 8 yet but it reduces the number of errors with Java 8.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15999 from HyukjinKwon/SPARK-3359-errors.
      
      (cherry picked from commit 51b1c155)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      69856f28
    • n.fraison's avatar
      [SPARK-18119][SPARK-CORE] Namenode safemode check is only performed on one... · a49dfa93
      n.fraison authored
      [SPARK-18119][SPARK-CORE] Namenode safemode check is only performed on one namenode which can stuck the startup of SparkHistory server
      
      ## What changes were proposed in this pull request?
      
      Instead of using the setSafeMode method that check the first namenode used the one which permitts to check only for active NNs
      ## How was this patch tested?
      
      manual tests
      
      Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
      
       before opening a pull request.
      
      This commit is contributed by Criteo SA under the Apache v2 licence.
      
      Author: n.fraison <n.fraison@criteo.com>
      
      Closes #15648 from ashangit/SPARK-18119.
      
      (cherry picked from commit f42db0c0)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      a49dfa93
  2. Nov 23, 2016
    • Reynold Xin's avatar
      [SPARK-18557] Downgrade confusing memory leak warning message · e11d7c68
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      TaskMemoryManager has a memory leak detector that gets called at task completion callback and checks whether any memory has not been released. If they are not released by the time the callback is invoked, TaskMemoryManager releases them.
      
      The current error message says something like the following:
      ```
      WARN  [Executor task launch worker-0]
      org.apache.spark.memory.TaskMemoryManager - leak 16.3 MB memory from
      org.apache.spark.unsafe.map.BytesToBytesMap33fb6a15
      In practice, there are multiple reasons why these can be triggered in the normal code path (e.g. limit, or task failures), and the fact that these messages are log means the "leak" is fixed by TaskMemoryManager.
      ```
      
      To not confuse users, this patch downgrade the message from warning to debug level, and avoids using the word "leak" since it is not actually a leak.
      
      ## How was this patch tested?
      N/A - this is a simple logging improvement.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15989 from rxin/SPARK-18557.
      
      (cherry picked from commit 9785ed40)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      e11d7c68
    • Eric Liang's avatar
      [SPARK-18545][SQL] Verify number of hive client RPCs in PartitionedTablePerfStatsSuite · 539c193a
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      This would help catch accidental O(n) calls to the hive client as in https://issues.apache.org/jira/browse/SPARK-18507
      
      ## How was this patch tested?
      
      Checked that the test fails before https://issues.apache.org/jira/browse/SPARK-18507
      
       was patched. cc cloud-fan
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #15985 from ericl/spark-18545.
      
      (cherry picked from commit 85235ed6)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      539c193a
  3. Nov 19, 2016
    • Kazuaki Ishizaki's avatar
      [SPARK-18458][CORE] Fix signed integer overflow problem at an expression in RadixSort.java · b0b2f108
      Kazuaki Ishizaki authored
      
      ## What changes were proposed in this pull request?
      
      This PR avoids that a result of an expression is negative due to signed integer overflow (e.g. 0x10?????? * 8 < 0). This PR casts each operand to `long` before executing a calculation. Since the result is interpreted as long, the result of the expression is positive.
      
      ## How was this patch tested?
      
      Manually executed query82 of TPC-DS with 100TB
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #15907 from kiszk/SPARK-18458.
      
      (cherry picked from commit d93b6552)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      b0b2f108
    • Sean Owen's avatar
      [SPARK-18353][CORE] spark.rpc.askTimeout defalut value is not 120s · 30a6fbbb
      Sean Owen authored
      
      ## What changes were proposed in this pull request?
      
      Avoid hard-coding spark.rpc.askTimeout to non-default in Client; fix doc about spark.rpc.askTimeout default
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15833 from srowen/SPARK-18353.
      
      (cherry picked from commit 8b1e1088)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      30a6fbbb
    • hyukjinkwon's avatar
      [SPARK-18445][BUILD][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note... · 4b396a65
      hyukjinkwon authored
      [SPARK-18445][BUILD][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that`/`'''Note:'''` across Scala/Java API documentation
      
      It seems in Scala/Java,
      
      - `Note:`
      - `NOTE:`
      - `Note that`
      - `'''Note:'''`
      - `note`
      
      This PR proposes to fix those to `note` to be consistent.
      
      **Before**
      
      - Scala
        ![2016-11-17 6 16 39](https://cloud.githubusercontent.com/assets/6477701/20383180/1a7aed8c-acf2-11e6-9611-5eaf6d52c2e0.png)
      
      - Java
        ![2016-11-17 6 14 41](https://cloud.githubusercontent.com/assets/6477701/20383096/c8ffc680-acf1-11e6-914a-33460bf1401d.png)
      
      **After**
      
      - Scala
        ![2016-11-17 6 16 44](https://cloud.githubusercontent.com/assets/6477701/20383167/09940490-acf2-11e6-937a-0d5e1dc2cadf.png)
      
      - Java
        ![2016-11-17 6 13 39](https://cloud.githubusercontent.com/assets/6477701/20383132/e7c2a57e-acf1-11e6-9c47-b849674d4d88.png
      
      )
      
      The notes were found via
      
      ```bash
      grep -r "NOTE: " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// NOTE: " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \ # note that this is a regular expression. So actual matches were mostly `org/apache/spark/api/java/functions ...`
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "Note that " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// Note that " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "Note: " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// Note: " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "'''Note:'''" . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// '''Note:''' " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      And then fixed one by one comparing with API documentation/access modifiers.
      
      After that, manually tested via `jekyll build`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15889 from HyukjinKwon/SPARK-18437.
      
      (cherry picked from commit d5b1d5fc)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      4b396a65
  4. Nov 18, 2016
  5. Nov 17, 2016
  6. Nov 16, 2016
    • Xianyang Liu's avatar
      [SPARK-18420][BUILD] Fix the errors caused by lint check in Java · b0ae8712
      Xianyang Liu authored
      
      Small fix, fix the errors caused by lint check in Java
      
      - Clear unused objects and `UnusedImports`.
      - Add comments around the method `finalize` of `NioBufferedFileInputStream`to turn off checkstyle.
      - Cut the line which is longer than 100 characters into two lines.
      
      Travis CI.
      ```
      $ build/mvn -T 4 -q -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive -Phive-thriftserver install
      $ dev/lint-java
      ```
      Before:
      ```
      Checkstyle checks failed at following occurrences:
      [ERROR] src/main/java/org/apache/spark/network/util/TransportConf.java:[21,8] (imports) UnusedImports: Unused import - org.apache.commons.crypto.cipher.CryptoCipherFactory.
      [ERROR] src/test/java/org/apache/spark/network/sasl/SparkSaslSuite.java:[516,5] (modifier) RedundantModifier: Redundant 'public' modifier.
      [ERROR] src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java:[133] (coding) NoFinalizer: Avoid using finalizer method.
      [ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeMapData.java:[71] (sizes) LineLength: Line is longer than 100 characters (found 113).
      [ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java:[112] (sizes) LineLength: Line is longer than 100 characters (found 110).
      [ERROR] src/test/java/org/apache/spark/sql/catalyst/expressions/HiveHasherSuite.java:[31,17] (modifier) ModifierOrder: 'static' modifier out of order with the JLS suggestions.
      [ERROR]src/main/java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java:[64] (sizes) LineLength: Line is longer than 100 characters (found 103).
      [ERROR] src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[22,8] (imports) UnusedImports: Unused import - org.apache.spark.ml.linalg.Vectors.
      [ERROR] src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[51] (regexp) RegexpSingleline: No trailing whitespace allowed.
      ```
      
      After:
      ```
      $ build/mvn -T 4 -q -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive -Phive-thriftserver install
      $ dev/lint-java
      Using `mvn` from path: /home/travis/build/ConeyLiu/spark/build/apache-maven-3.3.9/bin/mvn
      Checkstyle checks passed.
      ```
      
      Author: Xianyang Liu <xyliu0530@icloud.com>
      
      Closes #15865 from ConeyLiu/master.
      
      (cherry picked from commit 7569cf6c)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      b0ae8712
  7. Nov 14, 2016
  8. Nov 12, 2016
    • Guoqiang Li's avatar
      [SPARK-18375][SPARK-18383][BUILD][CORE] Upgrade netty to 4.0.42.Final · 89335514
      Guoqiang Li authored
      
      ## What changes were proposed in this pull request?
      
      One of the important changes for 4.0.42.Final is "Support any FileRegion implementation when using epoll transport netty/netty#5825".
      In 4.0.42.Final, `MessageWithHeader` can work properly when `spark.[shuffle|rpc].io.mode` is set to epoll
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Guoqiang Li <witgo@qq.com>
      
      Closes #15830 from witgo/SPARK-18375_netty-4.0.42.
      
      (cherry picked from commit bc41d997)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      89335514
  9. Nov 11, 2016
  10. Nov 10, 2016
    • Eric Liang's avatar
      [SPARK-18185] Fix all forms of INSERT / OVERWRITE TABLE for Datasource tables · 064d4315
      Eric Liang authored
      
      ## What changes were proposed in this pull request?
      
      As of current 2.1, INSERT OVERWRITE with dynamic partitions against a Datasource table will overwrite the entire table instead of only the partitions matching the static keys, as in Hive. It also doesn't respect custom partition locations.
      
      This PR adds support for all these operations to Datasource tables managed by the Hive metastore. It is implemented as follows
      - During planning time, the full set of partitions affected by an INSERT or OVERWRITE command is read from the Hive metastore.
      - The planner identifies any partitions with custom locations and includes this in the write task metadata.
      - FileFormatWriter tasks refer to this custom locations map when determining where to write for dynamic partition output.
      - When the write job finishes, the set of written partitions is compared against the initial set of matched partitions, and the Hive metastore is updated to reflect the newly added / removed partitions.
      
      It was necessary to introduce a method for staging files with absolute output paths to `FileCommitProtocol`. These files are not handled by the Hadoop output committer but are moved to their final locations when the job commits.
      
      The overwrite behavior of legacy Datasource tables is also changed: no longer will the entire table be overwritten if a partial partition spec is present.
      
      cc cloud-fan yhuai
      
      ## How was this patch tested?
      
      Unit tests, existing tests.
      
      Author: Eric Liang <ekl@databricks.com>
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15814 from ericl/sc-5027.
      
      (cherry picked from commit a3356343)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      064d4315
  11. Nov 09, 2016
    • Vinayak's avatar
      [SPARK-16808][CORE] History Server main page does not honor APPLICATION_WEB_PROXY_BASE · 5bd31dc9
      Vinayak authored
      
      ## What changes were proposed in this pull request?
      
      Application links generated on the history server UI no longer (regression from 1.6) contain the configured spark.ui.proxyBase in the links. To address this, made the uiRoot available globally to all javascripts for Web UI. Updated the mustache template (historypage-template.html) to include the uiroot for rendering links to the applications.
      
      The existing test was not sufficient to verify the scenario where ajax call is used to populate the application listing template, so added a new selenium test case to cover this scenario.
      
      ## How was this patch tested?
      
      Existing tests and a new unit test.
      No visual changes to the UI.
      
      Author: Vinayak <vijoshi5@in.ibm.com>
      
      Closes #15742 from vijoshi/SPARK-16808_master.
      
      (cherry picked from commit 06a13ecc)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      5bd31dc9
  12. Nov 08, 2016
    • Shixiong Zhu's avatar
      [SPARK-18280][CORE] Fix potential deadlock in `StandaloneSchedulerBackend.dead` · ba80eaf7
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      "StandaloneSchedulerBackend.dead" is called in a RPC thread, so it should not call "SparkContext.stop" in the same thread. "SparkContext.stop" will block until all RPC threads exit, if it's called inside a RPC thread, it will be dead-lock.
      
      This PR add a thread local flag inside RPC threads. `SparkContext.stop` uses it to decide if launching a new thread to stop the SparkContext.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15775 from zsxwing/SPARK-18280.
      ba80eaf7
  13. Nov 07, 2016
    • fidato's avatar
      [SPARK-16575][CORE] partition calculation mismatch with sc.binaryFiles · c8879bf1
      fidato authored
      
      ## What changes were proposed in this pull request?
      
      This Pull request comprises of the critical bug SPARK-16575 changes. This change rectifies the issue with BinaryFileRDD partition calculations as  upon creating an RDD with sc.binaryFiles, the resulting RDD always just consisted of two partitions only.
      ## How was this patch tested?
      
      The original issue ie. getNumPartitions on binary Files RDD (always having two partitions) was first replicated and then tested upon the changes. Also the unit tests have been checked and passed.
      
      This contribution is my original work and I licence the work to the project under the project's open source license
      
      srowen hvanhovell rxin vanzin skyluc kmader zsxwing datafarmer Please have a look .
      
      Author: fidato <fidato.july13@gmail.com>
      
      Closes #15327 from fidato13/SPARK-16575.
      
      (cherry picked from commit 6f369713)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      c8879bf1
  14. Nov 05, 2016
    • Susan X. Huynh's avatar
      [SPARK-17964][SPARKR] Enable SparkR with Mesos client mode and cluster mode · dcbc4265
      Susan X. Huynh authored
      
      ## What changes were proposed in this pull request?
      
      Enabled SparkR with Mesos client mode and cluster mode. Just a few changes were required to get this working on Mesos: (1) removed the SparkR on Mesos error checks and (2) do not require "--class" to be specified for R apps. The logic to check spark.mesos.executor.home was already in there.
      
      sun-rui
      
      ## How was this patch tested?
      
      1. SparkSubmitSuite
      2. On local mesos cluster (on laptop): ran SparkR shell, spark-submit client mode, and spark-submit cluster mode, with the "examples/src/main/R/dataframe.R" example application.
      3. On multi-node mesos cluster: ran SparkR shell, spark-submit client mode, and spark-submit cluster mode, with the "examples/src/main/R/dataframe.R" example application. I tested with the following --conf values set: spark.mesos.executor.docker.image and spark.mesos.executor.home
      
      This contribution is my original work and I license the work to the project under the project's open source license.
      
      Author: Susan X. Huynh <xhuynh@mesosphere.com>
      
      Closes #15700 from susanxhuynh/susan-r-branch.
      
      (cherry picked from commit 9a87c313)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      dcbc4265
    • Weiqing Yang's avatar
      [SPARK-17710][FOLLOW UP] Add comments to state why 'Utils.classForName' is not used · 70763014
      Weiqing Yang authored
      
      ## What changes were proposed in this pull request?
      Add comments.
      
      ## How was this patch tested?
      Build passed.
      
      Author: Weiqing Yang <yangweiqing001@gmail.com>
      
      Closes #15776 from weiqingy/SPARK-17710.
      
      (cherry picked from commit 8a9ca192)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      70763014
  15. Nov 04, 2016
    • Adam Roberts's avatar
      [SPARK-18197][CORE] Optimise AppendOnlyMap implementation · a2d7e25e
      Adam Roberts authored
      
      ## What changes were proposed in this pull request?
      This improvement works by using the fastest comparison test first and we observed a 1% throughput performance improvement on PageRank (HiBench large profile) with this change.
      
      We used tprof and before the change in AppendOnlyMap.changeValue (where the optimisation occurs) this method was being used for 8053 profiling ticks representing 0.72% of the overall application time.
      
      After this change we observed this method only occurring for 2786 ticks and for 0.25% of the overall time.
      
      ## How was this patch tested?
      Existing unit tests and for performance we used HiBench large, profiling with tprof and IBM Healthcenter.
      
      Author: Adam Roberts <aroberts@uk.ibm.com>
      
      Closes #15714 from a-roberts/patch-9.
      
      (cherry picked from commit a42d738c)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      a2d7e25e
    • Dongjoon Hyun's avatar
      [SPARK-18200][GRAPHX][FOLLOW-UP] Support zero as an initial capacity in OpenHashSet · cfe76028
      Dongjoon Hyun authored
      
      ## What changes were proposed in this pull request?
      
      This is a follow-up PR of #15741 in order to keep `nextPowerOf2` consistent.
      
      **Before**
      ```
      nextPowerOf2(0) => 2
      nextPowerOf2(1) => 1
      nextPowerOf2(2) => 2
      nextPowerOf2(3) => 4
      nextPowerOf2(4) => 4
      nextPowerOf2(5) => 8
      ```
      
      **After**
      ```
      nextPowerOf2(0) => 1
      nextPowerOf2(1) => 1
      nextPowerOf2(2) => 2
      nextPowerOf2(3) => 4
      nextPowerOf2(4) => 4
      nextPowerOf2(5) => 8
      ```
      
      ## How was this patch tested?
      
      N/A
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15754 from dongjoon-hyun/SPARK-18200-2.
      
      (cherry picked from commit 27602c33)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      cfe76028
  16. Nov 03, 2016
    • Sean Owen's avatar
      [SPARK-18138][DOCS] Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 2.6... · 37550c49
      Sean Owen authored
      [SPARK-18138][DOCS] Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 2.6 are deprecated in Spark 2.1.0
      
      ## What changes were proposed in this pull request?
      
      Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 2.6 are deprecated in Spark 2.1.0. This does not actually implement any of the change in SPARK-18138, just peppers the documentation with notices about it.
      
      ## How was this patch tested?
      
      Doc build
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15733 from srowen/SPARK-18138.
      
      (cherry picked from commit dc4c6009)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      37550c49
    • Reynold Xin's avatar
      [SPARK-18219] Move commit protocol API (internal) from sql/core to core module · bc7f05f5
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      This patch moves the new commit protocol API from sql/core to core module, so we can use it in the future in the RDD API.
      
      As part of this patch, I also moved the speficiation of the random uuid for the write path out of the commit protocol, and instead pass in a job id.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15731 from rxin/SPARK-18219.
      
      (cherry picked from commit 937af592)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      bc7f05f5
    • Dongjoon Hyun's avatar
      [SPARK-18200][GRAPHX] Support zero as an initial capacity in OpenHashSet · 965c964c
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      [SPARK-18200](https://issues.apache.org/jira/browse/SPARK-18200
      
      ) reports Apache Spark 2.x raises `java.lang.IllegalArgumentException: requirement failed: Invalid initial capacity` while running `triangleCount`. The root cause is that `VertexSet`, a type alias of `OpenHashSet`, does not allow zero as a initial size. This PR loosens the restriction to allow zero.
      
      ## How was this patch tested?
      
      Pass the Jenkins test with a new test case in `OpenHashSetSuite`.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15741 from dongjoon-hyun/SPARK-18200.
      
      (cherry picked from commit d24e7364)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      965c964c
  17. Nov 02, 2016
    • Jeff Zhang's avatar
      [SPARK-18160][CORE][YARN] spark.files & spark.jars should not be passed to driver in yarn mode · bd3ea659
      Jeff Zhang authored
      
      ## What changes were proposed in this pull request?
      
      spark.files is still passed to driver in yarn mode, so SparkContext will still handle it which cause the error in the jira desc.
      
      ## How was this patch tested?
      
      Tested manually in a 5 node cluster. As this issue only happens in multiple node cluster, so I didn't write test for it.
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #15669 from zjffdu/SPARK-18160.
      
      (cherry picked from commit 3c24299b)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      bd3ea659
    • Xiangrui Meng's avatar
      [SPARK-14393][SQL] values generated by non-deterministic functions shouldn't... · 0093257e
      Xiangrui Meng authored
      [SPARK-14393][SQL] values generated by non-deterministic functions shouldn't change after coalesce or union
      
      ## What changes were proposed in this pull request?
      
      When a user appended a column using a "nondeterministic" function to a DataFrame, e.g., `rand`, `randn`, and `monotonically_increasing_id`, the expected semantic is the following:
      - The value in each row should remain unchanged, as if we materialize the column immediately, regardless of later DataFrame operations.
      
      However, since we use `TaskContext.getPartitionId` to get the partition index from the current thread, the values from nondeterministic columns might change if we call `union` or `coalesce` after. `TaskContext.getPartitionId` returns the partition index of the current Spark task, which might not be the corresponding partition index of the DataFrame where we defined the column.
      
      See the unit tests below or JIRA for examples.
      
      This PR uses the partition index from `RDD.mapPartitionWithIndex` instead of `TaskContext` and fixes the partition initialization logic in whole-stage codegen, normal codegen, and codegen fallback. `initializeStatesForPartition(partitionIndex: Int)` was added to `Projection`, `Nondeterministic`, and `Predicate` (codegen) and initialized right after object creation in `mapPartitionWithIndex`. `newPredicate` now returns a `Predicate` instance rather than a function for proper initialization.
      ## How was this patch tested?
      
      Unit tests. (Actually I'm not very confident that this PR fixed all issues without introducing new ones ...)
      
      cc: rxin davies
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #15567 from mengxr/SPARK-14393.
      
      (cherry picked from commit 02f20310)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      0093257e
    • Sean Owen's avatar
      [SPARK-18076][CORE][SQL] Fix default Locale used in DateFormat, NumberFormat to Locale.US · 176afa5e
      Sean Owen authored
      
      ## What changes were proposed in this pull request?
      
      Fix `Locale.US` for all usages of `DateFormat`, `NumberFormat`
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15610 from srowen/SPARK-18076.
      
      (cherry picked from commit 9c8deef6)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      176afa5e
    • Ryan Blue's avatar
      [SPARK-17532] Add lock debugging info to thread dumps. · 3b624bed
      Ryan Blue authored
      ## What changes were proposed in this pull request?
      
      This adds information to the web UI thread dump page about the JVM locks
      held by threads and the locks that threads are blocked waiting to
      acquire. This should help find cases where lock contention is causing
      Spark applications to run slowly.
      ## How was this patch tested?
      
      Tested by applying this patch and viewing the change in the web UI.
      
      ![thread-lock-info](https://cloud.githubusercontent.com/assets/87915/18493057/6e5da870-79c3-11e6-8c20-f54c18a37544.png
      
      )
      
      Additions:
      - A "Thread Locking" column with the locks held by the thread or that are blocking the thread
      - Links from the a blocked thread to the thread holding the lock
      - Stack frames show where threads are inside `synchronized` blocks, "holding Monitor(...)"
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #15088 from rdblue/SPARK-17532-add-thread-lock-info.
      
      (cherry picked from commit 2dc04808)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      3b624bed
  18. Nov 01, 2016
    • Josh Rosen's avatar
      [SPARK-18182] Expose ReplayListenerBus.read() overload which takes string iterator · b929537b
      Josh Rosen authored
      The `ReplayListenerBus.read()` method is used when implementing a custom `ApplicationHistoryProvider`. The current interface only exposes a `read()` method which takes an `InputStream` and performs stream-to-lines conversion itself, but it would also be useful to expose an overloaded method which accepts an iterator of strings, thereby enabling events to be provided from non-`InputStream` sources.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #15698 from JoshRosen/replay-listener-bus-interface.
      b929537b
  19. Oct 31, 2016
    • Shixiong Zhu's avatar
      [SPARK-18143][SQL] Ignore Structured Streaming event logs to avoid breaking history server · d2923f17
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Because of the refactoring work in Structured Streaming, the event logs generated by Strucutred Streaming in Spark 2.0.0 and 2.0.1 cannot be parsed.
      
      This PR just ignores these logs in ReplayListenerBus because no places use them.
      ## How was this patch tested?
      - Generated events logs using Spark 2.0.0 and 2.0.1, and saved them as `structured-streaming-query-event-logs-2.0.0.txt` and `structured-streaming-query-event-logs-2.0.1.txt`
      - The new added test makes sure ReplayListenerBus will skip these bad jsons.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15663 from zsxwing/fix-event-log.
      d2923f17
  20. Oct 30, 2016
    • Hossein's avatar
      [SPARK-17919] Make timeout to RBackend configurable in SparkR · 2881a2d1
      Hossein authored
      ## What changes were proposed in this pull request?
      
      This patch makes RBackend connection timeout configurable by user.
      
      ## How was this patch tested?
      N/A
      
      Author: Hossein <hossein@databricks.com>
      
      Closes #15471 from falaki/SPARK-17919.
      2881a2d1
    • Eric Liang's avatar
      [SPARK-18103][SQL] Rename *FileCatalog to *FileIndex · 90d3b91f
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      To reduce the number of components in SQL named *Catalog, rename *FileCatalog to *FileIndex. A FileIndex is responsible for returning the list of partitions / files to scan given a filtering expression.
      
      ```
      TableFileCatalog => CatalogFileIndex
      FileCatalog => FileIndex
      ListingFileCatalog => InMemoryFileIndex
      MetadataLogFileCatalog => MetadataLogFileIndex
      PrunedTableFileCatalog => PrunedInMemoryFileIndex
      ```
      
      cc yhuai marmbrus
      
      ## How was this patch tested?
      
      N/A
      
      Author: Eric Liang <ekl@databricks.com>
      Author: Eric Liang <ekhliang@gmail.com>
      
      Closes #15634 from ericl/rename-file-provider.
      90d3b91f
  21. Oct 27, 2016
    • wm624@hotmail.com's avatar
      [SPARK-CORE][TEST][MINOR] Fix the wrong comment in test · 701a9d36
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      While learning core scheduler code, I found two lines of wrong comments. This PR simply corrects the comments.
      
      ## How was this patch tested?
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #15631 from wangmiao1981/Rbug.
      701a9d36
    • Yin Huai's avatar
      [SPARK-18132] Fix checkstyle · d3b4831d
      Yin Huai authored
      This PR fixes checkstyle.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #15656 from yhuai/fix-format.
      d3b4831d
  22. Oct 26, 2016
    • Miao Wang's avatar
      [SPARK-18126][SPARK-CORE] getIteratorZipWithIndex accepts negative value as index · a76846cf
      Miao Wang authored
      ## What changes were proposed in this pull request?
      
      (Please fill in changes proposed in this fix)
      
      `Utils.getIteratorZipWithIndex` was added to deal with number of records > 2147483647 in one partition.
      
      method `getIteratorZipWithIndex` accepts `startIndex` < 0, which leads to negative index.
      
      This PR just adds a defensive check on `startIndex` to make sure it is >= 0.
      
      ## How was this patch tested?
      
      Add a new unit test.
      
      Author: Miao Wang <miaowang@Miaos-MacBook-Pro.local>
      
      Closes #15639 from wangmiao1981/zip.
      a76846cf
    • Shixiong Zhu's avatar
      [SPARK-13747][SQL] Fix concurrent executions in ForkJoinPool for SQL · 7ac70e7b
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Calling `Await.result` will allow other tasks to be run on the same thread when using ForkJoinPool. However, SQL uses a `ThreadLocal` execution id to trace Spark jobs launched by a query, which doesn't work perfectly in ForkJoinPool.
      
      This PR just uses `Awaitable.result` instead to  prevent ForkJoinPool from running other tasks in the current waiting thread.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15520 from zsxwing/SPARK-13747.
      7ac70e7b
    • Shuai Lin's avatar
      [SPARK-17802] Improved caller context logging. · 402205dd
      Shuai Lin authored
      ## What changes were proposed in this pull request?
      
      [SPARK-16757](https://issues.apache.org/jira/browse/SPARK-16757) sets the hadoop `CallerContext` when calling hadoop/hdfs apis to make spark applications more diagnosable in hadoop/hdfs logs. However, the `org.apache.hadoop.ipc.CallerContext` class is only added since [hadoop 2.8](https://issues.apache.org/jira/browse/HDFS-9184), which is not officially releaed yet. So each time `utils.CallerContext.setCurrentContext()` is called (e.g [when a task is created](https://github.com/apache/spark/blob/b678e46/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L95-L96)), a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext"
      error is logged, which pollutes the spark logs when there are lots of tasks.
      
      This patch improves this behaviour by only logging the `ClassNotFoundException` once.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Shuai Lin <linshuai2012@gmail.com>
      
      Closes #15377 from lins05/spark-17802-improve-callercontext-logging.
      402205dd
    • Alex Bozarth's avatar
      [SPARK-4411][WEB UI] Add "kill" link for jobs in the UI · 5d0f81da
      Alex Bozarth authored
      ## What changes were proposed in this pull request?
      
      Currently users can kill stages via the web ui but not jobs directly (jobs are killed if one of their stages is). I've added the ability to kill jobs via the web ui. This code change is based on #4823 by lianhuiwang and updated to work with the latest code matching how stages are currently killed. In general I've copied the kill stage code warning and note comments and all. I also updated applicable tests and documentation.
      
      ## How was this patch tested?
      
      Manually tested and dev/run-tests
      
      ![screen shot 2016-10-11 at 4 49 43 pm](https://cloud.githubusercontent.com/assets/13952758/19292857/12f1b7c0-8fd4-11e6-8982-210249f7b697.png)
      
      Author: Alex Bozarth <ajbozart@us.ibm.com>
      Author: Lianhui Wang <lianhuiwang09@gmail.com>
      
      Closes #15441 from ajbozarth/spark4411.
      5d0f81da
Loading