Skip to content
Snippets Groups Projects
  1. Sep 06, 2017
    • hyukjinkwon's avatar
      [SPARK-21903][BUILD][FOLLOWUP] Upgrade scalastyle-maven-plugin and scalastyle... · 64936c14
      hyukjinkwon authored
      [SPARK-21903][BUILD][FOLLOWUP] Upgrade scalastyle-maven-plugin and scalastyle as well in POM and SparkBuild.scala
      
      ## What changes were proposed in this pull request?
      
      This PR proposes to match scalastyle version in POM and SparkBuild.scala
      
      ## How was this patch tested?
      
      Manual builds.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19146 from HyukjinKwon/SPARK-21903-follow-up.
      64936c14
  2. Sep 05, 2017
    • hyukjinkwon's avatar
      [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0. · 7f3c6ff4
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      1.0.0 fixes an issue with import order, explicit type for public methods, line length limitation and comment validation:
      
      ```
      [error] .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/Main.scala:50:16: Are you sure you want to println? If yes, wrap the code block with
      [error]       // scalastyle:off println
      [error]       println(...)
      [error]       // scalastyle:on println
      [error] .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala:49: File line length exceeds 100 characters
      [error] .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala:22:21: Are you sure you want to println? If yes, wrap the code block with
      [error]       // scalastyle:off println
      [error]       println(...)
      [error]       // scalastyle:on println
      [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:35:6: Public method must have explicit type
      [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:51:6: Public method must have explicit type
      [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:93:15: Public method must have explicit type
      [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:98:15: Public method must have explicit type
      [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:47:2: Insert a space after the start of the comment
      [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:26:43: JavaDStream should come before JavaDStreamLike.
      ```
      
      This PR also fixes the workaround added in SPARK-16877 for `org.scalastyle.scalariform.OverrideJavaChecker` feature, added from 0.9.0.
      
      ## How was this patch tested?
      
      Manually tested.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19116 from HyukjinKwon/scalastyle-1.0.0.
      7f3c6ff4
  3. Sep 01, 2017
    • Sean Owen's avatar
      [SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala... · 12ab7f7e
      Sean Owen authored
      [SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala 2.12 profiles and enable 2.12 compilation
      
      …build; fix some things that will be warnings or errors in 2.12; restore Scala 2.12 profile infrastructure
      
      ## What changes were proposed in this pull request?
      
      This change adds back the infrastructure for a Scala 2.12 build, but does not enable it in the release or Python test scripts.
      
      In order to make that meaningful, it also resolves compile errors that the code hits in 2.12 only, in a way that still works with 2.11.
      
      It also updates dependencies to the earliest minor release of dependencies whose current version does not yet support Scala 2.12. This is in a sense covered by other JIRAs under the main umbrella, but implemented here. The versions below still work with 2.11, and are the _latest_ maintenance release in the _earliest_ viable minor release.
      
      - Scalatest 2.x -> 3.0.3
      - Chill 0.8.0 -> 0.8.4
      - Clapper 1.0.x -> 1.1.2
      - json4s 3.2.x -> 3.4.2
      - Jackson 2.6.x -> 2.7.9 (required by json4s)
      
      This change does _not_ fully enable a Scala 2.12 build:
      
      - It will also require dropping support for Kafka before 0.10. Easy enough, just didn't do it yet here
      - It will require recreating `SparkILoop` and `Main` for REPL 2.12, which is SPARK-14650. Possible to do here too.
      
      What it does do is make changes that resolve much of the remaining gap without affecting the current 2.11 build.
      
      ## How was this patch tested?
      
      Existing tests and build. Manually tested with `./dev/change-scala-version.sh 2.12` to verify it compiles, modulo the exceptions above.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #18645 from srowen/SPARK-14280.
      12ab7f7e
  4. Aug 31, 2017
    • WeichenXu's avatar
      [SPARK-17139][ML][FOLLOW-UP] Add convenient method `asBinary` for casting to... · 96028e36
      WeichenXu authored
      [SPARK-17139][ML][FOLLOW-UP] Add convenient method `asBinary` for casting to BinaryLogisticRegressionSummary
      
      ## What changes were proposed in this pull request?
      
      add an "asBinary" method to LogisticRegressionSummary for convenient casting to BinaryLogisticRegressionSummary.
      
      ## How was this patch tested?
      
      Testcase updated.
      
      Author: WeichenXu <weichen.xu@databricks.com>
      
      Closes #19072 from WeichenXu123/mlor_summary_as_binary.
      96028e36
  5. Aug 28, 2017
    • Weichen Xu's avatar
      [SPARK-17139][ML] Add model summary for MultinomialLogisticRegression · c7270a46
      Weichen Xu authored
      ## What changes were proposed in this pull request?
      
      Add 4 traits, using the following hierarchy:
      LogisticRegressionSummary
      LogisticRegressionTrainingSummary: LogisticRegressionSummary
      BinaryLogisticRegressionSummary: LogisticRegressionSummary
      BinaryLogisticRegressionTrainingSummary: LogisticRegressionTrainingSummary, BinaryLogisticRegressionSummary
      
      and the public method such as `def summary` only return trait type listed above.
      
      and then implement 4 concrete classes:
      LogisticRegressionSummaryImpl (multiclass case)
      LogisticRegressionTrainingSummaryImpl (multiclass case)
      BinaryLogisticRegressionSummaryImpl (binary case).
      BinaryLogisticRegressionTrainingSummaryImpl (binary case).
      
      ## How was this patch tested?
      
      Existing tests & added tests.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #15435 from WeichenXu123/mlor_summary.
      c7270a46
  6. Aug 24, 2017
    • Herman van Hovell's avatar
      [SPARK-21830][SQL] Bump ANTLR version and fix a few issues. · 05af2de0
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      This PR bumps the ANTLR version to 4.7, and fixes a number of small parser related issues uncovered by the bump.
      
      The main reason for upgrading is that in some cases the current version of ANTLR (4.5) can exhibit exponential slowdowns if it needs to parse boolean predicates. For example the following query will take forever to parse:
      ```sql
      SELECT *
      FROM RANGE(1000)
      WHERE
      TRUE
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      ```
      
      This is caused by a know bug in ANTLR (https://github.com/antlr/antlr4/issues/994), which was fixed in version 4.6.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #19042 from hvanhovell/SPARK-21830.
      05af2de0
  7. Aug 16, 2017
    • Peng Meng's avatar
      [SPARK-21680][ML][MLLIB] optimize Vector compress · a0345cbe
      Peng Meng authored
      ## What changes were proposed in this pull request?
      
      When use Vector.compressed to change a Vector to SparseVector, the performance is very low comparing with Vector.toSparse.
      This is because you have to scan the value three times using Vector.compressed, but you just need two times when use Vector.toSparse.
      When the length of the vector is large, there is significant performance difference between this two method.
      
      ## How was this patch tested?
      
      The existing UT
      
      Author: Peng Meng <peng.meng@intel.com>
      
      Closes #18899 from mpjlu/optVectorCompress.
      a0345cbe
  8. Aug 15, 2017
    • Marcelo Vanzin's avatar
      [SPARK-21731][BUILD] Upgrade scalastyle to 0.9. · 3f958a99
      Marcelo Vanzin authored
      This version fixes a few issues in the import order checker; it provides
      better error messages, and detects more improper ordering (thus the need
      to change a lot of files in this patch). The main fix is that it correctly
      complains about the order of packages vs. classes.
      
      As part of the above, I moved some "SparkSession" import in ML examples
      inside the "$example on$" blocks; that didn't seem consistent across
      different source files to start with, and avoids having to add more on/off blocks
      around specific imports.
      
      The new scalastyle also seems to have a better header detector, so a few
      license headers had to be updated to match the expected indentation.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #18943 from vanzin/SPARK-21731.
      3f958a99
  9. Aug 12, 2017
    • pj.fanning's avatar
      [SPARK-21709][BUILD] sbt 0.13.16 and some plugin updates · c0e333db
      pj.fanning authored
      ## What changes were proposed in this pull request?
      
      Update sbt version to 0.13.16. I think this is a useful stepping stone to getting to sbt 1.0.0.
      
      ## How was this patch tested?
      
      Existing Build.
      
      Author: pj.fanning <pj.fanning@workday.com>
      
      Closes #18921 from pjfanning/SPARK-21709.
      c0e333db
  10. Aug 09, 2017
    • Takeshi Yamamuro's avatar
      [SPARK-21276][CORE] Update lz4-java to the latest (v1.4.0) · b78cf13b
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr updated `lz4-java` to the latest (v1.4.0) and removed custom `LZ4BlockInputStream`. We currently use custom `LZ4BlockInputStream` to read concatenated byte stream in shuffle. But, this functionality has been implemented in the latest lz4-java (https://github.com/lz4/lz4-java/pull/105). So, we might update the latest to remove the custom `LZ4BlockInputStream`.
      
      Major diffs between the latest release and v1.3.0 in the master are as follows (https://github.com/lz4/lz4-java/compare/62f7547abb0819d1ca1e669645ee1a9d26cd60b0...6d4693f56253fcddfad7b441bb8d917b182efa2d);
      - fixed NPE in XXHashFactory similarly
      - Don't place resources in default package to support shading
      - Fixes ByteBuffer methods failing to apply arrayOffset() for array-backed
      - Try to load lz4-java from java.library.path, then fallback to bundled
      - Add ppc64le binary
      - Add s390x JNI binding
      - Add basic LZ4 Frame v1.5.0 support
      - enable aarch64 support for lz4-java
      - Allow unsafeInstance() for ppc64le archiecture
      - Add unsafeInstance support for AArch64
      - Support 64-bit JNI build on Solaris
      - Avoid over-allocating a buffer
      - Allow EndMark to be incompressible for LZ4FrameInputStream.
      - Concat byte stream
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #18883 from maropu/SPARK-21276.
      b78cf13b
  11. Aug 08, 2017
    • Marcelo Vanzin's avatar
      [SPARK-20655][CORE] In-memory KVStore implementation. · 979bf946
      Marcelo Vanzin authored
      This change adds an in-memory implementation of KVStore that can be
      used by the live UI.
      
      The implementation is not fully optimized, neither for speed nor
      space, but should be fast enough for using in the listener bus.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #18395 from vanzin/SPARK-20655.
      979bf946
  12. Jul 18, 2017
    • Sean Owen's avatar
      [SPARK-21415] Triage scapegoat warnings, part 1 · e26dac5f
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Address scapegoat warnings for:
      - BigDecimal double constructor
      - Catching NPE
      - Finalizer without super
      - List.size is O(n)
      - Prefer Seq.empty
      - Prefer Set.empty
      - reverse.map instead of reverseMap
      - Type shadowing
      - Unnecessary if condition.
      - Use .log1p
      - Var could be val
      
      In some instances like Seq.empty, I avoided making the change even where valid in test code to keep the scope of the change smaller. Those issues are concerned with performance and it won't matter for tests.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #18635 from srowen/Scapegoat1.
      e26dac5f
  13. Jul 13, 2017
    • Sean Owen's avatar
      [SPARK-19810][BUILD][CORE] Remove support for Scala 2.10 · 425c4ada
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      - Remove Scala 2.10 build profiles and support
      - Replace some 2.10 support in scripts with commented placeholders for 2.12 later
      - Remove deprecated API calls from 2.10 support
      - Remove usages of deprecated context bounds where possible
      - Remove Scala 2.10 workarounds like ScalaReflectionLock
      - Other minor Scala warning fixes
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17150 from srowen/SPARK-19810.
      425c4ada
  14. Jun 22, 2017
  15. Jun 06, 2017
    • Marcelo Vanzin's avatar
      [SPARK-20641][CORE] Add key-value store abstraction and LevelDB implementation. · 0cba4951
      Marcelo Vanzin authored
      This change adds an abstraction and LevelDB implementation for a key-value
      store that will be used to store UI and SHS data.
      
      The interface is described in KVStore.java (see javadoc). Specifics
      of the LevelDB implementation are discussed in the javadocs of both
      LevelDB.java and LevelDBTypeInfo.java.
      
      Included also are a few small benchmarks just to get some idea of
      latency. Because they're too slow for regular unit test runs, they're
      disabled by default.
      
      Tested with the included unit tests, and also as part of the overall feature
      implementation (including running SHS with hundreds of apps).
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #17902 from vanzin/shs-ng/M1.
      0cba4951
  16. May 11, 2017
  17. May 09, 2017
  18. May 07, 2017
    • Steve Loughran's avatar
      [SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object store access. · 2cf83c47
      Steve Loughran authored
      ## What changes were proposed in this pull request?
      
      Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson.
      
      It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`.
      
      There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector.
      
      (this is the successor to #12004; I can't re-open it)
      
      ## How was this patch tested?
      
      Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples)
      
      Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well.
      
      Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile.
      
      SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package`
      maven build `mvn install -Phadoop-cloud -Phadoop-2.7`
      
      This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible.
      
      Author: Steve Loughran <stevel@apache.org>
      Author: Steve Loughran <stevel@hortonworks.com>
      
      Closes #17834 from steveloughran/cloud/SPARK-7481-current.
      2cf83c47
  19. May 05, 2017
    • madhu's avatar
      [SPARK-20495][SQL][CORE] Add StorageLevel to cacheTable API · 9064f1b0
      madhu authored
      ## What changes were proposed in this pull request?
      Currently cacheTable API only supports MEMORY_AND_DISK. This PR adds additional API to take different storage levels.
      ## How was this patch tested?
      unit tests
      
      Author: madhu <phatak.dev@gmail.com>
      
      Closes #17802 from phatak-dev/cacheTableAPI.
      9064f1b0
  20. Apr 24, 2017
  21. Apr 19, 2017
  22. Apr 18, 2017
  23. Apr 06, 2017
    • jerryshao's avatar
      [SPARK-17019][CORE] Expose on-heap and off-heap memory usage in various places · a4491626
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      With [SPARK-13992](https://issues.apache.org/jira/browse/SPARK-13992), Spark supports persisting data into off-heap memory, but the usage of on-heap and off-heap memory is not exposed currently, it is not so convenient for user to monitor and profile, so here propose to expose off-heap memory as well as on-heap memory usage in various places:
      1. Spark UI's executor page will display both on-heap and off-heap memory usage.
      2. REST request returns both on-heap and off-heap memory.
      3. Also this can be gotten from MetricsSystem.
      4. Last this usage can be obtained programmatically from SparkListener.
      
      Attach the UI changes:
      
      ![screen shot 2016-08-12 at 11 20 44 am](https://cloud.githubusercontent.com/assets/850797/17612032/6c2f4480-607f-11e6-82e8-a27fb8cbb4ae.png)
      
      Backward compatibility is also considered for event-log and REST API. Old event log can still be replayed with off-heap usage displayed as 0. For REST API, only adds the new fields, so JSON backward compatibility can still be kept.
      ## How was this patch tested?
      
      Unit test added and manual verification.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #14617 from jerryshao/SPARK-17019.
      a4491626
  24. Mar 24, 2017
    • sethah's avatar
      [SPARK-17471][ML] Add compressed method to ML matrices · e8810b73
      sethah authored
      ## What changes were proposed in this pull request?
      
      This patch adds a `compressed` method to ML `Matrix` class, which returns the minimal storage representation of the matrix - either sparse or dense. Because the space occupied by a sparse matrix is dependent upon its layout (i.e. column major or row major), this method must consider both cases. It may also be useful to force the layout to be column or row major beforehand, so an overload is added which takes in a `columnMajor: Boolean` parameter.
      
      The compressed implementation relies upon two new abstract methods `toDense(columnMajor: Boolean)` and `toSparse(columnMajor: Boolean)`, similar to the compressed method implemented in the `Vector` class. These methods also allow the layout of the resulting matrix to be specified via the `columnMajor` parameter. More detail on the new methods is given below.
      ## How was this patch tested?
      
      Added many new unit tests
      ## New methods (summary, not exhaustive list)
      
      **Matrix trait**
      - `private[ml] def toDenseMatrix(columnMajor: Boolean): DenseMatrix` (abstract) - converts the matrix (either sparse or dense) to dense format
      - `private[ml] def toSparseMatrix(columnMajor: Boolean): SparseMatrix` (abstract) -  converts the matrix (either sparse or dense) to sparse format
      - `def toDense: DenseMatrix = toDense(true)`  - converts the matrix (either sparse or dense) to dense format in column major layout
      - `def toSparse: SparseMatrix = toSparse(true)` -  converts the matrix (either sparse or dense) to sparse format in column major layout
      - `def compressed: Matrix` - finds the minimum space representation of this matrix, considering both column and row major layouts, and converts it
      - `def compressed(columnMajor: Boolean): Matrix` - finds the minimum space representation of this matrix considering only column OR row major, and converts it
      
      **DenseMatrix class**
      - `private[ml] def toDenseMatrix(columnMajor: Boolean): DenseMatrix` - converts the dense matrix to a dense matrix, optionally changing the layout (data is NOT duplicated if the layouts are the same)
      - `private[ml] def toSparseMatrix(columnMajor: Boolean): SparseMatrix` - converts the dense matrix to sparse matrix, using the specified layout
      
      **SparseMatrix class**
      - `private[ml] def toDenseMatrix(columnMajor: Boolean): DenseMatrix` - converts the sparse matrix to a dense matrix, using the specified layout
      - `private[ml] def toSparseMatrix(columnMajors: Boolean): SparseMatrix` - converts the sparse matrix to sparse matrix. If the sparse matrix contains any explicit zeros, they are removed. If the layout requested does not match the current layout, data is copied to a new representation. If the layouts match and no explicit zeros exist, the current matrix is returned.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #15628 from sethah/matrix_compress.
      Unverified
      e8810b73
    • Eric Liang's avatar
      [SPARK-19820][CORE] Add interface to kill tasks w/ a reason · 8e558041
      Eric Liang authored
      This commit adds a killTaskAttempt method to SparkContext, to allow users to
      kill tasks so that they can be re-scheduled elsewhere.
      
      This also refactors the task kill path to allow specifying a reason for the task kill. The reason is propagated opaquely through events, and will show up in the UI automatically as `(N killed: $reason)` and `TaskKilled: $reason`. Without this change, there is no way to provide the user feedback through the UI.
      
      Currently used reasons are "stage cancelled", "another attempt succeeded", and "killed via SparkContext.killTask". The user can also specify a custom reason through `SparkContext.killTask`.
      
      cc rxin
      
      In the stage overview UI the reasons are summarized:
      ![1](https://cloud.githubusercontent.com/assets/14922/23929209/a83b2862-08e1-11e7-8b3e-ae1967bbe2e5.png)
      
      Within the stage UI you can see individual task kill reasons:
      ![2](https://cloud.githubusercontent.com/assets/14922/23929200/9a798692-08e1-11e7-8697-72b27ad8a287.png)
      
      Existing tests, tried killing some stages in the UI and verified the messages are as expected.
      
      Author: Eric Liang <ekl@databricks.com>
      Author: Eric Liang <ekl@google.com>
      
      Closes #17166 from ericl/kill-reason.
      8e558041
  25. Mar 23, 2017
    • Tyson Condie's avatar
      [SPARK-19876][SS][WIP] OneTime Trigger Executor · 746a558d
      Tyson Condie authored
      ## What changes were proposed in this pull request?
      
      An additional trigger and trigger executor that will execute a single trigger only. One can use this OneTime trigger to have more control over the scheduling of triggers.
      
      In addition, this patch requires an optimization to StreamExecution that logs a commit record at the end of successfully processing a batch. This new commit log will be used to determine the next batch (offsets) to process after a restart, instead of using the offset log itself to determine what batch to process next after restart; using the offset log to determine this would process the previously logged batch, always, thus not permitting a OneTime trigger feature.
      
      ## How was this patch tested?
      
      A number of existing tests have been revised. These tests all assumed that when restarting a stream, the last batch in the offset log is to be re-processed. Given that we now have a commit log that will tell us if that last batch was processed successfully, the results/assumptions of those tests needed to be revised accordingly.
      
      In addition, a OneTime trigger test was added to StreamingQuerySuite, which tests:
      - The semantics of OneTime trigger (i.e., on start, execute a single batch, then stop).
      - The case when the commit log was not able to successfully log the completion of a batch before restart, which would mean that we should fall back to what's in the offset log.
      - A OneTime trigger execution that results in an exception being thrown.
      
      marmbrus tdas zsxwing
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Tyson Condie <tcondie@gmail.com>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #17219 from tcondie/stream-commit.
      746a558d
  26. Mar 09, 2017
    • Shixiong Zhu's avatar
      [SPARK-19874][BUILD] Hide API docs for org.apache.spark.sql.internal · 029e40b4
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      The API docs should not include the "org.apache.spark.sql.internal" package because they are internal private APIs.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #17217 from zsxwing/SPARK-19874.
      029e40b4
  27. Mar 07, 2017
    • VinceShieh's avatar
      [SPARK-17498][ML] StringIndexer enhancement for handling unseen labels · 4a9034b1
      VinceShieh authored
      ## What changes were proposed in this pull request?
      This PR is an enhancement to ML StringIndexer.
      Before this PR, String Indexer only supports "skip"/"error" options to deal with unseen records.
      But those unseen records might still be useful and user would like to keep the unseen labels in
      certain use cases, This PR enables StringIndexer to support keeping unseen labels as
      indices [numLabels].
      
      '''Before
      StringIndexer().setHandleInvalid("skip")
      StringIndexer().setHandleInvalid("error")
      '''After
      support the third option "keep"
      StringIndexer().setHandleInvalid("keep")
      
      ## How was this patch tested?
      Test added in StringIndexerSuite
      
      Signed-off-by: VinceShieh <vincent.xieintel.com>
      (Please fill in changes proposed in this fix)
      
      Author: VinceShieh <vincent.xie@intel.com>
      
      Closes #16883 from VinceShieh/spark-17498.
      4a9034b1
  28. Mar 02, 2017
    • Imran Rashid's avatar
      [SPARK-19276][CORE] Fetch Failure handling robust to user error handling · 8417a7ae
      Imran Rashid authored
      ## What changes were proposed in this pull request?
      
      Fault-tolerance in spark requires special handling of shuffle fetch
      failures.  The Executor would catch FetchFailedException and send a
      special msg back to the driver.
      
      However, intervening user code could intercept that exception, and wrap
      it with something else.  This even happens in SparkSQL.  So rather than
      checking the thrown exception only, we'll store the fetch failure directly
      in the TaskContext, where users can't touch it.
      
      ## How was this patch tested?
      
      Added a test case which failed before the fix.  Full test suite via jenkins.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #16639 from squito/SPARK-19276.
      8417a7ae
  29. Feb 21, 2017
    • Marcelo Vanzin's avatar
      [SPARK-19652][UI] Do auth checks for REST API access. · 17d83e1e
      Marcelo Vanzin authored
      The REST API has a security filter that performs auth checks
      based on the UI root's security manager. That works fine when
      the UI root is the app's UI, but not when it's the history server.
      
      In the SHS case, all users would be allowed to see all applications
      through the REST API, even if the UI itself wouldn't be available
      to them.
      
      This change adds auth checks for each app access through the API
      too, so that only authorized users can see the app's data.
      
      The change also modifies the existing security filter to use
      `HttpServletRequest.getRemoteUser()`, which is used in other
      places. That is not necessarily the same as the principal's
      name; for example, when using Hadoop's SPNEGO auth filter,
      the remote user strips the realm information, which then matches
      the user name registered as the owner of the application.
      
      I also renamed the UIRootFromServletContext trait to a more generic
      name since I'm using it to store more context information now.
      
      Tested manually with an authentication filter enabled.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16978 from vanzin/SPARK-19652.
      17d83e1e
  30. Feb 19, 2017
    • Sean Owen's avatar
      [SPARK-19550][BUILD][WIP] Addendum: select Java 1.7 for scalac 2.10, still · df3cbe3a
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Go back to selecting source/target 1.7 for Scala 2.10 builds, because the SBT-based build for 2.10 won't work otherwise.
      
      ## How was this patch tested?
      
      Existing tests, but, we need to verify this vs what the SBT build would exactly run on Jenkins
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16983 from srowen/SPARK-19550.3.
      Unverified
      df3cbe3a
  31. Feb 16, 2017
    • Sean Owen's avatar
      [SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support · 0e240549
      Sean Owen authored
      - Move external/java8-tests tests into core, streaming, sql and remove
      - Remove MaxPermGen and related options
      - Fix some reflection / TODOs around Java 8+ methods
      - Update doc references to 1.7/1.8 differences
      - Remove Java 7/8 related build profiles
      - Update some plugins for better Java 8 compatibility
      - Fix a few Java-related warnings
      
      For the future:
      
      - Update Java 8 examples to fully use Java 8
      - Update Java tests to use lambdas for simplicity
      - Update Java internal implementations to use lambdas
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16871 from srowen/SPARK-19493.
      Unverified
      0e240549
  32. Feb 08, 2017
    • Sean Owen's avatar
      [SPARK-19464][CORE][YARN][TEST-HADOOP2.6] Remove support for Hadoop 2.5 and earlier · e8d3fca4
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      - Remove support for Hadoop 2.5 and earlier
      - Remove reflection and code constructs only needed to support multiple versions at once
      - Update docs to reflect newer versions
      - Remove older versions' builds and profiles.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16810 from srowen/SPARK-19464.
      Unverified
      e8d3fca4
  33. Jan 31, 2017
    • Bryan Cutler's avatar
      [SPARK-17161][PYSPARK][ML] Add PySpark-ML JavaWrapper convenience function to... · 57d70d26
      Bryan Cutler authored
      [SPARK-17161][PYSPARK][ML] Add PySpark-ML JavaWrapper convenience function to create Py4J JavaArrays
      
      ## What changes were proposed in this pull request?
      
      Adding convenience function to Python `JavaWrapper` so that it is easy to create a Py4J JavaArray that is compatible with current class constructors that have a Scala `Array` as input so that it is not necessary to have a Java/Python friendly constructor.  The function takes a Java class as input that is used by Py4J to create the Java array of the given class.  As an example, `OneVsRest` has been updated to use this and the alternate constructor is removed.
      
      ## How was this patch tested?
      
      Added unit tests for the new convenience function and updated `OneVsRest` doctests which use this to persist the model.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #14725 from BryanCutler/pyspark-new_java_array-CountVectorizer-SPARK-17161.
      57d70d26
  34. Jan 20, 2017
    • Parag Chaudhari's avatar
      [SPARK-19069][CORE] Expose task 'status' and 'duration' in spark history server REST API. · e20d9b15
      Parag Chaudhari authored
      ## What changes were proposed in this pull request?
      
      Although Spark history server UI shows task ‘status’ and ‘duration’ fields, it does not expose these fields in the REST API response. For the Spark history server API users, it is not possible to determine task status and duration. Spark history server has access to task status and duration from event log, but it is not exposing these in API. This patch is proposed to expose task ‘status’ and ‘duration’ fields in Spark history server REST API.
      
      ## How was this patch tested?
      
      Modified existing test cases in org.apache.spark.deploy.history.HistoryServerSuite.
      
      Author: Parag Chaudhari <paragpc@amazon.com>
      
      Closes #16473 from paragpc/expose_task_status.
      e20d9b15
  35. Jan 19, 2017
    • Zheng RuiFeng's avatar
      [SPARK-14272][ML] Add Loglikelihood in GaussianMixtureSummary · 8ccca917
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      
      add loglikelihood in GMM.summary
      
      ## How was this patch tested?
      
      added tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      Author: Ruifeng Zheng <ruifengz@foxmail.com>
      
      Closes #12064 from zhengruifeng/gmm_metric.
      8ccca917
  36. Jan 16, 2017
    • Wenchen Fan's avatar
      [SPARK-19148][SQL] do not expose the external table concept in Catalog · 18ee55dd
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      In https://github.com/apache/spark/pull/16296 , we reached a consensus that we should hide the external/managed table concept to users and only expose custom table path.
      
      This PR renames `Catalog.createExternalTable` to `createTable`(still keep the old versions for backward compatibility), and only set the table type to EXTERNAL if `path` is specified in options.
      
      ## How was this patch tested?
      
      new tests in `CatalogSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16528 from cloud-fan/create-table.
      18ee55dd
  37. Dec 22, 2016
    • saturday_s's avatar
      [SPARK-18537][WEB UI] Add a REST api to serve spark streaming information · ce99f51d
      saturday_s authored
      ## What changes were proposed in this pull request?
      
      This PR is an inheritance from #16000, and is a completion of #15904.
      
      **Description**
      
      - Augment the `org.apache.spark.status.api.v1` package for serving streaming information.
      - Retrieve the streaming information through StreamingJobProgressListener.
      
      > this api should cover exceptly the same amount of information as you can get from the web interface
      > the implementation is base on the current REST implementation of spark-core
      > and will be available for running applications only
      >
      > https://issues.apache.org/jira/browse/SPARK-18537
      
      ## How was this patch tested?
      
      Local test.
      
      Author: saturday_s <shi.indetail@gmail.com>
      Author: Chan Chor Pang <ChorPang.Chan@access-company.com>
      Author: peterCPChan <universknight@gmail.com>
      
      Closes #16253 from saturday-shi/SPARK-18537.
      ce99f51d
  38. Dec 21, 2016
    • gatorsmile's avatar
      [SPARK-18949][SQL] Add recoverPartitions API to Catalog · 24c0c941
      gatorsmile authored
      ### What changes were proposed in this pull request?
      
      Currently, we only have a SQL interface for recovering all the partitions in the directory of a table and update the catalog. `MSCK REPAIR TABLE` or `ALTER TABLE table RECOVER PARTITIONS`. (Actually, very hard for me to remember `MSCK` and have no clue what it means)
      
      After the new "Scalable Partition Handling", the table repair becomes much more important for making visible the data in the created data source partitioned table.
      
      Thus, this PR is to add it into the Catalog interface. After this PR, users can repair the table by
      ```Scala
      spark.catalog.recoverPartitions("testTable")
      ```
      
      ### How was this patch tested?
      Modified the existing test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16356 from gatorsmile/repairTable.
      24c0c941
Loading