Skip to content
Snippets Groups Projects
  1. Jul 18, 2017
    • Sean Owen's avatar
      [SPARK-21415] Triage scapegoat warnings, part 1 · e26dac5f
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Address scapegoat warnings for:
      - BigDecimal double constructor
      - Catching NPE
      - Finalizer without super
      - List.size is O(n)
      - Prefer Seq.empty
      - Prefer Set.empty
      - reverse.map instead of reverseMap
      - Type shadowing
      - Unnecessary if condition.
      - Use .log1p
      - Var could be val
      
      In some instances like Seq.empty, I avoided making the change even where valid in test code to keep the scope of the change smaller. Those issues are concerned with performance and it won't matter for tests.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #18635 from srowen/Scapegoat1.
      e26dac5f
  2. Jul 13, 2017
    • Sean Owen's avatar
      [SPARK-19810][BUILD][CORE] Remove support for Scala 2.10 · 425c4ada
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      - Remove Scala 2.10 build profiles and support
      - Replace some 2.10 support in scripts with commented placeholders for 2.12 later
      - Remove deprecated API calls from 2.10 support
      - Remove usages of deprecated context bounds where possible
      - Remove Scala 2.10 workarounds like ScalaReflectionLock
      - Other minor Scala warning fixes
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17150 from srowen/SPARK-19810.
      425c4ada
  3. Jun 22, 2017
  4. Jun 06, 2017
    • Marcelo Vanzin's avatar
      [SPARK-20641][CORE] Add key-value store abstraction and LevelDB implementation. · 0cba4951
      Marcelo Vanzin authored
      This change adds an abstraction and LevelDB implementation for a key-value
      store that will be used to store UI and SHS data.
      
      The interface is described in KVStore.java (see javadoc). Specifics
      of the LevelDB implementation are discussed in the javadocs of both
      LevelDB.java and LevelDBTypeInfo.java.
      
      Included also are a few small benchmarks just to get some idea of
      latency. Because they're too slow for regular unit test runs, they're
      disabled by default.
      
      Tested with the included unit tests, and also as part of the overall feature
      implementation (including running SHS with hundreds of apps).
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #17902 from vanzin/shs-ng/M1.
      0cba4951
  5. May 11, 2017
  6. May 09, 2017
  7. May 07, 2017
    • Steve Loughran's avatar
      [SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object store access. · 2cf83c47
      Steve Loughran authored
      ## What changes were proposed in this pull request?
      
      Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson.
      
      It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`.
      
      There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector.
      
      (this is the successor to #12004; I can't re-open it)
      
      ## How was this patch tested?
      
      Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples)
      
      Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well.
      
      Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile.
      
      SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package`
      maven build `mvn install -Phadoop-cloud -Phadoop-2.7`
      
      This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible.
      
      Author: Steve Loughran <stevel@apache.org>
      Author: Steve Loughran <stevel@hortonworks.com>
      
      Closes #17834 from steveloughran/cloud/SPARK-7481-current.
      2cf83c47
  8. May 05, 2017
    • madhu's avatar
      [SPARK-20495][SQL][CORE] Add StorageLevel to cacheTable API · 9064f1b0
      madhu authored
      ## What changes were proposed in this pull request?
      Currently cacheTable API only supports MEMORY_AND_DISK. This PR adds additional API to take different storage levels.
      ## How was this patch tested?
      unit tests
      
      Author: madhu <phatak.dev@gmail.com>
      
      Closes #17802 from phatak-dev/cacheTableAPI.
      9064f1b0
  9. Apr 24, 2017
  10. Apr 19, 2017
  11. Apr 18, 2017
  12. Apr 06, 2017
    • jerryshao's avatar
      [SPARK-17019][CORE] Expose on-heap and off-heap memory usage in various places · a4491626
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      With [SPARK-13992](https://issues.apache.org/jira/browse/SPARK-13992), Spark supports persisting data into off-heap memory, but the usage of on-heap and off-heap memory is not exposed currently, it is not so convenient for user to monitor and profile, so here propose to expose off-heap memory as well as on-heap memory usage in various places:
      1. Spark UI's executor page will display both on-heap and off-heap memory usage.
      2. REST request returns both on-heap and off-heap memory.
      3. Also this can be gotten from MetricsSystem.
      4. Last this usage can be obtained programmatically from SparkListener.
      
      Attach the UI changes:
      
      ![screen shot 2016-08-12 at 11 20 44 am](https://cloud.githubusercontent.com/assets/850797/17612032/6c2f4480-607f-11e6-82e8-a27fb8cbb4ae.png)
      
      Backward compatibility is also considered for event-log and REST API. Old event log can still be replayed with off-heap usage displayed as 0. For REST API, only adds the new fields, so JSON backward compatibility can still be kept.
      ## How was this patch tested?
      
      Unit test added and manual verification.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #14617 from jerryshao/SPARK-17019.
      a4491626
  13. Mar 24, 2017
    • sethah's avatar
      [SPARK-17471][ML] Add compressed method to ML matrices · e8810b73
      sethah authored
      ## What changes were proposed in this pull request?
      
      This patch adds a `compressed` method to ML `Matrix` class, which returns the minimal storage representation of the matrix - either sparse or dense. Because the space occupied by a sparse matrix is dependent upon its layout (i.e. column major or row major), this method must consider both cases. It may also be useful to force the layout to be column or row major beforehand, so an overload is added which takes in a `columnMajor: Boolean` parameter.
      
      The compressed implementation relies upon two new abstract methods `toDense(columnMajor: Boolean)` and `toSparse(columnMajor: Boolean)`, similar to the compressed method implemented in the `Vector` class. These methods also allow the layout of the resulting matrix to be specified via the `columnMajor` parameter. More detail on the new methods is given below.
      ## How was this patch tested?
      
      Added many new unit tests
      ## New methods (summary, not exhaustive list)
      
      **Matrix trait**
      - `private[ml] def toDenseMatrix(columnMajor: Boolean): DenseMatrix` (abstract) - converts the matrix (either sparse or dense) to dense format
      - `private[ml] def toSparseMatrix(columnMajor: Boolean): SparseMatrix` (abstract) -  converts the matrix (either sparse or dense) to sparse format
      - `def toDense: DenseMatrix = toDense(true)`  - converts the matrix (either sparse or dense) to dense format in column major layout
      - `def toSparse: SparseMatrix = toSparse(true)` -  converts the matrix (either sparse or dense) to sparse format in column major layout
      - `def compressed: Matrix` - finds the minimum space representation of this matrix, considering both column and row major layouts, and converts it
      - `def compressed(columnMajor: Boolean): Matrix` - finds the minimum space representation of this matrix considering only column OR row major, and converts it
      
      **DenseMatrix class**
      - `private[ml] def toDenseMatrix(columnMajor: Boolean): DenseMatrix` - converts the dense matrix to a dense matrix, optionally changing the layout (data is NOT duplicated if the layouts are the same)
      - `private[ml] def toSparseMatrix(columnMajor: Boolean): SparseMatrix` - converts the dense matrix to sparse matrix, using the specified layout
      
      **SparseMatrix class**
      - `private[ml] def toDenseMatrix(columnMajor: Boolean): DenseMatrix` - converts the sparse matrix to a dense matrix, using the specified layout
      - `private[ml] def toSparseMatrix(columnMajors: Boolean): SparseMatrix` - converts the sparse matrix to sparse matrix. If the sparse matrix contains any explicit zeros, they are removed. If the layout requested does not match the current layout, data is copied to a new representation. If the layouts match and no explicit zeros exist, the current matrix is returned.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #15628 from sethah/matrix_compress.
      e8810b73
    • Eric Liang's avatar
      [SPARK-19820][CORE] Add interface to kill tasks w/ a reason · 8e558041
      Eric Liang authored
      This commit adds a killTaskAttempt method to SparkContext, to allow users to
      kill tasks so that they can be re-scheduled elsewhere.
      
      This also refactors the task kill path to allow specifying a reason for the task kill. The reason is propagated opaquely through events, and will show up in the UI automatically as `(N killed: $reason)` and `TaskKilled: $reason`. Without this change, there is no way to provide the user feedback through the UI.
      
      Currently used reasons are "stage cancelled", "another attempt succeeded", and "killed via SparkContext.killTask". The user can also specify a custom reason through `SparkContext.killTask`.
      
      cc rxin
      
      In the stage overview UI the reasons are summarized:
      ![1](https://cloud.githubusercontent.com/assets/14922/23929209/a83b2862-08e1-11e7-8b3e-ae1967bbe2e5.png)
      
      Within the stage UI you can see individual task kill reasons:
      ![2](https://cloud.githubusercontent.com/assets/14922/23929200/9a798692-08e1-11e7-8697-72b27ad8a287.png)
      
      Existing tests, tried killing some stages in the UI and verified the messages are as expected.
      
      Author: Eric Liang <ekl@databricks.com>
      Author: Eric Liang <ekl@google.com>
      
      Closes #17166 from ericl/kill-reason.
      8e558041
  14. Mar 23, 2017
    • Tyson Condie's avatar
      [SPARK-19876][SS][WIP] OneTime Trigger Executor · 746a558d
      Tyson Condie authored
      ## What changes were proposed in this pull request?
      
      An additional trigger and trigger executor that will execute a single trigger only. One can use this OneTime trigger to have more control over the scheduling of triggers.
      
      In addition, this patch requires an optimization to StreamExecution that logs a commit record at the end of successfully processing a batch. This new commit log will be used to determine the next batch (offsets) to process after a restart, instead of using the offset log itself to determine what batch to process next after restart; using the offset log to determine this would process the previously logged batch, always, thus not permitting a OneTime trigger feature.
      
      ## How was this patch tested?
      
      A number of existing tests have been revised. These tests all assumed that when restarting a stream, the last batch in the offset log is to be re-processed. Given that we now have a commit log that will tell us if that last batch was processed successfully, the results/assumptions of those tests needed to be revised accordingly.
      
      In addition, a OneTime trigger test was added to StreamingQuerySuite, which tests:
      - The semantics of OneTime trigger (i.e., on start, execute a single batch, then stop).
      - The case when the commit log was not able to successfully log the completion of a batch before restart, which would mean that we should fall back to what's in the offset log.
      - A OneTime trigger execution that results in an exception being thrown.
      
      marmbrus tdas zsxwing
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Tyson Condie <tcondie@gmail.com>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #17219 from tcondie/stream-commit.
      746a558d
  15. Mar 09, 2017
    • Shixiong Zhu's avatar
      [SPARK-19874][BUILD] Hide API docs for org.apache.spark.sql.internal · 029e40b4
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      The API docs should not include the "org.apache.spark.sql.internal" package because they are internal private APIs.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #17217 from zsxwing/SPARK-19874.
      029e40b4
  16. Mar 07, 2017
    • VinceShieh's avatar
      [SPARK-17498][ML] StringIndexer enhancement for handling unseen labels · 4a9034b1
      VinceShieh authored
      ## What changes were proposed in this pull request?
      This PR is an enhancement to ML StringIndexer.
      Before this PR, String Indexer only supports "skip"/"error" options to deal with unseen records.
      But those unseen records might still be useful and user would like to keep the unseen labels in
      certain use cases, This PR enables StringIndexer to support keeping unseen labels as
      indices [numLabels].
      
      '''Before
      StringIndexer().setHandleInvalid("skip")
      StringIndexer().setHandleInvalid("error")
      '''After
      support the third option "keep"
      StringIndexer().setHandleInvalid("keep")
      
      ## How was this patch tested?
      Test added in StringIndexerSuite
      
      Signed-off-by: VinceShieh <vincent.xieintel.com>
      (Please fill in changes proposed in this fix)
      
      Author: VinceShieh <vincent.xie@intel.com>
      
      Closes #16883 from VinceShieh/spark-17498.
      4a9034b1
  17. Mar 02, 2017
    • Imran Rashid's avatar
      [SPARK-19276][CORE] Fetch Failure handling robust to user error handling · 8417a7ae
      Imran Rashid authored
      ## What changes were proposed in this pull request?
      
      Fault-tolerance in spark requires special handling of shuffle fetch
      failures.  The Executor would catch FetchFailedException and send a
      special msg back to the driver.
      
      However, intervening user code could intercept that exception, and wrap
      it with something else.  This even happens in SparkSQL.  So rather than
      checking the thrown exception only, we'll store the fetch failure directly
      in the TaskContext, where users can't touch it.
      
      ## How was this patch tested?
      
      Added a test case which failed before the fix.  Full test suite via jenkins.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #16639 from squito/SPARK-19276.
      8417a7ae
  18. Feb 21, 2017
    • Marcelo Vanzin's avatar
      [SPARK-19652][UI] Do auth checks for REST API access. · 17d83e1e
      Marcelo Vanzin authored
      The REST API has a security filter that performs auth checks
      based on the UI root's security manager. That works fine when
      the UI root is the app's UI, but not when it's the history server.
      
      In the SHS case, all users would be allowed to see all applications
      through the REST API, even if the UI itself wouldn't be available
      to them.
      
      This change adds auth checks for each app access through the API
      too, so that only authorized users can see the app's data.
      
      The change also modifies the existing security filter to use
      `HttpServletRequest.getRemoteUser()`, which is used in other
      places. That is not necessarily the same as the principal's
      name; for example, when using Hadoop's SPNEGO auth filter,
      the remote user strips the realm information, which then matches
      the user name registered as the owner of the application.
      
      I also renamed the UIRootFromServletContext trait to a more generic
      name since I'm using it to store more context information now.
      
      Tested manually with an authentication filter enabled.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16978 from vanzin/SPARK-19652.
      17d83e1e
  19. Feb 19, 2017
    • Sean Owen's avatar
      [SPARK-19550][BUILD][WIP] Addendum: select Java 1.7 for scalac 2.10, still · df3cbe3a
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Go back to selecting source/target 1.7 for Scala 2.10 builds, because the SBT-based build for 2.10 won't work otherwise.
      
      ## How was this patch tested?
      
      Existing tests, but, we need to verify this vs what the SBT build would exactly run on Jenkins
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16983 from srowen/SPARK-19550.3.
      df3cbe3a
  20. Feb 16, 2017
    • Sean Owen's avatar
      [SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support · 0e240549
      Sean Owen authored
      - Move external/java8-tests tests into core, streaming, sql and remove
      - Remove MaxPermGen and related options
      - Fix some reflection / TODOs around Java 8+ methods
      - Update doc references to 1.7/1.8 differences
      - Remove Java 7/8 related build profiles
      - Update some plugins for better Java 8 compatibility
      - Fix a few Java-related warnings
      
      For the future:
      
      - Update Java 8 examples to fully use Java 8
      - Update Java tests to use lambdas for simplicity
      - Update Java internal implementations to use lambdas
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16871 from srowen/SPARK-19493.
      0e240549
  21. Feb 08, 2017
    • Sean Owen's avatar
      [SPARK-19464][CORE][YARN][TEST-HADOOP2.6] Remove support for Hadoop 2.5 and earlier · e8d3fca4
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      - Remove support for Hadoop 2.5 and earlier
      - Remove reflection and code constructs only needed to support multiple versions at once
      - Update docs to reflect newer versions
      - Remove older versions' builds and profiles.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16810 from srowen/SPARK-19464.
      e8d3fca4
  22. Jan 31, 2017
    • Bryan Cutler's avatar
      [SPARK-17161][PYSPARK][ML] Add PySpark-ML JavaWrapper convenience function to... · 57d70d26
      Bryan Cutler authored
      [SPARK-17161][PYSPARK][ML] Add PySpark-ML JavaWrapper convenience function to create Py4J JavaArrays
      
      ## What changes were proposed in this pull request?
      
      Adding convenience function to Python `JavaWrapper` so that it is easy to create a Py4J JavaArray that is compatible with current class constructors that have a Scala `Array` as input so that it is not necessary to have a Java/Python friendly constructor.  The function takes a Java class as input that is used by Py4J to create the Java array of the given class.  As an example, `OneVsRest` has been updated to use this and the alternate constructor is removed.
      
      ## How was this patch tested?
      
      Added unit tests for the new convenience function and updated `OneVsRest` doctests which use this to persist the model.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #14725 from BryanCutler/pyspark-new_java_array-CountVectorizer-SPARK-17161.
      57d70d26
  23. Jan 20, 2017
    • Parag Chaudhari's avatar
      [SPARK-19069][CORE] Expose task 'status' and 'duration' in spark history server REST API. · e20d9b15
      Parag Chaudhari authored
      ## What changes were proposed in this pull request?
      
      Although Spark history server UI shows task ‘status’ and ‘duration’ fields, it does not expose these fields in the REST API response. For the Spark history server API users, it is not possible to determine task status and duration. Spark history server has access to task status and duration from event log, but it is not exposing these in API. This patch is proposed to expose task ‘status’ and ‘duration’ fields in Spark history server REST API.
      
      ## How was this patch tested?
      
      Modified existing test cases in org.apache.spark.deploy.history.HistoryServerSuite.
      
      Author: Parag Chaudhari <paragpc@amazon.com>
      
      Closes #16473 from paragpc/expose_task_status.
      e20d9b15
  24. Jan 19, 2017
    • Zheng RuiFeng's avatar
      [SPARK-14272][ML] Add Loglikelihood in GaussianMixtureSummary · 8ccca917
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      
      add loglikelihood in GMM.summary
      
      ## How was this patch tested?
      
      added tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      Author: Ruifeng Zheng <ruifengz@foxmail.com>
      
      Closes #12064 from zhengruifeng/gmm_metric.
      8ccca917
  25. Jan 16, 2017
    • Wenchen Fan's avatar
      [SPARK-19148][SQL] do not expose the external table concept in Catalog · 18ee55dd
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      In https://github.com/apache/spark/pull/16296 , we reached a consensus that we should hide the external/managed table concept to users and only expose custom table path.
      
      This PR renames `Catalog.createExternalTable` to `createTable`(still keep the old versions for backward compatibility), and only set the table type to EXTERNAL if `path` is specified in options.
      
      ## How was this patch tested?
      
      new tests in `CatalogSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16528 from cloud-fan/create-table.
      18ee55dd
  26. Dec 22, 2016
    • saturday_s's avatar
      [SPARK-18537][WEB UI] Add a REST api to serve spark streaming information · ce99f51d
      saturday_s authored
      ## What changes were proposed in this pull request?
      
      This PR is an inheritance from #16000, and is a completion of #15904.
      
      **Description**
      
      - Augment the `org.apache.spark.status.api.v1` package for serving streaming information.
      - Retrieve the streaming information through StreamingJobProgressListener.
      
      > this api should cover exceptly the same amount of information as you can get from the web interface
      > the implementation is base on the current REST implementation of spark-core
      > and will be available for running applications only
      >
      > https://issues.apache.org/jira/browse/SPARK-18537
      
      ## How was this patch tested?
      
      Local test.
      
      Author: saturday_s <shi.indetail@gmail.com>
      Author: Chan Chor Pang <ChorPang.Chan@access-company.com>
      Author: peterCPChan <universknight@gmail.com>
      
      Closes #16253 from saturday-shi/SPARK-18537.
      ce99f51d
  27. Dec 21, 2016
    • gatorsmile's avatar
      [SPARK-18949][SQL] Add recoverPartitions API to Catalog · 24c0c941
      gatorsmile authored
      ### What changes were proposed in this pull request?
      
      Currently, we only have a SQL interface for recovering all the partitions in the directory of a table and update the catalog. `MSCK REPAIR TABLE` or `ALTER TABLE table RECOVER PARTITIONS`. (Actually, very hard for me to remember `MSCK` and have no clue what it means)
      
      After the new "Scalable Partition Handling", the table repair becomes much more important for making visible the data in the created data source partitioned table.
      
      Thus, this PR is to add it into the Catalog interface. After this PR, users can repair the table by
      ```Scala
      spark.catalog.recoverPartitions("testTable")
      ```
      
      ### How was this patch tested?
      Modified the existing test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16356 from gatorsmile/repairTable.
      24c0c941
  28. Dec 14, 2016
  29. Dec 09, 2016
    • Weiqing Yang's avatar
      [SPARK-18697][BUILD] Upgrade sbt plugins · 9338aa4f
      Weiqing Yang authored
      ## What changes were proposed in this pull request?
      
      This PR is to upgrade sbt plugins. The following sbt plugins will be upgraded:
      ```
      sbteclipse-plugin: 4.0.0 -> 5.0.1
      sbt-mima-plugin: 0.1.11 -> 0.1.12
      org.ow2.asm/asm: 5.0.3 -> 5.1
      org.ow2.asm/asm-commons: 5.0.3 -> 5.1
      ```
      ## How was this patch tested?
      Pass the Jenkins build.
      
      Author: Weiqing Yang <yangweiqing001@gmail.com>
      
      Closes #16223 from weiqingy/SPARK_18697.
      9338aa4f
  30. Dec 07, 2016
  31. Dec 06, 2016
    • Sean Owen's avatar
      Revert "[SPARK-18697][BUILD] Upgrade sbt plugins" · 4cc8d890
      Sean Owen authored
      This reverts commit 7f31d378.
      4cc8d890
    • Weiqing Yang's avatar
      [SPARK-18697][BUILD] Upgrade sbt plugins · 7f31d378
      Weiqing Yang authored
      ## What changes were proposed in this pull request?
      
      This PR is to upgrade sbt plugins. The following sbt plugins will be upgraded:
      ```
      sbt-assembly: 0.11.2 -> 0.14.3
      sbteclipse-plugin: 4.0.0 -> 5.0.1
      sbt-mima-plugin: 0.1.11 -> 0.1.12
      org.ow2.asm/asm: 5.0.3 -> 5.1
      org.ow2.asm/asm-commons: 5.0.3 -> 5.1
      ```
      All other plugins are up-to-date.
      
      ## How was this patch tested?
      Pass the Jenkins build.
      
      Author: Weiqing Yang <yangweiqing001@gmail.com>
      
      Closes #16159 from weiqingy/SPARK-18697.
      7f31d378
  32. Dec 05, 2016
    • Tathagata Das's avatar
      [SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart and... · bb57bfe9
      Tathagata Das authored
      [SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart and not auto-generate StreamingQuery.name
      
      ## What changes were proposed in this pull request?
      Here are the major changes in this PR.
      - Added the ability to recover `StreamingQuery.id` from checkpoint location, by writing the id to `checkpointLoc/metadata`.
      - Added `StreamingQuery.runId` which is unique for every query started and does not persist across restarts. This is to identify each restart of a query separately (same as earlier behavior of `id`).
      - Removed auto-generation of `StreamingQuery.name`. The purpose of name was to have the ability to define an identifier across restarts, but since id is precisely that, there is no need for a auto-generated name. This means name becomes purely cosmetic, and is null by default.
      - Added `runId` to `StreamingQueryListener` events and `StreamingQueryProgress`.
      
      Implementation details
      - Renamed existing `StreamExecutionMetadata` to `OffsetSeqMetadata`, and moved it to the file `OffsetSeq.scala`, because that is what this metadata is tied to. Also did some refactoring to make the code cleaner (got rid of a lot of `.json` and `.getOrElse("{}")`).
      - Added the `id` as the new `StreamMetadata`.
      - When a StreamingQuery is created it gets or writes the `StreamMetadata` from `checkpointLoc/metadata`.
      - All internal logging in `StreamExecution` uses `(name, id, runId)` instead of just `name`
      
      TODO
      - [x] Test handling of name=null in json generation of StreamingQueryProgress
      - [x] Test handling of name=null in json generation of StreamingQueryListener events
      - [x] Test python API of runId
      
      ## How was this patch tested?
      Updated unit tests and new unit tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16113 from tdas/SPARK-18657.
      bb57bfe9
    • Shixiong Zhu's avatar
      [SPARK-18694][SS] Add StreamingQuery.explain and exception to Python and fix... · 24601285
      Shixiong Zhu authored
      [SPARK-18694][SS] Add StreamingQuery.explain and exception to Python and fix StreamingQueryException
      
      ## What changes were proposed in this pull request?
      
      - Add StreamingQuery.explain and exception to Python.
      - Fix StreamingQueryException to not expose `OffsetSeq`.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16125 from zsxwing/py-streaming-explain.
      24601285
  33. Dec 03, 2016
    • Weiqing Yang's avatar
      [SPARK-18638][BUILD] Upgrade sbt, Zinc, and Maven plugins · 57619732
      Weiqing Yang authored
      ## What changes were proposed in this pull request?
      This PR is to upgrade:
      ```
         sbt: 0.13.11 -> 0.13.13,
         zinc: 0.3.9 -> 0.3.11,
         maven-assembly-plugin: 2.6 -> 3.0.0
         maven-compiler-plugin: 3.5.1 -> 3.6.
         maven-jar-plugin: 2.6 -> 3.0.2
         maven-javadoc-plugin: 2.10.3 -> 2.10.4
         maven-source-plugin: 2.4 -> 3.0.1
         org.codehaus.mojo:build-helper-maven-plugin: 1.10 -> 1.12
         org.codehaus.mojo:exec-maven-plugin: 1.4.0 -> 1.5.0
      ```
      
      The sbt release notes since the last version we used are: [v0.13.12](https://github.com/sbt/sbt/releases/tag/v0.13.12)  and [v0.13.13 ](https://github.com/sbt/sbt/releases/tag/v0.13.13).
      
      ## How was this patch tested?
      Pass build and the existing tests.
      
      Author: Weiqing Yang <yangweiqing001@gmail.com>
      
      Closes #16069 from weiqingy/SPARK-18638.
      57619732
  34. Dec 02, 2016
  35. Dec 01, 2016
    • Reynold Xin's avatar
      [SPARK-18663][SQL] Simplify CountMinSketch aggregate implementation · d3c90b74
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      SPARK-18429 introduced count-min sketch aggregate function for SQL, but the implementation and testing is more complicated than needed. This simplifies the test cases and removes support for data types that don't have clear equality semantics:
      
      1. Removed support for floating point and decimal types.
      
      2. Removed the heavy randomized tests. The underlying CountMinSketch implementation already had pretty good test coverage through randomized tests, and the SPARK-18429 implementation is just to add an aggregate function wrapper around CountMinSketch. There is no need for randomized tests at three different levels of the implementations.
      
      ## How was this patch tested?
      A lot of the change is to simplify test cases.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #16093 from rxin/SPARK-18663.
      d3c90b74
  36. Nov 29, 2016
    • Tathagata Das's avatar
      [SPARK-18516][SQL] Split state and progress in streaming · c3d08e2f
      Tathagata Das authored
      This PR separates the status of a `StreamingQuery` into two separate APIs:
       - `status` - describes the status of a `StreamingQuery` at this moment, including what phase of processing is currently happening and if data is available.
       - `recentProgress` - an array of statistics about the most recent microbatches that have executed.
      
      A recent progress contains the following information:
      ```
      {
        "id" : "2be8670a-fce1-4859-a530-748f29553bb6",
        "name" : "query-29",
        "timestamp" : 1479705392724,
        "inputRowsPerSecond" : 230.76923076923077,
        "processedRowsPerSecond" : 10.869565217391303,
        "durationMs" : {
          "triggerExecution" : 276,
          "queryPlanning" : 3,
          "getBatch" : 5,
          "getOffset" : 3,
          "addBatch" : 234,
          "walCommit" : 30
        },
        "currentWatermark" : 0,
        "stateOperators" : [ ],
        "sources" : [ {
          "description" : "KafkaSource[Subscribe[topic-14]]",
          "startOffset" : {
            "topic-14" : {
              "2" : 0,
              "4" : 1,
              "1" : 0,
              "3" : 0,
              "0" : 0
            }
          },
          "endOffset" : {
            "topic-14" : {
              "2" : 1,
              "4" : 2,
              "1" : 0,
              "3" : 0,
              "0" : 1
            }
          },
          "numRecords" : 3,
          "inputRowsPerSecond" : 230.76923076923077,
          "processedRowsPerSecond" : 10.869565217391303
        } ]
      }
      ```
      
      Additionally, in order to make it possible to correlate progress updates across restarts, we change the `id` field from an integer that is unique with in the JVM to a `UUID` that is globally unique.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #15954 from marmbrus/queryProgress.
      c3d08e2f
Loading