Skip to content
Snippets Groups Projects
  1. Mar 09, 2017
  2. Feb 22, 2017
    • Marcelo Vanzin's avatar
      [SPARK-19652][UI] Do auth checks for REST API access (branch-2.1). · 21afc453
      Marcelo Vanzin authored
      The REST API has a security filter that performs auth checks
      based on the UI root's security manager. That works fine when
      the UI root is the app's UI, but not when it's the history server.
      
      In the SHS case, all users would be allowed to see all applications
      through the REST API, even if the UI itself wouldn't be available
      to them.
      
      This change adds auth checks for each app access through the API
      too, so that only authorized users can see the app's data.
      
      The change also modifies the existing security filter to use
      `HttpServletRequest.getRemoteUser()`, which is used in other
      places. That is not necessarily the same as the principal's
      name; for example, when using Hadoop's SPNEGO auth filter,
      the remote user strips the realm information, which then matches
      the user name registered as the owner of the application.
      
      I also renamed the UIRootFromServletContext trait to a more generic
      name since I'm using it to store more context information now.
      
      Tested manually with an authentication filter enabled.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #17019 from vanzin/SPARK-19652_2.1.
      21afc453
  3. Dec 21, 2016
    • gatorsmile's avatar
      [SPARK-18949][SQL][BACKPORT-2.1] Add recoverPartitions API to Catalog · 0e51bb08
      gatorsmile authored
      ### What changes were proposed in this pull request?
      
      This PR is to backport https://github.com/apache/spark/pull/16356 to Spark 2.1.1 branch.
      
      ----
      
      Currently, we only have a SQL interface for recovering all the partitions in the directory of a table and update the catalog. `MSCK REPAIR TABLE` or `ALTER TABLE table RECOVER PARTITIONS`. (Actually, very hard for me to remember `MSCK` and have no clue what it means)
      
      After the new "Scalable Partition Handling", the table repair becomes much more important for making visible the data in the created data source partitioned table.
      
      Thus, this PR is to add it into the Catalog interface. After this PR, users can repair the table by
      ```Scala
      spark.catalog.recoverPartitions("testTable")
      ```
      
      ### How was this patch tested?
      Modified the existing test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16372 from gatorsmile/repairTable2.1.1.
      0e51bb08
  4. Dec 07, 2016
  5. Dec 05, 2016
    • Tathagata Das's avatar
      [SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart and... · 1946854a
      Tathagata Das authored
      [SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart and not auto-generate StreamingQuery.name
      
      Here are the major changes in this PR.
      - Added the ability to recover `StreamingQuery.id` from checkpoint location, by writing the id to `checkpointLoc/metadata`.
      - Added `StreamingQuery.runId` which is unique for every query started and does not persist across restarts. This is to identify each restart of a query separately (same as earlier behavior of `id`).
      - Removed auto-generation of `StreamingQuery.name`. The purpose of name was to have the ability to define an identifier across restarts, but since id is precisely that, there is no need for a auto-generated name. This means name becomes purely cosmetic, and is null by default.
      - Added `runId` to `StreamingQueryListener` events and `StreamingQueryProgress`.
      
      Implementation details
      - Renamed existing `StreamExecutionMetadata` to `OffsetSeqMetadata`, and moved it to the file `OffsetSeq.scala`, because that is what this metadata is tied to. Also did some refactoring to make the code cleaner (got rid of a lot of `.json` and `.getOrElse("{}")`).
      - Added the `id` as the new `StreamMetadata`.
      - When a StreamingQuery is created it gets or writes the `StreamMetadata` from `checkpointLoc/metadata`.
      - All internal logging in `StreamExecution` uses `(name, id, runId)` instead of just `name`
      
      TODO
      - [x] Test handling of name=null in json generation of StreamingQueryProgress
      - [x] Test handling of name=null in json generation of StreamingQueryListener events
      - [x] Test python API of runId
      
      Updated unit tests and new unit tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16113 from tdas/SPARK-18657.
      
      (cherry picked from commit bb57bfe9)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      1946854a
    • Shixiong Zhu's avatar
      [SPARK-18694][SS] Add StreamingQuery.explain and exception to Python and fix... · c6a4e3d9
      Shixiong Zhu authored
      [SPARK-18694][SS] Add StreamingQuery.explain and exception to Python and fix StreamingQueryException (branch 2.1)
      
      ## What changes were proposed in this pull request?
      
      Backport #16125 to branch 2.1.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16153 from zsxwing/SPARK-18694-2.1.
      c6a4e3d9
  6. Nov 29, 2016
    • Tathagata Das's avatar
      [SPARK-18516][SQL] Split state and progress in streaming · 28b57c8a
      Tathagata Das authored
      
      This PR separates the status of a `StreamingQuery` into two separate APIs:
       - `status` - describes the status of a `StreamingQuery` at this moment, including what phase of processing is currently happening and if data is available.
       - `recentProgress` - an array of statistics about the most recent microbatches that have executed.
      
      A recent progress contains the following information:
      ```
      {
        "id" : "2be8670a-fce1-4859-a530-748f29553bb6",
        "name" : "query-29",
        "timestamp" : 1479705392724,
        "inputRowsPerSecond" : 230.76923076923077,
        "processedRowsPerSecond" : 10.869565217391303,
        "durationMs" : {
          "triggerExecution" : 276,
          "queryPlanning" : 3,
          "getBatch" : 5,
          "getOffset" : 3,
          "addBatch" : 234,
          "walCommit" : 30
        },
        "currentWatermark" : 0,
        "stateOperators" : [ ],
        "sources" : [ {
          "description" : "KafkaSource[Subscribe[topic-14]]",
          "startOffset" : {
            "topic-14" : {
              "2" : 0,
              "4" : 1,
              "1" : 0,
              "3" : 0,
              "0" : 0
            }
          },
          "endOffset" : {
            "topic-14" : {
              "2" : 1,
              "4" : 2,
              "1" : 0,
              "3" : 0,
              "0" : 1
            }
          },
          "numRecords" : 3,
          "inputRowsPerSecond" : 230.76923076923077,
          "processedRowsPerSecond" : 10.869565217391303
        } ]
      }
      ```
      
      Additionally, in order to make it possible to correlate progress updates across restarts, we change the `id` field from an integer that is unique with in the JVM to a `UUID` that is globally unique.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #15954 from marmbrus/queryProgress.
      
      (cherry picked from commit c3d08e2f)
      Signed-off-by: default avatarMichael Armbrust <michael@databricks.com>
      28b57c8a
  7. Nov 26, 2016
  8. Nov 25, 2016
    • hyukjinkwon's avatar
      [SPARK-3359][BUILD][DOCS] More changes to resolve javadoc 8 errors that will... · 69856f28
      hyukjinkwon authored
      [SPARK-3359][BUILD][DOCS] More changes to resolve javadoc 8 errors that will help unidoc/genjavadoc compatibility
      
      ## What changes were proposed in this pull request?
      
      This PR only tries to fix things that looks pretty straightforward and were fixed in other previous PRs before.
      
      This PR roughly fixes several things as below:
      
      - Fix unrecognisable class and method links in javadoc by changing it from `[[..]]` to `` `...` ``
      
        ```
        [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/DataStreamReader.java:226: error: reference not found
        [error]    * Loads text files and returns a {link DataFrame} whose schema starts with a string column named
        ```
      
      - Fix an exception annotation and remove code backticks in `throws` annotation
      
        Currently, sbt unidoc with Java 8 complains as below:
      
        ```
        [error] .../java/org/apache/spark/sql/streaming/StreamingQuery.java:72: error: unexpected text
        [error]    * throws StreamingQueryException, if <code>this</code> query has terminated with an exception.
        ```
      
        `throws` should specify the correct class name from `StreamingQueryException,` to `StreamingQueryException` without backticks. (see [JDK-8007644](https://bugs.openjdk.java.net/browse/JDK-8007644)).
      
      - Fix `[[http..]]` to `<a href="http..."></a>`.
      
        ```diff
        -   * [[https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https Oracle
        -   * blog page]].
        +   * <a href="https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https
      
      ">
        +   * Oracle blog page</a>.
        ```
      
         `[[http...]]` link markdown in scaladoc is unrecognisable in javadoc.
      
      - It seems class can't have `return` annotation. So, two cases of this were removed.
      
        ```
        [error] .../java/org/apache/spark/mllib/regression/IsotonicRegression.java:27: error: invalid use of return
        [error]    * return New instance of IsotonicRegression.
        ```
      
      - Fix < to `&lt;` and > to `&gt;` according to HTML rules.
      
      - Fix `</p>` complaint
      
      - Exclude unrecognisable in javadoc, `constructor`, `todo` and `groupname`.
      
      ## How was this patch tested?
      
      Manually tested by `jekyll build` with Java 7 and 8
      
      ```
      java version "1.7.0_80"
      Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
      Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
      ```
      
      ```
      java version "1.8.0_45"
      Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
      Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
      ```
      
      Note: this does not yet make sbt unidoc suceed with Java 8 yet but it reduces the number of errors with Java 8.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15999 from HyukjinKwon/SPARK-3359-errors.
      
      (cherry picked from commit 51b1c155)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      69856f28
  9. Nov 20, 2016
    • hyukjinkwon's avatar
      [SPARK-3359][BUILD][DOCS] Print examples and disable group and tparam tags in javadoc · bc3e7b3b
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes/fixes two things.
      
      - Remove many errors to generate javadoc with Java8 from unrecognisable tags, `tparam` and `group`.
      
        ```
        [error] .../spark/mllib/target/java/org/apache/spark/ml/classification/Classifier.java:18: error: unknown tag: group
        [error]   /** group setParam */
        [error]       ^
        [error] .../spark/mllib/target/java/org/apache/spark/ml/classification/Classifier.java:8: error: unknown tag: tparam
        [error]  * tparam FeaturesType  Type of input features.  E.g., <code>Vector</code>
        [error]    ^
        ...
        ```
      
        It does not fully resolve the problem but remove many errors. It seems both `group` and `tparam` are unrecognisable in javadoc. It seems we can't print them pretty in javadoc in a way of `example` here because they appear differently (both examples can be found in http://spark.apache.org/docs/2.0.2/api/scala/index.html#org.apache.spark.ml.classification.Classifier).
      
      - Print `example` in javadoc.
        Currently, there are few `example` tag in several places.
      
        ```
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This operation might be used to evaluate a graph
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example We might use this operation to change the vertex values
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This function might be used to initialize edge
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This function might be used to initialize edge
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This function might be used to initialize edge
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example We can use this function to compute the in-degree of each
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This function is used to update the vertices with new values based on external data.
        ./graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala:   * example Loads a file in the following format:
        ./graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala:   * example This function is used to update the vertices with new
        ./graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala:   * example This function can be used to filter the graph based on some property, without
        ./graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala: * example We can use the Pregel abstraction to implement PageRank:
        ./graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala: * example Construct a `VertexRDD` from a plain RDD:
        ./repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkCommandLine.scala: * example new SparkCommandLine(Nil).settings
        ./repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkIMain.scala:   * example addImports("org.apache.spark.SparkContext")
        ./sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/LiteralGenerator.scala: * example {{{
        ```
      
      **Before**
      
        <img width="505" alt="2016-11-20 2 43 23" src="https://cloud.githubusercontent.com/assets/6477701/20457285/26f07e1c-aecb-11e6-9ae9-d9dee66845f4.png">
      
      **After**
        <img width="499" alt="2016-11-20 1 27 17" src="https://cloud.githubusercontent.com/assets/6477701/20457240/409124e4-aeca-11e6-9a91-0ba514148b52.png
      
      ">
      
      ## How was this patch tested?
      
      Maunally tested by `jekyll build` with Java 7 and 8
      
      ```
      java version "1.7.0_80"
      Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
      Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
      ```
      
      ```
      java version "1.8.0_45"
      Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
      Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
      ```
      
      Note: this does not make sbt unidoc suceed with Java 8 yet but it reduces the number of errors with Java 8.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15939 from HyukjinKwon/SPARK-3359-javadoc.
      
      (cherry picked from commit c528812c)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      bc3e7b3b
  10. Nov 19, 2016
    • hyukjinkwon's avatar
      [SPARK-18445][BUILD][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note... · 4b396a65
      hyukjinkwon authored
      [SPARK-18445][BUILD][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that`/`'''Note:'''` across Scala/Java API documentation
      
      It seems in Scala/Java,
      
      - `Note:`
      - `NOTE:`
      - `Note that`
      - `'''Note:'''`
      - `note`
      
      This PR proposes to fix those to `note` to be consistent.
      
      **Before**
      
      - Scala
        ![2016-11-17 6 16 39](https://cloud.githubusercontent.com/assets/6477701/20383180/1a7aed8c-acf2-11e6-9611-5eaf6d52c2e0.png)
      
      - Java
        ![2016-11-17 6 14 41](https://cloud.githubusercontent.com/assets/6477701/20383096/c8ffc680-acf1-11e6-914a-33460bf1401d.png)
      
      **After**
      
      - Scala
        ![2016-11-17 6 16 44](https://cloud.githubusercontent.com/assets/6477701/20383167/09940490-acf2-11e6-937a-0d5e1dc2cadf.png)
      
      - Java
        ![2016-11-17 6 13 39](https://cloud.githubusercontent.com/assets/6477701/20383132/e7c2a57e-acf1-11e6-9c47-b849674d4d88.png
      
      )
      
      The notes were found via
      
      ```bash
      grep -r "NOTE: " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// NOTE: " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \ # note that this is a regular expression. So actual matches were mostly `org/apache/spark/api/java/functions ...`
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "Note that " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// Note that " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "Note: " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// Note: " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "'''Note:'''" . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// '''Note:''' " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      And then fixed one by one comparing with API documentation/access modifiers.
      
      After that, manually tested via `jekyll build`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15889 from HyukjinKwon/SPARK-18437.
      
      (cherry picked from commit d5b1d5fc)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      4b396a65
  11. Oct 26, 2016
    • Shixiong Zhu's avatar
      [SPARK-18104][DOC] Don't build KafkaSource doc · 7d10631c
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Don't need to build doc for KafkaSource because the user should use the data source APIs to use KafkaSource. All KafkaSource APIs are internal.
      
      ## How was this patch tested?
      
      Verified manually.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15630 from zsxwing/kafka-unidoc.
      7d10631c
  12. Oct 21, 2016
  13. Oct 18, 2016
    • Tathagata Das's avatar
      [SPARK-17731][SQL][STREAMING][FOLLOWUP] Refactored StreamingQueryListener APIs · 941b3f9a
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      As per rxin request, here are further API changes
      - Changed `Stream(Started/Progress/Terminated)` events to `Stream*Event`
      - Changed the fields in `StreamingQueryListener.on***` from `query*` to `event`
      
      ## How was this patch tested?
      Existing unit tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #15530 from tdas/SPARK-17731-1.
      941b3f9a
  14. Oct 13, 2016
    • Tathagata Das's avatar
      [SPARK-17731][SQL][STREAMING] Metrics for structured streaming · 7106866c
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Metrics are needed for monitoring structured streaming apps. Here is the design doc for implementing the necessary metrics.
      https://docs.google.com/document/d/1NIdcGuR1B3WIe8t7VxLrt58TJB4DtipWEbj5I_mzJys/edit?usp=sharing
      
      Specifically, this PR adds the following public APIs changes.
      
      ### New APIs
      - `StreamingQuery.status` returns a `StreamingQueryStatus` object (renamed from `StreamingQueryInfo`, see later)
      
      - `StreamingQueryStatus` has the following important fields
        - inputRate - Current rate (rows/sec) at which data is being generated by all the sources
        - processingRate - Current rate (rows/sec) at which the query is processing data from
                                        all the sources
        - ~~outputRate~~ - *Does not work with wholestage codegen*
        - latency - Current average latency between the data being available in source and the sink writing the corresponding output
        - sourceStatuses: Array[SourceStatus] - Current statuses of the sources
        - sinkStatus: SinkStatus - Current status of the sink
        - triggerStatus - Low-level detailed status of the last completed/currently active trigger
          - latencies - getOffset, getBatch, full trigger, wal writes
          - timestamps - trigger start, finish, after getOffset, after getBatch
          - numRows - input, output, state total/updated rows for aggregations
      
      - `SourceStatus` has the following important fields
        - inputRate - Current rate (rows/sec) at which data is being generated by the source
        - processingRate - Current rate (rows/sec) at which the query is processing data from the source
        - triggerStatus - Low-level detailed status of the last completed/currently active trigger
      
      - Python API for `StreamingQuery.status()`
      
      ### Breaking changes to existing APIs
      **Existing direct public facing APIs**
      - Deprecated direct public-facing APIs `StreamingQuery.sourceStatuses` and `StreamingQuery.sinkStatus` in favour of `StreamingQuery.status.sourceStatuses/sinkStatus`.
        - Branch 2.0 should have it deprecated, master should have it removed.
      
      **Existing advanced listener APIs**
      - `StreamingQueryInfo` renamed to `StreamingQueryStatus` for consistency with `SourceStatus`, `SinkStatus`
         - Earlier StreamingQueryInfo was used only in the advanced listener API, but now it is used in direct public-facing API (StreamingQuery.status)
      
      - Field `queryInfo` in listener events `QueryStarted`, `QueryProgress`, `QueryTerminated` changed have name `queryStatus` and return type `StreamingQueryStatus`.
      
      - Field `offsetDesc` in `SourceStatus` was Option[String], converted it to `String`.
      
      - For `SourceStatus` and `SinkStatus` made constructor private instead of private[sql] to make them more java-safe. Instead added `private[sql] object SourceStatus/SinkStatus.apply()` which are harder to accidentally use in Java.
      
      ## How was this patch tested?
      
      Old and new unit tests.
      - Rate calculation and other internal logic of StreamMetrics tested by StreamMetricsSuite.
      - New info in statuses returned through StreamingQueryListener is tested in StreamingQueryListenerSuite.
      - New and old info returned through StreamingQuery.status is tested in StreamingQuerySuite.
      - Source-specific tests for making sure input rows are counted are is source-specific test suites.
      - Additional tests to test minor additions in LocalTableScanExec, StateStore, etc.
      
      Metrics also manually tested using Ganglia sink
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #15307 from tdas/SPARK-17731.
      7106866c
  15. Oct 11, 2016
  16. Oct 10, 2016
    • Wenchen Fan's avatar
      [SPARK-17338][SQL] add global temp view · 23ddff4b
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Global temporary view is a cross-session temporary view, which means it's shared among all sessions. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database `global_temp`(configurable via SparkConf), and we must use the qualified name to refer a global temp view, e.g. SELECT * FROM global_temp.view1.
      
      changes for `SessionCatalog`:
      
      1. add a new field `gloabalTempViews: GlobalTempViewManager`, to access the shared global temp views, and the global temp db name.
      2. `createDatabase` will fail if users wanna create `global_temp`, which is system preserved.
      3. `setCurrentDatabase` will fail if users wanna set `global_temp`, which is system preserved.
      4. add `createGlobalTempView`, which is used in `CreateViewCommand` to create global temp views.
      5. add `dropGlobalTempView`, which is used in `CatalogImpl` to drop global temp view.
      6. add `alterTempViewDefinition`, which is used in `AlterViewAsCommand` to update the view definition for local/global temp views.
      7. `renameTable`/`dropTable`/`isTemporaryTable`/`lookupRelation`/`getTempViewOrPermanentTableMetadata`/`refreshTable` will handle global temp views.
      
      changes for SQL commands:
      
      1. `CreateViewCommand`/`AlterViewAsCommand` is updated to support global temp views
      2. `ShowTablesCommand` outputs a new column `database`, which is used to distinguish global and local temp views.
      3. other commands can also handle global temp views if they call `SessionCatalog` APIs which accepts global temp views, e.g. `DropTableCommand`, `AlterTableRenameCommand`, `ShowColumnsCommand`, etc.
      
      changes for other public API
      
      1. add a new method `dropGlobalTempView` in `Catalog`
      2. `Catalog.findTable` can find global temp view
      3. add a new method `createGlobalTempView` in `Dataset`
      
      ## How was this patch tested?
      
      new tests in `SQLViewSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #14897 from cloud-fan/global-temp-view.
      23ddff4b
  17. Oct 05, 2016
    • Shixiong Zhu's avatar
      [SPARK-17346][SQL] Add Kafka source for Structured Streaming · 9293734d
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR adds a new project ` external/kafka-0-10-sql` for Structured Streaming Kafka source.
      
      It's based on the design doc: https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
      
      tdas did most of work and part of them was inspired by koeninger's work.
      
      ### Introduction
      
      The Kafka source is a structured streaming data source to poll data from Kafka. The schema of reading data is as follows:
      
      Column | Type
      ---- | ----
      key | binary
      value | binary
      topic | string
      partition | int
      offset | long
      timestamp | long
      timestampType | int
      
      The source can deal with deleting topics. However, the user should make sure there is no Spark job processing the data when deleting a topic.
      
      ### Configuration
      
      The user can use `DataStreamReader.option` to set the following configurations.
      
      Kafka Source's options | value | default | meaning
      ------ | ------- | ------ | -----
      startingOffset | ["earliest", "latest"] | "latest" | The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming query is started, and that resuming will always pick up from where the query left off.
      failOnDataLost | [true, false] | true | Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected.
      subscribe | A comma-separated list of topics | (none) | The topic list to subscribe. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
      subscribePattern | Java regex string | (none) | The pattern used to subscribe the topic. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
      kafka.consumer.poll.timeoutMs | long | 512 | The timeout in milliseconds to poll data from Kafka in executors
      fetchOffset.numRetries | int | 3 | Number of times to retry before giving up fatch Kafka latest offsets.
      fetchOffset.retryIntervalMs | long | 10 | milliseconds to wait before retrying to fetch Kafka offsets
      
      Kafka's own configurations can be set via `DataStreamReader.option` with `kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")`
      
      ### Usage
      
      * Subscribe to 1 topic
      ```Scala
      spark
        .readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "host:port")
        .option("subscribe", "topic1")
        .load()
      ```
      
      * Subscribe to multiple topics
      ```Scala
      spark
        .readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "host:port")
        .option("subscribe", "topic1,topic2")
        .load()
      ```
      
      * Subscribe to a pattern
      ```Scala
      spark
        .readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "host:port")
        .option("subscribePattern", "topic.*")
        .load()
      ```
      
      ## How was this patch tested?
      
      The new unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: Shixiong Zhu <zsxwing@gmail.com>
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #15102 from zsxwing/kafka-source.
      9293734d
  18. Oct 04, 2016
    • Sean Owen's avatar
      [SPARK-17671][WEBUI] Spark 2.0 history server summary page is slow even set... · 8e8de007
      Sean Owen authored
      [SPARK-17671][WEBUI] Spark 2.0 history server summary page is slow even set spark.history.ui.maxApplications
      
      ## What changes were proposed in this pull request?
      
      Return Iterator of applications internally in history server, for consistency and performance. See https://github.com/apache/spark/pull/15248 for some back-story.
      
      The code called by and calling HistoryServer.getApplicationList wants an Iterator, but this method materializes an Iterable, which potentially causes a performance problem. It's simpler too to make this internal method also pass through an Iterator.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15321 from srowen/SPARK-17671.
      8e8de007
  19. Oct 01, 2016
    • Herman van Hovell's avatar
      [SPARK-17717][SQL] Add Exist/find methods to Catalog [FOLLOW-UP] · af6ece33
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      We added find and exists methods for Databases, Tables and Functions to the user facing Catalog in PR https://github.com/apache/spark/pull/15301. However, it was brought up that the semantics of the  `find` methods are more in line a `get` method (get an object or else fail). So we rename these in this PR.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #15308 from hvanhovell/SPARK-17717-2.
      af6ece33
  20. Sep 29, 2016
    • Herman van Hovell's avatar
      [SPARK-17717][SQL] Add exist/find methods to Catalog. · 74ac1c43
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      The current user facing catalog does not implement methods for checking object existence or finding objects. You could theoretically do this using the `list*` commands, but this is rather cumbersome and can actually be costly when there are many objects. This PR adds `exists*` and `find*` methods for Databases, Table and Functions.
      
      ## How was this patch tested?
      Added tests to `org.apache.spark.sql.internal.CatalogSuite`
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #15301 from hvanhovell/SPARK-17717.
      74ac1c43
    • Yanbo Liang's avatar
      [SPARK-17704][ML][MLLIB] ChiSqSelector performance improvement. · f7082ac1
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Several performance improvement for ```ChiSqSelector```:
      1, Keep ```selectedFeatures``` ordered ascendent.
      ```ChiSqSelectorModel.transform``` need ```selectedFeatures``` ordered to make prediction. We should sort it when training model rather than making prediction, since users usually train model once and use the model to do prediction multiple times.
      2, When training ```fpr``` type ```ChiSqSelectorModel```, it's not necessary to sort the ChiSq test result by statistic.
      
      ## How was this patch tested?
      Existing unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15277 from yanboliang/spark-17704.
      f7082ac1
  21. Sep 23, 2016
    • jisookim's avatar
      [SPARK-12221] add cpu time to metrics · 90a30f46
      jisookim authored
      Currently task metrics don't support executor CPU time, so there's no way to calculate how much CPU time a stage/task took from History Server metrics. This PR enables reporting CPU time.
      
      Author: jisookim <jisookim0513@gmail.com>
      
      Closes #10212 from jisookim0513/add-cpu-time-metric.
      90a30f46
  22. Sep 22, 2016
    • Marcelo Vanzin's avatar
      [SPARK-17639][BUILD] Add jce.jar to buildclasspath when building. · a4aeb767
      Marcelo Vanzin authored
      This was missing, preventing code that uses javax.crypto to properly
      compile in Spark.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #15204 from vanzin/SPARK-17639.
      a4aeb767
    • Gayathri Murali's avatar
      [SPARK-16240][ML] ML persistence backward compatibility for LDA · f4f6bd8c
      Gayathri Murali authored
      ## What changes were proposed in this pull request?
      
      Allow Spark 2.x to load instances of LDA, LocalLDAModel, and DistributedLDAModel saved from Spark 1.6.
      
      ## How was this patch tested?
      
      I tested this manually, saving the 3 types from 1.6 and loading them into master (2.x).  In the future, we can add generic tests for testing backwards compatibility across all ML models in SPARK-15573.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #15034 from jkbradley/lda-backwards.
      f4f6bd8c
    • Dhruve Ashar's avatar
      [SPARK-17365][CORE] Remove/Kill multiple executors together to reduce RPC call time. · 17b72d31
      Dhruve Ashar authored
      ## What changes were proposed in this pull request?
      We are killing multiple executors together instead of iterating over expensive RPC calls to kill single executor.
      
      ## How was this patch tested?
      Executed sample spark job to observe executors being killed/removed with dynamic allocation enabled.
      
      Author: Dhruve Ashar <dashar@yahoo-inc.com>
      Author: Dhruve Ashar <dhruveashar@gmail.com>
      
      Closes #15152 from dhruve/impr/SPARK-17365.
      17b72d31
  23. Sep 21, 2016
  24. Sep 19, 2016
    • sethah's avatar
      [SPARK-17163][ML] Unified LogisticRegression interface · 26145a5a
      sethah authored
      ## What changes were proposed in this pull request?
      
      Merge `MultinomialLogisticRegression` into `LogisticRegression` and remove `MultinomialLogisticRegression`.
      
      Marked as WIP because we should discuss the coefficients API in the model. See discussion below.
      
      JIRA: [SPARK-17163](https://issues.apache.org/jira/browse/SPARK-17163)
      
      ## How was this patch tested?
      
      Merged test suites and added some new unit tests.
      
      ## Design
      
      ### Switching between binomial and multinomial
      
      We default to automatically detecting whether we should run binomial or multinomial lor. We expose a new parameter called `family` which defaults to auto. When "auto" is used, we run normal binomial lor with pivoting if there are 1 or 2 label classes. Otherwise, we run multinomial. If the user explicitly sets the family, then we abide by that setting. In the case where "binomial" is set but multiclass lor is detected, we throw an error.
      
      ### coefficients/intercept model API (TODO)
      
      This is the biggest design point remaining, IMO. We need to decide how to store the coefficients and intercepts in the model, and in turn how to expose them via the API. Two important points:
      
      * We must maintain compatibility with the old API, i.e. we must expose `def coefficients: Vector` and `def intercept: Double`
      * There are two separate cases: binomial lr where we have a single set of coefficients and a single intercept and multinomial lr where we have `numClasses` sets of coefficients and `numClasses` intercepts.
      
      Some options:
      
      1. **Store the binomial coefficients as a `2 x numFeatures` matrix.** This means that we would center the model coefficients before storing them in the model. The BLOR algorithm gives `1 * numFeatures` coefficients, but we would convert them to `2 x numFeatures` coefficients before storing them, effectively doubling the storage in the model. This has the advantage that we can make the code cleaner (i.e. less `if (isMultinomial) ... else ...`) and we don't have to reason about the different cases as much. It has the disadvantage that we double the storage space and we could see small regressions at prediction time since there are 2x the number of operations in the prediction algorithms. Additionally, we still have to produce the uncentered coefficients/intercept via the API, so we will have to either ALSO store the uncentered version, or compute it in `def coefficients: Vector` every time.
      
      2. **Store the binomial coefficients as a `1 x numFeatures` matrix.** We still store the coefficients as a matrix and the intercepts as a vector. When users call `coefficients` we return them a `Vector` that is backed by the same underlying array as the `coefficientMatrix`, so we don't duplicate any data. At prediction time, we use the old prediction methods that are specialized for binary LOR. The benefits here are that we don't store extra data, and we won't see any regressions in performance. The cost of this is that we have separate implementations for predict methods in the binary vs multiclass case. The duplicated code is really not very high, but it's still a bit messy.
      
      If we do decide to store the 2x coefficients, we would likely want to see some performance tests to understand the potential regressions.
      
      **Update:** We have chosen option 2
      
      ### Threshold/thresholds (TODO)
      
      Currently, when `threshold` is set we clear whatever value is in `thresholds` and when `thresholds` is set we clear whatever value is in `threshold`. [SPARK-11543](https://issues.apache.org/jira/browse/SPARK-11543) was created to prefer thresholds over threshold. We should decide if we should implement this behavior now or if we want to do it in a separate JIRA.
      
      **Update:** Let's leave it for a follow up PR
      
      ## Follow up
      
      * Summary model for multiclass logistic regression [SPARK-17139](https://issues.apache.org/jira/browse/SPARK-17139)
      * Thresholds vs threshold [SPARK-11543](https://issues.apache.org/jira/browse/SPARK-11543)
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #14834 from sethah/SPARK-17163.
      26145a5a
  25. Sep 15, 2016
  26. Sep 12, 2016
    • Josh Rosen's avatar
      [SPARK-14818] Post-2.0 MiMa exclusion and build changes · 7c51b99a
      Josh Rosen authored
      This patch makes a handful of post-Spark-2.0 MiMa exclusion and build updates. It should be merged to master and a subset of it should be picked into branch-2.0 in order to test Spark 2.0.1-SNAPSHOT.
      
      - Remove the ` sketch`, `mllibLocal`, and `streamingKafka010` from the list of excluded subprojects so that MiMa checks them.
      - Remove now-unnecessary special-case handling of the Kafka 0.8 artifact in `mimaSettings`.
      - Move the exclusion added in SPARK-14743 from `v20excludes` to `v21excludes`, since that patch was only merged into master and not branch-2.0.
      - Add exclusions for an API change introduced by SPARK-17096 / #14675.
      - Add missing exclusions for the `o.a.spark.internal` and `o.a.spark.sql.internal` packages.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #15061 from JoshRosen/post-2.0-mima-changes.
      7c51b99a
  27. Sep 04, 2016
    • Shivansh's avatar
      [SPARK-17308] Improved the spark core code by replacing all pattern match on... · e75c162e
      Shivansh authored
      [SPARK-17308] Improved the spark core code by replacing all pattern match on boolean value by if/else block.
      
      ## What changes were proposed in this pull request?
      Improved the code quality of spark by replacing all pattern match on boolean value by if/else block.
      
      ## How was this patch tested?
      
      By running the tests
      
      Author: Shivansh <shiv4nsh@gmail.com>
      
      Closes #14873 from shiv4nsh/SPARK-17308.
      e75c162e
  28. Aug 26, 2016
    • Michael Gummelt's avatar
      [SPARK-16967] move mesos to module · 8e5475be
      Michael Gummelt authored
      ## What changes were proposed in this pull request?
      
      Move Mesos code into a mvn module
      
      ## How was this patch tested?
      
      unit tests
      manually submitting a client mode and cluster mode job
      spark/mesos integration test suite
      
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      
      Closes #14637 from mgummelt/mesos-module.
      8e5475be
  29. Aug 10, 2016
    • jerryshao's avatar
      [SPARK-14743][YARN] Add a configurable credential manager for Spark running on YARN · ab648c00
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Add a configurable token manager for Spark on running on yarn.
      
      ### Current Problems ###
      
      1. Supported token provider is hard-coded, currently only hdfs, hbase and hive are supported and it is impossible for user to add new token provider without code changes.
      2. Also this problem exits in timely token renewer and updater.
      
      ### Changes In This Proposal ###
      
      In this proposal, to address the problems mentioned above and make the current code more cleaner and easier to understand, mainly has 3 changes:
      
      1. Abstract a `ServiceTokenProvider` as well as `ServiceTokenRenewable` interface for token provider. Each service wants to communicate with Spark through token way needs to implement this interface.
      2. Provide a `ConfigurableTokenManager` to manage all the register token providers, also token renewer and updater. Also this class offers the API for other modules to obtain tokens, get renewal interval and so on.
      3. Implement 3 built-in token providers `HDFSTokenProvider`, `HiveTokenProvider` and `HBaseTokenProvider` to keep the same semantics as supported today. Whether to load in these built-in token providers is controlled by configuration "spark.yarn.security.tokens.${service}.enabled", by default for all the built-in token providers are loaded.
      
      ### Behavior Changes ###
      
      For the end user there's no behavior change, we still use the same configuration `spark.yarn.security.tokens.${service}.enabled` to decide which token provider is enabled (hbase or hive).
      
      For user implemented token provider (assume the name of token provider is "test") needs to add into this class should have two configurations:
      
      1. `spark.yarn.security.tokens.test.enabled` to true
      2. `spark.yarn.security.tokens.test.class` to the full qualified class name.
      
      So we still keep the same semantics as current code while add one new configuration.
      
      ### Current Status ###
      
      - [x] token provider interface and management framework.
      - [x] implement built-in token providers (hdfs, hbase, hive).
      - [x] Coverage of unit test.
      - [x] Integrated test with security cluster.
      
      ## How was this patch tested?
      
      Unit test and integrated test.
      
      Please suggest and review, any comment is greatly appreciated.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #14065 from jerryshao/SPARK-16342.
      ab648c00
  30. Aug 04, 2016
    • Sean Zhong's avatar
      [SPARK-16853][SQL] fixes encoder error in DataSet typed select · 9d7a4740
      Sean Zhong authored
      ## What changes were proposed in this pull request?
      
      For DataSet typed select:
      ```
      def select[U1: Encoder](c1: TypedColumn[T, U1]): Dataset[U1]
      ```
      If type T is a case class or a tuple class that is not atomic, the resulting logical plan's schema will mismatch with `Dataset[T]` encoder's schema, which will cause encoder error and throw AnalysisException.
      
      ### Before change:
      ```
      scala> case class A(a: Int, b: Int)
      scala> Seq((0, A(1,2))).toDS.select($"_2".as[A])
      org.apache.spark.sql.AnalysisException: cannot resolve '`a`' given input columns: [_2];
      ..
      ```
      
      ### After change:
      ```
      scala> case class A(a: Int, b: Int)
      scala> Seq((0, A(1,2))).toDS.select($"_2".as[A]).show
      +---+---+
      |  a|  b|
      +---+---+
      |  1|  2|
      +---+---+
      ```
      
      ## How was this patch tested?
      
      Unit test.
      
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #14474 from clockfly/SPARK-16853.
      9d7a4740
  31. Jul 12, 2016
    • petermaxlee's avatar
      [SPARK-16199][SQL] Add a method to list the referenced columns in data source Filter · c9a67621
      petermaxlee authored
      ## What changes were proposed in this pull request?
      It would be useful to support listing the columns that are referenced by a filter. This can help simplify data source planning, because with this we would be able to implement unhandledFilters method in HadoopFsRelation.
      
      This is based on rxin's patch (#13901) and adds unit tests.
      
      ## How was this patch tested?
      Added a new suite FiltersSuite.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #14120 from petermaxlee/SPARK-16199.
      c9a67621
  32. Jul 11, 2016
    • Reynold Xin's avatar
      [SPARK-16476] Restructure MimaExcludes for easier union excludes · 52b5bb0b
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      It is currently fairly difficult to have proper mima excludes when we cut a version branch. I'm proposing a small change to take the exclude list out of the exclude function, and put it in a variable so we can easily union excludes.
      
      After this change, we can bump pom.xml version to 2.1.0-SNAPSHOT, without bumping the diff base version. Note that I also deleted all the exclude rules for version 1.x, to cut down the size of the file.
      
      ## How was this patch tested?
      N/A - this is a build infra change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #14128 from rxin/SPARK-16476.
      52b5bb0b
  33. Jul 06, 2016
  34. Jul 05, 2016
    • cody koeninger's avatar
      [SPARK-16359][STREAMING][KAFKA] unidoc skip kafka 0.10 · 1f0d0213
      cody koeninger authored
      ## What changes were proposed in this pull request?
      during sbt unidoc task, skip the streamingKafka010 subproject and filter kafka 0.10 classes from the classpath, so that at least existing kafka 0.8 doc can be included in unidoc without error
      
      ## How was this patch tested?
      sbt spark/scalaunidoc:doc | grep -i error
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #14041 from koeninger/SPARK-16359.
      1f0d0213
  35. Jul 04, 2016
Loading