Skip to content
Snippets Groups Projects
  1. Sep 26, 2016
    • Yanbo Liang's avatar
      [SPARK-17017][FOLLOW-UP][ML] Refactor of ChiSqSelector and add ML Python API. · ac65139b
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      #14597 modified ```ChiSqSelector``` to support ```fpr``` type selector, however, it left some issue need to be addressed:
      * We should allow users to set selector type explicitly rather than switching them by using different setting function, since the setting order will involves some unexpected issue. For example, if users both set ```numTopFeatures``` and ```percentile```, it will train ```kbest``` or ```percentile``` model based on the order of setting (the latter setting one will be trained). This make users confused, and we should allow users to set selector type explicitly. We handle similar issues at other place of ML code base such as ```GeneralizedLinearRegression``` and ```LogisticRegression```.
      * Meanwhile, if there are more than one parameter except ```alpha``` can be set for ```fpr``` model, we can not handle it elegantly in the existing framework. And similar issues for ```kbest``` and ```percentile``` model. Setting selector type explicitly can solve this issue also.
      * If setting selector type explicitly by users is allowed, we should handle param interaction such as if users set ```selectorType = percentile``` and ```alpha = 0.1```, we should notify users the parameter ```alpha``` will take no effect. We should handle complex parameter interaction checks at ```transformSchema```. (FYI #11620)
      * We should use lower case of the selector type names to follow MLlib convention.
      * Add ML Python API.
      
      ## How was this patch tested?
      Unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15214 from yanboliang/spark-17017.
      Unverified
      ac65139b
    • Burak Yavuz's avatar
      [SPARK-17650] malformed url's throw exceptions before bricking Executors · 59d87d24
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      When a malformed URL was sent to Executors through `sc.addJar` and `sc.addFile`, the executors become unusable, because they constantly throw `MalformedURLException`s and can never acknowledge that the file or jar is just bad input.
      
      This PR tries to fix that problem by making sure MalformedURLs can never be submitted through `sc.addJar` and `sc.addFile`. Another solution would be to blacklist bad files and jars on Executors. Maybe fail the first time, and then ignore the second time (but print a warning message).
      
      ## How was this patch tested?
      
      Unit tests in SparkContextSuite
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #15224 from brkyvz/SPARK-17650.
      59d87d24
  2. Sep 25, 2016
    • xin wu's avatar
      [SPARK-17551][SQL] Add DataFrame API for null ordering · de333d12
      xin wu authored
      ## What changes were proposed in this pull request?
      This pull request adds Scala/Java DataFrame API for null ordering (NULLS FIRST | LAST).
      
      Also did some minor clean up for related code (e.g. incorrect indentation), and renamed "orderby-nulls-ordering.sql" to be consistent with existing test files.
      
      ## How was this patch tested?
      Added a new test case in DataFrameSuite.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      Author: Xin Wu <xinwu@us.ibm.com>
      
      Closes #15123 from petermaxlee/SPARK-17551.
      de333d12
  3. Sep 24, 2016
  4. Sep 23, 2016
    • Shivaram Venkataraman's avatar
      [SPARK-17651][SPARKR] Set R package version number along with mvn · 7c382524
      Shivaram Venkataraman authored
      ## What changes were proposed in this pull request?
      
      This PR sets the R package version while tagging releases. Note that since R doesn't accept `-SNAPSHOT` in version number field, we remove that while setting the next version
      
      ## How was this patch tested?
      
      Tested manually by running locally
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #15223 from shivaram/sparkr-version-change.
      7c382524
    • jisookim's avatar
      [SPARK-12221] add cpu time to metrics · 90a30f46
      jisookim authored
      Currently task metrics don't support executor CPU time, so there's no way to calculate how much CPU time a stage/task took from History Server metrics. This PR enables reporting CPU time.
      
      Author: jisookim <jisookim0513@gmail.com>
      
      Closes #10212 from jisookim0513/add-cpu-time-metric.
      90a30f46
    • Michael Armbrust's avatar
      [SPARK-17643] Remove comparable requirement from Offset · 988c7145
      Michael Armbrust authored
      For some sources, it is difficult to provide a global ordering based only on the data in the offset.  Since we don't use comparison for correctness, lets remove it.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #15207 from marmbrus/removeComparable.
      988c7145
    • Jeff Zhang's avatar
      [SPARK-17210][SPARKR] sparkr.zip is not distributed to executors when running sparkr in RStudio · f62ddc59
      Jeff Zhang authored
      ## What changes were proposed in this pull request?
      
      Spark will add sparkr.zip to archive only when it is yarn mode (SparkSubmit.scala).
      ```
          if (args.isR && clusterManager == YARN) {
            val sparkRPackagePath = RUtils.localSparkRPackagePath
            if (sparkRPackagePath.isEmpty) {
              printErrorAndExit("SPARK_HOME does not exist for R application in YARN mode.")
            }
            val sparkRPackageFile = new File(sparkRPackagePath.get, SPARKR_PACKAGE_ARCHIVE)
            if (!sparkRPackageFile.exists()) {
              printErrorAndExit(s"$SPARKR_PACKAGE_ARCHIVE does not exist for R application in YARN mode.")
            }
            val sparkRPackageURI = Utils.resolveURI(sparkRPackageFile.getAbsolutePath).toString
      
            // Distribute the SparkR package.
            // Assigns a symbol link name "sparkr" to the shipped package.
            args.archives = mergeFileLists(args.archives, sparkRPackageURI + "#sparkr")
      
            // Distribute the R package archive containing all the built R packages.
            if (!RUtils.rPackages.isEmpty) {
              val rPackageFile =
                RPackageUtils.zipRLibraries(new File(RUtils.rPackages.get), R_PACKAGE_ARCHIVE)
              if (!rPackageFile.exists()) {
                printErrorAndExit("Failed to zip all the built R packages.")
              }
      
              val rPackageURI = Utils.resolveURI(rPackageFile.getAbsolutePath).toString
              // Assigns a symbol link name "rpkg" to the shipped package.
              args.archives = mergeFileLists(args.archives, rPackageURI + "#rpkg")
            }
          }
      ```
      So it is necessary to pass spark.master from R process to JVM. Otherwise sparkr.zip won't be distributed to executor.  Besides that I also pass spark.yarn.keytab/spark.yarn.principal to spark side, because JVM process need them to access secured cluster.
      
      ## How was this patch tested?
      
      Verify it manually in R Studio using the following code.
      ```
      Sys.setenv(SPARK_HOME="/Users/jzhang/github/spark")
      .libPaths(c(file.path(Sys.getenv(), "R", "lib"), .libPaths()))
      library(SparkR)
      sparkR.session(master="yarn-client", sparkConfig = list(spark.executor.instances="1"))
      df <- as.DataFrame(mtcars)
      head(df)
      
      ```
      
      …
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #14784 from zjffdu/SPARK-17210.
      f62ddc59
    • WeichenXu's avatar
      [SPARK-17499][SPARKR][ML][MLLIB] make the default params in sparkR spark.mlp... · f89808b0
      WeichenXu authored
      [SPARK-17499][SPARKR][ML][MLLIB] make the default params in sparkR spark.mlp consistent with MultilayerPerceptronClassifier
      
      ## What changes were proposed in this pull request?
      
      update `MultilayerPerceptronClassifierWrapper.fit` paramter type:
      `layers: Array[Int]`
      `seed: String`
      
      update several default params in sparkR `spark.mlp`:
      `tol` --> 1e-6
      `stepSize` --> 0.03
      `seed` --> NULL ( when seed == NULL, the scala-side wrapper regard it as a `null` value and the seed will use the default one )
      r-side `seed` only support 32bit integer.
      
      remove `layers` default value, and move it in front of those parameters with default value.
      add `layers` parameter validation check.
      
      ## How was this patch tested?
      
      tests added.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #15051 from WeichenXu123/update_py_mlp_default.
      f89808b0
    • Holden Karau's avatar
      [SPARK-16861][PYSPARK][CORE] Refactor PySpark accumulator API on top of Accumulator V2 · 90d57542
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Move the internals of the PySpark accumulator API from the old deprecated API on top of the new accumulator API.
      
      ## How was this patch tested?
      
      The existing PySpark accumulator tests (both unit tests and doc tests at the start of accumulator.py).
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #14467 from holdenk/SPARK-16861-refactor-pyspark-accumulator-api.
      Unverified
      90d57542
    • hyukjinkwon's avatar
      [BUILD] Closes some stale PRs · 5c5396cb
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to close some stale PRs and ones suggested to be closed by committer(s)
      
      Closes #12415
      Closes #14765
      Closes #15118
      Closes #15184
      Closes #15183
      Closes #9440
      Closes #15023
      Closes #14643
      Closes #14827
      
      ## How was this patch tested?
      
      N/A
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15198 from HyukjinKwon/stale-prs.
      Unverified
      5c5396cb
    • Shixiong Zhu's avatar
      [SPARK-17640][SQL] Avoid using -1 as the default batchId for FileStreamSource.FileEntry · 62ccf27a
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Avoid using -1 as the default batchId for FileStreamSource.FileEntry so that we can make sure not writing any FileEntry(..., batchId = -1) into the log. This also avoids people misusing it in future (#15203 is an example).
      
      ## How was this patch tested?
      
      Jenkins.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15206 from zsxwing/cleanup.
      62ccf27a
    • Joseph K. Bradley's avatar
      [SPARK-16719][ML] Random Forests should communicate fewer trees on each iteration · 947b8c6e
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      RandomForest currently sends the entire forest to each worker on each iteration. This is because (a) the node queue is FIFO and (b) the closure references the entire array of trees (topNodes). (a) causes RFs to handle splits in many trees, especially early on in learning. (b) sends all trees explicitly.
      
      This PR:
      (a) Change the RF node queue to be FILO (a stack), so that RFs tend to focus on 1 or a few trees before focusing on others.
      (b) Change topNodes to pass only the trees required on that iteration.
      
      ## How was this patch tested?
      
      Unit tests:
      * Existing tests for correctness of tree learning
      * Manually modifying code and running tests to verify that a small number of trees are communicated on each iteration
        * This last item is hard to test via unit tests given the current APIs.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #14359 from jkbradley/rfs-fewer-trees.
      947b8c6e
  5. Sep 22, 2016
    • Marcelo Vanzin's avatar
      [SPARK-17639][BUILD] Add jce.jar to buildclasspath when building. · a4aeb767
      Marcelo Vanzin authored
      This was missing, preventing code that uses javax.crypto to properly
      compile in Spark.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #15204 from vanzin/SPARK-17639.
      a4aeb767
    • Yucai Yu's avatar
      [SPARK-17635][SQL] Remove hardcode "agg_plan" in HashAggregateExec · 79159a1e
      Yucai Yu authored
      ## What changes were proposed in this pull request?
      
      "agg_plan" are hardcoded in HashAggregateExec, which have potential issue, so removing them.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Yucai Yu <yucai.yu@intel.com>
      
      Closes #15199 from yucai/agg_plan.
      79159a1e
    • Burak Yavuz's avatar
      [SPARK-17569][SPARK-17569][TEST] Make the unit test added for work again · a1661968
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      A [PR](https://github.com/apache/spark/commit/a6aade0042d9c065669f46d2dac40ec6ce361e63) was merged concurrently that made the unit test for PR #15122 not test anything anymore. This PR fixes the test.
      
      ## How was this patch tested?
      
      Changed line https://github.com/apache/spark/blob/0d634875026ccf1eaf984996e9460d7673561f80/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L137
      from `false` to `true` and made sure the unit test failed.
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #15203 from brkyvz/fix-test.
      a1661968
    • Gayathri Murali's avatar
      [SPARK-16240][ML] ML persistence backward compatibility for LDA · f4f6bd8c
      Gayathri Murali authored
      ## What changes were proposed in this pull request?
      
      Allow Spark 2.x to load instances of LDA, LocalLDAModel, and DistributedLDAModel saved from Spark 1.6.
      
      ## How was this patch tested?
      
      I tested this manually, saving the 3 types from 1.6 and loading them into master (2.x).  In the future, we can add generic tests for testing backwards compatibility across all ML models in SPARK-15573.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #15034 from jkbradley/lda-backwards.
      f4f6bd8c
    • Herman van Hovell's avatar
      [SPARK-17616][SQL] Support a single distinct aggregate combined with a non-partial aggregate · 0d634875
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      We currently cannot execute an aggregate that contains a single distinct aggregate function and an one or more non-partially plannable aggregate functions, for example:
      ```sql
      select   grp,
               collect_list(col1),
               count(distinct col2)
      from     tbl_a
      group by 1
      ```
      This is a regression from Spark 1.6. This is caused by the fact that the single distinct aggregation code path assumes that all aggregates can be planned in two phases (is partially aggregatable). This PR works around this issue by triggering the `RewriteDistinctAggregates` in such cases (this is similar to the approach taken in 1.6).
      
      ## How was this patch tested?
      Created `RewriteDistinctAggregatesSuite` which checks if the aggregates with distinct aggregate functions get rewritten into two `Aggregates` and an `Expand`. Added a regression test to `DataFrameAggregateSuite`.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #15187 from hvanhovell/SPARK-17616.
      0d634875
    • Shixiong Zhu's avatar
      [SPARK-17638][STREAMING] Stop JVM StreamingContext when the Python process is dead · 3cdae0ff
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      When the Python process is dead, the JVM StreamingContext is still running. Hence we will see a lot of Py4jException before the JVM process exits. It's better to stop the JVM StreamingContext to avoid those annoying logs.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15201 from zsxwing/stop-jvm-ssc.
      3cdae0ff
    • Burak Yavuz's avatar
      [SPARK-17613] S3A base paths with no '/' at the end return empty DataFrames · 85d609cf
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      Consider you have a bucket as `s3a://some-bucket`
      and under it you have files:
      ```
      s3a://some-bucket/file1.parquet
      s3a://some-bucket/file2.parquet
      ```
      Getting the parent path of `s3a://some-bucket/file1.parquet` yields
      `s3a://some-bucket/` and the ListingFileCatalog uses this as the key in the hash map.
      
      When catalog.allFiles is called, we use `s3a://some-bucket` (no slash at the end) to get the list of files, and we're left with an empty list!
      
      This PR fixes this by adding a `/` at the end of the `URI` iff the given `Path` doesn't have a parent, i.e. is the root. This is a no-op if the path already had a `/` at the end, and is handled through the Hadoop Path, path merging semantics.
      
      ## How was this patch tested?
      
      Unit test in `FileCatalogSuite`.
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #15169 from brkyvz/SPARK-17613.
      85d609cf
    • Shivaram Venkataraman's avatar
      Skip building R vignettes if Spark is not built · 9f24a17c
      Shivaram Venkataraman authored
      ## What changes were proposed in this pull request?
      
      When we build the docs separately we don't have the JAR files from the Spark build in
      the same tree. As the SparkR vignettes need to launch a SparkContext to be built, we skip building them if JAR files don't exist
      
      ## How was this patch tested?
      
      To test this we can run the following:
      ```
      build/mvn -DskipTests -Psparkr clean
      ./R/create-docs.sh
      ```
      You should see a line `Skipping R vignettes as Spark JARs not found` at the end
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #15200 from shivaram/sparkr-vignette-skip.
      9f24a17c
    • Dhruve Ashar's avatar
      [SPARK-17365][CORE] Remove/Kill multiple executors together to reduce RPC call time. · 17b72d31
      Dhruve Ashar authored
      ## What changes were proposed in this pull request?
      We are killing multiple executors together instead of iterating over expensive RPC calls to kill single executor.
      
      ## How was this patch tested?
      Executed sample spark job to observe executors being killed/removed with dynamic allocation enabled.
      
      Author: Dhruve Ashar <dashar@yahoo-inc.com>
      Author: Dhruve Ashar <dhruveashar@gmail.com>
      
      Closes #15152 from dhruve/impr/SPARK-17365.
      17b72d31
    • Wenchen Fan's avatar
      [SQL][MINOR] correct the comment of SortBasedAggregationIterator.safeProj · 8a02410a
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This comment went stale long time ago, this PR fixes it according to my understanding.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15095 from cloud-fan/update-comment.
      8a02410a
    • WeichenXu's avatar
      [SPARK-17281][ML][MLLIB] Add treeAggregateDepth parameter for AFTSurvivalRegression · 72d9fba2
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      Add treeAggregateDepth parameter for AFTSurvivalRegression to keep consistent with LiR/LoR.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #14851 from WeichenXu123/add_treeAggregate_param_for_survival_regression.
      72d9fba2
    • frreiss's avatar
      [SPARK-17421][DOCS] Documenting the current treatment of MAVEN_OPTS. · 646f3834
      frreiss authored
      ## What changes were proposed in this pull request?
      
      Modified the documentation to clarify that `build/mvn` and `pom.xml` always add Java 7-specific parameters to `MAVEN_OPTS`, and that developers can safely ignore warnings about `-XX:MaxPermSize` that may result from compiling or running tests with Java 8.
      
      ## How was this patch tested?
      
      Rebuilt HTML documentation, made sure that building-spark.html displays correctly in a browser.
      
      Author: frreiss <frreiss@us.ibm.com>
      
      Closes #15005 from frreiss/fred-17421a.
      Unverified
      646f3834
    • Zhenhua Wang's avatar
      [SPARK-17625][SQL] set expectedOutputAttributes when converting... · de7df7de
      Zhenhua Wang authored
      [SPARK-17625][SQL] set expectedOutputAttributes when converting SimpleCatalogRelation to LogicalRelation
      
      ## What changes were proposed in this pull request?
      
      We should set expectedOutputAttributes when converting SimpleCatalogRelation to LogicalRelation, otherwise the outputs of LogicalRelation are different from outputs of SimpleCatalogRelation - they have different exprId's.
      
      ## How was this patch tested?
      
      add a test case
      
      Author: Zhenhua Wang <wzh_zju@163.com>
      
      Closes #15182 from wzhfy/expectedAttributes.
      de7df7de
    • gatorsmile's avatar
      [SPARK-17492][SQL] Fix Reading Cataloged Data Sources without Extending SchemaRelationProvider · 3a80f92f
      gatorsmile authored
      ### What changes were proposed in this pull request?
      For data sources without extending `SchemaRelationProvider`, we expect users to not specify schemas when they creating tables. If the schema is input from users, an exception is issued.
      
      Since Spark 2.1, for any data source, to avoid infer the schema every time, we store the schema in the metastore catalog. Thus, when reading a cataloged data source table, the schema could be read from metastore catalog. In this case, we also got an exception. For example,
      
      ```Scala
      sql(
        s"""
           |CREATE TABLE relationProvierWithSchema
           |USING org.apache.spark.sql.sources.SimpleScanSource
           |OPTIONS (
           |  From '1',
           |  To '10'
           |)
         """.stripMargin)
      spark.table(tableName).show()
      ```
      ```
      org.apache.spark.sql.sources.SimpleScanSource does not allow user-specified schemas.;
      ```
      
      This PR is to fix the above issue. When building a data source, we introduce a flag `isSchemaFromUsers` to indicate whether the schema is really input from users. If true, we issue an exception. Otherwise, we will call the `createRelation` of `RelationProvider` to generate the `BaseRelation`, in which it contains the actual schema.
      
      ### How was this patch tested?
      Added a few cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15046 from gatorsmile/tempViewCases.
      3a80f92f
    • Yadong Qi's avatar
      [SPARK-17425][SQL] Override sameResult in HiveTableScanExec to make... · cb324f61
      Yadong Qi authored
      [SPARK-17425][SQL] Override sameResult in HiveTableScanExec to make ReuseExchange work in text format table
      
      ## What changes were proposed in this pull request?
      The PR will override the `sameResult` in `HiveTableScanExec` to make `ReuseExchange` work in text format table.
      
      ## How was this patch tested?
      # SQL
      ```sql
      SELECT * FROM src t1
      JOIN src t2 ON t1.key = t2.key
      JOIN src t3 ON t1.key = t3.key;
      ```
      
      # Before
      ```
      == Physical Plan ==
      *BroadcastHashJoin [key#30], [key#34], Inner, BuildRight
      :- *BroadcastHashJoin [key#30], [key#32], Inner, BuildRight
      :  :- *Filter isnotnull(key#30)
      :  :  +- HiveTableScan [key#30, value#31], MetastoreRelation default, src
      :  +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
      :     +- *Filter isnotnull(key#32)
      :        +- HiveTableScan [key#32, value#33], MetastoreRelation default, src
      +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
         +- *Filter isnotnull(key#34)
            +- HiveTableScan [key#34, value#35], MetastoreRelation default, src
      ```
      
      # After
      ```
      == Physical Plan ==
      *BroadcastHashJoin [key#2], [key#6], Inner, BuildRight
      :- *BroadcastHashJoin [key#2], [key#4], Inner, BuildRight
      :  :- *Filter isnotnull(key#2)
      :  :  +- HiveTableScan [key#2, value#3], MetastoreRelation default, src
      :  +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
      :     +- *Filter isnotnull(key#4)
      :        +- HiveTableScan [key#4, value#5], MetastoreRelation default, src
      +- ReusedExchange [key#6, value#7], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
      ```
      
      cc: davies cloud-fan
      
      Author: Yadong Qi <qiyadong2010@gmail.com>
      
      Closes #14988 from watermen/SPARK-17425.
      cb324f61
  6. Sep 21, 2016
    • Wenchen Fan's avatar
      [SPARK-17609][SQL] SessionCatalog.tableExists should not check temp view · b50b34f5
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      After #15054 , there is no place in Spark SQL that need `SessionCatalog.tableExists` to check temp views, so this PR makes `SessionCatalog.tableExists` only check permanent table/view and removes some hacks.
      
      This PR also improves the `getTempViewOrPermanentTableMetadata` that is introduced in  #15054 , to make the code simpler.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15160 from cloud-fan/exists.
      b50b34f5
    • Davies Liu's avatar
      [SPARK-17494][SQL] changePrecision() on compact decimal should respect rounding mode · 8bde03bf
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Floor()/Ceil() of decimal is implemented using changePrecision() by passing a rounding mode, but the rounding mode is not respected when the decimal is in compact mode (could fit within a Long).
      
      This Update the changePrecision() to respect rounding mode, which could be ROUND_FLOOR, ROUND_CEIL, ROUND_HALF_UP, ROUND_HALF_EVEN.
      
      ## How was this patch tested?
      
      Added regression tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #15154 from davies/decimal_round.
      8bde03bf
    • Michael Armbrust's avatar
      [SPARK-17627] Mark Streaming Providers Experimental · 3497ebe5
      Michael Armbrust authored
      All of structured streaming is experimental in its first release.  We missed the annotation on two of the APIs.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #15188 from marmbrus/experimentalApi.
      3497ebe5
    • Yanbo Liang's avatar
      [SPARK-17315][FOLLOW-UP][SPARKR][ML] Fix print of Kolmogorov-Smirnov test summary · 6902edab
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      #14881 added Kolmogorov-Smirnov Test wrapper to SparkR. I found that ```print.summary.KSTest``` was implemented inappropriately and result in no effect.
      Running the following code for KSTest:
      ```Scala
      data <- data.frame(test = c(0.1, 0.15, 0.2, 0.3, 0.25, -1, -0.5))
      df <- createDataFrame(data)
      testResult <- spark.kstest(df, "test", "norm")
      summary(testResult)
      ```
      Before this PR:
      ![image](https://cloud.githubusercontent.com/assets/1962026/18615016/b9a2823a-7d4f-11e6-934b-128beade355e.png)
      After this PR:
      ![image](https://cloud.githubusercontent.com/assets/1962026/18615014/aafe2798-7d4f-11e6-8b99-c705bb9fe8f2.png)
      The new implementation is similar with [```print.summary.GeneralizedLinearRegressionModel```](https://github.com/apache/spark/blob/master/R/pkg/R/mllib.R#L284) of SparkR and [```print.summary.glm```](https://svn.r-project.org/R/trunk/src/library/stats/R/glm.R) of native R.
      
      BTW, I removed the comparison of ```print.summary.KSTest``` in unit test, since it's only wrappers of the summary output which has been checked. Another reason is that these comparison will output summary information to the test console, it will make the test output in a mess.
      
      ## How was this patch tested?
      Existing test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15139 from yanboliang/spark-17315.
      6902edab
    • Yanbo Liang's avatar
      [SPARK-17577][SPARKR][CORE] SparkR support add files to Spark job and get by executors · c133907c
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Scala/Python users can add files to Spark job by submit options ```--files``` or ```SparkContext.addFile()```. Meanwhile, users can get the added file by ```SparkFiles.get(filename)```.
      We should also support this function for SparkR users, since they also have the requirements for some shared dependency files. For example, SparkR users can download third party R packages to driver firstly, add these files to the Spark job as dependency by this API and then each executor can install these packages by ```install.packages```.
      
      ## How was this patch tested?
      Add unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15131 from yanboliang/spark-17577.
      c133907c
    • Burak Yavuz's avatar
      [SPARK-17569] Make StructuredStreaming FileStreamSource batch generation faster · 7cbe2164
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      While getting the batch for a `FileStreamSource` in StructuredStreaming, we know which files we must take specifically. We already have verified that they exist, and have committed them to a metadata log. When creating the FileSourceRelation however for an incremental execution, the code checks the existence of every single file once again!
      
      When you have 100,000s of files in a folder, creating the first batch takes 2 hours+ when working with S3! This PR disables that check
      
      ## How was this patch tested?
      
      Added a unit test to `FileStreamSource`.
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #15122 from brkyvz/SPARK-17569.
      7cbe2164
    • jerryshao's avatar
      [SPARK-17512][CORE] Avoid formatting to python path for yarn and mesos cluster mode · 8c3ee2bc
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Yarn and mesos cluster mode support remote python path (HDFS/S3 scheme) by their own mechanism, it is not necessary to check and format the python when running on these modes. This is a potential regression compared to 1.6, so here propose to fix it.
      
      ## How was this patch tested?
      
      Unit test to verify SparkSubmit arguments, also with local cluster verification. Because of lack of `MiniDFSCluster` support in Spark unit test, there's no integration test added.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #15137 from jerryshao/SPARK-17512.
      8c3ee2bc
    • Imran Rashid's avatar
      [SPARK-17623][CORE] Clarify type of TaskEndReason with a failed task. · 9fcf1c51
      Imran Rashid authored
      ## What changes were proposed in this pull request?
      
      In TaskResultGetter, enqueueFailedTask currently deserializes the result
      as a TaskEndReason. But the type is actually more specific, its a
      TaskFailedReason. This just leads to more blind casting later on – it
      would be more clear if the msg was cast to the right type immediately,
      so method parameter types could be tightened.
      
      ## How was this patch tested?
      
      Existing unit tests via jenkins.  Note that the code was already performing a blind-cast to a TaskFailedReason before in any case, just in a different spot, so there shouldn't be any behavior change.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #15181 from squito/SPARK-17623.
      9fcf1c51
    • Marcelo Vanzin's avatar
      [SPARK-4563][CORE] Allow driver to advertise a different network address. · 2cd1bfa4
      Marcelo Vanzin authored
      The goal of this feature is to allow the Spark driver to run in an
      isolated environment, such as a docker container, and be able to use
      the host's port forwarding mechanism to be able to accept connections
      from the outside world.
      
      The change is restricted to the driver: there is no support for achieving
      the same thing on executors (or the YARN AM for that matter). Those still
      need full access to the outside world so that, for example, connections
      can be made to an executor's block manager.
      
      The core of the change is simple: add a new configuration that tells what's
      the address the driver should bind to, which can be different than the address
      it advertises to executors (spark.driver.host). Everything else is plumbing
      the new configuration where it's needed.
      
      To use the feature, the host starting the container needs to set up the
      driver's port range to fall into a range that is being forwarded; this
      required the block manager port to need a special configuration just for
      the driver, which falls back to the existing spark.blockManager.port when
      not set. This way, users can modify the driver settings without affecting
      the executors; it would theoretically be nice to also have different
      retry counts for driver and executors, but given that docker (at least)
      allows forwarding port ranges, we can probably live without that for now.
      
      Because of the nature of the feature it's kinda hard to add unit tests;
      I just added a simple one to make sure the configuration works.
      
      This was tested with a docker image running spark-shell with the following
      command:
      
       docker blah blah blah \
         -p 38000-38100:38000-38100 \
         [image] \
         spark-shell \
           --num-executors 3 \
           --conf spark.shuffle.service.enabled=false \
           --conf spark.dynamicAllocation.enabled=false \
           --conf spark.driver.host=[host's address] \
           --conf spark.driver.port=38000 \
           --conf spark.driver.blockManager.port=38020 \
           --conf spark.ui.port=38040
      
      Running on YARN; verified the driver works, executors start up and listen
      on ephemeral ports (instead of using the driver's config), and that caching
      and shuffling (without the shuffle service) works. Clicked through the UI
      to make sure all pages (including executor thread dumps) worked. Also tested
      apps without docker, and ran unit tests.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #15120 from vanzin/SPARK-4563.
      2cd1bfa4
    • Sean Owen's avatar
      [SPARK-11918][ML] Better error from WLS for cases like singular input · b4a4421b
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Update error handling for Cholesky decomposition to provide a little more info when input is singular.
      
      ## How was this patch tested?
      
      New test case; jenkins tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15177 from srowen/SPARK-11918.
      b4a4421b
Loading