Skip to content
Snippets Groups Projects
  1. Oct 18, 2016
    • Takuya UESHIN's avatar
      [SPARK-17985][CORE] Bump commons-lang3 version to 3.5. · bfe7885a
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      `SerializationUtils.clone()` of commons-lang3 (<3.5) has a bug that breaks thread safety, which gets stack sometimes caused by race condition of initializing hash map.
      See https://issues.apache.org/jira/browse/LANG-1251.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #15525 from ueshin/issues/SPARK-17985.
      bfe7885a
    • Eric Liang's avatar
      [SPARK-17974] try 2) Refactor FileCatalog classes to simplify the inheritance tree · 4ef39c2f
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      This renames `BasicFileCatalog => FileCatalog`, combines  `SessionFileCatalog` with `PartitioningAwareFileCatalog`, and removes the old `FileCatalog` trait.
      
      In summary,
      ```
      MetadataLogFileCatalog extends PartitioningAwareFileCatalog
      ListingFileCatalog extends PartitioningAwareFileCatalog
      PartitioningAwareFileCatalog extends FileCatalog
      TableFileCatalog extends FileCatalog
      ```
      
      (note that this is a re-submission of https://github.com/apache/spark/pull/15518 which got reverted)
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #15533 from ericl/fix-scalastyle-revert.
      4ef39c2f
    • Yu Peng's avatar
      [SPARK-17711] Compress rolled executor log · 231f39e3
      Yu Peng authored
      ## What changes were proposed in this pull request?
      
      This PR adds support for executor log compression.
      
      ## How was this patch tested?
      
      Unit tests
      
      cc: yhuai tdas mengxr
      
      Author: Yu Peng <loneknightpy@gmail.com>
      
      Closes #15285 from loneknightpy/compress-executor-log.
      231f39e3
    • hyukjinkwon's avatar
      [SPARK-17388] [SQL] Support for inferring type date/timestamp/decimal for partition column · 37686539
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, Spark only supports to infer `IntegerType`, `LongType`, `DoubleType` and `StringType`.
      
      `DecimalType` is being tried but it seems it never infers type as `DecimalType` as `DoubleType` is being tried first. Also, it seems `DateType` and `TimestampType` could be inferred.
      
      As far as I know, it is pretty common to use both for a partition column.
      
      This PR fixes the incorrect `DecimalType` try and also adds the support for both `DateType` and `TimestampType` for inferring partition column type.
      
      ## How was this patch tested?
      
      Unit tests in `ParquetPartitionDiscoverySuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #14947 from HyukjinKwon/SPARK-17388.
      37686539
    • Wenchen Fan's avatar
      [SPARK-17899][SQL][FOLLOW-UP] debug mode should work for corrupted table · e59df62e
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Debug mode should work for corrupted table, so that we can really debug
      
      ## How was this patch tested?
      
      new test in `MetastoreDataSourcesSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15528 from cloud-fan/debug.
      e59df62e
    • Tathagata Das's avatar
      [SQL][STREAMING][TEST] Follow up to remove Option.contains for Scala 2.10 compatibility · a9e79a41
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Scala 2.10 does not have Option.contains, which broke Scala 2.10 build.
      
      ## How was this patch tested?
      Locally compiled and ran sql/core unit tests in 2.10
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #15531 from tdas/metrics-flaky-test-fix-1.
      a9e79a41
    • Liwei Lin's avatar
      [SQL][STREAMING][TEST] Fix flaky tests in StreamingQueryListenerSuite · 7d878cf2
      Liwei Lin authored
      This work has largely been done by lw-lin in his PR #15497. This is a slight refactoring of it.
      
      ## What changes were proposed in this pull request?
      There were two sources of flakiness in StreamingQueryListener test.
      
      - When testing with manual clock, consecutive attempts to advance the clock can occur without the stream execution thread being unblocked and doing some work between the two attempts. Hence the following can happen with the current ManualClock.
      ```
      +-----------------------------------+--------------------------------+
      |      StreamExecution thread       |         testing thread         |
      +-----------------------------------+--------------------------------+
      |  ManualClock.waitTillTime(100) {  |                                |
      |        _isWaiting = true          |                                |
      |            wait(10)               |                                |
      |        still in wait(10)          |  if (_isWaiting) advance(100)  |
      |        still in wait(10)          |  if (_isWaiting) advance(200)  | <- this should be disallowed !
      |        still in wait(10)          |  if (_isWaiting) advance(300)  | <- this should be disallowed !
      |      wake up from wait(10)        |                                |
      |       current time is 600         |                                |
      |       _isWaiting = false          |                                |
      |  }                                |                                |
      +-----------------------------------+--------------------------------+
      ```
      
      - Second source of flakiness is that the adding data to memory stream may get processing in any trigger, not just the first trigger.
      
      My fix is to make the manual clock wait for the other stream execution thread to start waiting for the clock at the right wait start time. That is, `advance(200)` (see above) will wait for stream execution thread to complete the wait that started at time 0, and start a new wait at time 200 (i.e. time stamp after the previous `advance(100)`).
      
      In addition, since this is a feature that is solely used by StreamExecution, I removed all the non-generic code from ManualClock and put them in StreamManualClock inside StreamTest.
      
      ## How was this patch tested?
      Ran existing unit test MANY TIME in Jenkins
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #15519 from tdas/metrics-flaky-test-fix.
      7d878cf2
  2. Oct 17, 2016
    • Reynold Xin's avatar
    • Eric Liang's avatar
      [SPARK-17974] Refactor FileCatalog classes to simplify the inheritance tree · 8daa1a29
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      This renames `BasicFileCatalog => FileCatalog`, combines  `SessionFileCatalog` with `PartitioningAwareFileCatalog`, and removes the old `FileCatalog` trait.
      
      In summary,
      ```
      MetadataLogFileCatalog extends PartitioningAwareFileCatalog
      ListingFileCatalog extends PartitioningAwareFileCatalog
      PartitioningAwareFileCatalog extends FileCatalog
      TableFileCatalog extends FileCatalog
      ```
      
      cc cloud-fan mallman
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #15518 from ericl/refactor-session-file-catalog.
      8daa1a29
    • Dilip Biswal's avatar
      [SPARK-17620][SQL] Determine Serde by hive.default.fileformat when Creating Hive Serde Tables · 813ab5e0
      Dilip Biswal authored
      ## What changes were proposed in this pull request?
      Reopens the closed PR https://github.com/apache/spark/pull/15190
      (Please refer to the above link for review comments on the PR)
      
      Make sure the hive.default.fileformat is used to when creating the storage format metadata.
      
      Output
      ``` SQL
      scala> spark.sql("SET hive.default.fileformat=orc")
      res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
      
      scala> spark.sql("CREATE TABLE tmp_default(id INT)")
      res2: org.apache.spark.sql.DataFrame = []
      ```
      Before
      ```SQL
      scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
      ..
      [# Storage Information,,]
      [SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
      [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
      [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
      [Compressed:,No,]
      [Storage Desc Parameters:,,]
      [  serialization.format,1,]
      ```
      After
      ```SQL
      scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
      ..
      [# Storage Information,,]
      [SerDe Library:,org.apache.hadoop.hive.ql.io.orc.OrcSerde,]
      [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
      [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
      [Compressed:,No,]
      [Storage Desc Parameters:,,]
      [  serialization.format,1,]
      
      ```
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      Added new tests to HiveDDLCommandSuite, SQLQuerySuite
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #15495 from dilipbiswal/orc2.
      813ab5e0
    • gatorsmile's avatar
      [SPARK-17751][SQL] Remove spark.sql.eagerAnalysis and Output the Plan if... · d88a1bae
      gatorsmile authored
      [SPARK-17751][SQL] Remove spark.sql.eagerAnalysis and Output the Plan if Existed in AnalysisException
      
      ### What changes were proposed in this pull request?
      Dataset always does eager analysis now. Thus, `spark.sql.eagerAnalysis` is not used any more. Thus, we need to remove it.
      
      This PR also outputs the plan. Without the fix, the analysis error is like
      ```
      cannot resolve '`k1`' given input columns: [k, v]; line 1 pos 12
      ```
      
      After the fix, the analysis error becomes:
      ```
      org.apache.spark.sql.AnalysisException: cannot resolve '`k1`' given input columns: [k, v]; line 1 pos 12;
      'Project [unresolvedalias(CASE WHEN ('k1 = 2) THEN 22 WHEN ('k1 = 4) THEN 44 ELSE 0 END, None), v#6]
      +- SubqueryAlias t
         +- Project [_1#2 AS k#5, _2#3 AS v#6]
            +- LocalRelation [_1#2, _2#3]
      ```
      
      ### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15316 from gatorsmile/eagerAnalysis.
      d88a1bae
    • Sital Kedia's avatar
      [SPARK-17839][CORE] Use Nio's directbuffer instead of BufferedInputStream in... · c7ac027d
      Sital Kedia authored
      [SPARK-17839][CORE] Use Nio's directbuffer instead of BufferedInputStream in order to avoid additional copy from os buffer cache to user buffer
      
      ## What changes were proposed in this pull request?
      
      Currently we use BufferedInputStream to read the shuffle file which copies the file content from os buffer cache to the user buffer. This adds additional latency in reading the spill files. We made a change to use java nio's direct buffer to read the spill files and for certain pipelines spilling significant amount of data, we see up to 7% speedup for the entire pipeline.
      
      ## How was this patch tested?
      Tested by running the job in the cluster and observed up to 7% speedup.
      
      Author: Sital Kedia <skedia@fb.com>
      
      Closes #15408 from sitalkedia/skedia/nio_spill_read.
      c7ac027d
    • Maxime Rihouey's avatar
      Fix example of tf_idf with minDocFreq · e3bf37fa
      Maxime Rihouey authored
      ## What changes were proposed in this pull request?
      
      The python example for tf_idf with the parameter "minDocFreq" is not properly set up because the same variable is used to transform the document for both with and without the "minDocFreq" parameter.
      The IDF(minDocFreq=2) is stored in the variable "idfIgnore" but then it is the original variable "idf" used to transform the "tf" instead of the "idfIgnore".
      
      ## How was this patch tested?
      
      Before the results for "tfidf" and "tfidfIgnore" were the same:
      tfidf:
      (1048576,[1046921],[3.75828890549])
      (1048576,[1046920],[3.75828890549])
      (1048576,[1046923],[3.75828890549])
      (1048576,[892732],[3.75828890549])
      (1048576,[892733],[3.75828890549])
      (1048576,[892734],[3.75828890549])
      tfidfIgnore:
      (1048576,[1046921],[3.75828890549])
      (1048576,[1046920],[3.75828890549])
      (1048576,[1046923],[3.75828890549])
      (1048576,[892732],[3.75828890549])
      (1048576,[892733],[3.75828890549])
      (1048576,[892734],[3.75828890549])
      
      After the fix those are how they should be:
      tfidf:
      (1048576,[1046921],[3.75828890549])
      (1048576,[1046920],[3.75828890549])
      (1048576,[1046923],[3.75828890549])
      (1048576,[892732],[3.75828890549])
      (1048576,[892733],[3.75828890549])
      (1048576,[892734],[3.75828890549])
      tfidfIgnore:
      (1048576,[1046921],[0.0])
      (1048576,[1046920],[0.0])
      (1048576,[1046923],[0.0])
      (1048576,[892732],[0.0])
      (1048576,[892733],[0.0])
      (1048576,[892734],[0.0])
      
      Author: Maxime Rihouey <maxime.rihouey@gmail.com>
      
      Closes #15503 from maximerihouey/patch-1.
      e3bf37fa
    • Weiqing Yang's avatar
      [MINOR][SQL] Add prettyName for current_database function · 56b0f5f4
      Weiqing Yang authored
      ## What changes were proposed in this pull request?
      Added a `prettyname` for current_database function.
      
      ## How was this patch tested?
      Manually.
      
      Before:
      ```
      scala> sql("select current_database()").show
      +-----------------+
      |currentdatabase()|
      +-----------------+
      |          default|
      +-----------------+
      ```
      
      After:
      ```
      scala> sql("select current_database()").show
      +------------------+
      |current_database()|
      +------------------+
      |           default|
      +------------------+
      ```
      
      Author: Weiqing Yang <yangweiqing001@gmail.com>
      
      Closes #15506 from weiqingy/prettyName.
      56b0f5f4
  3. Oct 16, 2016
    • gatorsmile's avatar
      [SPARK-17947][SQL] Add Doc and Comment about spark.sql.debug · e18d02c5
      gatorsmile authored
      ### What changes were proposed in this pull request?
      Just document the impact of `spark.sql.debug`:
      
      When enabling the debug, Spark SQL internal table properties are not filtered out; however, some related DDL commands (e.g., Analyze Table and CREATE TABLE LIKE) might not work properly.
      
      ### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15494 from gatorsmile/addDocForSQLDebug.
      e18d02c5
    • Dongjoon Hyun's avatar
      [SPARK-17819][SQL] Support default database in connection URIs for Spark Thrift Server · 59e3eb5a
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, Spark Thrift Server ignores the default database in URI. This PR supports that like the following.
      
      ```sql
      $ bin/beeline -u jdbc:hive2://localhost:10000 -e "create database testdb"
      $ bin/beeline -u jdbc:hive2://localhost:10000/testdb -e "create table t(a int)"
      $ bin/beeline -u jdbc:hive2://localhost:10000/testdb -e "show tables"
      ...
      +------------+--------------+--+
      | tableName  | isTemporary  |
      +------------+--------------+--+
      | t          | false        |
      +------------+--------------+--+
      1 row selected (0.347 seconds)
      $ bin/beeline -u jdbc:hive2://localhost:10000 -e "show tables"
      ...
      +------------+--------------+--+
      | tableName  | isTemporary  |
      +------------+--------------+--+
      +------------+--------------+--+
      No rows selected (0.098 seconds)
      ```
      
      ## How was this patch tested?
      
      Manual.
      
      Note: I tried to add a test case for this, but I cannot found a suitable testsuite for this. I'll add the testcase if some advice is given.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15399 from dongjoon-hyun/SPARK-17819.
      59e3eb5a
    • Reynold Xin's avatar
      Revert "[SPARK-17637][SCHEDULER] Packed scheduling for Spark tasks across executors" · 72a6e7a5
      Reynold Xin authored
      This reverts commit ed146334.
      
      The patch merged had obvious quality and documentation issue. The idea is useful, and we should work towards improving its quality and merging it in again.
      72a6e7a5
  4. Oct 15, 2016
    • Zhan Zhang's avatar
      [SPARK-17637][SCHEDULER] Packed scheduling for Spark tasks across executors · ed146334
      Zhan Zhang authored
      ## What changes were proposed in this pull request?
      
      Restructure the code and implement two new task assigner.
      PackedAssigner: try to allocate tasks to the executors with least available cores, so that spark can release reserved executors when dynamic allocation is enabled.
      
      BalancedAssigner: try to allocate tasks to the executors with more available cores in order to balance the workload across all executors.
      
      By default, the original round robin assigner is used.
      
      We test a pipeline, and new PackedAssigner  save around 45% regarding the reserved cpu and memory with dynamic allocation enabled.
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      Both unit test in TaskSchedulerImplSuite and manual tests in production pipeline.
      
      Author: Zhan Zhang <zhanzhang@fb.com>
      
      Closes #15218 from zhzhan/packed-scheduler.
      ed146334
    • Jun Kim's avatar
      [SPARK-17953][DOCUMENTATION] Fix typo in SparkSession scaladoc · 36d81c2c
      Jun Kim authored
      ## What changes were proposed in this pull request?
      
      ### Before:
      ```scala
      SparkSession.builder()
           .master("local")
           .appName("Word Count")
           .config("spark.some.config.option", "some-value").
           .getOrCreate()
      ```
      
      ### After:
      ```scala
      SparkSession.builder()
           .master("local")
           .appName("Word Count")
           .config("spark.some.config.option", "some-value")
           .getOrCreate()
      ```
      
      There was one unexpected dot!
      
      Author: Jun Kim <i2r.jun@gmail.com>
      
      Closes #15498 from tae-jun/SPARK-17953.
      36d81c2c
  5. Oct 14, 2016
    • Michael Allman's avatar
      [SPARK-16980][SQL] Load only catalog table partition metadata required to answer a query · 6ce1b675
      Michael Allman authored
      (This PR addresses https://issues.apache.org/jira/browse/SPARK-16980.)
      
      ## What changes were proposed in this pull request?
      
      In a new Spark session, when a partitioned Hive table is converted to use Spark's `HadoopFsRelation` in `HiveMetastoreCatalog`, metadata for every partition of that table are retrieved from the metastore and loaded into driver memory. In addition, every partition's metadata files are read from the filesystem to perform schema inference.
      
      If a user queries such a table with predicates which prune that table's partitions, we would like to be able to answer that query without consulting partition metadata which are not involved in the query. When querying a table with a large number of partitions for some data from a small number of partitions (maybe even a single partition), the current conversion strategy is highly inefficient. I suspect this scenario is not uncommon in the wild.
      
      In addition to being inefficient in running time, the current strategy is inefficient in its use of driver memory. When the sum of the number of partitions of all tables loaded in a driver reaches a certain level (somewhere in the tens of thousands), their cached data exhaust all driver heap memory in the default configuration. I suspect this scenario is less common (in that not too many deployments work with tables with tens of thousands of partitions), however this does illustrate how large the memory footprint of this metadata can be. With tables with hundreds or thousands of partitions, I would expect the `HiveMetastoreCatalog` table cache to represent a significant portion of the driver's heap space.
      
      This PR proposes an alternative approach. Basically, it makes four changes:
      
      1. It adds a new method, `listPartitionsByFilter` to the Catalyst `ExternalCatalog` trait which returns the partition metadata for a given sequence of partition pruning predicates.
      1. It refactors the `FileCatalog` type hierarchy to include a new `TableFileCatalog` to efficiently return files only for partitions matching a sequence of partition pruning predicates.
      1. It removes partition loading and caching from `HiveMetastoreCatalog`.
      1. It adds a new Catalyst optimizer rule, `PruneFileSourcePartitions`, which applies a plan's partition-pruning predicates to prune out unnecessary partition files from a `HadoopFsRelation`'s underlying file catalog.
      
      The net effect is that when a query over a partitioned Hive table is planned, the analyzer retrieves the table metadata from `HiveMetastoreCatalog`. As part of this operation, the `HiveMetastoreCatalog` builds a `HadoopFsRelation` with a `TableFileCatalog`. It does not load any partition metadata or scan any files. The optimizer prunes-away unnecessary table partitions by sending the partition-pruning predicates to the relation's `TableFileCatalog `. The `TableFileCatalog` in turn calls the `listPartitionsByFilter` method on its external catalog. This queries the Hive metastore, passing along those filters.
      
      As a bonus, performing partition pruning during optimization leads to a more accurate relation size estimate. This, along with c481bdf5, can lead to automatic, safe application of the broadcast optimization in a join where it might previously have been omitted.
      
      ## Open Issues
      
      1. This PR omits partition metadata caching. I can add this once the overall strategy for the cold path is established, perhaps in a future PR.
      1. This PR removes and omits partitioned Hive table schema reconciliation. As a result, it fails to find Parquet schema columns with upper case letters because of the Hive metastore's case-insensitivity. This issue may be fixed by #14750, but that PR appears to have stalled. ericl has contributed to this PR a workaround for Parquet wherein schema reconciliation occurs at query execution time instead of planning. Whether ORC requires a similar patch is an open issue.
      1. This PR omits an implementation of `listPartitionsByFilter` for the `InMemoryCatalog`.
      1. This PR breaks parquet log output redirection during query execution. I can work around this by running `Class.forName("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$")` first thing in a Spark shell session, but I haven't figured out how to fix this properly.
      
      ## How was this patch tested?
      
      The current Spark unit tests were run, and some ad-hoc tests were performed to validate that only the necessary partition metadata is loaded.
      
      Author: Michael Allman <michael@videoamp.com>
      Author: Eric Liang <ekl@databricks.com>
      Author: Eric Liang <ekhliang@gmail.com>
      
      Closes #14690 from mallman/spark-16980-lazy_partition_fetching.
      6ce1b675
    • Srinath Shankar's avatar
      [SPARK-17946][PYSPARK] Python crossJoin API similar to Scala · 2d96d35d
      Srinath Shankar authored
      ## What changes were proposed in this pull request?
      
      Add a crossJoin function to the DataFrame API similar to that in Scala. Joins with no condition (cartesian products) must be specified with the crossJoin API
      
      ## How was this patch tested?
      Added python tests to ensure that an AnalysisException if a cartesian product is specified without crossJoin(), and that cartesian products can execute if specified via crossJoin()
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.
      
      Author: Srinath Shankar <srinath@databricks.com>
      
      Closes #15493 from srinathshankar/crosspython.
      2d96d35d
    • Reynold Xin's avatar
      [SPARK-17900][SQL] Graduate a list of Spark SQL APIs to stable · 72adfbf9
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch graduates a list of Spark SQL APIs and mark them stable.
      
      The following are marked stable:
      
      Dataset/DataFrame
      - functions, since 1.3
      - ColumnName, since 1.3
      - DataFrameNaFunctions, since 1.3.1
      - DataFrameStatFunctions, since 1.4
      - UserDefinedFunction, since 1.3
      - UserDefinedAggregateFunction, since 1.5
      - Window and WindowSpec, since 1.4
      
      Data sources:
      - DataSourceRegister, since 1.5
      - RelationProvider, since 1.3
      - SchemaRelationProvider, since 1.3
      - CreatableRelationProvider, since 1.3
      - BaseRelation, since 1.3
      - TableScan, since 1.3
      - PrunedScan, since 1.3
      - PrunedFilteredScan, since 1.3
      - InsertableRelation, since 1.3
      
      The following are kept experimental / evolving:
      
      Data sources:
      - CatalystScan (tied to internal logical plans so it is not stable by definition)
      
      Structured streaming:
      - all classes (introduced new in 2.0 and will likely change)
      
      Dataset typed operations (introduced in 1.6 and 2.0 and might change, although probability is low)
      - all typed methods on Dataset
      - KeyValueGroupedDataset
      - o.a.s.sql.expressions.javalang.typed
      - o.a.s.sql.expressions.scalalang.typed
      - methods that return typed Dataset in SparkSession
      
      We should discuss more whether we want to mark Dataset typed operations stable in 2.1.
      
      ## How was this patch tested?
      N/A - just annotation changes.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15469 from rxin/SPARK-17900.
      72adfbf9
    • Jeff Zhang's avatar
      [SPARK-11775][PYSPARK][SQL] Allow PySpark to register Java UDF · f00df40c
      Jeff Zhang authored
      Currently pyspark can only call the builtin java UDF, but can not call custom java UDF. It would be better to allow that. 2 benefits:
      * Leverage the power of rich third party java library
      * Improve the performance. Because if we use python UDF, python daemons will be started on worker which will affect the performance.
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #9766 from zjffdu/SPARK-11775.
      f00df40c
    • Nick Pentreath's avatar
      [SPARK-16063][SQL] Add storageLevel to Dataset · 5aeb7384
      Nick Pentreath authored
      [SPARK-11905](https://issues.apache.org/jira/browse/SPARK-11905
      
      ) added support for `persist`/`cache` for `Dataset`. However, there is no user-facing API to check if a `Dataset` is cached and if so what the storage level is. This PR adds `getStorageLevel` to `Dataset`, analogous to `RDD.getStorageLevel`.
      
      Updated `DatasetCacheSuite`.
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #13780 from MLnick/ds-storagelevel.
      
      Signed-off-by: default avatarMichael Armbrust <michael@databricks.com>
      5aeb7384
    • Davies Liu's avatar
      [SPARK-17863][SQL] should not add column into Distinct · da9aeb0f
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      We are trying to resolve the attribute in sort by pulling up some column for grandchild into child, but that's wrong when the child is Distinct, because the added column will change the behavior of Distinct, we should not do that.
      
      ## How was this patch tested?
      
      Added regression test.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #15489 from davies/order_distinct.
      da9aeb0f
    • Yin Huai's avatar
      Revert "[SPARK-17620][SQL] Determine Serde by hive.default.fileformat when... · 522dd0d0
      Yin Huai authored
      Revert "[SPARK-17620][SQL] Determine Serde by hive.default.fileformat when Creating Hive Serde Tables"
      
      This reverts commit 7ab86244.
      522dd0d0
    • Dilip Biswal's avatar
      [SPARK-17620][SQL] Determine Serde by hive.default.fileformat when Creating Hive Serde Tables · 7ab86244
      Dilip Biswal authored
      ## What changes were proposed in this pull request?
      Make sure the hive.default.fileformat is used to when creating the storage format metadata.
      
      Output
      ``` SQL
      scala> spark.sql("SET hive.default.fileformat=orc")
      res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
      
      scala> spark.sql("CREATE TABLE tmp_default(id INT)")
      res2: org.apache.spark.sql.DataFrame = []
      ```
      Before
      ```SQL
      scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
      ..
      [# Storage Information,,]
      [SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
      [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
      [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
      [Compressed:,No,]
      [Storage Desc Parameters:,,]
      [  serialization.format,1,]
      ```
      After
      ```SQL
      scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
      ..
      [# Storage Information,,]
      [SerDe Library:,org.apache.hadoop.hive.ql.io.orc.OrcSerde,]
      [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
      [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
      [Compressed:,No,]
      [Storage Desc Parameters:,,]
      [  serialization.format,1,]
      
      ```
      
      ## How was this patch tested?
      Added new tests to HiveDDLCommandSuite
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #15190 from dilipbiswal/orc.
      7ab86244
    • sethah's avatar
      [SPARK-17941][ML][TEST] Logistic regression tests should use sample weights. · de1c1ca5
      sethah authored
      ## What changes were proposed in this pull request?
      
      The sample weight testing for logistic regressions is not robust. Logistic regression suite already has many test cases comparing results to R glmnet. Since both libraries support sample weights, we should use sample weights in the test to increase coverage for sample weighting. This patch doesn't really add any code and makes the testing more complete.
      
      Also fixed some errors with the R code that was referenced in the test suit. Changed `standardization=T` to `standardize=T` since the former is invalid.
      
      ## How was this patch tested?
      
      Existing unit tests are modified. No non-test code is touched.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #15488 from sethah/logreg_weight_tests.
      de1c1ca5
    • Tathagata Das's avatar
      [TEST] Ignore flaky test in StreamingQueryListenerSuite · 05800b4b
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Ignoring the flaky test introduced in #15307
      
      https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/1736/testReport/junit/org.apache.spark.sql.streaming/StreamingQueryListenerSuite/single_listener__check_trigger_statuses/
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #15491 from tdas/metrics-flaky-test.
      05800b4b
    • Andrew Ash's avatar
      Typo: form -> from · fa37877a
      Andrew Ash authored
      ## What changes were proposed in this pull request?
      
      Minor typo fix
      
      ## How was this patch tested?
      
      Existing unit tests on Jenkins
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #15486 from ash211/patch-8.
      fa37877a
    • Dhruve Ashar's avatar
      [DOC] Fix typo in sql hive doc · a0ebcb3a
      Dhruve Ashar authored
      Change is too trivial to file a JIRA.
      
      Author: Dhruve Ashar <dhruveashar@gmail.com>
      
      Closes #15485 from dhruve/master.
      a0ebcb3a
    • wangzhenhua's avatar
      [SPARK-17073][SQL][FOLLOWUP] generate column-level statistics · 7486442f
      wangzhenhua authored
      ## What changes were proposed in this pull request?
      This pr adds some test cases for statistics: case sensitive column names, non ascii column names, refresh table, and also improves some documentation.
      
      ## How was this patch tested?
      add test cases
      
      Author: wangzhenhua <wangzhenhua@huawei.com>
      
      Closes #15360 from wzhfy/colStats2.
      7486442f
    • invkrh's avatar
      [SPARK-17855][CORE] Remove query string from jar url · 28b645b1
      invkrh authored
      ## What changes were proposed in this pull request?
      
      Spark-submit support jar url with http protocol. However, if the url contains any query strings, `worker.DriverRunner.downloadUserJar()` method will throw "Did not see expected jar" exception. This is because this method checks the existance of a downloaded jar whose name contains query strings. This is a problem when your jar is located on some web service which requires some additional information to retrieve the file.
      
      This pr just removes query strings before checking jar existance on worker.
      
      ## How was this patch tested?
      
      For now, you can only test this patch by manual test.
      * Deploy a spark cluster locally
      * Make sure apache httpd service is on
      * Save an uber jar, e.g spark-job.jar under `/var/www/html/`
      * Use http://localhost/spark-job.jar?param=1 as jar url when running `spark-submit`
      * Job should be launched
      
      Author: invkrh <invkrh@gmail.com>
      
      Closes #15420 from invkrh/spark-17855.
      28b645b1
    • Peng's avatar
      [SPARK-17870][MLLIB][ML] Change statistic to pValue for SelectKBest and... · c8b612de
      Peng authored
      [SPARK-17870][MLLIB][ML] Change statistic to pValue for SelectKBest and SelectPercentile because of DoF difference
      
      ## What changes were proposed in this pull request?
      
      For feature selection method ChiSquareSelector, it is based on the ChiSquareTestResult.statistic (ChiSqure value) to select the features. It select the features with the largest ChiSqure value. But the Degree of Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and for different df, you cannot base on ChiSqure value to select features.
      
      So we change statistic to pValue for SelectKBest and SelectPercentile
      
      ## How was this patch tested?
      change existing test
      
      Author: Peng <peng.meng@intel.com>
      
      Closes #15444 from mpjlu/chisqure-bug.
      c8b612de
    • Zheng RuiFeng's avatar
      [SPARK-14634][ML] Add BisectingKMeansSummary · a1b136d0
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      Add BisectingKMeansSummary
      
      ## How was this patch tested?
      unit test
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #12394 from zhengruifeng/biKMSummary.
      a1b136d0
    • Yanbo Liang's avatar
      [SPARK-15402][ML][PYSPARK] PySpark ml.evaluation should support save/load · 1db8feab
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Since ```ml.evaluation``` has supported save/load at Scala side, supporting it at Python side is very straightforward and easy.
      
      ## How was this patch tested?
      Add python doctest.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #13194 from yanboliang/spark-15402.
      1db8feab
    • Wenchen Fan's avatar
      [SPARK-17903][SQL] MetastoreRelation should talk to external catalog instead of hive client · 2fb12b0a
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      `HiveExternalCatalog` should be the only interface to talk to the hive metastore. In `MetastoreRelation` we can just use `ExternalCatalog` instead of `HiveClient` to interact with hive metastore,  and add missing API in `ExternalCatalog`.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15460 from cloud-fan/relation.
      2fb12b0a
    • Reynold Xin's avatar
      [SPARK-17925][SQL] Break fileSourceInterfaces.scala into multiple pieces · 6c29b3de
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch does a few changes to the file structure of data sources:
      
      - Break fileSourceInterfaces.scala into multiple pieces (HadoopFsRelation, FileFormat, OutputWriter)
      - Move ParquetOutputWriter into its own file
      
      I created this as a separate patch so it'd be easier to review my future PRs that focus on refactoring this internal logic. This patch only moves code around, and has no logic changes.
      
      ## How was this patch tested?
      N/A - should be covered by existing tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15473 from rxin/SPARK-17925.
      6c29b3de
  6. Oct 13, 2016
Loading