Skip to content
Snippets Groups Projects
  1. Oct 21, 2016
    • Jagadeesan's avatar
      [SPARK-17960][PYSPARK][UPGRADE TO PY4J 0.10.4] · 595893d3
      Jagadeesan authored
      ## What changes were proposed in this pull request?
      
      1) Upgrade the Py4J version on the Java side
      2) Update the py4j src zip file we bundle with Spark
      
      ## How was this patch tested?
      
      Existing doctests & unit tests pass
      
      Author: Jagadeesan <as2@us.ibm.com>
      
      Closes #15514 from jagadeesanas2/SPARK-17960.
      Unverified
      595893d3
  2. Oct 20, 2016
    • Wenchen Fan's avatar
      [SPARK-18029][SQL] PruneFileSourcePartitions should not change the output of LogicalRelation · 57e97fcb
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      In `PruneFileSourcePartitions`, we will replace the `LogicalRelation` with a pruned one. However, this replacement may change the output of the `LogicalRelation` if it doesn't have `expectedOutputAttributes`. This PR fixes it.
      
      ## How was this patch tested?
      
      the new `PruneFileSourcePartitionsSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15569 from cloud-fan/partition-bug.
      57e97fcb
    • Felix Cheung's avatar
      [SPARKR] fix warnings · 3180272d
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Fix for a bunch of test warnings that were added recently.
      We need to investigate why warnings are not turning into errors.
      
      ```
      Warnings -----------------------------------------------------------------------
      1. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Sepal_Length instead of Sepal.Length  as column name
      
      2. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Sepal_Width instead of Sepal.Width  as column name
      
      3. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Petal_Length instead of Petal.Length  as column name
      
      4. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Petal_Width instead of Petal.Width  as column name
      
      Consider adding
        importFrom("utils", "object.size")
      to your NAMESPACE file.
      ```
      
      ## How was this patch tested?
      
      unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #15560 from felixcheung/rwarnings.
      3180272d
    • Shixiong Zhu's avatar
      [SPARK-18030][TESTS] Adds more checks to collect more info about FileStreamSourceSuite failure · 1bb99c48
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      My hunch is `mkdirs` fails. Just add more checks to collect more info.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15577 from zsxwing/SPARK-18030-debug.
      1bb99c48
    • Mark Grover's avatar
      [DOCS] Update docs to not suggest to package Spark before running tests. · 2d14ab7e
      Mark Grover authored
      ## What changes were proposed in this pull request?
      
      Update docs to not suggest to package Spark before running tests.
      
      ## How was this patch tested?
      
      Not creating a JIRA since this pretty small. We haven't had the need to run mvn package before mvn test since 1.6 at least, or so I am told. So, updating the docs to not be misguiding.
      
      Author: Mark Grover <mark@apache.org>
      
      Closes #15572 from markgrover/doc_update.
      2d14ab7e
    • Reynold Xin's avatar
      [SPARK-18021][SQL] Refactor file name specification for data sources · 7f9ec19e
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      Currently each data source OutputWriter is responsible for specifying the entire file name for each file output. This, however, does not make any sense because we rely on file naming schemes for certain behaviors in Spark SQL, e.g. bucket id. The current approach allows individual data sources to break the implementation of bucketing.
      
      On the flip side, we also don't want to move file naming entirely out of data sources, because different data sources do want to specify different extensions.
      
      This patch divides file name specification into two parts: the first part is a prefix specified by the caller of OutputWriter (in WriteOutput), and the second part is the suffix that can be specified by the OutputWriter itself. Note that a side effect of this change is that now all file based data sources also support bucketing automatically.
      
      There are also some other minor cleanups:
      
      - Removed the UUID passed through generic Configuration string
      - Some minor rewrites for better clarity
      - Renamed "path" in multiple places to "stagingDir", to more accurately reflect its meaning
      
      ## How was this patch tested?
      This should be covered by existing data source tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15562 from rxin/SPARK-18021.
      7f9ec19e
    • jerryshao's avatar
      [SPARK-17999][KAFKA][SQL] Add getPreferredLocations for KafkaSourceRDD · 947f4f25
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      The newly implemented Structured Streaming `KafkaSource` did calculate the preferred locations for each topic partition, but didn't offer this information through RDD's `getPreferredLocations` method. So here propose to add this method in `KafkaSourceRDD`.
      
      ## How was this patch tested?
      
      Manual verification.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #15545 from jerryshao/SPARK-17999.
      947f4f25
    • Koert Kuipers's avatar
      [SPARK-15780][SQL] Support mapValues on KeyValueGroupedDataset · 84b245f2
      Koert Kuipers authored
      ## What changes were proposed in this pull request?
      
      Add mapValues to KeyValueGroupedDataset
      
      ## How was this patch tested?
      
      New test in DatasetSuite for groupBy function, mapValues, flatMap
      
      Author: Koert Kuipers <koert@tresata.com>
      
      Closes #13526 from koertkuipers/feat-keyvaluegroupeddataset-mapvalues.
      84b245f2
    • Tejas Patil's avatar
      [SPARK-17698][SQL] Join predicates should not contain filter clauses · fb0894b3
      Tejas Patil authored
      ## What changes were proposed in this pull request?
      
      Jira : https://issues.apache.org/jira/browse/SPARK-17698
      
      `ExtractEquiJoinKeys` is incorrectly using filter predicates as the join condition for joins. `canEvaluate` [0] tries to see if the an `Expression` can be evaluated using output of a given `Plan`. In case of filter predicates (eg. `a.id='1'`), the `Expression` passed for the right hand side (ie. '1' ) is a `Literal` which does not have any attribute references. Thus `expr.references` is an empty set which theoretically is a subset of any set. This leads to `canEvaluate` returning `true` and `a.id='1'` is treated as a join predicate. While this does not lead to incorrect results but in case of bucketed + sorted tables, we might miss out on avoiding un-necessary shuffle + sort. See example below:
      
      [0] : https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala#L91
      
      eg.
      
      ```
      val df = (1 until 10).toDF("id").coalesce(1)
      hc.sql("DROP TABLE IF EXISTS table1").collect
      df.write.bucketBy(8, "id").sortBy("id").saveAsTable("table1")
      hc.sql("DROP TABLE IF EXISTS table2").collect
      df.write.bucketBy(8, "id").sortBy("id").saveAsTable("table2")
      
      sqlContext.sql("""
        SELECT a.id, b.id
        FROM table1 a
        FULL OUTER JOIN table2 b
        ON a.id = b.id AND a.id='1' AND b.id='1'
      """).explain(true)
      ```
      
      BEFORE: This is doing shuffle + sort over table scan outputs which is not needed as both tables are bucketed and sorted on the same columns and have same number of buckets. This should be a single stage job.
      
      ```
      SortMergeJoin [id#38, cast(id#38 as double), 1.0], [id#39, 1.0, cast(id#39 as double)], FullOuter
      :- *Sort [id#38 ASC NULLS FIRST, cast(id#38 as double) ASC NULLS FIRST, 1.0 ASC NULLS FIRST], false, 0
      :  +- Exchange hashpartitioning(id#38, cast(id#38 as double), 1.0, 200)
      :     +- *FileScan parquet default.table1[id#38] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table1, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
      +- *Sort [id#39 ASC NULLS FIRST, 1.0 ASC NULLS FIRST, cast(id#39 as double) ASC NULLS FIRST], false, 0
         +- Exchange hashpartitioning(id#39, 1.0, cast(id#39 as double), 200)
            +- *FileScan parquet default.table2[id#39] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table2, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
      ```
      
      AFTER :
      
      ```
      SortMergeJoin [id#32], [id#33], FullOuter, ((cast(id#32 as double) = 1.0) && (cast(id#33 as double) = 1.0))
      :- *FileScan parquet default.table1[id#32] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table1, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
      +- *FileScan parquet default.table2[id#33] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table2, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
      ```
      
      ## How was this patch tested?
      
      - Added a new test case for this scenario : `SPARK-17698 Join predicates should not contain filter clauses`
      - Ran all the tests in `BucketedReadSuite`
      
      Author: Tejas Patil <tejasp@fb.com>
      
      Closes #15272 from tejasapatil/SPARK-17698_join_predicate_filter_clause.
      fb0894b3
    • Dilip Biswal's avatar
      [SPARK-17860][SQL] SHOW COLUMN's database conflict check should respect case... · e895bc25
      Dilip Biswal authored
      [SPARK-17860][SQL] SHOW COLUMN's database conflict check should respect case sensitivity configuration
      
      ## What changes were proposed in this pull request?
      SHOW COLUMNS command validates the user supplied database
      name with database name from qualified table name name to make
      sure both of them are consistent. This comparison should respect
      case sensitivity.
      
      ## How was this patch tested?
      Added tests in DDLSuite and existing tests were moved to use new sql based test infrastructure.
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #15423 from dilipbiswal/dkb_show_column_fix.
      e895bc25
    • Dongjoon Hyun's avatar
      [SPARK-17796][SQL] Support wildcard character in filename for LOAD DATA LOCAL INPATH · 986a3b8b
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, Spark 2.0 raises an `input path does not exist` AnalysisException if the file name contains '*'. It is misleading since it occurs when there exist some matched files. Also, it was a supported feature in Spark 1.6.2. This PR aims to support wildcard characters in filename for `LOAD DATA LOCAL INPATH` SQL command like Spark 1.6.2.
      
      **Reported Error Scenario**
      ```scala
      scala> sql("CREATE TABLE t(a string)")
      res0: org.apache.spark.sql.DataFrame = []
      
      scala> sql("LOAD DATA LOCAL INPATH '/tmp/x*' INTO TABLE t")
      org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: /tmp/x*;
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins test with a new test case.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15376 from dongjoon-hyun/SPARK-17796.
      Unverified
      986a3b8b
    • Mike Ihbe's avatar
      [SPARK-11653][DEPLOY] Allow spark-daemon.sh to run in the foreground · c2c107ab
      Mike Ihbe authored
      ## What changes were proposed in this pull request?
      
      Add a SPARK_NO_DAEMONIZE environment variable flag to spark-daemon.sh that causes the process it would run to be run in the foreground.
      
      It looks like there has been some prior work in https://github.com/apache/spark/pull/3881, but there was some talk about these being refactored. I'm not sure if that happened or not, but that PR is almost 2 years old at this point so it was worth revisiting.
      
      ## How was this patch tested?
      
      ./dev/run-tests still seems to work. It doesn't look like these scripts have tests, but if I missed them just let me know.
      
      Author: Mike Ihbe <mikejihbe@gmail.com>
      
      Closes #15338 from mikejihbe/SPARK-11653.
      Unverified
      c2c107ab
    • Eric Liang's avatar
      [SPARK-17991][SQL] Enable metastore partition pruning by default. · 4bd17c46
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      This should apply to non-converted metastore relations. WIP to see if this causes any test failures.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #15475 from ericl/try-enabling-pruning.
      4bd17c46
    • WeichenXu's avatar
      [SPARK-18003][SPARK CORE] Fix bug of RDD zipWithIndex & zipWithUniqueId index value overflowing · 39755169
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      - Fix bug of RDD `zipWithIndex` generating wrong result when one partition contains more than 2147483647 records.
      
      - Fix bug of RDD `zipWithUniqueId` generating wrong result when one partition contains more than 2147483647 records.
      
      ## How was this patch tested?
      
      test added.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #15550 from WeichenXu123/fix_rdd_zipWithIndex_overflow.
      39755169
    • Reynold Xin's avatar
      [SPARK-18012][SQL] Simplify WriterContainer · f313117b
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch refactors WriterContainer to simplify the logic and make control flow more obvious.The previous code setup made it pretty difficult to track the actual dependencies on variables and setups because the driver side and the executor side were using the same set of variables.
      
      ## How was this patch tested?
      N/A - this should be covered by existing tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15551 from rxin/writercontainer-refactor.
      f313117b
  3. Oct 19, 2016
    • hyukjinkwon's avatar
      [SPARK-17989][SQL] Check ascendingOrder type in sort_array function rather... · 4b2011ec
      hyukjinkwon authored
      [SPARK-17989][SQL] Check ascendingOrder type in sort_array function rather than throwing ClassCastException
      
      ## What changes were proposed in this pull request?
      
      This PR proposes to check the second argument, `ascendingOrder`  rather than throwing `ClassCastException` exception message.
      
      ```sql
      select sort_array(array('b', 'd'), '1');
      ```
      
      **Before**
      
      ```
      16/10/19 13:16:08 ERROR SparkSQLDriver: Failed in [select sort_array(array('b', 'd'), '1')]
      java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Boolean
      	at scala.runtime.BoxesRunTime.unboxToBoolean(BoxesRunTime.java:85)
      	at org.apache.spark.sql.catalyst.expressions.SortArray.nullSafeEval(collectionOperations.scala:185)
      	at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:416)
      	at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:50)
      	at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:43)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
      	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:297)
      ```
      
      **After**
      
      ```
      Error in query: cannot resolve 'sort_array(array('b', 'd'), '1')' due to data type mismatch: Sort order in second argument requires a boolean literal.; line 1 pos 7;
      ```
      
      ## How was this patch tested?
      
      Unit test in `DataFrameFunctionsSuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15532 from HyukjinKwon/SPARK-17989.
      4b2011ec
    • Alex Bozarth's avatar
      [SPARK-10541][WEB UI] Allow ApplicationHistoryProviders to provide their own... · 444c2d22
      Alex Bozarth authored
      [SPARK-10541][WEB UI] Allow ApplicationHistoryProviders to provide their own text when there aren't any complete apps
      
      ## What changes were proposed in this pull request?
      
      I've added a method to `ApplicationHistoryProvider` that returns the html paragraph to display when there are no applications. This allows providers other than `FsHistoryProvider` to determine what is printed. The current hard coded text is now moved into `FsHistoryProvider` since it assumed that's what was being used before.
      
      I chose to make the function return html rather than text because the current text block had inline html in it and it allows a new implementation of `ApplicationHistoryProvider` more versatility. I did not see any security issues with this since injecting html here requires implementing `ApplicationHistoryProvider` and can't be done outside of code.
      
      ## How was this patch tested?
      
      Manual testing and dev/run-tests
      
      No visible changes to the UI
      
      Author: Alex Bozarth <ajbozart@us.ibm.com>
      
      Closes #15490 from ajbozarth/spark10541.
      444c2d22
    • Takuya UESHIN's avatar
      [SPARK-17985][CORE] Bump commons-lang3 version to 3.5. · 9540357a
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      `SerializationUtils.clone()` of commons-lang3 (<3.5) has a bug that breaks thread safety, which gets stack sometimes caused by race condition of initializing hash map.
      See https://issues.apache.org/jira/browse/LANG-1251.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #15548 from ueshin/issues/SPARK-17985.
      Unverified
      9540357a
  4. Oct 18, 2016
    • Tommy YU's avatar
      [SPARK-18001][DOCUMENT] fix broke link to SparkDataFrame · f39852e5
      Tommy YU authored
      ## What changes were proposed in this pull request?
      
      In http://spark.apache.org/docs/latest/sql-programming-guide.html, Section "Untyped Dataset Operations (aka DataFrame Operations)"
      
      Link to R DataFrame doesn't work that return
      The requested URL /docs/latest/api/R/DataFrame.html was not found on this server.
      
      Correct link is SparkDataFrame.html for spark 2.0
      
      ## How was this patch tested?
      
      Manual checked.
      
      Author: Tommy YU <tummyyu@163.com>
      
      Closes #15543 from Wenpei/spark-18001.
      f39852e5
    • Wenchen Fan's avatar
      [SPARK-17873][SQL] ALTER TABLE RENAME TO should allow users to specify... · 4329c5ce
      Wenchen Fan authored
      [SPARK-17873][SQL] ALTER TABLE RENAME TO should allow users to specify database in destination table name(but have to be same as source table)
      
      ## What changes were proposed in this pull request?
      
      Unlike Hive, in Spark SQL, ALTER TABLE RENAME TO cannot move a table from one database to another(e.g. `ALTER TABLE db1.tbl RENAME TO db2.tbl2`), and will report error if the database in source table and destination table is different. So in #14955 , we forbid users to specify database of destination table in ALTER TABLE RENAME TO, to be consistent with other database systems and also make it easier to rename tables in non-current database, e.g. users can write `ALTER TABLE db1.tbl RENAME TO tbl2`, instead of `ALTER TABLE db1.tbl RENAME TO db1.tbl2`.
      
      However, this is a breaking change. Users may already have queries that specify database of destination table in ALTER TABLE RENAME TO.
      
      This PR reverts most of #14955 , and simplify the usage of ALTER TABLE RENAME TO by making database of source table the default database of destination table, instead of current database, so that users can still write `ALTER TABLE db1.tbl RENAME TO tbl2`, which is consistent with other databases like MySQL, Postgres, etc.
      
      ## How was this patch tested?
      
      The added back tests and some new tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15434 from cloud-fan/revert.
      4329c5ce
    • Yu Peng's avatar
      [SPARK-17711][TEST-HADOOP2.2] Fix hadoop2.2 compilation error · 2629cd74
      Yu Peng authored
      ## What changes were proposed in this pull request?
      
      Fix hadoop2.2 compilation error.
      
      ## How was this patch tested?
      
      Existing tests.
      
      cc tdas zsxwing
      
      Author: Yu Peng <loneknightpy@gmail.com>
      
      Closes #15537 from loneknightpy/fix-17711.
      2629cd74
    • Eric Liang's avatar
      [SPARK-17980][SQL] Fix refreshByPath for converted Hive tables · 5f20ae03
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      There was a bug introduced in https://github.com/apache/spark/pull/14690 which broke refreshByPath with converted hive tables (though, it turns out it was very difficult to refresh converted hive tables anyways, since you had to specify the exact path of one of the partitions).
      
      This changes refreshByPath to invalidate by prefix instead of exact match, and fixes the issue.
      
      cc sameeragarwal for refreshByPath changes
      mallman
      
      ## How was this patch tested?
      
      Extended unit test.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #15521 from ericl/fix-caching.
      5f20ae03
    • Tathagata Das's avatar
      [SPARK-17731][SQL][STREAMING][FOLLOWUP] Refactored StreamingQueryListener APIs · 941b3f9a
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      As per rxin request, here are further API changes
      - Changed `Stream(Started/Progress/Terminated)` events to `Stream*Event`
      - Changed the fields in `StreamingQueryListener.on***` from `query*` to `event`
      
      ## How was this patch tested?
      Existing unit tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #15530 from tdas/SPARK-17731-1.
      941b3f9a
    • Liang-Chi Hsieh's avatar
      [SPARK-17817] [PYSPARK] [FOLLOWUP] PySpark RDD Repartitioning Results in... · 1e35e969
      Liang-Chi Hsieh authored
      [SPARK-17817] [PYSPARK] [FOLLOWUP] PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes
      
      ## What changes were proposed in this pull request?
      
      This change is a followup for #15389 which calls `_to_java_object_rdd()` to solve this issue. Due to the concern of the possible expensive cost of the call, we can choose to decrease the batch size to solve this issue too.
      
      Simple benchmark:
      
          import time
          num_partitions = 20000
          a = sc.parallelize(range(int(1e6)), 2)
          start = time.time()
          l = a.repartition(num_partitions).glom().map(len).collect()
          end = time.time()
          print(end - start)
      
      Before: 419.447577953
      _to_java_object_rdd(): 421.916361094
      decreasing the batch size: 423.712255955
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #15445 from viirya/repartition-batch-size.
      1e35e969
    • cody koeninger's avatar
      [SPARK-17841][STREAMING][KAFKA] drain commitQueue · cd106b05
      cody koeninger authored
      ## What changes were proposed in this pull request?
      
      Actually drain commit queue rather than just iterating it.
      iterator() on a concurrent linked queue won't remove items from the queue, poll() will.
      
      ## How was this patch tested?
      Unit tests
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #15407 from koeninger/SPARK-17841.
      cd106b05
    • Reynold Xin's avatar
      Revert "[SPARK-17985][CORE] Bump commons-lang3 version to 3.5." · cd662bc7
      Reynold Xin authored
      This reverts commit bfe7885a.
      
      The commit caused build failures on Hadoop 2.2 profile:
      
      ```
      [error] /scratch/rxin/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:1489: value read is not a member of object org.apache.commons.io.IOUtils
      [error]       var numBytes = IOUtils.read(gzInputStream, buf)
      [error]                              ^
      [error] /scratch/rxin/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:1492: value read is not a member of object org.apache.commons.io.IOUtils
      [error]         numBytes = IOUtils.read(gzInputStream, buf)
      [error]                            ^
      ```
      cd662bc7
    • hyukjinkwon's avatar
      [SPARK-17955][SQL] Make DataFrameReader.jdbc call DataFrameReader.format("jdbc").load · b3130c7b
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to make `DataFrameReader.jdbc` call `DataFrameReader.format("jdbc").load` consistently with other APIs in `DataFrameReader`/`DataFrameWriter` and avoid calling `sparkSession.baseRelationToDataFrame(..)` here and there.
      
      The changes were mostly copied from `DataFrameWriter.jdbc()` which was recently updated.
      
      ```diff
      -    val params = extraOptions.toMap ++ connectionProperties.asScala.toMap
      -    val options = new JDBCOptions(url, table, params)
      -    val relation = JDBCRelation(parts, options)(sparkSession)
      -    sparkSession.baseRelationToDataFrame(relation)
      +    this.extraOptions = this.extraOptions ++ connectionProperties.asScala
      +    // explicit url and dbtable should override all
      +    this.extraOptions += ("url" -> url, "dbtable" -> table)
      +    format("jdbc").load()
      ```
      
      ## How was this patch tested?
      
      Existing tests should cover this.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15499 from HyukjinKwon/SPARK-17955.
      b3130c7b
    • Guoqiang Li's avatar
      [SPARK-17930][CORE] The SerializerInstance instance used when deserializing a... · 4518642a
      Guoqiang Li authored
      [SPARK-17930][CORE] The SerializerInstance instance used when deserializing a TaskResult is not reused
      
      ## What changes were proposed in this pull request?
      The following code is called when the DirectTaskResult instance is deserialized
      
      ```scala
      
        def value(): T = {
          if (valueObjectDeserialized) {
            valueObject
          } else {
            // Each deserialization creates a new instance of SerializerInstance, which is very time-consuming
            val resultSer = SparkEnv.get.serializer.newInstance()
            valueObject = resultSer.deserialize(valueBytes)
            valueObjectDeserialized = true
            valueObject
          }
        }
      
      ```
      
      In the case of stage has a lot of tasks, reuse SerializerInstance instance can improve the scheduling performance of three times
      
      The test data is TPC-DS 2T (Parquet) and  SQL statement as follows (query 2):
      
      ```sql
      
      select  i_item_id,
              avg(ss_quantity) agg1,
              avg(ss_list_price) agg2,
              avg(ss_coupon_amt) agg3,
              avg(ss_sales_price) agg4
       from store_sales, customer_demographics, date_dim, item, promotion
       where ss_sold_date_sk = d_date_sk and
             ss_item_sk = i_item_sk and
             ss_cdemo_sk = cd_demo_sk and
             ss_promo_sk = p_promo_sk and
             cd_gender = 'M' and
             cd_marital_status = 'M' and
             cd_education_status = '4 yr Degree' and
             (p_channel_email = 'N' or p_channel_event = 'N') and
             d_year = 2001
       group by i_item_id
       order by i_item_id
       limit 100;
      
      ```
      
      `spark-defaults.conf` file:
      
      ```
      spark.master                           yarn-client
      spark.executor.instances               20
      spark.driver.memory                    16g
      spark.executor.memory                  30g
      spark.executor.cores                   5
      spark.default.parallelism              100
      spark.sql.shuffle.partitions           100000
      spark.serializer                       org.apache.spark.serializer.KryoSerializer
      spark.driver.maxResultSize              0
      spark.rpc.netty.dispatcher.numThreads   8
      spark.executor.extraJavaOptions          -XX:+UseG1GC -XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=256M
      spark.cleaner.referenceTracking.blocking true
      spark.cleaner.referenceTracking.blocking.shuffle true
      
      ```
      
      Performance test results are as follows
      
      [SPARK-17930](https://github.com/witgo/spark/tree/SPARK-17930)| [ed146334](https://github.com/witgo/spark/commit/ed1463341455830b8867b721a1b34f291139baf3])
      ------------ | -------------
      54.5 s|231.7 s
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Guoqiang Li <witgo@qq.com>
      
      Closes #15512 from witgo/SPARK-17930.
      4518642a
    • Weiqing Yang's avatar
      [MINOR][DOC] Add more built-in sources in sql-programming-guide.md · 20dd1109
      Weiqing Yang authored
      ## What changes were proposed in this pull request?
      Add more built-in sources in sql-programming-guide.md.
      
      ## How was this patch tested?
      Manually.
      
      Author: Weiqing Yang <yangweiqing001@gmail.com>
      
      Closes #15522 from weiqingy/dsDoc.
      20dd1109
    • Takuya UESHIN's avatar
      [SPARK-17985][CORE] Bump commons-lang3 version to 3.5. · bfe7885a
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      `SerializationUtils.clone()` of commons-lang3 (<3.5) has a bug that breaks thread safety, which gets stack sometimes caused by race condition of initializing hash map.
      See https://issues.apache.org/jira/browse/LANG-1251.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #15525 from ueshin/issues/SPARK-17985.
      bfe7885a
    • Eric Liang's avatar
      [SPARK-17974] try 2) Refactor FileCatalog classes to simplify the inheritance tree · 4ef39c2f
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      This renames `BasicFileCatalog => FileCatalog`, combines  `SessionFileCatalog` with `PartitioningAwareFileCatalog`, and removes the old `FileCatalog` trait.
      
      In summary,
      ```
      MetadataLogFileCatalog extends PartitioningAwareFileCatalog
      ListingFileCatalog extends PartitioningAwareFileCatalog
      PartitioningAwareFileCatalog extends FileCatalog
      TableFileCatalog extends FileCatalog
      ```
      
      (note that this is a re-submission of https://github.com/apache/spark/pull/15518 which got reverted)
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #15533 from ericl/fix-scalastyle-revert.
      4ef39c2f
    • Yu Peng's avatar
      [SPARK-17711] Compress rolled executor log · 231f39e3
      Yu Peng authored
      ## What changes were proposed in this pull request?
      
      This PR adds support for executor log compression.
      
      ## How was this patch tested?
      
      Unit tests
      
      cc: yhuai tdas mengxr
      
      Author: Yu Peng <loneknightpy@gmail.com>
      
      Closes #15285 from loneknightpy/compress-executor-log.
      231f39e3
    • hyukjinkwon's avatar
      [SPARK-17388] [SQL] Support for inferring type date/timestamp/decimal for partition column · 37686539
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, Spark only supports to infer `IntegerType`, `LongType`, `DoubleType` and `StringType`.
      
      `DecimalType` is being tried but it seems it never infers type as `DecimalType` as `DoubleType` is being tried first. Also, it seems `DateType` and `TimestampType` could be inferred.
      
      As far as I know, it is pretty common to use both for a partition column.
      
      This PR fixes the incorrect `DecimalType` try and also adds the support for both `DateType` and `TimestampType` for inferring partition column type.
      
      ## How was this patch tested?
      
      Unit tests in `ParquetPartitionDiscoverySuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #14947 from HyukjinKwon/SPARK-17388.
      37686539
    • Wenchen Fan's avatar
      [SPARK-17899][SQL][FOLLOW-UP] debug mode should work for corrupted table · e59df62e
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Debug mode should work for corrupted table, so that we can really debug
      
      ## How was this patch tested?
      
      new test in `MetastoreDataSourcesSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15528 from cloud-fan/debug.
      e59df62e
    • Tathagata Das's avatar
      [SQL][STREAMING][TEST] Follow up to remove Option.contains for Scala 2.10 compatibility · a9e79a41
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Scala 2.10 does not have Option.contains, which broke Scala 2.10 build.
      
      ## How was this patch tested?
      Locally compiled and ran sql/core unit tests in 2.10
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #15531 from tdas/metrics-flaky-test-fix-1.
      a9e79a41
    • Liwei Lin's avatar
      [SQL][STREAMING][TEST] Fix flaky tests in StreamingQueryListenerSuite · 7d878cf2
      Liwei Lin authored
      This work has largely been done by lw-lin in his PR #15497. This is a slight refactoring of it.
      
      ## What changes were proposed in this pull request?
      There were two sources of flakiness in StreamingQueryListener test.
      
      - When testing with manual clock, consecutive attempts to advance the clock can occur without the stream execution thread being unblocked and doing some work between the two attempts. Hence the following can happen with the current ManualClock.
      ```
      +-----------------------------------+--------------------------------+
      |      StreamExecution thread       |         testing thread         |
      +-----------------------------------+--------------------------------+
      |  ManualClock.waitTillTime(100) {  |                                |
      |        _isWaiting = true          |                                |
      |            wait(10)               |                                |
      |        still in wait(10)          |  if (_isWaiting) advance(100)  |
      |        still in wait(10)          |  if (_isWaiting) advance(200)  | <- this should be disallowed !
      |        still in wait(10)          |  if (_isWaiting) advance(300)  | <- this should be disallowed !
      |      wake up from wait(10)        |                                |
      |       current time is 600         |                                |
      |       _isWaiting = false          |                                |
      |  }                                |                                |
      +-----------------------------------+--------------------------------+
      ```
      
      - Second source of flakiness is that the adding data to memory stream may get processing in any trigger, not just the first trigger.
      
      My fix is to make the manual clock wait for the other stream execution thread to start waiting for the clock at the right wait start time. That is, `advance(200)` (see above) will wait for stream execution thread to complete the wait that started at time 0, and start a new wait at time 200 (i.e. time stamp after the previous `advance(100)`).
      
      In addition, since this is a feature that is solely used by StreamExecution, I removed all the non-generic code from ManualClock and put them in StreamManualClock inside StreamTest.
      
      ## How was this patch tested?
      Ran existing unit test MANY TIME in Jenkins
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #15519 from tdas/metrics-flaky-test-fix.
      7d878cf2
  5. Oct 17, 2016
    • Reynold Xin's avatar
    • Eric Liang's avatar
      [SPARK-17974] Refactor FileCatalog classes to simplify the inheritance tree · 8daa1a29
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      This renames `BasicFileCatalog => FileCatalog`, combines  `SessionFileCatalog` with `PartitioningAwareFileCatalog`, and removes the old `FileCatalog` trait.
      
      In summary,
      ```
      MetadataLogFileCatalog extends PartitioningAwareFileCatalog
      ListingFileCatalog extends PartitioningAwareFileCatalog
      PartitioningAwareFileCatalog extends FileCatalog
      TableFileCatalog extends FileCatalog
      ```
      
      cc cloud-fan mallman
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #15518 from ericl/refactor-session-file-catalog.
      8daa1a29
    • Dilip Biswal's avatar
      [SPARK-17620][SQL] Determine Serde by hive.default.fileformat when Creating Hive Serde Tables · 813ab5e0
      Dilip Biswal authored
      ## What changes were proposed in this pull request?
      Reopens the closed PR https://github.com/apache/spark/pull/15190
      (Please refer to the above link for review comments on the PR)
      
      Make sure the hive.default.fileformat is used to when creating the storage format metadata.
      
      Output
      ``` SQL
      scala> spark.sql("SET hive.default.fileformat=orc")
      res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
      
      scala> spark.sql("CREATE TABLE tmp_default(id INT)")
      res2: org.apache.spark.sql.DataFrame = []
      ```
      Before
      ```SQL
      scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
      ..
      [# Storage Information,,]
      [SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
      [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
      [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
      [Compressed:,No,]
      [Storage Desc Parameters:,,]
      [  serialization.format,1,]
      ```
      After
      ```SQL
      scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
      ..
      [# Storage Information,,]
      [SerDe Library:,org.apache.hadoop.hive.ql.io.orc.OrcSerde,]
      [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
      [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
      [Compressed:,No,]
      [Storage Desc Parameters:,,]
      [  serialization.format,1,]
      
      ```
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      Added new tests to HiveDDLCommandSuite, SQLQuerySuite
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #15495 from dilipbiswal/orc2.
      813ab5e0
    • gatorsmile's avatar
      [SPARK-17751][SQL] Remove spark.sql.eagerAnalysis and Output the Plan if... · d88a1bae
      gatorsmile authored
      [SPARK-17751][SQL] Remove spark.sql.eagerAnalysis and Output the Plan if Existed in AnalysisException
      
      ### What changes were proposed in this pull request?
      Dataset always does eager analysis now. Thus, `spark.sql.eagerAnalysis` is not used any more. Thus, we need to remove it.
      
      This PR also outputs the plan. Without the fix, the analysis error is like
      ```
      cannot resolve '`k1`' given input columns: [k, v]; line 1 pos 12
      ```
      
      After the fix, the analysis error becomes:
      ```
      org.apache.spark.sql.AnalysisException: cannot resolve '`k1`' given input columns: [k, v]; line 1 pos 12;
      'Project [unresolvedalias(CASE WHEN ('k1 = 2) THEN 22 WHEN ('k1 = 4) THEN 44 ELSE 0 END, None), v#6]
      +- SubqueryAlias t
         +- Project [_1#2 AS k#5, _2#3 AS v#6]
            +- LocalRelation [_1#2, _2#3]
      ```
      
      ### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15316 from gatorsmile/eagerAnalysis.
      d88a1bae
Loading