Skip to content
Snippets Groups Projects
  1. Sep 22, 2015
    • Yin Huai's avatar
      [SPARK-10737] [SQL] When using UnsafeRows, SortMergeJoin may return wrong results · 5aea987c
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-10737
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #8854 from yhuai/SMJBug.
      5aea987c
    • Yin Huai's avatar
      [SPARK-10672] [SQL] Do not fail when we cannot save the metadata of a data... · 2204cdb2
      Yin Huai authored
      [SPARK-10672] [SQL] Do not fail when we cannot save the metadata of a data source table in a hive compatible way
      
      https://issues.apache.org/jira/browse/SPARK-10672
      
      With changes in this PR, we will fallback to same the metadata of a table in Spark SQL specific way if we fail to save it in a hive compatible way (Hive throws an exception because of its internal restrictions, e.g. binary and decimal types cannot be saved to parquet if the metastore is running Hive 0.13). I manually tested the fix with the following test in `DataSourceWithHiveMetastoreCatalogSuite` (`spark.sql.hive.metastore.version=0.13` and `spark.sql.hive.metastore.jars`=`maven`).
      
      ```
          test(s"fail to save metadata of a parquet table in hive 0.13") {
            withTempPath { dir =>
              withTable("t") {
                val path = dir.getCanonicalPath
      
                sql(
                  s"""CREATE TABLE t USING $provider
                     |OPTIONS (path '$path')
                     |AS SELECT 1 AS d1, cast("val_1" as binary) AS d2
                   """.stripMargin)
      
                sql(
                  s"""describe formatted t
                   """.stripMargin).collect.foreach(println)
      
                sqlContext.table("t").show
              }
            }
          }
        }
      ```
      
      Without this fix, we will fail with the following error.
      ```
      org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.UnsupportedOperationException: Unknown field type: binary
      	at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:619)
      	at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:576)
      	at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply$mcV$sp(ClientWrapper.scala:359)
      	at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply(ClientWrapper.scala:357)
      	at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply(ClientWrapper.scala:357)
      	at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
      	at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
      	at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
      	at org.apache.spark.sql.hive.client.ClientWrapper.createTable(ClientWrapper.scala:357)
      	at org.apache.spark.sql.hive.HiveMetastoreCatalog.createDataSourceTable(HiveMetastoreCatalog.scala:358)
      	at org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.run(commands.scala:285)
      	at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
      	at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
      	at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
      	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
      	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:58)
      	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:58)
      	at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:144)
      	at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:129)
      	at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
      	at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:725)
      	at org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:56)
      	at org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:56)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcV$sp$2$$anonfun$apply$2.apply$mcV$sp(HiveMetastoreCatalogSuite.scala:165)
      	at org.apache.spark.sql.test.SQLTestUtils$class.withTable(SQLTestUtils.scala:150)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite.withTable(HiveMetastoreCatalogSuite.scala:52)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcV$sp$2.apply(HiveMetastoreCatalogSuite.scala:162)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcV$sp$2.apply(HiveMetastoreCatalogSuite.scala:161)
      	at org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:125)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite.withTempPath(HiveMetastoreCatalogSuite.scala:52)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$4$$anonfun$apply$1.apply$mcV$sp(HiveMetastoreCatalogSuite.scala:161)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$4$$anonfun$apply$1.apply(HiveMetastoreCatalogSuite.scala:161)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$4$$anonfun$apply$1.apply(HiveMetastoreCatalogSuite.scala:161)
      	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
      	at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
      	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
      	at org.scalatest.Transformer.apply(Transformer.scala:22)
      	at org.scalatest.Transformer.apply(Transformer.scala:20)
      	at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
      	at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
      	at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
      	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
      	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
      	at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
      	at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
      	at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
      	at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
      	at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
      	at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
      	at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
      	at scala.collection.immutable.List.foreach(List.scala:318)
      	at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
      	at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
      	at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
      	at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
      	at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
      	at org.scalatest.Suite$class.run(Suite.scala:1424)
      	at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
      	at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
      	at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
      	at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
      	at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite.org$scalatest$BeforeAndAfterAll$$super$run(HiveMetastoreCatalogSuite.scala:52)
      	at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
      	at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite.run(HiveMetastoreCatalogSuite.scala:52)
      	at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
      	at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
      	at sbt.ForkMain$Run$2.call(ForkMain.java:294)
      	at sbt.ForkMain$Run$2.call(ForkMain.java:284)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: java.lang.UnsupportedOperationException: Unknown field type: binary
      	at org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.getObjectInspector(ArrayWritableObjectInspector.java:108)
      	at org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.<init>(ArrayWritableObjectInspector.java:60)
      	at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113)
      	at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:339)
      	at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:288)
      	at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:194)
      	at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:597)
      	... 76 more
      ```
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #8824 from yhuai/datasourceMetadata.
      2204cdb2
    • Wenchen Fan's avatar
      [SPARK-10740] [SQL] handle nondeterministic expressions correctly for set operations · 5017c685
      Wenchen Fan authored
      https://issues.apache.org/jira/browse/SPARK-10740
      
      Author: Wenchen Fan <cloud0fan@163.com>
      
      Closes #8858 from cloud-fan/non-deter.
      5017c685
    • Josh Rosen's avatar
      [SPARK-10704] Rename HashShuffleReader to BlockStoreShuffleReader · 1ca5e2e0
      Josh Rosen authored
      The current shuffle code has an interface named ShuffleReader with only one implementation, HashShuffleReader. This naming is confusing, since the same read path code is used for both sort- and hash-based shuffle. This patch addresses this by renaming HashShuffleReader to BlockStoreShuffleReader.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #8825 from JoshRosen/shuffle-reader-cleanup.
      1ca5e2e0
    • Davies Liu's avatar
      [SPARK-10593] [SQL] fix resolve output of Generate · 22d40159
      Davies Liu authored
      The output of Generate should not be resolved as Reference.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #8755 from davies/view.
      22d40159
    • xutingjun's avatar
      [SPARK-9585] Delete the input format caching because some input format are non thread safe · 2ea0f2e1
      xutingjun authored
      If we cache the  InputFormat, all tasks on the same executor will share it.
      Some InputFormat is thread safety, but some are not, such as HiveHBaseTableInputFormat. If tasks share a non thread safe InputFormat, unexpected error may be occurs.
      To avoid it, I think we should delete the input format  caching.
      
      Author: xutingjun <xutingjun@huawei.com>
      Author: meiyoula <1039320815@qq.com>
      Author: Xutingjun <xutingjun@huawei.com>
      
      Closes #7918 from XuTingjun/cached_inputFormat.
      2ea0f2e1
    • Yanbo Liang's avatar
      [SPARK-10750] [ML] ML Param validate should print better error information · 7104ee0e
      Yanbo Liang authored
      Currently when you set illegal value for params of array type (such as IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw IllegalArgumentException but with incomprehensible error information.
      Take ```VectorSlicer.setNames``` as an example:
      ```scala
      val vectorSlicer = new VectorSlicer().setInputCol("features").setOutputCol("result")
      // The value of setNames must be contain distinct elements, so the next line will throw exception.
      vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1"))
      ```
      It will throw IllegalArgumentException as:
      ```
      vectorSlicer_b3b4d1a10f43 parameter names given invalid value [Ljava.lang.String;798256c5.
      java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names given invalid value [Ljava.lang.String;798256c5.
      ```
      We should distinguish the value of array type from primitive type at Param.validate(value: T), and we will get better error information.
      ```
      vectorSlicer_3b744ea277b2 parameter names given invalid value [f1,f4,f1].
      java.lang.IllegalArgumentException: vectorSlicer_3b744ea277b2 parameter names given invalid value [f1,f4,f1].
      ```
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #8863 from yanboliang/spark-10750.
      7104ee0e
    • Holden Karau's avatar
      [SPARK-9962] [ML] Decision Tree training: prevNodeIdsForInstances.unpersist() at end of training · f4a3c4e3
      Holden Karau authored
      NodeIdCache: prevNodeIdsForInstances.unpersist() needs to be called at end of training.
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #8541 from holdenk/SPARK-9962-decission-tree-training-prevNodeIdsForiNstances-unpersist-at-end-of-training.
      f4a3c4e3
    • Meihua Wu's avatar
      [SPARK-10706] [MLLIB] Add java wrapper for random vector rdd · 870b8a2e
      Meihua Wu authored
      Add java wrapper for random vector rdd
      
      holdenk srowen
      
      Author: Meihua Wu <meihuawu@umich.edu>
      
      Closes #8841 from rotationsymmetry/SPARK-10706.
      870b8a2e
    • Rekha Joshi's avatar
      [SPARK-10718] [BUILD] Update License on conf files and corresponding excludes file update · 7278f792
      Rekha Joshi authored
      Update License on conf files and corresponding excludes file update
      
      Author: Rekha Joshi <rekhajoshm@gmail.com>
      Author: Joshi <rekhajoshm@gmail.com>
      
      Closes #8842 from rekhajoshm/SPARK-10718.
      7278f792
    • Akash Mishra's avatar
      [SPARK-10695] [DOCUMENTATION] [MESOS] Fixing incorrect value informati… · 0bd0e5be
      Akash Mishra authored
      …on for spark.mesos.constraints parameter.
      
      Author: Akash Mishra <akash.mishra20@gmail.com>
      
      Closes #8816 from SleepyThread/constraint-fix.
      0bd0e5be
    • Reynold Xin's avatar
      [SQL] [MINOR] map -> foreach. · f3b727c8
      Reynold Xin authored
      DataFrame.explain should use foreach to print the explain content.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #8862 from rxin/map-foreach.
      f3b727c8
    • Yin Huai's avatar
      [SPARK-8567] [SQL] Increase the timeout of o.a.s.sql.hive.HiveSparkSubmitSuite to 5 minutes. · 4da32bc0
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-8567
      
      Looks like "SPARK-8368: includes jars passed in through --jars" is pretty flaky now. Based on some history runs, the time spent on a successful run may be from 1.5 minutes to almost 3 minutes. Let's try to increase the timeout and see if we can fix this test.
      
      https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/385/testReport/junit/org.apache.spark.sql.hive/HiveSparkSubmitSuite/SPARK_8368__includes_jars_passed_in_through___jars/history/?start=25
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #8850 from yhuai/SPARK-8567-anotherTry.
      4da32bc0
    • Andrew Or's avatar
      fd61b004
    • Madhusudanan Kandasamy's avatar
      [SPARK-10458] [SPARK CORE] Added isStopped() method in SparkContext · f24316e6
      Madhusudanan Kandasamy authored
      Added isStopped() method in SparkContext
      
      Author: Madhusudanan Kandasamy <madhusudanan@in.ibm.com>
      
      Closes #8749 from kmadhugit/SPARK-10458.
      f24316e6
    • Liang-Chi Hsieh's avatar
      [SPARK-10446][SQL] Support to specify join type when calling join with usingColumns · 1fcefef0
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-10446
      
      Currently the method `join(right: DataFrame, usingColumns: Seq[String])` only supports inner join. It is more convenient to have it support other join types.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #8600 from viirya/usingcolumns_df.
      1fcefef0
    • Ewan Leith's avatar
      [SPARK-10419] [SQL] Adding SQLServer support for datetimeoffset types to JdbcDialects · 781b21ba
      Ewan Leith authored
      Reading from Microsoft SQL Server over jdbc fails when the table contains datetimeoffset types.
      
      This patch registers a SQLServer JDBC Dialect that maps datetimeoffset to a String, as Microsoft suggest.
      
      Author: Ewan Leith <ewan.leith@realitymine.com>
      
      Closes #8575 from realitymine-coordinator/sqlserver.
      781b21ba
    • Jian Feng's avatar
      [SPARK-10577] [PYSPARK] DataFrame hint for broadcast join · 0180b849
      Jian Feng authored
      https://issues.apache.org/jira/browse/SPARK-10577
      
      Author: Jian Feng <jzhang.chs@gmail.com>
      
      Closes #8801 from Jianfeng-chs/master.
      0180b849
    • Sean Owen's avatar
      [SPARK-10716] [BUILD] spark-1.5.0-bin-hadoop2.6.tgz file doesn't uncompress on... · bf20d6c9
      Sean Owen authored
      [SPARK-10716] [BUILD] spark-1.5.0-bin-hadoop2.6.tgz file doesn't uncompress on OS X due to hidden file
      
      Remove ._SUCCESS.crc hidden file that may cause problems in distribution tar archive, and is not used
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #8846 from srowen/SPARK-10716.
      bf20d6c9
    • Holden Karau's avatar
      [SPARK-9821] [PYSPARK] pyspark-reduceByKey-should-take-a-custom-partitioner · 1cd67415
      Holden Karau authored
      from the issue:
      
      In Scala, I can supply a custom partitioner to reduceByKey (and other aggregation/repartitioning methods like aggregateByKey and combinedByKey), but as far as I can tell from the Pyspark API, there's no way to do the same in Python.
      Here's an example of my code in Scala:
      weblogs.map(s => (getFileType(s), 1)).reduceByKey(new FileTypePartitioner(),_+_)
      But I can't figure out how to do the same in Python. The closest I can get is to call repartition before reduceByKey like so:
      weblogs.map(lambda s: (getFileType(s), 1)).partitionBy(3,hash_filetype).reduceByKey(lambda v1,v2: v1+v2).collect()
      But that defeats the purpose, because I'm shuffling twice instead of once, so my performance is worse instead of better.
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #8569 from holdenk/SPARK-9821-pyspark-reduceByKey-should-take-a-custom-partitioner.
      1cd67415
  2. Sep 21, 2015
  3. Sep 20, 2015
  4. Sep 19, 2015
    • Josh Rosen's avatar
      [SPARK-10710] Remove ability to disable spilling in core and SQL · 2117eea7
      Josh Rosen authored
      It does not make much sense to set `spark.shuffle.spill` or `spark.sql.planner.externalSort` to false: I believe that these configurations were initially added as "escape hatches" to guard against bugs in the external operators, but these operators are now mature and well-tested. In addition, these configurations are not handled in a consistent way anymore: SQL's Tungsten codepath ignores these configurations and will continue to use spilling operators. Similarly, Spark Core's `tungsten-sort` shuffle manager does not respect `spark.shuffle.spill=false`.
      
      This pull request removes these configurations, adds warnings at the appropriate places, and deletes a large amount of code which was only used in code paths that did not support spilling.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #8831 from JoshRosen/remove-ability-to-disable-spilling.
      2117eea7
    • zsxwing's avatar
      [SPARK-10155] [SQL] Change SqlParser to object to avoid memory leak · e789000b
      zsxwing authored
      Since `scala.util.parsing.combinator.Parsers` is thread-safe since Scala 2.10 (See [SI-4929](https://issues.scala-lang.org/browse/SI-4929)), we can change SqlParser to object to avoid memory leak.
      
      I didn't change other subclasses of `scala.util.parsing.combinator.Parsers` because there is only one instance in one SQLContext, which should not be an issue.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #8357 from zsxwing/sql-memory-leak.
      e789000b
    • Alexis Seigneurin's avatar
      Fixed links to the API · d83b6aae
      Alexis Seigneurin authored
      Submitting this change on the master branch as requested in https://github.com/apache/spark/pull/8819#issuecomment-141505941
      
      Author: Alexis Seigneurin <alexis.seigneurin@gmail.com>
      
      Closes #8838 from aseigneurin/patch-2.
      d83b6aae
    • Kousuke Saruta's avatar
      [SPARK-10584] [SQL] [DOC] Documentation about the compatible Hive version is wrong. · d507f9c0
      Kousuke Saruta authored
      In Spark 1.5.0, Spark SQL is compatible with Hive 0.12.0 through 1.2.1 but the documentation is wrong.
      
      /CC yhuai
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #8776 from sarutak/SPARK-10584-2.
      d507f9c0
Loading