Skip to content
Snippets Groups Projects
  1. Sep 28, 2015
    • Sean Owen's avatar
      [SPARK-10833] [BUILD] Inline, organize BSD/MIT licenses in LICENSE · bf4199e2
      Sean Owen authored
      In the course of https://issues.apache.org/jira/browse/LEGAL-226 it came to light that the guidance at http://www.apache.org/dev/licensing-howto.html#permissive-deps means that permissively-licensed dependencies has a different interpretation than we (er, I) had been operating under. "pointer ... to the license within the source tree" specifically means a copy of the license within Spark's distribution, whereas at the moment, Spark's LICENSE has a pointer to the project's license in the other project's source tree.
      
      The remedy is simply to inline all such license references (i.e. BSD/MIT licenses) or include their text in "licenses" subdirectory and point to that.
      
      Along the way, we can also treat other BSD/MIT licenses, whose text has been inlined into LICENSE, in the same way.
      
      The LICENSE file can continue to provide a helpful list of BSD/MIT licensed projects and a pointer to their sites. This would be over and above including license text in the distro, which is the essential thing.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #8919 from srowen/SPARK-10833.
      bf4199e2
    • Davies Liu's avatar
      [SPARK-10859] [SQL] fix stats of StringType in columnar cache · ea02e551
      Davies Liu authored
      The UTF8String may come from UnsafeRow, then underline buffer of it is not copied, so we should clone it in order to hold it in Stats.
      
      cc yhuai
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #8929 from davies/pushdown_string.
      ea02e551
    • Cheng Lian's avatar
      [SPARK-10395] [SQL] Simplifies CatalystReadSupport · 14978b78
      Cheng Lian authored
      Please refer to [SPARK-10395] [1] for details.
      
      [1]: https://issues.apache.org/jira/browse/SPARK-10395
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #8553 from liancheng/spark-10395/simplify-parquet-read-support.
      14978b78
    • jerryshao's avatar
      [SPARK-10790] [YARN] Fix initial executor number not set issue and consolidate the codes · 353c30bd
      jerryshao authored
      This bug is introduced in [SPARK-9092](https://issues.apache.org/jira/browse/SPARK-9092), `targetExecutorNumber` should use `minExecutors` if `initialExecutors` is not set. Using 0 instead will meet the problem as mentioned in [SPARK-10790](https://issues.apache.org/jira/browse/SPARK-10790).
      
      Also consolidate and simplify some similar code snippets to keep the consistent semantics.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #8910 from jerryshao/SPARK-10790.
      353c30bd
    • Holden Karau's avatar
      [SPARK-10812] [YARN] Spark hadoop util support switching to yarn · d8d50ed3
      Holden Karau authored
      While this is likely not a huge issue for real production systems, for test systems which may setup a Spark Context and tear it down and stand up a Spark Context with a different master (e.g. some local mode & some yarn mode) tests this cane be an issue. Discovered during work on spark-testing-base on Spark 1.4.1, but seems like the logic that triggers it is present in master (see SparkHadoopUtil object). A valid work around for users encountering this issue is to fork a different JVM, however this can be heavy weight.
      
      ```
      [info] SampleMiniClusterTest:
      [info] Exception encountered when attempting to run a suite with class name: com.holdenkarau.spark.testing.SampleMiniClusterTest *** ABORTED ***
      [info] java.lang.ClassCastException: org.apache.spark.deploy.SparkHadoopUtil cannot be cast to org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
      [info] at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.get(YarnSparkHadoopUtil.scala:163)
      [info] at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:257)
      [info] at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561)
      [info] at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115)
      [info] at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
      [info] at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
      [info] at org.apache.spark.SparkContext.<init>(SparkContext.scala:497)
      [info] at com.holdenkarau.spark.testing.SharedMiniCluster$class.setup(SharedMiniCluster.scala:186)
      [info] at com.holdenkarau.spark.testing.SampleMiniClusterTest.setup(SampleMiniClusterTest.scala:26)
      [info] at com.holdenkarau.spark.testing.SharedMiniCluster$class.beforeAll(SharedMiniCluster.scala:103)
      ```
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #8911 from holdenk/SPARK-10812-spark-hadoop-util-support-switching-to-yarn.
      d8d50ed3
    • David Martin's avatar
      Fix two mistakes in programming-guide page · b5824993
      David Martin authored
      seperate -> separate
      sees -> see
      
      Author: David Martin <dmartinpro@users.noreply.github.com>
      
      Closes #8928 from dmartinpro/patch-1.
      b5824993
  2. Sep 27, 2015
  3. Sep 26, 2015
    • Cheng Lian's avatar
      [SPARK-10845] [SQL] Makes spark.sql.hive.version a SQLConfEntry · 6f94d56a
      Cheng Lian authored
      When refactoring SQL options from plain strings to the strongly typed `SQLConfEntry`, `spark.sql.hive.version` wasn't migrated, and doesn't show up in the result of `SET -v`, as `SET -v` only shows public `SQLConfEntry` instances. This affects compatibility with Simba ODBC driver.
      
      This PR migrates this SQL option as a `SQLConfEntry` to fix this issue.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #8925 from liancheng/spark-10845/hive-version-conf.
      6f94d56a
  4. Sep 25, 2015
    • Narine Kokhlikyan's avatar
      [SPARK-10760] [SPARKR] SparkR glm: the documentation in examples - family argument is missing · 6fcee906
      Narine Kokhlikyan authored
      Hi everyone,
      
      Since the family argument is required for the glm function, the execution of:
      
      model <- glm(Sepal_Length ~ Sepal_Width, df)
      
      is failing.
      
      I've fixed the documentation by adding the family argument and also added the summay(model) which will show the coefficients for the model.
      
      Thanks,
      Narine
      
      Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
      
      Closes #8870 from NarineK/sparkrml.
      6fcee906
    • Eric Liang's avatar
      [SPARK-9681] [ML] Support R feature interactions in RFormula · 92233881
      Eric Liang authored
      This integrates the Interaction feature transformer with SparkR R formula support (i.e. support `:`).
      
      To generate reasonable ML attribute names for feature interactions, it was necessary to add the ability to read attribute the original attribute names back from `StructField`, and also to specify custom group prefixes in `VectorAssembler`. This also has the side-benefit of cleaning up the double-underscores in the attributes generated for non-interaction terms.
      
      mengxr
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #8830 from ericl/interaction-2.
      92233881
  5. Sep 24, 2015
  6. Sep 23, 2015
  7. Sep 22, 2015
    • Matt Hagen's avatar
      [SPARK-10663] Removed unnecessary invocation of DataFrame.toDF method. · 558e9c7e
      Matt Hagen authored
      The Scala example under the "Example: Pipeline" heading in this
      document initializes the "test" variable to a DataFrame. Because test
      is already a DF, there is not need to call test.toDF as the example
      does in a subsequent line: model.transform(test.toDF). So, I removed
      the extraneous toDF invocation.
      
      Author: Matt Hagen <anonz3000@gmail.com>
      
      Closes #8875 from hagenhaus/SPARK-10663.
      558e9c7e
    • Zhichao Li's avatar
      [SPARK-10310] [SQL] Fixes script transformation field/line delimiters · 84f81e03
      Zhichao Li authored
      **Please attribute this PR to `Zhichao Li <zhichao.liintel.com>`.**
      
      This PR is based on PR #8476 authored by zhichao-li. It fixes SPARK-10310 by adding field delimiter SerDe property to the default `LazySimpleSerDe`, and enabling default record reader/writer classes.
      
      Currently, we only support `LazySimpleSerDe`, used together with `TextRecordReader` and `TextRecordWriter`, and don't support customizing record reader/writer using `RECORDREADER`/`RECORDWRITER` clauses. This should be addressed in separate PR(s).
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #8860 from liancheng/spark-10310/fix-script-trans-delimiters.
      84f81e03
    • Andrew Or's avatar
      [SPARK-10640] History server fails to parse TaskCommitDenied · 61d4c07f
      Andrew Or authored
      ... simply because the code is missing!
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #8828 from andrewor14/task-end-reason-json.
      61d4c07f
    • Reynold Xin's avatar
      [SPARK-10714] [SPARK-8632] [SPARK-10685] [SQL] Refactor Python UDF handling · a96ba40f
      Reynold Xin authored
      This patch refactors Python UDF handling:
      
      1. Extract the per-partition Python UDF calling logic from PythonRDD into a PythonRunner. PythonRunner itself expects iterator as input/output, and thus has no dependency on RDD. This way, we can use PythonRunner directly in a mapPartitions call, or in the future in an environment without RDDs.
      2. Use PythonRunner in Spark SQL's BatchPythonEvaluation.
      3. Updated BatchPythonEvaluation to only use its input once, rather than twice. This should fix Python UDF performance regression in Spark 1.5.
      
      There are a number of small cleanups I wanted to do when I looked at the code, but I kept most of those out so the diff looks small.
      
      This basically implements the approach in https://github.com/apache/spark/pull/8833, but with some code moving around so the correctness doesn't depend on the inner workings of Spark serialization and task execution.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #8835 from rxin/python-iter-refactor.
      a96ba40f
    • Yin Huai's avatar
      [SPARK-10737] [SQL] When using UnsafeRows, SortMergeJoin may return wrong results · 5aea987c
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-10737
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #8854 from yhuai/SMJBug.
      5aea987c
    • Yin Huai's avatar
      [SPARK-10672] [SQL] Do not fail when we cannot save the metadata of a data... · 2204cdb2
      Yin Huai authored
      [SPARK-10672] [SQL] Do not fail when we cannot save the metadata of a data source table in a hive compatible way
      
      https://issues.apache.org/jira/browse/SPARK-10672
      
      With changes in this PR, we will fallback to same the metadata of a table in Spark SQL specific way if we fail to save it in a hive compatible way (Hive throws an exception because of its internal restrictions, e.g. binary and decimal types cannot be saved to parquet if the metastore is running Hive 0.13). I manually tested the fix with the following test in `DataSourceWithHiveMetastoreCatalogSuite` (`spark.sql.hive.metastore.version=0.13` and `spark.sql.hive.metastore.jars`=`maven`).
      
      ```
          test(s"fail to save metadata of a parquet table in hive 0.13") {
            withTempPath { dir =>
              withTable("t") {
                val path = dir.getCanonicalPath
      
                sql(
                  s"""CREATE TABLE t USING $provider
                     |OPTIONS (path '$path')
                     |AS SELECT 1 AS d1, cast("val_1" as binary) AS d2
                   """.stripMargin)
      
                sql(
                  s"""describe formatted t
                   """.stripMargin).collect.foreach(println)
      
                sqlContext.table("t").show
              }
            }
          }
        }
      ```
      
      Without this fix, we will fail with the following error.
      ```
      org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.UnsupportedOperationException: Unknown field type: binary
      	at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:619)
      	at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:576)
      	at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply$mcV$sp(ClientWrapper.scala:359)
      	at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply(ClientWrapper.scala:357)
      	at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply(ClientWrapper.scala:357)
      	at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
      	at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
      	at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
      	at org.apache.spark.sql.hive.client.ClientWrapper.createTable(ClientWrapper.scala:357)
      	at org.apache.spark.sql.hive.HiveMetastoreCatalog.createDataSourceTable(HiveMetastoreCatalog.scala:358)
      	at org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.run(commands.scala:285)
      	at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
      	at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
      	at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
      	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
      	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:58)
      	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:58)
      	at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:144)
      	at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:129)
      	at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
      	at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:725)
      	at org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:56)
      	at org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:56)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcV$sp$2$$anonfun$apply$2.apply$mcV$sp(HiveMetastoreCatalogSuite.scala:165)
      	at org.apache.spark.sql.test.SQLTestUtils$class.withTable(SQLTestUtils.scala:150)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite.withTable(HiveMetastoreCatalogSuite.scala:52)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcV$sp$2.apply(HiveMetastoreCatalogSuite.scala:162)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcV$sp$2.apply(HiveMetastoreCatalogSuite.scala:161)
      	at org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:125)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite.withTempPath(HiveMetastoreCatalogSuite.scala:52)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$4$$anonfun$apply$1.apply$mcV$sp(HiveMetastoreCatalogSuite.scala:161)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$4$$anonfun$apply$1.apply(HiveMetastoreCatalogSuite.scala:161)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$4$$anonfun$apply$1.apply(HiveMetastoreCatalogSuite.scala:161)
      	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
      	at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
      	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
      	at org.scalatest.Transformer.apply(Transformer.scala:22)
      	at org.scalatest.Transformer.apply(Transformer.scala:20)
      	at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
      	at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
      	at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
      	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
      	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
      	at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
      	at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
      	at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
      	at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
      	at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
      	at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
      	at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
      	at scala.collection.immutable.List.foreach(List.scala:318)
      	at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
      	at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
      	at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
      	at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
      	at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
      	at org.scalatest.Suite$class.run(Suite.scala:1424)
      	at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
      	at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
      	at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
      	at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
      	at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite.org$scalatest$BeforeAndAfterAll$$super$run(HiveMetastoreCatalogSuite.scala:52)
      	at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
      	at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite.run(HiveMetastoreCatalogSuite.scala:52)
      	at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
      	at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
      	at sbt.ForkMain$Run$2.call(ForkMain.java:294)
      	at sbt.ForkMain$Run$2.call(ForkMain.java:284)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: java.lang.UnsupportedOperationException: Unknown field type: binary
      	at org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.getObjectInspector(ArrayWritableObjectInspector.java:108)
      	at org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.<init>(ArrayWritableObjectInspector.java:60)
      	at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113)
      	at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:339)
      	at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:288)
      	at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:194)
      	at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:597)
      	... 76 more
      ```
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #8824 from yhuai/datasourceMetadata.
      2204cdb2
    • Wenchen Fan's avatar
      [SPARK-10740] [SQL] handle nondeterministic expressions correctly for set operations · 5017c685
      Wenchen Fan authored
      https://issues.apache.org/jira/browse/SPARK-10740
      
      Author: Wenchen Fan <cloud0fan@163.com>
      
      Closes #8858 from cloud-fan/non-deter.
      5017c685
    • Josh Rosen's avatar
      [SPARK-10704] Rename HashShuffleReader to BlockStoreShuffleReader · 1ca5e2e0
      Josh Rosen authored
      The current shuffle code has an interface named ShuffleReader with only one implementation, HashShuffleReader. This naming is confusing, since the same read path code is used for both sort- and hash-based shuffle. This patch addresses this by renaming HashShuffleReader to BlockStoreShuffleReader.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #8825 from JoshRosen/shuffle-reader-cleanup.
      1ca5e2e0
    • Davies Liu's avatar
      [SPARK-10593] [SQL] fix resolve output of Generate · 22d40159
      Davies Liu authored
      The output of Generate should not be resolved as Reference.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #8755 from davies/view.
      22d40159
Loading