Skip to content
Snippets Groups Projects
  1. Feb 05, 2015
    • Patrick Wendell's avatar
      SPARK-5557: Explicitly include servlet API in dependencies. · 793dbaef
      Patrick Wendell authored
      Because of the way we shade jetty, we lose its dependency orbit
      in the assembly jar, which includes the javax servlet API's. This
      adds back orbit explicitly, using the version that matches
      our jetty version.
      
      Author: Patrick Wendell <patrick@databricks.com>
      
      Closes #4411 from pwendell/servlet-api and squashes the following commits:
      
      445f868 [Patrick Wendell] SPARK-5557: Explicitly include servlet API in dependencies.
      793dbaef
    • Cheng Lian's avatar
      [HOTFIX] [SQL] Disables Metastore Parquet table conversion for "SQLQuerySuite.CTAS with serde" · 7c0a648f
      Cheng Lian authored
      Ideally we should convert Metastore Parquet tables with our own Parquet implementation on both read path and write path. However, the write path is not well covered, and causes this test failure. This PR is a hotfix to bring back Jenkins PR builder. A proper fix will be delivered in a follow-up PR.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4413)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #4413 from liancheng/hotfix-parquet-ctas and squashes the following commits:
      
      5291289 [Cheng Lian] Hot fix for "SQLQuerySuite.CTAS with serde"
      7c0a648f
    • Reynold Xin's avatar
      [SPARK-5638][SQL] Add a config flag to disable eager analysis of DataFrames · e8a5d50a
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4408 from rxin/df-config-eager and squashes the following commits:
      
      c0204cf [Reynold Xin] [SPARK-5638][SQL] Add a config flag to disable eager analysis of DataFrames.
      e8a5d50a
    • Xiangrui Meng's avatar
      [SPARK-5620][DOC] group methods in generated unidoc · 85ccee81
      Xiangrui Meng authored
      It seems that `(ScalaUnidoc, unidoc)` is the correct way to overwrite `scalacOptions` in unidoc.
      
      CC: rxin gzm0
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4404 from mengxr/SPARK-5620 and squashes the following commits:
      
      f890cf5 [Xiangrui Meng] add -groups to scalacOptions in unidoc
      85ccee81
    • Cheng Lian's avatar
      [SPARK-5182] [SPARK-5528] [SPARK-5509] [SPARK-3575] [SQL] Parquet data source improvements · a9ed5117
      Cheng Lian authored
      This PR adds three major improvements to Parquet data source:
      
      1.  Partition discovery
      
          While reading Parquet files resides in Hive style partition directories, `ParquetRelation2` automatically discovers partitioning information and infers partition column types.
      
          This is also a partial work for [SPARK-5182] [1], which aims to provide first class partitioning support for the data source API.  Related code in this PR can be easily extracted to the data source API level in future versions.
      
      1.  Schema merging
      
          When enabled, Parquet data source collects schema information from all Parquet part-files and tries to merge them.  Exceptions are thrown when incompatible schemas are detected.  This feature is controlled by data source option `parquet.mergeSchema`, and is enabled by default.
      
      1.  Metastore Parquet table conversion moved to analysis phase
      
          This greatly simplifies the conversion logic.  `ParquetConversion` strategy can be removed once the old Parquet implementation is removed in the future.
      
      This version of Parquet data source aims to entirely replace the old Parquet implementation.  However, the old version hasn't been removed yet.  Users can fall back to the old version by turning off SQL configuration `spark.sql.parquet.useDataSourceApi`.
      
      Other JIRA tickets fixed as side effects in this PR:
      
      - [SPARK-5509] [3]: `EqualTo` now uses a proper `Ordering` to compare binary types.
      
      - [SPARK-3575] [4]: Metastore schema is now preserved and passed to `ParquetRelation2` via data source option `parquet.metastoreSchema`.
      
      TODO:
      
      - [ ] More test cases for partition discovery
      - [x] Fix write path after data source write support (#4294) is merged
      
            It turned out to be non-trivial to fall back to old Parquet implementation on the write path when Parquet data source is enabled.  Since we're planning to include data source write support in 1.3.0, I simply ignored two test cases involving Parquet insertion for now.
      
      - [ ] Fix outdated comments and documentations
      
      PS: This PR looks big, but more than a half of the changed lines in this PR are trivial changes to test cases. To test Parquet with and without the new data source, almost all Parquet test cases are moved into wrapper driver functions. This introduces hundreds of lines of changes.
      
      [1]: https://issues.apache.org/jira/browse/SPARK-5182
      [2]: https://issues.apache.org/jira/browse/SPARK-5528
      [3]: https://issues.apache.org/jira/browse/SPARK-5509
      [4]: https://issues.apache.org/jira/browse/SPARK-3575
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4308)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #4308 from liancheng/parquet-partition-discovery and squashes the following commits:
      
      b6946e6 [Cheng Lian] Fixes MiMA issues, addresses comments
      8232e17 [Cheng Lian] Write support for Parquet data source
      a49bd28 [Cheng Lian] Fixes spelling typo in trait name "CreateableRelationProvider"
      808380f [Cheng Lian] Fixes issues introduced while rebasing
      50dd8d1 [Cheng Lian] Addresses @rxin's comment, fixes UDT schema merging
      adf2aae [Cheng Lian] Fixes compilation error introduced while rebasing
      4e0175f [Cheng Lian] Fixes Python Parquet API, we need Py4J array to call varargs method
      0d8ec1d [Cheng Lian] Adds more test cases
      b35c8c6 [Cheng Lian] Fixes some typos and outdated comments
      dd704fd [Cheng Lian] Fixes Python Parquet API
      596c312 [Cheng Lian] Uses switch to control whether use Parquet data source or not
      7d0f7a2 [Cheng Lian] Fixes Metastore Parquet table conversion
      a1896c7 [Cheng Lian] Fixes all existing Parquet test suites except for ParquetMetastoreSuite
      5654c9d [Cheng Lian] Draft version of Parquet partition discovery and schema merging
      a9ed5117
    • Xiangrui Meng's avatar
      [SPARK-5604[MLLIB] remove checkpointDir from LDA · c19152cd
      Xiangrui Meng authored
      `checkpointDir` is a Spark global configuration. Users should set it outside LDA. This PR also hides some methods under `private[clustering] object LDA`, so they don't show up in the generated Java doc (SPARK-5610).
      
      jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4390 from mengxr/SPARK-5604 and squashes the following commits:
      
      a34bb39 [Xiangrui Meng] remove checkpointDir from LDA
      c19152cd
    • x1-'s avatar
      [SPARK-5460][MLlib] Wrapped `Try` around `deleteAllCheckpoints` - RandomForest. · 62371ada
      x1- authored
      Because `deleteAllCheckpoints` has IOException potential.
      fix issue.
      
      Author: x1- <viva008@gmail.com>
      
      Closes #4347 from x1-/SPARK-5460 and squashes the following commits:
      
      7a3d8de [x1-] change `Try()` to `try catch { case ... }` ar RandomForest.
      3a52745 [x1-] modified typo. 'faild' -> 'failed' and remove disused '-'.
      1572576 [x1-] Wrapped `Try` around `deleteAllCheckpoints` - RandomForest.
      62371ada
    • OopsOutOfMemory's avatar
      [SPARK-5135][SQL] Add support for describe table to DDL in SQLContext · 4d8d070c
      OopsOutOfMemory authored
      Hi, rxin marmbrus
      I considered your suggestion (in #4127) and now re-write it. This is now up-to-date.
      Could u please review it ?
      
      Author: OopsOutOfMemory <victorshengli@126.com>
      
      Closes #4227 from OopsOutOfMemory/describe and squashes the following commits:
      
      053826f [OopsOutOfMemory] describe
      4d8d070c
    • wangfei's avatar
      [SPARK-5617][SQL] fix test failure of SQLQuerySuite · a83936e1
      wangfei authored
      SQLQuerySuite test failure:
      [info] - simple select (22 milliseconds)
      [info] - sorting (722 milliseconds)
      [info] - external sorting (728 milliseconds)
      [info] - limit (95 milliseconds)
      [info] - date row *** FAILED *** (35 milliseconds)
      [info]   Results do not match for query:
      [info]   'Limit 1
      [info]    'Project [CAST(2015-01-28, DateType) AS c0#3630]
      [info]     'UnresolvedRelation [testData], None
      [info]
      [info]   == Analyzed Plan ==
      [info]   Limit 1
      [info]    Project [CAST(2015-01-28, DateType) AS c0#3630]
      [info]     LogicalRDD [key#0,value#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:35
      [info]
      [info]   == Physical Plan ==
      [info]   Limit 1
      [info]    Project [16463 AS c0#3630]
      [info]     PhysicalRDD [key#0,value#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:35
      [info]
      [info]   == Results ==
      [info]   !== Correct Answer - 1 ==   == Spark Answer - 1 ==
      [info]   ![2015-01-28]               [2015-01-27] (QueryTest.scala:77)
      [info]   org.scalatest.exceptions.TestFailedException:
      [info]   at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
      [info]   at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
      [info]   at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
      [info]   at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
      [info]   at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:77)
      [info]   at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:95)
      [info]   at org.apache.spark.sql.SQLQuerySuite$$anonfun$23.apply$mcV$sp(SQLQuerySuite.scala:300)
      [info]   at org.apache.spark.sql.SQLQuerySuite$$anonfun$23.apply(SQLQuerySuite.scala:300)
      [info]   at org.apache.spark.sql.SQLQuerySuite$$anonfun$23.apply(SQLQuerySuite.scala:300)
      [info]   at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
      [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
      [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
      [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
      [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
      [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
      [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
      [info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
      [info]   at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
      [info]   at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
      [info]   at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
      [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
      [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
      [info]   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
      [info]   at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
      [info]   at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
      [info]   at org.scalatest.SuperEngine$$anonfun$traverseSubNode
      
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #4395 from scwf/SQLQuerySuite and squashes the following commits:
      
      1431a2d [wangfei] fix conflicts
      c35fe5e [wangfei] minor fix
      01dab3a [wangfei] fix test failure of SQLQuerySuite
      a83936e1
    • Daoyuan Wang's avatar
      [Branch-1.3] [DOC] doc fix for date · 6fa4ac1b
      Daoyuan Wang authored
      Trivial fix.
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #4400 from adrian-wang/docdate and squashes the following commits:
      
      31bbe40 [Daoyuan Wang] doc fix for date
      6fa4ac1b
    • Jacek Lewandowski's avatar
      SPARK-5548: Fixed a race condition in AkkaUtilsSuite · 081ac69f
      Jacek Lewandowski authored
      `Await.result` and `selection.resolveOne` runs the same timeout simultaneously. When `Await.result` timeout is reached first, then `TimeoutException` is thrown. On the other hand, when `selection.resolveOne` timeout is reached first, `ActorNotFoundException` is thrown. This is an obvious race condition and the easiest way to fix it is to increase the timeout of one method to make sure the code fails on the other method first.
      
      Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>
      
      Closes #4343 from jacek-lewandowski/SPARK-5548-1.3 and squashes the following commits:
      
      b9ba47e [Jacek Lewandowski] SPARK-5548: Fixed a race condition in AkkaUtilsSuite
      081ac69f
    • GuoQiang Li's avatar
      [SPARK-5474][Build]curl should support URL redirection in build/mvn · 34147549
      GuoQiang Li authored
      Author: GuoQiang Li <witgo@qq.com>
      
      Closes #4263 from witgo/SPARK-5474 and squashes the following commits:
      
      ef397ff [GuoQiang Li] review commits
      a398324 [GuoQiang Li] curl should support URL redirection in build/mvn
      34147549
    • Matei Zaharia's avatar
      [SPARK-5608] Improve SEO of Spark documentation pages · 4d74f060
      Matei Zaharia authored
      - Add meta description tags on some of the most important doc pages
      - Shorten the titles of some pages to have more relevant keywords; for
        example there's no reason to have "Spark SQL Programming Guide - Spark
        1.2.0 documentation", we can just say "Spark SQL - Spark 1.2.0
        documentation".
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #4381 from mateiz/docs-seo and squashes the following commits:
      
      4940563 [Matei Zaharia] [SPARK-5608] Improve SEO of Spark documentation pages
      4d74f060
    • Sandy Ryza's avatar
      SPARK-4687. Add a recursive option to the addFile API · c4b1108c
      Sandy Ryza authored
      This adds a recursive option to the addFile API to satisfy Hive's needs.  It only allows specifying HDFS dirs that will be copied down on every executor.
      
      There are a couple outstanding questions.
      * Should we allow specifying local dirs as well?  The best way to do this would probably be to archive them.  The drawback is that it would require a fair bit of code that I don't know of any current use cases for.
      * The addFiles implementation has a caching component that I don't entirely understand.  What events are we caching between?  AFAICT it's users calling addFile on the same file in the same app at different times?  Do we want/need to add something similar for addDirectory.
      *  The addFiles implementation will check to see if an added file already exists and has the same contents.  I imagine we want the same behavior, so planning to add this unless people think otherwise.
      
      I plan to add some tests if people are OK with the approach.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #3670 from sryza/sandy-spark-4687 and squashes the following commits:
      
      f9fc77f [Sandy Ryza] Josh's comments
      70cd24d [Sandy Ryza] Add another test
      13da824 [Sandy Ryza] Revert executor changes
      38bf94d [Sandy Ryza] Marcelo's comments
      ca83849 [Sandy Ryza] Add addFile test
      1941be3 [Sandy Ryza] Fix test and avoid HTTP server in local mode
      31f15a9 [Sandy Ryza] Use cache recursively and fix some compile errors
      0239c3d [Sandy Ryza] Change addDirectory to addFile with recursive
      46fe70a [Sandy Ryza] SPARK-4687. Add a addDirectory API
      c4b1108c
    • Reynold Xin's avatar
      [HOTFIX] MLlib build break. · 6580929f
      Reynold Xin authored
      6580929f
    • Reynold Xin's avatar
      [MLlib] Minor: UDF style update. · c3ba4d4c
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4388 from rxin/mllib-style and squashes the following commits:
      
      61d465b [Reynold Xin] oops
      3364295 [Reynold Xin] Missed one ..
      5e068e3 [Reynold Xin] [MLlib] Minor: UDF style update.
      c3ba4d4c
    • Reynold Xin's avatar
      [SPARK-5612][SQL] Move DataFrame implicit functions into SQLContext.implicits. · 7d789e11
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4386 from rxin/df-implicits and squashes the following commits:
      
      9d96606 [Reynold Xin] style fix
      edd296b [Reynold Xin] ReplSuite
      1c946ab [Reynold Xin] [SPARK-5612][SQL] Move DataFrame implicit functions into SQLContext.implicits.
      7d789e11
    • q00251598's avatar
      [SPARK-5606][SQL] Support plus sign in HiveContext · 9d3a75ef
      q00251598 authored
      Now spark version is only support ```SELECT -key FROM DECIMAL_UDF;``` in HiveContext.
      This patch is used to support ```SELECT +key FROM DECIMAL_UDF;``` in HiveContext.
      
      Author: q00251598 <qiyadong@huawei.com>
      
      Closes #4378 from watermen/SPARK-5606 and squashes the following commits:
      
      777f132 [q00251598] sql-case22
      74dd368 [q00251598] sql-case22
      1a67410 [q00251598] sql-case22
      c5cd5bc [q00251598] sql-case22
      9d3a75ef
    • Xiangrui Meng's avatar
      [SPARK-5599] Check MLlib public APIs for 1.3 · db346904
      Xiangrui Meng authored
      There are no break changes (against 1.2) in this PR. I hide the PythonMLLibAPI, which is only called by Py4J, and renamed `SparseMatrix.diag` to `SparseMatrix.spdiag`. All other changes are documentation and annotations. The `Experimental` tag is removed from `ALS.setAlpha` and `Rating`. One issue not addressed in this PR is the `setCheckpointDir` in `LDA` (https://issues.apache.org/jira/browse/SPARK-5604).
      
      CC: srowen jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4377 from mengxr/SPARK-5599 and squashes the following commits:
      
      17975dc [Xiangrui Meng] fix tests
      4487f20 [Xiangrui Meng] remove experimental tag from each stat method because Statistics is experimental already
      3cd969a [Xiangrui Meng] remove freeman (sorry~) from StreamLA public doc
      55900f5 [Xiangrui Meng] make IR experimental and update its doc
      9b8eed3 [Xiangrui Meng] graduate Rating and setAlpha in ALS
      b854d28 [Xiangrui Meng] correct iid doc in RandomRDDs
      27f5bdd [Xiangrui Meng] update linalg docs and some new method signatures
      371721b [Xiangrui Meng] mark fpg as experimental and update its doc
      8aca7ee [Xiangrui Meng] change SLR to experimental and update the doc
      ebbb2e9 [Xiangrui Meng] mark PIC experimental and update the doc
      7830d3b [Xiangrui Meng] mark GMM experimental
      a378496 [Xiangrui Meng] use the correct subscript syntax in PIC
      c65c424 [Xiangrui Meng] update LDAModel doc
      a213b0c [Xiangrui Meng] update GMM constructor
      3993054 [Xiangrui Meng] hide algorithm in SLR
      ad6b9ce [Xiangrui Meng] Revert "make ClassificatinModel.predict(JavaRDD) return JavaDoubleRDD"
      0054684 [Xiangrui Meng] add doc to LRModel's constructor
      a89763b [Xiangrui Meng] make ClassificatinModel.predict(JavaRDD) return JavaDoubleRDD
      7c0946c [Xiangrui Meng] hide PythonMLLibAPI
      db346904
    • Joseph K. Bradley's avatar
      [SPARK-5596] [mllib] ML model import/export for GLMs, NaiveBayes · 975bcef4
      Joseph K. Bradley authored
      This is a PR for Parquet-based model import/export.  Please see the design doc on [the JIRA](https://issues.apache.org/jira/browse/SPARK-4587).
      
      Note: This includes only a subset of regression and classification models:
      * NaiveBayes, SVM, LogisticRegression
      * LinearRegression, RidgeRegression, Lasso
      
      Follow-up PRs will cover other models.
      
      Sketch of current contents:
      * New traits: Saveable, Loader
      * Implementations for some algorithms
      * Also: Added LogisticRegressionModel.getThreshold method (so that unit test could check the threshold)
      
      CC: mengxr  selvinsource
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #4233 from jkbradley/ml-import-export and squashes the following commits:
      
      87c4eb8 [Joseph K. Bradley] small cleanups
      12d9059 [Joseph K. Bradley] Many cleanups after code review.  Major changes: Storing numFeatures, numClasses in model metadata. Improvements to unit tests
      b4ee064 [Joseph K. Bradley] Reorganized save/load for regression and classification.  Renamed concepts to Saveable, Loader
      a34aef5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into ml-import-export
      ee99228 [Joseph K. Bradley] scala style fix
      79675d5 [Joseph K. Bradley] cleanups in LogisticRegression after rebasing after multinomial PR
      d1e5882 [Joseph K. Bradley] organized imports
      2935963 [Joseph K. Bradley] Added save/load and tests for most classification and regression models
      c495dba [Joseph K. Bradley] made version for model import/export local to each model
      1496852 [Joseph K. Bradley] Added save/load for NaiveBayes
      8d46386 [Joseph K. Bradley] Added save/load to NaiveBayes
      1577d70 [Joseph K. Bradley] fixed issues after rebasing on master (DataFrame patch)
      64914a3 [Joseph K. Bradley] added getThreshold to SVMModel
      b1fc5ec [Joseph K. Bradley] small cleanups
      418ba1b [Joseph K. Bradley] Added save, load to mllib.classification.LogisticRegressionModel, plus test suite
      975bcef4
    • Patrick Wendell's avatar
      SPARK-5607: Update to Kryo 2.24.0 to avoid including objenesis 1.2. · c23ac03c
      Patrick Wendell authored
      Our existing Kryo version actually embeds objenesis 1.2 classes in
      its jar, causing dependency conflicts during tests. This updates us to
      Kryo 2.24.0 (which was changed to not embed objenesis) to avoid this
      behavior. See the JIRA for more detail.
      
      Author: Patrick Wendell <patrick@databricks.com>
      
      Closes #4383 from pwendell/SPARK-5607 and squashes the following commits:
      
      c3b8d27 [Patrick Wendell] SPARK-5607: Update to Kryo 2.24.0 to avoid including objenesis 1.2.
      c23ac03c
  2. Feb 04, 2015
    • Reynold Xin's avatar
      [SPARK-5602][SQL] Better support for creating DataFrame from local data collection · 84acd08e
      Reynold Xin authored
      1. Added methods to create DataFrames from Seq[Product]
      2. Added executeTake to avoid running a Spark job on LocalRelations.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4372 from rxin/localDataFrame and squashes the following commits:
      
      f696858 [Reynold Xin] style checker.
      839ef7f [Reynold Xin] [SPARK-5602][SQL] Better support for creating DataFrame from local data collection.
      84acd08e
    • Reynold Xin's avatar
      [SPARK-5538][SQL] Fix flaky CachedTableSuite · 206f9bc3
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4379 from rxin/CachedTableSuite and squashes the following commits:
      
      f2b44ce [Reynold Xin] [SQL] Fix flaky CachedTableSuite.
      206f9bc3
    • Reynold Xin's avatar
      [SQL][DataFrame] Minor cleanup. · 6b4c7f08
      Reynold Xin authored
      1. Removed LocalHiveContext in Python.
      2. Reduced DSL UDF support from 22 arguments to 10 arguments so JavaDoc/ScalaDoc look nicer.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4374 from rxin/df-style and squashes the following commits:
      
      e493342 [Reynold Xin] [SQL][DataFrame] Minor cleanup.
      6b4c7f08
    • Sadhan Sood's avatar
      [SPARK-4520] [SQL] This pr fixes the ArrayIndexOutOfBoundsException as r... · dba98bf6
      Sadhan Sood authored
      ...aised in SPARK-4520.
      
      The exception is thrown only for a thrift generated parquet file. The array element schema name is assumed as "array" as per ParquetAvro but for thrift generated parquet files, it is array_name + "_tuple". This leads to missing child of array group type and hence when the parquet rows are being materialized leads to the exception.
      
      Author: Sadhan Sood <sadhan@tellapart.com>
      
      Closes #4148 from sadhan/SPARK-4520 and squashes the following commits:
      
      c5ccde8 [Sadhan Sood] [SPARK-4520] [SQL] This pr fixes the ArrayIndexOutOfBoundsException as raised in SPARK-4520.
      dba98bf6
    • Reynold Xin's avatar
      [SPARK-5605][SQL][DF] Allow using String to specify colum name in DSL aggregate functions · 1fbd124b
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4376 from rxin/SPARK-5605 and squashes the following commits:
      
      c55f5fa [Reynold Xin] Added a Python test.
      f4b8dbb [Reynold Xin] [SPARK-5605][SQL][DF] Allow using String to specify colum name in DSL aggregate functions.
      1fbd124b
    • Josh Rosen's avatar
      [SPARK-5411] Allow SparkListeners to be specified in SparkConf and loaded when... · 9a7ce70e
      Josh Rosen authored
      [SPARK-5411] Allow SparkListeners to be specified in SparkConf and loaded when creating SparkContext
      
      This patch introduces a new configuration option, `spark.extraListeners`, that allows SparkListeners to be specified in SparkConf and registered before the SparkContext is initialized.  From the configuration documentation:
      
      > A comma-separated list of classes that implement SparkListener; when initializing SparkContext, instances of these classes will be created and registered with Spark's listener bus. If a class has a single-argument constructor that accepts a SparkConf, that constructor will be called; otherwise, a zero-argument constructor will be called. If no valid constructor can be found, the SparkContext creation will fail with an exception.
      
      This motivation for this patch is to allow monitoring code to be easily injected into existing Spark programs without having to modify those programs' code.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #4111 from JoshRosen/SPARK-5190-register-sparklistener-in-sc-constructor and squashes the following commits:
      
      8370839 [Josh Rosen] Two minor fixes after merging with master
      6e0122c [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-5190-register-sparklistener-in-sc-constructor
      1a5b9a0 [Josh Rosen] Remove SPARK_EXTRA_LISTENERS environment variable.
      2daff9b [Josh Rosen] Add a couple of explanatory comments for SPARK_EXTRA_LISTENERS.
      b9973da [Josh Rosen] Add test to ensure that conf and env var settings are merged, not overriden.
      d6f3113 [Josh Rosen] Use getConstructors() instead of try-catch to find right constructor.
      d0d276d [Josh Rosen] Move code into setupAndStartListenerBus() method
      b22b379 [Josh Rosen] Instantiate SparkListeners from classes listed in configurations.
      9c0d8f1 [Josh Rosen] Revert "[SPARK-5190] Allow SparkListeners to be registered before SparkContext starts."
      217ecc0 [Josh Rosen] Revert "Add addSparkListener to JavaSparkContext"
      25988f3 [Josh Rosen] Add addSparkListener to JavaSparkContext
      163ba19 [Josh Rosen] [SPARK-5190] Allow SparkListeners to be registered before SparkContext starts.
      9a7ce70e
    • Davies Liu's avatar
      [SPARK-5577] Python udf for DataFrame · dc101b0e
      Davies Liu authored
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4351 from davies/python_udf and squashes the following commits:
      
      d250692 [Davies Liu] fix conflict
      34234d4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python_udf
      440f769 [Davies Liu] address comments
      f0a3121 [Davies Liu] track life cycle of broadcast
      f99b2e1 [Davies Liu] address comments
      462b334 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python_udf
      7bccc3b [Davies Liu] python udf
      58dee20 [Davies Liu] clean up
      dc101b0e
    • guowei2's avatar
      [SPARK-5118][SQL] Fix: create table test stored as parquet as select .. · e0490e27
      guowei2 authored
      Author: guowei2 <guowei2@asiainfo.com>
      
      Closes #3921 from guowei2/SPARK-5118 and squashes the following commits:
      
      b1ba3be [guowei2] add table file check in test case
      9da56f8 [guowei2] test case only run in Shim13
      112a0b6 [guowei2] add test case
      187c7d8 [guowei2] Fix: create table test stored as parquet as select ..
      e0490e27
    • Yin Huai's avatar
      [SQL] Use HiveContext's sessionState in HiveMetastoreCatalog.hiveDefaultTableFilePath · 548c9c2b
      Yin Huai authored
      `client.getDatabaseCurrent` uses SessionState's local variable which can be an issue.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #4355 from yhuai/defaultTablePath and squashes the following commits:
      
      84a29e5 [Yin Huai] Use HiveContext's sessionState instead of using SessionState's thread local variable.
      548c9c2b
    • Yin Huai's avatar
      [SQL] Correct the default size of TimestampType and expose NumericType · 0d81645f
      Yin Huai authored
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #4314 from yhuai/minor and squashes the following commits:
      
      d3870a7 [Yin Huai] Update test.
      6e4b0c0 [Yin Huai] Two minor changes.
      0d81645f
    • OopsOutOfMemory's avatar
      [SQL][Hiveconsole] Bring hive console code up to date and update README.md · b73d5fff
      OopsOutOfMemory authored
      Add `import org.apache.spark.sql.Dsl._` to make DSL query works.
      Since queryExecution is not avaliable in DataFrame, so remove it.
      
      Author: OopsOutOfMemory <victorshengli@126.com>
      Author: Sheng, Li <OopsOutOfMemory@users.noreply.github.com>
      
      Closes #4330 from OopsOutOfMemory/hiveconsole and squashes the following commits:
      
      46eb790 [Sheng, Li] Update SparkBuild.scala
      d23ee9f [OopsOutOfMemory] minor
      d4dd593 [OopsOutOfMemory] refine hive console
      b73d5fff
    • wangfei's avatar
      [SPARK-5367][SQL] Support star expression in udfs · 417d1118
      wangfei authored
      A follow up for #4163: support  `select array(key, *) from src`
      
      Since  array(key, *)  will not go into this case
      ```
      case Alias(f  UnresolvedFunction(_, args), name) if containsStar(args) =>
                    val expandedArgs = args.flatMap {
                      case s: Star => s.expand(child.output, resolver)
                      case o => o :: Nil
                    }
      ```
      here added a case to cover the corner case of array.
      
      /cc liancheng
      
      Author: wangfei <wangfei1@huawei.com>
      Author: scwf <wangfei1@huawei.com>
      
      Closes #4353 from scwf/udf-star1 and squashes the following commits:
      
      4350d17 [wangfei] minor fix
      a7cd191 [wangfei] minor fix
      0942fb1 [wangfei] follow up: support select array(key, *) from src
      6ae00db [wangfei] also fix problem with array
      da1da09 [scwf] minor fix
      f87b5f9 [scwf] added test case
      587bf7e [wangfei] compile fix
      eb93c16 [wangfei] fix star resolve issue in udf
      417d1118
    • kul's avatar
      [SPARK-5426][SQL] Add SparkSQL Java API helper methods. · 424cb699
      kul authored
      Right now the PR adds few helper methods for java apis. But the issue was opened mainly to get rid of transformations in java api like `.rdd` and `.toJavaRDD` while working with `SQLContext` or `HiveContext`.
      
      Author: kul <kuldeep.bora@gmail.com>
      
      Closes #4243 from kul/master and squashes the following commits:
      
      2390fba [kul] [SPARK-5426][SQL] Add SparkSQL Java API helper methods.
      424cb699
    • wangfei's avatar
      [SPARK-5587][SQL] Support change database owner · b90dd397
      wangfei authored
      Support change database owner, here i do not add the golden files since the golden answer is related to the tmp dir path (see https://github.com/scwf/spark/commit/6331e4ac0f982caf70531defcb957be76fe093c7)
      
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #4357 from scwf/db_owner and squashes the following commits:
      
      f761533 [wangfei] remove the alter_db_owner which have added to whitelist
      79413c6 [wangfei] Revert "added golden files"
      6331e4a [wangfei] added golden files
      6f7cacd [wangfei] support change database owner
      b90dd397
    • wangfei's avatar
      [SPARK-5591][SQL] Fix NoSuchObjectException for CTAS · a9f0db1f
      wangfei authored
      Now CTAS runs successfully but will throw a NoSuchObjectException.
      ```
      create table sc as select *
      from (select '2011-01-11', '2011-01-11+14:18:26' from src tablesample (1 rows)
      union all
      select '2011-01-11', '2011-01-11+15:18:26' from src tablesample (1 rows)
      union all
      select '2011-01-11', '2011-01-11+16:18:26' from src tablesample (1 rows) ) s;
      ```
      Get this exception:
      ERROR Hive: NoSuchObjectException(message:default.sc table not found)
      at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1560)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:601)
      at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
      at $Proxy8.get_table(Unknown Source)
      at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:997)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:601)
      at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
      at $Proxy9.getTable(Unknown Source)
      at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:976)
      at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950)
      at org.apache.spark.sql.hive.HiveMetastoreCatalog.tableExists(HiveMetastoreCatalog.scala:152)
      at org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$tableExists(HiveContext.scala:309)
      at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.tableExists(Catalog.scala:121)
      at org.apache.spark.sql.hive.HiveContext$$anon$2.tableExists(HiveContext.scala:309)
      at org.apache.spark.sql.hive.execution.CreateTableAsSelect.run(CreateTableAsSelect.scala:63)
      at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:53)
      
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #4365 from scwf/ctas-exception and squashes the following commits:
      
      c7c67bc [wangfei] no used imports
      f54eb2a [wangfei] fix exception for CTAS
      a9f0db1f
    • Davies Liu's avatar
      [SPARK-4939] move to next locality when no pending tasks · 0a89b156
      Davies Liu authored
      Currently, if there are different locality in a task set, the tasks with NODE_LOCAL only get scheduled after all the PROCESS_LOCAL tasks are scheduled and timeout with spark.locality.wait.process (3 seconds by default). In local mode, the LocalScheduler will never call resourceOffer() again once it failed to get a task with same locality, then all the NODE_LOCAL tasks will be never scheduled.
      
      This bug could be reproduced by run example python/streaming/stateful_network_wordcount.py, it will hang after finished a batch with some data.
      
      This patch will check whether there is task for current locality level, if not, it will change to next locality level without waiting for `spark.locality.wait.process` seconds. It works for all locality levels.
      
      Because the list of pending tasks are updated lazily, the check can be false-positive, it means it will not move to next locality level even there is no valid pending tasks, it will wait for timeout.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3779 from davies/local_streaming and squashes the following commits:
      
      2d25fb3 [Davies Liu] Update TaskSetManager.scala
      1550668 [Davies Liu] add comment
      1c37aac [Davies Liu] address comments
      6b13824 [Davies Liu] address comments
      906f456 [Davies Liu] Merge branch 'master' of github.com:apache/spark into local_streaming
      414e79e [Davies Liu] fix bug, add logging
      ff8eabb [Davies Liu] Merge branch 'master' into local_streaming
      28d1b3c [Davies Liu] check tasks
      9d0ceab [Davies Liu] Merge branch 'master' of github.com:apache/spark into local_streaming
      37a2804 [Davies Liu] fix tests
      49bda82 [Davies Liu] address comment
      d8fb95a [Davies Liu] move to next locality level if no more tasks
      2d6ae73 [Davies Liu] add comments
      32d363f [Davies Liu] add regression test
      7d8c5a5 [Davies Liu] jump to next locality if no pending tasks for executors
      0a89b156
    • Hari Shreedharan's avatar
      [SPARK-4707][STREAMING] Reliable Kafka Receiver can lose data if the blo... · f0500f9f
      Hari Shreedharan authored
      ...ck generator fails to store data.
      
      The Reliable Kafka Receiver commits offsets only when events are actually stored, which ensures that on restart we will actually start where we left off. But if the failure happens in the store() call, and the block generator reports an error the receiver does not do anything and will continue reading from the current offset and not the last commit. This means that messages between the last commit and the current offset will be lost.
      
      This PR retries the store call four times and then stops the receiver with an error message and the last exception that was received from the store.
      
      Author: Hari Shreedharan <hshreedharan@apache.org>
      
      Closes #3655 from harishreedharan/kafka-failure-fix and squashes the following commits:
      
      5e2e7ad [Hari Shreedharan] [SPARK-4704][STREAMING] Reliable Kafka Receiver can lose data if the block generator fails to store data.
      f0500f9f
    • cody koeninger's avatar
      [SPARK-4964] [Streaming] Exactly-once semantics for Kafka · b0c00219
      cody koeninger authored
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #3798 from koeninger/kafkaRdd and squashes the following commits:
      
      1dc2941 [cody koeninger] [SPARK-4964] silence ConsumerConfig warnings about broker connection props
      59e29f6 [cody koeninger] [SPARK-4964] settle on "Direct" as a naming convention for the new stream
      8c31855 [cody koeninger] [SPARK-4964] remove HasOffsetRanges interface from return types
      0df3ebe [cody koeninger] [SPARK-4964] add comments per pwendell / dibbhatt
      8991017 [cody koeninger] [SPARK-4964] formatting
      825110f [cody koeninger] [SPARK-4964] rename stuff per TD
      4354bce [cody koeninger] [SPARK-4964] per td, remove java interfaces, replace with final classes, corresponding changes to KafkaRDD constructor and checkpointing
      9adaa0a [cody koeninger] [SPARK-4964] formatting
      0090553 [cody koeninger] [SPARK-4964] javafication of interfaces
      9a838c2 [cody koeninger] [SPARK-4964] code cleanup, add more tests
      2b340d8 [cody koeninger] [SPARK-4964] refactor per TD feedback
      80fd6ae [cody koeninger] [SPARK-4964] Rename createExactlyOnceStream so it isnt over-promising, change doc
      99d2eba [cody koeninger] [SPARK-4964] Reduce level of nesting.  If beginning is past end, its actually an error (may happen if Kafka topic was deleted and recreated)
      19406cc [cody koeninger] Merge branch 'master' of https://github.com/apache/spark into kafkaRdd
      2e67117 [cody koeninger] [SPARK-4964] one potential way of hiding most of the implementation, while still allowing access to offsets (but not subclassing)
      bb80bbe [cody koeninger] [SPARK-4964] scalastyle line length
      d4a7cf7 [cody koeninger] [SPARK-4964] allow for use cases that need to override compute for custom kafka dstreams
      c1bd6d9 [cody koeninger] [SPARK-4964] use newly available attemptNumber for correct retry behavior
      548d529 [cody koeninger] Merge branch 'master' of https://github.com/apache/spark into kafkaRdd
      0458e4e [cody koeninger] [SPARK-4964] recovery of generated rdds from checkpoint
      e86317b [cody koeninger] [SPARK-4964] try seed brokers in random order to spread metadata requests
      e93eb72 [cody koeninger] [SPARK-4964] refactor to add preferredLocations.  depends on SPARK-4014
      356c7cc [cody koeninger] [SPARK-4964] code cleanup per helena
      adf99a6 [cody koeninger] [SPARK-4964] fix serialization issues for checkpointing
      1d50749 [cody koeninger] [SPARK-4964] code cleanup per tdas
      8bfd6c0 [cody koeninger] [SPARK-4964] configure rate limiting via spark.streaming.receiver.maxRate
      e09045b [cody koeninger] [SPARK-4964] add foreachPartitionWithIndex, to avoid doing equivalent map + empty foreach boilerplate
      cac63ee [cody koeninger] additional testing, fix fencepost error
      37d3053 [cody koeninger] make KafkaRDDPartition available to users so offsets can be committed per partition
      bcca8a4 [cody koeninger] Merge branch 'master' of https://github.com/apache/spark into kafkaRdd
      6bf14f2 [cody koeninger] first attempt at a Kafka dstream that allows for exactly-once semantics
      326ff3c [cody koeninger] add some tests
      38bb727 [cody koeninger] give easy access to the parameters of a KafkaRDD
      979da25 [cody koeninger] dont allow empty leader offsets to be returned
      8d7de4a [cody koeninger] make sure leader offsets can be found even for leaders that arent in the seed brokers
      4b078bf [cody koeninger] differentiate between leader and consumer offsets in error message
      3c2a96a [cody koeninger] fix scalastyle errors
      29c6b43 [cody koeninger] cleanup logging
      783b477 [cody koeninger] update tests for kafka 8.1.1
      7d050bc [cody koeninger] methods to set consumer offsets and get topic metadata, switch back to inclusive start / exclusive end to match typical kafka consumer behavior
      ce91c59 [cody koeninger] method to get consumer offsets, explicit error handling
      4dafd1b [cody koeninger] method to get leader offsets, switch rdd bound to being exclusive start, inclusive end to match offsets typically returned from cluster
      0b94b33 [cody koeninger] use dropWhile rather than filter to trim beginning of fetch response
      1d70625 [cody koeninger] WIP on kafka cluster
      76913e2 [cody koeninger] Batch oriented kafka rdd, WIP. todo: cluster metadata / finding leader
      b0c00219
    • Davies Liu's avatar
      [SPARK-5588] [SQL] support select/filter by SQL expression · ac0b2b78
      Davies Liu authored
      ```
      df.selectExpr('a + 1', 'abs(age)')
      df.filter('age > 3')
      df[ df.age > 3 ]
      df[ ['age', 'name'] ]
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4359 from davies/select_expr and squashes the following commits:
      
      d99856b [Davies Liu] support select/filter by SQL expression
      ac0b2b78
Loading