Skip to content
Snippets Groups Projects
  1. Nov 04, 2014
    • Michael Armbrust's avatar
      [SQL] Add String option for DSL AS · 515abb9a
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3097 from marmbrus/asString and squashes the following commits:
      
      6430520 [Michael Armbrust] Add String option for DSL AS
      515abb9a
    • Aaron Davidson's avatar
      [SPARK-2938] Support SASL authentication in NettyBlockTransferService · 5e73138a
      Aaron Davidson authored
      Also lays the groundwork for supporting it inside the external shuffle service.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3087 from aarondav/sasl and squashes the following commits:
      
      3481718 [Aaron Davidson] Delete rogue println
      44f8410 [Aaron Davidson] Delete documentation - muahaha!
      eb9f065 [Aaron Davidson] Improve documentation and add end-to-end test at Spark-level
      a6b95f1 [Aaron Davidson] Address comments
      785bbde [Aaron Davidson] Cleanup
      79973cb [Aaron Davidson] Remove unused file
      151b3c5 [Aaron Davidson] Add docs, timeout config, better failure handling
      f6177d7 [Aaron Davidson] Cleanup SASL state upon connection termination
      7b42adb [Aaron Davidson] Add unit tests
      8191bcb [Aaron Davidson] [SPARK-2938] Support SASL authentication in NettyBlockTransferService
      5e73138a
    • Niklas Wilcke's avatar
      [Spark-4060] [MLlib] exposing special rdd functions to the public · f90ad5d4
      Niklas Wilcke authored
      Author: Niklas Wilcke <1wilcke@informatik.uni-hamburg.de>
      
      Closes #2907 from numbnut/master and squashes the following commits:
      
      7f7c767 [Niklas Wilcke] [Spark-4060] [MLlib] exposing special rdd functions to the public, #2907
      f90ad5d4
    • Dariusz Kobylarz's avatar
      fixed MLlib Naive-Bayes java example bug · bcecd73f
      Dariusz Kobylarz authored
      the filter tests Double objects by references whereas it should test their values
      
      Author: Dariusz Kobylarz <darek.kobylarz@gmail.com>
      
      Closes #3081 from dkobylarz/master and squashes the following commits:
      
      5d43a39 [Dariusz Kobylarz] naive bayes example update
      a304b93 [Dariusz Kobylarz] fixed MLlib Naive-Bayes java example bug
      bcecd73f
    • Davies Liu's avatar
      [SPARK-3886] [PySpark] simplify serializer, use AutoBatchedSerializer by default. · e4f42631
      Davies Liu authored
      This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1.
      
      Author: Davies Liu <davies@databricks.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Josh Rosen <joshrosen@databricks.com>
      
      Closes #2920 from davies/fix_autobatch and squashes the following commits:
      
      e544ef9 [Davies Liu] revert unrelated change
      6880b14 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      1d557fc [Davies Liu] fix tests
      8180907 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      76abdce [Davies Liu] clean up
      53fa60b [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      d7ac751 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      2cc2497 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      b4292ce [Davies Liu] fix bug in master
      d79744c [Davies Liu] recover hive tests
      be37ece [Davies Liu] refactor
      eb3938d [Davies Liu] refactor serializer in scala
      8d77ef2 [Davies Liu] simplify serializer, use AutoBatchedSerializer by default.
      e4f42631
    • zsxwing's avatar
      [SPARK-4166][Core] Add a backward compatibility test for ExecutorLostFailure · b671ce04
      zsxwing authored
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3085 from zsxwing/SPARK-4166-back-comp and squashes the following commits:
      
      89329f4 [zsxwing] Add a backward compatibility test for ExecutorLostFailure
      b671ce04
    • zsxwing's avatar
      [SPARK-4163][Core] Add a backward compatibility test for FetchFailed · 9bdc8412
      zsxwing authored
      /cc aarondav
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3086 from zsxwing/SPARK-4163-back-comp and squashes the following commits:
      
      21cb2a8 [zsxwing] Add a backward compatibility test for FetchFailed
      9bdc8412
    • Xiangrui Meng's avatar
      [SPARK-3573][MLLIB] Make MLlib's Vector compatible with SQL's SchemaRDD · 1a9c6cdd
      Xiangrui Meng authored
      Register MLlib's Vector as a SQL user-defined type (UDT) in both Scala and Python. With this PR, we can easily map a RDD[LabeledPoint] to a SchemaRDD, and then select columns or save to a Parquet file. Examples in Scala/Python are attached. The Scala code was copied from jkbradley.
      
      ~~This PR contains the changes from #3068 . I will rebase after #3068 is merged.~~
      
      marmbrus jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3070 from mengxr/SPARK-3573 and squashes the following commits:
      
      3a0b6e5 [Xiangrui Meng] organize imports
      236f0a0 [Xiangrui Meng] register vector as UDT and provide dataset examples
      1a9c6cdd
  2. Nov 03, 2014
    • Xiangrui Meng's avatar
      [SPARK-4192][SQL] Internal API for Python UDT · 04450d11
      Xiangrui Meng authored
      Following #2919, this PR adds Python UDT (for internal use only) with tests under "pyspark.tests". Before `SQLContext.applySchema`, we check whether we need to convert user-type instances into SQL recognizable data. In the current implementation, a Python UDT must be paired with a Scala UDT for serialization on the JVM side. A following PR will add VectorUDT in MLlib for both Scala and Python.
      
      marmbrus jkbradley davies
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3068 from mengxr/SPARK-4192-sql and squashes the following commits:
      
      acff637 [Xiangrui Meng] merge master
      dba5ea7 [Xiangrui Meng] only use pyClass for Python UDT output sqlType as well
      2c9d7e4 [Xiangrui Meng] move import to global setup; update needsConversion
      7c4a6a9 [Xiangrui Meng] address comments
      75223db [Xiangrui Meng] minor update
      f740379 [Xiangrui Meng] remove UDT from default imports
      e98d9d0 [Xiangrui Meng] fix py style
      4e84fce [Xiangrui Meng] remove local hive tests and add more tests
      39f19e0 [Xiangrui Meng] add tests
      b7f666d [Xiangrui Meng] add Python UDT
      04450d11
    • Xiangrui Meng's avatar
      [FIX][MLLIB] fix seed in BaggedPointSuite · c5912ecc
      Xiangrui Meng authored
      Saw Jenkins test failures due to random seeds.
      
      jkbradley manishamde
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3084 from mengxr/fix-baggedpoint-suite and squashes the following commits:
      
      f735a43 [Xiangrui Meng] fix seed in BaggedPointSuite
      c5912ecc
    • Josh Rosen's avatar
      [SPARK-611] Display executor thread dumps in web UI · 4f035dd2
      Josh Rosen authored
      This patch allows executor thread dumps to be collected on-demand and viewed in the Spark web UI.
      
      The thread dumps are collected using Thread.getAllStackTraces().  To allow remote thread dumps to be triggered from the web UI, I added a new `ExecutorActor` that runs inside of the Executor actor system and responds to RPCs from the driver.  The driver's mechanism for obtaining a reference to this actor is a little bit hacky: it uses the block manager master actor to determine the host/port of the executor actor systems in order to construct ActorRefs to ExecutorActor.  Unfortunately, I couldn't find a much cleaner way to do this without a big refactoring of the executor -> driver communication.
      
      Screenshots:
      
      ![image](https://cloud.githubusercontent.com/assets/50748/4781793/7e7a0776-5cbf-11e4-874d-a91cd04620bd.png)
      
      ![image](https://cloud.githubusercontent.com/assets/50748/4781794/8bce76aa-5cbf-11e4-8d13-8477748c9f7e.png)
      
      ![image](https://cloud.githubusercontent.com/assets/50748/4781797/bd11a8b8-5cbf-11e4-9ad7-a7459467ec8e.png)
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #2944 from JoshRosen/jstack-in-web-ui and squashes the following commits:
      
      3c21a5d [Josh Rosen] Address review comments:
      880f7f7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into jstack-in-web-ui
      f719266 [Josh Rosen] Merge remote-tracking branch 'origin/master' into jstack-in-web-ui
      19707b0 [Josh Rosen] Add one comment.
      127a130 [Josh Rosen] Update to use SparkContext.DRIVER_IDENTIFIER
      b8e69aa [Josh Rosen] Merge remote-tracking branch 'origin/master' into jstack-in-web-ui
      3dfc2d4 [Josh Rosen] Add missing file.
      bc1e675 [Josh Rosen] Undo some leftover changes from the earlier approach.
      f4ac1c1 [Josh Rosen] Switch to on-demand collection of thread dumps
      dfec08b [Josh Rosen] Add option to disable thread dumps in UI.
      4c87d7f [Josh Rosen] Use separate RPC for sending thread dumps.
      2b8bdf3 [Josh Rosen] Enable thread dumps from the driver when running in non-local mode.
      cc3e6b3 [Josh Rosen] Fix test code in DAGSchedulerSuite.
      87b8b65 [Josh Rosen] Add new listener event for thread dumps.
      8c10216 [Josh Rosen] Add missing file.
      0f198ac [Josh Rosen] [SPARK-611] Display executor thread dumps in web UI
      4f035dd2
    • Zhang, Liye's avatar
      [SPARK-4168][WebUI] web statges number should show correctly when stages are more than 1000 · 97a466ec
      Zhang, Liye authored
      The number of completed stages and failed stages showed on webUI will always be less than 1000. This is really misleading when there are already thousands of stages completed or failed. The number should be correct even when only partial stages listed on the webUI (stage info will be removed if the number is too large).
      
      Author: Zhang, Liye <liye.zhang@intel.com>
      
      Closes #3035 from liyezhang556520/webStageNum and squashes the following commits:
      
      d9e29fb [Zhang, Liye] add detailed comments for variables
      4ea8fd1 [Zhang, Liye] change variable name accroding to comments
      f4c404d [Zhang, Liye] [SPARK-4168][WebUI] web statges number should show correctly when stages are more than 1000
      97a466ec
    • Michael Armbrust's avatar
      [SQL] Convert arguments to Scala UDFs · 15b58a22
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3077 from marmbrus/udfsWithUdts and squashes the following commits:
      
      34b5f27 [Michael Armbrust] style
      504adef [Michael Armbrust] Convert arguments to Scala UDFs
      15b58a22
    • Sandy Ryza's avatar
      SPARK-4178. Hadoop input metrics ignore bytes read in RecordReader insta... · 28128150
      Sandy Ryza authored
      ...ntiation
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #3045 from sryza/sandy-spark-4178 and squashes the following commits:
      
      8d2e70e [Sandy Ryza] Kostas's review feedback
      e5b27c0 [Sandy Ryza] SPARK-4178. Hadoop input metrics ignore bytes read in RecordReader instantiation
      28128150
    • Michael Armbrust's avatar
      [SQL] More aggressive defaults · 25bef7e6
      Michael Armbrust authored
       - Turns on compression for in-memory cached data by default
       - Changes the default parquet compression format back to gzip (we have seen more OOMs with production workloads due to the way Snappy allocates memory)
       - Ups the batch size to 10,000 rows
       - Increases the broadcast threshold to 10mb.
       - Uses our parquet implementation instead of the hive one by default.
       - Cache parquet metadata by default.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3064 from marmbrus/fasterDefaults and squashes the following commits:
      
      97ee9f8 [Michael Armbrust] parquet codec docs
      e641694 [Michael Armbrust] Remote also
      a12866a [Michael Armbrust] Cache metadata.
      2d73acc [Michael Armbrust] Update docs defaults.
      d63d2d5 [Michael Armbrust] document parquet option
      da373f9 [Michael Armbrust] More aggressive defaults
      25bef7e6
    • Cheng Hao's avatar
      [SPARK-4152] [SQL] Avoid data change in CTAS while table already existed · e83f13e8
      Cheng Hao authored
      CREATE TABLE t1 (a String);
      CREATE TABLE t1 AS SELECT key FROM src; – throw exception
      CREATE TABLE if not exists t1 AS SELECT key FROM src; – expect do nothing, currently it will overwrite the t1, which is incorrect.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3013 from chenghao-intel/ctas_unittest and squashes the following commits:
      
      194113e [Cheng Hao] fix bug in CTAS when table already existed
      e83f13e8
    • Cheng Lian's avatar
      [SPARK-4202][SQL] Simple DSL support for Scala UDF · c238fb42
      Cheng Lian authored
      This feature is based on an offline discussion with mengxr, hopefully can be useful for the new MLlib pipeline API.
      
      For the following test snippet
      
      ```scala
      case class KeyValue(key: Int, value: String)
      val testData = sc.parallelize(1 to 10).map(i => KeyValue(i, i.toString)).toSchemaRDD
      def foo(a: Int, b: String) => a.toString + b
      ```
      
      the newly introduced DSL enables the following syntax
      
      ```scala
      import org.apache.spark.sql.catalyst.dsl._
      testData.select(Star(None), foo.call('key, 'value) as 'result)
      ```
      
      which is equivalent to
      
      ```scala
      testData.registerTempTable("testData")
      sqlContext.registerFunction("foo", foo)
      sql("SELECT *, foo(key, value) AS result FROM testData")
      ```
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3067 from liancheng/udf-dsl and squashes the following commits:
      
      f132818 [Cheng Lian] Adds DSL support for Scala UDF
      c238fb42
    • Davies Liu's avatar
      [SPARK-3594] [PySpark] [SQL] take more rows to infer schema or sampling · 24544fbc
      Davies Liu authored
      This patch will try to infer schema for RDD which has empty value (None, [], {}) in the first row. It will try first 100 rows and merge the types into schema, also merge fields of StructType together. If there is still NullType in schema, then it will show an warning, tell user to try with sampling.
      
      If sampling is presented, it will infer schema from all the rows after sampling.
      
      Also, add samplingRatio for jsonFile() and jsonRDD()
      
      Author: Davies Liu <davies.liu@gmail.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2716 from davies/infer and squashes the following commits:
      
      e678f6d [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
      34b5c63 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
      567dc60 [Davies Liu] update docs
      9767b27 [Davies Liu] Merge branch 'master' into infer
      e48d7fb [Davies Liu] fix tests
      29e94d5 [Davies Liu] let NullType inherit from PrimitiveType
      ee5d524 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
      540d1d5 [Davies Liu] merge fields for StructType
      f93fd84 [Davies Liu] add more tests
      3603e00 [Davies Liu] take more rows to infer schema, or infer the schema by sampling the RDD
      24544fbc
    • ravipesala's avatar
      [SPARK-4207][SQL] Query which has syntax like 'not like' is not working in Spark SQL · 2b6e1ce6
      ravipesala authored
      Queries which has 'not like' is not working spark sql.
      
      sql("SELECT * FROM records where value not like 'val%'")
       same query works in Spark HiveQL
      
      Author: ravipesala <ravindra.pesala@huawei.com>
      
      Closes #3075 from ravipesala/SPARK-4207 and squashes the following commits:
      
      35c11e7 [ravipesala] Supported 'not like' syntax in sql
      2b6e1ce6
    • fi's avatar
      [SPARK-4211][Build] Fixes hive.version in Maven profile hive-0.13.1 · df607da0
      fi authored
      instead of `hive.version=0.13.1`.
      e.g. mvn -Phive -Phive=0.13.1
      
      Note: `hive.version=0.13.1a` is the default property value. However, when explicitly specifying the `hive-0.13.1` maven profile, the wrong one would be selected.
      References:  PR #2685, which resolved a package incompatibility issue with Hive-0.13.1 by introducing a special version Hive-0.13.1a
      
      Author: fi <coderfi@gmail.com>
      
      Closes #3072 from coderfi/master and squashes the following commits:
      
      7ca4b1e [fi] Fixes the `hive-0.13.1` maven profile referencing `hive.version=0.13.1` instead of the Spark compatible `hive.version=0.13.1a` Note: `hive.version=0.13.1a` is the default version. However, when explicitly specifying the `hive-0.13.1` maven profile, the wrong one would be selected. e.g. mvn -Phive -Phive=0.13.1 See PR #2685
      df607da0
    • Xiangrui Meng's avatar
      [SPARK-4148][PySpark] fix seed distribution and add some tests for rdd.sample · 3cca1962
      Xiangrui Meng authored
      The current way of seed distribution makes the random sequences from partition i and i+1 offset by 1.
      
      ~~~
      In [14]: import random
      
      In [15]: r1 = random.Random(10)
      
      In [16]: r1.randint(0, 1)
      Out[16]: 1
      
      In [17]: r1.random()
      Out[17]: 0.4288890546751146
      
      In [18]: r1.random()
      Out[18]: 0.5780913011344704
      
      In [19]: r2 = random.Random(10)
      
      In [20]: r2.randint(0, 1)
      Out[20]: 1
      
      In [21]: r2.randint(0, 1)
      Out[21]: 0
      
      In [22]: r2.random()
      Out[22]: 0.5780913011344704
      ~~~
      
      Note: The new tests are not for this bug fix.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3010 from mengxr/SPARK-4148 and squashes the following commits:
      
      869ae4b [Xiangrui Meng] move tests tests.py
      c1bacd9 [Xiangrui Meng] fix seed distribution and add some tests for rdd.sample
      3cca1962
    • Nicholas Chammas's avatar
      [EC2] Factor out Mesos spark-ec2 branch · 2aca97c7
      Nicholas Chammas authored
      We reference a specific branch in two places. This patch makes it one place.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #3008 from nchammas/mesos-spark-ec2-branch and squashes the following commits:
      
      10a6089 [Nicholas Chammas] factor out mess spark-ec2 branch
      2aca97c7
    • zsxwing's avatar
      [SPARK-4163][Core][WebUI] Send the fetch failure message back to Web UI · 76386e1a
      zsxwing authored
      This is a PR to send the fetch failure message back to Web UI.
      Before:
      ![f1](https://cloud.githubusercontent.com/assets/1000778/4856595/1f036c80-60be-11e4-956f-335147fbccb7.png)
      ![f2](https://cloud.githubusercontent.com/assets/1000778/4856596/1f11cbea-60be-11e4-8fe9-9f9b2b35c884.png)
      
      After (Please ignore the meaning of exception, I threw it in the code directly because it's hard to simulate a fetch failure):
      ![e1](https://cloud.githubusercontent.com/assets/1000778/4856600/2657ea38-60be-11e4-9f2d-d56c5f900f10.png)
      ![e2](https://cloud.githubusercontent.com/assets/1000778/4856601/26595008-60be-11e4-912b-2744af786991.png)
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3032 from zsxwing/SPARK-4163 and squashes the following commits:
      
      f7e1faf [zsxwing] Discard changes for FetchFailedException and minor modification
      4e946f7 [zsxwing] Add e as the cause of SparkException
      316767d [zsxwing] Add private[storage] to FetchResult
      d51b0b6 [zsxwing] Set e as the cause of FetchFailedException
      b88c919 [zsxwing] Use 'private[storage]' for case classes instead of 'sealed'
      62103fd [zsxwing] Update as per review
      0c07d1f [zsxwing] Backward-compatible support
      a3bca65 [zsxwing] Send the fetch failure message back to Web UI
      76386e1a
    • wangfei's avatar
      [SPARK-4177][Doc]update build doc since JDBC/CLI support hive 13 now · 001acc44
      wangfei authored
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #3042 from scwf/patch-9 and squashes the following commits:
      
      3784ed1 [wangfei] remove 'TODO'
      1891553 [wangfei] update build doc since JDBC/CLI support hive 13
      001acc44
  3. Nov 02, 2014
    • Reynold Xin's avatar
      Close #2971. · d6e4c591
      Reynold Xin authored
      d6e4c591
    • Aaron Davidson's avatar
      [SPARK-4183] Enable NettyBlockTransferService by default · 1ae51f6d
      Aaron Davidson authored
      Note that we're turning this on for at least the first part of the QA period as a trial. We want to enable this (and deprecate the NioBlockTransferService) as soon as possible in the hopes that NettyBlockTransferService will be more stable and easier to maintain. We will turn it off if we run into major issues.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3049 from aarondav/enable-netty and squashes the following commits:
      
      bb981cc [Aaron Davidson] [SPARK-4183] Enable NettyBlockTransferService by default
      1ae51f6d
    • Joseph K. Bradley's avatar
      [SPARK-3572] [SQL] Internal API for User-Defined Types · ebd64805
      Joseph K. Bradley authored
      This PR adds User-Defined Types (UDTs) to SQL. It is a precursor to using SchemaRDD as a Dataset for the new MLlib API. Currently, the UDT API is private since there is incomplete support (e.g., no Java or Python support yet).
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      Author: Michael Armbrust <michael@databricks.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3063 from marmbrus/udts and squashes the following commits:
      
      7ccfc0d [Michael Armbrust] remove println
      46a3aee [Michael Armbrust] Slightly easier to read test output.
      6cc434d [Michael Armbrust] Recursively convert rows.
      e369b91 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into udts
      15c10a6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into sql-udt2
      f3c72fe [Joseph K. Bradley] Fixing merge
      e13cd8a [Joseph K. Bradley] Removed Vector UDTs
      5817b2b [Joseph K. Bradley] style edits
      30ce5b2 [Joseph K. Bradley] updates based on code review
      d063380 [Joseph K. Bradley] Cleaned up Java UDT Suite, and added warning about element ordering when creating schema from Java Bean
      a571bb6 [Joseph K. Bradley] Removed old UDT code (registry and Java UDTs).  Cleaned up other code.  Extended JavaUserDefinedTypeSuite
      6fddc1c [Joseph K. Bradley] Made MyLabeledPoint into a Java Bean
      20630bc [Joseph K. Bradley] fixed scalastyle
      fa86b20 [Joseph K. Bradley] Removed Java UserDefinedType, and made UDTs private[spark] for now
      8de957c [Joseph K. Bradley] Modified UserDefinedType to store Java class of user type so that registerUDT takes only the udt argument.
      8b242ea [Joseph K. Bradley] Fixed merge error after last merge.  Note: Last merge commit also removed SQL UDT examples from mllib.
      7f29656 [Joseph K. Bradley] Moved udt case to top of all matches.  Small cleanups
      b028675 [Xiangrui Meng] allow any type in UDT
      4500d8a [Xiangrui Meng] update example code
      87264a5 [Xiangrui Meng] remove debug code
      3143ac3 [Xiangrui Meng] remove unnecessary changes
      cfbc321 [Xiangrui Meng] support UDT in parquet
      db16139 [Joseph K. Bradley] Added more doc for UserDefinedType.  Removed unused code in Suite
      759af7a [Joseph K. Bradley] Added more doc to UserDefineType
      63626a4 [Joseph K. Bradley] Updated ScalaReflectionsSuite per @marmbrus suggestions
      51e5282 [Joseph K. Bradley] fixed 1 test
      f025035 [Joseph K. Bradley] Cleanups before PR.  Added new tests
      85872f6 [Michael Armbrust] Allow schema calculation to be lazy, but ensure its available on executors.
      dff99d6 [Joseph K. Bradley] Added UDTs for Vectors in MLlib, plus DatasetExample using the UDTs
      cd60cb4 [Joseph K. Bradley] Trying to get other SQL tests to run
      34a5831 [Joseph K. Bradley] Added MLlib dependency on SQL.
      e1f7b9c [Joseph K. Bradley] blah
      2f40c02 [Joseph K. Bradley] renamed UDT types
      3579035 [Joseph K. Bradley] udt annotation now working
      b226b9e [Joseph K. Bradley] Changing UDT to annotation
      fea04af [Joseph K. Bradley] more cleanups
      964b32e [Joseph K. Bradley] some cleanups
      893ee4c [Joseph K. Bradley] udt finallly working
      50f9726 [Joseph K. Bradley] udts
      04303c9 [Joseph K. Bradley] udts
      39f8707 [Joseph K. Bradley] removed old udt suite
      273ac96 [Joseph K. Bradley] basic UDT is working, but deserialization has yet to be done
      8bebf24 [Joseph K. Bradley] commented out convertRowToScala for debugging
      53de70f [Joseph K. Bradley] more udts...
      982c035 [Joseph K. Bradley] still working on UDTs
      19b2f60 [Joseph K. Bradley] still working on UDTs
      0eaeb81 [Joseph K. Bradley] Still working on UDTs
      105c5a3 [Joseph K. Bradley] Adding UserDefinedType to SQL, not done yet.
      ebd64805
    • Aaron Davidson's avatar
      [SPARK-4183] Close transport-related resources between SparkContexts · 2ebd1df3
      Aaron Davidson authored
      A leak of event loops may be causing test failures.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3053 from aarondav/leak and squashes the following commits:
      
      e676d18 [Aaron Davidson] Typo!
      8f96475 [Aaron Davidson] Keep original ssc semantics
      7e49f10 [Aaron Davidson] A leak of event loops may be causing test failures.
      2ebd1df3
    • Cheng Lian's avatar
      [SPARK-2189][SQL] Adds dropTempTable API · 9081b9f9
      Cheng Lian authored
      This PR adds an API for unregistering temporary tables. If a temporary table has been cached before, it's unpersisted as well.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #3039 from liancheng/unregister-temp-table and squashes the following commits:
      
      54ae99f [Cheng Lian] Fixes Scala styling issue
      1948c14 [Cheng Lian] Removes the unpersist argument
      aca41d3 [Cheng Lian] Ensures thread safety
      7d4fb2b [Cheng Lian] Adds unregisterTempTable API
      9081b9f9
    • Yin Huai's avatar
      [SPARK-4185][SQL] JSON schema inference failed when dealing with type conflicts in arrays · 06232d23
      Yin Huai authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-4185.
      
      This PR also has the fix of #3052.
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #3056 from yhuai/SPARK-4185 and squashes the following commits:
      
      ed3a5a8 [Yin Huai] Correctly handle type conflicts between structs and primitive types in an array.
      06232d23
    • wangfei's avatar
      [SPARK-4191][SQL]move wrapperFor to HiveInspectors to reuse it · e749f5de
      wangfei authored
      Move wrapperFor in InsertIntoHiveTable to HiveInspectors to reuse them, this method can be reused when writing date with ObjectInspector(such as orc support)
      
      Author: wangfei <wangfei1@huawei.com>
      Author: scwf <wangfei1@huawei.com>
      
      Closes #3057 from scwf/reuse-wraperfor and squashes the following commits:
      
      7ccf932 [scwf] fix conflicts
      d44f4da [wangfei] fix imports
      9bf1b50 [wangfei] revert no related change
      9a5276a [wangfei] move wrapfor to hiveinspector to reuse them
      e749f5de
    • Cheng Lian's avatar
      [SPARK-3791][SQL] Provides Spark version and Hive version in HiveThriftServer2 · c9f84004
      Cheng Lian authored
      This PR overrides the `GetInfo` Hive Thrift API to provide correct version information. Another property `spark.sql.hive.version` is added to reveal the underlying Hive version. These are generally useful for Spark SQL ODBC driver providers. The Spark version information is extracted from the jar manifest. Also took the chance to remove the `SET -v` hack, which was a workaround for Simba ODBC driver connectivity.
      
      TODO
      
      - [x] Find a general way to figure out Hive (or even any dependency) version.
      
        This [blog post](http://blog.soebes.de/blog/2014/01/02/version-information-into-your-appas-with-maven/) suggests several methods to inspect application version. In the case of Spark, this can be tricky because the chosen method:
      
        1. must applies to both Maven build and SBT build
      
          For Maven builds, we can retrieve the version information from the META-INF/maven directory within the assembly jar. But this doesn't work for SBT builds.
      
        2. must not rely on the original jars of dependencies to extract specific dependency version, because Spark uses assembly jar.
      
          This implies we can't read Hive version from Hive jar files since standard Spark distribution doesn't include them.
      
        3. should play well with `SPARK_PREPEND_CLASSES` to ease local testing during development.
      
           `SPARK_PREPEND_CLASSES` prevents classes to be loaded from the assembly jar, thus we can't locate the jar file and read its manifest.
      
        Given these, maybe the only reliable method is to generate a source file containing version information at build time. pwendell Do you have any suggestions from the perspective of the build process?
      
      **Update** Hive version is now retrieved from the newly introduced `HiveShim` object.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #2843 from liancheng/get-info and squashes the following commits:
      
      a873d0f [Cheng Lian] Updates test case
      53f43cd [Cheng Lian] Retrieves underlying Hive verson via HiveShim
      1d282b8 [Cheng Lian] Removes the Simba ODBC "SET -v" hack
      f857fce [Cheng Lian] Overrides Hive GetInfo Thrift API and adds Hive version property
      c9f84004
    • Cheng Lian's avatar
      [SQL] Fixes race condition in CliSuite · 495a1320
      Cheng Lian authored
      `CliSuite` has been flaky for a while, this PR tries to improve this situation by fixing a race condition in `CliSuite`. The `captureOutput` function is used to capture both stdout and stderr output of the forked external process in two background threads and search for expected strings, but wasn't been properly synchronized before.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3060 from liancheng/fix-cli-suite and squashes the following commits:
      
      a70569c [Cheng Lian] Fixes race condition in CliSuite
      495a1320
    • Cheng Lian's avatar
      [SPARK-4182][SQL] Fixes ColumnStats classes for boolean, binary and complex data types · e4b80894
      Cheng Lian authored
      `NoopColumnStats` was once used for binary, boolean and complex data types. This `ColumnStats` doesn't return properly shaped column statistics and causes caching failure if a table contains columns of the aforementioned types.
      
      This PR adds `BooleanColumnStats`, `BinaryColumnStats` and `GenericColumnStats`, used for boolean, binary and all complex data types respectively. In addition, `NoopColumnStats` returns properly shaped column statistics containing null count and row count, but this class is now used for testing purpose only.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3059 from liancheng/spark-4182 and squashes the following commits:
      
      b398cfd [Cheng Lian] Fixes failed test case
      fb3ee85 [Cheng Lian] Fixes SPARK-4182
      e4b80894
    • Michael Armbrust's avatar
      [SPARK-3247][SQL] An API for adding data sources to Spark SQL · 9c0eb57c
      Michael Armbrust authored
      This PR introduces a new set of APIs to Spark SQL to allow other developers to add support for reading data from new sources in `org.apache.spark.sql.sources`.
      
      New sources must implement the interface `BaseRelation`, which is responsible for describing the schema of the data.  BaseRelations have three `Scan` subclasses, which are responsible for producing an RDD containing row objects.  The [various Scan interfaces](https://github.com/marmbrus/spark/blob/foreign/sql/core/src/main/scala/org/apache/spark/sql/sources/package.scala#L50) allow for optimizations such as column pruning and filter push down, when the underlying data source can handle these operations.
      
      By implementing a class that inherits from RelationProvider these data sources can be accessed using using pure SQL.  I've used the functionality to update the JSON support so it can now be used in this way as follows:
      
      ```sql
      CREATE TEMPORARY TABLE jsonTableSQL
      USING org.apache.spark.sql.json
      OPTIONS (
        path '/home/michael/data.json'
      )
      ```
      
      Further example usage can be found in the test cases: https://github.com/marmbrus/spark/tree/foreign/sql/core/src/test/scala/org/apache/spark/sql/sources
      
      There is also a library that uses this new API to read avro data available here:
      https://github.com/marmbrus/sql-avro
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2475 from marmbrus/foreign and squashes the following commits:
      
      1ed6010 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into foreign
      ab2c31f [Michael Armbrust] fix test
      1d41bb5 [Michael Armbrust] unify argument names
      5b47901 [Michael Armbrust] Remove sealed, more filter types
      fab154a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into foreign
      e3e690e [Michael Armbrust] Add hook for extraStrategies
      a70d602 [Michael Armbrust] Fix style, more tests, FilteredSuite => PrunedFilteredSuite
      70da6d9 [Michael Armbrust] Modify API to ease binary compatibility and interop with Java
      7d948ae [Michael Armbrust] Fix equality of AttributeReference.
      5545491 [Michael Armbrust] Address comments
      5031ac3 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into foreign
      22963ef [Michael Armbrust] package objects compile wierdly...
      b069146 [Michael Armbrust] traits => abstract classes
      34f836a [Michael Armbrust] Make @DeveloperApi
      0d74bcf [Michael Armbrust] Add documention on object life cycle
      3e06776 [Michael Armbrust] remove line wraps
      de3b68c [Michael Armbrust] Remove empty file
      360cb30 [Michael Armbrust] style and java api
      2957875 [Michael Armbrust] add override
      0fd3a07 [Michael Armbrust] Draft of data sources API
      9c0eb57c
    • wangfei's avatar
      [HOTFIX][SQL] hive test missing some golden files · f0a4b630
      wangfei authored
      cc marmbrus
      
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #3055 from scwf/hotfix and squashes the following commits:
      
      d881bd7 [wangfei] miss golden files
      f0a4b630
    • zsxwing's avatar
      [SPARK-4166][Core][WebUI] Display the executor ID in the Web UI when ExecutorLostFailure happens · 4e6a7a0b
      zsxwing authored
      Now when ExecutorLostFailure happens, it only displays `ExecutorLostFailure (executor lost)`. Adding the executor id will help locate the faulted executor.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3033 from zsxwing/SPARK-4166 and squashes the following commits:
      
      ff4664c [zsxwing] Backward-compatible support
      c5c4cf2 [zsxwing] Display the executor ID in the Web UI when ExecutorLostFailure happens
      4e6a7a0b
    • Davies Liu's avatar
      [SPARK-3466] Limit size of results that a driver collects for each action · 6181577e
      Davies Liu authored
      Right now, operations like collect() and take() can crash the driver with an OOM if they bring back too many data.
      
      This PR will introduce spark.driver.maxResultSize, after setting it, the driver will abort a job if its result is bigger than it.
      
      By default, it's 1g (for backward compatibility for most the cases).
      
      In local mode, the driver and executor share the same JVM, the default setting can not protect JVM from OOM.
      
      cc mateiz
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3003 from davies/collect and squashes the following commits:
      
      248ed5e [Davies Liu] fix compile
      272522e [Davies Liu] address comments
      2c35773 [Davies Liu] add sizes in message of abort()
      5d62303 [Davies Liu] address comments
      bc3c077 [Davies Liu] Merge branch 'master' of github.com:apache/spark into collect
      11f97c5 [Davies Liu] address comments
      47b144f [Davies Liu] check the size of result before send and fetch
      3d81af2 [Davies Liu] address comments
      ca8267d [Davies Liu] limit the size of data by collect
      6181577e
  4. Nov 01, 2014
    • Matei Zaharia's avatar
      [SPARK-3930] [SPARK-3933] Support fixed-precision decimal in SQL, and some optimizations · 23f966f4
      Matei Zaharia authored
      - Adds optional precision and scale to Spark SQL's decimal type, which behave similarly to those in Hive 13 (https://cwiki.apache.org/confluence/download/attachments/27362075/Hive_Decimal_Precision_Scale_Support.pdf)
      - Replaces our internal representation of decimals with a Decimal class that can store small values in a mutable Long, saving memory in this situation and letting some operations happen directly on Longs
      
      This is still marked WIP because there are a few TODOs, but I'll remove that tag when done.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #2983 from mateiz/decimal-1 and squashes the following commits:
      
      35e6b02 [Matei Zaharia] Fix issues after merge
      227f24a [Matei Zaharia] Review comments
      31f915e [Matei Zaharia] Implement Davies's suggestions in Python
      eb84820 [Matei Zaharia] Support reading/writing decimals as fixed-length binary in Parquet
      4dc6bae [Matei Zaharia] Fix decimal support in PySpark
      d1d9d68 [Matei Zaharia] Fix compile error and test issues after rebase
      b28933d [Matei Zaharia] Support decimal precision/scale in Hive metastore
      2118c0d [Matei Zaharia] Some test and bug fixes
      81db9cb [Matei Zaharia] Added mutable Decimal that will be more efficient for small precisions
      7af0c3b [Matei Zaharia] Add optional precision and scale to DecimalType, but use Unlimited for now
      ec0a947 [Matei Zaharia] Make the result of AVG on Decimals be Decimal, not Double
      23f966f4
    • Sung Chung's avatar
      [SPARK-3161][MLLIB] Adding a node Id caching mechanism for training deci... · 56f2c61c
      Sung Chung authored
      ...sion trees. jkbradley mengxr chouqin Please review this.
      
      Author: Sung Chung <schung@alpinenow.com>
      
      Closes #2868 from codedeft/SPARK-3161 and squashes the following commits:
      
      5f5a156 [Sung Chung] [SPARK-3161][MLLIB] Adding a node Id caching mechanism for training decision trees.
      56f2c61c
Loading