Skip to content
Snippets Groups Projects
  1. Nov 05, 2014
    • Aaron Davidson's avatar
      [SPARK-4242] [Core] Add SASL to external shuffle service · 4c42986c
      Aaron Davidson authored
      Does three things: (1) Adds SASL to ExternalShuffleClient, (2) puts SecurityManager in BlockManager's constructor, and (3) adds unit test.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3108 from aarondav/sasl-client and squashes the following commits:
      
      48b622d [Aaron Davidson] Screw it, let's just get LimitedInputStream
      3543b70 [Aaron Davidson] Back out of pom change due to unknown test issue?
      b58518a [Aaron Davidson] ByteStreams.limit() not available :(
      cbe451a [Aaron Davidson] Address comments
      2bf2908 [Aaron Davidson] [SPARK-4242] [Core] Add SASL to external shuffle service
      4c42986c
    • Joseph K. Bradley's avatar
      [SPARK-4197] [mllib] GradientBoosting API cleanup and examples in Scala, Java · 5b3b6f6f
      Joseph K. Bradley authored
      ### Summary
      
      * Made it easier to construct default Strategy and BoostingStrategy and to set parameters using simple types.
      * Added Scala and Java examples for GradientBoostedTrees
      * small cleanups and fixes
      
      ### Details
      
      GradientBoosting bug fixes (“bug” = bad default options)
      * Force boostingStrategy.weakLearnerParams.algo = Regression
      * Force boostingStrategy.weakLearnerParams.impurity = impurity.Variance
      * Only persist data if not yet persisted (since it causes an error if persisted twice)
      
      BoostingStrategy
      * numEstimators: renamed to numIterations
      * removed subsamplingRate (duplicated by Strategy)
      * removed categoricalFeaturesInfo since it belongs with the weak learner params (since boosting can be oblivious to feature type)
      * Changed algo to var (not val) and added BeanProperty, with overload taking String argument
      * Added assertValid() method
      * Updated defaultParams() method and eliminated defaultWeakLearnerParams() since that belongs in Strategy
      
      Strategy (for DecisionTree)
      * Changed algo to var (not val) and added BeanProperty, with overload taking String argument
      * Added setCategoricalFeaturesInfo method taking Java Map.
      * Cleaned up assertValid
      * Changed val’s to def’s since parameters can now be changed.
      
      CC: manishamde mengxr codedeft
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #3094 from jkbradley/gbt-api and squashes the following commits:
      
      7a27e22 [Joseph K. Bradley] scalastyle fix
      52013d5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into gbt-api
      e9b8410 [Joseph K. Bradley] Summary of changes
      5b3b6f6f
    • Tathagata Das's avatar
      [SPARK-4029][Streaming] Update streaming driver to reliably save and recover... · 5f13759d
      Tathagata Das authored
      [SPARK-4029][Streaming] Update streaming driver to reliably save and recover received block metadata on driver failures
      
      As part of the initiative of preventing data loss on driver failure, this JIRA tracks the sub task of modifying the streaming driver to reliably save received block metadata, and recover them on driver restart.
      
      This was solved by introducing a `ReceivedBlockTracker` that takes all the responsibility of managing the metadata of received blocks (i.e. `ReceivedBlockInfo`, and any actions on them (e.g, allocating blocks to batches, etc.). All actions to block info get written out to a write ahead log (using `WriteAheadLogManager`). On recovery, all the actions are replaying to recreate the pre-failure state of the `ReceivedBlockTracker`, which include the batch-to-block allocations and the unallocated blocks.
      
      Furthermore, the `ReceiverInputDStream` was modified to create `WriteAheadLogBackedBlockRDD`s when file segment info is present in the `ReceivedBlockInfo`. After recovery of all the block info (through recovery `ReceivedBlockTracker`), the `WriteAheadLogBackedBlockRDD`s gets recreated with the recovered info, and jobs submitted. The data of the blocks gets pulled from the write ahead logs, thanks to the segment info present in the `ReceivedBlockInfo`.
      
      This is still a WIP. Things that are missing here are.
      
      - *End-to-end integration tests:* Unit tests that tests the driver recovery, by killing and restarting the streaming context, and verifying all the input data gets processed. This has been implemented but not included in this PR yet. A sneak peek of that DriverFailureSuite can be found in this PR (on my personal repo): https://github.com/tdas/spark/pull/25 I can either include it in this PR, or submit that as a separate PR after this gets in.
      
      - *WAL cleanup:* Cleaning up the received data write ahead log, by calling `ReceivedBlockHandler.cleanupOldBlocks`. This is being worked on.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #3026 from tdas/driver-ha-rbt and squashes the following commits:
      
      a8009ed [Tathagata Das] Added comment
      1d704bb [Tathagata Das] Enabled storing recovered WAL-backed blocks to BM
      2ee2484 [Tathagata Das] More minor changes based on PR
      47fc1e3 [Tathagata Das] Addressed PR comments.
      9a7e3e4 [Tathagata Das] Refactored ReceivedBlockTracker API a bit to make things a little cleaner for users of the tracker.
      af63655 [Tathagata Das] Minor changes.
      fce2b21 [Tathagata Das] Removed commented lines
      59496d3 [Tathagata Das] Changed class names, made allocation more explicit and added cleanup
      19aec7d [Tathagata Das] Fixed casting bug.
      f66d277 [Tathagata Das] Fix line lengths.
      cda62ee [Tathagata Das] Added license
      25611d6 [Tathagata Das] Minor changes before submitting PR
      7ae0a7fb [Tathagata Das] Transferred changes from driver-ha-working branch
      5f13759d
  2. Nov 04, 2014
    • Davies Liu's avatar
      [SPARK-3964] [MLlib] [PySpark] add Hypothesis test Python API · c8abddc5
      Davies Liu authored
      ```
      pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None)
          :: Experimental ::
      
          If `observed` is Vector, conduct Pearson's chi-squared goodness
          of fit test of the observed data against the expected distribution,
          or againt the uniform distribution (by default), with each category
          having an expected frequency of `1 / len(observed)`.
          (Note: `observed` cannot contain negative values)
      
          If `observed` is matrix, conduct Pearson's independence test on the
          input contingency matrix, which cannot contain negative entries or
          columns or rows that sum up to 0.
      
          If `observed` is an RDD of LabeledPoint, conduct Pearson's independence
          test for every feature against the label across the input RDD.
          For each feature, the (feature, label) pairs are converted into a
          contingency matrix for which the chi-squared statistic is computed.
          All label and feature values must be categorical.
      
          :param observed: it could be a vector containing the observed categorical
                           counts/relative frequencies, or the contingency matrix
                           (containing either counts or relative frequencies),
                           or an RDD of LabeledPoint containing the labeled dataset
                           with categorical features. Real-valued features will be
                           treated as categorical for each distinct value.
          :param expected: Vector containing the expected categorical counts/relative
                           frequencies. `expected` is rescaled if the `expected` sum
                           differs from the `observed` sum.
          :return: ChiSquaredTest object containing the test statistic, degrees
                   of freedom, p-value, the method used, and the null hypothesis.
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3091 from davies/his and squashes the following commits:
      
      145d16c [Davies Liu] address comments
      0ab0764 [Davies Liu] fix float
      5097d54 [Davies Liu] add Hypothesis test Python API
      c8abddc5
    • Michael Armbrust's avatar
      [SQL] Add String option for DSL AS · 515abb9a
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3097 from marmbrus/asString and squashes the following commits:
      
      6430520 [Michael Armbrust] Add String option for DSL AS
      515abb9a
    • Aaron Davidson's avatar
      [SPARK-2938] Support SASL authentication in NettyBlockTransferService · 5e73138a
      Aaron Davidson authored
      Also lays the groundwork for supporting it inside the external shuffle service.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3087 from aarondav/sasl and squashes the following commits:
      
      3481718 [Aaron Davidson] Delete rogue println
      44f8410 [Aaron Davidson] Delete documentation - muahaha!
      eb9f065 [Aaron Davidson] Improve documentation and add end-to-end test at Spark-level
      a6b95f1 [Aaron Davidson] Address comments
      785bbde [Aaron Davidson] Cleanup
      79973cb [Aaron Davidson] Remove unused file
      151b3c5 [Aaron Davidson] Add docs, timeout config, better failure handling
      f6177d7 [Aaron Davidson] Cleanup SASL state upon connection termination
      7b42adb [Aaron Davidson] Add unit tests
      8191bcb [Aaron Davidson] [SPARK-2938] Support SASL authentication in NettyBlockTransferService
      5e73138a
    • Niklas Wilcke's avatar
      [Spark-4060] [MLlib] exposing special rdd functions to the public · f90ad5d4
      Niklas Wilcke authored
      Author: Niklas Wilcke <1wilcke@informatik.uni-hamburg.de>
      
      Closes #2907 from numbnut/master and squashes the following commits:
      
      7f7c767 [Niklas Wilcke] [Spark-4060] [MLlib] exposing special rdd functions to the public, #2907
      f90ad5d4
    • Dariusz Kobylarz's avatar
      fixed MLlib Naive-Bayes java example bug · bcecd73f
      Dariusz Kobylarz authored
      the filter tests Double objects by references whereas it should test their values
      
      Author: Dariusz Kobylarz <darek.kobylarz@gmail.com>
      
      Closes #3081 from dkobylarz/master and squashes the following commits:
      
      5d43a39 [Dariusz Kobylarz] naive bayes example update
      a304b93 [Dariusz Kobylarz] fixed MLlib Naive-Bayes java example bug
      bcecd73f
    • Davies Liu's avatar
      [SPARK-3886] [PySpark] simplify serializer, use AutoBatchedSerializer by default. · e4f42631
      Davies Liu authored
      This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1.
      
      Author: Davies Liu <davies@databricks.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Josh Rosen <joshrosen@databricks.com>
      
      Closes #2920 from davies/fix_autobatch and squashes the following commits:
      
      e544ef9 [Davies Liu] revert unrelated change
      6880b14 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      1d557fc [Davies Liu] fix tests
      8180907 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      76abdce [Davies Liu] clean up
      53fa60b [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      d7ac751 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      2cc2497 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      b4292ce [Davies Liu] fix bug in master
      d79744c [Davies Liu] recover hive tests
      be37ece [Davies Liu] refactor
      eb3938d [Davies Liu] refactor serializer in scala
      8d77ef2 [Davies Liu] simplify serializer, use AutoBatchedSerializer by default.
      e4f42631
    • zsxwing's avatar
      [SPARK-4166][Core] Add a backward compatibility test for ExecutorLostFailure · b671ce04
      zsxwing authored
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3085 from zsxwing/SPARK-4166-back-comp and squashes the following commits:
      
      89329f4 [zsxwing] Add a backward compatibility test for ExecutorLostFailure
      b671ce04
    • zsxwing's avatar
      [SPARK-4163][Core] Add a backward compatibility test for FetchFailed · 9bdc8412
      zsxwing authored
      /cc aarondav
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3086 from zsxwing/SPARK-4163-back-comp and squashes the following commits:
      
      21cb2a8 [zsxwing] Add a backward compatibility test for FetchFailed
      9bdc8412
    • Xiangrui Meng's avatar
      [SPARK-3573][MLLIB] Make MLlib's Vector compatible with SQL's SchemaRDD · 1a9c6cdd
      Xiangrui Meng authored
      Register MLlib's Vector as a SQL user-defined type (UDT) in both Scala and Python. With this PR, we can easily map a RDD[LabeledPoint] to a SchemaRDD, and then select columns or save to a Parquet file. Examples in Scala/Python are attached. The Scala code was copied from jkbradley.
      
      ~~This PR contains the changes from #3068 . I will rebase after #3068 is merged.~~
      
      marmbrus jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3070 from mengxr/SPARK-3573 and squashes the following commits:
      
      3a0b6e5 [Xiangrui Meng] organize imports
      236f0a0 [Xiangrui Meng] register vector as UDT and provide dataset examples
      1a9c6cdd
  3. Nov 03, 2014
    • Xiangrui Meng's avatar
      [SPARK-4192][SQL] Internal API for Python UDT · 04450d11
      Xiangrui Meng authored
      Following #2919, this PR adds Python UDT (for internal use only) with tests under "pyspark.tests". Before `SQLContext.applySchema`, we check whether we need to convert user-type instances into SQL recognizable data. In the current implementation, a Python UDT must be paired with a Scala UDT for serialization on the JVM side. A following PR will add VectorUDT in MLlib for both Scala and Python.
      
      marmbrus jkbradley davies
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3068 from mengxr/SPARK-4192-sql and squashes the following commits:
      
      acff637 [Xiangrui Meng] merge master
      dba5ea7 [Xiangrui Meng] only use pyClass for Python UDT output sqlType as well
      2c9d7e4 [Xiangrui Meng] move import to global setup; update needsConversion
      7c4a6a9 [Xiangrui Meng] address comments
      75223db [Xiangrui Meng] minor update
      f740379 [Xiangrui Meng] remove UDT from default imports
      e98d9d0 [Xiangrui Meng] fix py style
      4e84fce [Xiangrui Meng] remove local hive tests and add more tests
      39f19e0 [Xiangrui Meng] add tests
      b7f666d [Xiangrui Meng] add Python UDT
      04450d11
    • Xiangrui Meng's avatar
      [FIX][MLLIB] fix seed in BaggedPointSuite · c5912ecc
      Xiangrui Meng authored
      Saw Jenkins test failures due to random seeds.
      
      jkbradley manishamde
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3084 from mengxr/fix-baggedpoint-suite and squashes the following commits:
      
      f735a43 [Xiangrui Meng] fix seed in BaggedPointSuite
      c5912ecc
    • Josh Rosen's avatar
      [SPARK-611] Display executor thread dumps in web UI · 4f035dd2
      Josh Rosen authored
      This patch allows executor thread dumps to be collected on-demand and viewed in the Spark web UI.
      
      The thread dumps are collected using Thread.getAllStackTraces().  To allow remote thread dumps to be triggered from the web UI, I added a new `ExecutorActor` that runs inside of the Executor actor system and responds to RPCs from the driver.  The driver's mechanism for obtaining a reference to this actor is a little bit hacky: it uses the block manager master actor to determine the host/port of the executor actor systems in order to construct ActorRefs to ExecutorActor.  Unfortunately, I couldn't find a much cleaner way to do this without a big refactoring of the executor -> driver communication.
      
      Screenshots:
      
      ![image](https://cloud.githubusercontent.com/assets/50748/4781793/7e7a0776-5cbf-11e4-874d-a91cd04620bd.png)
      
      ![image](https://cloud.githubusercontent.com/assets/50748/4781794/8bce76aa-5cbf-11e4-8d13-8477748c9f7e.png)
      
      ![image](https://cloud.githubusercontent.com/assets/50748/4781797/bd11a8b8-5cbf-11e4-9ad7-a7459467ec8e.png)
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #2944 from JoshRosen/jstack-in-web-ui and squashes the following commits:
      
      3c21a5d [Josh Rosen] Address review comments:
      880f7f7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into jstack-in-web-ui
      f719266 [Josh Rosen] Merge remote-tracking branch 'origin/master' into jstack-in-web-ui
      19707b0 [Josh Rosen] Add one comment.
      127a130 [Josh Rosen] Update to use SparkContext.DRIVER_IDENTIFIER
      b8e69aa [Josh Rosen] Merge remote-tracking branch 'origin/master' into jstack-in-web-ui
      3dfc2d4 [Josh Rosen] Add missing file.
      bc1e675 [Josh Rosen] Undo some leftover changes from the earlier approach.
      f4ac1c1 [Josh Rosen] Switch to on-demand collection of thread dumps
      dfec08b [Josh Rosen] Add option to disable thread dumps in UI.
      4c87d7f [Josh Rosen] Use separate RPC for sending thread dumps.
      2b8bdf3 [Josh Rosen] Enable thread dumps from the driver when running in non-local mode.
      cc3e6b3 [Josh Rosen] Fix test code in DAGSchedulerSuite.
      87b8b65 [Josh Rosen] Add new listener event for thread dumps.
      8c10216 [Josh Rosen] Add missing file.
      0f198ac [Josh Rosen] [SPARK-611] Display executor thread dumps in web UI
      4f035dd2
    • Zhang, Liye's avatar
      [SPARK-4168][WebUI] web statges number should show correctly when stages are more than 1000 · 97a466ec
      Zhang, Liye authored
      The number of completed stages and failed stages showed on webUI will always be less than 1000. This is really misleading when there are already thousands of stages completed or failed. The number should be correct even when only partial stages listed on the webUI (stage info will be removed if the number is too large).
      
      Author: Zhang, Liye <liye.zhang@intel.com>
      
      Closes #3035 from liyezhang556520/webStageNum and squashes the following commits:
      
      d9e29fb [Zhang, Liye] add detailed comments for variables
      4ea8fd1 [Zhang, Liye] change variable name accroding to comments
      f4c404d [Zhang, Liye] [SPARK-4168][WebUI] web statges number should show correctly when stages are more than 1000
      97a466ec
    • Michael Armbrust's avatar
      [SQL] Convert arguments to Scala UDFs · 15b58a22
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3077 from marmbrus/udfsWithUdts and squashes the following commits:
      
      34b5f27 [Michael Armbrust] style
      504adef [Michael Armbrust] Convert arguments to Scala UDFs
      15b58a22
    • Sandy Ryza's avatar
      SPARK-4178. Hadoop input metrics ignore bytes read in RecordReader insta... · 28128150
      Sandy Ryza authored
      ...ntiation
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #3045 from sryza/sandy-spark-4178 and squashes the following commits:
      
      8d2e70e [Sandy Ryza] Kostas's review feedback
      e5b27c0 [Sandy Ryza] SPARK-4178. Hadoop input metrics ignore bytes read in RecordReader instantiation
      28128150
    • Michael Armbrust's avatar
      [SQL] More aggressive defaults · 25bef7e6
      Michael Armbrust authored
       - Turns on compression for in-memory cached data by default
       - Changes the default parquet compression format back to gzip (we have seen more OOMs with production workloads due to the way Snappy allocates memory)
       - Ups the batch size to 10,000 rows
       - Increases the broadcast threshold to 10mb.
       - Uses our parquet implementation instead of the hive one by default.
       - Cache parquet metadata by default.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3064 from marmbrus/fasterDefaults and squashes the following commits:
      
      97ee9f8 [Michael Armbrust] parquet codec docs
      e641694 [Michael Armbrust] Remote also
      a12866a [Michael Armbrust] Cache metadata.
      2d73acc [Michael Armbrust] Update docs defaults.
      d63d2d5 [Michael Armbrust] document parquet option
      da373f9 [Michael Armbrust] More aggressive defaults
      25bef7e6
    • Cheng Hao's avatar
      [SPARK-4152] [SQL] Avoid data change in CTAS while table already existed · e83f13e8
      Cheng Hao authored
      CREATE TABLE t1 (a String);
      CREATE TABLE t1 AS SELECT key FROM src; – throw exception
      CREATE TABLE if not exists t1 AS SELECT key FROM src; – expect do nothing, currently it will overwrite the t1, which is incorrect.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3013 from chenghao-intel/ctas_unittest and squashes the following commits:
      
      194113e [Cheng Hao] fix bug in CTAS when table already existed
      e83f13e8
    • Cheng Lian's avatar
      [SPARK-4202][SQL] Simple DSL support for Scala UDF · c238fb42
      Cheng Lian authored
      This feature is based on an offline discussion with mengxr, hopefully can be useful for the new MLlib pipeline API.
      
      For the following test snippet
      
      ```scala
      case class KeyValue(key: Int, value: String)
      val testData = sc.parallelize(1 to 10).map(i => KeyValue(i, i.toString)).toSchemaRDD
      def foo(a: Int, b: String) => a.toString + b
      ```
      
      the newly introduced DSL enables the following syntax
      
      ```scala
      import org.apache.spark.sql.catalyst.dsl._
      testData.select(Star(None), foo.call('key, 'value) as 'result)
      ```
      
      which is equivalent to
      
      ```scala
      testData.registerTempTable("testData")
      sqlContext.registerFunction("foo", foo)
      sql("SELECT *, foo(key, value) AS result FROM testData")
      ```
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3067 from liancheng/udf-dsl and squashes the following commits:
      
      f132818 [Cheng Lian] Adds DSL support for Scala UDF
      c238fb42
    • Davies Liu's avatar
      [SPARK-3594] [PySpark] [SQL] take more rows to infer schema or sampling · 24544fbc
      Davies Liu authored
      This patch will try to infer schema for RDD which has empty value (None, [], {}) in the first row. It will try first 100 rows and merge the types into schema, also merge fields of StructType together. If there is still NullType in schema, then it will show an warning, tell user to try with sampling.
      
      If sampling is presented, it will infer schema from all the rows after sampling.
      
      Also, add samplingRatio for jsonFile() and jsonRDD()
      
      Author: Davies Liu <davies.liu@gmail.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2716 from davies/infer and squashes the following commits:
      
      e678f6d [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
      34b5c63 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
      567dc60 [Davies Liu] update docs
      9767b27 [Davies Liu] Merge branch 'master' into infer
      e48d7fb [Davies Liu] fix tests
      29e94d5 [Davies Liu] let NullType inherit from PrimitiveType
      ee5d524 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
      540d1d5 [Davies Liu] merge fields for StructType
      f93fd84 [Davies Liu] add more tests
      3603e00 [Davies Liu] take more rows to infer schema, or infer the schema by sampling the RDD
      24544fbc
    • ravipesala's avatar
      [SPARK-4207][SQL] Query which has syntax like 'not like' is not working in Spark SQL · 2b6e1ce6
      ravipesala authored
      Queries which has 'not like' is not working spark sql.
      
      sql("SELECT * FROM records where value not like 'val%'")
       same query works in Spark HiveQL
      
      Author: ravipesala <ravindra.pesala@huawei.com>
      
      Closes #3075 from ravipesala/SPARK-4207 and squashes the following commits:
      
      35c11e7 [ravipesala] Supported 'not like' syntax in sql
      2b6e1ce6
    • fi's avatar
      [SPARK-4211][Build] Fixes hive.version in Maven profile hive-0.13.1 · df607da0
      fi authored
      instead of `hive.version=0.13.1`.
      e.g. mvn -Phive -Phive=0.13.1
      
      Note: `hive.version=0.13.1a` is the default property value. However, when explicitly specifying the `hive-0.13.1` maven profile, the wrong one would be selected.
      References:  PR #2685, which resolved a package incompatibility issue with Hive-0.13.1 by introducing a special version Hive-0.13.1a
      
      Author: fi <coderfi@gmail.com>
      
      Closes #3072 from coderfi/master and squashes the following commits:
      
      7ca4b1e [fi] Fixes the `hive-0.13.1` maven profile referencing `hive.version=0.13.1` instead of the Spark compatible `hive.version=0.13.1a` Note: `hive.version=0.13.1a` is the default version. However, when explicitly specifying the `hive-0.13.1` maven profile, the wrong one would be selected. e.g. mvn -Phive -Phive=0.13.1 See PR #2685
      df607da0
    • Xiangrui Meng's avatar
      [SPARK-4148][PySpark] fix seed distribution and add some tests for rdd.sample · 3cca1962
      Xiangrui Meng authored
      The current way of seed distribution makes the random sequences from partition i and i+1 offset by 1.
      
      ~~~
      In [14]: import random
      
      In [15]: r1 = random.Random(10)
      
      In [16]: r1.randint(0, 1)
      Out[16]: 1
      
      In [17]: r1.random()
      Out[17]: 0.4288890546751146
      
      In [18]: r1.random()
      Out[18]: 0.5780913011344704
      
      In [19]: r2 = random.Random(10)
      
      In [20]: r2.randint(0, 1)
      Out[20]: 1
      
      In [21]: r2.randint(0, 1)
      Out[21]: 0
      
      In [22]: r2.random()
      Out[22]: 0.5780913011344704
      ~~~
      
      Note: The new tests are not for this bug fix.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3010 from mengxr/SPARK-4148 and squashes the following commits:
      
      869ae4b [Xiangrui Meng] move tests tests.py
      c1bacd9 [Xiangrui Meng] fix seed distribution and add some tests for rdd.sample
      3cca1962
    • Nicholas Chammas's avatar
      [EC2] Factor out Mesos spark-ec2 branch · 2aca97c7
      Nicholas Chammas authored
      We reference a specific branch in two places. This patch makes it one place.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #3008 from nchammas/mesos-spark-ec2-branch and squashes the following commits:
      
      10a6089 [Nicholas Chammas] factor out mess spark-ec2 branch
      2aca97c7
    • zsxwing's avatar
      [SPARK-4163][Core][WebUI] Send the fetch failure message back to Web UI · 76386e1a
      zsxwing authored
      This is a PR to send the fetch failure message back to Web UI.
      Before:
      ![f1](https://cloud.githubusercontent.com/assets/1000778/4856595/1f036c80-60be-11e4-956f-335147fbccb7.png)
      ![f2](https://cloud.githubusercontent.com/assets/1000778/4856596/1f11cbea-60be-11e4-8fe9-9f9b2b35c884.png)
      
      After (Please ignore the meaning of exception, I threw it in the code directly because it's hard to simulate a fetch failure):
      ![e1](https://cloud.githubusercontent.com/assets/1000778/4856600/2657ea38-60be-11e4-9f2d-d56c5f900f10.png)
      ![e2](https://cloud.githubusercontent.com/assets/1000778/4856601/26595008-60be-11e4-912b-2744af786991.png)
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3032 from zsxwing/SPARK-4163 and squashes the following commits:
      
      f7e1faf [zsxwing] Discard changes for FetchFailedException and minor modification
      4e946f7 [zsxwing] Add e as the cause of SparkException
      316767d [zsxwing] Add private[storage] to FetchResult
      d51b0b6 [zsxwing] Set e as the cause of FetchFailedException
      b88c919 [zsxwing] Use 'private[storage]' for case classes instead of 'sealed'
      62103fd [zsxwing] Update as per review
      0c07d1f [zsxwing] Backward-compatible support
      a3bca65 [zsxwing] Send the fetch failure message back to Web UI
      76386e1a
    • wangfei's avatar
      [SPARK-4177][Doc]update build doc since JDBC/CLI support hive 13 now · 001acc44
      wangfei authored
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #3042 from scwf/patch-9 and squashes the following commits:
      
      3784ed1 [wangfei] remove 'TODO'
      1891553 [wangfei] update build doc since JDBC/CLI support hive 13
      001acc44
  4. Nov 02, 2014
    • Reynold Xin's avatar
      Close #2971. · d6e4c591
      Reynold Xin authored
      d6e4c591
    • Aaron Davidson's avatar
      [SPARK-4183] Enable NettyBlockTransferService by default · 1ae51f6d
      Aaron Davidson authored
      Note that we're turning this on for at least the first part of the QA period as a trial. We want to enable this (and deprecate the NioBlockTransferService) as soon as possible in the hopes that NettyBlockTransferService will be more stable and easier to maintain. We will turn it off if we run into major issues.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3049 from aarondav/enable-netty and squashes the following commits:
      
      bb981cc [Aaron Davidson] [SPARK-4183] Enable NettyBlockTransferService by default
      1ae51f6d
    • Joseph K. Bradley's avatar
      [SPARK-3572] [SQL] Internal API for User-Defined Types · ebd64805
      Joseph K. Bradley authored
      This PR adds User-Defined Types (UDTs) to SQL. It is a precursor to using SchemaRDD as a Dataset for the new MLlib API. Currently, the UDT API is private since there is incomplete support (e.g., no Java or Python support yet).
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      Author: Michael Armbrust <michael@databricks.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3063 from marmbrus/udts and squashes the following commits:
      
      7ccfc0d [Michael Armbrust] remove println
      46a3aee [Michael Armbrust] Slightly easier to read test output.
      6cc434d [Michael Armbrust] Recursively convert rows.
      e369b91 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into udts
      15c10a6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into sql-udt2
      f3c72fe [Joseph K. Bradley] Fixing merge
      e13cd8a [Joseph K. Bradley] Removed Vector UDTs
      5817b2b [Joseph K. Bradley] style edits
      30ce5b2 [Joseph K. Bradley] updates based on code review
      d063380 [Joseph K. Bradley] Cleaned up Java UDT Suite, and added warning about element ordering when creating schema from Java Bean
      a571bb6 [Joseph K. Bradley] Removed old UDT code (registry and Java UDTs).  Cleaned up other code.  Extended JavaUserDefinedTypeSuite
      6fddc1c [Joseph K. Bradley] Made MyLabeledPoint into a Java Bean
      20630bc [Joseph K. Bradley] fixed scalastyle
      fa86b20 [Joseph K. Bradley] Removed Java UserDefinedType, and made UDTs private[spark] for now
      8de957c [Joseph K. Bradley] Modified UserDefinedType to store Java class of user type so that registerUDT takes only the udt argument.
      8b242ea [Joseph K. Bradley] Fixed merge error after last merge.  Note: Last merge commit also removed SQL UDT examples from mllib.
      7f29656 [Joseph K. Bradley] Moved udt case to top of all matches.  Small cleanups
      b028675 [Xiangrui Meng] allow any type in UDT
      4500d8a [Xiangrui Meng] update example code
      87264a5 [Xiangrui Meng] remove debug code
      3143ac3 [Xiangrui Meng] remove unnecessary changes
      cfbc321 [Xiangrui Meng] support UDT in parquet
      db16139 [Joseph K. Bradley] Added more doc for UserDefinedType.  Removed unused code in Suite
      759af7a [Joseph K. Bradley] Added more doc to UserDefineType
      63626a4 [Joseph K. Bradley] Updated ScalaReflectionsSuite per @marmbrus suggestions
      51e5282 [Joseph K. Bradley] fixed 1 test
      f025035 [Joseph K. Bradley] Cleanups before PR.  Added new tests
      85872f6 [Michael Armbrust] Allow schema calculation to be lazy, but ensure its available on executors.
      dff99d6 [Joseph K. Bradley] Added UDTs for Vectors in MLlib, plus DatasetExample using the UDTs
      cd60cb4 [Joseph K. Bradley] Trying to get other SQL tests to run
      34a5831 [Joseph K. Bradley] Added MLlib dependency on SQL.
      e1f7b9c [Joseph K. Bradley] blah
      2f40c02 [Joseph K. Bradley] renamed UDT types
      3579035 [Joseph K. Bradley] udt annotation now working
      b226b9e [Joseph K. Bradley] Changing UDT to annotation
      fea04af [Joseph K. Bradley] more cleanups
      964b32e [Joseph K. Bradley] some cleanups
      893ee4c [Joseph K. Bradley] udt finallly working
      50f9726 [Joseph K. Bradley] udts
      04303c9 [Joseph K. Bradley] udts
      39f8707 [Joseph K. Bradley] removed old udt suite
      273ac96 [Joseph K. Bradley] basic UDT is working, but deserialization has yet to be done
      8bebf24 [Joseph K. Bradley] commented out convertRowToScala for debugging
      53de70f [Joseph K. Bradley] more udts...
      982c035 [Joseph K. Bradley] still working on UDTs
      19b2f60 [Joseph K. Bradley] still working on UDTs
      0eaeb81 [Joseph K. Bradley] Still working on UDTs
      105c5a3 [Joseph K. Bradley] Adding UserDefinedType to SQL, not done yet.
      ebd64805
    • Aaron Davidson's avatar
      [SPARK-4183] Close transport-related resources between SparkContexts · 2ebd1df3
      Aaron Davidson authored
      A leak of event loops may be causing test failures.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3053 from aarondav/leak and squashes the following commits:
      
      e676d18 [Aaron Davidson] Typo!
      8f96475 [Aaron Davidson] Keep original ssc semantics
      7e49f10 [Aaron Davidson] A leak of event loops may be causing test failures.
      2ebd1df3
    • Cheng Lian's avatar
      [SPARK-2189][SQL] Adds dropTempTable API · 9081b9f9
      Cheng Lian authored
      This PR adds an API for unregistering temporary tables. If a temporary table has been cached before, it's unpersisted as well.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #3039 from liancheng/unregister-temp-table and squashes the following commits:
      
      54ae99f [Cheng Lian] Fixes Scala styling issue
      1948c14 [Cheng Lian] Removes the unpersist argument
      aca41d3 [Cheng Lian] Ensures thread safety
      7d4fb2b [Cheng Lian] Adds unregisterTempTable API
      9081b9f9
    • Yin Huai's avatar
      [SPARK-4185][SQL] JSON schema inference failed when dealing with type conflicts in arrays · 06232d23
      Yin Huai authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-4185.
      
      This PR also has the fix of #3052.
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #3056 from yhuai/SPARK-4185 and squashes the following commits:
      
      ed3a5a8 [Yin Huai] Correctly handle type conflicts between structs and primitive types in an array.
      06232d23
    • wangfei's avatar
      [SPARK-4191][SQL]move wrapperFor to HiveInspectors to reuse it · e749f5de
      wangfei authored
      Move wrapperFor in InsertIntoHiveTable to HiveInspectors to reuse them, this method can be reused when writing date with ObjectInspector(such as orc support)
      
      Author: wangfei <wangfei1@huawei.com>
      Author: scwf <wangfei1@huawei.com>
      
      Closes #3057 from scwf/reuse-wraperfor and squashes the following commits:
      
      7ccf932 [scwf] fix conflicts
      d44f4da [wangfei] fix imports
      9bf1b50 [wangfei] revert no related change
      9a5276a [wangfei] move wrapfor to hiveinspector to reuse them
      e749f5de
    • Cheng Lian's avatar
      [SPARK-3791][SQL] Provides Spark version and Hive version in HiveThriftServer2 · c9f84004
      Cheng Lian authored
      This PR overrides the `GetInfo` Hive Thrift API to provide correct version information. Another property `spark.sql.hive.version` is added to reveal the underlying Hive version. These are generally useful for Spark SQL ODBC driver providers. The Spark version information is extracted from the jar manifest. Also took the chance to remove the `SET -v` hack, which was a workaround for Simba ODBC driver connectivity.
      
      TODO
      
      - [x] Find a general way to figure out Hive (or even any dependency) version.
      
        This [blog post](http://blog.soebes.de/blog/2014/01/02/version-information-into-your-appas-with-maven/) suggests several methods to inspect application version. In the case of Spark, this can be tricky because the chosen method:
      
        1. must applies to both Maven build and SBT build
      
          For Maven builds, we can retrieve the version information from the META-INF/maven directory within the assembly jar. But this doesn't work for SBT builds.
      
        2. must not rely on the original jars of dependencies to extract specific dependency version, because Spark uses assembly jar.
      
          This implies we can't read Hive version from Hive jar files since standard Spark distribution doesn't include them.
      
        3. should play well with `SPARK_PREPEND_CLASSES` to ease local testing during development.
      
           `SPARK_PREPEND_CLASSES` prevents classes to be loaded from the assembly jar, thus we can't locate the jar file and read its manifest.
      
        Given these, maybe the only reliable method is to generate a source file containing version information at build time. pwendell Do you have any suggestions from the perspective of the build process?
      
      **Update** Hive version is now retrieved from the newly introduced `HiveShim` object.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #2843 from liancheng/get-info and squashes the following commits:
      
      a873d0f [Cheng Lian] Updates test case
      53f43cd [Cheng Lian] Retrieves underlying Hive verson via HiveShim
      1d282b8 [Cheng Lian] Removes the Simba ODBC "SET -v" hack
      f857fce [Cheng Lian] Overrides Hive GetInfo Thrift API and adds Hive version property
      c9f84004
    • Cheng Lian's avatar
      [SQL] Fixes race condition in CliSuite · 495a1320
      Cheng Lian authored
      `CliSuite` has been flaky for a while, this PR tries to improve this situation by fixing a race condition in `CliSuite`. The `captureOutput` function is used to capture both stdout and stderr output of the forked external process in two background threads and search for expected strings, but wasn't been properly synchronized before.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3060 from liancheng/fix-cli-suite and squashes the following commits:
      
      a70569c [Cheng Lian] Fixes race condition in CliSuite
      495a1320
    • Cheng Lian's avatar
      [SPARK-4182][SQL] Fixes ColumnStats classes for boolean, binary and complex data types · e4b80894
      Cheng Lian authored
      `NoopColumnStats` was once used for binary, boolean and complex data types. This `ColumnStats` doesn't return properly shaped column statistics and causes caching failure if a table contains columns of the aforementioned types.
      
      This PR adds `BooleanColumnStats`, `BinaryColumnStats` and `GenericColumnStats`, used for boolean, binary and all complex data types respectively. In addition, `NoopColumnStats` returns properly shaped column statistics containing null count and row count, but this class is now used for testing purpose only.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3059 from liancheng/spark-4182 and squashes the following commits:
      
      b398cfd [Cheng Lian] Fixes failed test case
      fb3ee85 [Cheng Lian] Fixes SPARK-4182
      e4b80894
    • Michael Armbrust's avatar
      [SPARK-3247][SQL] An API for adding data sources to Spark SQL · 9c0eb57c
      Michael Armbrust authored
      This PR introduces a new set of APIs to Spark SQL to allow other developers to add support for reading data from new sources in `org.apache.spark.sql.sources`.
      
      New sources must implement the interface `BaseRelation`, which is responsible for describing the schema of the data.  BaseRelations have three `Scan` subclasses, which are responsible for producing an RDD containing row objects.  The [various Scan interfaces](https://github.com/marmbrus/spark/blob/foreign/sql/core/src/main/scala/org/apache/spark/sql/sources/package.scala#L50) allow for optimizations such as column pruning and filter push down, when the underlying data source can handle these operations.
      
      By implementing a class that inherits from RelationProvider these data sources can be accessed using using pure SQL.  I've used the functionality to update the JSON support so it can now be used in this way as follows:
      
      ```sql
      CREATE TEMPORARY TABLE jsonTableSQL
      USING org.apache.spark.sql.json
      OPTIONS (
        path '/home/michael/data.json'
      )
      ```
      
      Further example usage can be found in the test cases: https://github.com/marmbrus/spark/tree/foreign/sql/core/src/test/scala/org/apache/spark/sql/sources
      
      There is also a library that uses this new API to read avro data available here:
      https://github.com/marmbrus/sql-avro
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2475 from marmbrus/foreign and squashes the following commits:
      
      1ed6010 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into foreign
      ab2c31f [Michael Armbrust] fix test
      1d41bb5 [Michael Armbrust] unify argument names
      5b47901 [Michael Armbrust] Remove sealed, more filter types
      fab154a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into foreign
      e3e690e [Michael Armbrust] Add hook for extraStrategies
      a70d602 [Michael Armbrust] Fix style, more tests, FilteredSuite => PrunedFilteredSuite
      70da6d9 [Michael Armbrust] Modify API to ease binary compatibility and interop with Java
      7d948ae [Michael Armbrust] Fix equality of AttributeReference.
      5545491 [Michael Armbrust] Address comments
      5031ac3 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into foreign
      22963ef [Michael Armbrust] package objects compile wierdly...
      b069146 [Michael Armbrust] traits => abstract classes
      34f836a [Michael Armbrust] Make @DeveloperApi
      0d74bcf [Michael Armbrust] Add documention on object life cycle
      3e06776 [Michael Armbrust] remove line wraps
      de3b68c [Michael Armbrust] Remove empty file
      360cb30 [Michael Armbrust] style and java api
      2957875 [Michael Armbrust] add override
      0fd3a07 [Michael Armbrust] Draft of data sources API
      9c0eb57c
    • wangfei's avatar
      [HOTFIX][SQL] hive test missing some golden files · f0a4b630
      wangfei authored
      cc marmbrus
      
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #3055 from scwf/hotfix and squashes the following commits:
      
      d881bd7 [wangfei] miss golden files
      f0a4b630
Loading