Skip to content
Snippets Groups Projects
  1. Jul 27, 2014
    • Rahul Singhal's avatar
      SPARK-2651: Add maven scalastyle plugin · d7eac4c3
      Rahul Singhal authored
      Can be run as: "mvn scalastyle:check"
      
      Author: Rahul Singhal <rahul.singhal@guavus.com>
      
      Closes #1550 from rahulsinghaliitd/SPARK-2651 and squashes the following commits:
      
      53748dd [Rahul Singhal] SPARK-2651: Add maven scalastyle plugin
      d7eac4c3
    • Patrick Wendell's avatar
      Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server" · e5bbce9a
      Patrick Wendell authored
      This reverts commit f6ff2a61.
      e5bbce9a
    • Doris Xin's avatar
      [SPARK-2514] [mllib] Random RDD generator · 81fcdd22
      Doris Xin authored
      Utilities for generating random RDDs.
      
      RandomRDD and RandomVectorRDD are created instead of using `sc.parallelize(range:Range)` because `Range` objects in Scala can only have `size <= Int.MaxValue`.
      
      The object `RandomRDDGenerators` can be transformed into a generator class to reduce the number of auxiliary methods for optional arguments.
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      
      Closes #1520 from dorx/randomRDD and squashes the following commits:
      
      01121ac [Doris Xin] reviewer comments
      6bf27d8 [Doris Xin] Merge branch 'master' into randomRDD
      a8ea92d [Doris Xin] Reviewer comments
      063ea0b [Doris Xin] Merge branch 'master' into randomRDD
      aec68eb [Doris Xin] newline
      bc90234 [Doris Xin] units passed.
      d56cacb [Doris Xin] impl with RandomRDD
      92d6f1c [Doris Xin] solution for Cloneable
      df5bcff [Doris Xin] Merge branch 'generator' into randomRDD
      f46d928 [Doris Xin] WIP
      49ed20d [Doris Xin] alternative poisson distribution generator
      7cb0e40 [Doris Xin] fix for data inconsistency
      8881444 [Doris Xin] RandomRDDGenerator: initial design
      81fcdd22
    • Andrew Or's avatar
      [SPARK-1777] Prevent OOMs from single partitions · ecf30ee7
      Andrew Or authored
      **Problem.** When caching, we currently unroll the entire RDD partition before making sure we have enough free memory. This is a common cause for OOMs especially when (1) the BlockManager has little free space left in memory, and (2) the partition is large.
      
      **Solution.** We maintain a global memory pool of `M` bytes shared across all threads, similar to the way we currently manage memory for shuffle aggregation. Then, while we unroll each partition, periodically check if there is enough space to continue. If not, drop enough RDD blocks to ensure we have at least `M` bytes to work with, then try again. If we still don't have enough space to unroll the partition, give up and drop the block to disk directly if applicable.
      
      **New configurations.**
      - `spark.storage.bufferFraction` - the value of `M` as a fraction of the storage memory. (default: 0.2)
      - `spark.storage.safetyFraction` - a margin of safety in case size estimation is slightly off. This is the equivalent of the existing `spark.shuffle.safetyFraction`. (default 0.9)
      
      For more detail, see the [design document](https://issues.apache.org/jira/secure/attachment/12651793/spark-1777-design-doc.pdf). Tests pending for performance and memory usage patterns.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1165 from andrewor14/them-rdd-memories and squashes the following commits:
      
      e77f451 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      c7c8832 [Andrew Or] Simplify logic + update a few comments
      269d07b [Andrew Or] Very minor changes to tests
      6645a8a [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      b7e165c [Andrew Or] Add new tests for unrolling blocks
      f12916d [Andrew Or] Slightly clean up tests
      71672a7 [Andrew Or] Update unrollSafely tests
      369ad07 [Andrew Or] Correct ensureFreeSpace and requestMemory behavior
      f4d035c [Andrew Or] Allow one thread to unroll multiple blocks
      a66fbd2 [Andrew Or] Rename a few things + update comments
      68730b3 [Andrew Or] Fix weird scalatest behavior
      e40c60d [Andrew Or] Fix MIMA excludes
      ff77aa1 [Andrew Or] Fix tests
      1a43c06 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      b9a6eee [Andrew Or] Simplify locking behavior on unrollMemoryMap
      ed6cda4 [Andrew Or] Formatting fix (super minor)
      f9ff82e [Andrew Or] putValues -> putIterator + putArray
      beb368f [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      8448c9b [Andrew Or] Fix tests
      a49ba4d [Andrew Or] Do not expose unroll memory check period
      69bc0a5 [Andrew Or] Always synchronize on putLock before unrollMemoryMap
      3f5a083 [Andrew Or] Simplify signature of ensureFreeSpace
      dce55c8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      8288228 [Andrew Or] Synchronize put and unroll properly
      4f18a3d [Andrew Or] bufferFraction -> unrollFraction
      28edfa3 [Andrew Or] Update a few comments / log messages
      728323b [Andrew Or] Do not synchronize every 1000 elements
      5ab2329 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      129c441 [Andrew Or] Fix bug: Use toArray rather than array
      9a65245 [Andrew Or] Update a few comments + minor control flow changes
      57f8d85 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      abeae4f [Andrew Or] Add comment clarifying the MEMORY_AND_DISK case
      3dd96aa [Andrew Or] AppendOnlyBuffer -> Vector (+ a few small changes)
      f920531 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      0871835 [Andrew Or] Add an effective storage level interface to BlockManager
      64e7d4c [Andrew Or] Add/modify a few comments (minor)
      8af2f35 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      4f4834e [Andrew Or] Use original storage level for blocks dropped to disk
      ecc8c2d [Andrew Or] Fix binary incompatibility
      24185ea [Andrew Or] Avoid dropping a block back to disk if reading from disk
      2b7ee66 [Andrew Or] Fix bug in SizeTracking*
      9b9a273 [Andrew Or] Fix tests
      20eb3e5 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      649bdb3 [Andrew Or] Document spark.storage.bufferFraction
      a10b0e7 [Andrew Or] Add initial memory request threshold + rename a few things
      e9c3cb0 [Andrew Or] cacheMemoryMap -> unrollMemoryMap
      198e374 [Andrew Or] Unfold -> unroll
      0d50155 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      d9d02a8 [Andrew Or] Remove unused param in unfoldSafely
      ec728d8 [Andrew Or] Add tests for safe unfolding of blocks
      22b2209 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      078eb83 [Andrew Or] Add check for hasNext in PrimitiveVector.iterator
      0871535 [Andrew Or] Fix tests in BlockManagerSuite
      d68f31e [Andrew Or] Safely unfold blocks for all memory puts
      5961f50 [Andrew Or] Fix tests
      195abd7 [Andrew Or] Refactor: move unfold logic to MemoryStore
      1e82d00 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      3ce413e [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      d5dd3b4 [Andrew Or] Free buffer memory in finally
      ea02eec [Andrew Or] Fix tests
      b8e1d9c [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      a8704c1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      e1b8b25 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      87aa75c [Andrew Or] Fix mima excludes again (typo)
      11eb921 [Andrew Or] Clarify comment (minor)
      50cae44 [Andrew Or] Remove now duplicate mima exclude
      7de5ef9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      df47265 [Andrew Or] Fix binary incompatibility
      6d05a81 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      f94f5af [Andrew Or] Update a few comments (minor)
      776aec9 [Andrew Or] Prevent OOM if a single RDD partition is too large
      bbd3eea [Andrew Or] Fix CacheManagerSuite to use Array
      97ea499 [Andrew Or] Change BlockManager interface to use Arrays
      c12f093 [Andrew Or] Add SizeTrackingAppendOnlyBuffer and tests
      ecf30ee7
    • Cheng Lian's avatar
      [SPARK-2410][SQL] Merging Hive Thrift/JDBC server · f6ff2a61
      Cheng Lian authored
      (This is a replacement of #1399, trying to fix potential `HiveThriftServer2` port collision between parallel builds. Please refer to [these comments](https://github.com/apache/spark/pull/1399#issuecomment-50212572) for details.)
      
      JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
      
      Merging the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).
      
      Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1600 from liancheng/jdbc and squashes the following commits:
      
      ac4618b [Cheng Lian] Uses random port for HiveThriftServer2 to avoid collision with parallel builds
      090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
      21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
      fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
      199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
      1083e9d [Cheng Lian] Fixed failed test suites
      7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
      9cc0f06 [Cheng Lian] Starts beeline with spark-submit
      cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
      061880f [Cheng Lian] Addressed all comments by @pwendell
      7755062 [Cheng Lian] Adapts test suites to spark-submit settings
      40bafef [Cheng Lian] Fixed more license header issues
      e214aab [Cheng Lian] Added missing license headers
      b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
      f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
      3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
      a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
      61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
      2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
      f6ff2a61
    • Cheng Lian's avatar
      [SPARK-2705][CORE] Fixed stage description in stage info page · 2bbf2353
      Cheng Lian authored
      Stage description should be a `String`, but was changed to an `Option[String]` by mistake:
      
      ![stage-desc-small](https://cloud.githubusercontent.com/assets/230655/3655611/f6d0b0f6-117b-11e4-83ed-71000dcd5009.png)
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1524 from liancheng/fix-stage-desc and squashes the following commits:
      
      3c69327 [Cheng Lian] Fixed stage description object type in Web UI stage table
      2bbf2353
    • Matei Zaharia's avatar
      SPARK-2684: Update ExternalAppendOnlyMap to take an iterator as input · 98570530
      Matei Zaharia authored
      This will decrease object allocation from the "update" closure used in map.changeValue.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #1607 from mateiz/spark-2684 and squashes the following commits:
      
      b7d89e6 [Matei Zaharia] Add insertAll for Iterables too, and fix some code style
      561fc97 [Matei Zaharia] Update ExternalAppendOnlyMap to take an iterator as input
      98570530
    • Doris Xin's avatar
      [SPARK-2679] [MLLib] Ser/De for Double · 3a69c72e
      Doris Xin authored
      Added a set of serializer/deserializer for Double in _common.py and PythonMLLibAPI in MLLib.
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      
      Closes #1581 from dorx/doubleSerDe and squashes the following commits:
      
      86a85b3 [Doris Xin] Merge branch 'master' into doubleSerDe
      2bfe7a4 [Doris Xin] Removed magic byte
      ad4d0d9 [Doris Xin] removed a space in unit
      a9020bc [Doris Xin] units passed
      7dad9af [Doris Xin] WIP
      3a69c72e
    • Xiangrui Meng's avatar
      [SPARK-2361][MLLIB] Use broadcast instead of serializing data directly into task closure · aaf2b735
      Xiangrui Meng authored
      We saw task serialization problems with large feature dimension, which could be avoid if we don't serialize data directly into task but use broadcast variables. This PR uses broadcast in both training and prediction and adds tests to make sure the task size is small.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1427 from mengxr/broadcast-new and squashes the following commits:
      
      b9a1228 [Xiangrui Meng] style update
      b97c184 [Xiangrui Meng] minimal change to LBFGS
      9ebadcc [Xiangrui Meng] add task size test to RowMatrix
      9427bf0 [Xiangrui Meng] add task size tests to linear methods
      e0a5cf2 [Xiangrui Meng] add task size test to GD
      28a8411 [Xiangrui Meng] add test for NaiveBayes
      380778c [Xiangrui Meng] update KMeans test
      bccab92 [Xiangrui Meng] add task size test to LBFGS
      02103ba [Xiangrui Meng] remove print
      e73d68e [Xiangrui Meng] update tests for k-means
      174cb15 [Xiangrui Meng] use local-cluster for test with a small akka.frameSize
      1928a5a [Xiangrui Meng] add test for KMeans task size
      e00c2da [Xiangrui Meng] use broadcast in GD, KMeans
      010d076 [Xiangrui Meng] modify NaiveBayesModel and GLM to use broadcast
      aaf2b735
    • Matei Zaharia's avatar
      SPARK-2680: Lower spark.shuffle.memoryFraction to 0.2 by default · b547f69b
      Matei Zaharia authored
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #1593 from mateiz/spark-2680 and squashes the following commits:
      
      3c949c4 [Matei Zaharia] Lower spark.shuffle.memoryFraction to 0.2 by default
      b547f69b
  2. Jul 26, 2014
    • Josh Rosen's avatar
      [SPARK-2601] [PySpark] Fix Py4J error when transforming pickleFiles · ba46bbed
      Josh Rosen authored
      Similar to SPARK-1034, the problem was that Py4J didn’t cope well with the fake ClassTags used in the Java API.  It doesn’t look like there’s any reason why PythonRDD needs to take a ClassTag, since it just ignores the type of the previous RDD, so I removed the type parameter and we no longer pass ClassTags from Python.
      
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #1605 from JoshRosen/spark-2601 and squashes the following commits:
      
      b68e118 [Josh Rosen] Fix Py4J error when transforming pickleFiles [SPARK-2601]
      ba46bbed
    • Reynold Xin's avatar
      [SPARK-2704] Name threads in ConnectionManager and mark them as daemon. · 12901643
      Reynold Xin authored
      handleMessageExecutor, handleReadWriteExecutor, and handleConnectExecutor are not marked as daemon and not named. I think there exists some condition in which Spark programs won't terminate because of this.
      
      Stack dump attached in https://issues.apache.org/jira/browse/SPARK-2704
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1604 from rxin/daemon and squashes the following commits:
      
      98d6a6c [Reynold Xin] [SPARK-2704] Name threads in ConnectionManager and mark them as daemon.
      12901643
    • bpaulin's avatar
      [SPARK-2279] Added emptyRDD method to Java API · c183b92c
      bpaulin authored
      Added emptyRDD method to Java API with tests.
      
      Author: bpaulin <bob@bobpaulin.com>
      
      Closes #1597 from bobpaulin/SPARK-2279 and squashes the following commits:
      
      5ad57c2 [bpaulin] [SPARK-2279] Added emptyRDD method to Java API
      c183b92c
    • Davies Liu's avatar
      [SPARK-2652] [PySpark] Turning some default configs for PySpark · 75663b57
      Davies Liu authored
      Add several default configs for PySpark, related to serialization in JVM.
      
      spark.serializer = org.apache.spark.serializer.KryoSerializer
      spark.serializer.objectStreamReset = 100
      spark.rdd.compress = True
      
      This will help to reduce the memory usage during RDD.partitionBy()
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1568 from davies/conf and squashes the following commits:
      
      cd316f1 [Davies Liu] remove duplicated line
      f71a355 [Davies Liu] rebase to master, add spark.rdd.compress = True
      8f63f45 [Davies Liu] Merge branch 'master' into conf
      8bc9f08 [Davies Liu] fix unittest
      c04a83d [Davies Liu] some default configs for PySpark
      75663b57
    • Hossein's avatar
      [SPARK-2696] Reduce default value of spark.serializer.objectStreamReset · 66f26a46
      Hossein authored
      The current default value of spark.serializer.objectStreamReset is 10,000.
      When trying to re-partition (e.g., to 64 partitions) a large file (e.g., 500MB), containing 1MB records, the serializer will cache 10000 x 1MB x 64 ~= 640 GB which will cause out of memory errors.
      
      This patch sets the default value to a more reasonable default value (100).
      
      Author: Hossein <hossein@databricks.com>
      
      Closes #1595 from falaki/objectStreamReset and squashes the following commits:
      
      650a935 [Hossein] Updated documentation
      1aa0df8 [Hossein] Reduce default value of spark.serializer.objectStreamReset
      66f26a46
    • Josh Rosen's avatar
      [SPARK-1458] [PySpark] Expose sc.version in Java and PySpark · cf3e9fd8
      Josh Rosen authored
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #1596 from JoshRosen/spark-1458 and squashes the following commits:
      
      fdbb0bf [Josh Rosen] Add SparkContext.version to Python & Java [SPARK-1458]
      cf3e9fd8
  3. Jul 25, 2014
    • Michael Armbrust's avatar
      [SPARK-2659][SQL] Fix division semantics for hive · 89047912
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1557 from marmbrus/fixDivision and squashes the following commits:
      
      b85077f [Michael Armbrust] Fix unit tests.
      af98f29 [Michael Armbrust] Change DIV to long type
      0c29ae8 [Michael Armbrust] Fix division semantics for hive
      89047912
    • Reynold Xin's avatar
      Part of [SPARK-2456] Removed some HashMaps from DAGScheduler by storing information in Stage. · 9d8666ca
      Reynold Xin authored
      This is part of the scheduler cleanup/refactoring effort to make the scheduler code easier to maintain.
      
      @kayousterhout @markhamstra please take a look ...
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1561 from rxin/dagSchedulerHashMaps and squashes the following commits:
      
      1c44e15 [Reynold Xin] Clear pending tasks in submitMissingTasks.
      620a0d1 [Reynold Xin] Use filterKeys.
      5b54404 [Reynold Xin] Code review feedback.
      c1e9a1c [Reynold Xin] Removed some HashMaps from DAGScheduler by storing information in Stage.
      9d8666ca
    • Michael Armbrust's avatar
      Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server" · afd757a2
      Michael Armbrust authored
      This reverts commit 06dc0d2c.
      
      #1399 is making Jenkins fail.  We should investigate and put this back after its passing tests.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1594 from marmbrus/revertJDBC and squashes the following commits:
      
      59748da [Michael Armbrust] Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
      afd757a2
    • Kay Ousterhout's avatar
      [SPARK-1726] [SPARK-2567] Eliminate zombie stages in UI. · 37ad3b72
      Kay Ousterhout authored
      Due to problems with when we update runningStages (in DAGScheduler.scala)
      and how we decide to send a SparkListenerStageCompleted message to
      SparkListeners, sometimes stages can be shown as "running" in the UI forever
      (even after they have failed).  This issue can manifest when stages are
      resubmitted with 0 tasks, or when the DAGScheduler catches non-serializable
      tasks. The problem also resulted in a (small) memory leak in the DAGScheduler,
      where stages can stay in runningStages forever. This commit fixes
      that problem and adds a unit test.
      
      Thanks tsudukim for helping to look into this issue!
      
      cc markhamstra rxin
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #1566 from kayousterhout/dag_fix and squashes the following commits:
      
      217d74b [Kay Ousterhout] [SPARK-1726] [SPARK-2567] Eliminate zombie stages in UI.
      37ad3b72
    • jerryshao's avatar
      [SPARK-2125] Add sort flag and move sort into shuffle implementations · 47b6b38c
      jerryshao authored
      This patch adds a sort flag into ShuffleDependecy and moves sort into hash shuffle implementation.
      
      Moving sort into shuffle implementation can give space for other shuffle implementations (like sort-based shuffle) to better optimize sort through shuffle.
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #1210 from jerryshao/SPARK-2125 and squashes the following commits:
      
      2feaf7b [jerryshao] revert MimaExcludes
      ceddf75 [jerryshao] add MimaExeclude
      f674ff4 [jerryshao] Add missing Scope restriction
      b9fe0dd [jerryshao] Fix some style issues according to comments
      ef6b729 [jerryshao] Change sort flag into Option
      3f6eeed [jerryshao] Fix issues related to unit test
      2f552a5 [jerryshao] Minor changes about naming and order
      c92a281 [jerryshao] Move sort into shuffle implementations
      47b6b38c
    • baishuo(白硕)'s avatar
      [SQL]Update HiveMetastoreCatalog.scala · ab3c6a45
      baishuo(白硕) authored
      I think it's better to defined hiveQlTable as a val
      
      Author: baishuo(白硕) <vc_java@hotmail.com>
      
      Closes #1569 from baishuo/patch-1 and squashes the following commits:
      
      dc2f895 [baishuo(白硕)] Update HiveMetastoreCatalog.scala
      a7b32a2 [baishuo(白硕)] Update HiveMetastoreCatalog.scala
      ab3c6a45
    • Yin Huai's avatar
      [SPARK-2682] Javadoc generated from Scala source code is not in javadoc's index · a19d8c89
      Yin Huai authored
      Add genjavadocSettings back to SparkBuild. It requires #1585 .
      
      https://issues.apache.org/jira/browse/SPARK-2682
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1584 from yhuai/SPARK-2682 and squashes the following commits:
      
      2e89461 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2682
      54e3b66 [Yin Huai] Add genjavadocSettings back.
      a19d8c89
    • Cheng Lian's avatar
      [SPARK-2410][SQL] Merging Hive Thrift/JDBC server · 06dc0d2c
      Cheng Lian authored
      JIRA issue:
      
      - Main: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
      - Related: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678)
      
      Cherry picked the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).
      
      (Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.)
      
      TODO
      
      - [x] Use `spark-submit` to launch the server, the CLI and beeline
      - [x] Migration guideline draft for Shark users
      
      ----
      
      Hit by a bug in `SparkSubmitArguments` while working on this PR: all application options that are recognized by `SparkSubmitArguments` are stolen as `SparkSubmit` options. For example:
      
      ```bash
      $ spark-submit --class org.apache.hive.beeline.BeeLine spark-internal --help
      ```
      
      This actually shows usage information of `SparkSubmit` rather than `BeeLine`.
      
      ~~Fixed this bug here since the `spark-internal` related stuff also touches `SparkSubmitArguments` and I'd like to avoid conflict.~~
      
      **UPDATE** The bug mentioned above is now tracked by [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678). Decided to revert changes to this bug since it involves more subtle considerations and worth a separate PR.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1399 from liancheng/thriftserver and squashes the following commits:
      
      090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
      21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
      fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
      199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
      1083e9d [Cheng Lian] Fixed failed test suites
      7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
      9cc0f06 [Cheng Lian] Starts beeline with spark-submit
      cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
      061880f [Cheng Lian] Addressed all comments by @pwendell
      7755062 [Cheng Lian] Adapts test suites to spark-submit settings
      40bafef [Cheng Lian] Fixed more license header issues
      e214aab [Cheng Lian] Added missing license headers
      b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
      f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
      3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
      a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
      61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
      2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
      06dc0d2c
    • Yin Huai's avatar
      [SPARK-2683] unidoc failed because org.apache.spark.util.CallSite uses Java keywords as value names · 32bcf9af
      Yin Huai authored
      Renaming `short` to `shortForm` and `long` to `longForm`.
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-2683
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1585 from yhuai/SPARK-2683 and squashes the following commits:
      
      5ddb843 [Yin Huai] "short" and "long" are Java keyworks. In order to generate javadoc, renaming "short" to "shortForm" and "long" to "longForm".
      32bcf9af
    • fireflyc's avatar
      replace println to log4j · a2715ccd
      fireflyc authored
      Our program needs to receive a large amount of data and run for a long
      time.
      We set the log level to WARN but "Storing iterator" "received single"
      as such message written to the log file. (over yarn)
      
      Author: fireflyc <fireflyc@126.com>
      
      Closes #1372 from fireflyc/fix-replace-stdout-log and squashes the following commits:
      
      e684140 [fireflyc] 'info' modified into the 'debug'
      fa22a38 [fireflyc] replace println to log4j
      a2715ccd
    • Cheng Hao's avatar
      [SPARK-2665] [SQL] Add EqualNS & Unit Tests · 184aa1c6
      Cheng Hao authored
      Hive Supports the operator "<=>", which returns same result with EQUAL(=) operator for non-null operands, but returns TRUE if both are NULL, FALSE if one of the them is NULL.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #1570 from chenghao-intel/equalns and squashes the following commits:
      
      8d6c789 [Cheng Hao] Remove the test case orc_predicate_pushdown
      5b2ca88 [Cheng Hao] Add cases into whitelist
      8e66cdd [Cheng Hao] Rename the EqualNSTo ==> EqualNullSafe
      7af4b0b [Cheng Hao] Add EqualNS & Unit Tests
      184aa1c6
    • Reynold Xin's avatar
      [SPARK-2529] Clean closures in foreach and foreachPartition. · eb82abd8
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1583 from rxin/closureClean and squashes the following commits:
      
      8982fe6 [Reynold Xin] [SPARK-2529] Clean closures in foreach and foreachPartition.
      eb82abd8
    • Matei Zaharia's avatar
      SPARK-2657 Use more compact data structures than ArrayBuffer in groupBy & cogroup · 8529ced3
      Matei Zaharia authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-2657
      
      Our current code uses ArrayBuffers for each group of values in groupBy, as well as for the key's elements in CoGroupedRDD. ArrayBuffers have a lot of overhead if there are few values in them, which is likely to happen in cases such as join. In particular, they have a pointer to an Object[] of size 16 by default, which is 24 bytes for the array header + 128 for the pointers in there, plus at least 32 for the ArrayBuffer data structure. This patch replaces the per-group buffers with a CompactBuffer class that can store up to 2 elements more efficiently (in fields of itself) and acts like an ArrayBuffer beyond that. For a key's elements in CoGroupedRDD, we use an Array of CompactBuffers instead of an ArrayBuffer of ArrayBuffers.
      
      There are some changes throughout the code to deal with CoGroupedRDD returning Array instead. We can also decide not to do that but CoGroupedRDD is a `DeveloperAPI` so I think it's okay to change it here.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #1555 from mateiz/compact-groupby and squashes the following commits:
      
      845a356 [Matei Zaharia] Lower initial size of CompactBuffer's vector to 8
      07621a7 [Matei Zaharia] Review comments
      0c1cd12 [Matei Zaharia] Don't use varargs in CompactBuffer.apply
      bdc8a39 [Matei Zaharia] Small tweak to +=, and typos
      f61f040 [Matei Zaharia] Fix line lengths
      59da88b0 [Matei Zaharia] Fix line lengths
      197cde8 [Matei Zaharia] Make CompactBuffer extend Seq to make its toSeq more efficient
      775110f [Matei Zaharia] Change CoGroupedRDD to give (K, Array[Iterable[_]]) to avoid wrappers
      9b4c6e8 [Matei Zaharia] Use CompactBuffer in CoGroupedRDD
      ed577ab [Matei Zaharia] Use CompactBuffer in groupByKey
      10f0de1 [Matei Zaharia] A CompactBuffer that's more memory-efficient than ArrayBuffer for small buffers
      8529ced3
    • Doris Xin's avatar
      [SPARK-2656] Python version of stratified sampling · 2f75a4a3
      Doris Xin authored
      exact sample size not supported for now.
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      
      Closes #1554 from dorx/pystratified and squashes the following commits:
      
      4ba927a [Doris Xin] use rel diff (+- 50%) instead of abs diff (+- 50)
      bdc3f8b [Doris Xin] updated unit to check sample holistically
      7713c7b [Doris Xin] Python version of stratified sampling
      2f75a4a3
    • Davies Liu's avatar
      [SPARK-2538] [PySpark] Hash based disk spilling aggregation · 14174abd
      Davies Liu authored
      During aggregation in Python worker, if the memory usage is above spark.executor.memory, it will do disk spilling aggregation.
      
      It will split the aggregation into multiple stage, in each stage, it will partition the aggregated data by hash and dump them into disks. After all the data are aggregated, it will merge all the stages together (partition by partition).
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1460 from davies/spill and squashes the following commits:
      
      cad91bf [Davies Liu] call gc.collect() after data.clear() to release memory as much as possible.
      37d71f7 [Davies Liu] balance the partitions
      902f036 [Davies Liu] add shuffle.py into run-tests
      dcf03a9 [Davies Liu] fix memory_info() of psutil
      67e6eba [Davies Liu] comment for MAX_TOTAL_PARTITIONS
      f6bd5d6 [Davies Liu] rollback next_limit() again, the performance difference is huge:
      e74b785 [Davies Liu] fix code style and change next_limit to memory_limit
      400be01 [Davies Liu] address all the comments
      6178844 [Davies Liu] refactor and improve docs
      fdd0a49 [Davies Liu] add long doc string for ExternalMerger
      1a97ce4 [Davies Liu] limit used memory and size of objects in partitionBy()
      e6cc7f9 [Davies Liu] Merge branch 'master' into spill
      3652583 [Davies Liu] address comments
      e78a0a0 [Davies Liu] fix style
      24cec6a [Davies Liu] get local directory by SPARK_LOCAL_DIR
      57ee7ef [Davies Liu] update docs
      286aaff [Davies Liu] let spilled aggregation in Python configurable
      e9a40f6 [Davies Liu] recursive merger
      6edbd1f [Davies Liu] Hash based disk spilling aggregation
      14174abd
  4. Jul 24, 2014
    • Prashant Sharma's avatar
      [SPARK-2014] Make PySpark store RDDs in MEMORY_ONLY_SER with compression by default · eff9714e
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #1051 from ScrapCodes/SPARK-2014/pyspark-cache and squashes the following commits:
      
      f192df7 [Prashant Sharma] Code Review
      2a2f43f [Prashant Sharma] [SPARK-2014] Make PySpark store RDDs in MEMORY_ONLY_SER with compression by default
      eff9714e
    • Tathagata Das's avatar
      [SPARK-2464][Streaming] Fixed Twitter stream stopping bug · a45d5480
      Tathagata Das authored
      Stopping the Twitter Receiver would call twitter4j's TwitterStream.shutdown, which in turn causes an Exception to be thrown to the listener. This exception caused the Receiver to be restarted. This patch check whether the receiver was stopped or not, and accordingly restarts on exception.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #1577 from tdas/twitter-stop and squashes the following commits:
      
      011b525 [Tathagata Das] Fixed Twitter stream stopping bug.
      a45d5480
    • Neville Li's avatar
      SPARK-2250: show stage RDDs in UI · fec641b8
      Neville Li authored
      Author: Neville Li <neville@spotify.com>
      
      Closes #1188 from nevillelyh/neville/ui and squashes the following commits:
      
      d3ac425 [Neville Li] SPARK-2250: show persisted RDD in stage UI
      f075db9 [Neville Li] SPARK-2035: show call stack even when description is available
      fec641b8
    • GuoQiang Li's avatar
      [SPARK-2037]: yarn client mode doesn't support spark.yarn.max.executor.failures · 323a83c5
      GuoQiang Li authored
      Author: GuoQiang Li <witgo@qq.com>
      
      Closes #1180 from witgo/SPARK-2037 and squashes the following commits:
      
      3d52411 [GuoQiang Li] review commit
      7058f4d [GuoQiang Li] Correctly stop SparkContext
      6d0561f [GuoQiang Li] Fix: yarn client mode doesn't support spark.yarn.max.executor.failures
      323a83c5
    • Xiangrui Meng's avatar
      [SPARK-2479 (partial)][MLLIB] fix binary metrics unit tests · c960b505
      Xiangrui Meng authored
      Allow small errors in comparison.
      
      @dbtsai , this unit test blocks https://github.com/apache/spark/pull/1562 . I may need to merge this one first. We can change it to use the tools in https://github.com/apache/spark/pull/1425 after that PR gets merged.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1576 from mengxr/fix-binary-metrics-unit-tests and squashes the following commits:
      
      5076a7f [Xiangrui Meng] fix binary metrics unit tests
      c960b505
    • Yin Huai's avatar
      [SPARK-2603][SQL] Remove unnecessary toMap and toList in converting Java... · b352ef17
      Yin Huai authored
      [SPARK-2603][SQL] Remove unnecessary toMap and toList in converting Java collections to Scala collections JsonRDD.scala
      
      In JsonRDD.scalafy, we are using toMap/toList to convert a Java Map/List to a Scala one. These two operations are pretty expensive because they read elements from a Java Map/List and then load to a Scala Map/List. We can use Scala wrappers to wrap those Java collections instead of using toMap/toList.
      
      I did a quick test to see the performance. I had a 2.9GB cached RDD[String] storing one JSON object per record (twitter dataset). My simple test program is attached below.
      ```scala
      val sqlContext = new org.apache.spark.sql.SQLContext(sc)
      import sqlContext._
      
      val jsonData = sc.textFile("...")
      jsonData.cache.count
      
      val jsonSchemaRDD = sqlContext.jsonRDD(jsonData)
      jsonSchemaRDD.registerAsTable("jt")
      
      sqlContext.sql("select count(*) from jt").collect
      ```
      Stages for the schema inference and the table scan both had 48 tasks. These tasks were executed sequentially. For the current implementation, scanning the JSON dataset will materialize values of all fields of a record. The inferred schema of the dataset can be accessed at https://gist.github.com/yhuai/05fe8a57c638c6666f8d.
      
      From the result, there was no significant difference on running `jsonRDD`. For the simple aggregation query, results are attached below.
      ```
      Original:
      Run 1: 26.1s
      Run 2: 27.03s
      Run 3: 27.035s
      
      With this change:
      Run 1: 21.086s
      Run 2: 21.035s
      Run 3: 21.029s
      ```
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-2603
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1504 from yhuai/removeToMapToList and squashes the following commits:
      
      6831b77 [Yin Huai] Fix failed tests.
      09b9bca [Yin Huai] Merge remote-tracking branch 'upstream/master' into removeToMapToList
      d1abdb8 [Yin Huai] Remove unnecessary toMap and toList.
      b352ef17
    • tzolov's avatar
      [Build] SPARK-2619: Configurable filemode for the spark/bin folder in debian package · 9fd14147
      tzolov authored
      Add  a `<deb.bin.filemode>744</deb.bin.filemode>` property to the `assembly/pom.xml` that defaults to `744`.
      Use this property for ../bin folder <filemode>.
      
      This patch doesn't change the current default modes but allows one override the modes at build time:
      `-Ddeb.bin.filemode=<new mode>`
      
      Author: tzolov <christian.tzolov@gmail.com>
      
      Closes #1531 from tzolov/SPARK-2619 and squashes the following commits:
      
      6d95343 [tzolov] [Build] SPARK-2619: Configurable filemode for the spark/bin folder in the .deb package
      9fd14147
    • Rahul Singhal's avatar
      SPARK-2150: Provide direct link to finished application UI in yarn resou... · 46e224aa
      Rahul Singhal authored
      ...rce manager UI
      
      Use the event logger directory to provide a direct link to finished
      application UI in yarn resourcemanager UI.
      
      Author: Rahul Singhal <rahul.singhal@guavus.com>
      
      Closes #1094 from rahulsinghaliitd/SPARK-2150 and squashes the following commits:
      
      95f230c [Rahul Singhal] SPARK-2150: Provide direct link to finished application UI in yarn resource manager UI
      46e224aa
    • Daoyuan's avatar
      [SPARK-2661][bagel]unpersist old processed rdd · 42dfab7d
      Daoyuan authored
      Unpersist useless rdd during bagel iteration to make full use of memory.
      
      Author: Daoyuan <daoyuan.wang@intel.com>
      
      Closes #1519 from adrian-wang/bagelunpersist and squashes the following commits:
      
      182c9dd [Daoyuan] rename var nextUseless to lastRDD
      87fd3a4 [Daoyuan] bagel unpersist old processed rdd
      42dfab7d
Loading