Skip to content
Snippets Groups Projects
  1. Jul 27, 2014
    • Cheng Lian's avatar
      [SPARK-2410][SQL] Merging Hive Thrift/JDBC server · f6ff2a61
      Cheng Lian authored
      (This is a replacement of #1399, trying to fix potential `HiveThriftServer2` port collision between parallel builds. Please refer to [these comments](https://github.com/apache/spark/pull/1399#issuecomment-50212572) for details.)
      
      JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
      
      Merging the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).
      
      Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1600 from liancheng/jdbc and squashes the following commits:
      
      ac4618b [Cheng Lian] Uses random port for HiveThriftServer2 to avoid collision with parallel builds
      090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
      21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
      fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
      199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
      1083e9d [Cheng Lian] Fixed failed test suites
      7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
      9cc0f06 [Cheng Lian] Starts beeline with spark-submit
      cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
      061880f [Cheng Lian] Addressed all comments by @pwendell
      7755062 [Cheng Lian] Adapts test suites to spark-submit settings
      40bafef [Cheng Lian] Fixed more license header issues
      e214aab [Cheng Lian] Added missing license headers
      b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
      f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
      3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
      a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
      61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
      2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
      f6ff2a61
  2. Jul 25, 2014
    • Michael Armbrust's avatar
      Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server" · afd757a2
      Michael Armbrust authored
      This reverts commit 06dc0d2c.
      
      #1399 is making Jenkins fail.  We should investigate and put this back after its passing tests.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1594 from marmbrus/revertJDBC and squashes the following commits:
      
      59748da [Michael Armbrust] Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
      afd757a2
    • Cheng Lian's avatar
      [SPARK-2410][SQL] Merging Hive Thrift/JDBC server · 06dc0d2c
      Cheng Lian authored
      JIRA issue:
      
      - Main: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
      - Related: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678)
      
      Cherry picked the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).
      
      (Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.)
      
      TODO
      
      - [x] Use `spark-submit` to launch the server, the CLI and beeline
      - [x] Migration guideline draft for Shark users
      
      ----
      
      Hit by a bug in `SparkSubmitArguments` while working on this PR: all application options that are recognized by `SparkSubmitArguments` are stolen as `SparkSubmit` options. For example:
      
      ```bash
      $ spark-submit --class org.apache.hive.beeline.BeeLine spark-internal --help
      ```
      
      This actually shows usage information of `SparkSubmit` rather than `BeeLine`.
      
      ~~Fixed this bug here since the `spark-internal` related stuff also touches `SparkSubmitArguments` and I'd like to avoid conflict.~~
      
      **UPDATE** The bug mentioned above is now tracked by [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678). Decided to revert changes to this bug since it involves more subtle considerations and worth a separate PR.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1399 from liancheng/thriftserver and squashes the following commits:
      
      090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
      21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
      fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
      199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
      1083e9d [Cheng Lian] Fixed failed test suites
      7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
      9cc0f06 [Cheng Lian] Starts beeline with spark-submit
      cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
      061880f [Cheng Lian] Addressed all comments by @pwendell
      7755062 [Cheng Lian] Adapts test suites to spark-submit settings
      40bafef [Cheng Lian] Fixed more license header issues
      e214aab [Cheng Lian] Added missing license headers
      b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
      f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
      3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
      a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
      61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
      2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
      06dc0d2c
  3. Jul 23, 2014
    • Ankur Dave's avatar
      Replace RoutingTableMessage with pair · 2d25e348
      Ankur Dave authored
      RoutingTableMessage was used to construct routing tables to enable
      joining VertexRDDs with partitioned edges. It stored three elements: the
      destination vertex ID, the source edge partition, and a byte specifying
      the position in which the edge partition referenced the vertex to enable
      join elimination.
      
      However, this was incompatible with sort-based shuffle (SPARK-2045). It
      was also slightly wasteful, because partition IDs are usually much
      smaller than 2^32, though this was mitigated by a custom serializer that
      used variable-length encoding.
      
      This commit replaces RoutingTableMessage with a pair of (VertexId, Int)
      where the Int encodes both the source partition ID (in the lower 30
      bits) and the position (in the top 2 bits).
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #1553 from ankurdave/remove-RoutingTableMessage and squashes the following commits:
      
      697e17b [Ankur Dave] Replace RoutingTableMessage with pair
      2d25e348
    • Ankur Dave's avatar
      Remove GraphX MessageToPartition for compatibility with sort-based shuffle · 6c2be93f
      Ankur Dave authored
      MessageToPartition was used in `Graph#partitionBy`. Unlike a Tuple2, it marked the key as transient to avoid sending it over the network. However, it was incompatible with sort-based shuffle (SPARK-2045) and represented only a minor optimization: for partitionBy, it improved performance by 6.3% (30.4 s to 28.5 s) and reduced communication by 5.6% (114.2 MB to 107.8 MB).
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #1537 from ankurdave/remove-MessageToPartition and squashes the following commits:
      
      f9d0054 [Ankur Dave] Remove MessageToPartition
      ab71364 [Ankur Dave] Remove unused VertexBroadcastMsg
      6c2be93f
  4. Jul 22, 2014
    • CrazyJvm's avatar
      Graphx example · 5f7b9916
      CrazyJvm authored
      fix examples
      
      Author: CrazyJvm <crazyjvm@gmail.com>
      
      Closes #1523 from CrazyJvm/graphx-example and squashes the following commits:
      
      663457a [CrazyJvm] outDegrees does not take parameters
      7cfff1d [CrazyJvm] fix example for joinVertices
      5f7b9916
  5. Jul 12, 2014
    • Ankur Dave's avatar
      [SPARK-2455] Mark (Shippable)VertexPartition serializable · 7a013529
      Ankur Dave authored
      VertexPartition and ShippableVertexPartition are contained in RDDs but are not marked Serializable, leading to NotSerializableExceptions when using Java serialization.
      
      The fix is simply to mark them as Serializable. This PR does that and adds a test for serializing them using Java and Kryo serialization.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #1376 from ankurdave/SPARK-2455 and squashes the following commits:
      
      ed4a51b [Ankur Dave] Make (Shippable)VertexPartition serializable
      1fd42c5 [Ankur Dave] Add failing tests for Java serialization
      7a013529
  6. Jul 11, 2014
    • CrazyJvm's avatar
      fix Graph partitionStrategy comment · 282cca0e
      CrazyJvm authored
      Author: CrazyJvm <crazyjvm@gmail.com>
      
      Closes #1368 from CrazyJvm/graph-comment-1 and squashes the following commits:
      
      d47f3c5 [CrazyJvm] fix style
      e190d6f [CrazyJvm] fix Graph partitionStrategy comment
      282cca0e
  7. Jul 10, 2014
    • Prashant Sharma's avatar
      [SPARK-1776] Have Spark's SBT build read dependencies from Maven. · 628932b8
      Prashant Sharma authored
      Patch introduces the new way of working also retaining the existing ways of doing things.
      
      For example build instruction for yarn in maven is
      `mvn -Pyarn -PHadoop2.2 clean package -DskipTests`
      in sbt it can become
      `MAVEN_PROFILES="yarn, hadoop-2.2" sbt/sbt clean assembly`
      Also supports
      `sbt/sbt -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 clean assembly`
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #772 from ScrapCodes/sbt-maven and squashes the following commits:
      
      a8ac951 [Prashant Sharma] Updated sbt version.
      62b09bb [Prashant Sharma] Improvements.
      fa6221d [Prashant Sharma] Excluding sql from mima
      4b8875e [Prashant Sharma] Sbt assembly no longer builds tools by default.
      72651ca [Prashant Sharma] Addresses code reivew comments.
      acab73d [Prashant Sharma] Revert "Small fix to run-examples script."
      ac4312c [Prashant Sharma] Revert "minor fix"
      6af91ac [Prashant Sharma] Ported oldDeps back. + fixes issues with prev commit.
      65cf06c [Prashant Sharma] Servelet API jars mess up with the other servlet jars on the class path.
      446768e [Prashant Sharma] minor fix
      89b9777 [Prashant Sharma] Merge conflicts
      d0a02f2 [Prashant Sharma] Bumped up pom versions, Since the build now depends on pom it is better updated there. + general cleanups.
      dccc8ac [Prashant Sharma] updated mima to check against 1.0
      a49c61b [Prashant Sharma] Fix for tools jar
      a2f5ae1 [Prashant Sharma] Fixes a bug in dependencies.
      cf88758 [Prashant Sharma] cleanup
      9439ea3 [Prashant Sharma] Small fix to run-examples script.
      96cea1f [Prashant Sharma] SPARK-1776 Have Spark's SBT build read dependencies from Maven.
      36efa62 [Patrick Wendell] Set project name in pom files and added eclipse/intellij plugins.
      4973dbd [Patrick Wendell] Example build using pom reader.
      628932b8
  8. Jun 23, 2014
    • jerryshao's avatar
      [SPARK-2124] Move aggregation into shuffle implementations · 56eb8af1
      jerryshao authored
      This PR is a sub-task of SPARK-2044 to move the execution of aggregation into shuffle implementations.
      
      I leave `CoGoupedRDD` and `SubtractedRDD` unchanged because they have their implementations of aggregation. I'm not sure is it suitable to change these two RDDs.
      
      Also I do not move sort related code of `OrderedRDDFunctions` into shuffle, this will be solved in another sub-task.
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #1064 from jerryshao/SPARK-2124 and squashes the following commits:
      
      4a05a40 [jerryshao] Modify according to comments
      1f7dcc8 [jerryshao] Style changes
      50a2fd6 [jerryshao] Fix test suite issue after moving aggregator to Shuffle reader and writer
      1a96190 [jerryshao] Code modification related to the ShuffledRDD
      308f635 [jerryshao] initial works of move combiner to ShuffleManager's reader and writer
      56eb8af1
  9. Jun 06, 2014
    • Ankur Dave's avatar
      [SPARK-1552] Fix type comparison bug in {map,outerJoin}Vertices · 8d85359f
      Ankur Dave authored
      In GraphImpl, mapVertices and outerJoinVertices use a more efficient implementation when the map function conserves vertex attribute types. This is implemented by comparing the ClassTags of the old and new vertex attribute types. However, ClassTags store erased types, so the comparison will return a false positive for types with different type parameters, such as Option[Int] and Option[Double].
      
      This PR resolves the problem by requesting that the compiler generate evidence of equality between the old and new vertex attribute types, and providing a default value for the evidence parameter if the two types are not equal. The methods can then check the value of the evidence parameter to see whether the types are equal.
      
      It also adds a test called "mapVertices changing type with same erased type" that failed before the PR and succeeds now.
      
      Callers of mapVertices and outerJoinVertices can no longer use a wildcard for a graph's VD type. To avoid "Error occurred in an application involving default arguments," they must bind VD to a type parameter, as this PR does for ShortestPaths and LabelPropagation.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #967 from ankurdave/SPARK-1552 and squashes the following commits:
      
      68a4fff [Ankur Dave] Undo conserve naming
      7388705 [Ankur Dave] Remove unnecessary ClassTag for VD parameters
      a704e5f [Ankur Dave] Use type equality constraint with default argument
      29a5ab7 [Ankur Dave] Add failing test
      f458c83 [Ankur Dave] Revert "[SPARK-1552] Fix type comparison bug in mapVertices and outerJoinVertices"
      16d6af8 [Ankur Dave] [SPARK-1552] Fix type comparison bug in mapVertices and outerJoinVertices
      8d85359f
  10. Jun 05, 2014
    • Ankur Dave's avatar
      [SPARK-2025] Unpersist edges of previous graph in Pregel · 9bad0b73
      Ankur Dave authored
      Due to a bug introduced by apache/spark#497, Pregel does not unpersist replicated vertices from previous iterations. As a result, they stay cached until memory is full, wasting GC time.
      
      This PR corrects the problem by unpersisting both the edges and the replicated vertices of previous iterations. This is safe because the edges and replicated vertices of the current iteration are cached by the call to `g.cache()` and then materialized by the call to `messages.count()`. Therefore no unmaterialized RDDs depend on `prevG.edges`. I verified that no recomputation occurs by running PageRank with a custom patch to Spark that warns when a partition is recomputed.
      
      Thanks to Tim Weninger for reporting this bug.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #972 from ankurdave/SPARK-2025 and squashes the following commits:
      
      13d5b07 [Ankur Dave] Unpersist edges of previous graph in Pregel
      9bad0b73
    • Takuya UESHIN's avatar
      [SPARK-2029] Bump pom.xml version number of master branch to 1.1.0-SNAPSHOT. · 7c160293
      Takuya UESHIN authored
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #974 from ueshin/issues/SPARK-2029 and squashes the following commits:
      
      e19e8f4 [Takuya UESHIN] Bump version number to 1.1.0-SNAPSHOT.
      7c160293
  11. Jun 04, 2014
  12. Jun 03, 2014
    • Joseph E. Gonzalez's avatar
      Enable repartitioning of graph over different number of partitions · 5284ca78
      Joseph E. Gonzalez authored
      It is currently very difficult to repartition a graph over a different number of partitions.  This PR adds an additional `partitionBy` function that takes the number of partitions.
      
      Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
      
      Closes #719 from jegonzal/graph_partitioning_options and squashes the following commits:
      
      730b405 [Joseph E. Gonzalez] adding an additional number of partitions option to partitionBy
      5284ca78
    • Ankur Dave's avatar
      [SPARK-1991] Support custom storage levels for vertices and edges · b1feb602
      Ankur Dave authored
      This PR adds support for specifying custom storage levels for the vertices and edges of a graph. This enables GraphX to handle graphs larger than memory size by specifying MEMORY_AND_DISK and then repartitioning the graph to use many small partitions, each of which does fit in memory. Spark will then automatically load partitions from disk as needed.
      
      The user specifies the desired vertex and edge storage levels when building the graph by passing them to the graph constructor. These are then stored in the `targetStorageLevel` attribute of the VertexRDD and EdgeRDD respectively. Whenever GraphX needs to cache a VertexRDD or EdgeRDD (because it plans to use it more than once, for example), it uses the specified target storage level. Also, when the user calls `Graph#cache()`, the vertices and edges are persisted using their target storage levels.
      
      In order to facilitate propagating the target storage levels across VertexRDD and EdgeRDD operations, we remove raw calls to the constructors and instead introduce the `withPartitionsRDD` and `withTargetStorageLevel` methods.
      
      I tested this change by running PageRank and triangle count on a severely memory-constrained cluster (1 executor with 300 MB of memory, and a 1 GB graph). Before this PR, these algorithms used to fail with OutOfMemoryErrors. With this PR, and using the DISK_ONLY storage level, they succeed.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #946 from ankurdave/SPARK-1991 and squashes the following commits:
      
      ce17d95 [Ankur Dave] Move pickStorageLevel to StorageLevel.fromString
      ccaf06f [Ankur Dave] Shadow members in withXYZ() methods rather than using underscores
      c34abc0 [Ankur Dave] Exclude all of GraphX from compatibility checks vs. 1.0.0
      c5ca068 [Ankur Dave] Revert "Exclude all of GraphX from binary compatibility checks"
      34bcefb [Ankur Dave] Exclude all of GraphX from binary compatibility checks
      6fdd137 [Ankur Dave] [SPARK-1991] Support custom storage levels for vertices and edges
      b1feb602
    • Joseph E. Gonzalez's avatar
      Synthetic GraphX Benchmark · 894ecde0
      Joseph E. Gonzalez authored
      This PR accomplishes two things:
      
      1. It introduces a Synthetic Benchmark application that generates an arbitrarily large log-normal graph and executes either PageRank or connected components on the graph.  This can be used to profile GraphX system on arbitrary clusters without access to large graph datasets
      
      2. This PR improves the implementation of the log-normal graph generator.
      
      Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #720 from jegonzal/graphx_synth_benchmark and squashes the following commits:
      
      e40812a [Ankur Dave] Exclude all of GraphX from compatibility checks vs. 1.0.0
      bccccad [Ankur Dave] Fix long lines
      374678a [Ankur Dave] Bugfix and style changes
      1bdf39a [Joseph E. Gonzalez] updating options
      d943972 [Joseph E. Gonzalez] moving the benchmark application into the examples folder.
      f4f839a [Joseph E. Gonzalez] Creating a synthetic benchmark script.
      894ecde0
    • Syed Hashmi's avatar
      [SPARK-1942] Stop clearing spark.driver.port in unit tests · 7782a304
      Syed Hashmi authored
      stop resetting spark.driver.port in unit tests (scala, java and python).
      
      Author: Syed Hashmi <shashmi@cloudera.com>
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #943 from syedhashmi/master and squashes the following commits:
      
      885f210 [Syed Hashmi] Removing unnecessary file (created by mergetool)
      b8bd4b5 [Syed Hashmi] Merge remote-tracking branch 'upstream/master'
      b895e59 [Syed Hashmi] Revert "[SPARK-1784] Add a new partitioner"
      57b6587 [Syed Hashmi] Revert "[SPARK-1784] Add a balanced partitioner"
      1574769 [Syed Hashmi] [SPARK-1942] Stop clearing spark.driver.port in unit tests
      4354836 [Syed Hashmi] Revert "SPARK-1686: keep schedule() calling in the main thread"
      fd36542 [Syed Hashmi] [SPARK-1784] Add a balanced partitioner
      6668015 [CodingCat] SPARK-1686: keep schedule() calling in the main thread
      4ca94cc [Syed Hashmi] [SPARK-1784] Add a new partitioner
      7782a304
  13. Jun 02, 2014
    • Ankur Dave's avatar
      Add landmark-based Shortest Path algorithm to graphx.lib · 9535f404
      Ankur Dave authored
      This is a modified version of apache/spark#10.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      Author: Andres Perez <andres@tresata.com>
      
      Closes #933 from ankurdave/shortestpaths and squashes the following commits:
      
      03a103c [Ankur Dave] Style fixes
      7a1ff48 [Ankur Dave] Improve ShortestPaths documentation
      d75c8fc [Ankur Dave] Remove unnecessary VD type param, and pass through ED
      d983fb4 [Ankur Dave] Fix style errors
      60ed8e6 [Andres Perez] Add Shortest-path computations to graphx.lib with unit tests.
      9535f404
  14. May 29, 2014
    • Ankur Dave's avatar
      initial version of LPA · b7e28fa4
      Ankur Dave authored
      A straightforward implementation of LPA algorithm for detecting graph communities using the Pregel framework.  Amongst the growing literature on community detection algorithms in networks, LPA is perhaps the most elementary, and despite its flaws it remains a nice and simple approach.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      Author: haroldsultan <haroldsultan@gmail.com>
      Author: Harold Sultan <haroldsultan@gmail.com>
      
      Closes #905 from haroldsultan/master and squashes the following commits:
      
      327aee0 [haroldsultan] Merge pull request #2 from ankurdave/label-propagation
      227a4d0 [Ankur Dave] Untabify
      0ac574c [haroldsultan] Merge pull request #1 from ankurdave/label-propagation
      0e24303 [Ankur Dave] Add LabelPropagationSuite
      84aa061 [Ankur Dave] LabelPropagation: Fix compile errors and style; rename from LPA
      9830342 [Harold Sultan] initial version of LPA
      b7e28fa4
  15. May 26, 2014
    • Ankur Dave's avatar
      [SPARK-1931] Reconstruct routing tables in Graph.partitionBy · 56c771cb
      Ankur Dave authored
      905173df introduced a bug in partitionBy where, after repartitioning the edges, it reuses the VertexRDD without updating the routing tables to reflect the new edge layout. Subsequent accesses of the triplets contain nulls for many vertex properties.
      
      This commit adds a test for this bug and fixes it by introducing `VertexRDD#withEdges` and calling it in `partitionBy`.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #885 from ankurdave/SPARK-1931 and squashes the following commits:
      
      3930cdd [Ankur Dave] Note how to set up VertexRDD for efficient joins
      9bdbaa4 [Ankur Dave] [SPARK-1931] Reconstruct routing tables in Graph.partitionBy
      56c771cb
  16. May 16, 2014
    • Zhen Peng's avatar
      bugfix: overflow of graphx Edge compare function · fa6de408
      Zhen Peng authored
      Author: Zhen Peng <zhenpeng01@baidu.com>
      
      Closes #769 from zhpengg/bugfix-graphx-edge-compare and squashes the following commits:
      
      8a978ff [Zhen Peng] add ut for graphx Edge.lexicographicOrdering.compare
      413c258 [Zhen Peng] there maybe a overflow for two Long's substraction
      fa6de408
  17. May 15, 2014
    • Prashant Sharma's avatar
      Fixes a misplaced comment. · e1e3416c
      Prashant Sharma authored
      Fixes a misplaced comment from #785.
      
      @pwendell
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #788 from ScrapCodes/patch-1 and squashes the following commits:
      
      3ef6a69 [Prashant Sharma] Update package-info.java
      67d9461 [Prashant Sharma] Update package-info.java
      e1e3416c
    • Prashant Sharma's avatar
      Package docs · 46324279
      Prashant Sharma authored
      This is a few changes based on the original patch by @scrapcodes.
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #785 from pwendell/package-docs and squashes the following commits:
      
      c32b731 [Patrick Wendell] Changes based on Prashant's patch
      c0463d3 [Prashant Sharma] added eof new line
      ce8bf73 [Prashant Sharma] Added eof new line to all files.
      4c35f2e [Prashant Sharma] SPARK-1563 Add package-info.java and package.scala files for all packages that appear in docs
      46324279
  18. May 12, 2014
    • Sean Owen's avatar
      SPARK-1798. Tests should clean up temp files · 7120a297
      Sean Owen authored
      Three issues related to temp files that tests generate – these should be touched up for hygiene but are not urgent.
      
      Modules have a log4j.properties which directs the unit-test.log output file to a directory like `[module]/target/unit-test.log`. But this ends up creating `[module]/[module]/target/unit-test.log` instead of former.
      
      The `work/` directory is not deleted by "mvn clean", in the parent and in modules. Neither is the `checkpoint/` directory created under the various external modules.
      
      Many tests create a temp directory, which is not usually deleted. This can be largely resolved by calling `deleteOnExit()` at creation and trying to call `Utils.deleteRecursively` consistently to clean up, sometimes in an `@After` method.
      
      _If anyone seconds the motion, I can create a more significant change that introduces a new test trait along the lines of `LocalSparkContext`, which provides management of temp directories for subclasses to take advantage of._
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #732 from srowen/SPARK-1798 and squashes the following commits:
      
      5af578e [Sean Owen] Try to consistently delete test temp dirs and files, and set deleteOnExit() for each
      b21b356 [Sean Owen] Remove work/ and checkpoint/ dirs with mvn clean
      bdd0f41 [Sean Owen] Remove duplicate module dir in log4j.properties output path for tests
      7120a297
    • Ankur Dave's avatar
      SPARK-1786: Reopening PR 724 · 0e2bde20
      Ankur Dave authored
      Addressing issue in MimaBuild.scala.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
      
      Closes #742 from jegonzal/edge_partition_serialization and squashes the following commits:
      
      8ba6e0d [Ankur Dave] Add concatenation operators to MimaBuild.scala
      cb2ed3a [Joseph E. Gonzalez] addressing missing exclusion in MimaBuild.scala
      5d27824 [Ankur Dave] Disable reference tracking to fix serialization test
      c0a9ae5 [Ankur Dave] Add failing test for EdgePartition Kryo serialization
      a4a3faa [Joseph E. Gonzalez] Making EdgePartition serializable.
      0e2bde20
    • Patrick Wendell's avatar
      Revert "SPARK-1786: Edge Partition Serialization" · af15c82b
      Patrick Wendell authored
      This reverts commit a6b02fb7.
      af15c82b
  19. May 11, 2014
    • Ankur Dave's avatar
      SPARK-1786: Edge Partition Serialization · a6b02fb7
      Ankur Dave authored
      This appears to address the issue with edge partition serialization.  The solution appears to be just registering the `PrimitiveKeyOpenHashMap`.  However I noticed that we appear to have forked that code in GraphX but retained the same name (which is confusing).  I also renamed our local copy to `GraphXPrimitiveKeyOpenHashMap`.  We should consider dropping that and using the one in Spark if possible.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
      
      Closes #724 from jegonzal/edge_partition_serialization and squashes the following commits:
      
      b0a525a [Ankur Dave] Disable reference tracking to fix serialization test
      bb7f548 [Ankur Dave] Add failing test for EdgePartition Kryo serialization
      67dac22 [Joseph E. Gonzalez] Making EdgePartition serializable.
      a6b02fb7
    • Joseph E. Gonzalez's avatar
      Fix error in 2d Graph Partitioner · f938a155
      Joseph E. Gonzalez authored
      Their was a minor bug in which negative partition ids could be generated when constructing a 2D partitioning of a graph.  This could lead to an inefficient 2D partition for large vertex id values.
      
      Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
      
      Closes #709 from jegonzal/fix_2d_partitioning and squashes the following commits:
      
      937c562 [Joseph E. Gonzalez] fixing bug in 2d partitioning algorithm where negative partition ids could be generated.
      f938a155
  20. May 10, 2014
    • Ankur Dave's avatar
      Unify GraphImpl RDDs + other graph load optimizations · 905173df
      Ankur Dave authored
      This PR makes the following changes, primarily in e4fbd329aef85fe2c38b0167255d2a712893d683:
      
      1. *Unify RDDs to avoid zipPartitions.* A graph used to be four RDDs: vertices, edges, routing table, and triplet view. This commit merges them down to two: vertices (with routing table), and edges (with replicated vertices).
      
      2. *Avoid duplicate shuffle in graph building.* We used to do two shuffles when building a graph: one to extract routing information from the edges and move it to the vertices, and another to find nonexistent vertices referred to by edges. With this commit, the latter is done as a side effect of the former.
      
      3. *Avoid no-op shuffle when joins are fully eliminated.* This is a side effect of unifying the edges and the triplet view.
      
      4. *Join elimination for mapTriplets.*
      
      5. *Ship only the needed vertex attributes when upgrading the triplet view.* If the triplet view already contains source attributes, and we now need both attributes, only ship destination attributes rather than re-shipping both. This is done in `ReplicatedVertexView#upgrade`.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #497 from ankurdave/unify-rdds and squashes the following commits:
      
      332ab43 [Ankur Dave] Merge remote-tracking branch 'apache-spark/master' into unify-rdds
      4933e2e [Ankur Dave] Exclude RoutingTable from binary compatibility check
      5ba8789 [Ankur Dave] Add GraphX upgrade guide from Spark 0.9.1
      13ac845 [Ankur Dave] Merge remote-tracking branch 'apache-spark/master' into unify-rdds
      a04765c [Ankur Dave] Remove unnecessary toOps call
      57202e8 [Ankur Dave] Replace case with pair parameter
      75af062 [Ankur Dave] Add explicit return types
      04d3ae5 [Ankur Dave] Convert implicit parameter to context bound
      c88b269 [Ankur Dave] Revert upgradeIterator to if-in-a-loop
      0d3584c [Ankur Dave] EdgePartition.size should be val
      2a928b2 [Ankur Dave] Set locality wait
      10b3596 [Ankur Dave] Clean up public API
      ae36110 [Ankur Dave] Fix style errors
      e4fbd32 [Ankur Dave] Unify GraphImpl RDDs + other graph load optimizations
      d6d60e2 [Ankur Dave] In GraphLoader, coalesce to minEdgePartitions
      62c7b78 [Ankur Dave] In Analytics, take PageRank numIter
      d64e8d4 [Ankur Dave] Log current Pregel iteration
      905173df
    • Matei Zaharia's avatar
      SPARK-1708. Add a ClassTag on Serializer and things that depend on it · 7eefc9d2
      Matei Zaharia authored
      This pull request contains a rebased patch from @heathermiller (https://github.com/heathermiller/spark/pull/1) to add ClassTags on Serializer and types that depend on it (Broadcast and AccumulableCollection). Putting these in the public API signatures now will allow us to use Scala Pickling for serialization down the line without breaking binary compatibility.
      
      One question remaining is whether we also want them on Accumulator -- Accumulator is passed as part of a bigger Task or TaskResult object via the closure serializer so it doesn't seem super useful to add the ClassTag there. Broadcast and AccumulableCollection in contrast were being serialized directly.
      
      CC @rxin, @pwendell, @heathermiller
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #700 from mateiz/spark-1708 and squashes the following commits:
      
      1a3d8b0 [Matei Zaharia] Use fake ClassTag in Java
      3b449ed [Matei Zaharia] test fix
      2209a27 [Matei Zaharia] Code style fixes
      9d48830 [Matei Zaharia] Add a ClassTag on Serializer and things that depend on it
      7eefc9d2
  21. May 08, 2014
    • Prashant Sharma's avatar
      SPARK-1565, update examples to be used with spark-submit script. · 44dd57fb
      Prashant Sharma authored
      Commit for initial feedback, basically I am curious if we should prompt user for providing args esp. when its mandatory. And can we skip if they are not ?
      
      Also few other things that did not work like
      `bin/spark-submit examples/target/scala-2.10/spark-examples-1.0.0-SNAPSHOT-hadoop1.0.4.jar --class org.apache.spark.examples.SparkALS --arg 100 500 10 5 2`
      
      Not all the args get passed properly, may be I have messed up something will try to sort it out hopefully.
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #552 from ScrapCodes/SPARK-1565/update-examples and squashes the following commits:
      
      669dd23 [Prashant Sharma] Review comments
      2727e70 [Prashant Sharma] SPARK-1565, update examples to be used with spark-submit script.
      44dd57fb
  22. May 07, 2014
    • Kan Zhang's avatar
      [SPARK-1460] Returning SchemaRDD instead of normal RDD on Set operations... · 967635a2
      Kan Zhang authored
      ... that do not change schema
      
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #448 from kanzhang/SPARK-1460 and squashes the following commits:
      
      111e388 [Kan Zhang] silence MiMa errors in EdgeRDD and VertexRDD
      91dc787 [Kan Zhang] Taking into account newly added Ordering param
      79ed52a [Kan Zhang] [SPARK-1460] Returning SchemaRDD on Set operations that do not change schema
      967635a2
  23. Apr 29, 2014
    • witgo's avatar
      Improved build configuration · 030f2c21
      witgo authored
      1, Fix SPARK-1441: compile spark core error with hadoop 0.23.x
      2, Fix SPARK-1491: maven hadoop-provided profile fails to build
      3, Fix org.scala-lang: * ,org.apache.avro:* inconsistent versions dependency
      4, A modified on the sql/catalyst/pom.xml,sql/hive/pom.xml,sql/core/pom.xml (Four spaces formatted into two spaces)
      
      Author: witgo <witgo@qq.com>
      
      Closes #480 from witgo/format_pom and squashes the following commits:
      
      03f652f [witgo] review commit
      b452680 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
      bee920d [witgo] revert fix SPARK-1629: Spark Core missing commons-lang dependence
      7382a07 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
      6902c91 [witgo] fix SPARK-1629: Spark Core missing commons-lang dependence
      0da4bc3 [witgo] merge master
      d1718ed [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
      e345919 [witgo] add avro dependency to yarn-alpha
      77fad08 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
      62d0862 [witgo] Fix org.scala-lang: * inconsistent versions dependency
      1a162d7 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
      934f24d [witgo] review commit
      cf46edc [witgo] exclude jruby
      06e7328 [witgo] Merge branch 'SparkBuild' into format_pom
      99464d2 [witgo] fix maven hadoop-provided profile fails to build
      0c6c1fc [witgo] Fix compile spark core error with hadoop 0.23.x
      6851bec [witgo] Maintain consistent SparkBuild.scala, pom.xml
      030f2c21
  24. Apr 24, 2014
    • Sandeep's avatar
      Fix Scala Style · a03ac222
      Sandeep authored
      Any comments are welcome
      
      Author: Sandeep <sandeep@techaddict.me>
      
      Closes #531 from techaddict/stylefix-1 and squashes the following commits:
      
      7492730 [Sandeep] Pass 4
      98b2428 [Sandeep] fix rxin suggestions
      b5e2e6f [Sandeep] Pass 3
      05932d7 [Sandeep] fix if else styling 2
      08690e5 [Sandeep] fix if else styling
      a03ac222
    • Ankur Dave's avatar
      Mark all fields of EdgePartition, Graph, and GraphOps transient · 1d6abe3a
      Ankur Dave authored
      These classes are only serializable to work around closure capture, so their fields should all be marked `@transient` to avoid wasteful serialization.
      
      This PR supersedes apache/spark#519 and fixes the same bug.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #520 from ankurdave/graphx-transient and squashes the following commits:
      
      6431760 [Ankur Dave] Mark all fields of EdgePartition, Graph, and GraphOps `@transient`
      1d6abe3a
  25. Apr 16, 2014
    • Ankur Dave's avatar
      SPARK-1329: Create pid2vid with correct number of partitions · 17d32345
      Ankur Dave authored
      Each vertex partition is co-located with a pid2vid array created in RoutingTable.scala. This array maps edge partition IDs to the list of vertices in the current vertex partition that are mentioned by edges in that partition. Therefore the pid2vid array should have one entry per edge partition.
      
      GraphX currently creates one entry per *vertex* partition, which is a bug that leads to an ArrayIndexOutOfBoundsException when there are more edge partitions than vertex partitions. This commit fixes the bug and adds a test for this case.
      
      Resolves SPARK-1329. Thanks to Daniel Darabos for reporting this bug.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #368 from ankurdave/fix-pid2vid-size and squashes the following commits:
      
      5a5c52a [Ankur Dave] SPARK-1329: Create pid2vid with correct number of partitions
      17d32345
    • Ankur Dave's avatar
      Rebuild routing table after Graph.reverse · 235a47ce
      Ankur Dave authored
      GraphImpl.reverse used to reverse edges in each partition of the edge RDD but preserve the routing table and replicated vertex view, since reversing should not affect partitioning.
      
      However, the old routing table would then have incorrect information for srcAttrOnly and dstAttrOnly. These RDDs should be switched.
      
      A simple fix is for Graph.reverse to rebuild the routing table and replicated vertex view.
      
      Thanks to Bogdan Ghidireac for reporting this issue on the [mailing list](http://apache-spark-user-list.1001560.n3.nabble.com/graph-reverse-amp-Pregel-API-td4338.html).
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #431 from ankurdave/fix-reverse-bug and squashes the following commits:
      
      75d63cb [Ankur Dave] Rebuild routing table after Graph.reverse
      235a47ce
  26. Apr 15, 2014
    • William Benton's avatar
      SPARK-1501: Ensure assertions in Graph.apply are asserted. · 2580a3b1
      William Benton authored
      The Graph.apply test in GraphSuite had some assertions in a closure in
      a graph transformation. As a consequence, these assertions never
      actually executed.  Furthermore, these closures had a reference to
      (non-serializable) test harness classes because they called assert(),
      which could be a problem if we proactively check closure serializability
      in the future.
      
      This commit simply changes the Graph.apply test to collect the graph
      triplets so it can assert about each triplet from a map method.
      
      Author: William Benton <willb@redhat.com>
      
      Closes #415 from willb/graphsuite-nop-fix and squashes the following commits:
      
      0b63658 [William Benton] Ensure assertions in Graph.apply are asserted.
      2580a3b1
  27. Apr 14, 2014
    • Sean Owen's avatar
      SPARK-1488. Resolve scalac feature warnings during build · 0247b5c5
      Sean Owen authored
      For your consideration: scalac currently notes a number of feature warnings during compilation:
      
      ```
      [warn] there were 65 feature warning(s); re-run with -feature for details
      ```
      
      Warnings are like:
      
      ```
      [warn] /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:1261: implicit conversion method rddToPairRDDFunctions should be enabled
      [warn] by making the implicit value scala.language.implicitConversions visible.
      [warn] This can be achieved by adding the import clause 'import scala.language.implicitConversions'
      [warn] or by setting the compiler option -language:implicitConversions.
      [warn] See the Scala docs for value scala.language.implicitConversions for a discussion
      [warn] why the feature should be explicitly enabled.
      [warn]   implicit def rddToPairRDDFunctions[K: ClassTag, V: ClassTag](rdd: RDD[(K, V)]) =
      [warn]                ^
      ```
      
      scalac is suggesting that it's just best practice to explicitly enable certain language features by importing them where used.
      
      This PR simply adds the imports it suggests (and squashes one other Java warning along the way). This leaves just deprecation warnings in the build.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #404 from srowen/SPARK-1488 and squashes the following commits:
      
      8598980 [Sean Owen] Quiet scalac warnings about language features by explicitly importing language features.
      39bc831 [Sean Owen] Enable -feature in scalac to emit language feature warnings
      0247b5c5
Loading