Skip to content
Snippets Groups Projects
  1. May 12, 2014
  2. May 11, 2014
    • Ankur Dave's avatar
      SPARK-1786: Edge Partition Serialization · a6b02fb7
      Ankur Dave authored
      This appears to address the issue with edge partition serialization.  The solution appears to be just registering the `PrimitiveKeyOpenHashMap`.  However I noticed that we appear to have forked that code in GraphX but retained the same name (which is confusing).  I also renamed our local copy to `GraphXPrimitiveKeyOpenHashMap`.  We should consider dropping that and using the one in Spark if possible.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
      
      Closes #724 from jegonzal/edge_partition_serialization and squashes the following commits:
      
      b0a525a [Ankur Dave] Disable reference tracking to fix serialization test
      bb7f548 [Ankur Dave] Add failing test for EdgePartition Kryo serialization
      67dac22 [Joseph E. Gonzalez] Making EdgePartition serializable.
      a6b02fb7
    • Joseph E. Gonzalez's avatar
      Fix error in 2d Graph Partitioner · f938a155
      Joseph E. Gonzalez authored
      Their was a minor bug in which negative partition ids could be generated when constructing a 2D partitioning of a graph.  This could lead to an inefficient 2D partition for large vertex id values.
      
      Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
      
      Closes #709 from jegonzal/fix_2d_partitioning and squashes the following commits:
      
      937c562 [Joseph E. Gonzalez] fixing bug in 2d partitioning algorithm where negative partition ids could be generated.
      f938a155
    • Patrick Wendell's avatar
      SPARK-1652: Set driver memory correctly in spark-submit. · 05c9aa9e
      Patrick Wendell authored
      The previous check didn't account for the fact that the default
      deploy mode is "client" unless otherwise specified. Also, this
      sets the more narrowly defined SPARK_DRIVER_MEMORY instead of setting
      SPARK_MEM.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #730 from pwendell/spark-submit and squashes the following commits:
      
      430b98f [Patrick Wendell] Feedback from Aaron
      e788edf [Patrick Wendell] Changes based on Aaron's feedback
      f508146 [Patrick Wendell] SPARK-1652: Set driver memory correctly in spark-submit.
      05c9aa9e
    • Patrick Wendell's avatar
      SPARK-1770: Load balance elements when repartitioning. · 7d9cc921
      Patrick Wendell authored
      This patch adds better balancing when performing a repartition of an
      RDD. Previously the elements in the RDD were hash partitioned, meaning
      if the RDD was skewed certain partitions would end up being very large.
      
      This commit adds load balancing of elements across the repartitioned
      RDD splits. The load balancing is not perfect: a given output partition
      can have up to N more elements than the average if there are N input
      partitions. However, some randomization is used to minimize the
      probabiliy that this happens.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #727 from pwendell/load-balance and squashes the following commits:
      
      f9da752 [Patrick Wendell] Response to Matei's feedback
      acfa46a [Patrick Wendell] SPARK-1770: Load balance elements when repartitioning.
      7d9cc921
    • witgo's avatar
      remove outdated runtime Information scala home · 6bee01dd
      witgo authored
      Author: witgo <witgo@qq.com>
      
      Closes #728 from witgo/scala_home and squashes the following commits:
      
      cdfd8be [witgo] Merge branch 'master' of https://github.com/apache/spark into scala_home
      fac094a [witgo] remove outdated runtime Information scala home
      6bee01dd
  3. May 10, 2014
    • Prashant Sharma's avatar
      Enabled incremental build that comes with sbt 0.13.2 · 70bcdef4
      Prashant Sharma authored
      More info at. https://github.com/sbt/sbt/issues/1010
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #525 from ScrapCodes/sbt-inc-opt and squashes the following commits:
      
      ba8fa42 [Prashant Sharma] Enabled incremental build that comes with sbt 0.13.2
      70bcdef4
    • Andrew Or's avatar
      [SPARK-1774] Respect SparkSubmit --jars on YARN (client) · 83e0424d
      Andrew Or authored
      SparkSubmit ignores `--jars` for YARN client. This is a bug.
      
      This PR also automatically adds the application jar to `spark.jar`. Previously, when running as yarn-client, you must specify the jar additionally through `--files` (because `--jars` didn't work). Now you don't have to explicitly specify it through either.
      
      Tested on a YARN cluster.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #710 from andrewor14/yarn-jars and squashes the following commits:
      
      35d1928 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-jars
      c27bf6c [Andrew Or] For yarn-cluster and python, do not add primaryResource to spark.jar
      c92c5bf [Andrew Or] Minor cleanups
      269f9f3 [Andrew Or] Fix format
      013d840 [Andrew Or] Fix tests
      1407474 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-jars
      3bb75e8 [Andrew Or] Allow SparkSubmit --jars to take effect in yarn-client mode
      83e0424d
    • Sean Owen's avatar
      SPARK-1789. Multiple versions of Netty dependencies cause FlumeStreamSuite failure · 2b7bd29e
      Sean Owen authored
      TL;DR is there is a bit of JAR hell trouble with Netty, that can be mostly resolved and will resolve a test failure.
      
      I hit the error described at http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-startup-time-out-td1753.html while running FlumeStreamingSuite, and have for a short while (is it just me?)
      
      velvia notes:
      "I have found a workaround.  If you add akka 2.2.4 to your dependencies, then everything works, probably because akka 2.2.4 brings in newer version of Jetty."
      
      There are at least 3 versions of Netty in play in the build:
      
      - the new Flume 1.4.0 dependency brings in io.netty:netty:3.4.0.Final, and that is the immediate problem
      - the custom version of akka 2.2.3 depends on io.netty:netty:3.6.6.
      - but, Spark Core directly uses io.netty:netty-all:4.0.17.Final
      
      The POMs try to exclude other versions of netty, but are excluding org.jboss.netty:netty, when in fact older versions of io.netty:netty (not netty-all) are also an issue.
      
      The org.jboss.netty:netty excludes are largely unnecessary. I replaced many of them with io.netty:netty exclusions until everything agreed on io.netty:netty-all:4.0.17.Final.
      
      But this didn't work, since Akka 2.2.3 doesn't work with Netty 4.x. Down-grading to 3.6.6.Final across the board made some Spark code not compile.
      
      If the build *keeps* io.netty:netty:3.6.6.Final as well, everything seems to work. Part of the reason seems to be that Netty 3.x used the old `org.jboss.netty` packages. This is less than ideal, but is no worse than the current situation.
      
      So this PR resolves the issue and improves the JAR hell, even if it leaves the existing theoretical Netty 3-vs-4 conflict:
      
      - Remove org.jboss.netty excludes where possible, for clarity; they're not needed except with Hadoop artifacts
      - Add io.netty:netty excludes where needed -- except, let akka keep its io.netty:netty
      - Change a bit of test code that actually depended on Netty 3.x, to use 4.x equivalent
      - Update SBT build accordingly
      
      A better change would be to update Akka far enough such that it agrees on Netty 4.x, but I don't know if that's feasible.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #723 from srowen/SPARK-1789 and squashes the following commits:
      
      43661b7 [Sean Owen] Update and add Netty excludes to prevent some JAR conflicts that cause test issues
      2b7bd29e
    • Ankur Dave's avatar
      Unify GraphImpl RDDs + other graph load optimizations · 905173df
      Ankur Dave authored
      This PR makes the following changes, primarily in e4fbd329aef85fe2c38b0167255d2a712893d683:
      
      1. *Unify RDDs to avoid zipPartitions.* A graph used to be four RDDs: vertices, edges, routing table, and triplet view. This commit merges them down to two: vertices (with routing table), and edges (with replicated vertices).
      
      2. *Avoid duplicate shuffle in graph building.* We used to do two shuffles when building a graph: one to extract routing information from the edges and move it to the vertices, and another to find nonexistent vertices referred to by edges. With this commit, the latter is done as a side effect of the former.
      
      3. *Avoid no-op shuffle when joins are fully eliminated.* This is a side effect of unifying the edges and the triplet view.
      
      4. *Join elimination for mapTriplets.*
      
      5. *Ship only the needed vertex attributes when upgrading the triplet view.* If the triplet view already contains source attributes, and we now need both attributes, only ship destination attributes rather than re-shipping both. This is done in `ReplicatedVertexView#upgrade`.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #497 from ankurdave/unify-rdds and squashes the following commits:
      
      332ab43 [Ankur Dave] Merge remote-tracking branch 'apache-spark/master' into unify-rdds
      4933e2e [Ankur Dave] Exclude RoutingTable from binary compatibility check
      5ba8789 [Ankur Dave] Add GraphX upgrade guide from Spark 0.9.1
      13ac845 [Ankur Dave] Merge remote-tracking branch 'apache-spark/master' into unify-rdds
      a04765c [Ankur Dave] Remove unnecessary toOps call
      57202e8 [Ankur Dave] Replace case with pair parameter
      75af062 [Ankur Dave] Add explicit return types
      04d3ae5 [Ankur Dave] Convert implicit parameter to context bound
      c88b269 [Ankur Dave] Revert upgradeIterator to if-in-a-loop
      0d3584c [Ankur Dave] EdgePartition.size should be val
      2a928b2 [Ankur Dave] Set locality wait
      10b3596 [Ankur Dave] Clean up public API
      ae36110 [Ankur Dave] Fix style errors
      e4fbd32 [Ankur Dave] Unify GraphImpl RDDs + other graph load optimizations
      d6d60e2 [Ankur Dave] In GraphLoader, coalesce to minEdgePartitions
      62c7b78 [Ankur Dave] In Analytics, take PageRank numIter
      d64e8d4 [Ankur Dave] Log current Pregel iteration
      905173df
    • Kan Zhang's avatar
      [SPARK-1690] Tolerating empty elements when saving Python RDD to text files · 6c2691d0
      Kan Zhang authored
      Tolerate empty strings in PythonRDD
      
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #644 from kanzhang/SPARK-1690 and squashes the following commits:
      
      c62ad33 [Kan Zhang] Adding Python doctest
      473ec4b [Kan Zhang] [SPARK-1690] Tolerating empty elements when saving Python RDD to text files
      6c2691d0
    • Bouke van der Bijl's avatar
      Add Python includes to path before depickling broadcast values · 3776f2f2
      Bouke van der Bijl authored
      This fixes https://issues.apache.org/jira/browse/SPARK-1731 by adding the Python includes to the PYTHONPATH before depickling the broadcast values
      
      @airhorns
      
      Author: Bouke van der Bijl <boukevanderbijl@gmail.com>
      
      Closes #656 from bouk/python-includes-before-broadcast and squashes the following commits:
      
      7b0dfe4 [Bouke van der Bijl] Add Python includes to path before depickling broadcast values
      3776f2f2
    • Andy Konwinski's avatar
      fix broken in link in python docs · c05d11bb
      Andy Konwinski authored
      Author: Andy Konwinski <andykonwinski@gmail.com>
      
      Closes #650 from andyk/python-docs-link-fix and squashes the following commits:
      
      a1f9d51 [Andy Konwinski] fix broken in link in python docs
      c05d11bb
    • Matei Zaharia's avatar
      SPARK-1708. Add a ClassTag on Serializer and things that depend on it · 7eefc9d2
      Matei Zaharia authored
      This pull request contains a rebased patch from @heathermiller (https://github.com/heathermiller/spark/pull/1) to add ClassTags on Serializer and types that depend on it (Broadcast and AccumulableCollection). Putting these in the public API signatures now will allow us to use Scala Pickling for serialization down the line without breaking binary compatibility.
      
      One question remaining is whether we also want them on Accumulator -- Accumulator is passed as part of a bigger Task or TaskResult object via the closure serializer so it doesn't seem super useful to add the ClassTag there. Broadcast and AccumulableCollection in contrast were being serialized directly.
      
      CC @rxin, @pwendell, @heathermiller
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #700 from mateiz/spark-1708 and squashes the following commits:
      
      1a3d8b0 [Matei Zaharia] Use fake ClassTag in Java
      3b449ed [Matei Zaharia] test fix
      2209a27 [Matei Zaharia] Code style fixes
      9d48830 [Matei Zaharia] Add a ClassTag on Serializer and things that depend on it
      7eefc9d2
    • Takuya UESHIN's avatar
      [SPARK-1778] [SQL] Add 'limit' transformation to SchemaRDD. · 8e94d272
      Takuya UESHIN authored
      Add `limit` transformation to `SchemaRDD`.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #711 from ueshin/issues/SPARK-1778 and squashes the following commits:
      
      33169df [Takuya UESHIN] Add 'limit' transformation to SchemaRDD.
      8e94d272
    • Michael Armbrust's avatar
      [SQL] Upgrade parquet library. · 4d605532
      Michael Armbrust authored
      I think we are hitting this issue in some perf tests: https://github.com/Parquet/parquet-mr/commit/6aed5288fd4a1398063a5a219b2ae4a9f71b02cf
      
      Credit to @aarondav !
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #684 from marmbrus/upgradeParquet and squashes the following commits:
      
      e10a619 [Michael Armbrust] Upgrade parquet library.
      4d605532
    • witgo's avatar
      [SPARK-1644] The org.datanucleus:* should not be packaged into spark-assembly-*.jar · 56151086
      witgo authored
      Author: witgo <witgo@qq.com>
      
      Closes #688 from witgo/SPARK-1644 and squashes the following commits:
      
      56ad6ac [witgo] review commit
      87c03e4 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1644
      6ffa7e4 [witgo] review commit
      a597414 [witgo] The org.datanucleus:* should not be packaged into spark-assembly-*.jar
      56151086
  4. May 09, 2014
    • CodingCat's avatar
      SPARK-1686: keep schedule() calling in the main thread · 2f452cba
      CodingCat authored
      https://issues.apache.org/jira/browse/SPARK-1686
      
      moved from original JIRA (by @markhamstra):
      
      In deploy.master.Master, the completeRecovery method is the last thing to be called when a standalone Master is recovering from failure. It is responsible for resetting some state, relaunching drivers, and eventually resuming its scheduling duties.
      
      There are currently four places in Master.scala where completeRecovery is called. Three of them are from within the actor's receive method, and aren't problems. The last starts from within receive when the ElectedLeader message is received, but the actual completeRecovery() call is made from the Akka scheduler. That means that it will execute on a different scheduler thread, and Master itself will end up running (i.e., schedule() ) from that Akka scheduler thread.
      
      In this PR, I added a new master message TriggerSchedule to trigger the "local" call of schedule() in the scheduler thread
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #639 from CodingCat/SPARK-1686 and squashes the following commits:
      
      81bb4ca [CodingCat] rename variable
      69e0a2a [CodingCat] style fix
      36a2ac0 [CodingCat] address Aaron's comments
      ec9b7bb [CodingCat] address the comments
      02b37ca [CodingCat] keep schedule() calling in the main thread
      2f452cba
    • Aaron Davidson's avatar
      SPARK-1770: Revert accidental(?) fix · 59577df1
      Aaron Davidson authored
      Looks like this change was accidentally committed here: https://github.com/apache/spark/commit/06b15baab25951d124bbe6b64906f4139e037deb
      but the change does not show up in the PR itself (#704).
      
      Other than not intending to go in with that PR, this also broke the test JavaAPISuite.repartition.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #716 from aarondav/shufflerand and squashes the following commits:
      
      b1cf70b [Aaron Davidson] SPARK-1770: Revert accidental(?) fix
      59577df1
    • witgo's avatar
      [SPARK-1760]: fix building spark with maven documentation · bd67551e
      witgo authored
      Author: witgo <witgo@qq.com>
      
      Closes #712 from witgo/building-with-maven and squashes the following commits:
      
      215523b [witgo] fix building spark with maven documentation
      bd67551e
    • Tathagata Das's avatar
      Converted bang to ask to avoid scary warning when a block is removed · 32868f31
      Tathagata Das authored
      Removing a block through the blockmanager gave a scary warning messages in the driver.
      ```
      2014-05-08 20:16:19,172 WARN BlockManagerMasterActor: Got unknown message: true
      2014-05-08 20:16:19,172 WARN BlockManagerMasterActor: Got unknown message: true
      2014-05-08 20:16:19,172 WARN BlockManagerMasterActor: Got unknown message: true
      ```
      
      This is because the [BlockManagerSlaveActor](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManagerSlaveActor.scala#L44) would send back an acknowledgement ("true"). But the BlockManagerMasterActor would have sent the RemoveBlock message as a send, not as ask(), so would reject the receiver "true" as a unknown message.
      @pwendell
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #708 from tdas/bm-fix and squashes the following commits:
      
      ed4ef15 [Tathagata Das] Converted bang to ask to avoid scary warning when a block is removed.
      32868f31
    • Patrick Wendell's avatar
      MINOR: Removing dead code. · 4c60fd1e
      Patrick Wendell authored
      Meant to do this when patching up the last merge.
      4c60fd1e
    • Sandeep's avatar
      SPARK-1775: Unneeded lock in ShuffleMapTask.deserializeInfo · 7db47c46
      Sandeep authored
      This was used in the past to have a cache of deserialized ShuffleMapTasks, but that's been removed, so there's no need for a lock. It slows down Spark when task descriptions are large, e.g. due to large lineage graphs or local variables.
      
      Author: Sandeep <sandeep@techaddict.me>
      
      Closes #707 from techaddict/SPARK-1775 and squashes the following commits:
      
      18d8ebf [Sandeep] SPARK-1775: Unneeded lock in ShuffleMapTask.deserializeInfo This was used in the past to have a cache of deserialized ShuffleMapTasks, but that's been removed, so there's no need for a lock. It slows down Spark when task descriptions are large, e.g. due to large lineage graphs or local variables.
      7db47c46
    • Patrick Wendell's avatar
      SPARK-1565 (Addendum): Replace `run-example` with `spark-submit`. · 06b15baa
      Patrick Wendell authored
      Gives a nicely formatted message to the user when `run-example` is run to
      tell them to use `spark-submit`.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #704 from pwendell/examples and squashes the following commits:
      
      1996ee8 [Patrick Wendell] Feedback form Andrew
      3eb7803 [Patrick Wendell] Suggestions from TD
      2474668 [Patrick Wendell] SPARK-1565 (Addendum): Replace `run-example` with `spark-submit`.
      06b15baa
  5. May 08, 2014
    • Marcelo Vanzin's avatar
      [SPARK-1631] Correctly set the Yarn app name when launching the AM. · 3f779d87
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #539 from vanzin/yarn-app-name and squashes the following commits:
      
      7d1ca4f [Marcelo Vanzin] [SPARK-1631] Correctly set the Yarn app name when launching the AM.
      3f779d87
    • Andrew Or's avatar
      [SPARK-1755] Respect SparkSubmit --name on YARN · 8b784129
      Andrew Or authored
      Right now, SparkSubmit ignores the `--name` flag for both yarn-client and yarn-cluster. This is a bug.
      
      In client mode, SparkSubmit treats `--name` as a [cluster config](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L170) and does not propagate this to SparkContext.
      
      In cluster mode, SparkSubmit passes this flag to `org.apache.spark.deploy.yarn.Client`, which only uses it for the [YARN ResourceManager](https://github.com/apache/spark/blob/master/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L80), but does not propagate this to SparkContext.
      
      This PR ensures that `spark.app.name` is always set if SparkSubmit receives the `--name` flag, which is what the usage promises. This makes it possible for applications to start a SparkContext with an empty conf `val sc = new SparkContext(new SparkConf)`, and inherit the app name from SparkSubmit.
      
      Tested both modes on a YARN cluster.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #699 from andrewor14/yarn-app-name and squashes the following commits:
      
      98f6a79 [Andrew Or] Fix tests
      dea932f [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-app-name
      c86d9ca [Andrew Or] Respect SparkSubmit --name on YARN
      8b784129
    • Bouke van der Bijl's avatar
      Include the sbin/spark-config.sh in spark-executor · 2fd2752e
      Bouke van der Bijl authored
      This is needed because broadcast values are broken on pyspark on Mesos, it tries to import pyspark but can't, as the PYTHONPATH is not set due to changes in ff5be9a4
      
      https://issues.apache.org/jira/browse/SPARK-1725
      
      Author: Bouke van der Bijl <boukevanderbijl@gmail.com>
      
      Closes #651 from bouk/include-spark-config-in-mesos-executor and squashes the following commits:
      
      b2f1295 [Bouke van der Bijl] Inline PYTHONPATH in spark-executor
      eedbbcc [Bouke van der Bijl] Include the sbin/spark-config.sh in spark-executor
      2fd2752e
    • Funes's avatar
      Bug fix of sparse vector conversion · 191279ce
      Funes authored
      Fixed a small bug caused by the inconsistency of index/data array size and vector length.
      
      Author: Funes <tianshaocun@gmail.com>
      Author: funes <tianshaocun@gmail.com>
      
      Closes #661 from funes/bugfix and squashes the following commits:
      
      edb2b9d [funes] remove unused import
      75dced3 [Funes] update test case
      d129a66 [Funes] Add test for sparse breeze by vector builder
      64e7198 [Funes] Copy data only when necessary
      b85806c [Funes] Bug fix of sparse vector conversion
      191279ce
    • DB Tsai's avatar
      [SPARK-1157][MLlib] Bug fix: lossHistory should exclude rejection steps, and remove miniBatch · 910a13b3
      DB Tsai authored
      Getting the lossHistory from Breeze's API which already excludes the rejection steps in line search. Also, remove the miniBatch in LBFGS since those quasi-Newton methods approximate the inverse of Hessian. It doesn't make sense if the gradients are computed from a varying objective.
      
      Author: DB Tsai <dbtsai@alpinenow.com>
      
      Closes #582 from dbtsai/dbtsai-lbfgs-bug and squashes the following commits:
      
      9cc6cf9 [DB Tsai] Removed the miniBatch in LBFGS.
      1ba6a33 [DB Tsai] Formatting the code.
      d72c679 [DB Tsai] Using Breeze's states to get the loss.
      910a13b3
    • DB Tsai's avatar
      MLlib documentation fix · d38febee
      DB Tsai authored
      Fixed the documentation for that `loadLibSVMData` is changed to `loadLibSVMFile`.
      
      Author: DB Tsai <dbtsai@alpinenow.com>
      
      Closes #703 from dbtsai/dbtsai-docfix and squashes the following commits:
      
      71dd508 [DB Tsai] loadLibSVMData is changed to loadLibSVMFile
      d38febee
    • Takuya UESHIN's avatar
      [SPARK-1754] [SQL] Add missing arithmetic DSL operations. · 322b1808
      Takuya UESHIN authored
      Add missing arithmetic DSL operations: `unary_-`, `%`.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #689 from ueshin/issues/SPARK-1754 and squashes the following commits:
      
      a09ef69 [Takuya UESHIN] Add also missing ! (not) operation.
      f73ae2c [Takuya UESHIN] Remove redundant tests.
      5b3f087 [Takuya UESHIN] Add tests relating DSL operations.
      e09c5b8 [Takuya UESHIN] Add missing arithmetic DSL operations.
      322b1808
    • Evan Sparks's avatar
      Fixing typo in als.py · 5c5e7d58
      Evan Sparks authored
      XtY should be Xty.
      
      Author: Evan Sparks <evan.sparks@gmail.com>
      
      Closes #696 from etrain/patch-2 and squashes the following commits:
      
      634cb8d [Evan Sparks] Fixing typo in als.py
      5c5e7d58
    • Andrew Or's avatar
      [SPARK-1745] Move interrupted flag from TaskContext constructor (minor) · c3f8b78c
      Andrew Or authored
      It makes little sense to start a TaskContext that is interrupted. Indeed, I searched for all use cases of it and didn't find a single instance in which `interrupted` is true on construction.
      
      This was inspired by reviewing #640, which adds an additional `@volatile var completed` that is similar. These are not the most urgent changes, but I wanted to push them out before I forget.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #675 from andrewor14/task-context and squashes the following commits:
      
      9575e02 [Andrew Or] Add space
      69455d1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into task-context
      c471490 [Andrew Or] Oops, removed one flag too many. Adding it back.
      85311f8 [Andrew Or] Move interrupted flag from TaskContext constructor
      c3f8b78c
    • Prashant Sharma's avatar
      SPARK-1565, update examples to be used with spark-submit script. · 44dd57fb
      Prashant Sharma authored
      Commit for initial feedback, basically I am curious if we should prompt user for providing args esp. when its mandatory. And can we skip if they are not ?
      
      Also few other things that did not work like
      `bin/spark-submit examples/target/scala-2.10/spark-examples-1.0.0-SNAPSHOT-hadoop1.0.4.jar --class org.apache.spark.examples.SparkALS --arg 100 500 10 5 2`
      
      Not all the args get passed properly, may be I have messed up something will try to sort it out hopefully.
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #552 from ScrapCodes/SPARK-1565/update-examples and squashes the following commits:
      
      669dd23 [Prashant Sharma] Review comments
      2727e70 [Prashant Sharma] SPARK-1565, update examples to be used with spark-submit script.
      44dd57fb
    • Michael Armbrust's avatar
      [SQL] Improve SparkSQL Aggregates · 19c8fb02
      Michael Armbrust authored
      * Add native min/max (was using hive before).
      * Handle nulls correctly in Avg and Sum.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #683 from marmbrus/aggFixes and squashes the following commits:
      
      64fe30b [Michael Armbrust] Improve SparkSQL Aggregates * Add native min/max (was using hive before). * Handle nulls correctly in Avg and Sum.
      19c8fb02
  6. May 07, 2014
    • Evan Sparks's avatar
      Use numpy directly for matrix multiply. · 6ed7e2cd
      Evan Sparks authored
      Using matrix multiply to compute XtX and XtY yields a 5-20x speedup depending on problem size.
      
      For example - the following takes 19s locally after this change vs. 5m21s before the change. (16x speedup).
      bin/pyspark examples/src/main/python/als.py local[8] 1000 1000 50 10 10
      
      Author: Evan Sparks <evan.sparks@gmail.com>
      
      Closes #687 from etrain/patch-1 and squashes the following commits:
      
      e094dbc [Evan Sparks] Touching only diaganols on update.
      d1ab9b6 [Evan Sparks] Use numpy directly for matrix multiply.
      6ed7e2cd
    • Sandeep's avatar
      SPARK-1668: Add implicit preference as an option to examples/MovieLensALS · 108c4c16
      Sandeep authored
      Add --implicitPrefs as an command-line option to the example app MovieLensALS under examples/
      
      Author: Sandeep <sandeep@techaddict.me>
      
      Closes #597 from techaddict/SPARK-1668 and squashes the following commits:
      
      8b371dc [Sandeep] Second Pass on reviews by mengxr
      eca9d37 [Sandeep] based on mengxr's suggestions
      937e54c [Sandeep] Changes
      5149d40 [Sandeep] Changes based on review
      1dd7657 [Sandeep] use mean()
      42444d7 [Sandeep] Based on Suggestions by mengxr
      e3082fa [Sandeep] SPARK-1668: Add implicit preference as an option to examples/MovieLensALS Add --implicitPrefs as an command-line option to the example app MovieLensALS under examples/
      108c4c16
    • Manish Amde's avatar
      SPARK-1544 Add support for deep decision trees. · f269b016
      Manish Amde authored
      @etrain and I came with a PR for arbitrarily deep decision trees at the cost of multiple passes over the data at deep tree levels.
      
      To summarize:
      1) We take a parameter that indicates the amount of memory users want to reserve for computation on each worker (and 2x that at the driver).
      2) Using that information, we calculate two things - the maximum depth to which we train as usual (which is, implicitly, the maximum number of nodes we want to train in parallel), and the size of the groups we should use in the case where we exceed this depth.
      
      cc: @atalwalkar, @hirakendu, @mengxr
      
      Author: Manish Amde <manish9ue@gmail.com>
      Author: manishamde <manish9ue@gmail.com>
      Author: Evan Sparks <sparks@cs.berkeley.edu>
      
      Closes #475 from manishamde/deep_tree and squashes the following commits:
      
      968ca9d [Manish Amde] merged master
      7fc9545 [Manish Amde] added docs
      ce004a1 [Manish Amde] minor formatting
      b27ad2c [Manish Amde] formatting
      426bb28 [Manish Amde] programming guide blurb
      8053fed [Manish Amde] more formatting
      5eca9e4 [Manish Amde] grammar
      4731cda [Manish Amde] formatting
      5e82202 [Manish Amde] added documentation, fixed off by 1 error in max level calculation
      cbd9f14 [Manish Amde] modified scala.math to math
      dad9652 [Manish Amde] removed unused imports
      e0426ee [Manish Amde] renamed parameter
      718506b [Manish Amde] added unit test
      1517155 [Manish Amde] updated documentation
      9dbdabe [Manish Amde] merge from master
      719d009 [Manish Amde] updating user documentation
      fecf89a [manishamde] Merge pull request #6 from etrain/deep_tree
      0287772 [Evan Sparks] Fixing scalastyle issue.
      2f1e093 [Manish Amde] minor: added doc for maxMemory parameter
      2f6072c [manishamde] Merge pull request #5 from etrain/deep_tree
      abc5a23 [Evan Sparks] Parameterizing max memory.
      50b143a [Manish Amde] adding support for very deep trees
      f269b016
    • baishuo(白硕)'s avatar
      Update GradientDescentSuite.scala · 0c19bb16
      baishuo(白硕) authored
      use more faster way to construct an array
      
      Author: baishuo(白硕) <vc_java@hotmail.com>
      
      Closes #588 from baishuo/master and squashes the following commits:
      
      45b95fb [baishuo(白硕)] Update GradientDescentSuite.scala
      c03b61c [baishuo(白硕)] Update GradientDescentSuite.scala
      b666d27 [baishuo(白硕)] Update GradientDescentSuite.scala
      0c19bb16
    • Xiangrui Meng's avatar
      [SPARK-1743][MLLIB] add loadLibSVMFile and saveAsLibSVMFile to pyspark · 3188553f
      Xiangrui Meng authored
      Make loading/saving labeled data easier for pyspark users.
      
      Also changed type check in `SparseVector` to allow numpy integers.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #672 from mengxr/pyspark-mllib-util and squashes the following commits:
      
      2943fa7 [Xiangrui Meng] format docs
      d61668d [Xiangrui Meng] add loadLibSVMFile and saveAsLibSVMFile to pyspark
      3188553f
Loading