Skip to content
Snippets Groups Projects
  1. Mar 20, 2014
    • Reza Zadeh's avatar
      Principal Component Analysis · 66a03e5f
      Reza Zadeh authored
      # Principal Component Analysis
      
      Computes the top k principal component coefficients for the m-by-n data matrix X. Rows of X correspond to observations and columns correspond to variables. The coefficient matrix is n-by-k. Each column of the coefficients return matrix contains coefficients for one principal component, and the columns are in descending order of component variance. This function centers the data and uses the singular value decomposition (SVD) algorithm.
      
      ## Testing
      Tests included:
       * All principal components
       * Only top k principal components
       * Dense SVD tests
       * Dense/sparse matrix tests
      
      The results are tested against MATLAB's pca: http://www.mathworks.com/help/stats/pca.html
      
      ## Documentation
      Added to mllib-guide.md
      
      ## Example Usage
      Added to examples directory under SparkPCA.scala
      
      Author: Reza Zadeh <rizlar@gmail.com>
      
      Closes #88 from rezazadeh/sparkpca and squashes the following commits:
      
      e298700 [Reza Zadeh] reformat using IDE
      3f23271 [Reza Zadeh] documentation and cleanup
      b025ab2 [Reza Zadeh] documentation
      e2667d4 [Reza Zadeh] assertMatrixApproximatelyEquals
      3787bb4 [Reza Zadeh] stylin
      c6ecc1f [Reza Zadeh] docs
      aa2bbcb [Reza Zadeh] rename sparseToTallSkinnyDense
      56975b0 [Reza Zadeh] docs
      2df9bde [Reza Zadeh] docs update
      8fb0015 [Reza Zadeh] rcond documentation
      dbf7797 [Reza Zadeh] correct argument number
      a9f1f62 [Reza Zadeh] documentation
      4ce6caa [Reza Zadeh] style changes
      9a56a02 [Reza Zadeh] use rcond relative to larget svalue
      120f796 [Reza Zadeh] housekeeping
      156ff78 [Reza Zadeh] string comprehension
      2e1cf43 [Reza Zadeh] rename rcond
      ea223a6 [Reza Zadeh] many style changes
      f4002d7 [Reza Zadeh] more docs
      bd53c7a [Reza Zadeh] proper accumulator
      a8b5ecf [Reza Zadeh] Don't use for loops
      0dc7980 [Reza Zadeh] filter zeros in sparse
      6115610 [Reza Zadeh] More documentation
      36d51e8 [Reza Zadeh] use JBLAS for UVS^-1 computation
      bc4599f [Reza Zadeh] configurable rcond
      86f7515 [Reza Zadeh] compute per parition, use while
      09726b3 [Reza Zadeh] more style changes
      4195e69 [Reza Zadeh] private, accumulator
      17002be [Reza Zadeh] style changes
      4ba7471 [Reza Zadeh] style change
      f4982e6 [Reza Zadeh] Use dense matrix in example
      2828d28 [Reza Zadeh] optimizations: normalize once, use inplace ops
      72c9fa1 [Reza Zadeh] rename DenseMatrix to TallSkinnyDenseMatrix, lean
      f807be9 [Reza Zadeh] fix typo
      2d7ccde [Reza Zadeh] Array interface for dense svd and pca
      cd290fa [Reza Zadeh] provide RDD[Array[Double]] support
      398d123 [Reza Zadeh] style change
      55abbfa [Reza Zadeh] docs fix
      ef29644 [Reza Zadeh] bad chnage undo
      472566e [Reza Zadeh] all files from old pr
      555168f [Reza Zadeh] initial files
      66a03e5f
  2. Mar 19, 2014
    • Aaron Davidson's avatar
      Revert "SPARK-1099:Spark's local mode should probably respect spark.cores.max by default" · ffe272d9
      Aaron Davidson authored
      This reverts commit 16789317. Jenkins was not run for this PR.
      ffe272d9
    • qqsun8819's avatar
      SPARK-1099:Spark's local mode should probably respect spark.cores.max by default · 16789317
      qqsun8819 authored
      This is for JIRA:https://spark-project.atlassian.net/browse/SPARK-1099
      And this is what I do in this patch (also commented in the JIRA) @aarondav
      
       This is really a behavioral change, so I do this with great caution, and welcome any review advice:
      
      1 I change the "MASTER=local" pattern of create LocalBackEnd . In the past, we passed 1 core to it . now it use a default cores
      The reason here is that when someone use spark-shell to start local mode , Repl will use this "MASTER=local" pattern as default.
      So if one also specify cores in the spark-shell command line, it will all go in here. So here pass 1 core is not suitalbe reponding to our change here.
      2 In the LocalBackEnd , the "totalCores" variable are fetched following a different rule(in the past it just take in a userd passed cores, like 1 in "MASTER=local" pattern, 2 in "MASTER=local[2]" pattern"
      rules:
      a The second argument of LocalBackEnd 's constructor indicating cores have a default value which is Int.MaxValue. If user didn't pass it , its first default value is Int.MaxValue
      b In getMaxCores, we first compare the former value to Int.MaxValue. if it's not equal, we think that user has passed their desired value, so just use it
      c. If b is not satified, we then get cores from spark.cores.max, and we get real logical cores from Runtime. And if cores specified by spark.cores.max is bigger than logical cores, we use logical cores, otherwise we use spark.cores.max
      3 In SparkContextSchedulerCreationSuite 's test("local") case, assertion is modified from 1 to logical cores, because "MASTER=local" pattern use default vaules.
      
      Author: qqsun8819 <jin.oyj@alibaba-inc.com>
      
      Closes #110 from qqsun8819/local-cores and squashes the following commits:
      
      731aefa [qqsun8819] 1 LocalBackend not change 2 In SparkContext do some process to the cores and pass it to original LocalBackend constructor
      78b9c60 [qqsun8819] 1 SparkContext MASTER=local pattern use default cores instead of 1 to construct LocalBackEnd , for use of spark-shell and cores specified in cmd line 2 some test case change from local to local[1]. 3 SparkContextSchedulerCreationSuite test spark.cores.max config in local pattern
      6ae1ee8 [qqsun8819] Add a static function in LocalBackEnd to let it use spark.cores.max specified cores when no cores are passed to it
      16789317
    • Jyotiska NK's avatar
      Added doctest for map function in rdd.py · 67fa71cb
      Jyotiska NK authored
      Doctest added for map in rdd.py
      
      Author: Jyotiska NK <jyotiska123@gmail.com>
      
      Closes #177 from jyotiska/pyspark_rdd_map_doctest and squashes the following commits:
      
      a38527f [Jyotiska NK] Added doctest for map function in rdd.py
      67fa71cb
    • Andrew Or's avatar
      [SPARK-1132] Persisting Web UI through refactoring the SparkListener interface · 79d07d66
      Andrew Or authored
      The fleeting nature of the Spark Web UI has long been a problem reported by many users: The existing Web UI disappears as soon as the associated application terminates. This is because SparkUI is tightly coupled with SparkContext, and cannot be instantiated independently from it. To solve this, some state must be saved to persistent storage while the application is still running.
      
      The approach taken by this PR involves persisting the UI state through SparkListenerEvents. This requires a major refactor of the SparkListener interface because existing events (1) maintain deep references, making de/serialization is difficult, and (2) do not encode all the information displayed on the UI. In this design, each existing listener for the UI (e.g. ExecutorsListener) maintains state that can be fully constructed from SparkListenerEvents. This state is then supplied to the parent UI (e.g. ExecutorsUI), which renders the associated page(s) on demand.
      
      This PR introduces two important classes: the **EventLoggingListener**, and the **ReplayListenerBus**. In a live application, SparkUI registers an EventLoggingListener with the SparkContext in addition to the existing listeners. Over the course of the application, this listener serializes and logs all events to persisted storage. Then, after the application has finished, the SparkUI can be revived by replaying all the logged events to the existing UI listeners through the ReplayListenerBus.
      
      This feature is currently integrated with the Master Web UI, which optionally rebuilds a SparkUI from event logs as soon as the corresponding application finishes.
      
      More details can be found in the commit messages, comments within the code, and the [design doc](https://spark-project.atlassian.net/secure/attachment/12900/PersistingSparkWebUI.pdf). Comments and feedback are most welcome.
      
      Author: Andrew Or <andrewor14@gmail.com>
      Author: andrewor14 <andrewor14@gmail.com>
      
      Closes #42 from andrewor14/master and squashes the following commits:
      
      e5f14fa [Andrew Or] Merge github.com:apache/spark
      a1c5cd9 [Andrew Or] Merge github.com:apache/spark
      b8ba817 [Andrew Or] Remove UI from map when removing application in Master
      83af656 [Andrew Or] Scraps and pieces (no functionality change)
      222adcd [Andrew Or] Merge github.com:apache/spark
      124429f [Andrew Or] Clarify LiveListenerBus behavior + Add tests for new behavior
      f80bd31 [Andrew Or] Simplify static handler and BlockManager status update logic
      9e14f97 [Andrew Or] Moved around functionality + renamed classes per Patrick
      6740e49 [Andrew Or] Fix comment nits
      650eb12 [Andrew Or] Add unit tests + Fix bugs found through tests
      45fd84c [Andrew Or] Remove now deprecated test
      c5c2c8f [Andrew Or] Remove list of (TaskInfo, TaskMetrics) from StageInfo
      3456090 [Andrew Or] Address Patrick's comments
      bf80e3d [Andrew Or] Imports, comments, and code formatting, once again (minor)
      ac69ec8 [Andrew Or] Fix test fail
      d801d11 [Andrew Or] Merge github.com:apache/spark (major)
      dc93915 [Andrew Or] Imports, comments, and code formatting (minor)
      77ba283 [Andrew Or] Address Kay's and Patrick's comments
      b6eaea7 [Andrew Or] Treating SparkUI as a handler of MasterUI
      d59da5f [Andrew Or] Avoid logging all the blocks on each executor
      d6e3b4a [Andrew Or] Merge github.com:apache/spark
      ca258a4 [Andrew Or] Master UI - add support for reading compressed event logs
      176e68e [Andrew Or] Fix deprecated message for JavaSparkContext (minor)
      4f69c4a [Andrew Or] Master UI - Rebuild SparkUI on application finish
      291b2be [Andrew Or] Correct directory in log message "INFO: Logging events to <dir>"
      1ba3407 [Andrew Or] Add a few configurable options to event logging
      e375431 [Andrew Or] Add new constructors for SparkUI
      18b256d [Andrew Or] Refactor out event logging and replaying logic from UI
      bb4c503 [Andrew Or] Use a more mnemonic path for logging
      aef411c [Andrew Or] Fix bug: storage status was not reflected on UI in the local case
      03eda0b [Andrew Or] Fix HDFS flush behavior
      36b3e5d [Andrew Or] Add HDFS support for event logging
      cceff2b [andrewor14] Fix 100 char format fail
      2fee310 [Andrew Or] Address Patrick's comments
      2981d61 [Andrew Or] Move SparkListenerBus out of DAGScheduler + Clean up
      5d2cec1 [Andrew Or] JobLogger: ID -> Id
      0503e4b [Andrew Or] Fix PySpark tests + remove sc.clearFiles/clearJars
      4d2fb0c [Andrew Or] Fix format fail
      faa113e [Andrew Or] General clean up
      d47585f [Andrew Or] Clean up FileLogger
      472fd8a [Andrew Or] Fix a couple of tests
      996d7a2 [Andrew Or] Reflect RDD unpersist on UI
      7b2f811 [Andrew Or] Guard against TaskMetrics NPE + Fix tests
      d1f4285 [Andrew Or] Migrate from lift-json to json4s-jackson
      28019ca [Andrew Or] Merge github.com:apache/spark
      bbe3501 [Andrew Or] Embed storage status and RDD info in Task events
      6631c02 [Andrew Or] More formatting changes, this time mainly for Json DSL
      70e7e7a [Andrew Or] Formatting changes
      e9e1c6d [Andrew Or] Move all JSON de/serialization logic to JsonProtocol
      d646df6 [Andrew Or] Completely decouple SparkUI from SparkContext
      6814da0 [Andrew Or] Explicitly register each UI listener rather than through some magic
      64d2ce1 [Andrew Or] Fix BlockManagerUI bug by introducing new event
      4273013 [Andrew Or] Add a gateway SparkListener to simplify event logging
      904c729 [Andrew Or] Fix another major bug
      5ac906d [Andrew Or] Mostly naming, formatting, and code style changes
      3fd584e [Andrew Or] Fix two major bugs
      f3fc13b [Andrew Or] General refactor
      4dfcd22 [Andrew Or] Merge git://git.apache.org/incubator-spark into persist-ui
      b3976b0 [Andrew Or] Add functionality of reconstructing a persisted UI from SparkContext
      8add36b [Andrew Or] JobProgressUI: Add JSON functionality
      d859efc [Andrew Or] BlockManagerUI: Add JSON functionality
      c4cd480 [Andrew Or] Also deserialize new events
      8a2ebe6 [Andrew Or] Fix bugs for EnvironmentUI and ExecutorsUI
      de8a1cd [Andrew Or] Serialize events both to and from JSON (rather than just to)
      bf0b2e9 [Andrew Or] ExecutorUI: Serialize events rather than arbitary executor information
      bb222b9 [Andrew Or] ExecutorUI: render completely from JSON
      dcbd312 [Andrew Or] Add JSON Serializability for all SparkListenerEvent's
      10ed49d [Andrew Or] Merge github.com:apache/incubator-spark into persist-ui
      8e09306 [Andrew Or] Use JSON for ExecutorsUI
      e3ae35f [Andrew Or] Merge github.com:apache/incubator-spark
      3ddeb7e [Andrew Or] Also privatize fields
      090544a [Andrew Or] Privatize methods
      13920c9 [Andrew Or] Update docs
      bd5a1d7 [Andrew Or] Typo: phyiscal -> physical
      287ef44 [Andrew Or] Avoid reading the entire batch into memory; also simplify streaming logic
      3df7005 [Andrew Or] Merge branch 'master' of github.com:andrewor14/incubator-spark
      a531d2e [Andrew Or] Relax assumptions on compressors and serializers when batching
      164489d [Andrew Or] Relax assumptions on compressors and serializers when batching
      79d07d66
    • Mridul Muralidharan's avatar
      Bugfixes/improvements to scheduler · ab747d39
      Mridul Muralidharan authored
      Move the PR#517 of apache-incubator-spark to the apache-spark
      
      Author: Mridul Muralidharan <mridul@gmail.com>
      
      Closes #159 from mridulm/master and squashes the following commits:
      
      5ff59c2 [Mridul Muralidharan] Change property in suite also
      167fad8 [Mridul Muralidharan] Address review comments
      9bda70e [Mridul Muralidharan] Address review comments, akwats add to failedExecutors
      270d841 [Mridul Muralidharan] Address review comments
      fa5d9f1 [Mridul Muralidharan] Bugfixes/improvements to scheduler : PR #517
      ab747d39
    • Thomas Graves's avatar
      SPARK-1203 fix saving to hdfs from yarn · 6112270c
      Thomas Graves authored
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #173 from tgravescs/SPARK-1203 and squashes the following commits:
      
      4fd5ded [Thomas Graves] adding import
      964e3f7 [Thomas Graves] SPARK-1203 fix saving to hdfs from yarn
      6112270c
    • shiyun.wxm's avatar
      bugfix: Wrong "Duration" in "Active Stages" in stages page · d55ec86d
      shiyun.wxm authored
      If a stage which has completed once loss parts of data, it will be resubmitted. At this time, it appears that stage.completionTime > stage.submissionTime.
      
      Author: shiyun.wxm <shiyun.wxm@taobao.com>
      
      Closes #170 from BlackNiuza/duration_problem and squashes the following commits:
      
      a86d261 [shiyun.wxm] tow space indent
      c0d7b24 [shiyun.wxm] change the style
      3b072e1 [shiyun.wxm] fix scala style
      f20701e [shiyun.wxm] bugfix: "Duration" in "Active Stages" in stages page
      d55ec86d
    • Nick Lanham's avatar
      Bundle tachyon: SPARK-1269 · a18ea00f
      Nick Lanham authored
      This should all work as expected with the current version of the tachyon tarball (0.4.1)
      
      Author: Nick Lanham <nick@afternight.org>
      
      Closes #137 from nicklan/bundle-tachyon and squashes the following commits:
      
      2eee15b [Nick Lanham] Put back in exec, start tachyon first
      738ba23 [Nick Lanham] Move tachyon out of sbin
      f2f9bc6 [Nick Lanham] More checks for tachyon script
      111e8e1 [Nick Lanham] Only try tachyon operations if tachyon script exists
      0561574 [Nick Lanham] Copy over web resources so web interface can run
      4dc9809 [Nick Lanham] Update to tachyon 0.4.1
      0a1a20c [Nick Lanham] Add scripts using tachyon tarball
      a18ea00f
  3. Mar 18, 2014
    • witgo's avatar
      Fix SPARK-1256: Master web UI and Worker web UI returns a 404 error · cc2655a2
      witgo authored
      Author: witgo <witgo@qq.com>
      
      Closes #150 from witgo/SPARK-1256 and squashes the following commits:
      
      08044a2 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1256
      c99b030 [witgo] Fix SPARK-1256
      cc2655a2
    • Xiangrui Meng's avatar
      [SPARK-1266] persist factors in implicit ALS · f9d8a83c
      Xiangrui Meng authored
      In implicit ALS computation, the user or product factor is used twice in each iteration. Caching can certainly help accelerate the computation. I saw the running time decreased by ~70% for implicit ALS on the movielens data.
      
      I also made the following changes:
      
      1. Change `YtYb` type from `Broadcast[Option[DoubleMatrix]]` to `Option[Broadcast[DoubleMatrix]]`, so we don't need to broadcast None in explicit computation.
      
      2. Mark methods `computeYtY`, `unblockFactors`, `updateBlock`, and `updateFeatures private`. Users do not need those methods.
      
      3. Materialize the final matrix factors before returning the model. It allows us to clean up other cached RDDs before returning the model. I do not have a better solution here, so I use `RDD.count()`.
      
      JIRA: https://spark-project.atlassian.net/browse/SPARK-1266
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #165 from mengxr/als and squashes the following commits:
      
      c9676a6 [Xiangrui Meng] add a comment about the last products.persist
      d3a88aa [Xiangrui Meng] change implicitPrefs match to if ... else ...
      63862d6 [Xiangrui Meng] persist factors in implicit ALS
      f9d8a83c
    • Xiangrui Meng's avatar
      [SPARK-1260]: faster construction of features with intercept · e108b9ab
      Xiangrui Meng authored
      The current implementation uses `Array(1.0, features: _*)` to construct a new array with intercept. This is not efficient for big arrays because `Array.apply` uses a for loop that iterates over the arguments. `Array.+:` is a better choice here.
      
      Also, I don't see a reason to set initial weights to ones. So I set them to zeros.
      
      JIRA: https://spark-project.atlassian.net/browse/SPARK-1260
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #161 from mengxr/sgd and squashes the following commits:
      
      b5cfc53 [Xiangrui Meng] set default weights to zeros
      a1439c2 [Xiangrui Meng] faster construction of features with intercept
      e108b9ab
    • Matei Zaharia's avatar
      Update copyright year in NOTICE to 2014 · 79e547fe
      Matei Zaharia authored
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #174 from mateiz/update-notice and squashes the following commits:
      
      47fc1a5 [Matei Zaharia] Update copyright year in NOTICE to 2014
      79e547fe
    • CodingCat's avatar
      SPARK-1102: Create a saveAsNewAPIHadoopDataset method · 2fa26ec0
      CodingCat authored
      https://spark-project.atlassian.net/browse/SPARK-1102
      
      Create a saveAsNewAPIHadoopDataset method
      
      By @mateiz: "Right now RDDs can only be saved as files using the new Hadoop API, not as "datasets" with no filename and just a JobConf. See http://codeforhire.com/2014/02/18/using-spark-with-mongodb/ for an example of how you have to give a bogus filename. For the old Hadoop API, we have saveAsHadoopDataset."
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #12 from CodingCat/SPARK-1102 and squashes the following commits:
      
      6ba0c83 [CodingCat] add test cases for saveAsHadoopDataSet (new&old API)
      a8d11ba [CodingCat] style fix.........
      95a6929 [CodingCat] code clean
      7643c88 [CodingCat] change the parameter type back to Configuration
      a8583ee [CodingCat] Create a saveAsNewAPIHadoopDataset method
      2fa26ec0
    • Patrick Wendell's avatar
      Revert "SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225." · e7423d40
      Patrick Wendell authored
      This reverts commit ca4bf8c5.
      
      Jetty 9 requires JDK7 which is probably not a dependency we want to bump right now. Before Spark 1.0 we should consider upgrading to Jetty 8. However, in the mean time to ease some pain let's revert this. Sorry for not catching this during the initial review. cc/ @rxin
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #167 from pwendell/jetty-revert and squashes the following commits:
      
      811b1c5 [Patrick Wendell] Revert "SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225."
      e7423d40
    • Dan McClary's avatar
      Spark 1246 add min max to stat counter · e3681f26
      Dan McClary authored
      Here's the addition of min and max to statscounter.py and min and max methods to rdd.py.
      
      Author: Dan McClary <dan.mcclary@gmail.com>
      
      Closes #144 from dwmclary/SPARK-1246-add-min-max-to-stat-counter and squashes the following commits:
      
      fd3fd4b [Dan McClary] fixed  error, updated test
      82cde0e [Dan McClary] flipped incorrectly assigned inf values in StatCounter
      5d96799 [Dan McClary] added max and min to StatCounter repr for pyspark
      21dd366 [Dan McClary] added max and min to StatCounter output, updated doc
      1a97558 [Dan McClary] added max and min to StatCounter output, updated doc
      a5c13b0 [Dan McClary] Added min and max to Scala and Java RDD, added min and max to StatCounter
      ed67136 [Dan McClary] broke min/max out into separate transaction, added to rdd.py
      1e7056d [Dan McClary] added underscore to getBucket
      37a7dea [Dan McClary] cleaned up boundaries for histogram -- uses real min/max when buckets are derived
      29981f2 [Dan McClary] fixed indentation on doctest comment
      eaf89d9 [Dan McClary] added correct doctest for histogram
      4916016 [Dan McClary] added histogram method, added max and min to statscounter
      e3681f26
  4. Mar 17, 2014
    • Diana Carroll's avatar
      [Spark-1261] add instructions for running python examples to doc overview page · 087eedca
      Diana Carroll authored
      Author: Diana Carroll <dcarroll@cloudera.com>
      
      Closes #162 from dianacarroll/SPARK-1261 and squashes the following commits:
      
      14ac602 [Diana Carroll] typo in python example text
      5121e3e [Diana Carroll] Add explanation of how to run Python examples to main doc overview page
      087eedca
    • Patrick Wendell's avatar
      SPARK-1244: Throw exception if map output status exceeds frame size · 796977ac
      Patrick Wendell authored
      This is a very small change on top of @andrewor14's patch in #147.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #152 from pwendell/akka-frame and squashes the following commits:
      
      e5fb3ff [Patrick Wendell] Reversing test order
      393af4c [Patrick Wendell] Small improvement suggested by Andrew Or
      8045103 [Patrick Wendell] Breaking out into two tests
      2b4e085 [Patrick Wendell] Consolidate Executor use of akka frame size
      c9b6109 [Andrew Or] Simplify test + make access to akka frame size more modular
      281d7c9 [Andrew Or] Throw exception on spark.akka.frameSize exceeded + Unit tests
      796977ac
    • CodingCat's avatar
      SPARK-1240: handle the case of empty RDD when takeSample · dc965463
      CodingCat authored
      https://spark-project.atlassian.net/browse/SPARK-1240
      
      It seems that the current implementation does not handle the empty RDD case when run takeSample
      
      In this patch, before calling sample() inside takeSample API, I add a checker for this case and returns an empty Array when it's a empty RDD; also in sample(), I add a checker for the invalid fraction value
      
      In the test case, I also add several lines for this case
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #135 from CodingCat/SPARK-1240 and squashes the following commits:
      
      fef57d4 [CodingCat] fix the same problem in PySpark
      36db06b [CodingCat] create new test cases for takeSample from an empty red
      810948d [CodingCat] further fix
      a40e8fb [CodingCat] replace if with require
      ad483fd [CodingCat] handle the case with empty RDD when take sample
      dc965463
  5. Mar 16, 2014
    • Reynold Xin's avatar
      SPARK-1255: Allow user to pass Serializer object instead of class name for shuffle. · f5486e9f
      Reynold Xin authored
      This is more general than simply passing a string name and leaves more room for performance optimizations.
      
      Note that this is technically an API breaking change in the following two ways:
      1. The shuffle serializer specification in ShuffleDependency now require an object instead of a String (of the class name), but I suspect nobody else in this world has used this API other than me in GraphX and Shark.
      2. Serializer's in Spark from now on are required to be serializable.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #149 from rxin/serializer and squashes the following commits:
      
      5acaccd [Reynold Xin] Properly call serializer's constructors.
      2a8d75a [Reynold Xin] Added more documentation for the serializer option in ShuffleDependency.
      7420185 [Reynold Xin] Allow user to pass Serializer object instead of class name for shuffle.
      f5486e9f
  6. Mar 15, 2014
    • Sean Owen's avatar
      SPARK-1254. Consolidate, order, and harmonize repository declarations in Maven/SBT builds · 97e4459e
      Sean Owen authored
      This suggestion addresses a few minor suboptimalities with how repositories are handled.
      
      1) Use HTTPS consistently to access repos, instead of HTTP
      
      2) Consolidate repository declarations in the parent POM file, in the case of the Maven build, so that their ordering can be controlled to put the fully optional Cloudera repo at the end, after required repos. (This was prompted by the untimely failure of the Cloudera repo this week, which made the Spark build fail. #2 would have prevented that.)
      
      3) Update SBT build to match Maven build in this regard
      
      4) Update SBT build to not refer to Sonatype snapshot repos. This wasn't in Maven, and a build generally would not refer to external snapshots, but I'm not 100% sure on this one.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #145 from srowen/SPARK-1254 and squashes the following commits:
      
      42f9bfc [Sean Owen] Use HTTPS for repos; consolidate repos in parent in order to put optional Cloudera repo last; harmonize SBT build repos with Maven; remove snapshot repos from SBT build which weren't in Maven
      97e4459e
  7. Mar 14, 2014
  8. Mar 13, 2014
    • Tianshuo Deng's avatar
      [bugfix] wrong client arg, should use executor-cores · 181b130a
      Tianshuo Deng authored
      client arg is wrong, it should be executor-cores. it causes executor fail to start when executor-cores is specified
      
      Author: Tianshuo Deng <tdeng@twitter.com>
      
      Closes #138 from tsdeng/bugfix_wrong_client_args and squashes the following commits:
      
      304826d [Tianshuo Deng] wrong client arg, should use executor-cores
      181b130a
    • Reynold Xin's avatar
      SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225. · ca4bf8c5
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #113 from rxin/jetty9 and squashes the following commits:
      
      867a2ce [Reynold Xin] Updated Jetty version to 9.1.3.v20140225 in Maven build file.
      d7c97ca [Reynold Xin] Return the correctly bound port.
      d14706f [Reynold Xin] Upgrade Jetty to 9.1.3.v20140225.
      ca4bf8c5
    • Sandy Ryza's avatar
      SPARK-1183. Don't use "worker" to mean executor · 69837321
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #120 from sryza/sandy-spark-1183 and squashes the following commits:
      
      5066a4a [Sandy Ryza] Remove "worker" in a couple comments
      0bd1e46 [Sandy Ryza] Remove --am-class from usage
      bfc8fe0 [Sandy Ryza] Remove am-class from doc and fix yarn-alpha
      607539f [Sandy Ryza] Address review comments
      74d087a [Sandy Ryza] SPARK-1183. Don't use "worker" to mean executor
      69837321
    • Xiangrui Meng's avatar
      [SPARK-1237, 1238] Improve the computation of YtY for implicit ALS · e4e8d8f3
      Xiangrui Meng authored
      Computing YtY can be implemented using BLAS's DSPR operations instead of generating y_i y_i^T and then combining them. The latter generates many k-by-k matrices. On the movielens data, this change improves the performance by 10-20%. The algorithm remains the same, verified by computing RMSE on the movielens data.
      
      To compare the results, I also added an option to set a random seed in ALS.
      
      JIRA:
      1. https://spark-project.atlassian.net/browse/SPARK-1237
      2. https://spark-project.atlassian.net/browse/SPARK-1238
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #131 from mengxr/als and squashes the following commits:
      
      ed00432 [Xiangrui Meng] minor changes
      d984623 [Xiangrui Meng] minor changes
      2fc1641 [Xiangrui Meng] remove commented code
      4c7cde2 [Xiangrui Meng] allow specifying a random seed in ALS
      200bef0 [Xiangrui Meng] optimize computeYtY and updateBlock
      e4e8d8f3
    • Patrick Wendell's avatar
      SPARK-1019: pyspark RDD take() throws an NPE · 4ea23db0
      Patrick Wendell authored
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #112 from pwendell/pyspark-take and squashes the following commits:
      
      daae80e [Patrick Wendell] SPARK-1019: pyspark RDD take() throws an NPE
      4ea23db0
  9. Mar 12, 2014
    • CodingCat's avatar
      hot fix for PR105 - change to Java annotation · 6bd2eaa4
      CodingCat authored
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #133 from CodingCat/SPARK-1160-2 and squashes the following commits:
      
      6607155 [CodingCat] hot fix for PR105 - change to Java annotation
      6bd2eaa4
    • jianghan's avatar
      Fix example bug: compile error · 31a70400
      jianghan authored
      Author: jianghan <jianghan@xiaomi.com>
      
      Closes #132 from pooorman/master and squashes the following commits:
      
      54afbe0 [jianghan] Fix example bug: compile error
      31a70400
    • CodingCat's avatar
      SPARK-1160: Deprecate toArray in RDD · 9032f7c0
      CodingCat authored
      https://spark-project.atlassian.net/browse/SPARK-1160
      
      reported by @mateiz: "It's redundant with collect() and the name doesn't make sense in Java, where we return a List (we can't return an array due to the way Java generics work). It's also missing in Python."
      
      In this patch, I deprecated the method and changed the source files using it by replacing toArray with collect() directly
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #105 from CodingCat/SPARK-1060 and squashes the following commits:
      
      286f163 [CodingCat] deprecate in JavaRDDLike
      ee17b4e [CodingCat] add message and since
      2ff7319 [CodingCat] deprecate toArray in RDD
      9032f7c0
    • Prashant Sharma's avatar
      SPARK-1162 Added top in python. · b8afe305
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #93 from ScrapCodes/SPARK-1162/pyspark-top-takeOrdered and squashes the following commits:
      
      ece1fa4 [Prashant Sharma] Added top in python.
      b8afe305
    • liguoqiang's avatar
      Fix #SPARK-1149 Bad partitioners can cause Spark to hang · 5d1ec64e
      liguoqiang authored
      Author: liguoqiang <liguoqiang@rd.tuan800.com>
      
      Closes #44 from witgo/SPARK-1149 and squashes the following commits:
      
      3dcdcaf [liguoqiang] Merge branch 'master' into SPARK-1149
      8425395 [liguoqiang] Merge remote-tracking branch 'upstream/master' into SPARK-1149
      3dad595 [liguoqiang] review comment
      e3e56aa [liguoqiang] Merge branch 'master' into SPARK-1149
      b0d5c07 [liguoqiang] review comment
      d0a6005 [liguoqiang] review comment
      3395ee7 [liguoqiang] Merge remote-tracking branch 'upstream/master' into SPARK-1149
      ac006a3 [liguoqiang] code Formatting
      3feb3a8 [liguoqiang] Merge branch 'master' into SPARK-1149
      adc443e [liguoqiang] partitions check  bugfix
      928e1e3 [liguoqiang] Added a unit test for PairRDDFunctions.lookup with bad partitioner
      db6ecc5 [liguoqiang] Merge branch 'master' into SPARK-1149
      1e3331e [liguoqiang] Merge branch 'master' into SPARK-1149
      3348619 [liguoqiang] Optimize performance for partitions check
      61e5a87 [liguoqiang] Merge branch 'master' into SPARK-1149
      e68210a [liguoqiang] add partition index check to submitJob
      3a65903 [liguoqiang] make the code more readable
      6bb725e [liguoqiang] fix #SPARK-1149 Bad partitioners can cause Spark to hang
      5d1ec64e
    • Thomas Graves's avatar
      [SPARK-1233] Fix running hadoop 0.23 due to java.lang.NoSuchFieldException: DEFAULT_M... · b5162f44
      Thomas Graves authored
      ...APREDUCE_APPLICATION_CLASSPATH
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #129 from tgravescs/SPARK-1233 and squashes the following commits:
      
      85ff5a6 [Thomas Graves] Fix running hadoop 0.23 due to java.lang.NoSuchFieldException: DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH
      b5162f44
    • Thomas Graves's avatar
      [SPARK-1232] Fix the hadoop 0.23 yarn build · c8c59b32
      Thomas Graves authored
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #127 from tgravescs/SPARK-1232 and squashes the following commits:
      
      c05cfd4 [Thomas Graves] Fix the hadoop 0.23 yarn build
      c8c59b32
    • prabinb's avatar
      Spark-1163, Added missing Python RDD functions · af7f2f10
      prabinb authored
      Author: prabinb <prabin.banka@imaginea.com>
      
      Closes #92 from prabinb/python-api-rdd and squashes the following commits:
      
      51129ca [prabinb] Added missing Python RDD functions Added __repr__ function to StorageLevel class. Added doctest for RDD.getStorageLevel().
      af7f2f10
    • Sandy Ryza's avatar
      SPARK-1064 · 2409af9d
      Sandy Ryza authored
      This reopens PR 649 from incubator-spark against the new repo
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #102 from sryza/sandy-spark-1064 and squashes the following commits:
      
      270e490 [Sandy Ryza] Handle different application classpath variables in different versions
      88b04e0 [Sandy Ryza] SPARK-1064. Make it possible to run on YARN without bundling Hadoop jars in Spark assembly
      2409af9d
  10. Mar 11, 2014
    • Patrick Wendell's avatar
      SPARK-1167: Remove metrics-ganglia from default build due to LGPL issues... · 16788a65
      Patrick Wendell authored
      This patch removes Ganglia integration from the default build. It
      allows users willing to link against LGPL code to use Ganglia
      by adding build flags or linking against a new Spark artifact called
      spark-ganglia-lgpl.
      
      This brings Spark in line with the Apache policy on LGPL code
      enumerated here:
      
      https://www.apache.org/legal/3party.html#options-optional
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #108 from pwendell/ganglia and squashes the following commits:
      
      326712a [Patrick Wendell] Responding to review feedback
      5f28ee4 [Patrick Wendell] SPARK-1167: Remove metrics-ganglia from default build due to LGPL issues.
      16788a65
  11. Mar 10, 2014
    • Sandy Ryza's avatar
      SPARK-1211. In ApplicationMaster, set spark.master system property to "y... · 2a2c9645
      Sandy Ryza authored
      ...arn-cluster"
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #118 from sryza/sandy-spark-1211 and squashes the following commits:
      
      d4001c7 [Sandy Ryza] SPARK-1211. In ApplicationMaster, set spark.master system property to "yarn-cluster"
      2a2c9645
    • Patrick Wendell's avatar
      SPARK-1205: Clean up callSite/origin/generator. · 2a516170
      Patrick Wendell authored
      This patch removes the `generator` field and simplifies + documents
      the tracking of callsites.
      
      There are two places where we care about call sites, when a job is
      run and when an RDD is created. This patch retains both of those
      features but does a slight refactoring and renaming to make things
      less confusing.
      
      There was another feature of an rdd called the `generator` which was
      by default the user class that in which the RDD was created. This is
      used exclusively in the JobLogger. It been subsumed by the ability
      to name a job group. The job logger can later be refectored to
      read the job group directly (will require some work) but for now
      this just preserves the default logged value of the user class.
      I'm not sure any users ever used the ability to override this.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #106 from pwendell/callsite and squashes the following commits:
      
      fc1d009 [Patrick Wendell] Compile fix
      e17fb76 [Patrick Wendell] Review feedback: callSite -> creationSite
      62e77ef [Patrick Wendell] Review feedback
      576e60b [Patrick Wendell] SPARK-1205: Clean up callSite/origin/generator.
      2a516170
    • Prashant Sharma's avatar
      SPARK-1168, Added foldByKey to pyspark. · a59419c2
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #115 from ScrapCodes/SPARK-1168/pyspark-foldByKey and squashes the following commits:
      
      db6f67e [Prashant Sharma] SPARK-1168, Added foldByKey to pyspark.
      a59419c2
Loading