- Mar 20, 2014
-
-
Reza Zadeh authored
# Principal Component Analysis Computes the top k principal component coefficients for the m-by-n data matrix X. Rows of X correspond to observations and columns correspond to variables. The coefficient matrix is n-by-k. Each column of the coefficients return matrix contains coefficients for one principal component, and the columns are in descending order of component variance. This function centers the data and uses the singular value decomposition (SVD) algorithm. ## Testing Tests included: * All principal components * Only top k principal components * Dense SVD tests * Dense/sparse matrix tests The results are tested against MATLAB's pca: http://www.mathworks.com/help/stats/pca.html ## Documentation Added to mllib-guide.md ## Example Usage Added to examples directory under SparkPCA.scala Author: Reza Zadeh <rizlar@gmail.com> Closes #88 from rezazadeh/sparkpca and squashes the following commits: e298700 [Reza Zadeh] reformat using IDE 3f23271 [Reza Zadeh] documentation and cleanup b025ab2 [Reza Zadeh] documentation e2667d4 [Reza Zadeh] assertMatrixApproximatelyEquals 3787bb4 [Reza Zadeh] stylin c6ecc1f [Reza Zadeh] docs aa2bbcb [Reza Zadeh] rename sparseToTallSkinnyDense 56975b0 [Reza Zadeh] docs 2df9bde [Reza Zadeh] docs update 8fb0015 [Reza Zadeh] rcond documentation dbf7797 [Reza Zadeh] correct argument number a9f1f62 [Reza Zadeh] documentation 4ce6caa [Reza Zadeh] style changes 9a56a02 [Reza Zadeh] use rcond relative to larget svalue 120f796 [Reza Zadeh] housekeeping 156ff78 [Reza Zadeh] string comprehension 2e1cf43 [Reza Zadeh] rename rcond ea223a6 [Reza Zadeh] many style changes f4002d7 [Reza Zadeh] more docs bd53c7a [Reza Zadeh] proper accumulator a8b5ecf [Reza Zadeh] Don't use for loops 0dc7980 [Reza Zadeh] filter zeros in sparse 6115610 [Reza Zadeh] More documentation 36d51e8 [Reza Zadeh] use JBLAS for UVS^-1 computation bc4599f [Reza Zadeh] configurable rcond 86f7515 [Reza Zadeh] compute per parition, use while 09726b3 [Reza Zadeh] more style changes 4195e69 [Reza Zadeh] private, accumulator 17002be [Reza Zadeh] style changes 4ba7471 [Reza Zadeh] style change f4982e6 [Reza Zadeh] Use dense matrix in example 2828d28 [Reza Zadeh] optimizations: normalize once, use inplace ops 72c9fa1 [Reza Zadeh] rename DenseMatrix to TallSkinnyDenseMatrix, lean f807be9 [Reza Zadeh] fix typo 2d7ccde [Reza Zadeh] Array interface for dense svd and pca cd290fa [Reza Zadeh] provide RDD[Array[Double]] support 398d123 [Reza Zadeh] style change 55abbfa [Reza Zadeh] docs fix ef29644 [Reza Zadeh] bad chnage undo 472566e [Reza Zadeh] all files from old pr 555168f [Reza Zadeh] initial files
-
- Mar 19, 2014
-
-
Aaron Davidson authored
This reverts commit 16789317. Jenkins was not run for this PR.
-
qqsun8819 authored
This is for JIRA:https://spark-project.atlassian.net/browse/SPARK-1099 And this is what I do in this patch (also commented in the JIRA) @aarondav This is really a behavioral change, so I do this with great caution, and welcome any review advice: 1 I change the "MASTER=local" pattern of create LocalBackEnd . In the past, we passed 1 core to it . now it use a default cores The reason here is that when someone use spark-shell to start local mode , Repl will use this "MASTER=local" pattern as default. So if one also specify cores in the spark-shell command line, it will all go in here. So here pass 1 core is not suitalbe reponding to our change here. 2 In the LocalBackEnd , the "totalCores" variable are fetched following a different rule(in the past it just take in a userd passed cores, like 1 in "MASTER=local" pattern, 2 in "MASTER=local[2]" pattern" rules: a The second argument of LocalBackEnd 's constructor indicating cores have a default value which is Int.MaxValue. If user didn't pass it , its first default value is Int.MaxValue b In getMaxCores, we first compare the former value to Int.MaxValue. if it's not equal, we think that user has passed their desired value, so just use it c. If b is not satified, we then get cores from spark.cores.max, and we get real logical cores from Runtime. And if cores specified by spark.cores.max is bigger than logical cores, we use logical cores, otherwise we use spark.cores.max 3 In SparkContextSchedulerCreationSuite 's test("local") case, assertion is modified from 1 to logical cores, because "MASTER=local" pattern use default vaules. Author: qqsun8819 <jin.oyj@alibaba-inc.com> Closes #110 from qqsun8819/local-cores and squashes the following commits: 731aefa [qqsun8819] 1 LocalBackend not change 2 In SparkContext do some process to the cores and pass it to original LocalBackend constructor 78b9c60 [qqsun8819] 1 SparkContext MASTER=local pattern use default cores instead of 1 to construct LocalBackEnd , for use of spark-shell and cores specified in cmd line 2 some test case change from local to local[1]. 3 SparkContextSchedulerCreationSuite test spark.cores.max config in local pattern 6ae1ee8 [qqsun8819] Add a static function in LocalBackEnd to let it use spark.cores.max specified cores when no cores are passed to it
-
Jyotiska NK authored
Doctest added for map in rdd.py Author: Jyotiska NK <jyotiska123@gmail.com> Closes #177 from jyotiska/pyspark_rdd_map_doctest and squashes the following commits: a38527f [Jyotiska NK] Added doctest for map function in rdd.py
-
Andrew Or authored
The fleeting nature of the Spark Web UI has long been a problem reported by many users: The existing Web UI disappears as soon as the associated application terminates. This is because SparkUI is tightly coupled with SparkContext, and cannot be instantiated independently from it. To solve this, some state must be saved to persistent storage while the application is still running. The approach taken by this PR involves persisting the UI state through SparkListenerEvents. This requires a major refactor of the SparkListener interface because existing events (1) maintain deep references, making de/serialization is difficult, and (2) do not encode all the information displayed on the UI. In this design, each existing listener for the UI (e.g. ExecutorsListener) maintains state that can be fully constructed from SparkListenerEvents. This state is then supplied to the parent UI (e.g. ExecutorsUI), which renders the associated page(s) on demand. This PR introduces two important classes: the **EventLoggingListener**, and the **ReplayListenerBus**. In a live application, SparkUI registers an EventLoggingListener with the SparkContext in addition to the existing listeners. Over the course of the application, this listener serializes and logs all events to persisted storage. Then, after the application has finished, the SparkUI can be revived by replaying all the logged events to the existing UI listeners through the ReplayListenerBus. This feature is currently integrated with the Master Web UI, which optionally rebuilds a SparkUI from event logs as soon as the corresponding application finishes. More details can be found in the commit messages, comments within the code, and the [design doc](https://spark-project.atlassian.net/secure/attachment/12900/PersistingSparkWebUI.pdf). Comments and feedback are most welcome. Author: Andrew Or <andrewor14@gmail.com> Author: andrewor14 <andrewor14@gmail.com> Closes #42 from andrewor14/master and squashes the following commits: e5f14fa [Andrew Or] Merge github.com:apache/spark a1c5cd9 [Andrew Or] Merge github.com:apache/spark b8ba817 [Andrew Or] Remove UI from map when removing application in Master 83af656 [Andrew Or] Scraps and pieces (no functionality change) 222adcd [Andrew Or] Merge github.com:apache/spark 124429f [Andrew Or] Clarify LiveListenerBus behavior + Add tests for new behavior f80bd31 [Andrew Or] Simplify static handler and BlockManager status update logic 9e14f97 [Andrew Or] Moved around functionality + renamed classes per Patrick 6740e49 [Andrew Or] Fix comment nits 650eb12 [Andrew Or] Add unit tests + Fix bugs found through tests 45fd84c [Andrew Or] Remove now deprecated test c5c2c8f [Andrew Or] Remove list of (TaskInfo, TaskMetrics) from StageInfo 3456090 [Andrew Or] Address Patrick's comments bf80e3d [Andrew Or] Imports, comments, and code formatting, once again (minor) ac69ec8 [Andrew Or] Fix test fail d801d11 [Andrew Or] Merge github.com:apache/spark (major) dc93915 [Andrew Or] Imports, comments, and code formatting (minor) 77ba283 [Andrew Or] Address Kay's and Patrick's comments b6eaea7 [Andrew Or] Treating SparkUI as a handler of MasterUI d59da5f [Andrew Or] Avoid logging all the blocks on each executor d6e3b4a [Andrew Or] Merge github.com:apache/spark ca258a4 [Andrew Or] Master UI - add support for reading compressed event logs 176e68e [Andrew Or] Fix deprecated message for JavaSparkContext (minor) 4f69c4a [Andrew Or] Master UI - Rebuild SparkUI on application finish 291b2be [Andrew Or] Correct directory in log message "INFO: Logging events to <dir>" 1ba3407 [Andrew Or] Add a few configurable options to event logging e375431 [Andrew Or] Add new constructors for SparkUI 18b256d [Andrew Or] Refactor out event logging and replaying logic from UI bb4c503 [Andrew Or] Use a more mnemonic path for logging aef411c [Andrew Or] Fix bug: storage status was not reflected on UI in the local case 03eda0b [Andrew Or] Fix HDFS flush behavior 36b3e5d [Andrew Or] Add HDFS support for event logging cceff2b [andrewor14] Fix 100 char format fail 2fee310 [Andrew Or] Address Patrick's comments 2981d61 [Andrew Or] Move SparkListenerBus out of DAGScheduler + Clean up 5d2cec1 [Andrew Or] JobLogger: ID -> Id 0503e4b [Andrew Or] Fix PySpark tests + remove sc.clearFiles/clearJars 4d2fb0c [Andrew Or] Fix format fail faa113e [Andrew Or] General clean up d47585f [Andrew Or] Clean up FileLogger 472fd8a [Andrew Or] Fix a couple of tests 996d7a2 [Andrew Or] Reflect RDD unpersist on UI 7b2f811 [Andrew Or] Guard against TaskMetrics NPE + Fix tests d1f4285 [Andrew Or] Migrate from lift-json to json4s-jackson 28019ca [Andrew Or] Merge github.com:apache/spark bbe3501 [Andrew Or] Embed storage status and RDD info in Task events 6631c02 [Andrew Or] More formatting changes, this time mainly for Json DSL 70e7e7a [Andrew Or] Formatting changes e9e1c6d [Andrew Or] Move all JSON de/serialization logic to JsonProtocol d646df6 [Andrew Or] Completely decouple SparkUI from SparkContext 6814da0 [Andrew Or] Explicitly register each UI listener rather than through some magic 64d2ce1 [Andrew Or] Fix BlockManagerUI bug by introducing new event 4273013 [Andrew Or] Add a gateway SparkListener to simplify event logging 904c729 [Andrew Or] Fix another major bug 5ac906d [Andrew Or] Mostly naming, formatting, and code style changes 3fd584e [Andrew Or] Fix two major bugs f3fc13b [Andrew Or] General refactor 4dfcd22 [Andrew Or] Merge git://git.apache.org/incubator-spark into persist-ui b3976b0 [Andrew Or] Add functionality of reconstructing a persisted UI from SparkContext 8add36b [Andrew Or] JobProgressUI: Add JSON functionality d859efc [Andrew Or] BlockManagerUI: Add JSON functionality c4cd480 [Andrew Or] Also deserialize new events 8a2ebe6 [Andrew Or] Fix bugs for EnvironmentUI and ExecutorsUI de8a1cd [Andrew Or] Serialize events both to and from JSON (rather than just to) bf0b2e9 [Andrew Or] ExecutorUI: Serialize events rather than arbitary executor information bb222b9 [Andrew Or] ExecutorUI: render completely from JSON dcbd312 [Andrew Or] Add JSON Serializability for all SparkListenerEvent's 10ed49d [Andrew Or] Merge github.com:apache/incubator-spark into persist-ui 8e09306 [Andrew Or] Use JSON for ExecutorsUI e3ae35f [Andrew Or] Merge github.com:apache/incubator-spark 3ddeb7e [Andrew Or] Also privatize fields 090544a [Andrew Or] Privatize methods 13920c9 [Andrew Or] Update docs bd5a1d7 [Andrew Or] Typo: phyiscal -> physical 287ef44 [Andrew Or] Avoid reading the entire batch into memory; also simplify streaming logic 3df7005 [Andrew Or] Merge branch 'master' of github.com:andrewor14/incubator-spark a531d2e [Andrew Or] Relax assumptions on compressors and serializers when batching 164489d [Andrew Or] Relax assumptions on compressors and serializers when batching
-
Mridul Muralidharan authored
Move the PR#517 of apache-incubator-spark to the apache-spark Author: Mridul Muralidharan <mridul@gmail.com> Closes #159 from mridulm/master and squashes the following commits: 5ff59c2 [Mridul Muralidharan] Change property in suite also 167fad8 [Mridul Muralidharan] Address review comments 9bda70e [Mridul Muralidharan] Address review comments, akwats add to failedExecutors 270d841 [Mridul Muralidharan] Address review comments fa5d9f1 [Mridul Muralidharan] Bugfixes/improvements to scheduler : PR #517
-
Thomas Graves authored
Author: Thomas Graves <tgraves@apache.org> Closes #173 from tgravescs/SPARK-1203 and squashes the following commits: 4fd5ded [Thomas Graves] adding import 964e3f7 [Thomas Graves] SPARK-1203 fix saving to hdfs from yarn
-
shiyun.wxm authored
If a stage which has completed once loss parts of data, it will be resubmitted. At this time, it appears that stage.completionTime > stage.submissionTime. Author: shiyun.wxm <shiyun.wxm@taobao.com> Closes #170 from BlackNiuza/duration_problem and squashes the following commits: a86d261 [shiyun.wxm] tow space indent c0d7b24 [shiyun.wxm] change the style 3b072e1 [shiyun.wxm] fix scala style f20701e [shiyun.wxm] bugfix: "Duration" in "Active Stages" in stages page
-
Nick Lanham authored
This should all work as expected with the current version of the tachyon tarball (0.4.1) Author: Nick Lanham <nick@afternight.org> Closes #137 from nicklan/bundle-tachyon and squashes the following commits: 2eee15b [Nick Lanham] Put back in exec, start tachyon first 738ba23 [Nick Lanham] Move tachyon out of sbin f2f9bc6 [Nick Lanham] More checks for tachyon script 111e8e1 [Nick Lanham] Only try tachyon operations if tachyon script exists 0561574 [Nick Lanham] Copy over web resources so web interface can run 4dc9809 [Nick Lanham] Update to tachyon 0.4.1 0a1a20c [Nick Lanham] Add scripts using tachyon tarball
-
- Mar 18, 2014
-
-
witgo authored
Author: witgo <witgo@qq.com> Closes #150 from witgo/SPARK-1256 and squashes the following commits: 08044a2 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1256 c99b030 [witgo] Fix SPARK-1256
-
Xiangrui Meng authored
In implicit ALS computation, the user or product factor is used twice in each iteration. Caching can certainly help accelerate the computation. I saw the running time decreased by ~70% for implicit ALS on the movielens data. I also made the following changes: 1. Change `YtYb` type from `Broadcast[Option[DoubleMatrix]]` to `Option[Broadcast[DoubleMatrix]]`, so we don't need to broadcast None in explicit computation. 2. Mark methods `computeYtY`, `unblockFactors`, `updateBlock`, and `updateFeatures private`. Users do not need those methods. 3. Materialize the final matrix factors before returning the model. It allows us to clean up other cached RDDs before returning the model. I do not have a better solution here, so I use `RDD.count()`. JIRA: https://spark-project.atlassian.net/browse/SPARK-1266 Author: Xiangrui Meng <meng@databricks.com> Closes #165 from mengxr/als and squashes the following commits: c9676a6 [Xiangrui Meng] add a comment about the last products.persist d3a88aa [Xiangrui Meng] change implicitPrefs match to if ... else ... 63862d6 [Xiangrui Meng] persist factors in implicit ALS
-
Xiangrui Meng authored
The current implementation uses `Array(1.0, features: _*)` to construct a new array with intercept. This is not efficient for big arrays because `Array.apply` uses a for loop that iterates over the arguments. `Array.+:` is a better choice here. Also, I don't see a reason to set initial weights to ones. So I set them to zeros. JIRA: https://spark-project.atlassian.net/browse/SPARK-1260 Author: Xiangrui Meng <meng@databricks.com> Closes #161 from mengxr/sgd and squashes the following commits: b5cfc53 [Xiangrui Meng] set default weights to zeros a1439c2 [Xiangrui Meng] faster construction of features with intercept
-
Matei Zaharia authored
Author: Matei Zaharia <matei@databricks.com> Closes #174 from mateiz/update-notice and squashes the following commits: 47fc1a5 [Matei Zaharia] Update copyright year in NOTICE to 2014
-
CodingCat authored
https://spark-project.atlassian.net/browse/SPARK-1102 Create a saveAsNewAPIHadoopDataset method By @mateiz: "Right now RDDs can only be saved as files using the new Hadoop API, not as "datasets" with no filename and just a JobConf. See http://codeforhire.com/2014/02/18/using-spark-with-mongodb/ for an example of how you have to give a bogus filename. For the old Hadoop API, we have saveAsHadoopDataset." Author: CodingCat <zhunansjtu@gmail.com> Closes #12 from CodingCat/SPARK-1102 and squashes the following commits: 6ba0c83 [CodingCat] add test cases for saveAsHadoopDataSet (new&old API) a8d11ba [CodingCat] style fix......... 95a6929 [CodingCat] code clean 7643c88 [CodingCat] change the parameter type back to Configuration a8583ee [CodingCat] Create a saveAsNewAPIHadoopDataset method
-
Patrick Wendell authored
This reverts commit ca4bf8c5. Jetty 9 requires JDK7 which is probably not a dependency we want to bump right now. Before Spark 1.0 we should consider upgrading to Jetty 8. However, in the mean time to ease some pain let's revert this. Sorry for not catching this during the initial review. cc/ @rxin Author: Patrick Wendell <pwendell@gmail.com> Closes #167 from pwendell/jetty-revert and squashes the following commits: 811b1c5 [Patrick Wendell] Revert "SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225."
-
Dan McClary authored
Here's the addition of min and max to statscounter.py and min and max methods to rdd.py. Author: Dan McClary <dan.mcclary@gmail.com> Closes #144 from dwmclary/SPARK-1246-add-min-max-to-stat-counter and squashes the following commits: fd3fd4b [Dan McClary] fixed error, updated test 82cde0e [Dan McClary] flipped incorrectly assigned inf values in StatCounter 5d96799 [Dan McClary] added max and min to StatCounter repr for pyspark 21dd366 [Dan McClary] added max and min to StatCounter output, updated doc 1a97558 [Dan McClary] added max and min to StatCounter output, updated doc a5c13b0 [Dan McClary] Added min and max to Scala and Java RDD, added min and max to StatCounter ed67136 [Dan McClary] broke min/max out into separate transaction, added to rdd.py 1e7056d [Dan McClary] added underscore to getBucket 37a7dea [Dan McClary] cleaned up boundaries for histogram -- uses real min/max when buckets are derived 29981f2 [Dan McClary] fixed indentation on doctest comment eaf89d9 [Dan McClary] added correct doctest for histogram 4916016 [Dan McClary] added histogram method, added max and min to statscounter
-
- Mar 17, 2014
-
-
Diana Carroll authored
Author: Diana Carroll <dcarroll@cloudera.com> Closes #162 from dianacarroll/SPARK-1261 and squashes the following commits: 14ac602 [Diana Carroll] typo in python example text 5121e3e [Diana Carroll] Add explanation of how to run Python examples to main doc overview page
-
Patrick Wendell authored
This is a very small change on top of @andrewor14's patch in #147. Author: Patrick Wendell <pwendell@gmail.com> Author: Andrew Or <andrewor14@gmail.com> Closes #152 from pwendell/akka-frame and squashes the following commits: e5fb3ff [Patrick Wendell] Reversing test order 393af4c [Patrick Wendell] Small improvement suggested by Andrew Or 8045103 [Patrick Wendell] Breaking out into two tests 2b4e085 [Patrick Wendell] Consolidate Executor use of akka frame size c9b6109 [Andrew Or] Simplify test + make access to akka frame size more modular 281d7c9 [Andrew Or] Throw exception on spark.akka.frameSize exceeded + Unit tests
-
CodingCat authored
https://spark-project.atlassian.net/browse/SPARK-1240 It seems that the current implementation does not handle the empty RDD case when run takeSample In this patch, before calling sample() inside takeSample API, I add a checker for this case and returns an empty Array when it's a empty RDD; also in sample(), I add a checker for the invalid fraction value In the test case, I also add several lines for this case Author: CodingCat <zhunansjtu@gmail.com> Closes #135 from CodingCat/SPARK-1240 and squashes the following commits: fef57d4 [CodingCat] fix the same problem in PySpark 36db06b [CodingCat] create new test cases for takeSample from an empty red 810948d [CodingCat] further fix a40e8fb [CodingCat] replace if with require ad483fd [CodingCat] handle the case with empty RDD when take sample
-
- Mar 16, 2014
-
-
Reynold Xin authored
This is more general than simply passing a string name and leaves more room for performance optimizations. Note that this is technically an API breaking change in the following two ways: 1. The shuffle serializer specification in ShuffleDependency now require an object instead of a String (of the class name), but I suspect nobody else in this world has used this API other than me in GraphX and Shark. 2. Serializer's in Spark from now on are required to be serializable. Author: Reynold Xin <rxin@apache.org> Closes #149 from rxin/serializer and squashes the following commits: 5acaccd [Reynold Xin] Properly call serializer's constructors. 2a8d75a [Reynold Xin] Added more documentation for the serializer option in ShuffleDependency. 7420185 [Reynold Xin] Allow user to pass Serializer object instead of class name for shuffle.
-
- Mar 15, 2014
-
-
Sean Owen authored
This suggestion addresses a few minor suboptimalities with how repositories are handled. 1) Use HTTPS consistently to access repos, instead of HTTP 2) Consolidate repository declarations in the parent POM file, in the case of the Maven build, so that their ordering can be controlled to put the fully optional Cloudera repo at the end, after required repos. (This was prompted by the untimely failure of the Cloudera repo this week, which made the Spark build fail. #2 would have prevented that.) 3) Update SBT build to match Maven build in this regard 4) Update SBT build to not refer to Sonatype snapshot repos. This wasn't in Maven, and a build generally would not refer to external snapshots, but I'm not 100% sure on this one. Author: Sean Owen <sowen@cloudera.com> Closes #145 from srowen/SPARK-1254 and squashes the following commits: 42f9bfc [Sean Owen] Use HTTPS for repos; consolidate repos in parent in order to put optional Cloudera repo last; harmonize SBT build repos with Maven; remove snapshot repos from SBT build which weren't in Maven
-
- Mar 14, 2014
-
-
Michael Armbrust authored
Author: Michael Armbrust <michael@databricks.com> Closes #141 from marmbrus/mutablePair and squashes the following commits: f5c4783 [Michael Armbrust] Change function name to update 8bfd973 [Michael Armbrust] Fix serialization of MutablePair. Also provide an interface for easy updating.
-
- Mar 13, 2014
-
-
Tianshuo Deng authored
client arg is wrong, it should be executor-cores. it causes executor fail to start when executor-cores is specified Author: Tianshuo Deng <tdeng@twitter.com> Closes #138 from tsdeng/bugfix_wrong_client_args and squashes the following commits: 304826d [Tianshuo Deng] wrong client arg, should use executor-cores
-
Reynold Xin authored
Author: Reynold Xin <rxin@apache.org> Closes #113 from rxin/jetty9 and squashes the following commits: 867a2ce [Reynold Xin] Updated Jetty version to 9.1.3.v20140225 in Maven build file. d7c97ca [Reynold Xin] Return the correctly bound port. d14706f [Reynold Xin] Upgrade Jetty to 9.1.3.v20140225.
-
Sandy Ryza authored
Author: Sandy Ryza <sandy@cloudera.com> Closes #120 from sryza/sandy-spark-1183 and squashes the following commits: 5066a4a [Sandy Ryza] Remove "worker" in a couple comments 0bd1e46 [Sandy Ryza] Remove --am-class from usage bfc8fe0 [Sandy Ryza] Remove am-class from doc and fix yarn-alpha 607539f [Sandy Ryza] Address review comments 74d087a [Sandy Ryza] SPARK-1183. Don't use "worker" to mean executor
-
Xiangrui Meng authored
Computing YtY can be implemented using BLAS's DSPR operations instead of generating y_i y_i^T and then combining them. The latter generates many k-by-k matrices. On the movielens data, this change improves the performance by 10-20%. The algorithm remains the same, verified by computing RMSE on the movielens data. To compare the results, I also added an option to set a random seed in ALS. JIRA: 1. https://spark-project.atlassian.net/browse/SPARK-1237 2. https://spark-project.atlassian.net/browse/SPARK-1238 Author: Xiangrui Meng <meng@databricks.com> Closes #131 from mengxr/als and squashes the following commits: ed00432 [Xiangrui Meng] minor changes d984623 [Xiangrui Meng] minor changes 2fc1641 [Xiangrui Meng] remove commented code 4c7cde2 [Xiangrui Meng] allow specifying a random seed in ALS 200bef0 [Xiangrui Meng] optimize computeYtY and updateBlock
-
Patrick Wendell authored
Author: Patrick Wendell <pwendell@gmail.com> Closes #112 from pwendell/pyspark-take and squashes the following commits: daae80e [Patrick Wendell] SPARK-1019: pyspark RDD take() throws an NPE
-
- Mar 12, 2014
-
-
CodingCat authored
Author: CodingCat <zhunansjtu@gmail.com> Closes #133 from CodingCat/SPARK-1160-2 and squashes the following commits: 6607155 [CodingCat] hot fix for PR105 - change to Java annotation
-
jianghan authored
Author: jianghan <jianghan@xiaomi.com> Closes #132 from pooorman/master and squashes the following commits: 54afbe0 [jianghan] Fix example bug: compile error
-
CodingCat authored
https://spark-project.atlassian.net/browse/SPARK-1160 reported by @mateiz: "It's redundant with collect() and the name doesn't make sense in Java, where we return a List (we can't return an array due to the way Java generics work). It's also missing in Python." In this patch, I deprecated the method and changed the source files using it by replacing toArray with collect() directly Author: CodingCat <zhunansjtu@gmail.com> Closes #105 from CodingCat/SPARK-1060 and squashes the following commits: 286f163 [CodingCat] deprecate in JavaRDDLike ee17b4e [CodingCat] add message and since 2ff7319 [CodingCat] deprecate toArray in RDD
-
Prashant Sharma authored
Author: Prashant Sharma <prashant.s@imaginea.com> Closes #93 from ScrapCodes/SPARK-1162/pyspark-top-takeOrdered and squashes the following commits: ece1fa4 [Prashant Sharma] Added top in python.
-
liguoqiang authored
Author: liguoqiang <liguoqiang@rd.tuan800.com> Closes #44 from witgo/SPARK-1149 and squashes the following commits: 3dcdcaf [liguoqiang] Merge branch 'master' into SPARK-1149 8425395 [liguoqiang] Merge remote-tracking branch 'upstream/master' into SPARK-1149 3dad595 [liguoqiang] review comment e3e56aa [liguoqiang] Merge branch 'master' into SPARK-1149 b0d5c07 [liguoqiang] review comment d0a6005 [liguoqiang] review comment 3395ee7 [liguoqiang] Merge remote-tracking branch 'upstream/master' into SPARK-1149 ac006a3 [liguoqiang] code Formatting 3feb3a8 [liguoqiang] Merge branch 'master' into SPARK-1149 adc443e [liguoqiang] partitions check bugfix 928e1e3 [liguoqiang] Added a unit test for PairRDDFunctions.lookup with bad partitioner db6ecc5 [liguoqiang] Merge branch 'master' into SPARK-1149 1e3331e [liguoqiang] Merge branch 'master' into SPARK-1149 3348619 [liguoqiang] Optimize performance for partitions check 61e5a87 [liguoqiang] Merge branch 'master' into SPARK-1149 e68210a [liguoqiang] add partition index check to submitJob 3a65903 [liguoqiang] make the code more readable 6bb725e [liguoqiang] fix #SPARK-1149 Bad partitioners can cause Spark to hang
-
Thomas Graves authored
...APREDUCE_APPLICATION_CLASSPATH Author: Thomas Graves <tgraves@apache.org> Closes #129 from tgravescs/SPARK-1233 and squashes the following commits: 85ff5a6 [Thomas Graves] Fix running hadoop 0.23 due to java.lang.NoSuchFieldException: DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH
-
Thomas Graves authored
Author: Thomas Graves <tgraves@apache.org> Closes #127 from tgravescs/SPARK-1232 and squashes the following commits: c05cfd4 [Thomas Graves] Fix the hadoop 0.23 yarn build
-
prabinb authored
Author: prabinb <prabin.banka@imaginea.com> Closes #92 from prabinb/python-api-rdd and squashes the following commits: 51129ca [prabinb] Added missing Python RDD functions Added __repr__ function to StorageLevel class. Added doctest for RDD.getStorageLevel().
-
Sandy Ryza authored
This reopens PR 649 from incubator-spark against the new repo Author: Sandy Ryza <sandy@cloudera.com> Closes #102 from sryza/sandy-spark-1064 and squashes the following commits: 270e490 [Sandy Ryza] Handle different application classpath variables in different versions 88b04e0 [Sandy Ryza] SPARK-1064. Make it possible to run on YARN without bundling Hadoop jars in Spark assembly
-
- Mar 11, 2014
-
-
Patrick Wendell authored
This patch removes Ganglia integration from the default build. It allows users willing to link against LGPL code to use Ganglia by adding build flags or linking against a new Spark artifact called spark-ganglia-lgpl. This brings Spark in line with the Apache policy on LGPL code enumerated here: https://www.apache.org/legal/3party.html#options-optional Author: Patrick Wendell <pwendell@gmail.com> Closes #108 from pwendell/ganglia and squashes the following commits: 326712a [Patrick Wendell] Responding to review feedback 5f28ee4 [Patrick Wendell] SPARK-1167: Remove metrics-ganglia from default build due to LGPL issues.
-
- Mar 10, 2014
-
-
Sandy Ryza authored
...arn-cluster" Author: Sandy Ryza <sandy@cloudera.com> Closes #118 from sryza/sandy-spark-1211 and squashes the following commits: d4001c7 [Sandy Ryza] SPARK-1211. In ApplicationMaster, set spark.master system property to "yarn-cluster"
-
Patrick Wendell authored
This patch removes the `generator` field and simplifies + documents the tracking of callsites. There are two places where we care about call sites, when a job is run and when an RDD is created. This patch retains both of those features but does a slight refactoring and renaming to make things less confusing. There was another feature of an rdd called the `generator` which was by default the user class that in which the RDD was created. This is used exclusively in the JobLogger. It been subsumed by the ability to name a job group. The job logger can later be refectored to read the job group directly (will require some work) but for now this just preserves the default logged value of the user class. I'm not sure any users ever used the ability to override this. Author: Patrick Wendell <pwendell@gmail.com> Closes #106 from pwendell/callsite and squashes the following commits: fc1d009 [Patrick Wendell] Compile fix e17fb76 [Patrick Wendell] Review feedback: callSite -> creationSite 62e77ef [Patrick Wendell] Review feedback 576e60b [Patrick Wendell] SPARK-1205: Clean up callSite/origin/generator.
-
Prashant Sharma authored
Author: Prashant Sharma <prashant.s@imaginea.com> Closes #115 from ScrapCodes/SPARK-1168/pyspark-foldByKey and squashes the following commits: db6f67e [Prashant Sharma] SPARK-1168, Added foldByKey to pyspark.
-