Skip to content
Snippets Groups Projects
  1. Jan 27, 2014
    • Reynold Xin's avatar
      Merge pull request #466 from liyinan926/file-overwrite-new · 84670f27
      Reynold Xin authored
      Allow files added through SparkContext.addFile() to be overwritten
      
      This is useful for the cases when a file needs to be refreshed and downloaded by the executors periodically. For example, a possible use case is: the driver periodically renews a Hadoop delegation token and writes it to a token file. The token file needs to be downloaded by the executors whenever it gets renewed. However, the current implementation throws an exception when the target file exists and its contents do not match those of the new source. This PR adds an option to allow files to be overwritten to support use cases similar to the above.
      84670f27
    • Reynold Xin's avatar
      Merge pull request #516 from sarutak/master · 3d5c03e2
      Reynold Xin authored
      modified SparkPluginBuild.scala to use https protocol for accessing gith...
      
      We cannot build Spark behind a proxy although we execute sbt with -Dhttp(s).proxyHost -Dhttp(s).proxyPort -Dhttp(s).proxyUser -Dhttp(s).proxyPassword options.
      It's because of using git protocol to clone junit_xml_listener.git.
      I could build after modifying SparkPluginBuild.scala.
      
      I reported this issue to JIRA.
      https://spark-project.atlassian.net/browse/SPARK-1046
      3d5c03e2
    • Reynold Xin's avatar
      Merge pull request #490 from hsaputra/modify_checkoption_with_isdefined · f16c21e2
      Reynold Xin authored
      Replace the check for None Option with isDefined and isEmpty in Scala code
      
      Propose to replace the Scala check for Option "!= None" with Option.isDefined and "=== None" with Option.isEmpty.
      
      I think this, using method call if possible then operator function plus argument, will make the Scala code easier to read and understand.
      
      Pass compile and tests.
      f16c21e2
    • Sean Owen's avatar
      Merge pull request #460 from srowen/RandomInitialALSVectors · f67ce3e2
      Sean Owen authored
      Choose initial user/item vectors uniformly on the unit sphere
      
      ...rather than within the unit square to possibly avoid bias in the initial state and improve convergence.
      
      The current implementation picks the N vector elements uniformly at random from [0,1). This means they all point into one quadrant of the vector space. As N gets just a little large, the vector tend strongly to point into the "corner", towards (1,1,1...,1). The vectors are not unit vectors either.
      
      I suggest choosing the elements as Gaussian ~ N(0,1) and normalizing. This gets you uniform random choices on the unit sphere which is more what's of interest here. It has worked a little better for me in the past.
      
      This is pretty minor but wanted to warm up suggesting a few tweaks to ALS.
      Please excuse my Scala, pretty new to it.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      == Merge branch commits ==
      
      commit 492b13a7469e5a4ed7591ee8e56d8bd7570dfab6
      Author: Sean Owen <sowen@cloudera.com>
      Date:   Mon Jan 27 08:05:25 2014 +0000
      
          Style: spaces around binary operators
      
      commit ce2b5b5a4fefa0356875701f668f01f02ba4d87e
      Author: Sean Owen <sowen@cloudera.com>
      Date:   Sun Jan 19 22:50:03 2014 +0000
      
          Generate factors with all positive components, per discussion in https://github.com/apache/incubator-spark/pull/460
      
      commit b6f7a8a61643a8209e8bc662e8e81f2d15c710c7
      Author: Sean Owen <sowen@cloudera.com>
      Date:   Sat Jan 18 15:54:42 2014 +0000
      
          Choose initial user/item vectors uniformly on the unit sphere rather than within the unit square to possibly avoid bias in the initial state and improve convergence
      f67ce3e2
    • sarutak's avatar
  2. Jan 26, 2014
  3. Jan 25, 2014
    • Josh Rosen's avatar
      Fix ClassCastException in JavaPairRDD.collectAsMap() (SPARK-1040) · 740e865f
      Josh Rosen authored
      This fixes an issue where collectAsMap() could
      fail when called on a JavaPairRDD that was derived
      by transforming a non-JavaPairRDD.
      
      The root problem was that we were creating the
      JavaPairRDD's ClassTag by casting a
      ClassTag[AnyRef] to a ClassTag[Tuple2[K2, V2]].
      To fix this, I cast a ClassTag[Tuple2[_, _]]
      instead, since this actually produces a ClassTag
      of the appropriate type because ClassTags don't
      capture type parameters:
      
      scala> implicitly[ClassTag[Tuple2[_, _]]] == implicitly[ClassTag[Tuple2[Int, Int]]]
      res8: Boolean = true
      
      scala> implicitly[ClassTag[AnyRef]].asInstanceOf[ClassTag[Tuple2[Int, Int]]] == implicitly[ClassTag[Tuple2[Int, Int]]]
      res9: Boolean = false
      740e865f
    • Josh Rosen's avatar
      Increase JUnit test verbosity under SBT. · 531d9d75
      Josh Rosen authored
      Upgrade junit-interface plugin from 0.9 to 0.10.
      
      I noticed that the JavaAPISuite tests didn't
      appear to display any output locally or under
      Jenkins, making it difficult to know whether they
      were running.  This change increases the verbosity
      to more closely match the ScalaTest tests.
      531d9d75
  4. Jan 23, 2014
  5. Jan 22, 2014
    • Patrick Wendell's avatar
      Merge pull request #496 from pwendell/master · a1cd1851
      Patrick Wendell authored
      Fix bug in worker clean-up in UI
      
      Introduced in d5a96fec (/cc @aarondav).
      
      This should be picked into 0.8 and 0.9 as well. The bug causes old (zombie) workers on a node to not disappear immediately from the UI when a new one registers.
      a1cd1851
    • Patrick Wendell's avatar
      Merge pull request #447 from CodingCat/SPARK-1027 · 034dce2a
      Patrick Wendell authored
      fix for SPARK-1027
      
      fix for SPARK-1027  (https://spark-project.atlassian.net/browse/SPARK-1027)
      
      FIXES
      
      1. change sparkhome from String to Option(String) in ApplicationDesc
      
      2. remove sparkhome parameter in LaunchExecutor message
      
      3. adjust involved files
      034dce2a
    • Patrick Wendell's avatar
      Fix bug in worker clean-up in UI · 62855131
      Patrick Wendell authored
      Introduced in d5a96fec. This should be picked into 0.8 and 0.9 as well.
      62855131
    • CodingCat's avatar
      refactor sparkHome to val · 2b3c4614
      CodingCat authored
      clean code
      2b3c4614
    • Patrick Wendell's avatar
      Merge pull request #495 from srowen/GraphXCommonsMathDependency · 3184facd
      Patrick Wendell authored
      Fix graphx Commons Math dependency
      
      `graphx` depends on Commons Math (2.x) in `SVDPlusPlus.scala`. However the module doesn't declare this dependency. It happens to work because it is included by Hadoop artifacts. But, I can tell you this isn't true as of a month or so ago. Building versus recent Hadoop would fail. (That's how we noticed.)
      
      The simple fix is to declare the dependency, as it should be. But it's also worth noting that `commons-math` is the old-ish 2.x line, while `commons-math3` is where newer 3.x releases are. Drop-in replacement, but different artifact and package name. Changing this only usage to `commons-math3` works, tests pass, and isn't surprising that it does, so is probably also worth changing. (A comment in some test code also references `commons-math3`, FWIW.)
      
      It does raise another question though: `mllib` looks like it uses the `jblas` `DoubleMatrix` for general purpose vector/matrix stuff. Should `graphx` really use Commons Math for this? Beyond the tiny scope here but worth asking.
      3184facd
    • Sean Owen's avatar
      4476398f
    • Patrick Wendell's avatar
      Merge pull request #492 from skicavs/master · a1238bb5
      Patrick Wendell authored
      fixed job name and usage information for the JavaSparkPi example
      a1238bb5
    • Sean Owen's avatar
      Depend on Commons Math explicitly instead of accidentally getting it from... · fd0c5b8c
      Sean Owen authored
      Depend on Commons Math explicitly instead of accidentally getting it from Hadoop (which stops working in 2.2.x) and also use the newer commons-math3
      fd0c5b8c
    • Patrick Wendell's avatar
      Merge pull request #478 from sryza/sandy-spark-1033 · 576c4a4c
      Patrick Wendell authored
      SPARK-1033. Ask for cores in Yarn container requests
      
      Tested on a pseudo-distributed cluster against the Fair Scheduler and observed a worker taking more than a single core.
      576c4a4c
    • Matei Zaharia's avatar
      Merge pull request #493 from kayousterhout/double_add · 5bcfd798
      Matei Zaharia authored
      Fixed bug where task set managers are added to queue twice
      
      @mateiz can you verify that this is a bug and wasn't intentional? (https://github.com/apache/incubator-spark/commit/90a04dab8d9a2a9a372cea7cdf46cc0fd0f2f76c#diff-7fa4f84a961750c374f2120ca70e96edR551)
      
      This bug leads to a small performance hit because task
      set managers will get offered each rejected resource
      offer twice, but doesn't lead to any incorrect functionality.
      
      Thanks to @hdc1112 for pointing this out.
      5bcfd798
    • Matei Zaharia's avatar
      Merge pull request #315 from rezazadeh/sparsesvd · d009b17d
      Matei Zaharia authored
      Sparse SVD
      
      # Singular Value Decomposition
      Given an *m x n* matrix *A*, compute matrices *U, S, V* such that
      
      *A = U * S * V^T*
      
      There is no restriction on m, but we require n^2 doubles to fit in memory.
      Further, n should be less than m.
      
      The decomposition is computed by first computing *A^TA = V S^2 V^T*,
      computing svd locally on that (since n x n is small),
      from which we recover S and V.
      Then we compute U via easy matrix multiplication
      as *U =  A * V * S^-1*
      
      Only singular vectors associated with the largest k singular values
      If there are k such values, then the dimensions of the return will be:
      
      * *S* is *k x k* and diagonal, holding the singular values on diagonal.
      * *U* is *m x k* and satisfies U^T*U = eye(k).
      * *V* is *n x k* and satisfies V^TV = eye(k).
      
      All input and output is expected in sparse matrix format, 0-indexed
      as tuples of the form ((i,j),value) all in RDDs.
      
      # Testing
      Tests included. They test:
      - Decomposition promise (A = USV^T)
      - For small matrices, output is compared to that of jblas
      - Rank 1 matrix test included
      - Full Rank matrix test included
      - Middle-rank matrix forced via k included
      
      # Example Usage
      
      import org.apache.spark.SparkContext
      import org.apache.spark.mllib.linalg.SVD
      import org.apache.spark.mllib.linalg.SparseMatrix
      import org.apache.spark.mllib.linalg.MatrixyEntry
      
      // Load and parse the data file
      val data = sc.textFile("mllib/data/als/test.data").map { line =>
            val parts = line.split(',')
            MatrixEntry(parts(0).toInt, parts(1).toInt, parts(2).toDouble)
      }
      val m = 4
      val n = 4
      
      // recover top 1 singular vector
      val decomposed = SVD.sparseSVD(SparseMatrix(data, m, n), 1)
      
      println("singular values = " + decomposed.S.data.toArray.mkString)
      
      # Documentation
      Added to docs/mllib-guide.md
      d009b17d
    • Kay Ousterhout's avatar
      Fixed bug where task set managers are added to queue twice · 19da82c5
      Kay Ousterhout authored
      This bug leads to a small performance hit because task
      set managers will get offered each rejected resource
      offer twice, but doesn't lead to any incorrect functionality.
      19da82c5
Loading