Skip to content
Snippets Groups Projects
  1. Feb 27, 2014
  2. Feb 26, 2014
    • Jyotiska NK's avatar
      Updated link for pyspark examples in docs · 26450351
      Jyotiska NK authored
      Author: Jyotiska NK <jyotiska123@gmail.com>
      
      Closes #22 from jyotiska/pyspark_docs and squashes the following commits:
      
      426136c [Jyotiska NK] Updated link for pyspark examples
      26450351
    • Prashant Sharma's avatar
      Deprecated and added a few java api methods for corresponding scala api. · 0e40e2b1
      Prashant Sharma authored
      PR [402](https://github.com/apache/incubator-spark/pull/402) from incubator repo.
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #19 from ScrapCodes/java-api-completeness and squashes the following commits:
      
      11d0c2b [Prashant Sharma] Integer -> java.lang.Integer
      737819a [Prashant Sharma] SPARK-1095 add explicit return types to APIs.
      3ddc8bb [Prashant Sharma] Deprected *With functions in scala and added a few missing Java APIs
      0e40e2b1
    • Reynold Xin's avatar
      Removed reference to incubation in README.md. · 84f7ca13
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1 from rxin/readme and squashes the following commits:
      
      b3a77cd [Reynold Xin] Removed reference to incubation in README.md.
      84f7ca13
    • Bouke van der Bijl's avatar
      SPARK-1115: Catch depickling errors · 12738c1a
      Bouke van der Bijl authored
      This surroungs the complete worker code in a try/except block so we catch any error that arrives. An example would be the depickling failing for some reason
      
      @JoshRosen
      
      Author: Bouke van der Bijl <boukevanderbijl@gmail.com>
      
      Closes #644 from bouk/catch-depickling-errors and squashes the following commits:
      
      f0f67cc [Bouke van der Bijl] Lol indentation
      0e4d504 [Bouke van der Bijl] Surround the complete python worker with the try block
      12738c1a
    • Matei Zaharia's avatar
      SPARK-1135: fix broken anchors in docs · c86eec58
      Matei Zaharia authored
      A recent PR that added Java vs Scala tabs for streaming also
      inadvertently added some bad code to a document.ready handler, breaking
      our other handler that manages scrolling to anchors correctly with the
      floating top bar. As a result the section title ended up always being
      hidden below the top bar. This removes the unnecessary JavaScript code.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #3 from mateiz/doc-links and squashes the following commits:
      
      e2a3488 [Matei Zaharia] SPARK-1135: fix broken anchors in docs
      c86eec58
    • William Benton's avatar
      SPARK-1078: Replace lift-json with json4s-jackson. · fbedc8ef
      William Benton authored
      The aim of the Json4s project is to provide a common API for
      Scala JSON libraries.  It is Apache-licensed, easier for
      downstream distributions to package, and mostly API-compatible
      with lift-json.  Furthermore, the Jackson-backed implementation
      parses faster than lift-json on all but the smallest inputs.
      
      Author: William Benton <willb@redhat.com>
      
      Closes #582 from willb/json4s and squashes the following commits:
      
      7ca62c4 [William Benton] Replace lift-json with json4s-jackson.
      fbedc8ef
    • Sandy Ryza's avatar
      SPARK-1053. Don't require SPARK_YARN_APP_JAR · b8a18719
      Sandy Ryza authored
      It looks this just requires taking out the checks.
      
      I verified that, with the patch, I was able to run spark-shell through yarn without setting the environment variable.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #553 from sryza/sandy-spark-1053 and squashes the following commits:
      
      b037676 [Sandy Ryza] SPARK-1053.  Don't require SPARK_YARN_APP_JAR
      b8a18719
  3. Feb 25, 2014
    • Raymond Liu's avatar
      For SPARK-1082, Use Curator for ZK interaction in standalone cluster · c852201c
      Raymond Liu authored
      Author: Raymond Liu <raymond.liu@intel.com>
      
      Closes #611 from colorant/curator and squashes the following commits:
      
      7556aa1 [Raymond Liu] Address review comments
      af92e1f [Raymond Liu] Fix coding style
      964f3c2 [Raymond Liu] Ignore NodeExists exception
      6df2966 [Raymond Liu] Rewrite zookeeper client code with curator
      c852201c
    • Semih Salihoglu's avatar
      Graph primitives2 · 1f4c7f7e
      Semih Salihoglu authored
      Hi guys,
      
      I'm following Joey and Ankur's suggestions to add collectEdges and pickRandomVertex. I'm also adding the tests for collectEdges and refactoring one method getCycleGraph in GraphOpsSuite.scala.
      
      Thank you,
      
      semih
      
      Author: Semih Salihoglu <semihsalihoglu@gmail.com>
      
      Closes #580 from semihsalihoglu/GraphPrimitives2 and squashes the following commits:
      
      937d3ec [Semih Salihoglu] - Fixed the scalastyle errors.
      a69a152 [Semih Salihoglu] - Adding collectEdges and pickRandomVertices. - Adding tests for collectEdges. - Refactoring a getCycle utility function for GraphOpsSuite.scala.
      41265a6 [Semih Salihoglu] - Adding collectEdges and pickRandomVertex. - Adding tests for collectEdges. - Recycling a getCycle utility test file.
      1f4c7f7e
  4. Feb 24, 2014
    • Andrew Ash's avatar
      Include reference to twitter/chill in tuning docs · a4f4fbc8
      Andrew Ash authored
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #647 from ash211/doc-tuning and squashes the following commits:
      
      b87de0a [Andrew Ash] Include reference to twitter/chill in tuning docs
      a4f4fbc8
    • Bryn Keller's avatar
      For outputformats that are Configurable, call setConf before sending data to them. · 4d880304
      Bryn Keller authored
      [SPARK-1108] This allows us to use, e.g. HBase's TableOutputFormat with PairRDDFunctions.saveAsNewAPIHadoopFile, which otherwise would throw NullPointerException because the output table name hasn't been configured.
      
      Note this bug also affects branch-0.9
      
      Author: Bryn Keller <bryn.keller@intel.com>
      
      Closes #638 from xoltar/SPARK-1108 and squashes the following commits:
      
      7e94e7d [Bryn Keller] Import, comment, and format cleanup per code review
      7cbcaa1 [Bryn Keller] For outputformats that are Configurable, call setConf before sending data to them. This allows us to use, e.g. HBase TableOutputFormat, which otherwise would throw NullPointerException because the output table name hasn't been configured
      4d880304
    • Matei Zaharia's avatar
      Merge pull request #641 from mateiz/spark-1124-master · d8d190ef
      Matei Zaharia authored
      SPARK-1124: Fix infinite retries of reduce stage when a map stage failed
      
      In the previous code, if you had a failing map stage and then tried to run reduce stages on it repeatedly, the first reduce stage would fail correctly, but the later ones would mistakenly believe that all map outputs are available and start failing infinitely with fetch failures from "null". See https://spark-project.atlassian.net/browse/SPARK-1124 for an example.
      
      This PR also cleans up code style slightly where there was a variable named "s" and some weird map manipulation.
      d8d190ef
    • Matei Zaharia's avatar
      Fix removal from shuffleToMapStage to search for a key-value pair with · 0187cef0
      Matei Zaharia authored
      our stage instead of using our shuffleID.
      0187cef0
    • Matei Zaharia's avatar
      SPARK-1124: Fix infinite retries of reduce stage when a map stage failed · cd32d5e4
      Matei Zaharia authored
      In the previous code, if you had a failing map stage and then tried to
      run reduce stages on it repeatedly, the first reduce stage would fail
      correctly, but the later ones would mistakenly believe that all map
      outputs are available and start failing infinitely with fetch failures
      from "null".
      cd32d5e4
  5. Feb 23, 2014
    • Sean Owen's avatar
      SPARK-1071: Tidy logging strategy and use of log4j · c0ef3afa
      Sean Owen authored
      Prompted by a recent thread on the mailing list, I tried and failed to see if Spark can be made independent of log4j. There are a few cases where control of the underlying logging is pretty useful, and to do that, you have to bind to a specific logger.
      
      Instead I propose some tidying that leaves Spark's use of log4j, but gets rid of warnings and should still enable downstream users to switch. The idea is to pipe everything (except log4j) through SLF4J, and have Spark use SLF4J directly when logging, and where Spark needs to output info (REPL and tests), bind from SLF4J to log4j.
      
      This leaves the same behavior in Spark. It means that downstream users who want to use something except log4j should:
      
      - Exclude dependencies on log4j, slf4j-log4j12 from Spark
      - Include dependency on log4j-over-slf4j
      - Include dependency on another logger X, and another slf4j-X
      - Recreate any log config that Spark does, that is needed, in the other logger's config
      
      That sounds about right.
      
      Here are the key changes:
      
      - Include the jcl-over-slf4j shim everywhere by depending on it in core.
      - Exclude dependencies on commons-logging from third-party libraries.
      - Include the jul-to-slf4j shim everywhere by depending on it in core.
      - Exclude slf4j-* dependencies from third-party libraries to prevent collision or warnings
      - Added missing slf4j-log4j12 binding to GraphX, Bagel module tests
      
      And minor/incidental changes:
      
      - Update to SLF4J 1.7.5, which happily matches Hadoop 2’s version and is a recommended update over 1.7.2
      - (Remove a duplicate HBase dependency declaration in SparkBuild.scala)
      - (Remove a duplicate mockito dependency declaration that was causing warnings and bugging me)
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #570 from srowen/SPARK-1071 and squashes the following commits:
      
      52eac9f [Sean Owen] Add slf4j-over-log4j12 dependency to core (non-test) and remove it from things that depend on core.
      77a7fa9 [Sean Owen] SPARK-1071: Tidy logging strategy and use of log4j
      c0ef3afa
  6. Feb 22, 2014
    • CodingCat's avatar
      [SPARK-1041] remove dead code in start script, remind user to set that in spark-env.sh · 437b62fc
      CodingCat authored
      the lines in start-master.sh and start-slave.sh no longer work
      
      in ec2, the host name has changed, e.g.
      
      ubuntu@ip-172-31-36-93:~$ hostname
      ip-172-31-36-93
      
      also, the URL to fetch public DNS name also changed, e.g.
      
      ubuntu@ip-172-31-36-93:~$ wget -q -O - http://instance-data.ec2.internal/latest/meta-data/public-hostname
      ubuntu@ip-172-31-36-93:~$  (returns nothing)
      
      since we have spark-ec2 project, we don't need to have such ec2-specific lines here, instead, user only need to set in spark-env.sh
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #588 from CodingCat/deadcode_in_sbin and squashes the following commits:
      
      e4236e0 [CodingCat] remove dead code in start script, remind user set that in spark-env.sh
      437b62fc
    • Punya Biswal's avatar
      Migrate Java code to Scala or move it to src/main/java · 29ac7ea5
      Punya Biswal authored
      These classes can't be migrated:
        StorageLevels: impossible to create static fields in Scala
        JavaSparkContextVarargsWorkaround: incompatible varargs
        JavaAPISuite: should test Java APIs in pure Java (for sanity)
      
      Author: Punya Biswal <pbiswal@palantir.com>
      
      Closes #605 from punya/move-java-sources and squashes the following commits:
      
      25b00b2 [Punya Biswal] Remove redundant type param; reformat
      853da46 [Punya Biswal] Use factory method rather than constructor
      e5d53d9 [Punya Biswal] Migrate Java code to Scala or move it to src/main/java
      29ac7ea5
    • CodingCat's avatar
      [SPARK-1055] fix the SCALA_VERSION and SPARK_VERSION in docker file · 1aa4f8af
      CodingCat authored
      As reported in https://spark-project.atlassian.net/browse/SPARK-1055
      
      "The used Spark version in the .../base/Dockerfile is stale on 0.8.1 and should be updated to 0.9.x to match the release."
      
      Author: CodingCat <zhunansjtu@gmail.com>
      Author: Nan Zhu <CodingCat@users.noreply.github.com>
      
      Closes #634 from CodingCat/SPARK-1055 and squashes the following commits:
      
      cb7330e [Nan Zhu] Update Dockerfile
      adf8259 [CodingCat] fix the SCALA_VERSION and SPARK_VERSION in docker file
      1aa4f8af
    • jyotiska's avatar
      doctest updated for mapValues, flatMapValues in rdd.py · 722199fa
      jyotiska authored
      Updated doctests for mapValues and flatMapValues in rdd.py
      
      Author: jyotiska <jyotiska123@gmail.com>
      
      Closes #621 from jyotiska/python_spark and squashes the following commits:
      
      716f7cd [jyotiska] doctest updated for mapValues, flatMapValues in rdd.py
      722199fa
    • jyotiska's avatar
      Fixed minor typo in worker.py · 3ff077d4
      jyotiska authored
      Fixed minor typo in worker.py
      
      Author: jyotiska <jyotiska123@gmail.com>
      
      Closes #630 from jyotiska/pyspark_code and squashes the following commits:
      
      ee44201 [jyotiska] typo fixed in worker.py
      3ff077d4
    • Xiangrui Meng's avatar
      SPARK-1117: update accumulator docs · aaec7d4a
      Xiangrui Meng authored
      The current doc hints spark doesn't support accumulators of type `Long`, which is wrong.
      
      JIRA: https://spark-project.atlassian.net/browse/SPARK-1117
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #631 from mengxr/acc and squashes the following commits:
      
      45ecd25 [Xiangrui Meng] update accumulator docs
      aaec7d4a
  7. Feb 21, 2014
    • Andrew Or's avatar
      [SPARK-1113] External spilling - fix Int.MaxValue hash code collision bug · fefd22f4
      Andrew Or authored
      The original poster of this bug is @guojc, who opened a PR that preceded this one at https://github.com/apache/incubator-spark/pull/612.
      
      ExternalAppendOnlyMap uses key hash code to order the buffer streams from which spilled files are read back into memory. When a buffer stream is empty, the default hash code for that stream is equal to Int.MaxValue. This is, however, a perfectly legitimate candidate for a key hash code. When reading from a spilled map containing such a key, a hash collision may occur, in which case we attempt to read from an empty stream and throw NoSuchElementException.
      
      The fix is to maintain the invariant that empty buffer streams are never added back to the merge queue to be considered. This guarantees that we never read from an empty buffer stream, ever again.
      
      This PR also includes two new tests for hash collisions.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #624 from andrewor14/spilling-bug and squashes the following commits:
      
      9e7263d [Andrew Or] Slightly optimize next()
      2037ae2 [Andrew Or] Move a few comments around...
      cf95942 [Andrew Or] Remove default value of Int.MaxValue for minKeyHash
      c11f03b [Andrew Or] Fix Int.MaxValue hash collision bug in ExternalAppendOnlyMap
      21c1a39 [Andrew Or] Add hash collision tests to ExternalAppendOnlyMapSuite
      fefd22f4
    • Sean Owen's avatar
      MLLIB-25: Implicit ALS runs out of memory for moderately large numbers of features · c8a4c9b1
      Sean Owen authored
      There's a step in implicit ALS where the matrix `Yt * Y` is computed. It's computed as the sum of matrices; an f x f matrix is created for each of n user/item rows in a partition. In `ALS.scala:214`:
      
      ```
              factors.flatMapValues{ case factorArray =>
                factorArray.map{ vector =>
                  val x = new DoubleMatrix(vector)
                  x.mmul(x.transpose())
                }
              }.reduceByKeyLocally((a, b) => a.addi(b))
               .values
               .reduce((a, b) => a.addi(b))
      ```
      
      Completely correct, but there's a subtle but quite large memory problem here. map() is going to create all of these matrices in memory at once, when they don't need to ever all exist at the same time.
      For example, if a partition has n = 100000 rows, and f = 200, then this intermediate product requires 32GB of heap. The computation will never work unless you can cough up workers with (more than) that much heap.
      
      Fortunately there's a trivial change that fixes it; just add `.view` in there.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #629 from srowen/ALSMatrixAllocationOptimization and squashes the following commits:
      
      062cda9 [Sean Owen] Update style per review comments
      e9a5d63 [Sean Owen] Avoid unnecessary out of memory situation by not simultaneously allocating lots of matrices
      c8a4c9b1
    • Patrick Wendell's avatar
      SPARK-1111: URL Validation Throws Error for HDFS URL's · 45b15e27
      Patrick Wendell authored
      Fixes an error where HDFS URL's cause an exception. Should be merged into master and 0.9.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #625 from pwendell/url-validation and squashes the following commits:
      
      d14bfe3 [Patrick Wendell] SPARK-1111: URL Validation Throws Error for HDFS URL's
      45b15e27
  8. Feb 20, 2014
    • Ahir Reddy's avatar
      SPARK-1114: Allow PySpark to use existing JVM and Gateway · 59b13795
      Ahir Reddy authored
      Patch to allow PySpark to use existing JVM and Gateway. Changes to PySpark implementation of SparkConf to take existing SparkConf JVM handle. Change to PySpark SparkContext to allow subclass specific context initialization.
      
      Author: Ahir Reddy <ahirreddy@gmail.com>
      
      Closes #622 from ahirreddy/pyspark-existing-jvm and squashes the following commits:
      
      a86f457 [Ahir Reddy] Patch to allow PySpark to use existing JVM and Gateway. Changes to PySpark implementation of SparkConf to take existing SparkConf JVM handle. Change to PySpark SparkContext to allow subclass specific context initialization.
      59b13795
    • Aaron Davidson's avatar
      Super minor: Add require for mergeCombiners in combineByKey · 3fede483
      Aaron Davidson authored
      We changed the behavior in 0.9.0 from requiring that mergeCombiners be null when mapSideCombine was false to requiring that mergeCombiners *never* be null, for external sorting. This patch adds a require() to make this behavior change explicitly messaged rather than resulting in a NPE.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #623 from aarondav/master and squashes the following commits:
      
      520b80c [Aaron Davidson] Super minor: Add require for mergeCombiners in combineByKey
      3fede483
    • Sean Owen's avatar
      MLLIB-22. Support negative implicit input in ALS · 9e63f80e
      Sean Owen authored
      I'm back with another less trivial suggestion for ALS:
      
      In ALS for implicit feedback, input values are treated as weights on squared-errors in a loss function (or rather, the weight is a simple function of the input r, like c = 1 + alpha*r). The paper on which it's based assumes that the input is positive. Indeed, if the input is negative, it will create a negative weight on squared-errors, which causes things to go haywire. The optimization will try to make the error in a cell as large possible, and the result is silently bogus.
      
      There is a good use case for negative input values though. Implicit feedback is usually collected from signals of positive interaction like a view or like or buy, but equally, can come from "not interested" signals. The natural representation is negative values.
      
      The algorithm can be extended quite simply to provide a sound interpretation of these values: negative values should encourage the factorization to come up with 0 for cells with large negative input values, just as much as positive values encourage it to come up with 1.
      
      The implications for the algorithm are simple:
      * the confidence function value must not be negative, and so can become 1 + alpha*|r|
      * the matrix P should have a value 1 where the input R is _positive_, not merely where it is non-zero. Actually, that's what the paper already says, it's just that we can't assume P = 1 when a cell in R is specified anymore, since it may be negative
      
      This in turn entails just a few lines of code change in `ALS.scala`:
      * `rs(i)` becomes `abs(rs(i))`
      * When constructing `userXy(us(i))`, it's implicitly only adding where P is 1. That had been true for any us(i) that is iterated over, before, since these are exactly the ones for which P is 1. But now P is zero where rs(i) <= 0, and should not be added
      
      I think it's a safe change because:
      * It doesn't change any existing behavior (unless you're using negative values, in which case results are already borked)
      * It's the simplest direct extension of the paper's algorithm
      * (I've used it to good effect in production FWIW)
      
      Tests included.
      
      I tweaked minor things en route:
      * `ALS.scala` javadoc writes "R = Xt*Y" when the paper and rest of code defines it as "R = X*Yt"
      * RMSE in the ALS tests uses a confidence-weighted mean, but the denominator is not actually sum of weights
      
      Excuse my Scala style; I'm sure it needs tweaks.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #500 from srowen/ALSNegativeImplicitInput and squashes the following commits:
      
      cf902a9 [Sean Owen] Support negative implicit input in ALS
      953be1c [Sean Owen] Make weighted RMSE in ALS test actually weighted; adjust comment about R = X*Yt
      9e63f80e
    • Chen Chao's avatar
      MLLIB-24: url of "Collaborative Filtering for Implicit Feedback Datasets" in ALS is invalid now · f9b7d64a
      Chen Chao authored
      url of "Collaborative Filtering for Implicit Feedback Datasets"  is invalid now. A new url is provided. http://research.yahoo.com/files/HuKorenVolinsky-ICDM08.pdf
      
      Author: Chen Chao <crazyjvm@gmail.com>
      
      Closes #619 from CrazyJvm/master and squashes the following commits:
      
      a0b54e4 [Chen Chao] change url to IEEE
      9e0e9f0 [Chen Chao] correct spell mistale
      fcfab5d [Chen Chao] wrap line to to fit within 100 chars
      590d56e [Chen Chao] url error
      f9b7d64a
  9. Feb 19, 2014
  10. Feb 18, 2014
  11. Feb 17, 2014
    • Aaron Davidson's avatar
      SPARK-1098: Minor cleanup of ClassTag usage in Java API · f74ae0eb
      Aaron Davidson authored
      Our usage of fake ClassTags in this manner is probably not healthy, but I'm not sure if there's a better solution available, so I just cleaned up and documented the current one.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #604 from aarondav/master and squashes the following commits:
      
      b398e89 [Aaron Davidson] SPARK-1098: Minor cleanup of ClassTag usage in Java API
      f74ae0eb
Loading