Skip to content
Snippets Groups Projects
  1. Dec 29, 2014
    • Travis Galoppo's avatar
      SPARK-4156 [MLLIB] EM algorithm for GMMs · 6cf6fdf3
      Travis Galoppo authored
      Implementation of Expectation-Maximization for Gaussian Mixture Models.
      
      This is my maiden contribution to Apache Spark, so I apologize now if I have done anything incorrectly; having said that, this work is my own, and I offer it to the project under the project's open source license.
      
      Author: Travis Galoppo <tjg2107@columbia.edu>
      Author: Travis Galoppo <travis@localhost.localdomain>
      Author: tgaloppo <tjg2107@columbia.edu>
      Author: FlytxtRnD <meethu.mathew@flytxt.com>
      
      Closes #3022 from tgaloppo/master and squashes the following commits:
      
      aaa8f25 [Travis Galoppo] MLUtils: changed privacy of EPSILON from [util] to [mllib]
      709e4bf [Travis Galoppo] fixed usage line to include optional maxIterations parameter
      acf1fba [Travis Galoppo] Fixed parameter comment in GaussianMixtureModel Made maximum iterations an optional parameter to DenseGmmEM
      9b2fc2a [Travis Galoppo] Style improvements Changed ExpectationSum to a private class
      b97fe00 [Travis Galoppo] Minor fixes and tweaks.
      1de73f3 [Travis Galoppo] Removed redundant array from array creation
      578c2d1 [Travis Galoppo] Removed unused import
      227ad66 [Travis Galoppo] Moved prediction methods into model class.
      308c8ad [Travis Galoppo] Numerous changes to improve code
      cff73e0 [Travis Galoppo] Replaced accumulators with RDD.aggregate
      20ebca1 [Travis Galoppo] Removed unusued code
      42b2142 [Travis Galoppo] Added functionality to allow setting of GMM starting point. Added two cluster test to testing suite.
      8b633f3 [Travis Galoppo] Style issue
      9be2534 [Travis Galoppo] Style issue
      d695034 [Travis Galoppo] Fixed style issues
      c3b8ce0 [Travis Galoppo] Merge branch 'master' of https://github.com/tgaloppo/spark   Adds predict() method
      2df336b [Travis Galoppo] Fixed style issue
      b99ecc4 [tgaloppo] Merge pull request #1 from FlytxtRnD/predictBranch
      f407b4c [FlytxtRnD] Added predict() to return the cluster labels and membership values
      97044cf [Travis Galoppo] Fixed style issues
      dc9c742 [Travis Galoppo] Moved MultivariateGaussian utility class
      e7d413b [Travis Galoppo] Moved multivariate Gaussian utility class to mllib/stat/impl Improved comments
      9770261 [Travis Galoppo] Corrected a variety of style and naming issues.
      8aaa17d [Travis Galoppo] Added additional train() method to companion object for cluster count and tolerance parameters.
      676e523 [Travis Galoppo] Fixed to no longer ignore delta value provided on command line
      e6ea805 [Travis Galoppo] Merged with master branch; update test suite with latest context changes. Improved cluster initialization strategy.
      86fb382 [Travis Galoppo] Merge remote-tracking branch 'upstream/master'
      719d8cc [Travis Galoppo] Added scala test suite with basic test
      c1a8e16 [Travis Galoppo] Made GaussianMixtureModel class serializable Modified sum function for better performance
      5c96c57 [Travis Galoppo] Merge remote-tracking branch 'upstream/master'
      c15405c [Travis Galoppo] SPARK-4156
      6cf6fdf3
    • Yash Datta's avatar
      SPARK-4968: takeOrdered to skip reduce step in case mappers return no partitions · 9bc0df68
      Yash Datta authored
      takeOrdered should skip reduce step in case mapped RDDs have no partitions. This prevents the mentioned exception :
      
      4. run query
      SELECT * FROM testTable WHERE market = 'market2' ORDER BY End_Time DESC LIMIT 100;
      Error trace
      java.lang.UnsupportedOperationException: empty collection
      at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:863)
      at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:863)
      at scala.Option.getOrElse(Option.scala:120)
      at org.apache.spark.rdd.RDD.reduce(RDD.scala:863)
      at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1136)
      
      Author: Yash Datta <Yash.Datta@guavus.com>
      
      Closes #3830 from saucam/fix_takeorder and squashes the following commits:
      
      5974d10 [Yash Datta] SPARK-4968: takeOrdered to skip reduce step in case mappers return no partitions
      9bc0df68
    • Burak Yavuz's avatar
      [SPARK-4409][MLlib] Additional Linear Algebra Utils · 02b55de3
      Burak Yavuz authored
      Addition of a very limited number of local matrix manipulation and generation methods that would be helpful in the further development for algorithms on top of BlockMatrix (SPARK-3974), such as Randomized SVD, and Multi Model Training (SPARK-1486).
      The proposed methods for addition are:
      
      For `Matrix`
       - map: maps the values in the matrix with a given function. Produces a new matrix.
       - update: the values in the matrix are updated with a given function. Occurs in place.
      
      Factory methods for `DenseMatrix`:
       - *zeros: Generate a matrix consisting of zeros
       - *ones: Generate a matrix consisting of ones
       - *eye: Generate an identity matrix
       - *rand: Generate a matrix consisting of i.i.d. uniform random numbers
       - *randn: Generate a matrix consisting of i.i.d. gaussian random numbers
       - *diag: Generate a diagonal matrix from a supplied vector
      *These methods already exist in the factory methods for `Matrices`, however for cases where we require a `DenseMatrix`, you constantly have to add `.asInstanceOf[DenseMatrix]` everywhere, which makes the code "dirtier". I propose moving these functions to factory methods for `DenseMatrix` where the putput will be a `DenseMatrix` and the factory methods for `Matrices` will call these functions directly and output a generic `Matrix`.
      
      Factory methods for `SparseMatrix`:
       - speye: Identity matrix in sparse format. Saves a ton of memory when dimensions are large, especially in Multi Model Training, where each row requires being multiplied by a scalar.
       - sprand: Generate a sparse matrix with a given density consisting of i.i.d. uniform random numbers.
       - sprandn: Generate a sparse matrix with a given density consisting of i.i.d. gaussian random numbers.
       - diag: Generate a diagonal matrix from a supplied vector, but is memory efficient, because it just stores the diagonal. Again, very helpful in Multi Model Training.
      
      Factory methods for `Matrices`:
       - Include all the factory methods given above, but return a generic `Matrix` rather than `SparseMatrix` or `DenseMatrix`.
       - horzCat: Horizontally concatenate matrices to form one larger matrix. Very useful in both Multi Model Training, and for the repartitioning of BlockMatrix.
       - vertCat: Vertically concatenate matrices to form one larger matrix. Very useful for the repartitioning of BlockMatrix.
      
      The names for these methods were selected from MATLAB
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3319 from brkyvz/SPARK-4409 and squashes the following commits:
      
      b0354f6 [Burak Yavuz] [SPARK-4409] Incorporated mengxr's code
      04c4829 [Burak Yavuz] Merge pull request #1 from mengxr/SPARK-4409
      80cfa29 [Xiangrui Meng] minor changes
      ecc937a [Xiangrui Meng] update sprand
      4e95e24 [Xiangrui Meng] simplify fromCOO implementation
      10a63a6 [Burak Yavuz] [SPARK-4409] Fourth pass of code review
      f62d6c7 [Burak Yavuz] [SPARK-4409] Modified genRandMatrix
      3971c93 [Burak Yavuz] [SPARK-4409] Third pass of code review
      75239f8 [Burak Yavuz] [SPARK-4409] Second pass of code review
      e4bd0c0 [Burak Yavuz] [SPARK-4409] Modified horzcat and vertcat
      65c562e [Burak Yavuz] [SPARK-4409] Hopefully fixed Java Test
      d8be7bc [Burak Yavuz] [SPARK-4409] Organized imports
      065b531 [Burak Yavuz] [SPARK-4409] First pass after code review
      a8120d2 [Burak Yavuz] [SPARK-4409] Finished updates to API according to SPARK-4614
      f798c82 [Burak Yavuz] [SPARK-4409] Updated API according to SPARK-4614
      c75f3cd [Burak Yavuz] [SPARK-4409] Added JavaAPI Tests, and fixed a couple of bugs
      d662f9d [Burak Yavuz] [SPARK-4409] Modified according to remote repo
      83dfe37 [Burak Yavuz] [SPARK-4409] Scalastyle error fixed
      a14c0da [Burak Yavuz] [SPARK-4409] Initial commit to add methods
      02b55de3
    • Kousuke Saruta's avatar
      [Minor] Fix a typo of type parameter in JavaUtils.scala · 8d72341a
      Kousuke Saruta authored
      In JavaUtils.scala, thare is a typo of type parameter. In addition, the type information is removed at the time of compile by erasure.
      
      This issue is really minor so I don't  file in JIRA.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #3789 from sarutak/fix-typo-in-javautils and squashes the following commits:
      
      e20193d [Kousuke Saruta] Fixed a typo of type parameter
      82bc5d9 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-typo-in-javautils
      99f6f63 [Kousuke Saruta] Fixed a typo of type parameter in JavaUtils.scala
      8d72341a
    • YanTangZhai's avatar
      [SPARK-4946] [CORE] Using AkkaUtils.askWithReply in... · 815de540
      YanTangZhai authored
      [SPARK-4946] [CORE] Using AkkaUtils.askWithReply in MapOutputTracker.askTracker to reduce the chance of the communicating problem
      
      Using AkkaUtils.askWithReply in MapOutputTracker.askTracker to reduce the chance of the communicating problem
      
      Author: YanTangZhai <hakeemzhai@tencent.com>
      Author: yantangzhai <tyz0303@163.com>
      
      Closes #3785 from YanTangZhai/SPARK-4946 and squashes the following commits:
      
      9ca6541 [yantangzhai] [SPARK-4946] [CORE] Using AkkaUtils.askWithReply in MapOutputTracker.askTracker to reduce the chance of the communicating problem
      e4c2c0a [YanTangZhai] Merge pull request #15 from apache/master
      718afeb [YanTangZhai] Merge pull request #12 from apache/master
      6e643f8 [YanTangZhai] Merge pull request #11 from apache/master
      e249846 [YanTangZhai] Merge pull request #10 from apache/master
      d26d982 [YanTangZhai] Merge pull request #9 from apache/master
      76d4027 [YanTangZhai] Merge pull request #8 from apache/master
      03b62b0 [YanTangZhai] Merge pull request #7 from apache/master
      8a00106 [YanTangZhai] Merge pull request #6 from apache/master
      cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
      cdef539 [YanTangZhai] Merge pull request #1 from apache/master
      815de540
    • Kousuke Saruta's avatar
      Adde LICENSE Header to build/mvn, build/sbt and sbt/sbt · 4cef05e1
      Kousuke Saruta authored
      Recently, build/mvn and build/sbt are added, and sbt/sbt is changed but there are no license headers. Should we add license headers to the scripts right?
      If it's not right, please let me correct.
      
      This PR doesn't affect behavior of Spark, I don't file in JIRA.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #3817 from sarutak/add-license-header and squashes the following commits:
      
      1abc972 [Kousuke Saruta] Added LICENSE Header
      4cef05e1
    • wangxiaojing's avatar
      [SPARK-4982][DOC] `spark.ui.retainedJobs` description is wrong in Spark UI configuration guide · 6645e525
      wangxiaojing authored
      Author: wangxiaojing <u9jing@gmail.com>
      
      Closes #3818 from wangxiaojing/SPARK-4982 and squashes the following commits:
      
      fe2ad5f [wangxiaojing] change stages to jobs
      6645e525
    • meiyoula's avatar
      [SPARK-4966][YARN]The MemoryOverhead value is setted not correctly · 14fa87bd
      meiyoula authored
      Author: meiyoula <1039320815@qq.com>
      
      Closes #3797 from XuTingjun/MemoryOverhead and squashes the following commits:
      
      5a780fc [meiyoula] Update ClientArguments.scala
      14fa87bd
  2. Dec 27, 2014
    • Brennon York's avatar
      [SPARK-4501][Core] - Create build/mvn to automatically download maven/zinc/scalac · a3e51cc9
      Brennon York authored
      Creates a top level directory script (as `build/mvn`) to automatically download zinc and the specific version of scala used to easily build spark. This will also download and install maven if the user doesn't already have it and all packages are hosted under the `build/` directory. Tested on both Linux and OSX OS's and both work. All commands pass through to the maven binary so it acts exactly as a traditional maven call would.
      
      Author: Brennon York <brennon.york@capitalone.com>
      
      Closes #3707 from brennonyork/SPARK-4501 and squashes the following commits:
      
      0e5a0e4 [Brennon York] minor incorrect doc verbage (with -> this)
      9b79e38 [Brennon York] fixed merge conflicts with dev/run-tests, properly quoted args in sbt/sbt, fixed bug where relative paths would fail if passed in from build/mvn
      d2d41b6 [Brennon York] added blurb about leverging zinc with build/mvn
      b979c58 [Brennon York] updated the merge conflict
      c5634de [Brennon York] updated documentation to overview build/mvn, updated all points where sbt/sbt was referenced with build/sbt
      b8437ba [Brennon York] set progress bars for curl and wget when not run on jenkins, no progress bar when run on jenkins, moved sbt script to build/sbt, wrote stub and warning under sbt/sbt which calls build/sbt, modified build/sbt to use the correct directory, fixed bug in build/sbt-launch-lib.bash to correctly pull the sbt version
      be11317 [Brennon York] added switch to silence download progress only if AMPLAB_JENKINS is set
      28d0a99 [Brennon York] updated to remove the python dependency, uses grep instead
      7e785a6 [Brennon York] added silent and quiet flags to curl and wget respectively, added single echo output to denote start of a download if download is needed
      14a5da0 [Brennon York] removed unnecessary zinc output on startup
      1af4a94 [Brennon York] fixed bug with uppercase vs lowercase variable
      3e8b9b3 [Brennon York] updated to properly only restart zinc if it was freshly installed
      a680d12 [Brennon York] Added comments to functions and tested various mvn calls
      bb8cc9d [Brennon York] removed package files
      ef017e6 [Brennon York] removed OS complexities, setup generic install_app call, removed extra file complexities, removed help, removed forced install (defaults now), removed double-dash from cli
      07bf018 [Brennon York] Updated to specifically handle pulling down the correct scala version
      f914dea [Brennon York] Beginning final portions of localized scala home
      69c4e44 [Brennon York] working linux and osx installers for purely local mvn build
      4a1609c [Brennon York] finalizing working linux install for maven to local ./build/apache-maven folder
      cbfcc68 [Brennon York] Changed the default sbt/sbt to build/sbt and added a build/mvn which will automatically download, install, and execute maven with zinc for easier build capability
      a3e51cc9
    • GuoQiang Li's avatar
      [SPARK-4952][Core]Handle ConcurrentModificationExceptions in SparkEnv.environmentDetails · 080ceb77
      GuoQiang Li authored
      Author: GuoQiang Li <witgo@qq.com>
      
      Closes #3788 from witgo/SPARK-4952 and squashes the following commits:
      
      d903529 [GuoQiang Li] Handle ConcurrentModificationExceptions in SparkEnv.environmentDetails
      080ceb77
    • Zhang, Liye's avatar
      [SPARK-4954][Core] add spark version infomation in log for standalone mode · 786808ab
      Zhang, Liye authored
      The master and worker spark version may be not the same with Driver spark version. That is because spark Jar file might be replaced for new application without restarting the spark cluster. So there shall log out the spark-version in both Mater and Worker log.
      
      Author: Zhang, Liye <liye.zhang@intel.com>
      
      Closes #3790 from liyezhang556520/version4Standalone and squashes the following commits:
      
      e05e1e3 [Zhang, Liye] add spark version infomation in log for standalone mode
      786808ab
    • Jongyoul Lee's avatar
      [SPARK-3955] Different versions between jackson-mapper-asl and jackson-c... · 2483c1ef
      Jongyoul Lee authored
      ...ore-asl
      
      - set the same version to jackson-mapper-asl and jackson-core-asl
      - It's related with #2818
      - coded a same patch from a latest master
      
      Author: Jongyoul Lee <jongyoul@gmail.com>
      
      Closes #3716 from jongyoul/SPARK-3955 and squashes the following commits:
      
      efa29aa [Jongyoul Lee] [SPARK-3955] Different versions between jackson-mapper-asl and jackson-core-asl - set the same version to jackson-mapper-asl and jackson-core-asl
      2483c1ef
    • Patrick Wendell's avatar
      HOTFIX: Slight tweak on previous commit. · 82bf4bee
      Patrick Wendell authored
      Meant to merge this in when committing SPARK-3787.
      82bf4bee
    • Kousuke Saruta's avatar
      [SPARK-3787][BUILD] Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version · de95c57a
      Kousuke Saruta authored
      This PR is another solution for When we build with sbt with profile for hadoop and without property for hadoop version like:
      
          sbt/sbt -Phadoop-2.2 assembly
      
      jar name is always used default version (1.0.4).
      
      When we build with maven with same condition for sbt, default version for each profile is used.
      For instance, if we  build like:
      
          mvn -Phadoop-2.2 package
      
      jar name is used hadoop2.2.0 as a default version of hadoop-2.2.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #3046 from sarutak/fix-assembly-jarname-2 and squashes the following commits:
      
      41ef90e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-assembly-jarname-2
      50c8676 [Kousuke Saruta] Merge branch 'fix-assembly-jarname-2' of github.com:sarutak/spark into fix-assembly-jarname-2
      52a1cd2 [Kousuke Saruta] Fixed comflicts
      dd30768 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-assembly-jarname2
      f1c90bb [Kousuke Saruta] Fixed SparkBuild.scala in order to read `hadoop.version` property from pom.xml
      af6b100 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-assembly-jarname
      c81806b [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-assembly-jarname
      ad1f96e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-assembly-jarname
      b2318eb [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-assembly-jarname
      5fc1259 [Kousuke Saruta] Fixed typo.
      eebbb7d [Kousuke Saruta] Fixed wrong jar name
      de95c57a
    • Patrick Wendell's avatar
      MAINTENANCE: Automated closing of pull requests. · 534f24b2
      Patrick Wendell authored
      This commit exists to close the following pull requests on Github:
      
      Closes #3456 (close requested by 'pwendell')
      Closes #1602 (close requested by 'tdas')
      Closes #2633 (close requested by 'tdas')
      Closes #2059 (close requested by 'JoshRosen')
      Closes #2348 (close requested by 'tdas')
      Closes #3662 (close requested by 'tdas')
      Closes #2031 (close requested by 'andrewor14')
      Closes #265 (close requested by 'JoshRosen')
      534f24b2
  3. Dec 26, 2014
  4. Dec 25, 2014
    • zsxwing's avatar
      [SPARK-4608][Streaming] Reorganize StreamingContext implicit to improve API convenience · f9ed2b66
      zsxwing authored
      There is only one implicit function `toPairDStreamFunctions` in `StreamingContext`. This PR did similar reorganization like [SPARK-4397](https://issues.apache.org/jira/browse/SPARK-4397).
      
      Compiled the following codes with Spark Streaming 1.1.0 and ran it with this PR. Everything is fine.
      ```Scala
      import org.apache.spark._
      import org.apache.spark.streaming._
      import org.apache.spark.streaming.StreamingContext._
      
      object StreamingApp {
      
        def main(args: Array[String]) {
          val conf = new SparkConf().setMaster("local[2]").setAppName("FileWordCount")
          val ssc = new StreamingContext(conf, Seconds(10))
          val lines = ssc.textFileStream("/some/path")
          val words = lines.flatMap(_.split(" "))
          val pairs = words.map(word => (word, 1))
          val wordCounts = pairs.reduceByKey(_ + _)
          wordCounts.print()
      
          ssc.start()
          ssc.awaitTermination()
        }
      }
      ```
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3464 from zsxwing/SPARK-4608 and squashes the following commits:
      
      aa6d44a [zsxwing] Fix a copy-paste error
      f74c190 [zsxwing] Merge branch 'master' into SPARK-4608
      e6f9cc9 [zsxwing] Update the docs
      27833bb [zsxwing] Remove `import StreamingContext._`
      c15162c [zsxwing] Reorganize StreamingContext implicit to improve API convenience
      f9ed2b66
    • jerryshao's avatar
      [SPARK-4537][Streaming] Expand StreamingSource to add more metrics · f205fe47
      jerryshao authored
      Add `processingDelay`, `schedulingDelay` and `totalDelay` for the last completed batch. Add `lastReceivedBatchRecords` and `totalReceivedBatchRecords` to the received records counting.
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #3466 from jerryshao/SPARK-4537 and squashes the following commits:
      
      00f5f7f [jerryshao] Change the code style and add totalProcessedRecords
      44721a6 [jerryshao] Further address the comments
      c097ddc [jerryshao] Address the comments
      02dd44f [jerryshao] Fix the addressed comments
      c7a9376 [jerryshao] Expand StreamingSource to add more metrics
      f205fe47
    • Nicholas Chammas's avatar
      [EC2] Update mesos/spark-ec2 branch to branch-1.3 · ac827859
      Nicholas Chammas authored
      Going forward, we'll use matching branch names across the mesos/spark-ec2 and apache/spark repositories, per [the discussion here](https://github.com/mesos/spark-ec2/pull/85#issuecomment-68069589).
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #3804 from nchammas/patch-2 and squashes the following commits:
      
      cd2c0d4 [Nicholas Chammas] [EC2] Update mesos/spark-ec2 branch to branch-1.3
      ac827859
    • Nicholas Chammas's avatar
      [EC2] Update default Spark version to 1.2.0 · b6b6393b
      Nicholas Chammas authored
      Now that 1.2.0 is out, let's update the default Spark version.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #3793 from nchammas/patch-1 and squashes the following commits:
      
      3255832 [Nicholas Chammas] add 1.2.0 version to Spark-Shark map
      ec0e904 [Nicholas Chammas] [EC2] Update default Spark version to 1.2.0
      b6b6393b
    • Denny Lee's avatar
      Fix "Building Spark With Maven" link in README.md · 08b18c7e
      Denny Lee authored
      Corrected link to the Building Spark with Maven page from its original (http://spark.apache.org/docs/latest/building-with-maven.html) to the current page (http://spark.apache.org/docs/latest/building-spark.html)
      
      Author: Denny Lee <denny.g.lee@gmail.com>
      
      Closes #3802 from dennyglee/patch-1 and squashes the following commits:
      
      15f601a [Denny Lee] Update README.md
      08b18c7e
    • Kousuke Saruta's avatar
      [SPARK-4953][Doc] Fix the description of building Spark with YARN · 11dd9931
      Kousuke Saruta authored
      At the section "Specifying the Hadoop Version" In building-spark.md, there is description about building with YARN with Hadoop 0.23.
      Spark 1.3.0 will not support Hadoop 0.23 so we should fix the description.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #3787 from sarutak/SPARK-4953 and squashes the following commits:
      
      ee9c355 [Kousuke Saruta] Removed description related to a specific vendor
      9ab0c24 [Kousuke Saruta] Fix the description about building SPARK with YARN
      11dd9931
  5. Dec 24, 2014
    • zsxwing's avatar
      [SPARK-4873][Streaming] Use `Future.zip` instead of `Future.flatMap`(for-loop)... · b4d0db80
      zsxwing authored
      [SPARK-4873][Streaming] Use `Future.zip` instead of `Future.flatMap`(for-loop) in WriteAheadLogBasedBlockHandler
      
      Use `Future.zip` instead of `Future.flatMap`(for-loop). `zip` implies these two Futures will run concurrently, while `flatMap` usually means one Future depends on the other one.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3721 from zsxwing/SPARK-4873 and squashes the following commits:
      
      46a2cd9 [zsxwing] Use Future.zip instead of Future.flatMap(for-loop)
      b4d0db80
    • Sean Owen's avatar
      SPARK-4297 [BUILD] Build warning fixes omnibus · 29fabb1b
      Sean Owen authored
      There are a number of warnings generated in a normal, successful build right now. They're mostly Java unchecked cast warnings, which can be suppressed. But there's a grab bag of other Scala language warnings and so on that can all be easily fixed. The forthcoming PR fixes about 90% of the build warnings I see now.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3157 from srowen/SPARK-4297 and squashes the following commits:
      
      8c9e469 [Sean Owen] Suppress unchecked cast warnings, and several other build warning fixes
      29fabb1b
  6. Dec 23, 2014
    • Kousuke Saruta's avatar
      [SPARK-4881][Minor] Use SparkConf#getBoolean instead of get().toBoolean · 199e59aa
      Kousuke Saruta authored
      It's really a minor issue.
      
      In ApplicationMaster, there is code like as follows.
      
          val preserveFiles = sparkConf.get("spark.yarn.preserve.staging.files", "false").toBoolean
      
      I think, the code can be simplified like as follows.
      
          val preserveFiles = sparkConf.getBoolean("spark.yarn.preserve.staging.files", false)
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #3733 from sarutak/SPARK-4881 and squashes the following commits:
      
      1771430 [Kousuke Saruta] Modified the code like sparkConf.get(...).toBoolean to sparkConf.getBoolean(...)
      c63daa0 [Kousuke Saruta] Simplified code
      199e59aa
    • jbencook's avatar
      [SPARK-4860][pyspark][sql] speeding up `sample()` and `takeSample()` · fd41eb95
      jbencook authored
      This PR modifies the python `SchemaRDD` to use `sample()` and `takeSample()` from Scala instead of the slower python implementations from `rdd.py`. This is worthwhile because the `Row`'s are already serialized as Java objects.
      
      In order to use the faster `takeSample()`, a `takeSampleToPython()` method was implemented in `SchemaRDD.scala` following the pattern of `collectToPython()`.
      
      Author: jbencook <jbenjamincook@gmail.com>
      Author: J. Benjamin Cook <jbenjamincook@gmail.com>
      
      Closes #3764 from jbencook/master and squashes the following commits:
      
      6fbc769 [J. Benjamin Cook] [SPARK-4860][pyspark][sql] fixing sloppy indentation for takeSampleToPython() arguments
      5170da2 [J. Benjamin Cook] [SPARK-4860][pyspark][sql] fixing typo: from RDD to SchemaRDD
      de22f70 [jbencook] [SPARK-4860][pyspark][sql] using sample() method from JavaSchemaRDD
      b916442 [jbencook] [SPARK-4860][pyspark][sql] adding sample() to JavaSchemaRDD
      020cbdf [jbencook] [SPARK-4860][pyspark][sql] using Scala implementations of `sample()` and `takeSample()`
      fd41eb95
    • Marcelo Vanzin's avatar
      [SPARK-4606] Send EOF to child JVM when there's no more data to read. · 7e2deb71
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #3460 from vanzin/SPARK-4606 and squashes the following commits:
      
      031207d [Marcelo Vanzin] [SPARK-4606] Send EOF to child JVM when there's no more data to read.
      7e2deb71
    • jerryshao's avatar
      [SPARK-4671][Streaming]Do not replicate streaming block when WAL is enabled · 3f5f4cc4
      jerryshao authored
      Currently streaming block will be replicated when specific storage level is set, since WAL is already fault tolerant, so replication is needless and will hurt the throughput of streaming application.
      
      Hi tdas , as per discussed about this issue, I fixed with this implementation, I'm not is this the way you want, would you mind taking a look at it? Thanks a lot.
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #3534 from jerryshao/SPARK-4671 and squashes the following commits:
      
      500b456 [jerryshao] Do not replicate streaming block when WAL is enabled
      3f5f4cc4
    • Ilayaperumal Gopinathan's avatar
      [SPARK-4802] [streaming] Remove receiverInfo once receiver is de-registered · 10d69e9c
      Ilayaperumal Gopinathan authored
        Once the streaming receiver is de-registered at executor, the `ReceiverTrackerActor` needs to
      remove the corresponding reveiverInfo from the `receiverInfo` map at `ReceiverTracker`.
      
      Author: Ilayaperumal Gopinathan <igopinathan@pivotal.io>
      
      Closes #3647 from ilayaperumalg/receiverInfo-RTracker and squashes the following commits:
      
      6eb97d5 [Ilayaperumal Gopinathan] Polishing based on the review
      3640c86 [Ilayaperumal Gopinathan] Remove receiverInfo once receiver is de-registered
      10d69e9c
    • Liang-Chi Hsieh's avatar
      [SPARK-4913] Fix incorrect event log path · 96281cd0
      Liang-Chi Hsieh authored
      SPARK-2261 uses a single file to log events for an app. `eventLogDir` in `ApplicationDescription` is replaced with `eventLogFile`. However, `ApplicationDescription` in `SparkDeploySchedulerBackend` is initialized with `SparkContext`'s `eventLogDir`. It is just the log directory, not the actual log file path. `Master.rebuildSparkUI` can not correctly rebuild a new SparkUI for the app.
      
      Because the `ApplicationDescription` is remotely registered with `Master` and the app's id is then generated in `Master`, we can not get the app id in advance before registration. So the received description needs to be modified with correct `eventLogFile` value.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #3755 from viirya/fix_app_logdir and squashes the following commits:
      
      5e0ea35 [Liang-Chi Hsieh] Revision for comment.
      b5730a1 [Liang-Chi Hsieh] Fix incorrect event log path.
      
      Closes #3777 (a duplicate PR for the same JIRA)
      96281cd0
    • Andrew Or's avatar
      [SPARK-4730][YARN] Warn against deprecated YARN settings · 27c5399f
      Andrew Or authored
      See https://issues.apache.org/jira/browse/SPARK-4730.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3590 from andrewor14/yarn-settings and squashes the following commits:
      
      36e0753 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-settings
      dcd1316 [Andrew Or] Warn against deprecated YARN settings
      27c5399f
    • Cheng Lian's avatar
      [SPARK-4914][Build] Cleans lib_managed before compiling with Hive 0.13.1 · 395b771f
      Cheng Lian authored
      This PR tries to fix the Hive tests failure encountered in PR #3157 by cleaning `lib_managed` before building assembly jar against Hive 0.13.1 in `dev/run-tests`. Otherwise two sets of datanucleus jars would be left in `lib_managed` and may mess up class paths while executing Hive test suites. Please refer to [this thread] [1] for details. A clean build would be even safer, but we only clean `lib_managed` here to save build time.
      
      This PR also takes the chance to clean up some minor typos and formatting issues in the comments.
      
      [1]: https://github.com/apache/spark/pull/3157#issuecomment-67656488
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3756)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3756 from liancheng/clean-lib-managed and squashes the following commits:
      
      e2bd21d [Cheng Lian] Adds lib_managed to clean set
      c9f2f3e [Cheng Lian] Cleans lib_managed before compiling with Hive 0.13.1
      395b771f
    • Takeshi Yamamuro's avatar
      [SPARK-4932] Add help comments in Analytics · 9c251c55
      Takeshi Yamamuro authored
      Trivial modifications for usability.
      
      Author: Takeshi Yamamuro <linguin.m.s@gmail.com>
      
      Closes #3775 from maropu/AddHelpCommentInAnalytics and squashes the following commits:
      
      fbea8f5 [Takeshi Yamamuro] Add help comments in Analytics
      9c251c55
    • Marcelo Vanzin's avatar
      [SPARK-4834] [standalone] Clean up application files after app finishes. · dd155369
      Marcelo Vanzin authored
      Commit 7aacb7bf added support for sharing downloaded files among multiple
      executors of the same app. That works great in Yarn, since the app's directory
      is cleaned up after the app is done.
      
      But Spark standalone mode didn't do that, so the lock/cache files created
      by that change were left around and could eventually fill up the disk hosting
      /tmp.
      
      To solve that, create app-specific directories under the local dirs when
      launching executors. Multiple executors launched by the same Worker will
      use the same app directories, so they should be able to share the downloaded
      files. When the application finishes, a new message is sent to all workers
      telling them the application has finished; once that message has been received,
      and all executors registered for the application shut down, then those
      directories will be cleaned up by the Worker.
      
      Note: Unit testing this is hard (if even possible), since local-cluster mode
      doesn't seem to leave the Master/Worker daemons running long enough after
      `sc.stop()` is called for the clean up protocol to take effect.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #3705 from vanzin/SPARK-4834 and squashes the following commits:
      
      b430534 [Marcelo Vanzin] Remove seemingly unnecessary synchronization.
      50eb4b9 [Marcelo Vanzin] Review feedback.
      c0e5ea5 [Marcelo Vanzin] [SPARK-4834] [standalone] Clean up application files after app finishes.
      dd155369
    • zsxwing's avatar
      [SPARK-4931][Yarn][Docs] Fix the format of running-on-yarn.md · 2d215aeb
      zsxwing authored
      Currently, the format about log4j in running-on-yarn.md is a bit messy.
      
      ![running-on-yarn](https://cloud.githubusercontent.com/assets/1000778/5535248/204c4b64-8ab4-11e4-83c3-b4722ea0ad9d.png)
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3774 from zsxwing/SPARK-4931 and squashes the following commits:
      
      4a5f853 [zsxwing] Fix the format of running-on-yarn.md
      2d215aeb
    • Nicholas Chammas's avatar
      [SPARK-4890] Ignore downloaded EC2 libs · 2823c7f0
      Nicholas Chammas authored
      PR #3737 changed `spark-ec2` to automatically download boto from PyPI. This PR tell git to ignore those downloaded library files.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #3770 from nchammas/ignore-ec2-lib and squashes the following commits:
      
      5c440d3 [Nicholas Chammas] gitignore downloaded EC2 libs
      2823c7f0
    • Nicholas Chammas's avatar
      [Docs] Minor typo fixes · 0e532ccb
      Nicholas Chammas authored
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #3772 from nchammas/patch-1 and squashes the following commits:
      
      b7d9083 [Nicholas Chammas] [Docs] Minor typo fixes
      0e532ccb
  7. Dec 22, 2014
Loading