Skip to content
Snippets Groups Projects
  1. Nov 27, 2014
    • Sean Owen's avatar
      SPARK-4170 [CORE] Closure problems when running Scala app that "extends App" · 5d7fe178
      Sean Owen authored
      Warn against subclassing scala.App, and remove one instance of this in examples
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3497 from srowen/SPARK-4170 and squashes the following commits:
      
      4a6131f [Sean Owen] Restore multiline string formatting
      a8ca895 [Sean Owen] Warn against subclassing scala.App, and remove one instance of this in examples
      5d7fe178
    • Andrew Or's avatar
      [Release] Automate generation of contributors list · c86e9bc4
      Andrew Or authored
      This commit provides a script that computes the contributors list
      by linking the github commits with JIRA issues. Automatically
      translating github usernames remains a TODO at this point.
      c86e9bc4
  2. Nov 26, 2014
    • CodingCat's avatar
      [SPARK-732][SPARK-3628][CORE][RESUBMIT] eliminate duplicate update on accmulator · 5af53ada
      CodingCat authored
      https://issues.apache.org/jira/browse/SPARK-3628
      
      In current implementation, the accumulator will be updated for every successfully finished task, even the task is from a resubmitted stage, which makes the accumulator counter-intuitive
      
      In this patch, I changed the way for the DAGScheduler to update the accumulator,
      
      DAGScheduler maintains a HashTable, mapping the stage id to the received <accumulator_id , value> pairs. Only when the stage becomes independent, (no job needs it any more), we accumulate the values of the <accumulator_id , value> pairs, when a task finished, we check if the HashTable has contained such stageId, it saves the accumulator_id, value only when the task is the first finished task of a new stage or the stage is running for the first attempt...
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #2524 from CodingCat/SPARK-732-1 and squashes the following commits:
      
      701a1e8 [CodingCat] roll back change on Accumulator.scala
      1433e6f [CodingCat] make MIMA happy
      b233737 [CodingCat] address Matei's comments
      02261b8 [CodingCat] rollback  some changes
      6b0aff9 [CodingCat] update document
      2b2e8cf [CodingCat] updateAccumulator
      83b75f8 [CodingCat] style fix
      84570d2 [CodingCat] re-enable  the bad accumulator guard
      1e9e14d [CodingCat] add NPE guard
      21b6840 [CodingCat] simplify the patch
      88d1f03 [CodingCat] fix rebase error
      f74266b [CodingCat] add test case for resubmitted result stage
      5cf586f [CodingCat] de-duplicate on task level
      138f9b3 [CodingCat] make MIMA happy
      67593d2 [CodingCat] make if allowing duplicate update as an option of accumulator
      5af53ada
    • Xiangrui Meng's avatar
      [SPARK-4614][MLLIB] Slight API changes in Matrix and Matrices · 561d31d2
      Xiangrui Meng authored
      Before we have a full picture of the operators we want to add, it might be safer to hide `Matrix.transposeMultiply` in 1.2.0. Another update we want to change is `Matrix.randn` and `Matrix.rand`, both of which should take a `Random` implementation. Otherwise, it is very likely to produce inconsistent RDDs. I also added some unit tests for matrix factory methods. All APIs are new in 1.2, so there is no incompatible changes.
      
      brkyvz
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3468 from mengxr/SPARK-4614 and squashes the following commits:
      
      3b0e4e2 [Xiangrui Meng] add mima excludes
      6bfd8a4 [Xiangrui Meng] hide transposeMultiply; add rng to rand and randn; add unit tests
      561d31d2
    • Joseph E. Gonzalez's avatar
      Removing confusing TripletFields · 288ce583
      Joseph E. Gonzalez authored
      After additional discussion with rxin, I think having all the possible `TripletField` options is confusing.  This pull request reduces the triplet fields to:
      
      ```java
        /**
         * None of the triplet fields are exposed.
         */
        public static final TripletFields None = new TripletFields(false, false, false);
      
        /**
         * Expose only the edge field and not the source or destination field.
         */
        public static final TripletFields EdgeOnly = new TripletFields(false, false, true);
      
        /**
         * Expose the source and edge fields but not the destination field. (Same as Src)
         */
        public static final TripletFields Src = new TripletFields(true, false, true);
      
        /**
         * Expose the destination and edge fields but not the source field. (Same as Dst)
         */
        public static final TripletFields Dst = new TripletFields(false, true, true);
      
        /**
         * Expose all the fields (source, edge, and destination).
         */
        public static final TripletFields All = new TripletFields(true, true, true);
      ```
      
      Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
      
      Closes #3472 from jegonzal/SimplifyTripletFields and squashes the following commits:
      
      91796b5 [Joseph E. Gonzalez] removing confusing triplet fields
      288ce583
    • Tathagata Das's avatar
      [SPARK-4612] Reduce task latency and increase scheduling throughput by making... · e7f4d253
      Tathagata Das authored
      [SPARK-4612] Reduce task latency and increase scheduling throughput by making configuration initialization lazy
      
      https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L337 creates a configuration object for every task that is launched, even if there is no new dependent file/JAR to update. This is a heavy-weight creation that should be avoided if there is no new file/JAR to update. This PR makes that creation lazy. Quick local test in spark-perf scheduling throughput tests gives the following numbers in a local standalone scheduler mode.
      1 job with 10000 tasks: before 7.8395 seconds, after 2.6415 seconds = 3x increase in task scheduling throughput
      
      pwendell JoshRosen
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #3463 from tdas/lazy-config and squashes the following commits:
      
      c791c1e [Tathagata Das] Reduce task latency by making configuration initialization lazy
      e7f4d253
  3. Nov 25, 2014
    • Aaron Davidson's avatar
      [SPARK-4516] Avoid allocating Netty PooledByteBufAllocators unnecessarily · 346bc17a
      Aaron Davidson authored
      Turns out we are allocating an allocator pool for every TransportClient (which means that the number increases with the number of nodes in the cluster), when really we should just reuse one for all clients.
      
      This patch, as expected, greatly decreases off-heap memory allocation, and appears to make allocation only proportional to the number of cores.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3465 from aarondav/fewer-pools and squashes the following commits:
      
      36c49da [Aaron Davidson] [SPARK-4516] Avoid allocating unnecessarily Netty PooledByteBufAllocators
      346bc17a
    • Aaron Davidson's avatar
      [SPARK-4516] Cap default number of Netty threads at 8 · f5f2d273
      Aaron Davidson authored
      In practice, only 2-4 cores should be required to transfer roughly 10 Gb/s, and each core that we use will have an initial overhead of roughly 32 MB of off-heap memory, which comes at a premium.
      
      Thus, this value should still retain maximum throughput and reduce wasted off-heap memory allocation. It can be overridden by setting the number of serverThreads and clientThreads manually in Spark's configuration.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3469 from aarondav/fewer-pools2 and squashes the following commits:
      
      087c59f [Aaron Davidson] [SPARK-4516] Cap default number of Netty threads at 8
      f5f2d273
    • Xiangrui Meng's avatar
      [SPARK-4604][MLLIB] make MatrixFactorizationModel public · b5fb1410
      Xiangrui Meng authored
      User could construct an MF model directly. I added a note about the performance.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3459 from mengxr/SPARK-4604 and squashes the following commits:
      
      f64bcd3 [Xiangrui Meng] organize imports
      ed08214 [Xiangrui Meng] check preconditions and unit tests
      a624c12 [Xiangrui Meng] make MatrixFactorizationModel public
      b5fb1410
    • Patrick Wendell's avatar
      [HOTFIX]: Adding back without-hive dist · 4d95526a
      Patrick Wendell authored
      4d95526a
    • Joseph K. Bradley's avatar
      [SPARK-4583] [mllib] LogLoss for GradientBoostedTrees fix + doc updates · c251fd74
      Joseph K. Bradley authored
      Currently, the LogLoss used by GradientBoostedTrees has 2 issues:
      * the gradient (and therefore loss) does not match that used by Friedman (1999)
      * the error computation uses 0/1 accuracy, not log loss
      
      This PR updates LogLoss.
      It also adds some doc for boosting and forests.
      
      I tested it on sample data and made sure the log loss is monotonically decreasing with each boosting iteration.
      
      CC: mengxr manishamde codedeft
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #3439 from jkbradley/gbt-loss-fix and squashes the following commits:
      
      cfec17e [Joseph K. Bradley] removed forgotten temp comments
      a27eb6d [Joseph K. Bradley] corrections to last log loss commit
      ed5da2c [Joseph K. Bradley] updated LogLoss (boosting) for numerical stability
      5e52bff [Joseph K. Bradley] * Removed the 1/2 from SquaredError.  This also required updating the test suite since it effectively doubles the gradient and loss. * Added doc for developers within RandomForest. * Small cleanup in test suite (generating data only once)
      e57897a [Joseph K. Bradley] Fixed LogLoss for GradientBoostedTrees, and updated doc for losses, forests, and boosting
      c251fd74
    • Xiangrui Meng's avatar
      [Spark-4509] Revert EC2 tag-based cluster membership patch · 7eba0fbe
      Xiangrui Meng authored
      This PR reverts changes related to tag-based cluster membership. As discussed in SPARK-3332, we didn't figure out a safe strategy to use tags to determine cluster membership, because tagging is not atomic. The following changes are reverted:
      
      SPARK-2333: 94053a7b
      SPARK-3213: 7faf755a
      SPARK-3608: 78d4220f.
      
      I tested launch, login, and destroy. It is easy to check the diff by comparing it to Josh's patch for branch-1.1:
      
      https://github.com/apache/spark/pull/2225/files
      
      JoshRosen I sent the PR to master. It might be easier for us to keep master and branch-1.2 the same at this time. We can always re-apply the patch once we figure out a stable solution.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3453 from mengxr/SPARK-4509 and squashes the following commits:
      
      f0b708b [Xiangrui Meng] revert 94053a7b
      4298ea5 [Xiangrui Meng] revert 7faf755a
      35963a1 [Xiangrui Meng] Revert "SPARK-3608 Break if the instance tag naming succeeds"
      7eba0fbe
    • hushan[胡珊]'s avatar
      Fix SPARK-4471: blockManagerIdFromJson function throws exception while B... · 9bdf5da5
      hushan[胡珊] authored
      Fix [SPARK-4471](https://issues.apache.org/jira/browse/SPARK-4471): blockManagerIdFromJson function throws exception while BlockManagerId be null in MetadataFetchFailedException
      
      Author: hushan[胡珊] <hushan@xiaomi.com>
      
      Closes #3340 from suyanNone/fix-blockmanagerId-jnothing-2 and squashes the following commits:
      
      159f9a3 [hushan[胡珊]] Refine test code for blockmanager is null
      4380d73 [hushan[胡珊]] remove useless blank line
      3ccf651 [hushan[胡珊]] Fix SPARK-4471: blockManagerIdFromJson function throws exception while metadata fetch failed
      9bdf5da5
    • Andrew Or's avatar
      [SPARK-4546] Improve HistoryServer first time user experience · 9afcbe49
      Andrew Or authored
      The documentation points the user to run the following
      ```
      sbin/start-history-server.sh
      ```
      The first thing this does is throw an exception that complains a log directory is not specified. The exception message itself does not say anything about what to set. Instead we should have a default and a landing page with a better message. The new default log directory is `file:/tmp/spark-events`.
      
      This is what it looks like as of this PR:
      
      ![after](https://issues.apache.org/jira/secure/attachment/12682985/after.png)
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3411 from andrewor14/minor-history-improvements and squashes the following commits:
      
      f33d6b3 [Andrew Or] Point user to set config if default log dir does not exist
      fc4c17a [Andrew Or] Improve HistoryServer UX
      9afcbe49
    • Andrew Or's avatar
      [SPARK-4592] Avoid duplicate worker registrations in standalone mode · 1b2ab1cd
      Andrew Or authored
      **Summary.** On failover, the Master may receive duplicate registrations from the same worker, causing the worker to exit. This is caused by this commit https://github.com/apache/spark/commit/4afe9a4852ebeb4cc77322a14225cd3dec165f3f, which adds logic for the worker to re-register with the master in case of failures. However, the following race condition may occur:
      
      (1) Master A fails and Worker attempts to reconnect to all masters
      (2) Master B takes over and notifies Worker
      (3) Worker responds by registering with Master B
      (4) Meanwhile, Worker's previous reconnection attempt reaches Master B, causing the same Worker to register with Master B twice
      
      **Fix.** Instead of attempting to register with all known masters, the worker should re-register with only the one that it has been communicating with. This is safe because the fact that a failover has occurred means the old master must have died. Then, when the worker is finally notified of a new master, it gives up on the old one in favor of the new one.
      
      **Caveat.** Even this fix is subject to more obscure race conditions. For instance, if Master B fails and Master A recovers immediately, then Master A may still observe duplicate worker registrations. However, this and other potential race conditions summarized in [SPARK-4592](https://issues.apache.org/jira/browse/SPARK-4592), are much, much less likely than the one described above, which is deterministically reproducible.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3447 from andrewor14/standalone-failover and squashes the following commits:
      
      0d9716c [Andrew Or] Move re-registration logic to actor for thread-safety
      79286dc [Andrew Or] Preserve old behavior for initial retries
      83b321c [Andrew Or] Tweak wording
      1fce6a9 [Andrew Or] Active master actor could be null in the beginning
      b6f269e [Andrew Or] Avoid duplicate worker registrations
      1b2ab1cd
    • Tathagata Das's avatar
      [SPARK-4196][SPARK-4602][Streaming] Fix serialization issue in... · 8838ad7c
      Tathagata Das authored
      [SPARK-4196][SPARK-4602][Streaming] Fix serialization issue in PairDStreamFunctions.saveAsNewAPIHadoopFiles
      
      Solves two JIRAs in one shot
      - Makes the ForechDStream created by saveAsNewAPIHadoopFiles serializable for checkpoints
      - Makes the default configuration object used saveAsNewAPIHadoopFiles be the Spark's hadoop configuration
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #3457 from tdas/savefiles-fix and squashes the following commits:
      
      bb4729a [Tathagata Das] Same treatment for saveAsHadoopFiles
      b382ea9 [Tathagata Das] Fix serialization issue in PairDStreamFunctions.saveAsNewAPIHadoopFiles.
      8838ad7c
    • DB Tsai's avatar
      [SPARK-4581][MLlib] Refactorize StandardScaler to improve the transformation performance · bf1a6aaa
      DB Tsai authored
      The following optimizations are done to improve the StandardScaler model
      transformation performance.
      
      1) Covert Breeze dense vector to primitive vector to reduce the overhead.
      2) Since mean can be potentially a sparse vector, we explicitly convert it to dense primitive vector.
      3) Have a local reference to `shift` and `factor` array so JVM can locate the value with one operation call.
      4) In pattern matching part, we use the mllib SparseVector/DenseVector instead of breeze's vector to
      make the codebase cleaner.
      
      Benchmark with mnist8m dataset:
      
      Before,
      DenseVector withMean and withStd: 50.97secs
      DenseVector withMean and withoutStd: 42.11secs
      DenseVector withoutMean and withStd: 8.75secs
      SparseVector withoutMean and withStd: 5.437secs
      
      With this PR,
      DenseVector withMean and withStd: 5.76secs
      DenseVector withMean and withoutStd: 5.28secs
      DenseVector withoutMean and withStd: 5.30secs
      SparseVector withoutMean and withStd: 1.27secs
      
      Note that without the local reference copy of `factor` and `shift` arrays,
      the runtime is almost three time slower.
      
      DenseVector withMean and withStd: 18.15secs
      DenseVector withMean and withoutStd: 18.05secs
      DenseVector withoutMean and withStd: 18.54secs
      SparseVector withoutMean and withStd: 2.01secs
      
      The following code,
      ```scala
      while (i < size) {
         values(i) = (values(i) - shift(i)) * factor(i)
         i += 1
      }
      ```
      will generate the bytecode
      ```
         L13
          LINENUMBER 106 L13
         FRAME FULL [org/apache/spark/mllib/feature/StandardScalerModel org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/DenseVector T [D I I] []
          ILOAD 7
          ILOAD 6
          IF_ICMPGE L14
         L15
          LINENUMBER 107 L15
          ALOAD 5
          ILOAD 7
          ALOAD 5
          ILOAD 7
          DALOAD
          ALOAD 0
          INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.shift ()[D
          ILOAD 7
          DALOAD
          DSUB
          ALOAD 0
          INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.factor ()[D
          ILOAD 7
          DALOAD
          DMUL
          DASTORE
         L16
          LINENUMBER 108 L16
          ILOAD 7
          ICONST_1
          IADD
          ISTORE 7
          GOTO L13
      ```
      , while with the local reference of the `shift` and `factor` arrays, the bytecode will be
      ```
         L14
          LINENUMBER 107 L14
          ALOAD 0
          INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.factor ()[D
          ASTORE 9
         L15
          LINENUMBER 108 L15
         FRAME FULL [org/apache/spark/mllib/feature/StandardScalerModel org/apache/spark/mllib/linalg/Vector [D org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/DenseVector T [D I I [D] []
          ILOAD 8
          ILOAD 7
          IF_ICMPGE L16
         L17
          LINENUMBER 109 L17
          ALOAD 6
          ILOAD 8
          ALOAD 6
          ILOAD 8
          DALOAD
          ALOAD 2
          ILOAD 8
          DALOAD
          DSUB
          ALOAD 9
          ILOAD 8
          DALOAD
          DMUL
          DASTORE
         L18
          LINENUMBER 110 L18
          ILOAD 8
          ICONST_1
          IADD
          ISTORE 8
          GOTO L15
      ```
      
      You can see that with local reference, the both of the arrays will be in the stack, so JVM can access the value without calling `INVOKESPECIAL`.
      
      Author: DB Tsai <dbtsai@alpinenow.com>
      
      Closes #3435 from dbtsai/standardscaler and squashes the following commits:
      
      85885a9 [DB Tsai] revert to have lazy in shift array.
      daf2b06 [DB Tsai] Address the feedback
      cdb5cef [DB Tsai] small change
      9c51eef [DB Tsai] style
      fc795e4 [DB Tsai] update
      5bffd3d [DB Tsai] first commit
      bf1a6aaa
    • Tathagata Das's avatar
      [SPARK-4601][Streaming] Set correct call site for streaming jobs so that it is... · 69cd53ea
      Tathagata Das authored
      [SPARK-4601][Streaming] Set correct call site for streaming jobs so that it is displayed correctly on the Spark UI
      
      When running the NetworkWordCount, the description of the word count jobs are set as "getCallsite at DStream:xxx" . This should be set to the line number of the streaming application that has the output operation that led to the job being created. This is because the callsite is incorrectly set in the thread launching the jobs. This PR fixes that.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #3455 from tdas/streaming-callsite-fix and squashes the following commits:
      
      69fc26f [Tathagata Das] Set correct call site for streaming jobs so that it is displayed correctly on the Spark UI
      69cd53ea
    • arahuja's avatar
      [SPARK-4344][DOCS] adding documentation on spark.yarn.user.classpath.first · d2407601
      arahuja authored
      The documentation for the two parameters is the same with a pointer from the standalone parameter to the yarn parameter
      
      Author: arahuja <aahuja11@gmail.com>
      
      Closes #3209 from arahuja/yarn-classpath-first-param and squashes the following commits:
      
      51cb9b2 [arahuja] [SPARK-4344][DOCS] adding documentation for YARN on userClassPathFirst
      d2407601
    • jerryshao's avatar
      [SPARK-4381][Streaming]Add warning log when user set spark.master to local in... · fef27b29
      jerryshao authored
      [SPARK-4381][Streaming]Add warning log when user set spark.master to local in Spark Streaming and there's no job executed
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #3244 from jerryshao/SPARK-4381 and squashes the following commits:
      
      d2486c7 [jerryshao] Improve the warning log
      d726e85 [jerryshao] Add local[1] to the filter condition
      eca428b [jerryshao] Add warning log
      fef27b29
    • q00251598's avatar
      [SPARK-4535][Streaming] Fix the error in comments · a51118a3
      q00251598 authored
      change `NetworkInputDStream` to `ReceiverInputDStream`
      change `ReceiverInputTracker` to `ReceiverTracker`
      
      Author: q00251598 <qiyadong@huawei.com>
      
      Closes #3400 from watermen/fix-comments and squashes the following commits:
      
      75d795c [q00251598] change 'NetworkInputDStream' to 'ReceiverInputDStream' && change 'ReceiverInputTracker' to 'ReceiverTracker'
      a51118a3
    • GuoQiang Li's avatar
      [SPARK-4526][MLLIB]GradientDescent get a wrong gradient value according to the gradient formula. · f515f943
      GuoQiang Li authored
      This is caused by the miniBatchSize parameter.The number of `RDD.sample` returns is not fixed.
      cc mengxr
      
      Author: GuoQiang Li <witgo@qq.com>
      
      Closes #3399 from witgo/GradientDescent and squashes the following commits:
      
      13cb228 [GuoQiang Li] review commit
      668ab66 [GuoQiang Li] Double to Long
      b6aa11a [GuoQiang Li] Check miniBatchSize is greater than 0
      0b5c3e3 [GuoQiang Li] Minor fix
      12e7424 [GuoQiang Li] GradientDescent get a wrong gradient value according to the gradient formula, which is caused by the miniBatchSize parameter.
      f515f943
    • DB Tsai's avatar
      [SPARK-4596][MLLib] Refactorize Normalizer to make code cleaner · 89f91226
      DB Tsai authored
      In this refactoring, the performance will be slightly increased due to removing
      the overhead from breeze vector. The bottleneck is still in breeze norm
      which is implemented by activeIterator.
      
      This inefficiency of breeze norm will be addressed in next PR. At least,
      this PR makes the code more consistent in the codebase.
      
      Author: DB Tsai <dbtsai@alpinenow.com>
      
      Closes #3446 from dbtsai/normalizer and squashes the following commits:
      
      e20a2b9 [DB Tsai] first commit
      89f91226
    • wangfei's avatar
      [DOC][Build] Wrong cmd for build spark with apache hadoop 2.4.X and hive 12 · 0fe54cff
      wangfei authored
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #3335 from scwf/patch-10 and squashes the following commits:
      
      d343113 [wangfei] add '-Phive'
      60d595e [wangfei] [DOC] Wrong cmd for build spark with apache hadoop 2.4.X and Hive 12 support
      0fe54cff
  4. Nov 24, 2014
    • w00228970's avatar
      [SQL] Compute timeTaken correctly · 723be60e
      w00228970 authored
      ```timeTaken``` should not count the time of printing result.
      
      Author: w00228970 <wangfei1@huawei.com>
      
      Closes #3423 from scwf/time-taken-bug and squashes the following commits:
      
      da7e102 [w00228970] compute time taken correctly
      723be60e
    • tkaessmann's avatar
      [SPARK-4582][MLLIB] get raw vectors for further processing in Word2Vec · 9ce2bf38
      tkaessmann authored
      This is #3309 for the master branch.
      
      e.g. clustering
      
      Author: tkaessmann <tobias.kaessmanns24.com>
      
      Closes #3309 from tkaessmann/branch-1.2 and squashes the following commits:
      
      e3a3142 [tkaessmann] changes the comment for getVectors
      58d3d83 [tkaessmann] removes sign from comment
      a5be213 [tkaessmann] fixes getVectors to fit code guidelines
      3782fa9 [tkaessmann] get raw vectors for further processing
      
      Author: tkaessmann <tobias.kaessmann@s24.com>
      
      Closes #3437 from mengxr/SPARK-4582 and squashes the following commits:
      
      6c666b4 [tkaessmann] get raw vectors for further processing in Word2Vec
      9ce2bf38
    • Jongyoul Lee's avatar
      [SPARK-4525] Mesos should decline unused offers · f0afb623
      Jongyoul Lee authored
      Functionally, this is just a small change on top of #3393 (by jongyoul). The issue being addressed is discussed in the comments there. I have not yet added a test for the bug there. I will add one shortly.
      
      I've also done some minor renaming/clean-up of variables in this class and tests.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      Author: Jongyoul Lee <jongyoul@gmail.com>
      
      Closes #3436 from pwendell/mesos-issue and squashes the following commits:
      
      58c35b5 [Patrick Wendell] Adding unit test for this situation
      c4f0697 [Patrick Wendell] Additional clean-up and fixes on top of existing fix
      f20f1b3 [Jongyoul Lee] [SPARK-4525] MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers - Added code for declining unused offers among acceptedOffers - Edited testCase for checking declining unused offers
      f0afb623
    • Patrick Wendell's avatar
      Revert "[SPARK-4525] Mesos should decline unused offers" · a68d4422
      Patrick Wendell authored
      This reverts commit b043c274.
      
      I accidentally committed this using my own authorship credential. However,
      I should have given authoriship to the original author: Jongyoul Lee.
      a68d4422
    • Patrick Wendell's avatar
      [SPARK-4525] Mesos should decline unused offers · b043c274
      Patrick Wendell authored
      Functionally, this is just a small change on top of #3393 (by jongyoul). The issue being addressed is discussed in the comments there. I have not yet added a test for the bug there. I will add one shortly.
      
      I've also done some minor renaming/clean-up of variables in this class and tests.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      Author: Jongyoul Lee <jongyoul@gmail.com>
      
      Closes #3436 from pwendell/mesos-issue and squashes the following commits:
      
      58c35b5 [Patrick Wendell] Adding unit test for this situation
      c4f0697 [Patrick Wendell] Additional clean-up and fixes on top of existing fix
      f20f1b3 [Jongyoul Lee] [SPARK-4525] MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers - Added code for declining unused offers among acceptedOffers - Edited testCase for checking declining unused offers
      b043c274
    • Kay Ousterhout's avatar
      [SPARK-4266] [Web-UI] Reduce stage page load time. · d24d5bf0
      Kay Ousterhout authored
      The commit changes the java script used to show/hide additional
      metrics in order to reduce page load time. SPARK-4016 significantly
      increased page load time for the stage page when stages had a lot
      (thousands or tens of thousands) of tasks, due to the additional
      Javascript to hide some metrics by default and stripe the tables.
      This commit reduces page load time in two ways:
      
      (1) Now, all of the metrics that are hidden by default are
      hidden by setting "display: none;" using CSS for the page,
      rather than hiding them using javascript after the page loads.
      Without this change, for stages with thousands of tasks, there
      was a few second delay after page load, where first the additional
      metrics were shown, and then after a delay were hidden once the
      relevant JS finished running.
      
      (2) CSS is used to stripe all of the tables except for the summary
      table. The summary table needs javascript to do the striping because
      some rows are hidden, but the javascript striping is slower, which
      again resulted in a delay when it was used for the task table (where
      for a few seconds after page load, all of the rows in the task table
      would be white, while the browser finished running the JS to stripe
      the table).
      
      cc pwendell
      
      This change is intended to be backported to 1.2 to avoid a regression in
      UI performance when users run large jobs.
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #3328 from kayousterhout/SPARK-4266 and squashes the following commits:
      
      f964091 [Kay Ousterhout] [SPARK-4266] [Web-UI] Reduce stage page load time.
      d24d5bf0
    • Davies Liu's avatar
      [SPARK-4548] []SPARK-4517] improve performance of python broadcast · 6cf50768
      Davies Liu authored
      Re-implement the Python broadcast using file:
      
      1) serialize the python object using cPickle, write into disks.
      2) Create a wrapper in JVM (for the dumped file), it read data from during serialization
      3) Using TorrentBroadcast or HttpBroadcast to transfer the data (compressed) into executors
      4) During deserialization, writing the data into disk.
      5) Passing the path into Python worker, read data from disk and unpickle it into python object, until the first access.
      
      It fixes the performance regression introduced in #2659, has similar performance as 1.1, but support object larger than 2G, also improve the memory efficiency (only one compressed copy in driver and executor).
      
      Testing with a 500M broadcast and 4 tasks (excluding the benefit from reused worker in 1.2):
      
               name |   1.1   | 1.2 with this patch |  improvement
      ---------|--------|---------|--------
            python-broadcast-w-bytes  |	25.20  |	9.33   |	170.13% |
              python-broadcast-w-set	  |     4.13	   |    4.50  |	-8.35%  |
      
      Testing with 100 tasks (16 CPUs):
      
               name |   1.1   | 1.2 with this patch |  improvement
      ---------|--------|---------|--------
           python-broadcast-w-bytes	| 38.16	| 8.40	 | 353.98%
              python-broadcast-w-set	| 23.29	| 9.59 |	142.80%
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3417 from davies/pybroadcast and squashes the following commits:
      
      50a58e0 [Davies Liu] address comments
      b98de1d [Davies Liu] disable gc while unpickle
      e5ee6b9 [Davies Liu] support large string
      09303b8 [Davies Liu] read all data into memory
      dde02dd [Davies Liu] improve performance of python broadcast
      6cf50768
    • Davies Liu's avatar
      [SPARK-4578] fix asDict() with nested Row() · 050616b4
      Davies Liu authored
      The Row object is created on the fly once the field is accessed, so we should access them by getattr() in asDict(0
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3434 from davies/fix_asDict and squashes the following commits:
      
      b20f1e7 [Davies Liu] fix asDict() with nested Row()
      050616b4
    • Davies Liu's avatar
      [SPARK-4562] [MLlib] speedup vector · b660de7a
      Davies Liu authored
      This PR change the underline array of DenseVector to numpy.ndarray to avoid the conversion, because most of the users will using numpy.array.
      
      It also improve the serialization of DenseVector.
      
      Before this change:
      
      trial	| trainingTime | 	testTime
      -------|--------|--------
      0	| 5.126 | 	1.786
      1	|2.698	|1.693
      
      After the change:
      
      trial	| trainingTime |	testTime
      -------|--------|--------
      0	|4.692	|0.554
      1	|2.307	|0.525
      
      This could partially fix the performance regression during test.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3420 from davies/ser2 and squashes the following commits:
      
      0e1e6f3 [Davies Liu] fix tests
      426f5db [Davies Liu] impove toArray()
      44707ec [Davies Liu] add name for ISO-8859-1
      fa7d791 [Davies Liu] address comments
      1cfb137 [Davies Liu] handle zero sparse vector
      2548ee2 [Davies Liu] fix tests
      9e6389d [Davies Liu] bugfix
      470f702 [Davies Liu] speed up DenseMatrix
      f0d3c40 [Davies Liu] speedup SparseVector
      ef6ce70 [Davies Liu] speed up dense vector
      b660de7a
    • Tathagata Das's avatar
      [SPARK-4518][SPARK-4519][Streaming] Refactored file stream to prevent files... · cb0e9b09
      Tathagata Das authored
      [SPARK-4518][SPARK-4519][Streaming] Refactored file stream to prevent files from being processed multiple times
      
      Because of a corner case, a file already selected for batch t can get considered again for batch t+2. This refactoring fixes it by remembering all the files selected in the last 1 minute, so that this corner case does not arise. Also uses spark context's hadoop configuration to access the file system API for listing directories.
      
      pwendell Please take look. I still have not run long-running integration tests, so I cannot say for sure whether this has indeed solved the issue. You could do a first pass on this in the meantime.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #3419 from tdas/filestream-fix2 and squashes the following commits:
      
      c19dd8a [Tathagata Das] Addressed PR comments.
      513b608 [Tathagata Das] Updated docs.
      d364faf [Tathagata Das] Added the current time condition back
      5526222 [Tathagata Das] Removed unnecessary imports.
      38bb736 [Tathagata Das] Fix long line.
      203bbc7 [Tathagata Das] Un-ignore tests.
      eaef4e1 [Tathagata Das] Fixed SPARK-4519
      9dbd40a [Tathagata Das] Refactored FileInputDStream to remember last few batches.
      cb0e9b09
    • Josh Rosen's avatar
      [SPARK-4145] Web UI job pages · 4a90276a
      Josh Rosen authored
      This PR adds two new pages to the Spark Web UI:
      
      - A jobs overview page, which shows details on running / completed / failed jobs.
      - A job details page, which displays information on an individual job's stages.
      
      The jobs overview page is now the default UI homepage; the old homepage is still accessible at `/stages`.
      
      ### Screenshots
      
      #### New UI homepage
      
      ![image](https://cloud.githubusercontent.com/assets/50748/5119035/fd0a69e6-701f-11e4-89cb-db7e9705714f.png)
      
      #### Job details page
      
      (This is effectively a per-job version of the stages page that can be extended later with other things, such as DAG visualizations)
      
      ![image](https://cloud.githubusercontent.com/assets/50748/5134910/50b340d4-70c7-11e4-88e1-6b73237ea7c8.png)
      
      ### Key changes in this PR
      
      - Rename `JobProgressPage` to `AllStagesPage`
      - Expose `StageInfo` objects in the ``SparkListenerJobStart` event; add backwards-compatibility tests to JsonProtocol.
      - Add additional data structures to `JobProgressListener` to map from stages to jobs.
      - Add several fields to `JobUIData`.
      
      I also added ~150 lines of Selenium tests as I uncovered UI issues while developing this patch.
      
      ### Limitations
      
      If a job contains stages that aren't run, then its overall job progress bar may be an underestimate of the total job progress; in other words, a completed job may appear to have a progress bar that's not at 100%.
      
      If stages or tasks fail, then the progress bar will not go backwards to reflect the true amount of remaining work.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3009 from JoshRosen/job-page and squashes the following commits:
      
      eb05e90 [Josh Rosen] Disable kill button in completed stages tables.
      f00c851 [Josh Rosen] Fix JsonProtocol compatibility
      b89c258 [Josh Rosen] More JSON protocol backwards-compatibility fixes.
      ff804cd [Josh Rosen] Don't write "Stage Ids" field in JobStartEvent JSON.
      6f17f3f [Josh Rosen] Only store StageInfos in SparkListenerJobStart event.
      2bbf41a [Josh Rosen] Update job progress bar to reflect skipped tasks/stages.
      61c265a [Josh Rosen] Add “skipped stages” table; only display non-empty tables.
      1f45d44 [Josh Rosen] Incorporate a bunch of minor review feedback.
      0b77e3e [Josh Rosen] More bug fixes for phantom stages.
      034aa8d [Josh Rosen] Use `.max()` to find result stage for job.
      eebdc2c [Josh Rosen] Don’t display pending stages for completed jobs.
      67080ba [Josh Rosen] Ensure that "phantom stages" don't cause memory leaks.
      7d10b97 [Josh Rosen] Merge remote-tracking branch 'apache/master' into job-page
      d69c775 [Josh Rosen] Fix table sorting on all jobs page.
      5eb39dc [Josh Rosen] Add pending stages table to job page.
      f2a15da [Josh Rosen] Add status field to job details page.
      171b53c [Josh Rosen] Move `startTime` to the start of SparkContext.
      e2f2c43 [Josh Rosen] Fix sorting of stages in job details page.
      8955f4c [Josh Rosen] Display information for pending stages on jobs page.
      8ab6c28 [Josh Rosen] Compute numTasks from job start stage infos.
      5884f91 [Josh Rosen] Add StageInfos to SparkListenerJobStart event.
      79793cd [Josh Rosen] Track indices of completed stage to avoid overcounting when failures occur.
      d62ea7b [Josh Rosen] Add failing Selenium test for stage overcounting issue.
      1145c60 [Josh Rosen] Display text instead of progress bar for stages.
      3d0a007 [Josh Rosen] Merge remote-tracking branch 'origin/master' into job-page
      8a2351b [Josh Rosen] Add help tooltip to Spark Jobs page.
      b7bf30e [Josh Rosen] Add stages progress bar; fix bug where active stages show as completed.
      4846ce4 [Josh Rosen] Hide "(Job Group") if no jobs were submitted in job groups.
      4d58e55 [Josh Rosen] Change label to "Tasks (for all stages)"
      85e9c85 [Josh Rosen] Extract startTime into separate variable.
      1cf4987 [Josh Rosen] Fix broken kill links; add Selenium test to avoid future regressions.
      56701fa [Josh Rosen] Move last stage name / description logic out of markup.
      a475ea1 [Josh Rosen] Add progress bars to jobs page.
      45343b8 [Josh Rosen] More comments
      4b206fb [Josh Rosen] Merge remote-tracking branch 'origin/master' into job-page
      bfce2b9 [Josh Rosen] Address review comments, except for progress bar.
      4487dcb [Josh Rosen] [SPARK-4145] Web UI job pages
      2568a6c [Josh Rosen] Rename JobProgressPage to AllStagesPage:
      4a90276a
    • Kousuke Saruta's avatar
      [SPARK-4487][SQL] Fix attribute reference resolution error when using ORDER BY. · dd1c9cb3
      Kousuke Saruta authored
      When we use ORDER BY clause, at first, attributes referenced by projection are resolved (1).
      And then, attributes referenced at ORDER BY clause are resolved (2).
       But when resolving attributes referenced at ORDER BY clause, the resolution result generated in (1) is discarded so for example, following query fails.
      
          SELECT c1 + c2 FROM mytable ORDER BY c1;
      
      The query above fails because when resolving the attribute reference 'c1', the resolution result of 'c2' is discarded.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #3363 from sarutak/SPARK-4487 and squashes the following commits:
      
      fd314f3 [Kousuke Saruta] Fixed attribute resolution logic in Analyzer
      6e60c20 [Kousuke Saruta] Fixed conflicts
      cb5b7e9 [Kousuke Saruta] Added test case for SPARK-4487
      282d529 [Kousuke Saruta] Fixed attributes reference resolution error
      b6123e6 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into concat-feature
      317b7fb [Kousuke Saruta] WIP
      dd1c9cb3
    • scwf's avatar
      [SQL] Fix path in HiveFromSpark · b3841193
      scwf authored
      It require us to run ```HiveFromSpark``` in specified dir because ```HiveFromSpark``` use relative path, this leads to ```run-example``` error(http://apache-spark-developers-list.1001551.n3.nabble.com/src-main-resources-kv1-txt-not-found-in-example-of-HiveFromSpark-td9100.html).
      
      Author: scwf <wangfei1@huawei.com>
      
      Closes #3415 from scwf/HiveFromSpark and squashes the following commits:
      
      ed3d6c9 [scwf] revert no need change
      b00e20c [scwf] fix path usring spark_home
      dbd321b [scwf] fix path in hivefromspark
      b3841193
    • Daniel Darabos's avatar
      [SQL] Fix comment in HiveShim · d5834f07
      Daniel Darabos authored
      This file is for Hive 0.13.1 I think.
      
      Author: Daniel Darabos <darabos.daniel@gmail.com>
      
      Closes #3432 from darabos/patch-2 and squashes the following commits:
      
      4fd22ed [Daniel Darabos] Fix comment. This file is for Hive 0.13.1.
      d5834f07
    • Cheng Lian's avatar
      [SPARK-4479][SQL] Avoids unnecessary defensive copies when sort based shuffle is on · a6d7b61f
      Cheng Lian authored
      This PR is a workaround for SPARK-4479. Two changes are introduced: when merge sort is bypassed in `ExternalSorter`,
      
      1. also bypass RDD elements buffering as buffering is the reason that `MutableRow` backed row objects must be copied, and
      2. avoids defensive copies in `Exchange` operator
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3422)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3422 from liancheng/avoids-defensive-copies and squashes the following commits:
      
      591f2e9 [Cheng Lian] Passes all shuffle suites
      0c3c91e [Cheng Lian] Fixes shuffle write metrics when merge sort is bypassed
      ed5df3c [Cheng Lian] Fixes styling changes
      f75089b [Cheng Lian] Avoids unnecessary defensive copies when sort based shuffle is on
      a6d7b61f
    • Sandy Ryza's avatar
      SPARK-4457. Document how to build for Hadoop versions greater than 2.4 · 29372b63
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #3322 from sryza/sandy-spark-4457 and squashes the following commits:
      
      5e72b77 [Sandy Ryza] Feedback
      0cf05c1 [Sandy Ryza] Caveat
      be8084b [Sandy Ryza] SPARK-4457. Document how to build for Hadoop versions greater than 2.4
      29372b63
Loading