Skip to content
Snippets Groups Projects
  1. Dec 01, 2014
    • Cheng Lian's avatar
      [SPARK-4258][SQL][DOC] Documents spark.sql.parquet.filterPushdown · 5db8dcaf
      Cheng Lian authored
      Documents `spark.sql.parquet.filterPushdown`, explains why it's turned off by default and when it's safe to be turned on.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3440)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3440 from liancheng/parquet-filter-pushdown-doc and squashes the following commits:
      
      2104311 [Cheng Lian] Documents spark.sql.parquet.filterPushdown
      5db8dcaf
    • Madhu Siddalingaiah's avatar
      Documentation: add description for repartitionAndSortWithinPartitions · 2b233f5f
      Madhu Siddalingaiah authored
      Author: Madhu Siddalingaiah <madhu@madhu.com>
      
      Closes #3390 from msiddalingaiah/master and squashes the following commits:
      
      cbccbfe [Madhu Siddalingaiah] Documentation: replace <b> with <code> (again)
      332f7a2 [Madhu Siddalingaiah] Documentation: replace <b> with <code>
      cd2b05a [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master'
      0fc12d7 [Madhu Siddalingaiah] Documentation: add description for repartitionAndSortWithinPartitions
      2b233f5f
    • zsxwing's avatar
      [SPARK-4661][Core] Minor code and docs cleanup · 30a86acd
      zsxwing authored
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3521 from zsxwing/SPARK-4661 and squashes the following commits:
      
      03cbe3f [zsxwing] Minor code and docs cleanup
      30a86acd
    • zsxwing's avatar
      [SPARK-4664][Core] Throw an exception when spark.akka.frameSize > 2047 · 1d238f22
      zsxwing authored
      If `spark.akka.frameSize` > 2047, it will overflow and become negative. Should have some assertion in `maxFrameSizeBytes` to warn people.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3527 from zsxwing/SPARK-4664 and squashes the following commits:
      
      0089c7a [zsxwing] Throw an exception when spark.akka.frameSize > 2047
      1d238f22
    • Sean Owen's avatar
      SPARK-2192 [BUILD] Examples Data Not in Binary Distribution · 6384f42a
      Sean Owen authored
      Simply, add data/ to distributions. This adds about 291KB (compressed) to the tarball, FYI.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3480 from srowen/SPARK-2192 and squashes the following commits:
      
      47688f1 [Sean Owen] Add data/ to distributions
      6384f42a
    • Kousuke Saruta's avatar
      Fix wrong file name pattern in .gitignore · 97eb6d7f
      Kousuke Saruta authored
      In .gitignore, there is an entry for spark-*-bin.tar.gz but considering make-distribution.sh, the name pattern should be spark-*-bin-*.tgz.
      
      This change is really small so I don't open issue in JIRA. If it's needed, please let me know.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #3529 from sarutak/fix-wrong-tgz-pattern and squashes the following commits:
      
      de3c70a [Kousuke Saruta] Fixed wrong file name pattern in .gitignore
      97eb6d7f
  2. Nov 30, 2014
    • Prabeesh K's avatar
      [SPARK-4632] version update · 5e7a6dcb
      Prabeesh K authored
      Author: Prabeesh K <prabsmails@gmail.com>
      
      Closes #3495 from prabeesh/master and squashes the following commits:
      
      ab03d50 [Prabeesh K] Update pom.xml
      8c6437e [Prabeesh K] Revert
      e10b40a [Prabeesh K] version update
      dbac9eb [Prabeesh K] Revert
      ec0b1c3 [Prabeesh K] [SPARK-4632] version update
      a835505 [Prabeesh K] [SPARK-4632] version update
      831391b [Prabeesh K]  [SPARK-4632] version update
      5e7a6dcb
    • Patrick Wendell's avatar
      MAINTENANCE: Automated closing of pull requests. · 06dc1b15
      Patrick Wendell authored
      This commit exists to close the following pull requests on Github:
      
      Closes #2915 (close requested by 'JoshRosen')
      Closes #3140 (close requested by 'JoshRosen')
      Closes #3366 (close requested by 'JoshRosen')
      06dc1b15
    • Cheng Lian's avatar
      [DOC] Fixes formatting typo in SQL programming guide · 2a4d389f
      Cheng Lian authored
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3498)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3498 from liancheng/fix-sql-doc-typo and squashes the following commits:
      
      865ecd7 [Cheng Lian] Fixes formatting typo in SQL programming guide
      2a4d389f
    • lewuathe's avatar
      [SPARK-4656][Doc] Typo in Programming Guide markdown · a217ec5f
      lewuathe authored
      Grammatical error in Programming Guide document
      
      Author: lewuathe <lewuathe@me.com>
      
      Closes #3412 from Lewuathe/typo-programming-guide and squashes the following commits:
      
      a3e2f00 [lewuathe] Typo in Programming Guide markdown
      a217ec5f
    • carlmartin's avatar
      [SPARK-4623]Add the some error infomation if using spark-sql in yarn-cluster mode · aea7a997
      carlmartin authored
      If using spark-sql in yarn-cluster mode, print an error infomation just as the spark shell in yarn-cluster mode.
      
      Author: carlmartin <carlmartinmax@gmail.com>
      Author: huangzhaowei <carlmartinmax@gmail.com>
      
      Closes #3479 from SaintBacchus/sparkSqlShell and squashes the following commits:
      
      35829a9 [carlmartin] improve the description of comment
      e6c1eb7 [carlmartin] add a comment in bin/spark-sql to remind user who wants to change the class
      f1c5c8d [carlmartin] Merge branch 'master' into sparkSqlShell
      8e112c5 [huangzhaowei] singular form
      ec957bc [carlmartin] Add the some error infomation if using spark-sql in yarn-cluster mode
      7bcecc2 [carlmartin] Merge branch 'master' of https://github.com/apache/spark into codereview
      4fad75a [carlmartin] Add the Error infomation using spark-sql in yarn-cluster mode
      aea7a997
    • Sean Owen's avatar
      SPARK-2143 [WEB UI] Add Spark version to UI footer · 048ecca6
      Sean Owen authored
      This PR adds the Spark version number to the UI footer; this is how it looks:
      
      ![screen shot 2014-11-21 at 22 58 40](https://cloud.githubusercontent.com/assets/822522/5157738/f4822094-7316-11e4-98f1-333a535fdcfa.png)
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3410 from srowen/SPARK-2143 and squashes the following commits:
      
      e9b3a7a [Sean Owen] Add Spark version to footer
      048ecca6
  3. Nov 29, 2014
    • Takuya UESHIN's avatar
      [DOCS][BUILD] Add instruction to use change-version-to-2.11.sh in 'Building for Scala 2.11'. · 0fcd24cc
      Takuya UESHIN authored
      To build with Scala 2.11, we have to execute `change-version-to-2.11.sh` before Maven execute, otherwise inter-module dependencies are broken.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #3361 from ueshin/docs/building-spark_2.11 and squashes the following commits:
      
      1d29126 [Takuya UESHIN] Add instruction to use change-version-to-2.11.sh in 'Building for Scala 2.11'.
      0fcd24cc
    • Takayuki Hasegawa's avatar
      SPARK-4507: PR merge script should support closing multiple JIRA tickets · 4316a7b0
      Takayuki Hasegawa authored
      This will fix SPARK-4507.
      
      For pull requests that reference multiple JIRAs in their titles, it would be helpful if the PR merge script offered to close all of them.
      
      Author: Takayuki Hasegawa <takayuki.hasegawa0311@gmail.com>
      
      Closes #3428 from hase1031/SPARK-4507 and squashes the following commits:
      
      bf6d64b [Takayuki Hasegawa] SPARK-4507: try to resolve issue when no JIRAs in title
      401224c [Takayuki Hasegawa] SPARK-4507: moved codes as before
      ce89021 [Takayuki Hasegawa] SPARK-4507: PR merge script should support closing multiple JIRA tickets
      4316a7b0
    • zsxwing's avatar
      [SPARK-4505][Core] Add a ClassTag parameter to CompactBuffer[T] · c0622242
      zsxwing authored
      Added a ClassTag parameter to CompactBuffer. So CompactBuffer[T] can create primitive arrays for primitive types. It will reduce the memory usage for primitive types significantly and only pay minor performance lost.
      
      Here is my test code:
      ```Scala
        // Call org.apache.spark.util.SizeEstimator.estimate
        def estimateSize(obj: AnyRef): Long = {
          val c = Class.forName("org.apache.spark.util.SizeEstimator$")
          val f = c.getField("MODULE$")
          val o = f.get(c)
          val m = c.getMethod("estimate", classOf[Object])
          m.setAccessible(true)
          m.invoke(o, obj).asInstanceOf[Long]
        }
      
        sc.parallelize(1 to 10000).groupBy(_ => 1).foreach {
          case (k, v) =>
            println(v.getClass() + " size: " + estimateSize(v))
        }
      ```
      
      Using the previous CompactBuffer outputed
      ```
      class org.apache.spark.util.collection.CompactBuffer size: 313358
      ```
      
      Using the new CompactBuffer outputed
      ```
      class org.apache.spark.util.collection.CompactBuffer size: 65712
      ```
      
      In this case, the new `CompactBuffer` only used 20% memory of the previous one. It's really helpful for `groupByKey` when using a primitive value.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3378 from zsxwing/SPARK-4505 and squashes the following commits:
      
      4abdbba [zsxwing] Add a ClassTag parameter to reduce the memory usage of CompactBuffer[T] when T is a primitive type
      c0622242
    • Kousuke Saruta's avatar
      [SPARK-4057] Use -agentlib instead of -Xdebug in sbt-launch-lib.bash for debugging · 938dc141
      Kousuke Saruta authored
      In -launch-lib.bash, -Xdebug option is used for debugging. We should use -agentlib option for Java 6+.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #2904 from sarutak/SPARK-4057 and squashes the following commits:
      
      39b5320 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4057
      26b4af8 [Kousuke Saruta] Improved java option for debugging
      938dc141
    • Stephen Haberman's avatar
      Include the key name when failing on an invalid value. · 95290bf4
      Stephen Haberman authored
      Admittedly a really small tweak.
      
      Author: Stephen Haberman <stephen@exigencecorp.com>
      
      Closes #3514 from stephenh/include-key-name-in-npe and squashes the following commits:
      
      937740a [Stephen Haberman] Include the key name when failing on an invalid value.
      95290bf4
    • Nicholas Chammas's avatar
      [SPARK-3398] [SPARK-4325] [EC2] Use EC2 status checks. · 317e114e
      Nicholas Chammas authored
      This PR re-introduces [0e648bc](https://github.com/apache/spark/commit/0e648bc2bedcbeb55fce5efac04f6dbad9f063b4) from PR #2339, which somehow never made it into the codebase.
      
      Additionally, it removes a now-unnecessary linear backoff on the SSH checks since we are blocking on EC2 status checks before testing SSH.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #3195 from nchammas/remove-ec2-ssh-backoff and squashes the following commits:
      
      efb29e1 [Nicholas Chammas] Revert "Remove linear backoff."
      ef3ca99 [Nicholas Chammas] reuse conn
      adb4eaa [Nicholas Chammas] Remove linear backoff.
      55caa24 [Nicholas Chammas] Check EC2 status checks before SSH.
      317e114e
  4. Nov 28, 2014
    • Patrick Wendell's avatar
      MAINTENANCE: Automated closing of pull requests. · 047ff573
      Patrick Wendell authored
      This commit exists to close the following pull requests on Github:
      
      Closes #3451 (close requested by 'pwendell')
      Closes #1310 (close requested by 'pwendell')
      Closes #3207 (close requested by 'JoshRosen')
      047ff573
    • Liang-Chi Hsieh's avatar
      [SPARK-4597] Use proper exception and reset variable in Utils.createTempDir() · 49fe8797
      Liang-Chi Hsieh authored
      `File.exists()` and `File.mkdirs()` only throw `SecurityException` instead of `IOException`. Then, when an exception is thrown, `dir` should be reset too.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #3449 from viirya/fix_createtempdir and squashes the following commits:
      
      36cacbd [Liang-Chi Hsieh] Use proper exception and reset variable.
      49fe8797
    • Sean Owen's avatar
      SPARK-1450 [EC2] Specify the default zone in the EC2 script help · 48223d88
      Sean Owen authored
      This looks like a one-liner, so I took a shot at it. There can be no fixed default availability zone since the names are different per region. But the default behavior can be documented:
      
      ```
          if opts.zone == "":
              opts.zone = random.choice(conn.get_all_zones()).name
      ```
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3454 from srowen/SPARK-1450 and squashes the following commits:
      
      9193cf3 [Sean Owen] Document that --zone defaults to a single random zone
      48223d88
    • Marcelo Vanzin's avatar
      [SPARK-4584] [yarn] Remove security manager from Yarn AM. · 915f8eeb
      Marcelo Vanzin authored
      The security manager adds a lot of overhead to the runtime of the
      app, and causes a severe performance regression. Even stubbing out
      all unneeded methods (all except checkExit()) does not help.
      
      So, instead, penalize users who do an explicit System.exit() by leaving
      them in "undefined behavior" territory: if they do that, the Yarn
      backend won't be able to report the final app status to the RM.
      The result is that the final status of the application might not match
      the user's expectations.
      
      One side-effect of the change is that users who do an explicit
      System.exit() will lose the AM retry functionality. Since there is
      no way to know if the exit was because of success or failure, the
      AM right now errs on the side of it being a successful exit.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #3484 from vanzin/SPARK-4584 and squashes the following commits:
      
      21f2502 [Marcelo Vanzin] Do not retry apps that use System.exit().
      4198b3b [Marcelo Vanzin] [SPARK-4584] [yarn] Remove security manager from Yarn AM.
      915f8eeb
    • Takuya UESHIN's avatar
      [SPARK-4193][BUILD] Disable doclint in Java 8 to prevent from build error. · e464f0ac
      Takuya UESHIN authored
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #3058 from ueshin/issues/SPARK-4193 and squashes the following commits:
      
      e096bb1 [Takuya UESHIN] Add a plugin declaration to pluginManagement.
      6762ec2 [Takuya UESHIN] Fix usage of -Xdoclint javadoc option.
      fdb280a [Takuya UESHIN] Fix Javadoc errors.
      4745f3c [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4193
      923e2f0 [Takuya UESHIN] Use doclint option `-missing` instead of `none`.
      30d6718 [Takuya UESHIN] Fix Javadoc errors.
      b548017 [Takuya UESHIN] Disable doclint in Java 8 to prevent from build error.
      e464f0ac
    • Daoyuan Wang's avatar
      [SPARK-4643] [Build] Remove unneeded staging repositories from build · 53ed7f1c
      Daoyuan Wang authored
      The old location will return a 404.
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #3504 from adrian-wang/repo and squashes the following commits:
      
      f604e05 [Daoyuan Wang] already in maven central, remove at all
      f494fac [Daoyuan Wang] spark staging repo outdated
      53ed7f1c
    • KaiXinXiaoLei's avatar
      Delete unnecessary function · 052e6581
      KaiXinXiaoLei authored
      when building spark by sbt, the function “runAlternateBoot" in sbt/sbt-launch-lib.bash is not used. And this function is not used by spark code. So I think this function is not necessary. And the option of "sbt.boot.properties" can be configured in the command line when building spark, eg:
      sbt/sbt assembly -Dsbt.boot.properties=$bootpropsfile.
      
      The file from https://github.com/sbt/sbt-launcher-package is changed. And the function “runAlternateBoot" is deleted in upstream project. I think spark project should delete this function in file sbt/sbt-launch-lib.bash. Thanks.
      
      Author: KaiXinXiaoLei <huleilei1@huawei.com>
      
      Closes #3224 from KaiXinXiaoLei/deleteFunction and squashes the following commits:
      
      e8eac49 [KaiXinXiaoLei] Delete blank lines.
      efe36d4 [KaiXinXiaoLei] Delete unnecessary function
      052e6581
    • Cheng Lian's avatar
      [SPARK-4645][SQL] Disables asynchronous execution in Hive 0.13.1 HiveThriftServer2 · 5b99bf24
      Cheng Lian authored
      This PR disables HiveThriftServer2 asynchronous execution by setting `runInBackground` argument in `ExecuteStatementOperation` to `false`, and reverting `SparkExecuteStatementOperation.run` in Hive 13 shim to Hive 12 version. This change makes Simba ODBC driver v1.0.0.1000 work.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3506)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3506 from liancheng/disable-async-exec and squashes the following commits:
      
      593804d [Cheng Lian] Disables asynchronous execution in Hive 0.13.1 HiveThriftServer2
      5b99bf24
    • maji2014's avatar
      [SPARK-4619][Storage]delete redundant time suffix · ceb62819
      maji2014 authored
      Time suffix exists in Utils.getUsedTimeMs(startTime), no need to append again, delete that
      
      Author: maji2014 <maji3@asiainfo.com>
      
      Closes #3475 from maji2014/SPARK-4619 and squashes the following commits:
      
      df0da4e [maji2014] delete redundant time suffix
      ceb62819
  5. Nov 27, 2014
    • Cheng Lian's avatar
      [SPARK-4613][Core] Java API for JdbcRDD · 120a3502
      Cheng Lian authored
      This PR introduces a set of Java APIs for using `JdbcRDD`:
      
      1. Trait (interface) `JdbcRDD.ConnectionFactory`: equivalent to the `getConnection: () => Connection` parameter in `JdbcRDD` constructor.
      2. Two overloaded versions of `Jdbc.create`: used to create `JavaRDD` that wraps a `JdbcRDD`.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3478)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3478 from liancheng/japi-jdbc-rdd and squashes the following commits:
      
      9a54625 [Cheng Lian] Only shutdowns a single DB rather than the whole Derby driver
      d4cedc5 [Cheng Lian] Moves Java JdbcRDD test case to a separate test suite
      ffcdf2e [Cheng Lian] Java API for JdbcRDD
      120a3502
    • roxchkplusony's avatar
      [SPARK-4626] Kill a task only if the executorId is (still) registered with the scheduler · 84376d31
      roxchkplusony authored
      Author: roxchkplusony <roxchkplusony@gmail.com>
      
      Closes #3483 from roxchkplusony/bugfix/4626 and squashes the following commits:
      
      aba9184 [roxchkplusony] replace warning message per review
      5e7fdea [roxchkplusony] [SPARK-4626] Kill a task only if the executorId is (still) registered with the scheduler
      84376d31
    • Sean Owen's avatar
      SPARK-4170 [CORE] Closure problems when running Scala app that "extends App" · 5d7fe178
      Sean Owen authored
      Warn against subclassing scala.App, and remove one instance of this in examples
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3497 from srowen/SPARK-4170 and squashes the following commits:
      
      4a6131f [Sean Owen] Restore multiline string formatting
      a8ca895 [Sean Owen] Warn against subclassing scala.App, and remove one instance of this in examples
      5d7fe178
    • Andrew Or's avatar
      [Release] Automate generation of contributors list · c86e9bc4
      Andrew Or authored
      This commit provides a script that computes the contributors list
      by linking the github commits with JIRA issues. Automatically
      translating github usernames remains a TODO at this point.
      c86e9bc4
  6. Nov 26, 2014
    • CodingCat's avatar
      [SPARK-732][SPARK-3628][CORE][RESUBMIT] eliminate duplicate update on accmulator · 5af53ada
      CodingCat authored
      https://issues.apache.org/jira/browse/SPARK-3628
      
      In current implementation, the accumulator will be updated for every successfully finished task, even the task is from a resubmitted stage, which makes the accumulator counter-intuitive
      
      In this patch, I changed the way for the DAGScheduler to update the accumulator,
      
      DAGScheduler maintains a HashTable, mapping the stage id to the received <accumulator_id , value> pairs. Only when the stage becomes independent, (no job needs it any more), we accumulate the values of the <accumulator_id , value> pairs, when a task finished, we check if the HashTable has contained such stageId, it saves the accumulator_id, value only when the task is the first finished task of a new stage or the stage is running for the first attempt...
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #2524 from CodingCat/SPARK-732-1 and squashes the following commits:
      
      701a1e8 [CodingCat] roll back change on Accumulator.scala
      1433e6f [CodingCat] make MIMA happy
      b233737 [CodingCat] address Matei's comments
      02261b8 [CodingCat] rollback  some changes
      6b0aff9 [CodingCat] update document
      2b2e8cf [CodingCat] updateAccumulator
      83b75f8 [CodingCat] style fix
      84570d2 [CodingCat] re-enable  the bad accumulator guard
      1e9e14d [CodingCat] add NPE guard
      21b6840 [CodingCat] simplify the patch
      88d1f03 [CodingCat] fix rebase error
      f74266b [CodingCat] add test case for resubmitted result stage
      5cf586f [CodingCat] de-duplicate on task level
      138f9b3 [CodingCat] make MIMA happy
      67593d2 [CodingCat] make if allowing duplicate update as an option of accumulator
      5af53ada
    • Xiangrui Meng's avatar
      [SPARK-4614][MLLIB] Slight API changes in Matrix and Matrices · 561d31d2
      Xiangrui Meng authored
      Before we have a full picture of the operators we want to add, it might be safer to hide `Matrix.transposeMultiply` in 1.2.0. Another update we want to change is `Matrix.randn` and `Matrix.rand`, both of which should take a `Random` implementation. Otherwise, it is very likely to produce inconsistent RDDs. I also added some unit tests for matrix factory methods. All APIs are new in 1.2, so there is no incompatible changes.
      
      brkyvz
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3468 from mengxr/SPARK-4614 and squashes the following commits:
      
      3b0e4e2 [Xiangrui Meng] add mima excludes
      6bfd8a4 [Xiangrui Meng] hide transposeMultiply; add rng to rand and randn; add unit tests
      561d31d2
    • Joseph E. Gonzalez's avatar
      Removing confusing TripletFields · 288ce583
      Joseph E. Gonzalez authored
      After additional discussion with rxin, I think having all the possible `TripletField` options is confusing.  This pull request reduces the triplet fields to:
      
      ```java
        /**
         * None of the triplet fields are exposed.
         */
        public static final TripletFields None = new TripletFields(false, false, false);
      
        /**
         * Expose only the edge field and not the source or destination field.
         */
        public static final TripletFields EdgeOnly = new TripletFields(false, false, true);
      
        /**
         * Expose the source and edge fields but not the destination field. (Same as Src)
         */
        public static final TripletFields Src = new TripletFields(true, false, true);
      
        /**
         * Expose the destination and edge fields but not the source field. (Same as Dst)
         */
        public static final TripletFields Dst = new TripletFields(false, true, true);
      
        /**
         * Expose all the fields (source, edge, and destination).
         */
        public static final TripletFields All = new TripletFields(true, true, true);
      ```
      
      Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
      
      Closes #3472 from jegonzal/SimplifyTripletFields and squashes the following commits:
      
      91796b5 [Joseph E. Gonzalez] removing confusing triplet fields
      288ce583
    • Tathagata Das's avatar
      [SPARK-4612] Reduce task latency and increase scheduling throughput by making... · e7f4d253
      Tathagata Das authored
      [SPARK-4612] Reduce task latency and increase scheduling throughput by making configuration initialization lazy
      
      https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L337 creates a configuration object for every task that is launched, even if there is no new dependent file/JAR to update. This is a heavy-weight creation that should be avoided if there is no new file/JAR to update. This PR makes that creation lazy. Quick local test in spark-perf scheduling throughput tests gives the following numbers in a local standalone scheduler mode.
      1 job with 10000 tasks: before 7.8395 seconds, after 2.6415 seconds = 3x increase in task scheduling throughput
      
      pwendell JoshRosen
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #3463 from tdas/lazy-config and squashes the following commits:
      
      c791c1e [Tathagata Das] Reduce task latency by making configuration initialization lazy
      e7f4d253
  7. Nov 25, 2014
    • Aaron Davidson's avatar
      [SPARK-4516] Avoid allocating Netty PooledByteBufAllocators unnecessarily · 346bc17a
      Aaron Davidson authored
      Turns out we are allocating an allocator pool for every TransportClient (which means that the number increases with the number of nodes in the cluster), when really we should just reuse one for all clients.
      
      This patch, as expected, greatly decreases off-heap memory allocation, and appears to make allocation only proportional to the number of cores.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3465 from aarondav/fewer-pools and squashes the following commits:
      
      36c49da [Aaron Davidson] [SPARK-4516] Avoid allocating unnecessarily Netty PooledByteBufAllocators
      346bc17a
    • Aaron Davidson's avatar
      [SPARK-4516] Cap default number of Netty threads at 8 · f5f2d273
      Aaron Davidson authored
      In practice, only 2-4 cores should be required to transfer roughly 10 Gb/s, and each core that we use will have an initial overhead of roughly 32 MB of off-heap memory, which comes at a premium.
      
      Thus, this value should still retain maximum throughput and reduce wasted off-heap memory allocation. It can be overridden by setting the number of serverThreads and clientThreads manually in Spark's configuration.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3469 from aarondav/fewer-pools2 and squashes the following commits:
      
      087c59f [Aaron Davidson] [SPARK-4516] Cap default number of Netty threads at 8
      f5f2d273
    • Xiangrui Meng's avatar
      [SPARK-4604][MLLIB] make MatrixFactorizationModel public · b5fb1410
      Xiangrui Meng authored
      User could construct an MF model directly. I added a note about the performance.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3459 from mengxr/SPARK-4604 and squashes the following commits:
      
      f64bcd3 [Xiangrui Meng] organize imports
      ed08214 [Xiangrui Meng] check preconditions and unit tests
      a624c12 [Xiangrui Meng] make MatrixFactorizationModel public
      b5fb1410
    • Patrick Wendell's avatar
      [HOTFIX]: Adding back without-hive dist · 4d95526a
      Patrick Wendell authored
      4d95526a
    • Joseph K. Bradley's avatar
      [SPARK-4583] [mllib] LogLoss for GradientBoostedTrees fix + doc updates · c251fd74
      Joseph K. Bradley authored
      Currently, the LogLoss used by GradientBoostedTrees has 2 issues:
      * the gradient (and therefore loss) does not match that used by Friedman (1999)
      * the error computation uses 0/1 accuracy, not log loss
      
      This PR updates LogLoss.
      It also adds some doc for boosting and forests.
      
      I tested it on sample data and made sure the log loss is monotonically decreasing with each boosting iteration.
      
      CC: mengxr manishamde codedeft
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #3439 from jkbradley/gbt-loss-fix and squashes the following commits:
      
      cfec17e [Joseph K. Bradley] removed forgotten temp comments
      a27eb6d [Joseph K. Bradley] corrections to last log loss commit
      ed5da2c [Joseph K. Bradley] updated LogLoss (boosting) for numerical stability
      5e52bff [Joseph K. Bradley] * Removed the 1/2 from SquaredError.  This also required updating the test suite since it effectively doubles the gradient and loss. * Added doc for developers within RandomForest. * Small cleanup in test suite (generating data only once)
      e57897a [Joseph K. Bradley] Fixed LogLoss for GradientBoostedTrees, and updated doc for losses, forests, and boosting
      c251fd74
Loading