Skip to content
Snippets Groups Projects
  1. Nov 18, 2014
    • Davies Liu's avatar
      [SPARK-4017] show progress bar in console · e34f38ff
      Davies Liu authored
      The progress bar will look like this:
      
      ![1___spark_job__85_250_finished__4_are_running___java_](https://cloud.githubusercontent.com/assets/40902/4854813/a02f44ac-6099-11e4-9060-7c73a73151d6.png)
      
      In the right corner, the numbers are: finished tasks, running tasks, total tasks.
      
      After the stage has finished, it will disappear.
      
      The progress bar is only showed if logging level is WARN or higher (but progress in title is still showed), it can be turned off by spark.driver.showConsoleProgress.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3029 from davies/progress and squashes the following commits:
      
      95336d5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress
      fc49ac8 [Davies Liu] address commentse
      2e90f75 [Davies Liu] show multiple stages in same time
      0081bcc [Davies Liu] address comments
      38c42f1 [Davies Liu] fix tests
      ab87958 [Davies Liu] disable progress bar during tests
      30ac852 [Davies Liu] re-implement progress bar
      b3f34e5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress
      6fd30ff [Davies Liu] show progress bar if no task finished in 500ms
      e4e7344 [Davies Liu] refactor
      e1f524d [Davies Liu] revert unnecessary change
      a60477c [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress
      5cae3f2 [Davies Liu] fix style
      ea49fe0 [Davies Liu] address comments
      bc53d99 [Davies Liu] refactor
      e6bb189 [Davies Liu] fix logging in sparkshell
      7e7d4e7 [Davies Liu] address commments
      5df26bb [Davies Liu] fix style
      9e42208 [Davies Liu] show progress bar in console and title
      e34f38ff
    • Davies Liu's avatar
      [SPARK-4404] remove sys.exit() in shutdown hook · 80f31778
      Davies Liu authored
      If SparkSubmit die first, then bootstrapper will be blocked by shutdown hook. sys.exit() in a shutdown hook will cause some kind of dead lock.
      
      cc andrewor14
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3289 from davies/fix_bootstraper and squashes the following commits:
      
      ea5cdd1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_bootstraper
      e04b690 [Davies Liu] remove sys.exit in hook
      4d11366 [Davies Liu] remove shutdown hook if subprocess die fist
      80f31778
    • Kousuke Saruta's avatar
      [SPARK-4075][SPARK-4434] Fix the URI validation logic for Application Jar name. · bfebfd8b
      Kousuke Saruta authored
      This PR adds a regression test for SPARK-4434.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #3326 from sarutak/add-triple-slash-testcase and squashes the following commits:
      
      82bc9cc [Kousuke Saruta] Fixed wrong grammar in comment
      9149027 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into add-triple-slash-testcase
      c1c80ca [Kousuke Saruta] Fixed style
      4f30210 [Kousuke Saruta] Modified comments
      9e09da2 [Kousuke Saruta] Fixed URI validation for jar file
      d4b99ef [Kousuke Saruta] [SPARK-4075] [Deploy] Jar url validation is not enough for Jar file
      ac79906 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into add-triple-slash-testcase
      6d4f47e [Kousuke Saruta] Added a test case as a regression check for SPARK-4434
      bfebfd8b
    • Michael Armbrust's avatar
      [SQL] Support partitioned parquet tables that have the key in both the directory and the file · 90d72ec8
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3272 from marmbrus/keyInPartitionedTable and squashes the following commits:
      
      447f08c [Michael Armbrust] Support partitioned parquet tables that have the key in both the directory and the file
      90d72ec8
    • Xiangrui Meng's avatar
      [SPARK-4396] allow lookup by index in Python's Rating · b54c6ab3
      Xiangrui Meng authored
      In PySpark, ALS can take an RDD of (user, product, rating) tuples as input. However, model.predict outputs an RDD of Rating. So on the input side, users can use r[0], r[1], r[2], while on the output side, users have to use r.user, r.product, r.rating. We should allow lookup by index in Rating by making Rating a namedtuple.
      
      davies
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3261)
      <!-- Reviewable:end -->
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3261 from mengxr/SPARK-4396 and squashes the following commits:
      
      543aef0 [Xiangrui Meng] use named tuple to implement ALS
      0b61bae [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4396
      d3bd7d4 [Xiangrui Meng] allow lookup by index in Python's Rating
      b54c6ab3
    • Davies Liu's avatar
      [SPARK-4435] [MLlib] [PySpark] improve classification · 8fbf72b7
      Davies Liu authored
      This PR add setThrehold() and clearThreshold() for LogisticRegressionModel and SVMModel, also support RDD of vector in LogisticRegressionModel.predict(), SVNModel.predict() and NaiveBayes.predict()
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3305 from davies/setThreshold and squashes the following commits:
      
      d0b835f [Davies Liu] Merge branch 'master' of github.com:apache/spark into setThreshold
      e4acd76 [Davies Liu] address comments
      2231a5f [Davies Liu] bugfix
      7bd9009 [Davies Liu] address comments
      0b0a8a7 [Davies Liu] address comments
      c1e5573 [Davies Liu] improve classification
      8fbf72b7
    • Felix Maximilian Möller's avatar
      ALS implicit: added missing parameter alpha in doc string · cedc3b5a
      Felix Maximilian Möller authored
      Author: Felix Maximilian Möller <felixmaximilian.moeller@immobilienscout24.de>
      
      Closes #3343 from felixmaximilian/fix-documentation and squashes the following commits:
      
      43dcdfb [Felix Maximilian Möller] Removed the information about the switch implicitPrefs. The parameter implicitPrefs cannot be set in this context because it is inherent true when calling the trainImplicit method.
      7d172ba [Felix Maximilian Möller] added missing parameter alpha in doc string.
      cedc3b5a
  2. Nov 17, 2014
    • Patrick Wendell's avatar
      SPARK-4466: Provide support for publishing Scala 2.11 artifacts to Maven · c6e0c2ab
      Patrick Wendell authored
      The maven release plug-in does not have support for publishing two separate sets of artifacts for a single release. Because of the way that Scala 2.11 support in Spark works, we have to write some customized code to do this. The good news is that the Maven release API is just a thin wrapper on doing git commits and pushing artifacts to the HTTP API of Apache's Sonatype server and this might overall make our deployment easier to understand.
      
      This was already used for the 1.2 snapshot, so I think it is working well. One other nice thing is this could be pretty easily extended to publish nightly snapshots.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #3332 from pwendell/releases and squashes the following commits:
      
      2fedaed [Patrick Wendell] Automate the opening and closing of Sonatype repos
      e2a24bb [Patrick Wendell] Fixing issue where we overrode non-spark version numbers
      9df3a50 [Patrick Wendell] Adding TODO
      1cc1749 [Patrick Wendell] Don't build the thriftserver for 2.11
      933201a [Patrick Wendell] Make tagging of release commit eager
      d0388a6 [Patrick Wendell] Support Scala 2.11 build
      4f4dc62 [Patrick Wendell] Change to 2.11 should not be included when committing new patch
      bf742e1 [Patrick Wendell] Minor fixes
      ffa1df2 [Patrick Wendell] Adding a Scala 2.11 package to test it
      9ac4381 [Patrick Wendell] Addressing TODO
      b3105ff [Patrick Wendell] Removing commented out code
      d906803 [Patrick Wendell] Small fix
      3f4d985 [Patrick Wendell] More work
      fcd54c2 [Patrick Wendell] Consolidating use of keys
      df2af30 [Patrick Wendell] Changes to release stuff
      c6e0c2ab
    • Cheng Lian's avatar
      [SPARK-4453][SPARK-4213][SQL] Simplifies Parquet filter generation code · 36b0956a
      Cheng Lian authored
      While reviewing PR #3083 and #3161, I noticed that Parquet record filter generation code can be simplified significantly according to the clue stated in [SPARK-4453](https://issues.apache.org/jira/browse/SPARK-4213). This PR addresses both SPARK-4453 and SPARK-4213 with this simplification.
      
      While generating `ParquetTableScan` operator, we need to remove all Catalyst predicates that have already been pushed down to Parquet. Originally, we first generate the record filter, and then call `findExpression` to traverse the generated filter to find out all pushed down predicates [[1](https://github.com/apache/spark/blob/64c6b9bad559c21f25cd9fbe37c8813cdab939f2/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L213-L228)]. In this way, we have to introduce the `CatalystFilter` class hierarchy to bind the Catalyst predicates together with their generated Parquet filter, and complicate the code base a lot.
      
      The basic idea of this PR is that, we don't need `findExpression` after filter generation, because we already know a predicate can be pushed down if we can successfully generate its corresponding Parquet filter. SPARK-4213 is fixed by returning `None` for any unsupported predicate type.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3317)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3317 from liancheng/simplify-parquet-filters and squashes the following commits:
      
      d6a9499 [Cheng Lian] Fixes import styling issue
      43760e8 [Cheng Lian] Simplifies Parquet filter generation logic
      36b0956a
    • Cheng Hao's avatar
      [SPARK-4448] [SQL] unwrap for the ConstantObjectInspector · ef7c464e
      Cheng Hao authored
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3308 from chenghao-intel/unwrap_constant_oi and squashes the following commits:
      
      156b500 [Cheng Hao] rebase the master
      c5b20ab [Cheng Hao] unwrap for the ConstantObjectInspector
      ef7c464e
    • w00228970's avatar
      [SPARK-4443][SQL] Fix statistics for external table in spark sql hive · 42389b17
      w00228970 authored
      The `totalSize` of external table  is always zero, which will influence join strategy(always use broadcast join for external table).
      
      Author: w00228970 <wangfei1@huawei.com>
      
      Closes #3304 from scwf/statistics and squashes the following commits:
      
      568f321 [w00228970] fix statistics for external table
      42389b17
    • Cheng Lian's avatar
      [SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, and fixes for complex types · 6b7f2f75
      Cheng Lian authored
      This PR is exactly the same as #3178 except it reverts the `FileStatus.isDir` to `FileStatus.isDirectory` change, since it doesn't compile with Hadoop 1.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3298)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3298 from liancheng/date-for-thriftserver and squashes the following commits:
      
      866037e [Cheng Lian] Revers isDirectory to isDir (it breaks Hadoop 1 profile)
      6f71d0b [Cheng Lian] Makes toHiveString static
      26fa955 [Cheng Lian] Fixes complex type support in Hive 0.13.1 shim
      a92882a [Cheng Lian] Updates HiveShim for 0.13.1
      73f442b [Cheng Lian] Adds Date support for HiveThriftServer2 (Hive 0.12.0)
      6b7f2f75
    • Cheng Hao's avatar
      [SQL] Construct the MutableRow from an Array · 69e858cc
      Cheng Hao authored
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3217 from chenghao-intel/mutablerow and squashes the following commits:
      
      e8a10bd [Cheng Hao] revert the change of Row object
      4681aea [Cheng Hao] Add toMutableRow method in object Row
      a751838 [Cheng Hao] Construct the MutableRow from an existed row
      69e858cc
    • Takuya UESHIN's avatar
      [SPARK-4425][SQL] Handle NaN or Infinity cast to Timestamp correctly. · 566c7919
      Takuya UESHIN authored
      `Cast` from `NaN` or `Infinity` of `Double` or `Float` to `TimestampType` throws `NumberFormatException`.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #3283 from ueshin/issues/SPARK-4425 and squashes the following commits:
      
      14def0c [Takuya UESHIN] Fix Cast to be able to handle NaN or Infinity to TimestampType.
      566c7919
    • Takuya UESHIN's avatar
      [SPARK-4420][SQL] Change nullability of Cast from DoubleType/FloatType to DecimalType. · 3a81a1c9
      Takuya UESHIN authored
      This is follow-up of [SPARK-4390](https://issues.apache.org/jira/browse/SPARK-4390) (#3256).
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #3278 from ueshin/issues/SPARK-4420 and squashes the following commits:
      
      7fea558 [Takuya UESHIN] Add some tests.
      cb2301a [Takuya UESHIN] Fix tests.
      133bad5 [Takuya UESHIN] Change nullability of Cast from DoubleType/FloatType to DecimalType.
      3a81a1c9
    • Cheng Lian's avatar
      [SQL] Makes conjunction pushdown more aggressive for in-memory table · 5ce7dae8
      Cheng Lian authored
      This is inspired by the [Parquet record filter generation code](https://github.com/apache/spark/blob/64c6b9bad559c21f25cd9fbe37c8813cdab939f2/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetFilters.scala#L387-L400).
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3318)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3318 from liancheng/aggresive-conj-pushdown and squashes the following commits:
      
      78b69d2 [Cheng Lian] Makes conjunction pushdown more aggressive
      5ce7dae8
    • Josh Rosen's avatar
      [SPARK-4180] [Core] Prevent creation of multiple active SparkContexts · 0f3ceb56
      Josh Rosen authored
      This patch adds error-detection logic to throw an exception when attempting to create multiple active SparkContexts in the same JVM, since this is currently unsupported and has been known to cause confusing behavior (see SPARK-2243 for more details).
      
      **The solution implemented here is only a partial fix.**  A complete fix would have the following properties:
      
      1. Only one SparkContext may ever be under construction at any given time.
      2. Once a SparkContext has been successfully constructed, any subsequent construction attempts should fail until the active SparkContext is stopped.
      3. If the SparkContext constructor throws an exception, then all resources created in the constructor should be cleaned up (SPARK-4194).
      4. If a user attempts to create a SparkContext but the creation fails, then the user should be able to create new SparkContexts.
      
      This PR only provides 2) and 4); we should be able to provide all of these properties, but the correct fix will involve larger changes to SparkContext's construction / initialization, so we'll target it for a different Spark release.
      
      ### The correct solution:
      
      I think that the correct way to do this would be to move the construction of SparkContext's dependencies into a static method in the SparkContext companion object.  Specifically, we could make the default SparkContext constructor `private` and change it to accept a `SparkContextDependencies` object that contains all of SparkContext's dependencies (e.g. DAGScheduler, ContextCleaner, etc.).  Secondary constructors could call a method on the SparkContext companion object to create the `SparkContextDependencies` and pass the result to the primary SparkContext constructor.  For example:
      
      ```scala
      class SparkContext private (deps: SparkContextDependencies) {
        def this(conf: SparkConf) {
          this(SparkContext.getDeps(conf))
        }
      }
      
      object SparkContext(
        private[spark] def getDeps(conf: SparkConf): SparkContextDependencies = synchronized {
          if (anotherSparkContextIsActive) { throw Exception(...) }
          var dagScheduler: DAGScheduler = null
          try {
              dagScheduler = new DAGScheduler(...)
              [...]
          } catch {
            case e: Exception =>
               Option(dagScheduler).foreach(_.stop())
                [...]
          }
          SparkContextDependencies(dagScheduler, ....)
        }
      }
      ```
      
      This gives us mutual exclusion and ensures that any resources created during the failed SparkContext initialization are properly cleaned up.
      
      This indirection is necessary to maintain binary compatibility.  In retrospect, it would have been nice if SparkContext had no private constructors and could only be created through builder / factory methods on its companion object, since this buys us lots of flexibility and makes dependency injection easier.
      
      ### Alternative solutions:
      
      As an alternative solution, we could refactor SparkContext's primary constructor to perform all object creation in a giant `try-finally` block.  Unfortunately, this will require us to turn a bunch of `vals` into `vars` so that they can be assigned from the `try` block.  If we still want `vals`, we could wrap each `val` in its own `try` block (since the try block can return a value), but this will lead to extremely messy code and won't guard against the introduction of future code which doesn't properly handle failures.
      
      The more complex approach outlined above gives us some nice dependency injection benefits, so I think that might be preferable to a `var`-ification.
      
      ### This PR's solution:
      
      - At the start of the constructor, check whether some other SparkContext is active; if so, throw an exception.
      - If another SparkContext might be under construction (or has thrown an exception during construction), allow the new SparkContext to begin construction but log a warning (since resources might have been leaked from a failed creation attempt).
      - At the end of the SparkContext constructor, check whether some other SparkContext constructor has raced and successfully created an active context.  If so, throw an exception.
      
      This guarantees that no two SparkContexts will ever be active and exposed to users (since we check at the very end of the constructor).  If two threads race to construct SparkContexts, then one of them will win and another will throw an exception.
      
      This exception can be turned into a warning by setting `spark.driver.allowMultipleContexts = true`.  The exception is disabled in unit tests, since there are some suites (such as Hive) that may require more significant refactoring to clean up their SparkContexts.  I've made a few changes to other suites' test fixtures to properly clean up SparkContexts so that the unit test logs contain fewer warnings.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3121 from JoshRosen/SPARK-4180 and squashes the following commits:
      
      23c7123 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
      d38251b [Josh Rosen] Address latest round of feedback.
      c0987d3 [Josh Rosen] Accept boolean instead of SparkConf in methods.
      85a424a [Josh Rosen] Incorporate more review feedback.
      372d0d3 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
      f5bb78c [Josh Rosen] Update mvn build, too.
      d809cb4 [Josh Rosen] Improve handling of failed SparkContext creation attempts.
      79a7e6f [Josh Rosen] Fix commented out test
      a1cba65 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
      7ba6db8 [Josh Rosen] Add utility to set system properties in tests.
      4629d5c [Josh Rosen] Set spark.driver.allowMultipleContexts=true in tests.
      ed17e14 [Josh Rosen] Address review feedback; expose hack workaround for existing unit tests.
      1c66070 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
      06c5c54 [Josh Rosen] Add / improve SparkContext cleanup in streaming BasicOperationsSuite
      d0437eb [Josh Rosen] StreamingContext.stop() should stop SparkContext even if StreamingContext has not been started yet.
      c4d35a2 [Josh Rosen] Log long form of creation site to aid debugging.
      918e878 [Josh Rosen] Document "one SparkContext per JVM" limitation.
      afaa7e3 [Josh Rosen] [SPARK-4180] Prevent creations of multiple active SparkContexts.
      0f3ceb56
    • Andy Konwinski's avatar
      [DOCS][SQL] Fix broken link to Row class scaladoc · cec1116b
      Andy Konwinski authored
      Author: Andy Konwinski <andykonwinski@gmail.com>
      
      Closes #3323 from andyk/patch-2 and squashes the following commits:
      
      4699fdc [Andy Konwinski] Fix broken link to Row class scaladoc
      cec1116b
    • Andrew Or's avatar
      dbb9da5c
    • Ankur Dave's avatar
      [SPARK-4444] Drop VD type parameter from EdgeRDD · 9ac2bb18
      Ankur Dave authored
      Due to vertex attribute caching, EdgeRDD previously took two type parameters: ED and VD. However, this is an implementation detail that should not be exposed in the interface, so this PR drops the VD type parameter.
      
      This requires removing the `filter` method from the EdgeRDD interface, because it depends on vertex attribute caching.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #3303 from ankurdave/edgerdd-drop-tparam and squashes the following commits:
      
      38dca9b [Ankur Dave] Leave EdgeRDD.fromEdges public
      fafeb51 [Ankur Dave] Drop VD type parameter from EdgeRDD
      9ac2bb18
    • Adam Pingel's avatar
      SPARK-2811 upgrade algebird to 0.8.1 · e7690ed2
      Adam Pingel authored
      Author: Adam Pingel <adam@axle-lang.org>
      
      Closes #3282 from adampingel/master and squashes the following commits:
      
      70c8d3c [Adam Pingel] relocate the algebird example back to example/src
      7a9d8be [Adam Pingel] SPARK-2811 upgrade algebird to 0.8.1
      e7690ed2
    • Prashant Sharma's avatar
      SPARK-4445, Don't display storage level in toDebugString unless RDD is persisted. · 5c92d47a
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #3310 from ScrapCodes/SPARK-4445/rddDebugStringFix and squashes the following commits:
      
      4e57c52 [Prashant Sharma] SPARK-4445, Don't display storage level in toDebugString unless RDD is persisted
      5c92d47a
  3. Nov 16, 2014
    • Michael Armbrust's avatar
      [SPARK-4410][SQL] Add support for external sort · 64c6b9ba
      Michael Armbrust authored
      Adds a new operator that uses Spark's `ExternalSort` class.  It is off by default now, but we might consider making it the default if benchmarks show that it does not regress performance.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3268 from marmbrus/externalSort and squashes the following commits:
      
      48b9726 [Michael Armbrust] comments
      b98799d [Michael Armbrust] Add test
      afd7562 [Michael Armbrust] Add support for external sort.
      64c6b9ba
    • GuoQiang Li's avatar
      [SPARK-4422][MLLIB]In some cases, Vectors.fromBreeze get wrong results. · 5168c6ca
      GuoQiang Li authored
      cc mengxr
      
      Author: GuoQiang Li <witgo@qq.com>
      
      Closes #3281 from witgo/SPARK-4422 and squashes the following commits:
      
      5f1fa5e [GuoQiang Li] import order
      50783bd [GuoQiang Li] review commits
      7a10123 [GuoQiang Li] In some cases, Vectors.fromBreeze get wrong results.
      5168c6ca
    • Michael Armbrust's avatar
      Revert "[SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, and... · 45ce3273
      Michael Armbrust authored
      Revert "[SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, and fixes for complex types"
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3292 from marmbrus/revert4309 and squashes the following commits:
      
      808e96e [Michael Armbrust] Revert "[SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, and fixes for complex types"
      45ce3273
    • Cheng Lian's avatar
      [SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, and fixes for complex types · cb6bd83a
      Cheng Lian authored
      SPARK-4407 was detected while working on SPARK-4309. Merged these two into a single PR since 1.2.0 RC is approaching.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3178)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3178 from liancheng/date-for-thriftserver and squashes the following commits:
      
      6f71d0b [Cheng Lian] Makes toHiveString static
      26fa955 [Cheng Lian] Fixes complex type support in Hive 0.13.1 shim
      a92882a [Cheng Lian] Updates HiveShim for 0.13.1
      73f442b [Cheng Lian] Adds Date support for HiveThriftServer2 (Hive 0.12.0)
      cb6bd83a
    • Josh Rosen's avatar
      [SPARK-4393] Fix memory leak in ConnectionManager ACK timeout TimerTasks; use HashedWheelTimer · 7850e0c7
      Josh Rosen authored
      This patch is intended to fix a subtle memory leak in ConnectionManager's ACK timeout TimerTasks: in the old code, each TimerTask held a reference to the message being sent and a cancelled TimerTask won't necessarily be garbage-collected until it's scheduled to run, so this caused huge buildups of messages that weren't garbage collected until their timeouts expired, leading to OOMs.
      
      This patch addresses this problem by capturing only the message ID in the TimerTask instead of the whole message, and by keeping a WeakReference to the promise in the TimerTask.  I've also modified this code to use Netty's HashedWheelTimer, whose performance characteristics should be better for this use-case.
      
      Thanks to cristianopris for narrowing down this issue!
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3259 from JoshRosen/connection-manager-timeout-bugfix and squashes the following commits:
      
      afcc8d6 [Josh Rosen] Address rxin's review feedback.
      2a2e92d [Josh Rosen] Keep only WeakReference to promise in TimerTask;
      0f0913b [Josh Rosen] Spelling fix: timout => timeout
      3200c33 [Josh Rosen] Use Netty HashedWheelTimer
      f847dd4 [Josh Rosen] Don't capture entire message in ACK timeout task.
      7850e0c7
    • Kousuke Saruta's avatar
      [SPARK-4426][SQL][Minor] The symbol of BitwiseOr is wrong, should not be '&' · 84468b2e
      Kousuke Saruta authored
      The symbol of BitwiseOr is defined as '&' but I think it's wrong. It should be '|'.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #3284 from sarutak/bitwise-or-symbol-fix and squashes the following commits:
      
      aff4be5 [Kousuke Saruta] Fixed symbol of BitwiseOr
      84468b2e
    • Josh Rosen's avatar
      [SPARK-4419] Upgrade snappy-java to 1.1.1.6 · 7d8e152e
      Josh Rosen authored
      This upgrades snappy-java to 1.1.1.6, which includes a patch that improves error messages when attempting to deserialize empty inputs using SnappyInputStream (see xerial/snappy-java#89).
      
      We previously tried up upgrade to 1.1.1.5 in #2911 but reverted that patch after discovering a memory leak in snappy-java.  This should leak have been fixed in 1.1.1.6, though (see xerial/snappy-java#92).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3287 from JoshRosen/SPARK-4419 and squashes the following commits:
      
      5d6f4cc [Josh Rosen] [SPARK-4419] Upgrade snappy-java to 1.1.1.6.
      7d8e152e
  4. Nov 15, 2014
    • Josh Rosen's avatar
      [SPARK-2321] Several progress API improvements / refactorings · 40eb8b6e
      Josh Rosen authored
      This PR refactors / extends the status API introduced in #2696.
      
      - Change StatusAPI from a mixin trait to a class.  Before, the new status API methods were directly accessible through SparkContext, whereas now they're accessed through a `sc.statusAPI` field.  As long as we were going to add these methods directly to SparkContext, the mixin trait seemed like a good idea, but this might be simpler to reason about and may avoid pitfalls that I've run into while attempting to refactor other parts of SparkContext to use mixins (see #3071, for example).
      - Change the name from SparkStatusAPI to SparkStatusTracker.
      - Make `getJobIdsForGroup(null)` return ids for jobs that aren't associated with any job group.
      - Add `getActiveStageIds()` and `getActiveJobIds()` methods that return the ids of whatever's currently active in this SparkContext.  This should simplify davies's progress bar code.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3197 from JoshRosen/progress-api-improvements and squashes the following commits:
      
      30b0afa [Josh Rosen] Rename SparkStatusAPI to SparkStatusTracker.
      d1b08d8 [Josh Rosen] Add missing newlines
      2cc7353 [Josh Rosen] Add missing file.
      d5eab1f [Josh Rosen] Add getActive[Stage|Job]Ids() methods.
      a227984 [Josh Rosen] getJobIdsForGroup(null) should return jobs for default group
      c47e294 [Josh Rosen] Remove StatusAPI mixin trait.
      40eb8b6e
    • kai's avatar
      Added contains(key) to Metadata · cbddac23
      kai authored
      Add contains(key) to org.apache.spark.sql.catalyst.util.Metadata to test the existence of a key. Otherwise, Class Metadata's get methods may throw NoSuchElement exception if the key does not exist.
      Testcases are added to MetadataSuite as well.
      
      Author: kai <kaizeng@eecs.berkeley.edu>
      
      Closes #3273 from kai-zeng/metadata-fix and squashes the following commits:
      
      74b3d03 [kai] Added contains(key) to Metadata
      cbddac23
    • Kousuke Saruta's avatar
      [SPARK-4260] Httpbroadcast should set connection timeout. · 60969b03
      Kousuke Saruta authored
      Httpbroadcast sets read timeout but doesn't set connection timeout.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #3122 from sarutak/httpbroadcast-timeout and squashes the following commits:
      
      c7f3a56 [Kousuke Saruta] Added Connection timeout for Http Connection to HttpBroadcast.scala
      60969b03
    • zsxwing's avatar
      [SPARK-4363][Doc] Update the Broadcast example · 861223ee
      zsxwing authored
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3226 from zsxwing/SPARK-4363 and squashes the following commits:
      
      8109914 [zsxwing] Update the Broadcast example
      861223ee
    • zsxwing's avatar
      [SPARK-4379][Core] Change Exception to SparkException in checkpoint · dba14058
      zsxwing authored
      It's better to change to SparkException. However, it's a breaking change since it will change the exception type.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3241 from zsxwing/SPARK-4379 and squashes the following commits:
      
      409f3af [zsxwing] Change Exception to SparkException in checkpoint
      dba14058
  5. Nov 14, 2014
    • Davies Liu's avatar
      [SPARK-4415] [PySpark] JVM should exit after Python exit · 7fe08b43
      Davies Liu authored
      When JVM is started in a Python process, it should exit once the stdin is closed.
      
      test: add spark.driver.memory in conf/spark-defaults.conf
      
      ```
      daviesdm:~/work/spark$ cat conf/spark-defaults.conf
      spark.driver.memory       8g
      daviesdm:~/work/spark$ bin/pyspark
      >>> quit
      daviesdm:~/work/spark$ jps
      4931 Jps
      286
      daviesdm:~/work/spark$ python wc.py
      943738
      0.719928026199
      daviesdm:~/work/spark$ jps
      286
      4990 Jps
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3274 from davies/exit and squashes the following commits:
      
      df0e524 [Davies Liu] address comments
      ce8599c [Davies Liu] address comments
      050651f [Davies Liu] JVM should exit after Python exit
      7fe08b43
    • WangTao's avatar
      [SPARK-4404]SparkSubmitDriverBootstrapper should stop after its SparkSubmit sub-proc... · 303a4e4d
      WangTao authored
      ...ess ends
      
      https://issues.apache.org/jira/browse/SPARK-4404
      
      When we have spark.driver.extra* or spark.driver.memory in SPARK_SUBMIT_PROPERTIES_FILE, spark-class will use SparkSubmitDriverBootstrapper to launch driver.
      If we get process id of SparkSubmitDriverBootstrapper and wanna kill it during its running, we expect its SparkSubmit sub-process stop also.
      
      Author: WangTao <barneystinson@aliyun.com>
      Author: WangTaoTheTonic <barneystinson@aliyun.com>
      
      Closes #3266 from WangTaoTheTonic/killsubmit and squashes the following commits:
      
      e03eba5 [WangTaoTheTonic] add comments
      57b5ca1 [WangTao] SparkSubmitDriverBootstrapper should stop after its SparkSubmit sub-process ends
      303a4e4d
    • Sandy Ryza's avatar
      SPARK-4214. With dynamic allocation, avoid outstanding requests for more... · ad42b283
      Sandy Ryza authored
      ... executors than pending tasks need.
      
      WIP. Still need to add and fix tests.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #3204 from sryza/sandy-spark-4214 and squashes the following commits:
      
      35cf0e0 [Sandy Ryza] Add comment
      13b53df [Sandy Ryza] Review feedback
      067465f [Sandy Ryza] Whitespace fix
      6ae080c [Sandy Ryza] Add tests and get num pending tasks from ExecutorAllocationListener
      531e2b6 [Sandy Ryza] SPARK-4214. With dynamic allocation, avoid outstanding requests for more executors than pending tasks need.
      ad42b283
    • Jim Carroll's avatar
      [SPARK-4412][SQL] Fix Spark's control of Parquet logging. · 37482ce5
      Jim Carroll authored
      The Spark ParquetRelation.scala code makes the assumption that the parquet.Log class has already been loaded. If ParquetRelation.enableLogForwarding executes prior to the parquet.Log class being loaded then the code in enableLogForwarding has no affect.
      
      ParquetRelation.scala attempts to override the parquet logger but, at least currently (and if your application simply reads a parquet file before it does anything else with Parquet), the parquet.Log class hasn't been loaded yet. Therefore the code in ParquetRelation.enableLogForwarding has no affect. If you look at the code in parquet.Log there's a static initializer that needs to be called prior to enableLogForwarding or whatever enableLogForwarding does gets undone by this static initializer.
      
      The "fix" would be to force the static initializer to get called in parquet.Log as part of enableForwardLogging.
      
      Author: Jim Carroll <jim@dontcallme.com>
      
      Closes #3271 from jimfcarroll/parquet-logging and squashes the following commits:
      
      37bdff7 [Jim Carroll] Fix Spark's control of Parquet logging.
      37482ce5
    • Yash Datta's avatar
      [SPARK-4365][SQL] Remove unnecessary filter call on records returned from parquet library · 63ca3af6
      Yash Datta authored
      Since parquet library has been updated , we no longer need to filter the records returned from parquet library for null records , as now the library skips those :
      
      from parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java
      
      public boolean nextKeyValue() throws IOException, InterruptedException {
      boolean recordFound = false;
      while (!recordFound) {
      // no more records left
      if (current >= total)
      { return false; }
      try {
      checkRead();
      currentValue = recordReader.read();
      current ++;
      if (recordReader.shouldSkipCurrentRecord())
      {
       // this record is being filtered via the filter2 package
      if (DEBUG) LOG.debug("skipping record");
       continue;
       }
      if (currentValue == null)
      {
      // only happens with FilteredRecordReader at end of block current = totalCountLoadedSoFar;
       if (DEBUG) LOG.debug("filtered record reader reached end of block");
       continue;
      }
      
      recordFound = true;
      if (DEBUG) LOG.debug("read value: " + currentValue);
      } catch (RuntimeException e)
      { throw new ParquetDecodingException(format("Can not read value at %d in block %d in file %s", current, currentBlock, file), e); }
      
      }
      return true;
      }
      
      Author: Yash Datta <Yash.Datta@guavus.com>
      
      Closes #3229 from saucam/remove_filter and squashes the following commits:
      
      8909ae9 [Yash Datta] SPARK-4365: Remove unnecessary filter call on records returned from parquet library
      63ca3af6
    • Jim Carroll's avatar
      [SPARK-4386] Improve performance when writing Parquet files. · f76b9683
      Jim Carroll authored
      If you profile the writing of a Parquet file, the single worst time consuming call inside of org.apache.spark.sql.parquet.MutableRowWriteSupport.write is actually in the scala.collection.AbstractSequence.size call. This is because the size call actually ends up COUNTING the elements in a scala.collection.LinearSeqOptimized.length ("optimized?").
      
      This doesn't need to be done. "size" is called repeatedly where needed rather than called once at the top of the method and stored in a 'val'.
      
      Author: Jim Carroll <jim@dontcallme.com>
      
      Closes #3254 from jimfcarroll/parquet-perf and squashes the following commits:
      
      30cc0b5 [Jim Carroll] Improve performance when writing Parquet files.
      f76b9683
Loading