Skip to content
Snippets Groups Projects
  1. Oct 26, 2014
    • Josh Rosen's avatar
      [SPARK-3616] Add basic Selenium tests to WebUISuite · bf589fc7
      Josh Rosen authored
      This patch adds Selenium tests for Spark's web UI.  To avoid adding extra
      dependencies to the test environment, the tests use Selenium's HtmlUnitDriver,
      which is pure-Java, instead of, say, ChromeDriver.
      
      I added new tests to try to reproduce a few UI bugs reported on JIRA, namely
      SPARK-3021, SPARK-2105, and SPARK-2527.  I wasn't able to reproduce these bugs;
      I suspect that the older ones might have been fixed by other patches.
      
      In order to use HtmlUnitDriver, I added an explicit dependency on the
      org.apache.httpcomponents version of httpclient in order to prevent jets3t's
      older version from taking precedence on the classpath.
      
      I also upgraded ScalaTest to 2.2.1.
      
      Author: Josh Rosen <joshrosen@apache.org>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #2474 from JoshRosen/webui-selenium-tests and squashes the following commits:
      
      fcc9e83 [Josh Rosen] scalautils -> scalactic package rename
      510e54a [Josh Rosen] [SPARK-3616] Add basic Selenium tests to WebUISuite.
      bf589fc7
    • Daniel Lemire's avatar
      Update RoaringBitmap to 0.4.3 · b7595401
      Daniel Lemire authored
      Roaring has been updated to version 0.4.3. We fixed a rarely occurring bug with serialization. No API or format changes were made.
      
      Author: Daniel Lemire <lemire@gmail.com>
      
      Closes #2938 from lemire/master and squashes the following commits:
      
      431f3a0 [Daniel Lemire] Recommended bug fix release
      b7595401
    • Sean Owen's avatar
      SPARK-3359 [DOCS] sbt/sbt unidoc doesn't work with Java 8 · df7974b8
      Sean Owen authored
      This follows https://github.com/apache/spark/pull/2893 , but does not completely fix SPARK-3359 either. This fixes minor scaladoc/javadoc issues that Javadoc 8 will treat as errors.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #2909 from srowen/SPARK-3359 and squashes the following commits:
      
      f62c347 [Sean Owen] Fix some javadoc issues that javadoc 8 considers errors. This is not all of the errors turned up when javadoc 8 runs on output of genjavadoc.
      df7974b8
  2. Oct 25, 2014
    • Andrew Or's avatar
      [SPARK-4071] Unroll fails silently if BlockManager is small · c6834440
      Andrew Or authored
      In tests, we may want to have BlockManagers of size < 1MB (spark.storage.unrollMemoryThreshold). However, these BlockManagers are useless because we can't unroll anything in them ever. At the very least we need to log a warning.
      
      tdas
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #2917 from andrewor14/unroll-safely-logging and squashes the following commits:
      
      38947e3 [Andrew Or] Warn against starting a block manager that's too small
      fd621b4 [Andrew Or] Warn against failure to reserve initial memory threshold
      c6834440
    • Josh Rosen's avatar
      Revert "[SPARK-4056] Upgrade snappy-java to 1.1.1.5" · 2e52e4f8
      Josh Rosen authored
      This reverts commit 898b22ab.
      
      Reverting because this may be causing OOMs.
      2e52e4f8
    • Davies Liu's avatar
      [SPARK-4088] [PySpark] Python worker should exit after socket is closed by JVM · e41786c7
      Davies Liu authored
      In case of take() or exception in Python, python worker may exit before JVM read() all the response, then the write thread may raise "Connection reset" exception.
      
      Python should always wait JVM to close the socket first.
      
      cc JoshRosen This is a warm fix, or the tests will be flaky, sorry for that.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2941 from davies/fix_exit and squashes the following commits:
      
      9d4d21e [Davies Liu] fix race
      e41786c7
    • Josh Rosen's avatar
      [SPARK-2321] Stable pull-based progress / status API · 95303168
      Josh Rosen authored
      This pull request is a first step towards the implementation of a stable, pull-based progress / status API for Spark (see [SPARK-2321](https://issues.apache.org/jira/browse/SPARK-2321)).  For now, I'd like to discuss the basic implementation, API names, and overall interface design.  Once we arrive at a good design, I'll go back and add additional methods to expose more information via these API.
      
      #### Design goals:
      
      - Pull-based API
      - Usable from Java / Scala / Python (eventually, likely with a wrapper)
      - Can be extended to expose more information without introducing binary incompatibilities.
      - Returns immutable objects.
      - Don't leak any implementation details, preserving our freedom to change the implementation.
      
      #### Implementation:
      
      - Add public methods (`getJobInfo`, `getStageInfo`) to SparkContext to allow status / progress information to be retrieved.
      - Add public interfaces (`SparkJobInfo`, `SparkStageInfo`) for our API return values.  These interfaces consist entirely of Java-style getter methods.  The interfaces are currently implemented in Java.  I decided to explicitly separate the interface from its implementation (`SparkJobInfoImpl`, `SparkStageInfoImpl`) in order to prevent users from constructing these responses themselves.
      -Allow an existing JobProgressListener to be used when constructing a live SparkUI.  This allows us to re-use this listeners in the implementation of this status API.  There are a few reasons why this listener re-use makes sense:
         - The status API and web UI are guaranteed to show consistent information.
         - These listeners are already well-tested.
         - The same garbage-collection / information retention configurations can apply to both this API and the web UI.
      - Extend JobProgressListener to maintain `jobId -> Job` and `stageId -> Stage` mappings.
      
      The progress API methods are implemented in a separate trait that's mixed into SparkContext.  This helps to avoid SparkContext.scala from becoming larger and more difficult to read.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #2696 from JoshRosen/progress-reporting-api and squashes the following commits:
      
      e6aa78d [Josh Rosen] Add tests.
      b585c16 [Josh Rosen] Accept SparkListenerBus instead of more specific subclasses.
      c96402d [Josh Rosen] Address review comments.
      2707f98 [Josh Rosen] Expose current stage attempt id
      c28ba76 [Josh Rosen] Update demo code:
      646ff1d [Josh Rosen] Document spark.ui.retainedJobs.
      7f47d6d [Josh Rosen] Clean up SparkUI constructors, per Andrew's feedback.
      b77b3d8 [Josh Rosen] Merge remote-tracking branch 'origin/master' into progress-reporting-api
      787444c [Josh Rosen] Move status API methods into trait that can be mixed into SparkContext.
      f9a9a00 [Josh Rosen] More review comments:
      3dc79af [Josh Rosen] Remove creation of unused listeners in SparkContext.
      249ca16 [Josh Rosen] Address several review comments:
      da5648e [Josh Rosen] Add example of basic progress reporting in Java.
      7319ffd [Josh Rosen] Add getJobIdsForGroup() and num*Tasks() methods.
      cc568e5 [Josh Rosen] Add note explaining that interfaces should not be implemented outside of Spark.
      6e840d4 [Josh Rosen] Remove getter-style names and "consistent snapshot" semantics:
      08cbec9 [Josh Rosen] Begin to sketch the interfaces for a stable, public status API.
      ac2d13a [Josh Rosen] Add jobId->stage, stageId->stage mappings in JobProgressListener
      24de263 [Josh Rosen] Create UI listeners in SparkContext instead of in Tabs:
      95303168
  3. Oct 24, 2014
    • Michael Armbrust's avatar
      [SQL] Update Hive test harness for Hive 12 and 13 · 3a845d3c
      Michael Armbrust authored
      As part of the upgrade I also copy the newest version of the query tests, and whitelist a bunch of new ones that are now passing.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2936 from marmbrus/fix13tests and squashes the following commits:
      
      d9cbdab [Michael Armbrust] Remove user specific tests
      65801cd [Michael Armbrust] style and rat
      8f6b09a [Michael Armbrust] Update test harness to work with both Hive 12 and 13.
      f044843 [Michael Armbrust] Update Hive query tests and golden files to 0.13
      3a845d3c
    • Josh Rosen's avatar
      [SPARK-4056] Upgrade snappy-java to 1.1.1.5 · 898b22ab
      Josh Rosen authored
      This upgrades snappy-java to 1.1.1.5, which improves error messages when attempting to deserialize empty inputs using SnappyInputStream (see https://github.com/xerial/snappy-java/issues/89).
      
      Author: Josh Rosen <rosenville@gmail.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #2911 from JoshRosen/upgrade-snappy-java and squashes the following commits:
      
      adec96c [Josh Rosen] Use snappy-java 1.1.1.5
      cc953d6 [Josh Rosen] [SPARK-4056] Upgrade snappy-java to 1.1.1.4
      898b22ab
    • Josh Rosen's avatar
      [SPARK-4080] Only throw IOException from [write|read][Object|External] · 6c98c29a
      Josh Rosen authored
      If classes implementing Serializable or Externalizable interfaces throw
      exceptions other than IOException or ClassNotFoundException from their
      (de)serialization methods, then this results in an unhelpful
      "IOException: unexpected exception type" rather than the actual exception that
      produced the (de)serialization error.
      
      This patch fixes this by adding a utility method that re-wraps any uncaught
      exceptions in IOException (unless they are already instances of IOException).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #2932 from JoshRosen/SPARK-4080 and squashes the following commits:
      
      cd3a9be [Josh Rosen] [SPARK-4080] Only throw IOException from [write|read][Object|External].
      6c98c29a
    • Michael Armbrust's avatar
      [HOTFIX][SQL] Remove sleep on reset() failure. · 3a906c66
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2934 from marmbrus/patch-2 and squashes the following commits:
      
      a96dab2 [Michael Armbrust] Remove sleep on reset() failure.
      3a906c66
    • Grace's avatar
      [GraphX] Modify option name according to example doc in SynthBenchmark · 07e439b4
      Grace authored
      Now graphx.SynthBenchmark example has an option of iteration number named as "niter". However, in its document, it is named as "niters". The mismatch between the implementation and document causes certain IllegalArgumentException while trying that example.
      
      Author: Grace <jie.huang@intel.com>
      
      Closes #2888 from GraceH/synthbenchmark and squashes the following commits:
      
      f101ee1 [Grace] Modify option name according to example doc
      07e439b4
    • Nan Zhu's avatar
      [SPARK-4067] refactor ExecutorUncaughtExceptionHandler · f80dcf2a
      Nan Zhu authored
      https://issues.apache.org/jira/browse/SPARK-4067
      
      currently , we call Utils.tryOrExit everywhere
      AppClient
      Executor
      TaskSchedulerImpl
      It makes the name of ExecutorUncaughtExceptionHandler unfit to the real case....
      
      Author: Nan Zhu <nanzhu@Nans-MacBook-Pro.local>
      Author: Nan Zhu <nanzhu@nans-mbp.home>
      
      Closes #2913 from CodingCat/SPARK-4067 and squashes the following commits:
      
      035ee3d [Nan Zhu] make RAT happy
      e62e416 [Nan Zhu] add some general Exit code
      a10b63f [Nan Zhu] refactor
      f80dcf2a
    • Andrew Or's avatar
      [SPARK-4013] Do not create multiple actor systems on each executor · b563987e
      Andrew Or authored
      In the existing code, each coarse-grained executor has two concurrently running actor systems. This causes many more error messages to be logged than necessary when the executor is lost or killed because we receive a disassociation event for each of these actor systems.
      
      This is blocking #2840.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #2863 from andrewor14/executor-actor-system and squashes the following commits:
      
      44ce2e0 [Andrew Or] Avoid starting two actor systems on each executor
      b563987e
    • Kousuke Saruta's avatar
      [SPARK-4075] [Deploy] Jar url validation is not enough for Jar file · 098f83c7
      Kousuke Saruta authored
      In deploy.ClientArguments.isValidJarUrl, the url is checked as follows.
      
          def isValidJarUrl(s: String): Boolean = s.matches("(.+):(.+)jar")
      
      So, it allows like 'hdfs:file.jar' (no authority).
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #2925 from sarutak/uri-syntax-check-improvement and squashes the following commits:
      
      cf06173 [Kousuke Saruta] Improved URI syntax checking
      098f83c7
    • Kousuke Saruta's avatar
      [SPARK-4076] Parameter expansion in spark-config is wrong · 30ea2868
      Kousuke Saruta authored
      In sbin/spark-config.sh, parameter expansion is used to extract source root as follows.
      
          this="${BASH_SOURCE-$0}"
      
      I think, the parameter expansion should be ":" instead of "".
      If we use "-" and BASH_SOURCE="", (empty character is set, not unset),
      "" (empty character) is set to $this.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #2930 from sarutak/SPARK-4076 and squashes the following commits:
      
      32a0370 [Kousuke Saruta] Fixed wrong parameter expansion
      30ea2868
    • Li Zhihui's avatar
      [SPARK-2713] Executors of same application in same host should only download files & jars once · 7aacb7bf
      Li Zhihui authored
      If Spark lunched multiple executors in one host for one application, every executor would download it dependent files and jars (if not using local: url) independently. It maybe result in huge latency. In my case, it result in 20 seconds latency to download dependent jars(size about 17M) when I lunched 32 executors in every host(total 4 hosts).
      
      This patch will cache downloaded files and jars for executors to reduce network throughput and download latency. In my case, the latency was reduced from 20 seconds to less than 1 second.
      
      Author: Li Zhihui <zhihui.li@intel.com>
      Author: li-zhihui <zhihui.li@intel.com>
      
      Closes #1616 from li-zhihui/cachefiles and squashes the following commits:
      
      36940df [Li Zhihui] Close cache for local mode
      935fed6 [Li Zhihui] Clean code.
      f9330d4 [Li Zhihui] Clean code again
      7050d46 [Li Zhihui] Clean code
      074a422 [Li Zhihui] Fix: deal with spark.files.overwrite
      03ed3a8 [li-zhihui] rename cache file name as XXXXXXXXX_cache
      2766055 [li-zhihui] Use url.hashCode + timestamp as cachedFileName
      76a7b66 [Li Zhihui] Clean code & use applcation work directory as cache directory
      3510eb0 [Li Zhihui] Keep fetchFile private
      2ffd742 [Li Zhihui] add comment for FileLock
      e0ebd48 [Li Zhihui] Try and finally lock.release
      7fb7c0b [Li Zhihui] Release lock before copy files
      6b997bf [Li Zhihui] Executors of same application in same host should only download files & jars once
      7aacb7bf
    • Hari Shreedharan's avatar
      [SPARK-4026][Streaming] Write ahead log management · 6a40a768
      Hari Shreedharan authored
      As part of the effort to avoid data loss on Spark Streaming driver failure, we want to implement a write ahead log that can write received data to HDFS. This allows the received data to be persist across driver failures. So when the streaming driver is restarted, it can find and reprocess all the data that were received but not processed.
      
      This was primarily implemented by @harishreedharan. This is still WIP, as he is going to improve the unitests by using HDFS mini cluster.
      
      Author: Hari Shreedharan <hshreedharan@apache.org>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #2882 from tdas/driver-ha-wal and squashes the following commits:
      
      e4bee20 [Tathagata Das] Removed synchronized, Path.getFileSystem is threadsafe
      55514e2 [Tathagata Das] Minor changes based on PR comments.
      d29fddd [Tathagata Das] Merge pull request #20 from harishreedharan/driver-ha-wal
      a317a4d [Hari Shreedharan] Directory deletion should not fail tests
      9514dc8 [Tathagata Das] Added unit tests to test reading of corrupted data and other minor edits
      3881706 [Tathagata Das] Merge pull request #19 from harishreedharan/driver-ha-wal
      4705fff [Hari Shreedharan] Sort listed files by name. Use local files for WAL tests.
      eb356ca [Tathagata Das] Merge pull request #18 from harishreedharan/driver-ha-wal
      82ce56e [Hari Shreedharan] Fix file ordering issue in WALManager tests
      5ff90ee [Hari Shreedharan] Fix tests to not ignore ordering and also assert all data is present
      ef8db09 [Tathagata Das] Merge pull request #17 from harishreedharan/driver-ha-wal
      7e40e56 [Hari Shreedharan] Restore old build directory after tests
      587b876 [Hari Shreedharan] Fix broken test. Call getFileSystem only from synchronized method.
      b4be0c1 [Hari Shreedharan] Remove unused method
      edcbee1 [Hari Shreedharan] Tests reading and writing data using writers now use Minicluster.
      5c70d1f [Hari Shreedharan] Remove underlying stream from the WALWriter.
      4ab602a [Tathagata Das] Refactored write ahead stuff from streaming.storage to streaming.util
      b06be2b [Tathagata Das] Adding missing license.
      5182ffb [Hari Shreedharan] Added documentation
      172358d [Tathagata Das] Pulled WriteAheadLog-related stuff from tdas/spark/tree/driver-ha-working
      6a40a768
    • Zhan Zhang's avatar
      [SPARK-2706][SQL] Enable Spark to support Hive 0.13 · 7c89a8f0
      Zhan Zhang authored
      Given that a lot of users are trying to use hive 0.13 in spark, and the incompatibility between hive-0.12 and hive-0.13 on the API level I want to propose following approach, which has no or minimum impact on existing hive-0.12 support, but be able to jumpstart the development of hive-0.13 and future version support.
      
      Approach: Introduce “hive-version” property,  and manipulate pom.xml files to support different hive version at compiling time through shim layer, e.g., hive-0.12.0 and hive-0.13.1. More specifically,
      
      1. For each different hive version, there is a very light layer of shim code to handle API differences, sitting in sql/hive/hive-version, e.g., sql/hive/v0.12.0 or sql/hive/v0.13.1
      
      2. Add a new profile hive-default active by default, which picks up all existing configuration and hive-0.12.0 shim (v0.12.0)  if no hive.version is specified.
      
      3. If user specifies different version (currently only 0.13.1 by -Dhive.version = 0.13.1), hive-versions profile will be activated, which pick up hive-version specific shim layer and configuration, mainly the hive jars and hive-version shim, e.g., v0.13.1.
      
      4. With this approach, nothing is changed with current hive-0.12 support.
      
      No change by default: sbt/sbt -Phive
      For example: sbt/sbt -Phive -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 assembly
      
      To enable hive-0.13: sbt/sbt -Dhive.version=0.13.1
      For example: sbt/sbt -Dhive.version=0.13.1 -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 assembly
      
      Note that in hive-0.13, hive-thriftserver is not enabled, which should be fixed by other Jira, and we don’t need -Phive with -Dhive.version in building (probably we should use -Phive -Dhive.version=xxx instead after thrift server is also supported in hive-0.13.1).
      
      Author: Zhan Zhang <zhazhan@gmail.com>
      Author: zhzhan <zhazhan@gmail.com>
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #2241 from zhzhan/spark-2706 and squashes the following commits:
      
      3ece905 [Zhan Zhang] minor fix
      410b668 [Zhan Zhang] solve review comments
      cbb4691 [Zhan Zhang] change run-test for new options
      0d4d2ed [Zhan Zhang] rebase
      497b0f4 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      8fad1cf [Zhan Zhang] change the pom file and make hive-0.13.1 as the default
      ab028d1 [Zhan Zhang] rebase
      4a2e36d [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      4cb1b93 [zhzhan] Merge pull request #1 from pwendell/pr-2241
      b0478c0 [Patrick Wendell] Changes to simplify the build of SPARK-2706
      2b50502 [Zhan Zhang] rebase
      a72c0d4 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      cb22863 [Zhan Zhang] correct the typo
      20f6cf7 [Zhan Zhang] solve compatability issue
      f7912a9 [Zhan Zhang] rebase and solve review feedback
      301eb4a [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      10c3565 [Zhan Zhang] address review comments
      6bc9204 [Zhan Zhang] rebase and remove temparory repo
      d3aa3f2 [Zhan Zhang] Merge branch 'master' into spark-2706
      cedcc6f [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      3ced0d7 [Zhan Zhang] rebase
      d9b981d [Zhan Zhang] rebase and fix error due to rollback
      adf4924 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      3dd50e8 [Zhan Zhang] solve conflicts and remove unnecessary implicts
      d10bf00 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      dc7bdb3 [Zhan Zhang] solve conflicts
      7e0cc36 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      d7c3e1e [Zhan Zhang] Merge branch 'master' into spark-2706
      68deb11 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      d48bd18 [Zhan Zhang] address review comments
      3ee3b2b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      57ea52e [Zhan Zhang] Merge branch 'master' into spark-2706
      2b0d513 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      9412d24 [Zhan Zhang] address review comments
      f4af934 [Zhan Zhang] rebase
      1ccd7cc [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      128b60b [Zhan Zhang] ignore 0.12.0 test cases for the time being
      af9feb9 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      5f5619f [Zhan Zhang] restructure the directory and different hive version support
      05d3683 [Zhan Zhang] solve conflicts
      e4c1982 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      94b4fdc [Zhan Zhang] Spark-2706: hive-0.13.1 support on spark
      87ebf3b [Zhan Zhang] Merge branch 'master' into spark-2706
      921e914 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      f896b2a [Zhan Zhang] Merge branch 'master' into spark-2706
      789ea21 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      cb53a2c [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      f6a8a40 [Zhan Zhang] revert
      ba14f28 [Zhan Zhang] test
      dbedff3 [Zhan Zhang] Merge remote-tracking branch 'upstream/master'
      70964fe [Zhan Zhang] revert
      fe0f379 [Zhan Zhang] Merge branch 'master' of https://github.com/zhzhan/spark
      70ffd93 [Zhan Zhang] revert
      42585ec [Zhan Zhang] test
      7d5fce2 [Zhan Zhang] test
      7c89a8f0
    • Michael Armbrust's avatar
      [SPARK-4050][SQL] Fix caching of temporary tables with projections. · 0e886610
      Michael Armbrust authored
      Previously cached data was found by `sameResult` plan matching on optimized plans.  This technique however fails to locate the cached data when a temporary table with a projection is queried with a further reduced projection.  The failure is due to the fact that optimization will collapse the projections, producing a plan that no longer produces the sameResult as the cached data (though the cached data still subsumes the desired data).  For example consider the following previously failing test case.
      
      ```scala
      sql("CACHE TABLE tempTable AS SELECT key FROM testData")
      assertCached(sql("SELECT COUNT(*) FROM tempTable"))
      ```
      
      In this PR I change the matching to occur after analysis instead of optimization, so that in the case of temporary tables, the plans will always match.  I think this should work generally, however, this error does raise questions about the need to do more thorough subsumption checking when locating cached data.
      
      Another question is what sort of semantics we want to provide when uncaching data from temporary tables.  For example consider the following sequence of commands:
      
      ```scala
      testData.select('key).registerTempTable("tempTable1")
      testData.select('key).registerTempTable("tempTable2")
      cacheTable("tempTable1")
      
      // This obviously works.
      assertCached(sql("SELECT COUNT(*) FROM tempTable1"))
      
      // It seems good that this works ...
      assertCached(sql("SELECT COUNT(*) FROM tempTable2"))
      
      // ... but is this valid?
      uncacheTable("tempTable2")
      
      // Should this still be cached?
      assertCached(sql("SELECT COUNT(*) FROM tempTable1"), 0)
      ```
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2912 from marmbrus/cachingBug and squashes the following commits:
      
      9c822d4 [Michael Armbrust] remove commented out code
      5c72fb7 [Michael Armbrust] Add a test case / question about uncaching semantics.
      63a23e4 [Michael Armbrust] Perform caching on analyzed instead of optimized plan.
      03f1cfe [Michael Armbrust] Clean-up / add tests to SameResult suite.
      0e886610
    • Davies Liu's avatar
      [SPARK-4051] [SQL] [PySpark] Convert Row into dictionary · d60a9d44
      Davies Liu authored
      Added a method to Row to turn row into dict:
      
      ```
      >>> row = Row(a=1)
      >>> row.asDict()
      {'a': 1}
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2896 from davies/dict and squashes the following commits:
      
      8d97366 [Davies Liu] convert Row into dict
      d60a9d44
    • Kousuke Saruta's avatar
      [SPARK-3900][YARN] ApplicationMaster's shutdown hook fails and IllegalStateException is thrown. · d2987e8f
      Kousuke Saruta authored
      ApplicationMaster registers a shutdown hook and it calls ApplicationMaster#cleanupStagingDir.
      
      cleanupStagingDir invokes FileSystem.get(yarnConf) and it invokes FileSystem.getInternal. FileSystem.getInternal also registers shutdown hook.
      In FileSystem of hadoop 0.23, the shutdown hook registration does not consider whether shutdown is in progress or not (In 2.2, it's considered).
      
          // 0.23
          if (map.isEmpty() ) {
            ShutdownHookManager.get().addShutdownHook(clientFinalizer, SHUTDOWN_HOOK_PRIORITY);
          }
      
          // 2.2
          if (map.isEmpty()
                      && !ShutdownHookManager.get().isShutdownInProgress()) {
             ShutdownHookManager.get().addShutdownHook(clientFinalizer, SHUTDOWN_HOOK_PRIORITY);
          }
      
      Thus, in 0.23, another shutdown hook can be registered when ApplicationMaster's shutdown hook run.
      
      This issue cause IllegalStateException as follows.
      
          java.lang.IllegalStateException: Shutdown in progress, cannot add a shutdownHook
                  at org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:152)
                  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2306)
                  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2278)
                  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:316)
                  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:162)
                  at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:307)
                  at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:118)
                  at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #2924 from sarutak/SPARK-3900-2 and squashes the following commits:
      
      9112817 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3900-2
      97018fa [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3900
      2c2850e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3900
      ee52db2 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3900
      a7d6c9b [Kousuke Saruta] Merge branch 'SPARK-3900' of github.com:sarutak/spark into SPARK-3900
      1cdf03c [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3900
      a5f6443 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3900
      57b397d [Kousuke Saruta] Fixed IllegalStateException caused by shutdown hook registration in another shutdown hook
      d2987e8f
    • Davies Liu's avatar
      [SPARK-2652] [PySpark] donot use KyroSerializer as default serializer · 809c785b
      Davies Liu authored
      KyroSerializer can not serialize customized class without registered explicitly, use it as default serializer in PySpark will introduce some regression in MLlib.
      
      cc mengxr
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2916 from davies/revert and squashes the following commits:
      
      43eb6d3 [Davies Liu] donot use KyroSerializer as default serializer
      809c785b
    • Prashant Sharma's avatar
      SPARK-3812 Build changes to publish effective pom. · 0aea2289
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #2921 from ScrapCodes/build-changes-effective-pom and squashes the following commits:
      
      8841491 [Prashant Sharma] Fixed broken maven build.
      aa7b91d [Prashant Sharma] used an unused dep.
      0300dac [Prashant Sharma] improved comment messages..
      28f891e [Prashant Sharma] Added a useless dependency, so that we can shade it. And realized fake shading works for us.
      553d96b [Prashant Sharma] Shaded some unused class of an unused dep, to generate effective pom(s)
      0aea2289
    • Cheng Lian's avatar
      [SPARK-4000][BUILD] Sends archived unit tests logs to Jenkins master · a29c9bd6
      Cheng Lian authored
      This PR sends archived unit tests logs to the build history directory in Jenkins master, so that we can serve it via HTTP later to help debugging Jenkins build failures.
      
      pwendell JoshRosen Please help review, thanks!
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #2845 from liancheng/log-archive and squashes the following commits:
      
      ac8d9d4 [Cheng Lian] Includes build number in messages posted to GitHub
      68c7010 [Cheng Lian] Logs backup should be implemented in dev/run-tests-jenkins
      4b912f7 [Cheng Lian] Sends archived unit tests logs to Jenkins master
      a29c9bd6
  4. Oct 23, 2014
    • Davies Liu's avatar
      [SPARK-3993] [PySpark] fix bug while reuse worker after take() · e595c8d0
      Davies Liu authored
      After take(), maybe there are some garbage left in the socket, then next task assigned to this worker will hang because of corrupted data.
      
      We should make sure the socket is clean before reuse it, write END_OF_STREAM at the end, and check it after read out all result from python.
      
      Author: Davies Liu <davies.liu@gmail.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2838 from davies/fix_reuse and squashes the following commits:
      
      8872914 [Davies Liu] fix tests
      660875b [Davies Liu] fix bug while reuse worker after take()
      e595c8d0
    • Josh Rosen's avatar
      [SPARK-4019] [SPARK-3740] Fix MapStatus compression bug that could lead to... · 83b7a1c6
      Josh Rosen authored
      [SPARK-4019] [SPARK-3740] Fix MapStatus compression bug that could lead to empty results or Snappy errors
      
      This commit fixes a bug in MapStatus that could cause jobs to wrongly return
      empty results if those jobs contained stages with more than 2000 partitions
      where most of those partitions were empty.
      
      For jobs with > 2000 partitions, MapStatus uses HighlyCompressedMapStatus,
      which only stores the average size of blocks.  If the average block size is
      zero, then this will cause all blocks to be reported as empty, causing
      BlockFetcherIterator to mistakenly skip them.
      
      For example, this would return an empty result:
      
          sc.makeRDD(0 until 10, 1000).repartition(2001).collect()
      
      This can also lead to deserialization errors (e.g. Snappy decoding errors)
      for jobs with > 2000 partitions where the average block size is non-zero but
      there is at least one empty block.  In this case, the BlockFetcher attempts to
      fetch empty blocks and fails when trying to deserialize them.
      
      The root problem here is that MapStatus has a (previously undocumented)
      correctness property that was violated by HighlyCompressedMapStatus:
      
          If a block is non-empty, then getSizeForBlock must be non-zero.
      
      I fixed this by modifying HighlyCompressedMapStatus to store the average size
      of _non-empty_ blocks and to use a compressed bitmap to track which blocks are
      empty.
      
      I also removed a test which was broken as originally written: it attempted
      to check that HighlyCompressedMapStatus's size estimation error was < 10%,
      but this was broken because HighlyCompressedMapStatus is only used for map
      statuses with > 2000 partitions, but the test only created 50.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #2866 from JoshRosen/spark-4019 and squashes the following commits:
      
      fc8b490 [Josh Rosen] Roll back hashset change, which didn't improve performance.
      5faa0a4 [Josh Rosen] Incorporate review feedback
      c8b8cae [Josh Rosen] Two performance fixes:
      3b892dd [Josh Rosen] Address Reynold's review comments
      ba2e71c [Josh Rosen] Add missing newline
      609407d [Josh Rosen] Use Roaring Bitmap to track non-empty blocks.
      c23897a [Josh Rosen] Use sets when comparing collect() results
      91276a3 [Josh Rosen] [SPARK-4019] Fix MapStatus compression bug that could lead to empty results.
      83b7a1c6
    • Patrick Wendell's avatar
      Revert "[SPARK-3812] [BUILD] Adapt maven build to publish effective pom." · 222fa47f
      Patrick Wendell authored
      This reverts commit c5882c66.
      
      I am reverting this becuase it appears to cause the maven tests
      to hang.
      222fa47f
    • Holden Karau's avatar
      specify unidocGenjavadocVersion of 0.8 · 293672c4
      Holden Karau authored
      Fixes an issue with being too strict generating javadoc causing errors.
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #2893 from holdenk/SPARK-3359-sbtunidoc-java8 and squashes the following commits:
      
      9379a70 [Holden Karau] specify unidocGenjavadocVersion of 0.8
      293672c4
    • Tal Sliwowicz's avatar
      [SPARK-4006] In long running contexts, we encountered the situation of double registe... · 6b485225
      Tal Sliwowicz authored
      ...r without a remove in between. The cause for that is unknown, and assumed a temp network issue.
      
      However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.
      
      The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.
      
      Also - added some logging for register and unregister.
      
      This is just like https://github.com/apache/spark/pull/2854 except it's on master
      
      Author: Tal Sliwowicz <tal.s@taboola.com>
      
      Closes #2886 from tsliwowicz/master-block-mgr-removal and squashes the following commits:
      
      094d508 [Tal Sliwowicz] some more white space change undone
      41a2217 [Tal Sliwowicz] some more whitspaces change undone
      7bcfc3d [Tal Sliwowicz] whitspaces fix
      df9d98f [Tal Sliwowicz] Code review comments fixed
      f48bce9 [Tal Sliwowicz] In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue.
      6b485225
    • Kousuke Saruta's avatar
      [SPARK-4055][MLlib] Inconsistent spelling 'MLlib' and 'MLLib' · f799700e
      Kousuke Saruta authored
      Thare are some inconsistent spellings 'MLlib' and 'MLLib' in some documents and source codes.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #2903 from sarutak/SPARK-4055 and squashes the following commits:
      
      b031640 [Kousuke Saruta] Fixed inconsistent spelling "MLlib and MLLib"
      f799700e
    • Prashant Sharma's avatar
      [BUILD] Fixed resolver for scalastyle plugin and upgrade sbt version. · d6a30253
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #2877 from ScrapCodes/scalastyle-fix and squashes the following commits:
      
      a17b9fe [Prashant Sharma] [BUILD] Fixed resolver for scalastyle plugin.
      d6a30253
  5. Oct 22, 2014
    • Prashant Sharma's avatar
      [SPARK-3812] [BUILD] Adapt maven build to publish effective pom. · c5882c66
      Prashant Sharma authored
      I have tried maven help plugin first but that published all projects in top level pom. So I was left with no choice but to roll my own trivial plugin. This patch basically installs an effective pom after maven install is finished.
      
      The problem it fixes is described as follows:
      If you install using maven
      ` mvn install -DskipTests -Dhadoop.version=2.2.0 -Phadoop-2.2 `
      Then without this patch the published pom(s) will have hadoop version as 1.0.4. This can be a problem at some point.
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #2673 from ScrapCodes/build-changes-effective-pom and squashes the following commits:
      
      aa7b91d [Prashant Sharma] used an unused dep.
      0300dac [Prashant Sharma] improved comment messages..
      28f891e [Prashant Sharma] Added a useless dependency, so that we can shade it. And realized fake shading works for us.
      553d96b [Prashant Sharma] Shaded some unused class of an unused dep, to generate effective pom(s)
      c5882c66
    • zsxwing's avatar
      [SPARK-3877][YARN] Throw an exception when application is not successful so... · 137d9423
      zsxwing authored
      [SPARK-3877][YARN] Throw an exception when application is not successful so that the exit code wil be set to 1
      
      When an yarn application fails (yarn-cluster mode), the exit code of spark-submit is still 0. It's hard for people to write some automatic scripts to run spark jobs in yarn because the failure can not be detected in these scripts.
      
      This PR added a status checking after `monitorApplication`. If an application is not successful, `run()` will throw an `SparkException`, so that Client.scala will exit with code 1. Therefore, people can use the exit code of `spark-submit` to write some automatic scripts.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #2732 from zsxwing/SPARK-3877 and squashes the following commits:
      
      1f89fa5 [zsxwing] Fix the unit test
      a0498e1 [zsxwing] Update the docs and the error message
      e1cb9ef [zsxwing] Fix the hacky way of calling Client
      ff16fec [zsxwing] Remove System.exit in Client.scala and add a test
      6a2c103 [zsxwing] [SPARK-3877] Throw an exception when application is not successful so that the exit code wil be set to 1
      137d9423
    • Josh Rosen's avatar
      [SPARK-3426] Fix sort-based shuffle error when spark.shuffle.compress and... · 813effc7
      Josh Rosen authored
      [SPARK-3426] Fix sort-based shuffle error when spark.shuffle.compress and spark.shuffle.spill.compress settings are different
      
      This PR fixes SPARK-3426, an issue where sort-based shuffle crashes if the
      `spark.shuffle.spill.compress` and `spark.shuffle.compress` settings have
      different values.
      
      The problem is that sort-based shuffle's read and write paths use different
      settings for determining whether to apply compression.  ExternalSorter writes
      runs to files using `TempBlockId` ids, which causes
      `spark.shuffle.spill.compress` to be used for enabling compression, but these
      spilled files end up being shuffled over the network and read as shuffle files
      using `ShuffleBlockId` by BlockStoreShuffleFetcher, which causes
      `spark.shuffle.compress` to be used for enabling decompression.  As a result,
      this leads to errors when these settings disagree.
      
      Based on the discussions in #2247 and #2178, it sounds like we don't want to
      remove the `spark.shuffle.spill.compress` setting.  Therefore, I've tried to
      come up with a fix where `spark.shuffle.spill.compress` is used to compress
      data that's read and written locally and `spark.shuffle.compress` is used to
      compress any data that will be fetched / read as shuffle blocks.
      
      To do this, I split `TempBlockId` into two new id types, `TempLocalBlockId` and
      `TempShuffleBlockId`, which map to `spark.shuffle.spill.compress` and
      `spark.shuffle.compress`, respectively.  ExternalAppendOnlyMap also used temp
      blocks for spilling data.  It looks like ExternalSorter was designed to be
      a generic sorter but its configuration already happens to be tied to sort-based
      shuffle, so I think it's fine if we use `spark.shuffle.compress` to compress
      its spills; we can move the compression configuration to the constructor in
      a later commit if we find that ExternalSorter is being used in other contexts
      where we want different configuration options to control compression.  To
      summarize:
      
      **Before:**
      
      |       | ExternalAppendOnlyMap        | ExternalSorter               |
      |-------|------------------------------|------------------------------|
      | Read  | spark.shuffle.spill.compress | spark.shuffle.compress       |
      | Write | spark.shuffle.spill.compress | spark.shuffle.spill.compress |
      
      **After:**
      
      |       | ExternalAppendOnlyMap        | ExternalSorter         |
      |-------|------------------------------|------------------------|
      | Read  | spark.shuffle.spill.compress | spark.shuffle.compress |
      | Write | spark.shuffle.spill.compress | spark.shuffle.compress |
      
      Thanks to andrewor14 for debugging this with me!
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #2890 from JoshRosen/SPARK-3426 and squashes the following commits:
      
      1921cf6 [Josh Rosen] Minor edit for clarity.
      c8dd8f2 [Josh Rosen] Add comment explaining use of createTempShuffleBlock().
      2c687b9 [Josh Rosen] Fix SPARK-3426.
      91e7e40 [Josh Rosen] Combine tests into single test of all combinations
      76ca65e [Josh Rosen] Add regression test for SPARK-3426.
      813effc7
    • freeman's avatar
      Fix for sampling error in NumPy v1.9 [SPARK-3995][PYSPARK] · 97cf19f6
      freeman authored
      Change maximum value for default seed during RDD sampling so that it is strictly less than 2 ** 32. This prevents a bug in the most recent version of NumPy, which cannot accept random seeds above this bound.
      
      Adds an extra test that uses the default seed (instead of setting it manually, as in the docstrings).
      
      mengxr
      
      Author: freeman <the.freeman.lab@gmail.com>
      
      Closes #2889 from freeman-lab/pyspark-sampling and squashes the following commits:
      
      dc385ef [freeman] Change maximum value for default seed
      97cf19f6
    • CrazyJvm's avatar
      use isRunningLocally rather than runningLocally · f05e09b4
      CrazyJvm authored
      runningLocally is deprecated now
      
      Author: CrazyJvm <crazyjvm@gmail.com>
      
      Closes #2879 from CrazyJvm/runningLocally and squashes the following commits:
      
      bec0b3e [CrazyJvm] use isRunningLocally rather than runningLocally
      f05e09b4
    • Karthik's avatar
      Update JavaCustomReceiver.java · bae4ca3b
      Karthik authored
      Changed the usage string to correctly reflect the file name.
      
      Author: Karthik <karthik.gomadam@gmail.com>
      
      Closes #2699 from namelessnerd/patch-1 and squashes the following commits:
      
      8570e33 [Karthik] Update JavaCustomReceiver.java
      bae4ca3b
  6. Oct 21, 2014
    • Sandy Ryza's avatar
      SPARK-1813. Add a utility to SparkConf that makes using Kryo really easy · 6bb56fae
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #789 from sryza/sandy-spark-1813 and squashes the following commits:
      
      48b05e9 [Sandy Ryza] Simplify
      b824932 [Sandy Ryza] Allow both spark.kryo.classesToRegister and spark.kryo.registrator at the same time
      6a15bb7 [Sandy Ryza] Small fix
      a2278c0 [Sandy Ryza] Respond to review comments
      6ef592e [Sandy Ryza] SPARK-1813. Add a utility to SparkConf that makes using Kryo really easy
      6bb56fae
    • wangfei's avatar
      [SQL]redundant methods for broadcast · 856b0817
      wangfei authored
      redundant methods for broadcast in ```TableReader```
      
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #2862 from scwf/TableReader and squashes the following commits:
      
      414cc24 [wangfei] unnecessary methods for broadcast
      856b0817
Loading