Skip to content
Snippets Groups Projects
  1. Aug 20, 2014
    • Andrew Or's avatar
      [SPARK-2849] Handle driver configs separately in client mode · b3ec51bf
      Andrew Or authored
      In client deploy mode, the driver is launched from within `SparkSubmit`'s JVM. This means by the time we parse Spark configs from `spark-defaults.conf`, it is already too late to control certain properties of the driver's JVM. We currently ignore these configs in client mode altogether.
      ```
      spark.driver.memory
      spark.driver.extraJavaOptions
      spark.driver.extraClassPath
      spark.driver.extraLibraryPath
      ```
      This PR handles these properties before launching the driver JVM. It achieves this by spawning a separate JVM that runs a new class called `SparkSubmitDriverBootstrapper`, which spawns `SparkSubmit` as a sub-process with the appropriate classpath, library paths, java opts and memory.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1845 from andrewor14/handle-configs-bash and squashes the following commits:
      
      bed4bdf [Andrew Or] Change a few comments / messages (minor)
      24dba60 [Andrew Or] Merge branch 'master' of github.com:apache/spark into handle-configs-bash
      08fd788 [Andrew Or] Warn against external usages of SparkSubmitDriverBootstrapper
      ff34728 [Andrew Or] Minor comments
      51aeb01 [Andrew Or] Filter out JVM memory in Scala rather than Bash (minor)
      9a778f6 [Andrew Or] Fix PySpark: actually kill driver on termination
      d0f20db [Andrew Or] Don't pass empty library paths, classpath, java opts etc.
      a78cb26 [Andrew Or] Revert a few changes in utils.sh (minor)
      9ba37e2 [Andrew Or] Don't barf when the properties file does not exist
      8867a09 [Andrew Or] A few more naming things (minor)
      19464ad [Andrew Or] SPARK_SUBMIT_JAVA_OPTS -> SPARK_SUBMIT_OPTS
      d6488f9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into handle-configs-bash
      1ea6bbe [Andrew Or] SparkClassLauncher -> SparkSubmitDriverBootstrapper
      a91ea19 [Andrew Or] Fix precedence of library paths, classpath, java opts and memory
      158f813 [Andrew Or] Remove "client mode" boolean argument
      c84f5c8 [Andrew Or] Remove debug print statement (minor)
      b71f52b [Andrew Or] Revert a few more changes (minor)
      7d94a8d [Andrew Or] Merge branch 'master' of github.com:apache/spark into handle-configs-bash
      3a8235d [Andrew Or] Only parse the properties file if special configs exist
      c37e08d [Andrew Or] Revert a few more changes
      a396eda [Andrew Or] Nullify my own hard work to simplify bash
      0effa1e [Andrew Or] Add code in Scala that handles special configs
      c886568 [Andrew Or] Fix lines too long + a few comments / style (minor)
      7a4190a [Andrew Or] Merge branch 'master' of github.com:apache/spark into handle-configs-bash
      7396be2 [Andrew Or] Explicitly comment that multi-line properties are not supported
      fa11ef8 [Andrew Or] Parse the properties file only if the special configs exist
      371cac4 [Andrew Or] Add function prefix (minor)
      be99eb3 [Andrew Or] Fix tests to not include multi-line configs
      bd0d468 [Andrew Or] Simplify parsing config file by ignoring multi-line arguments
      56ac247 [Andrew Or] Use eval and set to simplify splitting
      8d4614c [Andrew Or] Merge branch 'master' of github.com:apache/spark into handle-configs-bash
      aeb79c7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into handle-configs-bash
      2732ac0 [Andrew Or] Integrate BASH tests into dev/run-tests + log error properly
      8d26a5c [Andrew Or] Add tests for bash/utils.sh
      4ae24c3 [Andrew Or] Fix bug: escape properly in quote_java_property
      b3c4cd5 [Andrew Or] Fix bug: count the number of quotes instead of detecting presence
      c2273fc [Andrew Or] Fix typo (minor)
      e793e5f [Andrew Or] Handle multi-line arguments
      5d8f8c4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-driver-extra
      c7b9926 [Andrew Or] Minor changes to spark-defaults.conf.template
      a992ae2 [Andrew Or] Escape spark.*.extraJavaOptions correctly
      aabfc7e [Andrew Or] escape -> split (minor)
      45a1eb9 [Andrew Or] Fix bug: escape escaped backslashes and quotes properly...
      1cdc6b1 [Andrew Or] Fix bug: escape escaped double quotes properly
      c854859 [Andrew Or] Add small comment
      c13a2cb [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-driver-extra
      8e552b7 [Andrew Or] Include an example of spark.*.extraJavaOptions
      de765c9 [Andrew Or] Print spark-class command properly
      a4df3c4 [Andrew Or] Move parsing and escaping logic to utils.sh
      dec2343 [Andrew Or] Only export variables if they exist
      fa2136e [Andrew Or] Escape Java options + parse java properties files properly
      ef12f74 [Andrew Or] Minor formatting
      4ec22a1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-driver-extra
      e5cfb46 [Andrew Or] Collapse duplicate code + fix potential whitespace issues
      4edcaa8 [Andrew Or] Redirect stdout to stderr for python
      130f295 [Andrew Or] Handle spark.driver.memory too
      98dd8e3 [Andrew Or] Add warning if properties file does not exist
      8843562 [Andrew Or] Fix compilation issues...
      75ee6b4 [Andrew Or] Remove accidentally added file
      63ed2e9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-driver-extra
      0025474 [Andrew Or] Revert SparkSubmit handling of --driver-* options for only cluster mode
      a2ab1b0 [Andrew Or] Parse spark.driver.extra* in bash
      250cb95 [Andrew Or] Do not ignore spark.driver.extra* for client mode
      b3ec51bf
    • Kousuke Saruta's avatar
      [SPARK-3149] Connection establishment information is not enough. · c1ba4cd6
      Kousuke Saruta authored
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #2060 from sarutak/SPARK-3149 and squashes the following commits:
      
      1cc89af [Kousuke Saruta] Modified log message of accepting connection
      c1ba4cd6
    • Kousuke Saruta's avatar
      [SPARK-3062] [SPARK-2970] [SQL] spark-sql script ends with IOException when EventLogging is enabled · 0ea46ac8
      Kousuke Saruta authored
      #1891 was to avoid IOException when EventLogging is enabled.
      The solution used ShutdownHookManager but it was defined only Hadoop 2.x. Hadoop 1.x don't have ShutdownHookManager so #1891 doesn't compile on Hadoop 1.x
      
      Now, I had a compromised solution for both Hadoop 1.x and 2.x.
      Only for FileLogger, an unique FileSystem object is created.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #1970 from sarutak/SPARK-2970 and squashes the following commits:
      
      240c91e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2970
      0e7b45d [Kousuke Saruta] Revert "[SPARK-2970] [SQL] spark-sql script ends with IOException when EventLogging is enabled"
      e1262ec [Kousuke Saruta] Modified Filelogger to use unique FileSystem instance
      0ea46ac8
    • Cheng Lian's avatar
      [SPARK-3126][SPARK-3127][SQL] Fixed HiveThriftServer2Suite · cf46e725
      Cheng Lian authored
      This PR fixes two issues:
      
      1. Fixes wrongly quoted command line option in `HiveThriftServer2Suite` that makes test cases hang until timeout.
      1. Asks `dev/run-test` to run Spark SQL tests when `bin/spark-sql` and/or `sbin/start-thriftserver.sh` are modified.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #2036 from liancheng/fix-thriftserver-test and squashes the following commits:
      
      f38c4eb [Cheng Lian] Fixed the same quotation issue in CliSuite
      26b82a0 [Cheng Lian] Run SQL tests when dff contains bin/spark-sql and/or sbin/start-thriftserver.sh
      a87f83d [Cheng Lian] Extended timeout
      e5aa31a [Cheng Lian] Fixed metastore JDBC URI quotation
      cf46e725
    • Patrick Wendell's avatar
      BUILD: Bump Hadoop versions in the release build. · ceb19830
      Patrick Wendell authored
      Also, minor modifications to the MapR profile.
      ceb19830
    • Patrick Wendell's avatar
      SPARK-3092 [SQL]: Always include the thriftserver when -Phive is enabled. · f2f26c2a
      Patrick Wendell authored
      Currently we have a separate profile called hive-thriftserver. I originally suggested this in case users did not want to bundle the thriftserver, but it's ultimately lead to a lot of confusion. Since the thriftserver is only a few classes, I don't see a really good reason to isolate it from the rest of Hive. So let's go ahead and just include it in the same profile to simplify things.
      
      This has been suggested in the past by liancheng.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #2006 from pwendell/hiveserver and squashes the following commits:
      
      742ea40 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into hiveserver
      034ad47 [Patrick Wendell] SPARK-3092: Always include the thriftserver when -Phive is enabled.
      f2f26c2a
    • Hari Shreedharan's avatar
      [SPARK-3054][STREAMING] Add unit tests for Spark Sink. · 8c5a2226
      Hari Shreedharan authored
      This patch adds unit tests for Spark Sink.
      
      It also removes the private[flume] for Spark Sink,
      since the sink is instantiated from Flume configuration (looks like this is ignored by reflection which is used by
      Flume, but we should still remove it anyway).
      
      Author: Hari Shreedharan <hshreedharan@apache.org>
      Author: Hari Shreedharan <hshreedharan@cloudera.com>
      
      Closes #1958 from harishreedharan/spark-sink-test and squashes the following commits:
      
      e3110b9 [Hari Shreedharan] Add a sleep to allow sink to commit the transactions
      120b81e [Hari Shreedharan] Fix complexity in threading model in test
      4df5be6 [Hari Shreedharan] Merge remote-tracking branch 'asf/master' into spark-sink-test
      c9190d1 [Hari Shreedharan] Indentation and spaces changes
      7fedc5a [Hari Shreedharan] Merge remote-tracking branch 'asf/master' into spark-sink-test
      abc20cb [Hari Shreedharan] Minor test changes
      7b9b649 [Hari Shreedharan] Merge branch 'master' into spark-sink-test
      f2c56c9 [Hari Shreedharan] Update SparkSinkSuite.scala
      a24aac8 [Hari Shreedharan] Remove unused var
      c86d615 [Hari Shreedharan] [SPARK-3054][STREAMING] Add unit tests for Spark Sink.
      8c5a2226
    • Davies Liu's avatar
      [SPARK-3141] [PySpark] fix sortByKey() with take() · 0a7ef633
      Davies Liu authored
      Fix sortByKey() with take()
      
      The function `f` used in mapPartitions should always return an iterator.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2045 from davies/fix_sortbykey and squashes the following commits:
      
      1160f59 [Davies Liu] fix sortByKey() with take()
      0a7ef633
    • Ken Takagiwa's avatar
      [DOCS] Fixed wrong links · 8a74e4b2
      Ken Takagiwa authored
      Author: Ken Takagiwa <ugw.gi.world@gmail.com>
      
      Closes #2042 from giwa/patch-1 and squashes the following commits:
      
      216fe0e [Ken Takagiwa] Fixed wrong links
      8a74e4b2
    • Josh Rosen's avatar
      [SPARK-2974] [SPARK-2975] Fix two bugs related to spark.local.dirs · ebcb94f7
      Josh Rosen authored
      This PR fixes two bugs related to `spark.local.dirs` and `SPARK_LOCAL_DIRS`, one where `Utils.getLocalDir()` might return an invalid directory (SPARK-2974) and another where the `SPARK_LOCAL_DIRS` override didn't affect the driver, which could cause problems when running tasks in local mode (SPARK-2975).
      
      This patch fixes both issues: the new `Utils.getOrCreateLocalRootDirs(conf: SparkConf)` utility method manages the creation of local directories and handles the precedence among the different configuration options, so we should see the same behavior whether we're running in local mode or on a worker.
      
      It's kind of a pain to mock out environment variables in tests (no easy way to mock System.getenv), so I added a `private[spark]` method to SparkConf for accessing environment variables (by default, it just delegates to System.getenv).  By subclassing SparkConf and overriding this method, we can mock out SPARK_LOCAL_DIRS in tests.
      
      I also fixed a typo in PySpark where we used `SPARK_LOCAL_DIR` instead of `SPARK_LOCAL_DIRS` (I think this was technically innocuous, but it seemed worth fixing).
      
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #2002 from JoshRosen/local-dirs and squashes the following commits:
      
      efad8c6 [Josh Rosen] Address review comments:
      1dec709 [Josh Rosen] Minor updates to Javadocs.
      7f36999 [Josh Rosen] Use env vars to detect if running in YARN container.
      399ac25 [Josh Rosen] Update getLocalDir() documentation.
      bb3ad89 [Josh Rosen] Remove duplicated YARN getLocalDirs() code.
      3e92d44 [Josh Rosen] Move local dirs override logic into Utils; fix bugs:
      b2c4736 [Josh Rosen] Add failing tests for SPARK-2974 and SPARK-2975.
      007298b [Josh Rosen] Allow environment variables to be mocked in tests.
      6d9259b [Josh Rosen] Fix typo in PySpark: SPARK_LOCAL_DIR should be SPARK_LOCAL_DIRS
      ebcb94f7
    • Xiangrui Meng's avatar
      [SPARK-3142][MLLIB] output shuffle data directly in Word2Vec · 0a984aa1
      Xiangrui Meng authored
      Sorry I didn't realize this in #2043. Ishiihara
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #2049 from mengxr/more-w2v and squashes the following commits:
      
      050b1c5 [Xiangrui Meng] output shuffle data directly
      0a984aa1
    • Reynold Xin's avatar
      [SPARK-3119] Re-implementation of TorrentBroadcast. · 8adfbc2b
      Reynold Xin authored
      This is a re-implementation of TorrentBroadcast, with the following changes:
      
      1. Removes most of the mutable, transient state from TorrentBroadcast (e.g. totalBytes, num of blocks fetched).
      2. Removes TorrentInfo and TorrentBlock
      3. Replaces the BlockManager.getSingle call in readObject with a getLocal, resuling in one less RPC call to the BlockManagerMasterActor to find the location of the block.
      4. Removes the metadata block, resulting in one less block to fetch.
      5. Removes an extra memory copy for deserialization (by using Java's SequenceInputStream).
      
      Basically for a regular broadcasted object with only one block, the number of RPC calls goes from 5+1 to 2+1).
      
      Old TorrentBroadcast for object of a single block:
      1 RPC to ask for location of the broadcast variable
      1 RPC to ask for location of the metadata block
      1 RPC to fetch the metadata block
      1 RPC to ask for location of the first data block
      1 RPC to fetch the first data block
      1 RPC to tell the driver we put the first data block in
      i.e. 5 + 1
      
      New TorrentBroadcast for object of a single block:
      1 RPC to ask for location of the first data block
      1 RPC to get the first data block
      1 RPC to tell the driver we put the first data block in
      i.e. 2 + 1
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #2030 from rxin/torrentBroadcast and squashes the following commits:
      
      5bacb9d [Reynold Xin] Always add the object to driver's block manager.
      0d8ed5b [Reynold Xin] Added getBytes to BlockManager and uses that in TorrentBroadcast.
      2d6a5fb [Reynold Xin] Use putBytes/getRemoteBytes throughout.
      3670f00 [Reynold Xin] Code review feedback.
      c1185cd [Reynold Xin] [SPARK-3119] Re-implementation of TorrentBroadcast.
      8adfbc2b
    • Xiangrui Meng's avatar
      [HOTFIX][Streaming][MLlib] use temp folder for checkpoint · fce5c0fb
      Xiangrui Meng authored
      or Jenkins will complain about no Apache header in checkpoint files. tdas rxin
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #2046 from mengxr/tmp-checkpoint and squashes the following commits:
      
      0d3ec73 [Xiangrui Meng] remove ssc.stop
      9797843 [Xiangrui Meng] change checkpointDir to lazy val
      89964ab [Xiangrui Meng] use temp folder for checkpoint
      fce5c0fb
  2. Aug 19, 2014
    • Xiangrui Meng's avatar
      [SPARK-3130][MLLIB] detect negative values in naive Bayes · 068b6fe6
      Xiangrui Meng authored
      because NB treats feature values as term frequencies. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #2038 from mengxr/nb-neg and squashes the following commits:
      
      52c37c3 [Xiangrui Meng] address comments
      65f892d [Xiangrui Meng] detect negative values in nb
      068b6fe6
    • wangfei's avatar
      [SQL] add note of use synchronizedMap in SQLConf · 0e3ab94d
      wangfei authored
      Refer to:
      http://stackoverflow.com/questions/510632/whats-the-difference-between-concurrenthashmap-and-collections-synchronizedmap
      Collections.synchronizedMap(map) creates a blocking Map which will degrade performance, albeit ensure consistency. So use ConcurrentHashMap(a more effective thread-safe hashmap) instead.
      
      also update HiveQuerySuite to fix test error when changed to ConcurrentHashMap.
      
      Author: wangfei <wangfei_hello@126.com>
      Author: scwf <wangfei1@huawei.com>
      
      Closes #1996 from scwf/sqlconf and squashes the following commits:
      
      93bc0c5 [wangfei] revert change of HiveQuerySuite
      0cc05dd [wangfei] add note for use synchronizedMap
      3c224d31 [scwf] fix formate
      a7bcb98 [scwf] use ConcurrentHashMap in sql conf, intead synchronizedMap
      0e3ab94d
    • freeman's avatar
      [SPARK-3112][MLLIB] Add documentation and example for StreamingLR · c7252b00
      freeman authored
      Added a documentation section on StreamingLR to the ``MLlib - Linear Methods``, including a worked example.
      
      mengxr tdas
      
      Author: freeman <the.freeman.lab@gmail.com>
      
      Closes #2047 from freeman-lab/streaming-lr-docs and squashes the following commits:
      
      568d250 [freeman] Tweaks to wording / formatting
      05a1139 [freeman] Added documentation and example for StreamingLR
      c7252b00
    • Xiangrui Meng's avatar
      [MLLIB] minor update to word2vec · 1870dbaa
      Xiangrui Meng authored
      very minor update Ishiihara
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #2043 from mengxr/minor-w2v and squashes the following commits:
      
      be649fd [Xiangrui Meng] remove map because we only need append
      eccefcc [Xiangrui Meng] minor updates to word2vec
      1870dbaa
    • Reynold Xin's avatar
      [SPARK-2468] Netty based block server / client module · 8b9dc991
      Reynold Xin authored
      Previous pull request (#1907) was reverted. This brings it back. Still looking into the hang.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1971 from rxin/netty1 and squashes the following commits:
      
      b0be96f [Reynold Xin] Added test to make sure outstandingRequests are cleaned after firing the events.
      4c6d0ee [Reynold Xin] Pass callbacks cleanly.
      603dce7 [Reynold Xin] Upgrade Netty to 4.0.23 to fix the DefaultFileRegion bug.
      88be1d4 [Reynold Xin] Downgrade to 4.0.21 to work around a bug in writing DefaultFileRegion.
      002626a [Reynold Xin] Remove netty-test-file.txt.
      db6e6e0 [Reynold Xin] Revert "Revert "[SPARK-2468] Netty based block server / client module""
      8b9dc991
    • Xiangrui Meng's avatar
      [SPARK-3136][MLLIB] Create Java-friendly methods in RandomRDDs · 825d4fe4
      Xiangrui Meng authored
      Though we don't use default argument for methods in RandomRDDs, it is still not easy for Java users to use because the output type is either `RDD[Double]` or `RDD[Vector]`. Java users should expect `JavaDoubleRDD` and `JavaRDD[Vector]`, respectively. We should create dedicated methods for Java users, and allow default arguments in Scala methods in RandomRDDs, to make life easier for both Java and Scala users. This PR also contains documentation for random data generation. brkyvz
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #2041 from mengxr/stat-doc and squashes the following commits:
      
      fc5eedf [Xiangrui Meng] add missing comma
      ffde810 [Xiangrui Meng] address comments
      aef6d07 [Xiangrui Meng] add doc for random data generation
      b99d94b [Xiangrui Meng] add java-friendly methods to RandomRDDs
      825d4fe4
    • Davies Liu's avatar
      [SPARK-2790] [PySpark] fix zip with serializers which have different batch sizes. · d7e80c25
      Davies Liu authored
      If two RDDs have different batch size in serializers, then it will try to re-serialize the one with smaller batch size, then call RDD.zip() in Spark.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1894 from davies/zip and squashes the following commits:
      
      c4652ea [Davies Liu] add more test cases
      6d05fc8 [Davies Liu] Merge branch 'master' into zip
      813b1e4 [Davies Liu] add more tests for failed cases
      a4aafda [Davies Liu] fix zip with serializers which have different batch sizes.
      d7e80c25
    • hzw19900416's avatar
      Move a bracket in validateSettings of SparkConf · 76eaeb45
      hzw19900416 authored
      Move a bracket in validateSettings of SparkConf
      
      Author: hzw19900416 <carlmartinmax@gmail.com>
      
      Closes #2012 from hzw19900416/codereading and squashes the following commits:
      
      e717fb6 [hzw19900416] Move a bracket in validateSettings of SparkConf
      76eaeb45
    • Vida Ha's avatar
      SPARK-2333 - spark_ec2 script should allow option for existing security group · 94053a7b
      Vida Ha authored
          - Uses the name tag to identify machines in a cluster.
          - Allows overriding the security group name so it doesn't need to coincide with the cluster name.
          - Outputs the request id's of up to 10 pending spot instance requests.
      
      Author: Vida Ha <vida@databricks.com>
      
      Closes #1899 from vidaha/vida/ec2-reuse-security-group and squashes the following commits:
      
      c80d5c3 [Vida Ha] wrap retries in a try catch block
      b2989d5 [Vida Ha] SPARK-2333: spark_ec2 script should allow option for existing security group
      94053a7b
    • freeman's avatar
      [SPARK-3128][MLLIB] Use streaming test suite for StreamingLR · 31f0b071
      freeman authored
      Refactored tests for streaming linear regression to use existing  streaming test utilities. Summary of changes:
      - Made ``mllib`` depend on tests from ``streaming``
      - Rewrote accuracy and convergence tests to use ``setupStreams`` and ``runStreams``
      - Added new test for the accuracy of predictions generated by ``predictOnValue``
      
      These tests should run faster, be easier to extend/maintain, and provide a reference for new tests.
      
      mengxr tdas
      
      Author: freeman <the.freeman.lab@gmail.com>
      
      Closes #2037 from freeman-lab/streamingLR-predict-tests and squashes the following commits:
      
      e851ca7 [freeman] Fixed long lines
      50eb0bf [freeman] Refactored tests to use streaming test tools
      32c43c2 [freeman] Added test for prediction
      31f0b071
    • Kousuke Saruta's avatar
      [SPARK-3089] Fix meaningless error message in ConnectionManager · cbfc26ba
      Kousuke Saruta authored
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #2000 from sarutak/SPARK-3089 and squashes the following commits:
      
      02dfdea [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3089
      e759ce7 [Kousuke Saruta] Improved error message when closing SendingConnection
      cbfc26ba
    • Thomas Graves's avatar
      [SPARK-3072] YARN - Exit when reach max number failed executors · 7eb9cbc2
      Thomas Graves authored
      In some cases on hadoop 2.x the spark application master doesn't properly exit and hangs around for 10 minutes after its really done.  We should make sure it exits properly and stops the driver.
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #2022 from tgravescs/SPARK-3072 and squashes the following commits:
      
      665701d [Thomas Graves] Exit when reach max number failed executors
      7eb9cbc2
  3. Aug 18, 2014
    • Matt Forbes's avatar
      Fix typo in decision tree docs · cd0720ca
      Matt Forbes authored
      Candidate splits were inconsistent with the example.
      
      Author: Matt Forbes <matt@tellapart.com>
      
      Closes #1837 from emef/tree-doc and squashes the following commits:
      
      3be14a1 [Matt Forbes] Fix typo in decision tree docs
      cd0720ca
    • Reynold Xin's avatar
      [SPARK-3116] Remove the excessive lockings in TorrentBroadcast · 82577339
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #2028 from rxin/torrentBroadcast and squashes the following commits:
      
      92c62a5 [Reynold Xin] Revert the MEMORY_AND_DISK_SER changes.
      03a5221 [Reynold Xin] [SPARK-3116] Remove the excessive lockings in TorrentBroadcast
      82577339
    • Josh Rosen's avatar
      [SPARK-3114] [PySpark] Fix Python UDFs in Spark SQL. · 1f1819b2
      Josh Rosen authored
      This fixes SPARK-3114, an issue where we inadvertently broke Python UDFs in Spark SQL.
      
      This PR modifiers the test runner script to always run the PySpark SQL tests, irrespective of whether SparkSQL itself has been modified.  It also includes Davies' fix for the bug.
      
      Closes #2026.
      
      Author: Josh Rosen <joshrosen@apache.org>
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2027 from JoshRosen/pyspark-sql-fix and squashes the following commits:
      
      9af2708 [Davies Liu] bugfix: disable compression of command
      0d8d3a4 [Josh Rosen] Always run Python Spark SQL tests.
      1f1819b2
    • Xiangrui Meng's avatar
      [SPARK-3108][MLLIB] add predictOnValues to StreamingLR and fix predictOn · 217b5e91
      Xiangrui Meng authored
      It is useful in streaming to allow users to carry extra data with the prediction, for monitoring the prediction error for example. freeman-lab
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #2023 from mengxr/predict-on-values and squashes the following commits:
      
      cac47b8 [Xiangrui Meng] add classtag
      2821b3b [Xiangrui Meng] use mapValues
      0925efa [Xiangrui Meng] add predictOnValues to StreamingLR and fix predictOn
      217b5e91
    • Joseph K. Bradley's avatar
      [SPARK-2850] [SPARK-2626] [mllib] MLlib stats examples + small fixes · c8b16ca0
      Joseph K. Bradley authored
      Added examples for statistical summarization:
      * Scala: StatisticalSummary.scala
      ** Tests: correlation, MultivariateOnlineSummarizer
      * python: statistical_summary.py
      ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
      
      Added examples for random and sampled RDDs:
      * Scala: RandomAndSampledRDDs.scala
      * python: random_and_sampled_rdds.py
      * Both test:
      ** RandomRDDGenerators.normalRDD, normalVectorRDD
      ** RDD.sample, takeSample, sampleByKey
      
      Added sc.stop() to all examples.
      
      CorrelationSuite.scala
      * Added 1 test for RDDs with only 1 value
      
      RowMatrix.scala
      * numCols(): Added check for numRows = 0, with error message.
      * computeCovariance(): Added check for numRows <= 1, with error message.
      
      Python SparseVector (pyspark/mllib/linalg.py)
      * Added toDense() function
      
      python/run-tests script
      * Added stat.py (doc test)
      
      CC: mengxr dorx  Main changes were examples to show usage across APIs.
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #1878 from jkbradley/mllib-stats-api-check and squashes the following commits:
      
      ea5c047 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
      dafebe2 [Joseph K. Bradley] Bug fixes for examples SampledRDDs.scala and sampled_rdds.py: Check for division by 0 and for missing key in maps.
      8d1e555 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
      60c72d9 [Joseph K. Bradley] Fixed stat.py doc test to work for Python versions printing nan or NaN.
      b20d90a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
      4e5d15e [Joseph K. Bradley] Changed pyspark/mllib/stat.py doc tests to use NaN instead of nan.
      32173b7 [Joseph K. Bradley] Stats examples update.
      c8c20dc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
      cf70b07 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
      0b7cec3 [Joseph K. Bradley] Small updates based on code review.  Renamed statistical_summary.py to correlations.py
      ab48f6e [Joseph K. Bradley] RowMatrix.scala * numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message.
      65e4ebc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
      8195c78 [Joseph K. Bradley] Added examples for random and sampled RDDs: * Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey
      064985b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
      ee918e9 [Joseph K. Bradley] Added examples for statistical summarization: * Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
      c8b16ca0
    • Joseph K. Bradley's avatar
      [mllib] DecisionTree: treeAggregate + Python example bug fix · 115eeb30
      Joseph K. Bradley authored
      Small DecisionTree updates:
      * Changed main DecisionTree aggregate to treeAggregate.
      * Fixed bug in python example decision_tree_runner.py with missing argument (since categoricalFeaturesInfo is no longer an optional argument for trainClassifier).
      * Fixed same bug in python doc tests, and added tree.py to doc tests.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #2015 from jkbradley/dt-opt2 and squashes the following commits:
      
      b5114fa [Joseph K. Bradley] Fixed python tree.py doc test (extra newline)
      8e4665d [Joseph K. Bradley] Added tree.py to python doc tests.  Fixed bug from missing categoricalFeaturesInfo argument.
      b7b2922 [Joseph K. Bradley] Fixed bug in python example decision_tree_runner.py with missing argument.  Changed main DecisionTree aggregate to treeAggregate.
      85bbc1f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
      66d076f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
      a0ed0da [Joseph K. Bradley] Renamed DTMetadata to DecisionTreeMetadata.  Small doc updates.
      3726d20 [Joseph K. Bradley] Small code improvements based on code review.
      ac0b9f8 [Joseph K. Bradley] Small updates based on code review. Main change: Now using << instead of math.pow.
      db0d773 [Joseph K. Bradley] scala style fix
      6a38f48 [Joseph K. Bradley] Added DTMetadata class for cleaner code
      931a3a7 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
      797f68a [Joseph K. Bradley] Fixed DecisionTreeSuite bug for training second level.  Needed to update treePointToNodeIndex with groupShift.
      f40381c [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
      5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint
      6b5651e [Joseph K. Bradley] Updates based on code review.  1 major change: persisting to memory + disk, not just memory.
      2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
      26d10dd [Joseph K. Bradley] Removed tree/model/Filter.scala since no longer used.  Removed debugging println calls in DecisionTree.scala.
      356daba [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
      430d782 [Joseph K. Bradley] Added more debug info on binning error.  Added some docs.
      d036089 [Joseph K. Bradley] Print timing info to logDebug.
      e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private
      8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  Removed debugging println calls from DecisionTree.  Made TreePoint extend Serialiable
      a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
      c1565a5 [Joseph K. Bradley] Small DecisionTree updates: * Simplification: Updated calculateGainForSplit to take aggregates for a single (feature, split) pair. * Internal doc: findAggForOrderedFeatureClassification
      b914f3b [Joseph K. Bradley] DecisionTree optimization: eliminated filters + small changes
      b2ed1f3 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt
      0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
      3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)
      f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
      bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
      511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
      a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
      115eeb30
    • Marcelo Vanzin's avatar
      [SPARK-2718] [yarn] Handle quotes and other characters in user args. · 6201b276
      Marcelo Vanzin authored
      Due to the way Yarn runs things through bash, normal quoting doesn't
      work as expected. This change applies the necessary voodoo to the user
      args to avoid issues with bash and special characters.
      
      The change also uncovered an issue with the event logger app name
      sanitizing code; it wasn't cleaning up all "bad" characters, so
      sometimes it would fail to create the log dirs. I just added some
      more bad character replacements.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #1724 from vanzin/SPARK-2718 and squashes the following commits:
      
      cc84b89 [Marcelo Vanzin] Review feedback.
      c1a257a [Marcelo Vanzin] Add test for backslashes.
      55571d4 [Marcelo Vanzin] Unbreak yarn-client.
      515613d [Marcelo Vanzin] [SPARK-2718] [yarn] Handle quotes and other characters in user args.
      6201b276
    • Davies Liu's avatar
      [SPARK-3103] [PySpark] fix saveAsTextFile() with utf-8 · d1d0ee41
      Davies Liu authored
      bugfix: It will raise an exception when it try to encode non-ASCII strings into unicode. It should only encode unicode as "utf-8".
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2018 from davies/fix_utf8 and squashes the following commits:
      
      4db7967 [Davies Liu] fix saveAsTextFile() with utf-8
      d1d0ee41
    • Reynold Xin's avatar
      3a5962f0
    • Marcelo Vanzin's avatar
      [SPARK-2169] Don't copy appName / basePath everywhere. · 66ade00f
      Marcelo Vanzin authored
      Instead of keeping copies in all pages, just reference the values
      kept in the base SparkUI instance (by making them available via
      getters).
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #1252 from vanzin/SPARK-2169 and squashes the following commits:
      
      4412fc6 [Marcelo Vanzin] Simplify UIUtils.headerSparkPage signature.
      4e5d35a [Marcelo Vanzin] [SPARK-2169] Don't copy appName / basePath everywhere.
      66ade00f
    • Michael Armbrust's avatar
      [SPARK-2406][SQL] Initial support for using ParquetTableScan to read HiveMetaStore tables. · 3abd0c1c
      Michael Armbrust authored
      This PR adds an experimental flag `spark.sql.hive.convertMetastoreParquet` that when true causes the planner to detects tables that use Hive's Parquet SerDe and instead plans them using Spark SQL's native `ParquetTableScan`.
      
      Author: Michael Armbrust <michael@databricks.com>
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1819 from marmbrus/parquetMetastore and squashes the following commits:
      
      1620079 [Michael Armbrust] Revert "remove hive parquet bundle"
      cc30430 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into parquetMetastore
      4f3d54f [Michael Armbrust] fix style
      41ebc5f [Michael Armbrust] remove hive parquet bundle
      a43e0da [Michael Armbrust] Merge remote-tracking branch 'origin/master' into parquetMetastore
      4c4dc19 [Michael Armbrust] Fix bug with tree splicing.
      ebb267e [Michael Armbrust] include parquet hive to tests pass (Remove this later).
      c0d9b72 [Michael Armbrust] Avoid creating a HadoopRDD per partition.  Add dirty hacks to retrieve partition values from the InputSplit.
      8cdc93c [Michael Armbrust] Merge pull request #8 from yhuai/parquetMetastore
      a0baec7 [Yin Huai] Partitioning columns can be resolved.
      1161338 [Michael Armbrust] Add a test to make sure conversion is actually happening
      212d5cd [Michael Armbrust] Initial support for using ParquetTableScan to read HiveMetaStore tables.
      3abd0c1c
    • Matei Zaharia's avatar
      [SPARK-3091] [SQL] Add support for caching metadata on Parquet files · 9eb74c7d
      Matei Zaharia authored
      For larger Parquet files, reading the file footers (which is done in parallel on up to 5 threads) and HDFS block locations (which is serial) can take multiple seconds. We can add an option to cache this data within FilteringParquetInputFormat. Unfortunately ParquetInputFormat only caches footers within each instance of ParquetInputFormat, not across them.
      
      Note: this PR leaves this turned off by default for 1.1, but I believe it's safe to turn it on after. The keys in the hash maps are FileStatus objects that include a modification time, so this will work fine if files are modified. The location cache could become invalid if files have moved within HDFS, but that's rare so I just made it invalidate entries every 15 minutes.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #2005 from mateiz/parquet-cache and squashes the following commits:
      
      dae8efe [Matei Zaharia] Bug fix
      c71e9ed [Matei Zaharia] Handle empty statuses directly
      22072b0 [Matei Zaharia] Use Guava caches and add a config option for caching metadata
      8fb56ce [Matei Zaharia] Cache file block locations too
      453bd21 [Matei Zaharia] Bug fix
      4094df6 [Matei Zaharia] First attempt at caching Parquet footers
      9eb74c7d
    • Patrick Wendell's avatar
      SPARK-3025 [SQL]: Allow JDBC clients to set a fair scheduler pool · 6bca8898
      Patrick Wendell authored
      This definitely needs review as I am not familiar with this part of Spark.
      I tested this locally and it did seem to work.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #1937 from pwendell/scheduler and squashes the following commits:
      
      b858e33 [Patrick Wendell] SPARK-3025: Allow JDBC clients to set a fair scheduler pool
      6bca8898
    • Matei Zaharia's avatar
      [SPARK-3085] [SQL] Use compact data structures in SQL joins · 4bf3de71
      Matei Zaharia authored
      This reuses the CompactBuffer from Spark Core to save memory and pointer
      dereferences. I also tried AppendOnlyMap instead of java.util.HashMap
      but unfortunately that slows things down because it seems to do more
      equals() calls and the equals on GenericRow, and especially JoinedRow,
      is pretty expensive.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #1993 from mateiz/spark-3085 and squashes the following commits:
      
      188221e [Matei Zaharia] Remove unneeded import
      5f903ee [Matei Zaharia] [SPARK-3085] [SQL] Use compact data structures in SQL joins
      4bf3de71
    • Matei Zaharia's avatar
      [SPARK-3084] [SQL] Collect broadcasted tables in parallel in joins · 6a13dca1
      Matei Zaharia authored
      BroadcastHashJoin has a broadcastFuture variable that tries to collect
      the broadcasted table in a separate thread, but this doesn't help
      because it's a lazy val that only gets initialized when you attempt to
      build the RDD. Thus queries that broadcast multiple tables would collect
      and broadcast them sequentially. I changed this to a val to let it start
      collecting right when the operator is created.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #1990 from mateiz/spark-3084 and squashes the following commits:
      
      f468766 [Matei Zaharia] [SPARK-3084] Collect broadcasted tables in parallel in joins
      6a13dca1
Loading