Skip to content
Snippets Groups Projects
  1. Aug 18, 2014
    • Matei Zaharia's avatar
      [SPARK-3085] [SQL] Use compact data structures in SQL joins · 4bf3de71
      Matei Zaharia authored
      This reuses the CompactBuffer from Spark Core to save memory and pointer
      dereferences. I also tried AppendOnlyMap instead of java.util.HashMap
      but unfortunately that slows things down because it seems to do more
      equals() calls and the equals on GenericRow, and especially JoinedRow,
      is pretty expensive.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #1993 from mateiz/spark-3085 and squashes the following commits:
      
      188221e [Matei Zaharia] Remove unneeded import
      5f903ee [Matei Zaharia] [SPARK-3085] [SQL] Use compact data structures in SQL joins
      4bf3de71
    • Matei Zaharia's avatar
      [SPARK-3084] [SQL] Collect broadcasted tables in parallel in joins · 6a13dca1
      Matei Zaharia authored
      BroadcastHashJoin has a broadcastFuture variable that tries to collect
      the broadcasted table in a separate thread, but this doesn't help
      because it's a lazy val that only gets initialized when you attempt to
      build the RDD. Thus queries that broadcast multiple tables would collect
      and broadcast them sequentially. I changed this to a val to let it start
      collecting right when the operator is created.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #1990 from mateiz/spark-3084 and squashes the following commits:
      
      f468766 [Matei Zaharia] [SPARK-3084] Collect broadcasted tables in parallel in joins
      6a13dca1
    • Patrick Wendell's avatar
      SPARK-3096: Include parquet hive serde by default in build · 7ae28d12
      Patrick Wendell authored
      A small change - we should just add this dependency. It doesn't have any recursive deps and it's needed for reading have parquet tables.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #2009 from pwendell/parquet and squashes the following commits:
      
      e411f9f [Patrick Wendell] SPARk-309: Include parquet hive serde by default in build
      7ae28d12
    • Chandan Kumar's avatar
      [SPARK-2862] histogram method fails on some choices of bucketCount · f45efbb8
      Chandan Kumar authored
      Author: Chandan Kumar <chandan.kumar@imaginea.com>
      
      Closes #1787 from nrchandan/spark-2862 and squashes the following commits:
      
      a76bbf6 [Chandan Kumar] [SPARK-2862] Fix for a broken test case and add new test cases
      4211eea [Chandan Kumar] [SPARK-2862] Add Scala bug id
      13854f1 [Chandan Kumar] [SPARK-2862] Use shorthand range notation to avoid Scala bug
      f45efbb8
    • CrazyJvm's avatar
      SPARK-3093 : masterLock in Worker is no longer need · c0cbbdea
      CrazyJvm authored
      there's no need to use masterLock in Worker now since all communications are within Akka actor
      
      Author: CrazyJvm <crazyjvm@gmail.com>
      
      Closes #2008 from CrazyJvm/no-need-master-lock and squashes the following commits:
      
      dd39e20 [CrazyJvm] fix format
      58e7fa5 [CrazyJvm] there's no need to use masterLock now since all communications are within Akka actor
      c0cbbdea
    • Liquan Pei's avatar
      [MLlib] Remove transform(dataset: RDD[String]) from Word2Vec public API · 9306b8c6
      Liquan Pei authored
      mengxr
      Remove  transform(dataset: RDD[String]) from public API.
      
      Author: Liquan Pei <liquanpei@gmail.com>
      
      Closes #2010 from Ishiihara/Word2Vec-api and squashes the following commits:
      
      17b1031 [Liquan Pei] remove transform(dataset: RDD[String]) from public API
      9306b8c6
    • Liquan Pei's avatar
      [SPARK-2842][MLlib]Word2Vec documentation · eef779b8
      Liquan Pei authored
      mengxr
      Documentation for Word2Vec
      
      Author: Liquan Pei <liquanpei@gmail.com>
      
      Closes #2003 from Ishiihara/Word2Vec-doc and squashes the following commits:
      
      4ff11d4 [Liquan Pei] minor fix
      8d7458f [Liquan Pei] code reformat
      6df0dcb [Liquan Pei] add Word2Vec documentation
      eef779b8
    • Liquan Pei's avatar
      [SPARK-3097][MLlib] Word2Vec performance improvement · 3c8fa505
      Liquan Pei authored
      mengxr Please review the code. Adding weights in reduceByKey soon.
      
      Only output model entry for words appeared in the partition before merging and use reduceByKey to combine model. In general, this implementation is 30s or so faster than implementation using big array.
      
      Author: Liquan Pei <liquanpei@gmail.com>
      
      Closes #1932 from Ishiihara/Word2Vec-improve2 and squashes the following commits:
      
      d5377a9 [Liquan Pei] use syn0Global and syn1Global to represent model
      cad2011 [Liquan Pei] bug fix for synModify array out of bound
      083aa66 [Liquan Pei] update synGlobal in place and reduce synOut size
      9075e1c [Liquan Pei] combine syn0Global and syn1Global to synGlobal
      aa2ab36 [Liquan Pei] use reduceByKey to combine models
      3c8fa505
    • Sandy Ryza's avatar
      SPARK-2900. aggregate inputBytes per stage · df652ea0
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #1826 from sryza/sandy-spark-2900 and squashes the following commits:
      
      43f9091 [Sandy Ryza] SPARK-2900
      df652ea0
    • Patrick Wendell's avatar
  2. Aug 17, 2014
    • Xiangrui Meng's avatar
      [SPARK-3087][MLLIB] fix col indexing bug in chi-square and add a check for... · c77f4066
      Xiangrui Meng authored
      [SPARK-3087][MLLIB] fix col indexing bug in chi-square and add a check for number of distinct values
      
      There is a bug determining the column index. dorx
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1997 from mengxr/chisq-index and squashes the following commits:
      
      8fc2ab2 [Xiangrui Meng] fix col indexing bug and add a check for number of distinct values
      c77f4066
    • Hari Shreedharan's avatar
      [HOTFIX][STREAMING] Allow the JVM/Netty to decide which port to bind to in Flume Polling Tests. · 95470a03
      Hari Shreedharan authored
      Author: Hari Shreedharan <harishreedharan@gmail.com>
      
      Closes #1820 from harishreedharan/use-free-ports and squashes the following commits:
      
      b939067 [Hari Shreedharan] Remove unused import.
      67856a8 [Hari Shreedharan] Remove findFreePort.
      0ea51d1 [Hari Shreedharan] Make some changes to getPort to use map on the serverOpt.
      1fb0283 [Hari Shreedharan] Merge branch 'master' of https://github.com/apache/spark into use-free-ports
      b351651 [Hari Shreedharan] Allow Netty to choose port, and query it to decide the port to bind to. Leaving findFreePort as is, if other tests want to use it at some point.
      e6c9620 [Hari Shreedharan] Making sure the second sink uses the correct port.
      11c340d [Hari Shreedharan] Add info about race condition to scaladoc.
      e89d135 [Hari Shreedharan] Adding Scaladoc.
      6013bb0 [Hari Shreedharan] [STREAMING] Find free ports to use before attempting to create Flume Sink in Flume Polling Suite
      95470a03
    • Chris Fregly's avatar
      [SPARK-1981] updated streaming-kinesis.md · 99243288
      Chris Fregly authored
      fixed markup, separated out sections more-clearly, more thorough explanations
      
      Author: Chris Fregly <chris@fregly.com>
      
      Closes #1757 from cfregly/master and squashes the following commits:
      
      9b1c71a [Chris Fregly] better explained why spark checkpoints are disabled in the example (due to no stateful operations being used)
      0f37061 [Chris Fregly] SPARK-1981:  (Kinesis streaming support) updated streaming-kinesis.md
      862df67 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      8e1ae2e [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      4774581 [Chris Fregly] updated docs, renamed retry to retryRandom to be more clear, removed retries around store() method
      0393795 [Chris Fregly] moved Kinesis examples out of examples/ and back into extras/kinesis-asl
      691a6be [Chris Fregly] fixed tests and formatting, fixed a bug with JavaKinesisWordCount during union of streams
      0e1c67b [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      74e5c7c [Chris Fregly] updated per TD's feedback.  simplified examples, updated docs
      e33cbeb [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      bf614e9 [Chris Fregly] per matei's feedback:  moved the kinesis examples into the examples/ dir
      d17ca6d [Chris Fregly] per TD's feedback:  updated docs, simplified the KinesisUtils api
      912640c [Chris Fregly] changed the foundKinesis class to be a publically-avail class
      db3eefd [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      21de67f [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      6c39561 [Chris Fregly] parameterized the versions of the aws java sdk and kinesis client
      338997e [Chris Fregly] improve build docs for kinesis
      828f8ae [Chris Fregly] more cleanup
      e7c8978 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      cd68c0d [Chris Fregly] fixed typos and backward compatibility
      d18e680 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      b3b0ff1 [Chris Fregly] [SPARK-1981] Add AWS Kinesis streaming support
      99243288
    • Michael Armbrust's avatar
      [SQL] Improve debug logging and toStrings. · bfa09b01
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2004 from marmbrus/codgenDebugging and squashes the following commits:
      
      b7a7e41 [Michael Armbrust] Improve debug logging and toStrings.
      bfa09b01
    • Michael Armbrust's avatar
      Revert "[SPARK-2970] [SQL] spark-sql script ends with IOException when EventLogging is enabled" · 5ecb08ea
      Michael Armbrust authored
      Revert #1891 due to issues with hadoop 1 compatibility.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2007 from marmbrus/revert1891 and squashes the following commits:
      
      68706c0 [Michael Armbrust] Revert "[SPARK-2970] [SQL] spark-sql script ends with IOException when EventLogging is enabled"
      5ecb08ea
    • Patrick Wendell's avatar
      SPARK-2881. Upgrade snappy-java to 1.1.1.3. · 318e28b5
      Patrick Wendell authored
      This upgrades snappy-java which fixes the issue reported in SPARK-2881.
      This is the master branch equivalent to #1994 which provides a different
      work-around for the 1.1 branch.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #1995 from pwendell/snappy-1.1 and squashes the following commits:
      
      0c7c4c2 [Patrick Wendell] SPARK-2881. Upgrade snappy-java to 1.1.1.3.
      318e28b5
    • Joseph K. Bradley's avatar
      [SPARK-3042] [mllib] DecisionTree Filter top-down instead of bottom-up · 73ab7f14
      Joseph K. Bradley authored
      DecisionTree needs to match each example to a node at each iteration.  It currently does this with a set of filters very inefficiently: For each example, it examines each node at the current level and traces up to the root to see if that example should be handled by that node.
      
      Fix: Filter top-down using the partly built tree itself.
      
      Major changes:
      * Eliminated Filter class, findBinsForLevel() method.
      * Set up node parent links in main loop over levels in train().
      * Added predictNodeIndex() for filtering top-down.
      * Added DTMetadata class
      
      Other changes:
      * Pre-compute set of unorderedFeatures.
      
      Notes for following expected PR based on [https://issues.apache.org/jira/browse/SPARK-3043]:
      * The unorderedFeatures set will next be stored in a metadata structure to simplify function calls (to store other items such as the data in strategy).
      
      I've done initial tests indicating that this speeds things up, but am only now running large-scale ones.
      
      CC: mengxr manishamde chouqin  Any comments are welcome---thanks!
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #1975 from jkbradley/dt-opt2 and squashes the following commits:
      
      a0ed0da [Joseph K. Bradley] Renamed DTMetadata to DecisionTreeMetadata.  Small doc updates.
      3726d20 [Joseph K. Bradley] Small code improvements based on code review.
      ac0b9f8 [Joseph K. Bradley] Small updates based on code review. Main change: Now using << instead of math.pow.
      db0d773 [Joseph K. Bradley] scala style fix
      6a38f48 [Joseph K. Bradley] Added DTMetadata class for cleaner code
      931a3a7 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
      797f68a [Joseph K. Bradley] Fixed DecisionTreeSuite bug for training second level.  Needed to update treePointToNodeIndex with groupShift.
      f40381c [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
      5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint
      6b5651e [Joseph K. Bradley] Updates based on code review.  1 major change: persisting to memory + disk, not just memory.
      2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
      26d10dd [Joseph K. Bradley] Removed tree/model/Filter.scala since no longer used.  Removed debugging println calls in DecisionTree.scala.
      356daba [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
      430d782 [Joseph K. Bradley] Added more debug info on binning error.  Added some docs.
      d036089 [Joseph K. Bradley] Print timing info to logDebug.
      e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private
      8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  Removed debugging println calls from DecisionTree.  Made TreePoint extend Serialiable
      a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
      c1565a5 [Joseph K. Bradley] Small DecisionTree updates: * Simplification: Updated calculateGainForSplit to take aggregates for a single (feature, split) pair. * Internal doc: findAggForOrderedFeatureClassification
      b914f3b [Joseph K. Bradley] DecisionTree optimization: eliminated filters + small changes
      b2ed1f3 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt
      0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
      3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)
      f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
      bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
      511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
      a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
      73ab7f14
  3. Aug 16, 2014
    • Xiangrui Meng's avatar
      [SPARK-3077][MLLIB] fix some chisq-test · fbad7228
      Xiangrui Meng authored
      - promote nullHypothesis field in ChiSqTestResult to TestResult. Every test should have a null hypothesis
      - correct null hypothesis statement for independence test
      - p-value: 0.01 -> 0.1
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1982 from mengxr/fix-chisq and squashes the following commits:
      
      5f0de02 [Xiangrui Meng] make ChiSqTestResult constructor package private
      bc74ea1 [Xiangrui Meng] update chisq-test
      fbad7228
    • GuoQiang Li's avatar
      In the stop method of ConnectionManager to cancel the ackTimeoutMonitor · bc95fe08
      GuoQiang Li authored
      cc JoshRosen sarutak
      
      Author: GuoQiang Li <witgo@qq.com>
      
      Closes #1989 from witgo/cancel_ackTimeoutMonitor and squashes the following commits:
      
      4a700fa [GuoQiang Li] In the stop method of ConnectionManager to cancel the ackTimeoutMonitor
      bc95fe08
    • Davies Liu's avatar
      [SPARK-1065] [PySpark] improve supporting for large broadcast · 2fc8aca0
      Davies Liu authored
      Passing large object by py4j is very slow (cost much memory), so pass broadcast objects via files (similar to parallelize()).
      
      Add an option to keep object in driver (it's False by default) to save memory in driver.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1912 from davies/broadcast and squashes the following commits:
      
      e06df4a [Davies Liu] load broadcast from disk in driver automatically
      db3f232 [Davies Liu] fix serialization of accumulator
      631a827 [Davies Liu] Merge branch 'master' into broadcast
      c7baa8c [Davies Liu] compress serrialized broadcast and command
      9a7161f [Davies Liu] fix doc tests
      e93cf4b [Davies Liu] address comments: add test
      6226189 [Davies Liu] improve large broadcast
      2fc8aca0
    • iAmGhost's avatar
      [SPARK-3035] Wrong example with SparkContext.addFile · 379e7585
      iAmGhost authored
      https://issues.apache.org/jira/browse/SPARK-3035
      
      fix for wrong document.
      
      Author: iAmGhost <kdh7807@gmail.com>
      
      Closes #1942 from iAmGhost/master and squashes the following commits:
      
      487528a [iAmGhost] [SPARK-3035] Wrong example with SparkContext.addFile fix for wrong document.
      379e7585
    • Xiangrui Meng's avatar
      [SPARK-3081][MLLIB] rename RandomRDDGenerators to RandomRDDs · ac6411c6
      Xiangrui Meng authored
      `RandomRDDGenerators` means factory for `RandomRDDGenerator`. However, its methods return RDDs but not RDDGenerators. So a more proper (and shorter) name would be `RandomRDDs`.
      
      dorx brkyvz
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1979 from mengxr/randomrdds and squashes the following commits:
      
      b161a2d [Xiangrui Meng] rename RandomRDDGenerators to RandomRDDs
      ac6411c6
    • Xiangrui Meng's avatar
      [SPARK-3048][MLLIB] add LabeledPoint.parse and remove loadStreamingLabeledPoints · 7e70708a
      Xiangrui Meng authored
      Move `parse()` from `LabeledPointParser` to `LabeledPoint` and make it public. This breaks binary compatibility only when a user uses synthesized methods like `tupled` and `curried`, which is rare.
      
      `LabeledPoint.parse` is more consistent with `Vectors.parse`, which is why `LabeledPointParser` is not preferred.
      
      freeman-lab tdas
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1952 from mengxr/labelparser and squashes the following commits:
      
      c818fb2 [Xiangrui Meng] merge master
      ce20e6f [Xiangrui Meng] update mima excludes
      b386b8d [Xiangrui Meng] fix tests
      2436b3d [Xiangrui Meng] add parse() to LabeledPoint
      7e70708a
    • Kousuke Saruta's avatar
      [SPARK-2677] BasicBlockFetchIterator#next can wait forever · 76fa0eaf
      Kousuke Saruta authored
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #1632 from sarutak/SPARK-2677 and squashes the following commits:
      
      cddbc7b [Kousuke Saruta] Removed Exception throwing when ConnectionManager#handleMessage receives ack for non-referenced message
      d3bd2a8 [Kousuke Saruta] Modified configuration.md for spark.core.connection.ack.timeout
      e85f88b [Kousuke Saruta] Removed useless synchronized blocks
      7ed48be [Kousuke Saruta] Modified ConnectionManager to use ackTimeoutMonitor ConnectionManager-wide
      9b620a6 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2677
      0dd9ad3 [Kousuke Saruta] Modified typo in ConnectionManagerSuite.scala
      7cbb8ca [Kousuke Saruta] Modified to match with scalastyle
      8a73974 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2677
      ade279a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2677
      0174d6a [Kousuke Saruta] Modified ConnectionManager.scala to handle the case remote Executor cannot ack
      a454239 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2677
      9b7b7c1 [Kousuke Saruta] (WIP) Modifying ConnectionManager.scala
      76fa0eaf
    • Nicholas Chammas's avatar
      [SPARK-3076] [Jenkins] catch & report test timeouts · 4bdfaa16
      Nicholas Chammas authored
      * Remove unused code to get jq
      * Set timeout on tests and report gracefully on them
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #1974 from nchammas/master and squashes the following commits:
      
      d1f1b6b [Nicholas Chammas] set timeout to realistic number
      8b1ea41 [Nicholas Chammas] fix formatting
      279526e [Nicholas Chammas] [SPARK-3076] catch & report test timeouts
      4bdfaa16
    • Cheng Lian's avatar
      [SQL] Using safe floating-point numbers in doctest · b4a05928
      Cheng Lian authored
      Test code in `sql.py` tries to compare two floating-point numbers directly, and cased [build failure(s)](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18365/consoleFull).
      
      [Doctest documentation](https://docs.python.org/3/library/doctest.html#warnings) recommends using numbers in the form of `I/2**J` to avoid the precision issue.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1925 from liancheng/fix-pysql-fp-test and squashes the following commits:
      
      0fbf584 [Cheng Lian] Removed unnecessary `...' from inferSchema doctest
      e8059d4 [Cheng Lian] Using safe floating-point numbers in doctest
      b4a05928
    • Josh Rosen's avatar
      [SPARK-2977] Ensure ShuffleManager is created before ShuffleBlockManager · 20fcf3d0
      Josh Rosen authored
      This is intended to fix SPARK-2977.  Before, there was an implicit ordering dependency where we needed to know the ShuffleManager implementation before creating the ShuffleBlockManager.  This patch makes that dependency explicit by adding ShuffleManager to a bunch of constructors.
      
      I think it's a little odd for BlockManager to take a ShuffleManager only to pass it to ShuffleBlockManager without using it itself; there's an opportunity to clean this up later if we sever the circular dependencies between BlockManager and other components and pass those components to BlockManager's constructor.
      
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #1976 from JoshRosen/SPARK-2977 and squashes the following commits:
      
      a9cd1e1 [Josh Rosen] [SPARK-2977] Ensure ShuffleManager is created before ShuffleBlockManager.
      20fcf3d0
    • Reynold Xin's avatar
      [SPARK-3045] Make Serializer interface Java friendly · a83c7723
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1948 from rxin/kryo and squashes the following commits:
      
      a3a80d8 [Reynold Xin] [SPARK-3046] use executor's class loader as the default serializer classloader
      3d13277 [Reynold Xin] Reverted that in TestJavaSerializerImpl too.
      196f3dc [Reynold Xin] Ok one more commit to revert the classloader change.
      c49b50c [Reynold Xin] Removed JavaSerializer change.
      afbf37d [Reynold Xin] Moved the test case also.
      a2e693e [Reynold Xin] Removed the Kryo bug fix from this pull request.
      c81bd6c [Reynold Xin] Use defaultClassLoader when executing user specified custom registrator.
      68f261e [Reynold Xin] Added license check excludes.
      0c28179 [Reynold Xin] [SPARK-3045] Make Serializer interface Java friendly [SPARK-3046] Set executor's class loader as the default serializer class loader
      a83c7723
    • Andrew Or's avatar
      [SPARK-3015] Block on cleaning tasks to prevent Akka timeouts · c9da466e
      Andrew Or authored
      More detail on the issue is described in [SPARK-3015](https://issues.apache.org/jira/browse/SPARK-3015), but the TLDR is if we send too many blocking Akka messages that are dependent on each other in quick successions, then we end up causing a few of these messages to time out and ultimately kill the executors. As of #1498, we broadcast each RDD whether or not it is persisted. This means if we create many RDDs (each of which becomes a broadcast) and the driver performs a GC that cleans up all of these broadcast blocks, then we end up sending many `RemoveBroadcast` messages in parallel and trigger the chain of blocking messages at high frequencies.
      
      We do not know of the Akka-level root cause yet, so this is intended to be a temporary solution until we identify the real issue. I have done some preliminary testing of enabling blocking and observed that the queue length remains quite low (< 1000) even under very intensive workloads.
      
      In the long run, we should do something more sophisticated to allow a limited degree of parallelism through batching clean up tasks or processing them in a sliding window. In the longer run, we should clean up the whole `BlockManager*` message passing interface to avoid unnecessarily awaiting on futures created from Akka asks.
      
      tdas pwendell mengxr
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1931 from andrewor14/reference-blocking and squashes the following commits:
      
      d0f7195 [Andrew Or] Merge branch 'master' of github.com:apache/spark into reference-blocking
      ce9daf5 [Andrew Or] Remove logic for logging queue length
      111192a [Andrew Or] Add missing space in log message (minor)
      a183b83 [Andrew Or] Switch order of code blocks (minor)
      9fd1fe6 [Andrew Or] Remove outdated log
      104b366 [Andrew Or] Use the actual reference queue length
      0b7e768 [Andrew Or] Block on cleaning tasks by default + log error on queue full
      c9da466e
  4. Aug 15, 2014
    • Xiangrui Meng's avatar
      [SPARK-3001][MLLIB] Improve Spearman's correlation · 2e069ca6
      Xiangrui Meng authored
      The current implementation requires sorting individual columns, which could be done with a global sort.
      
      result on a 32-node cluster:
      
      m | n | prev | this
      ---|---|-------|-----
      1000000 | 50 | 55s | 9s
      10000000 | 50 | 97s | 76s
      1000000 | 100  | 119s | 15s
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1917 from mengxr/spearman and squashes the following commits:
      
      4d5d262 [Xiangrui Meng] remove unused import
      85c48de [Xiangrui Meng] minor updates
      a048d0c [Xiangrui Meng] remove cache and set a limit to cachedIds
      b98bb18 [Xiangrui Meng] add comments
      0846e07 [Xiangrui Meng] first version
      2e069ca6
    • Xiangrui Meng's avatar
      [SPARK-3078][MLLIB] Make LRWithLBFGS API consistent with others · 5d25c0b7
      Xiangrui Meng authored
      Should ask users to set parameters through the optimizer. dbtsai
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1973 from mengxr/lr-lbfgs and squashes the following commits:
      
      e3efbb1 [Xiangrui Meng] fix tests
      21b3579 [Xiangrui Meng] fix method name
      641eea4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into lr-lbfgs
      456ab7c [Xiangrui Meng] update LRWithLBFGS
      5d25c0b7
    • Reynold Xin's avatar
      [SPARK-3046] use executor's class loader as the default serializer classloader · cc364877
      Reynold Xin authored
      The serializer is not always used in an executor thread (e.g. connection manager, broadcast), in which case the classloader might not have the user jar set, leading to corruption in deserialization.
      
      https://issues.apache.org/jira/browse/SPARK-3046
      
      https://issues.apache.org/jira/browse/SPARK-2878
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1972 from rxin/kryoBug and squashes the following commits:
      
      c1c7bf0 [Reynold Xin] Made change to JavaSerializer.
      7204c33 [Reynold Xin] Added imports back.
      d879e67 [Reynold Xin] [SPARK-3046] use executor's class loader as the default serializer class loader.
      cc364877
    • Joseph K. Bradley's avatar
      [SPARK-3022] [SPARK-3041] [mllib] Call findBins once per level + unordered feature bug fix · c7032290
      Joseph K. Bradley authored
      DecisionTree improvements:
      (1) TreePoint representation to avoid binning multiple times
      (2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features
      (3) Timing for DecisionTree internals
      
      Details:
      
      (1) TreePoint representation to avoid binning multiple times
      
      [https://issues.apache.org/jira/browse/SPARK-3022]
      
      Added private[tree] TreePoint class for representing binned feature values.
      
      The input RDD of LabeledPoint is converted to the TreePoint representation initially and then cached.  This avoids the previous problem of re-computing bins multiple times.
      
      (2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features
      
      [https://issues.apache.org/jira/browse/SPARK-3041]
      
      isSampleValid used to treat unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories.
      * exhibited for unordered features (multi-class classification with categorical features of low arity)
      * Fix: Index bins correctly for unordered categorical features.
      
      (3) Timing for DecisionTree internals
      
      Added tree/impl/TimeTracker.scala class which is private[tree] for now, for timing key parts of DT code.
      Prints timing info via logDebug.
      
      CC: mengxr manishamde chouqin  Very similar update, with one bug fix.  Many apologies for the conflicting update, but I hope that a few more optimizations I have on the way (which depend on this update) will prove valuable to you: SPARK-3042 and SPARK-3043
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #1950 from jkbradley/dt-opt1 and squashes the following commits:
      
      5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint
      6b5651e [Joseph K. Bradley] Updates based on code review.  1 major change: persisting to memory + disk, not just memory.
      2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
      430d782 [Joseph K. Bradley] Added more debug info on binning error.  Added some docs.
      d036089 [Joseph K. Bradley] Print timing info to logDebug.
      e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private
      8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  Removed debugging println calls from DecisionTree.  Made TreePoint extend Serialiable
      a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
      0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
      3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)
      f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
      bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
      511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
      a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
      c7032290
    • Sandy Ryza's avatar
      SPARK-3028. sparkEventToJson should support SparkListenerExecutorMetrics... · 0afe5cb6
      Sandy Ryza authored
      ...Update
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #1961 from sryza/sandy-spark-3028 and squashes the following commits:
      
      dccdff5 [Sandy Ryza] Fix compile error
      f883ded [Sandy Ryza] SPARK-3028. sparkEventToJson should support SparkListenerExecutorMetricsUpdate
      0afe5cb6
    • Patrick Wendell's avatar
      Revert "[SPARK-2468] Netty based block server / client module" · fd9fcd25
      Patrick Wendell authored
      This reverts commit 3a8b68b7.
      fd9fcd25
    • Anand Avati's avatar
      [SPARK-2924] remove default args to overloaded methods · 7589c39d
      Anand Avati authored
      Not supported in Scala 2.11. Split them into separate methods instead.
      
      Author: Anand Avati <avati@redhat.com>
      
      Closes #1704 from avati/SPARK-1812-default-args and squashes the following commits:
      
      3e3924a [Anand Avati] SPARK-1812: Add Mima excludes for the broken ABI
      901dfc7 [Anand Avati] SPARK-1812: core - Fix overloaded methods with default arguments
      07f00af [Anand Avati] SPARK-1812: streaming - Fix overloaded methods with default arguments
      7589c39d
    • Nathan Kronenfeld's avatar
      Add caching information to rdd.toDebugString · fba8ec39
      Nathan Kronenfeld authored
      I find it useful to see where in an RDD's DAG data is cached, so I figured others might too.
      
      I've added both the caching level, and the actual memory state of the RDD.
      
      Some of this is redundant with the web UI (notably the actual memory state), but (a) that is temporary, and (b) putting it in the DAG tree shows some context that can help a lot.
      
      For example:
      ```
      (4) ShuffledRDD[3] at reduceByKey at <console>:14
       +-(4) MappedRDD[2] at map at <console>:14
          |  MapPartitionsRDD[1] at mapPartitions at <console>:12
          |  ParallelCollectionRDD[0] at parallelize at <console>:12
      ```
      should change to
      ```
      (4) ShuffledRDD[3] at reduceByKey at <console>:14 [Memory Deserialized 1x Replicated]
       |       CachedPartitions: 4; MemorySize: 50.8 MB; TachyonSize: 0.0 B; DiskSize: 0.0 B
       +-(4) MappedRDD[2] at map at <console>:14 [Memory Deserialized 1x Replicated]
          |  MapPartitionsRDD[1] at mapPartitions at <console>:12 [Memory Deserialized 1x Replicated]
          |      CachedPartitions: 4; MemorySize: 109.1 MB; TachyonSize: 0.0 B; DiskSize: 0.0 B
          |  ParallelCollectionRDD[0] at parallelize at <console>:12 [Memory Deserialized 1x Replicated]
      ```
      
      Author: Nathan Kronenfeld <nkronenfeld@oculusinfo.com>
      
      Closes #1535 from nkronenfeld/feature/debug-caching2 and squashes the following commits:
      
      40490bc [Nathan Kronenfeld] Back out DeveloperAPI and arguments to RDD.toDebugString, reinstate memory output
      794e6a3 [Nathan Kronenfeld] Attempt to merge mima changes from master
      6fe9e80 [Nathan Kronenfeld] Add exclusions to allow for signature change in toDebugString (will back out if necessary)
      31d6769 [Nathan Kronenfeld] Attempt to get rid of style errors.  Add comments for the new memory usage parameter.
      a0f6f76 [Nathan Kronenfeld] Add parameter to RDD.toDebugString to allow detailed memory info to be shown or not.  Default is for it not to be shown.
      f8f565a [Nathan Kronenfeld] Fix code style error
      8f54287 [Nathan Kronenfeld] Changed string addition to string interpolation as per PR comments
      2a0cd4d [Nathan Kronenfeld] Fixed a small formatting issue I forgot to copy over from the old branch
      8fbecb6 [Nathan Kronenfeld] Add caching information to rdd.toDebugString
      fba8ec39
    • Sean Owen's avatar
      SPARK-2955 [BUILD] Test code fails to compile with "mvn compile" without "install" · e1b85f31
      Sean Owen authored
      (This is the corrected follow-up to https://issues.apache.org/jira/browse/SPARK-2903)
      
      Right now, `mvn compile test-compile` fails to compile Spark. (Don't worry; `mvn package` works, so this is not major.) The issue stems from test code in some modules depending on test code in other modules. That is perfectly fine and supported by Maven.
      
      It takes extra work to get this to work with scalatest, and this has been attempted: https://github.com/apache/spark/blob/master/sql/catalyst/pom.xml#L86
      
      This formulation is not quite enough, since the SQL Core module's tests fail to compile for lack of finding test classes in SQL Catalyst, and likewise for most Streaming integration modules depending on core Streaming test code. Example:
      
      ```
      [error] /Users/srowen/Documents/spark/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala:23: not found: type PlanTest
      [error] class QueryTest extends PlanTest {
      [error]                         ^
      [error] /Users/srowen/Documents/spark/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala:28: package org.apache.spark.sql.test is not a value
      [error]   test("SPARK-1669: cacheTable should be idempotent") {
      [error]   ^
      ...
      ```
      
      The issue I believe is that generation of a `test-jar` is bound here to the `compile` phase, but the test classes are not being compiled in this phase. It should bind to the `test-compile` phase.
      
      It works when executing `mvn package` or `mvn install` since test-jar artifacts are actually generated available through normal Maven mechanisms as each module is built. They are then found normally, regardless of scalatest configuration.
      
      It would be nice for a simple `mvn compile test-compile` to work since the test code is perfectly compilable given the Maven declarations.
      
      On the plus side, this change is low-risk as it only affects tests.
      yhuai made the original scalatest change and has glanced at this and thinks it makes sense.
      
      Author: Sean Owen <srowen@gmail.com>
      
      Closes #1879 from srowen/SPARK-2955 and squashes the following commits:
      
      ad8242f [Sean Owen] Generate test-jar on test-compile for modules whose tests are needed by others' tests
      e1b85f31
    • Nicholas Chammas's avatar
      [SPARK-2912] [Spark QA] Include commit hash in Spark QA messages · 500f84e4
      Nicholas Chammas authored
      You can find the [discussion that motivated this PR here](http://mail-archives.apache.org/mod_mbox/spark-dev/201408.mbox/%3CCABPQxssy0ri2QAz=cc9Tx+EXYWARm7pNcVm8apqCwc-esLbO4Qmail.gmail.com%3E).
      
      As described in [SPARK-2912](https://issues.apache.org/jira/browse/SPARK-2912), the goal of this PR (and related ones to come) is to include useful detail in Spark QA's messages that are intended to make a committer's job easier to do.
      
      Since this work depends on Jenkins, I cannot test this locally. Hence, I will be iterating via this PR.
      
      Notes:
      * This is a duplicate of a [previous PR](https://github.com/apache/spark/pull/1811), without the extraneous commits.
      * This PR also resolves an issue targeted by [another open PR](https://github.com/apache/spark/pull/1809).
      
      Closes #1809.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      Author: nchammas <nicholas.chammas@gmail.com>
      
      Closes #1816 from nchammas/master and squashes the following commits:
      
      c1be644 [Nicholas Chammas] [SPARK-2912] include commit hash in messages
      8f641ac [nchammas] Merge pull request #7 from apache/master
      500f84e4
  5. Aug 14, 2014
    • Kan Zhang's avatar
      [SPARK-2736] PySpark converter and example script for reading Avro files · 9422a9b0
      Kan Zhang authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-2736
      
      This patch includes:
      1. An Avro converter that converts Avro data types to Python. It handles all 3 Avro data mappings (Generic, Specific and Reflect).
      2. An example Python script for reading Avro files using AvroKeyInputFormat and the converter.
      3. Fixing a classloading issue.
      
      cc @MLnick @JoshRosen @mateiz
      
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #1916 from kanzhang/SPARK-2736 and squashes the following commits:
      
      02443f8 [Kan Zhang] [SPARK-2736] Adding .avsc files to .rat-excludes
      f74e9a9 [Kan Zhang] [SPARK-2736] nit: clazz -> className
      82cc505 [Kan Zhang] [SPARK-2736] Update data sample
      0be7761 [Kan Zhang] [SPARK-2736] Example pyspark script and data files
      c8e5881 [Kan Zhang] [SPARK-2736] Trying to work with all 3 Avro data models
      2271a5b [Kan Zhang] [SPARK-2736] Using the right class loader to find Avro classes
      536876b [Kan Zhang] [SPARK-2736] Adding Avro to Java converter
      9422a9b0
Loading