Skip to content
Snippets Groups Projects
  1. Sep 09, 2014
    • scwf's avatar
      [SPARK-3193]output errer info when Process exit code is not zero in test suite · 26862337
      scwf authored
      https://issues.apache.org/jira/browse/SPARK-3193
      I noticed that sometimes pr tests failed due to the Process exitcode != 0,refer to
      https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18688/consoleFull
      https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19118/consoleFull
      
      [info] SparkSubmitSuite:
      [info] - prints usage on empty input
      [info] - prints usage with only --help
      [info] - prints error with unrecognized options
      [info] - handle binary specified but not class
      [info] - handles arguments with --key=val
      [info] - handles arguments to user program
      [info] - handles arguments to user program with name collision
      [info] - handles YARN cluster mode
      [info] - handles YARN client mode
      [info] - handles standalone cluster mode
      [info] - handles standalone client mode
      [info] - handles mesos client mode
      [info] - handles confs with flag equivalents
      [info] - launch simple application with spark-submit *** FAILED ***
      [info]   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, local, file:/tmp/1408854098404-0/testJar-1408854098404.jar) exited with code 1
      [info]   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:872)
      [info]   at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
      [info]   at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
      [info]   at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
      [info]   at org.apacSpark assembly has been built with Hive, including Datanucleus jars on classpath
      
      this PR output the process error info when failed, it can be helpful for diagnosis.
      
      Author: scwf <wangfei1@huawei.com>
      
      Closes #2108 from scwf/output-test-error-info and squashes the following commits:
      
      0c48082 [scwf] minor fix according to comments
      563fde1 [scwf] output errer info when Process exitcode not zero
      26862337
    • Sean Owen's avatar
      SPARK-3404 [BUILD] SparkSubmitSuite fails with "spark-submit exits with code 1" · f0f1ba09
      Sean Owen authored
      This fixes the `SparkSubmitSuite` failure by setting `<spark.ui.port>0</spark.ui.port>` in the Maven build, to match the SBT build. This avoids a port conflict which causes failures.
      
      (This also updates the `scalatest` plugin off of a release candidate, to the identical final release.)
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #2328 from srowen/SPARK-3404 and squashes the following commits:
      
      512d782 [Sean Owen] Set spark.ui.port=0 in Maven scalatest config to match SBT build and avoid SparkSubmitSuite failure due to port conflict
      f0f1ba09
    • Sandy Ryza's avatar
      SPARK-3422. JavaAPISuite.getHadoopInputSplits isn't used anywhere. · 88547a09
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #2324 from sryza/sandy-spark-3422 and squashes the following commits:
      
      6446175 [Sandy Ryza] SPARK-3422. JavaAPISuite.getHadoopInputSplits isn't used anywhere.
      88547a09
    • Cheng Hao's avatar
      [SPARK-3455] [SQL] **HOT FIX** Fix the unit test failure · 1e03cf79
      Cheng Hao authored
      Unit test failed due to can not resolve the attribute references. Temporally disable this test case for a quick fixing, otherwise it will block the others.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #2334 from chenghao-intel/unit_test_failure and squashes the following commits:
      
      661f784 [Cheng Hao] temporally disable the failed test case
      1e03cf79
    • Mario Pastorelli's avatar
      [Docs] actorStream storageLevel default is MEMORY_AND_DISK_SER_2 · c419e4f1
      Mario Pastorelli authored
      Comment of the storageLevel param of actorStream says that it defaults to memory-only while the default is MEMORY_AND_DISK_SER_2.
      
      Author: Mario Pastorelli <pastorelli.mario@gmail.com>
      
      Closes #2319 from melrief/master and squashes the following commits:
      
      7b6ce68 [Mario Pastorelli] [Docs] actorStream storageLevel default is MEMORY_AND_DISK_SER_2
      c419e4f1
    • Cheng Lian's avatar
      [Build] Removed -Phive-thriftserver since this profile has been removed · ce5cb325
      Cheng Lian authored
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #2269 from liancheng/clean-run-tests-profile and squashes the following commits:
      
      08617bd [Cheng Lian] Removed -Phive-thriftserver since this profile has been removed
      ce5cb325
  2. Sep 08, 2014
    • Mark Hamstra's avatar
      SPARK-2425 Don't kill a still-running Application because of some misbehaving Executors · 092e2f15
      Mark Hamstra authored
      Introduces a LOADING -> RUNNING ApplicationState transition and prevents Master from removing an Application with RUNNING Executors.
      
      Two basic changes: 1) Instead of allowing MAX_NUM_RETRY abnormal Executor exits over the entire lifetime of the Application, allow that many since any Executor successfully began running the Application; 2) Don't remove the Application while Master still thinks that there are RUNNING Executors.
      
      This should be fine as long as the ApplicationInfo doesn't believe any Executors are forever RUNNING when they are not.  I think that any non-RUNNING Executors will eventually no longer be RUNNING in Master's accounting, but another set of eyes should confirm that.  This PR also doesn't try to detect which nodes have gone rogue or to kill off bad Workers, so repeatedly failing Executors will continue to fail and fill up log files with failure reports as long as the Application keeps running.
      
      Author: Mark Hamstra <markhamstra@gmail.com>
      
      Closes #1360 from markhamstra/SPARK-2425 and squashes the following commits:
      
      f099c0b [Mark Hamstra] Reuse appInfo
      b2b7b25 [Mark Hamstra] Moved 'Application failed' logging
      bdd0928 [Mark Hamstra] switched to string interpolation
      1dd591b [Mark Hamstra] SPARK-2425 introduce LOADING -> RUNNING ApplicationState transition and prevent Master from removing Application with RUNNING Executors
      092e2f15
    • William Benton's avatar
      [SPARK-3329][SQL] Don't depend on Hive SET pair ordering in tests. · 2b7ab814
      William Benton authored
      This fixes some possible spurious test failures in `HiveQuerySuite` by comparing sets of key-value pairs as sets, rather than as lists.
      
      Author: William Benton <willb@redhat.com>
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #2220 from willb/spark-3329 and squashes the following commits:
      
      3b3e205 [William Benton] Collapse collectResults case match in HiveQuerySuite
      6525d8e [William Benton] Handle cases where SET returns Rows of (single) strings
      cf11b0e [Aaron Davidson] Fix flakey HiveQuerySuite test
      2b7ab814
    • Cheng Lian's avatar
      [SPARK-3414][SQL] Stores analyzed logical plan when registering a temp table · dc1dbf20
      Cheng Lian authored
      Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names, because we store unanalyzed logical plan when registering temp tables while the `CaseInsensitivityAttributeReferences` batch runs before the `Resolution` batch. To fix this issue, we need to store analyzed logical plan.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #2293 from liancheng/spark-3414 and squashes the following commits:
      
      d9fa1d6 [Cheng Lian] Stores analyzed logical plan when registering a temp table
      dc1dbf20
    • William Benton's avatar
      SPARK-3423: [SQL] Implement BETWEEN for SQLParser · ca0348e6
      William Benton authored
      This patch improves the SQLParser by adding support for BETWEEN conditions
      
      Author: William Benton <willb@redhat.com>
      
      Closes #2295 from willb/sql-between and squashes the following commits:
      
      0016d30 [William Benton] Implement BETWEEN for SQLParser
      ca0348e6
    • Xiangrui Meng's avatar
      [SPARK-3443][MLLIB] update default values of tree: · 50a4fa77
      Xiangrui Meng authored
      Adjust the default values of decision tree, based on the memory requirement discussed in https://github.com/apache/spark/pull/2125 :
      
      1. maxMemoryInMB: 128 -> 256
      2. maxBins: 100 -> 32
      3. maxDepth: 4 -> 5 (in some example code)
      
      jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #2322 from mengxr/tree-defaults and squashes the following commits:
      
      cda453a [Xiangrui Meng] fix tests
      5900445 [Xiangrui Meng] update comments
      8c81831 [Xiangrui Meng] update default values of tree:
      50a4fa77
    • Eric Liang's avatar
      [SPARK-3349][SQL] Output partitioning of limit should not be inherited from child · 7db53391
      Eric Liang authored
      This resolves https://issues.apache.org/jira/browse/SPARK-3349
      
      Author: Eric Liang <ekl@google.com>
      
      Closes #2262 from ericl/spark-3349 and squashes the following commits:
      
      3e1b05c [Eric Liang] add regression test
      ac32723 [Eric Liang] make limit/takeOrdered output SinglePartition
      7db53391
    • Reynold Xin's avatar
      [SPARK-3019] Pluggable block transfer interface (BlockTransferService) · 08ce1888
      Reynold Xin authored
      This pull request creates a new BlockTransferService interface for block fetch/upload and refactors the existing ConnectionManager to implement BlockTransferService (NioBlockTransferService).
      
      Most of the changes are simply moving code around. The main class to inspect is ShuffleBlockFetcherIterator.
      
      Review guide:
      - Most of the ConnectionManager code is now in network.cm package
      - ManagedBuffer is a new buffer abstraction backed by several different implementations (file segment, nio ByteBuffer, Netty ByteBuf)
      - BlockTransferService is the main internal interface introduced in this PR
      - NioBlockTransferService implements BlockTransferService and replaces the old BlockManagerWorker
      - ShuffleBlockFetcherIterator replaces the told BlockFetcherIterator to use the new interface
      
      TODOs that should be separate PRs:
      - Implement NettyBlockTransferService
      - Finalize the API/semantics for ManagedBuffer.release()
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #2240 from rxin/blockTransferService and squashes the following commits:
      
      64cd9d7 [Reynold Xin] Merge branch 'master' into blockTransferService
      1dfd3d7 [Reynold Xin] Limit the length of the FileInputStream.
      1332156 [Reynold Xin] Fixed style violation from refactoring.
      2960c93 [Reynold Xin] Added ShuffleBlockFetcherIteratorSuite.
      e29c721 [Reynold Xin] Updated comment for ShuffleBlockFetcherIterator.
      8a1046e [Reynold Xin] Code review feedback:
      2c6b1e1 [Reynold Xin] Removed println in test cases.
      2a907e4 [Reynold Xin] Merge branch 'master' into blockTransferService-merge
      07ccf0d [Reynold Xin] Added init check to CMBlockTransferService.
      98c668a [Reynold Xin] Added failure handling and fixed unit tests.
      ae05fcd [Reynold Xin] Updated tests, although DistributedSuite is hanging.
      d8d595c [Reynold Xin] Merge branch 'master' of github.com:apache/spark into blockTransferService
      9ef279c [Reynold Xin] Initial refactoring to move ConnectionManager to use the BlockTransferService.
      08ce1888
    • Matthew Rocklin's avatar
      [SPARK-3417] Use new-style classes in PySpark · 939a322c
      Matthew Rocklin authored
      Tiny PR making SQLContext a new-style class.  This allows various type logic to work more effectively
      
      ```Python
      In [1]: import pyspark
      
      In [2]: pyspark.sql.SQLContext.mro()
      Out[2]: [pyspark.sql.SQLContext, object]
      ```
      
      Author: Matthew Rocklin <mrocklin@gmail.com>
      
      Closes #2288 from mrocklin/sqlcontext-new-style-class and squashes the following commits:
      
      4aadab6 [Matthew Rocklin] update other old-style classes
      a2dc02f [Matthew Rocklin] pyspark.sql.SQLContext is new-style class
      939a322c
    • Henry Cook's avatar
      [SQL] Minor edits to sql programming guide. · 26bc7655
      Henry Cook authored
      Author: Henry Cook <hcook@eecs.berkeley.edu>
      
      Closes #2316 from hcook/sql-docs and squashes the following commits:
      
      373f94b [Henry Cook] Minor edits to sql programming guide.
      26bc7655
    • Matthew Farrellee's avatar
      Provide a default PYSPARK_PYTHON for python/run_tests · 386bc24e
      Matthew Farrellee authored
      Without this the version of python used in the test is not
      recorded. The error is,
      
         Testing with Python version:
         ./run-tests: line 57: --version: command not found
      
      Author: Matthew Farrellee <matt@redhat.com>
      
      Closes #2300 from mattf/master-fix-python-run-tests and squashes the following commits:
      
      65a09f5 [Matthew Farrellee] Provide a default PYSPARK_PYTHON for python/run_tests
      386bc24e
    • Sandy Ryza's avatar
      SPARK-2978. Transformation with MR shuffle semantics · 16a73c24
      Sandy Ryza authored
      I didn't add this to the transformations list in the docs because it's kind of obscure, but would be happy to do so if others think it would be helpful.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #2274 from sryza/sandy-spark-2978 and squashes the following commits:
      
      4a5332a [Sandy Ryza] Fix Java test
      c04b447 [Sandy Ryza] Fix Python doc and add back deleted code
      433ad5b [Sandy Ryza] Add Java test
      4c25a54 [Sandy Ryza] Add s at the end and a couple other fixes
      9b0ba99 [Sandy Ryza] Fix compilation
      36e0571 [Sandy Ryza] Fix import ordering
      48c12c2 [Sandy Ryza] Add Java version and additional doc
      e5381cd [Sandy Ryza] Fix python style warnings
      f147634 [Sandy Ryza] SPARK-2978. Transformation with MR shuffle semantics
      16a73c24
    • Prashant Sharma's avatar
      SPARK-3337 Paranoid quoting in shell to allow install dirs with spaces within. · e16a8e7d
      Prashant Sharma authored
      ...
      
      Tested ! TBH, it isn't a great idea to have directory with spaces within. Because emacs doesn't like it then hadoop doesn't like it. and so on...
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #2229 from ScrapCodes/SPARK-3337/quoting-shell-scripts and squashes the following commits:
      
      d4ad660 [Prashant Sharma] SPARK-3337 Paranoid quoting in shell to allow install dirs with spaces within.
      e16a8e7d
    • Joseph K. Bradley's avatar
      [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib] DecisionTree aggregation improvements · 711356b4
      Joseph K. Bradley authored
      Summary:
      1. Variable numBins for each feature [SPARK-3043]
      2. Reduced data reshaping in aggregation [SPARK-3043]
      3. Choose ordering for ordered categorical features adaptively [SPARK-3156]
      4. Changed nodes to use 1-indexing [SPARK-3086]
      5. Small clean-ups
      
      Note: This PR looks bigger than it is since I moved several functions from inside findBestSplitsPerGroup to outside of it (to make it clear what was being serialized in the aggregation).
      
      Speedups: This update helps most when many features use few bins but a few features use many bins.  Some example results on speedups with 2M examples, 3.5K features (15-worker EC2 cluster):
      * Example where old code was reasonably efficient (1/2 continuous, 1/4 binary, 1/4 20-category): 164.813 --> 116.491 sec
      * Example where old code wasted many bins (1/10 continuous, 81/100 binary, 9/100 20-category): 128.701 --> 39.334 sec
      
      Details:
      
      (1) Variable numBins for each feature [SPARK-3043]
      
      DecisionTreeMetadata now computes a variable numBins for each feature.  It also tracks numSplits.
      
      (2) Reduced data reshaping in aggregation [SPARK-3043]
      
      Added DTStatsAggregator, a wrapper around the aggregate statistics array for easy but efficient indexing.
      * Added ImpurityAggregator and ImpurityCalculator classes, to make DecisionTree code more oblivious to the type of impurity.
      * Design note: I originally tried creating Impurity classes which stored data and storing the aggregates in an Array[Array[Array[Impurity]]].  However, this led to significant slowdowns, perhaps because of overhead in creating so many objects.
      
      The aggregate statistics are never reshaped, and cumulative sums are computed in-place.
      
      Updated the layout of aggregation functions.  The update simplifies things by (1) dividing features into ordered/unordered (instead of ordered/unordered/continuous) and (2) making use of the DTStatsAggregator for indexing.
      For this update, the following functions were refactored:
      * updateBinForOrderedFeature
      * updateBinForUnorderedFeature
      * binaryOrNotCategoricalBinSeqOp
      * multiclassWithCategoricalBinSeqOp
      * regressionBinSeqOp
      The above 5 functions were replaced with:
      * orderedBinSeqOp
      * someUnorderedBinSeqOp
      
      Other changes:
      * calculateGainForSplit now treats all feature types the same way.
      * Eliminated extractLeftRightNodeAggregates.
      
      (3) Choose ordering for ordered categorical features adaptively [SPARK-3156]
      
      Updated binsToBestSplit():
      * This now computes cumulative sums of stats for ordered features.
      * For ordered categorical features, it chooses an ordering for categories. (This uses to be done by findSplitsBins.)
      * Uses iterators to shorten code and avoid building an Array[Array[InformationGainStats]].
      
      Side effects:
      * In findSplitsBins: A sample of the data is only taken for data with continuous features.  It is not needed for data with only categorical features.
      * In findSplitsBins: splits and bins are no longer pre-computed for ordered categorical features since they are not needed.
      * TreePoint binning is simpler for categorical features.
      
      (4) Changed nodes to use 1-indexing [SPARK-3086]
      
      Nodes used to be indexed from 0.  Now they are indexed from 1.
      Node indexing functions are now collected in object Node (Node.scala).
      
      (5) Small clean-ups
      
      Eliminated functions extractNodeInfo() and extractInfoForLowerLevels() to reduce duplicate code.
      Eliminated InvalidBinIndex since it is no longer used.
      
      CC: mengxr  manishamde  Please let me know if you have thoughts on this—thanks!
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #2125 from jkbradley/dt-opt3alt and squashes the following commits:
      
      42c192a [Joseph K. Bradley] Merge branch 'rfs' into dt-opt3alt
      d3cc46b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3alt
      00e4404 [Joseph K. Bradley] optimization for TreePoint construction (pre-computing featureArity and isUnordered as arrays)
      425716c [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs
      a2acea5 [Joseph K. Bradley] Small optimizations based on profiling
      aa4e4df [Joseph K. Bradley] Updated DTStatsAggregator with bug fix (nodeString should not be multiplied by statsSize)
      4651154 [Joseph K. Bradley] Changed numBins semantics for unordered features. * Before: numBins = numSplits = (1 << k - 1) - 1 * Now: numBins = 2 * numSplits = 2 * [(1 << k - 1) - 1] * This also involved changing the semantics of: ** DecisionTreeMetadata.numUnorderedBins()
      1e3b1c7 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3alt
      1485fcc [Joseph K. Bradley] Made some DecisionTree methods private.
      92f934f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3alt
      e676da1 [Joseph K. Bradley] Updated documentation for DecisionTree
      37ca845 [Joseph K. Bradley] Fixed problem with how DecisionTree handles ordered categorical	features.
      105f8ab [Joseph K. Bradley] Removed commented-out getEmptyBinAggregates from DecisionTree
      062c31d [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3alt
      6d32ccd [Joseph K. Bradley] In DecisionTree.binsToBestSplit, changed loops to iterators to shorten code.
      807cd00 [Joseph K. Bradley] Finished DTStatsAggregator, a wrapper around the aggregate statistics for easy but hopefully efficient indexing.  Modified old ImpurityAggregator classes and renamed them ImpurityCalculator; added ImpurityAggregator classes which work with DTStatsAggregator but do not store data.  Unit tests all succeed.
      f2166fd [Joseph K. Bradley] still working on DTStatsAggregator
      92f7118 [Joseph K. Bradley] Added partly written DTStatsAggregator
      fd8df30 [Joseph K. Bradley] Moved some aggregation helpers outside of findBestSplitsPerGroup
      d7c53ee [Joseph K. Bradley] Added more doc for ImpurityAggregator
      a40f8f1 [Joseph K. Bradley] Changed nodes to be indexed from 1.  Tests work.
      95cad7c [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3
      5f94342 [Joseph K. Bradley] Added treeAggregate since not yet merged from master.  Moved node indexing functions to Node.
      61c4509 [Joseph K. Bradley] Fixed bugs from merge: missing DT timer call, and numBins setting.  Cleaned up DT Suite some.
      3ba7166 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3
      b314659 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3
      9c83363 [Joseph K. Bradley] partial merge but not done yet
      45f7ea7 [Joseph K. Bradley] partial merge, not yet done
      5fce635 [Joseph K. Bradley] Merge branch 'dt-opt2' into dt-opt3
      26d10dd [Joseph K. Bradley] Removed tree/model/Filter.scala since no longer used.  Removed debugging println calls in DecisionTree.scala.
      356daba [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
      430d782 [Joseph K. Bradley] Added more debug info on binning error.  Added some docs.
      d036089 [Joseph K. Bradley] Print timing info to logDebug.
      e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private
      8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  Removed debugging println calls from DecisionTree.  Made TreePoint extend Serialiable
      a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
      dd4d3aa [Joseph K. Bradley] Mid-process in bug fix: bug for binary classification with categorical features * Bug: Categorical features were all treated as ordered for binary classification.  This is possible but would require the bin ordering to be determined on-the-fly after the aggregation.  Currently, the ordering is determined a priori and fixed for all splits. * (Temp) Fix: Treat low-arity categorical features as unordered for binary classification. * Related change: I removed most tests for isMulticlass in the code.  I instead test metadata for whether there are unordered features. * Status: The bug may be fixed, but more testing needs to be done.
      438a660 [Joseph K. Bradley] removed subsampling for mnist8m from DT
      86e217f [Joseph K. Bradley] added cache to DT input
      e3c84cc [Joseph K. Bradley] Added stuff fro mnist8m to D T Runner
      51ef781 [Joseph K. Bradley] Fixed bug introduced by last commit: Variance impurity calculation was incorrect since counts were swapped accidentally
      fd65372 [Joseph K. Bradley] Major changes: * Created ImpurityAggregator classes, rather than old aggregates. * Feature split/bin semantics are based on ordered vs. unordered ** E.g.: numSplits = numBins for all unordered features, and numSplits = numBins - 1 for all ordered features. * numBins can differ for each feature
      c1565a5 [Joseph K. Bradley] Small DecisionTree updates: * Simplification: Updated calculateGainForSplit to take aggregates for a single (feature, split) pair. * Internal doc: findAggForOrderedFeatureClassification
      b914f3b [Joseph K. Bradley] DecisionTree optimization: eliminated filters + small changes
      b2ed1f3 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt
      0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
      3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)
      f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
      bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
      511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
      a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
      711356b4
    • Prashant Sharma's avatar
      [HOTFIX] A left over version change. It should make mima happy. · 0d1cc4ae
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #2317 from ScrapCodes/hotfix and squashes the following commits:
      
      b6472d4 [Prashant Sharma] [HOTFIX] for hotfixes, a left over version change.
      0d1cc4ae
  3. Sep 07, 2014
    • Reynold Xin's avatar
      [SPARK-938][doc] Add OpenStack Swift support · eddfedda
      Reynold Xin authored
      See compiled doc at
      http://people.apache.org/~rxin/tmp/openstack-swift/_site/storage-openstack-swift.html
      
      This is based on #1010. Closes #1010.
      
      Author: Reynold Xin <rxin@apache.org>
      Author: Gil Vernik <gilv@il.ibm.com>
      
      Closes #2298 from rxin/openstack-swift and squashes the following commits:
      
      ff4e394 [Reynold Xin] Two minor comments from Patrick.
      279f6de [Reynold Xin] core-sites -> core-site
      dfb8fea [Reynold Xin] Updated based on Gil's suggestion.
      846f5cb [Reynold Xin] Added a link from overview page.
      0447c9f [Reynold Xin] Removed sample code.
      e9c3761 [Reynold Xin] Merge pull request #1010 from gilv/master
      9233fef [Gil Vernik] Fixed typos
      6994827 [Gil Vernik] Merge pull request #1 from rxin/openstack
      ac0679e [Reynold Xin] Fixed an unclosed tr.
      47ce99d [Reynold Xin] Merge branch 'master' into openstack
      cca7192 [Gil Vernik] Removed white spases from pom.xml
      99f095d [Reynold Xin] Pending openstack changes.
      eb22295 [Reynold Xin] Merge pull request #1010 from gilv/master
      39a9737 [Gil Vernik] Spark integration with Openstack Swift
      c977658 [Gil Vernik] Merge branch 'master' of https://github.com/gilv/spark
      2aba763 [Gil Vernik] Fix to docs/openstack-integration.md
      9b625b5 [Gil Vernik] Merge branch 'master' of https://github.com/gilv/spark
      eff538d [Gil Vernik] SPARK-938 - Openstack Swift object storage support
      ce483d7 [Gil Vernik] SPARK-938 - Openstack Swift object storage support
      b6c37ef [Gil Vernik] Openstack Swift support
      eddfedda
    • Reynold Xin's avatar
      [SPARK-3280] Made sort-based shuffle the default implementation · f25bbbdb
      Reynold Xin authored
      Sort-based shuffle has lower memory usage and seems to outperform hash-based in almost all of our testing.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #2178 from rxin/sort-shuffle and squashes the following commits:
      
      713d341 [Reynold Xin] Fixed test failures by setting spark.shuffle.compress to the same value as spark.shuffle.spill.compress.
      85165e6 [Reynold Xin] Fixed a comment typo.
      aa0d372 [Reynold Xin] [SPARK-3280] Made sort-based shuffle the default implementation
      f25bbbdb
    • Josh Rosen's avatar
      [HOTFIX] Fix broken Mima tests on the master branch · 4ba26735
      Josh Rosen authored
      By merging #2268, which bumped the Spark version to 1.2.0-SNAPSHOT, I inadvertently broke the Mima binary compatibility tests.  The issue is that we were comparing 1.2.0-SNAPSHOT against Spark 1.0.0 without using any Mima excludes.  The right long-term fix for this is probably to publish nightly snapshots on Maven central and change the master branch to test binary compatibility against the current release candidate branch's snapshots until that release is finalized.
      
      As a short-term fix until 1.1.0 is published on Maven central, I've configured the build to test the master branch for binary compatibility against the 1.1.0-RC4 jars.  I'll loop back and remove the Apache staging repo as soon as 1.1.0 final is available.
      
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #2315 from JoshRosen/mima-fix and squashes the following commits:
      
      776bc2c [Josh Rosen] Add two excludes to workaround Mima annotation issues.
      ec90e21 [Josh Rosen] Add deploy and graphx to 1.2 MiMa excludes.
      57569be [Josh Rosen] Fix MiMa tests in master branch; test against 1.1.0 RC.
      4ba26735
    • Cheng Lian's avatar
      Fixed typos in make-distribution.sh · 9d69a782
      Cheng Lian authored
      `hadoop.version` and `yarn.version` are properties rather then profiles, should use `-D` instead of `-P`.
      
      /cc pwendell
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #2121 from liancheng/fix-make-dist and squashes the following commits:
      
      4c49158 [Cheng Lian] Also mentions Hadoop version related Maven profiles
      ed5b42a [Cheng Lian] Fixed typos in make-distribution.sh
      9d69a782
    • Ward Viaene's avatar
      [SPARK-3415] [PySpark] removes SerializingAdapter code · ecfa76cd
      Ward Viaene authored
      This code removes the SerializingAdapter code that was copied from PiCloud
      
      Author: Ward Viaene <ward.viaene@bigdatapartnership.com>
      
      Closes #2287 from wardviaene/feature/pythonsys and squashes the following commits:
      
      5f0d426 [Ward Viaene] SPARK-3415: modified test class to do dump and load
      5f5d559 [Ward Viaene] SPARK-3415: modified test class name and call cloudpickle.dumps instead using StringIO
      afc4a9a [Ward Viaene] SPARK-3415: added newlines to pass lint
      aaf10b7 [Ward Viaene] SPARK-3415: removed references to SerializingAdapter and rewrote test
      65ffeff [Ward Viaene] removed duplicate test
      a958866 [Ward Viaene] SPARK-3415: test script
      e263bf5 [Ward Viaene] SPARK-3415: removes legacy SerializingAdapter code
      ecfa76cd
    • Reynold Xin's avatar
      [SPARK-3408] Fixed Limit operator so it works with sort-based shuffle. · e2614038
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #2281 from rxin/sql-limit-sort and squashes the following commits:
      
      1ef7780 [Reynold Xin] [SPARK-3408] Fixed Limit operator so it works with sort-based shuffle.
      e2614038
    • Michael Armbrust's avatar
      [SQL] Update SQL Programming Guide · 39db1bfd
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #2258 from marmbrus/sqlDocUpdate and squashes the following commits:
      
      f3d450b [Michael Armbrust] fix brackets
      bea3bfa [Michael Armbrust] Davies suggestions
      3a29fe2 [Michael Armbrust] tighten visibility
      a71aa36 [Michael Armbrust] Draft of doc updates
      52932c0 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into sqlDocUpdate
      1e8c849 [Yin Huai] Update the example used for applySchema.
      9457c39 [Yin Huai] Update doc.
      31ba240 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeDoc
      29bc668 [Yin Huai] Draft doc for data type and schema APIs.
      39db1bfd
    • Eric Liang's avatar
      [SPARK-3394] [SQL] Fix crash in TakeOrdered when limit is 0 · 6754570d
      Eric Liang authored
      This resolves https://issues.apache.org/jira/browse/SPARK-3394
      
      Author: Eric Liang <ekl@google.com>
      
      Closes #2264 from ericl/spark-3394 and squashes the following commits:
      
      c87355b [Eric Liang] refactor
      bfb6140 [Eric Liang] change RDD takeOrdered instead
      7a51528 [Eric Liang] fix takeordered when limit = 0
      6754570d
  4. Sep 06, 2014
    • Reynold Xin's avatar
      [SPARK-3353] parent stage should have lower stage id. · 3fb57a0a
      Reynold Xin authored
      Previously parent stages had higher stage id, but parent stages are executed first. This pull request changes the behavior so parent stages would have lower stage id.
      
      For example, command:
      ```scala
      sc.parallelize(1 to 10).map(x=>(x,x)).reduceByKey(_+_).count
      ```
      breaks down into 2 stages.
      
      The old web UI:
      ![screen shot 2014-09-04 at 12 42 44 am](https://cloud.githubusercontent.com/assets/323388/4146177/60fb4f42-3407-11e4-819f-853eb0e22b25.png)
      
      Web UI with this patch:
      ![screen shot 2014-09-04 at 12 44 55 am](https://cloud.githubusercontent.com/assets/323388/4146178/62e08e62-3407-11e4-867b-a36b10534464.png)
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #2273 from rxin/lower-stage-id and squashes the following commits:
      
      abbb4c6 [Reynold Xin] Fixed SparkListenerSuite.
      0e02379 [Reynold Xin] Updated DAGSchedulerSuite.
      54ccea3 [Reynold Xin] [SPARK-3353] parent stage should have lower stage id.
      3fb57a0a
    • Davies Liu's avatar
      [SPARK-2334] fix AttributeError when call PipelineRDD.id() · 110fb8b2
      Davies Liu authored
      The underline JavaRDD for PipelineRDD is created lazily, it's delayed until call _jrdd.
      
      The id of JavaRDD is cached as `_id`, it saves a RPC call in py4j for later calls.
      
      closes #1276
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2296 from davies/id and squashes the following commits:
      
      e197958 [Davies Liu] fix style
      9721716 [Davies Liu] fix id of PipelineRDD
      110fb8b2
    • GuoQiang Li's avatar
      [SPARK-3273][SPARK-3301]We should read the version information from the same place · 21a1e1bb
      GuoQiang Li authored
      Author: GuoQiang Li <witgo@qq.com>
      
      Closes #2175 from witgo/SPARK-3273 and squashes the following commits:
      
      cf9c65a [GuoQiang Li] We should read the version information from the same place
      2a44e2f [GuoQiang Li] The spark version in the welcome message of pyspark is not correct
      21a1e1bb
    • GuoQiang Li's avatar
      [SPARK-3397] Bump pom.xml version number of master branch to 1.2.0-SNAPSHOT · 607ae39c
      GuoQiang Li authored
      Author: GuoQiang Li <witgo@qq.com>
      
      Closes #2268 from witgo/SPARK-3397 and squashes the following commits:
      
      eaf913f [GuoQiang Li] Bump pom.xml version number of master branch to 1.2.0-SNAPSHOT
      607ae39c
    • Holden Karau's avatar
      Spark-3406 add a default storage level to python RDD persist API · da35330e
      Holden Karau authored
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #2280 from holdenk/SPARK-3406-Python-RDD-persist-api-does-not-have-default-storage-level and squashes the following commits:
      
      33eaade [Holden Karau] As Josh pointed out, sql also override persist. Make persist behave the same as in the underlying RDD as well
      e658227 [Holden Karau] Fix the test I added
      e95a6c5 [Holden Karau] The Python persist function did not have a default storageLevel unlike the Scala API. Noticed this issue because we got a bug report back from the book where we had documented it as if it was the same as the Scala API
      da35330e
    • Tathagata Das's avatar
      [SPARK-2419][Streaming][Docs] More updates to the streaming programming guide · baff7e93
      Tathagata Das authored
      - Improvements to the kinesis integration guide from @cfregly
      - More information about unified input dstreams in main guide
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: Chris Fregly <chris@fregly.com>
      
      Closes #2307 from tdas/streaming-doc-fix1 and squashes the following commits:
      
      ec40b5d [Tathagata Das] Updated figure with kinesis
      fdb9c5e [Tathagata Das] Fixed style issues with kinesis guide
      036d219 [Chris Fregly] updated kinesis docs and added an arch diagram
      24f622a [Tathagata Das] More modifications.
      baff7e93
    • Nicholas Chammas's avatar
      [EC2] don't duplicate default values · 0c681dd6
      Nicholas Chammas authored
      This PR makes two minor changes to the `spark-ec2` script:
      
      1. The script's input parameter default values are duplicated into the help text. This is unnecessary. This PR replaces the duplicated info with the appropriate `optparse`  placeholder.
      2. The default Spark version currently needs to be updated by hand during each release, which is known to be a faulty process. This PR places that default value in an easy-to-spot place.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #2290 from nchammas/spark-ec2-default-version and squashes the following commits:
      
      0c6d3bb [Nicholas Chammas] don't duplicate default values
      0c681dd6
    • Reynold Xin's avatar
      [SPARK-3409][SQL] Avoid pulling in Exchange operator itself in Exchange's closures. · 1b9001f7
      Reynold Xin authored
      This is a tiny teeny optimization to move the if check of sortBasedShuffledOn to outside the closures so the closures don't need to pull in the entire Exchange operator object.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #2282 from rxin/SPARK-3409 and squashes the following commits:
      
      1de3f88 [Reynold Xin] [SPARK-3409][SQL] Avoid pulling in Exchange operator itself in Exchange's closures.
      1b9001f7
    • Nicholas Chammas's avatar
      [SPARK-3361] Expand PEP 8 checks to include EC2 script and Python examples · 9422c4ee
      Nicholas Chammas authored
      This PR resolves [SPARK-3361](https://issues.apache.org/jira/browse/SPARK-3361) by expanding the PEP 8 checks to cover the remaining Python code base:
      * The EC2 script
      * All Python / PySpark examples
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #2297 from nchammas/pep8-rulez and squashes the following commits:
      
      1e5ac9a [Nicholas Chammas] PEP 8 fixes to Python examples
      c3dbeff [Nicholas Chammas] PEP 8 fixes to EC2 script
      65ef6e8 [Nicholas Chammas] expand PEP 8 checks
      9422c4ee
  5. Sep 05, 2014
    • Nicholas Chammas's avatar
      [Build] suppress curl/wget progress bars · 19f61c16
      Nicholas Chammas authored
      In the Jenkins console output, `curl` gives us mountains of `#` symbols as it tries to show its download progress.
      
      ![noise from curl in Jenkins output](http://i.imgur.com/P2E7yUw.png)
      
      I don't think this is useful so I've changed things to suppress these progress bars. If there is actually some use to this, feel free to reject this proposal.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #2279 from nchammas/trim-test-output and squashes the following commits:
      
      14a720c [Nicholas Chammas] suppress curl/wget progress bars
      19f61c16
    • Andrew Ash's avatar
      SPARK-3211 .take() is OOM-prone with empty partitions · ba5bcadd
      Andrew Ash authored
      Instead of jumping straight from 1 partition to all partitions, do exponential
      growth and double the number of partitions to attempt each time instead.
      
      Fix proposed by Paul Nepywoda
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #2117 from ash211/SPARK-3211 and squashes the following commits:
      
      8b2299a [Andrew Ash] Quadruple instead of double for a minor speedup
      e5f7e4d [Andrew Ash] Update comment to better reflect what we're doing
      09a27f7 [Andrew Ash] Update PySpark to be less OOM-prone as well
      3a156b8 [Andrew Ash] SPARK-3211 .take() is OOM-prone with empty partitions
      ba5bcadd
    • Kousuke Saruta's avatar
      [SPARK-3399][PySpark] Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR · 7ff8c45d
      Kousuke Saruta authored
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #2270 from sarutak/SPARK-3399 and squashes the following commits:
      
      7613be6 [Kousuke Saruta] Modified pyspark script to ignore environment variables YARN_CONF_DIR and HADOOP_CONF_DIR while testing
      7ff8c45d
Loading