Skip to content
Snippets Groups Projects
  1. Sep 29, 2014
    • Reynold Xin's avatar
      Add more debug message for ManagedBuffer · e43c72fe
      Reynold Xin authored
      This is to help debug the error reported at http://apache-spark-user-list.1001560.n3.nabble.com/SQL-queries-fail-in-1-2-0-SNAPSHOT-td15327.html
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #2580 from rxin/buffer-debug and squashes the following commits:
      
      5814292 [Reynold Xin] Logging close() in case close() fails.
      323dfec [Reynold Xin] Add more debug message.
      e43c72fe
    • jerryshao's avatar
      [SPARK-3032][Shuffle] Fix key comparison integer overflow introduced sorting exception · dab1b0ae
      jerryshao authored
      Previous key comparison in `ExternalSorter` will get wrong sorting result or exception when key comparison overflows, details can be seen in [SPARK-3032](https://issues.apache.org/jira/browse/SPARK-3032). Here fix this and add a unit test to prove it.
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #2514 from jerryshao/SPARK-3032 and squashes the following commits:
      
      6f3c302 [jerryshao] Improve the unit test according to comments
      01911e6 [jerryshao] Change the test to show the contract violate exception
      83acb38 [jerryshao] Minor changes according to comments
      fa2a08f [jerryshao] Fix key comparison integer overflow introduced sorting exception
      dab1b0ae
    • Reza Zadeh's avatar
      [MLlib] [SPARK-2885] DIMSUM: All-pairs similarity · 587a0cd7
      Reza Zadeh authored
      # All-pairs similarity via DIMSUM
      Compute all pairs of similar vectors using brute force approach, and also DIMSUM sampling approach.
      
      Laying down some notation: we are looking for all pairs of similar columns in an m x n RowMatrix whose entries are denoted a_ij, with the i’th row denoted r_i and the j’th column denoted c_j. There is an oversampling parameter labeled ɣ that should be set to 4 log(n)/s to get provably correct results (with high probability), where s is the similarity threshold.
      
      The algorithm is stated with a Map and Reduce, with proofs of correctness and efficiency in published papers [1] [2]. The reducer is simply the summation reducer. The mapper is more interesting, and is also the heart of the scheme. As an exercise, you should try to see why in expectation, the map-reduce below outputs cosine similarities.
      
      ![dimsumv2](https://cloud.githubusercontent.com/assets/3220351/3807272/d1d9514e-1c62-11e4-9f12-3cfdb1d78b3a.png)
      
      [1] Bosagh-Zadeh, Reza and Carlsson, Gunnar (2013), Dimension Independent Matrix Square using MapReduce, arXiv:1304.1467 http://arxiv.org/abs/1304.1467
      
      [2] Bosagh-Zadeh, Reza and Goel, Ashish (2012), Dimension Independent Similarity Computation, arXiv:1206.2082 http://arxiv.org/abs/1206.2082
      
      # Testing
      
      Tests for all invocations included.
      
      Added L1 and L2 norm computation to MultivariateStatisticalSummary since it was needed. Added tests for both of them.
      
      Author: Reza Zadeh <rizlar@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1778 from rezazadeh/dimsumv2 and squashes the following commits:
      
      404c64c [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
      4eb71c6 [Reza Zadeh] Add excludes for normL1 and normL2
      ee8bd65 [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
      976ddd4 [Reza Zadeh] Broadcast colMags. Avoid div by zero.
      3467cff [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
      aea0247 [Reza Zadeh] Allow large thresholds to promote sparsity
      9fe17c0 [Xiangrui Meng] organize imports
      2196ba5 [Xiangrui Meng] Merge branch 'rezazadeh-dimsumv2' into dimsumv2
      254ca08 [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
      f2947e4 [Xiangrui Meng] some optimization
      3c4cf41 [Xiangrui Meng] Merge branch 'master' into rezazadeh-dimsumv2
      0e4eda4 [Reza Zadeh] Use partition index for RNG
      251bb9c [Reza Zadeh] Documentation
      25e9d0d [Reza Zadeh] Line length for style
      fb296f6 [Reza Zadeh] renamed to normL1 and normL2
      3764983 [Reza Zadeh] Documentation
      e9c6791 [Reza Zadeh] New interface and documentation
      613f261 [Reza Zadeh] Column magnitude summary
      75a0b51 [Reza Zadeh] Use Ints instead of Longs in the shuffle
      0f12ade [Reza Zadeh] Style changes
      eb1dc20 [Reza Zadeh] Use Double.PositiveInfinity instead of Double.Max
      f56a882 [Reza Zadeh] Remove changes to MultivariateOnlineSummarizer
      dbc55ba [Reza Zadeh] Make colMagnitudes a method in RowMatrix
      41e8ece [Reza Zadeh] style changes
      139c8e1 [Reza Zadeh] Syntax changes
      029aa9c [Reza Zadeh] javadoc and new test
      75edb25 [Reza Zadeh] All tests passing!
      05e59b8 [Reza Zadeh] Add test
      502ce52 [Reza Zadeh] new interface
      654c4fb [Reza Zadeh] default methods
      3726ca9 [Reza Zadeh] Remove MatrixAlgebra
      6bebabb [Reza Zadeh] remove changes to MatrixSuite
      5b8cd7d [Reza Zadeh] Initial files
      587a0cd7
    • Nicholas Chammas's avatar
      [EC2] Sort long, manually-inputted dictionaries · aedd251c
      Nicholas Chammas authored
      Similar to the work done in #2571, this PR just sorts the remaining manually-inputted dicts in the EC2 script so they are easier to maintain.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #2578 from nchammas/ec2-dict-sort and squashes the following commits:
      
      f55c692 [Nicholas Chammas] sort long dictionaries
      aedd251c
    • Zhang, Liye's avatar
      [CORE] Bugfix: LogErr format in DAGScheduler.scala · 657bdff4
      Zhang, Liye authored
      Author: Zhang, Liye <liye.zhang@intel.com>
      
      Closes #2572 from liyezhang556520/DAGLogErr and squashes the following commits:
      
      5be2491 [Zhang, Liye] Bugfix: LogErr format in DAGScheduler.scala
      657bdff4
  2. Sep 28, 2014
    • Nicholas Chammas's avatar
      [EC2] Cleanup Python parens and disk dict · 1651cc11
      Nicholas Chammas authored
      Minor fixes:
      * Remove unnecessary parens (Python style)
      * Sort `disks_by_instance` dict and remove duplicate `t1.micro` key
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #2571 from nchammas/ec2-polish and squashes the following commits:
      
      9d203d5 [Nicholas Chammas] paren and dict cleanup
      1651cc11
    • Joseph K. Bradley's avatar
      [SPARK-1545] [mllib] Add Random Forests · 0dc2b636
      Joseph K. Bradley authored
      This PR adds RandomForest to MLlib.  The implementation is basic, and future performance optimizations will be important.  (Note: RFs = Random Forests.)
      
      # Overview
      
      ## RandomForest
      * trains multiple trees at once to reduce the number of passes over the data
      * allows feature subsets at each node
      * uses a queue of nodes instead of fixed groups for each level
      
      This implementation is based an implementation by manishamde and the [Alpine Labs Sequoia Forest](https://github.com/AlpineNow/SparkML2) by codedeft (in particular, the TreePoint, BaggedPoint, and node queue implementations).  Thank you for your inputs!
      
      ## Testing
      
      Correctness: This has been tested for correctness with the test suites and with DecisionTreeRunner on example datasets.
      
      Performance: This has been performance tested using [this branch of spark-perf](https://github.com/jkbradley/spark-perf/tree/rfs).  Results below.
      
      ### Regression tests for DecisionTree
      
      Summary: For training 1 tree, there are small regressions, especially from feature subsampling.
      
      In the table below, each row is a single (random) dataset.  The 2 different sets of result columns are for 2 different RF implementations:
      * (numTrees): This is from an earlier commit, after implementing RandomForest to train multiple trees at once.  It does not include any code for feature subsampling.
      * (feature subsets): This is from this current PR's code, after implementing feature subsampling.
      These tests were to identify regressions in DecisionTree, so they are training 1 tree with all of the features (i.e., no feature subsampling).
      
      These were run on an EC2 cluster with 15 workers, training 1 tree with maxDepth = 5 (= 6 levels).  Speedup values < 1 indicate slowdowns from the old DecisionTree implementation.
      
      numInstances | numFeatures | runtime (sec) | speedup | runtime (sec) | speedup
      ---- | ---- | ---- | ---- | ---- | ----
       | | (numTrees) | (numTrees) | (feature subsets) | (feature subsets)
      20000 | 100 | 4.051 | 1.044433473 | 4.478 | 0.9448414471
      20000 | 500 | 8.472 | 1.104461756 | 9.315 | 1.004508857
      20000 | 1500 | 19.354 | 1.05854087 | 20.863 | 0.9819776638
      20000 | 3500 | 43.674 | 1.072033704 | 45.887 | 1.020332556
      200000 | 100 | 4.196 | 1.171830315 | 4.848 | 1.014232673
      200000 | 500 | 8.926 | 1.082791844 | 9.771 | 0.989151571
      200000 | 1500 | 20.58 | 1.068415938 | 22.134 | 0.9934038131
      200000 | 3500 | 48.043 | 1.075203464 | 52.249 | 0.9886505005
      2000000 | 100 | 4.944 | 1.01355178 | 5.796 | 0.8645617667
      2000000 | 500 | 11.11 | 1.016831683 | 12.482 | 0.9050632911
      2000000 | 1500 | 31.144 | 1.017852556 | 35.274 | 0.8986789136
      2000000 | 3500 | 79.981 | 1.085382778 | 101.105 | 0.8586123337
      20000000 | 100 | 8.304 | 0.9270231214 | 9.073 | 0.8484514494
      20000000 | 500 | 28.174 | 1.083268262 | 34.236 | 0.8914592826
      20000000 | 1500 | 143.97 | 0.9579634646 | 159.275 | 0.8659111599
      
      ### Tests for forests
      
      I have run other tests with numTrees=10 and with sqrt(numFeatures), and those indicate that multi-model training and feature subsets can speed up training for forests, especially when training deeper trees.
      
      # Details on specific classes
      
      ## Changes to DecisionTree
      * Main train() method is now in RandomForest.
      * findBestSplits() is no longer needed.  (It split levels into groups, but we now use a queue of nodes.)
      * Many small changes to support RFs.  (Note: These methods should be moved to RandomForest.scala in a later PR, but are in DecisionTree.scala to make code comparison easier.)
      
      ## RandomForest
      * Main train() method is from old DecisionTree.
      * selectNodesToSplit: Note that it selects nodes and feature subsets jointly to track memory usage.
      
      ## RandomForestModel
      * Stores an Array[DecisionTreeModel]
      * Prediction:
       * For classification, most common label.  For regression, mean.
       * We could support other methods later.
      
      ## examples/.../DecisionTreeRunner
      * This now takes numTrees and featureSubsetStrategy, to support RFs.
      
      ## DTStatsAggregator
      * 2 types of functionality (w/ and w/o subsampling features): These require different indexing methods.  (We could treat both as subsampling, but this is less efficient
        DTStatsAggregator is now abstract, and 2 child classes implement these 2 types of functionality.
      
      ## impurities
      * These now take instance weights.
      
      ## Node
      * Some vals changed to vars.
       * This is unfortunately a public API change (DeveloperApi).  This could be avoided by creating a LearningNode struct, but would be awkward.
      
      ## RandomForestSuite
      Please let me know if there are missing tests!
      
      ## BaggedPoint
      This wraps TreePoint and holds bootstrap weights/counts.
      
      # Design decisions
      
      * BaggedPoint: BaggedPoint is separate from TreePoint since it may be useful for other bagging algorithms later on.
      
      * RandomForest public API: What options should be easily supported by the train* methods?  Should ALL options be in the Java-friendly constructors?  Should there be a constructor taking Strategy?
      
      * Feature subsampling options: What options should be supported?  scikit-learn supports the same options, except for "onethird."  One option would be to allow users to specific fractions ("0.1"): the current options could be supported, and any unrecognized values would be parsed as Doubles in [0,1].
      
      * Splits and bins are computed before bootstrapping, so all trees use the same discretization.
      
      * One queue, instead of one queue per tree.
      
      CC: mengxr manishamde codedeft chouqin  Please let me know if you have suggestions---thanks!
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      Author: qiping.lqp <qiping.lqp@alibaba-inc.com>
      Author: chouqin <liqiping1991@gmail.com>
      
      Closes #2435 from jkbradley/rfs-new and squashes the following commits:
      
      c694174 [Joseph K. Bradley] Fixed typo
      cc59d78 [Joseph K. Bradley] fixed imports
      e25909f [Joseph K. Bradley] Simplified node group maps.  Specifically, created NodeIndexInfo to store node index in agg and feature subsets, and no longer create extra maps in findBestSplits
      fbe9a1e [Joseph K. Bradley] Changed default featureSubsetStrategy to be sqrt for classification, onethird for regression.  Updated docs with references.
      ef7c293 [Joseph K. Bradley] Updates based on code review.  Most substantial changes: * Simplified DTStatsAggregator * Made RandomForestModel.trees public * Added test for regression to RandomForestSuite
      593b13c [Joseph K. Bradley] Fixed bug in metadata for computing log2(num features).  Now it checks >= 1.
      a1a08df [Joseph K. Bradley] Removed old comments
      866e766 [Joseph K. Bradley] Changed RandomForestSuite randomized tests to use multiple fixed random seeds.
      ff8bb96 [Joseph K. Bradley] removed usage of null from RandomForest and replaced with Option
      bf1a4c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs-new
      6b79c07 [Joseph K. Bradley] Added RandomForestSuite, and fixed small bugs, style issues.
      d7753d4 [Joseph K. Bradley] Added numTrees and featureSubsetStrategy to DecisionTreeRunner (to support RandomForest).  Fixed bugs so that RandomForest now runs.
      746d43c [Joseph K. Bradley] Implemented feature subsampling.  Tested DecisionTree but not RandomForest.
      6309d1d [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs-new.  Added RandomForestModel.toString
      b7ae594 [Joseph K. Bradley] Updated docs.  Small fix for bug which does not cause errors: No longer allocate unused child nodes for leaf nodes.
      121c74e [Joseph K. Bradley] Basic random forests are implemented.  Random features per node not yet implemented.  Test suite not implemented.
      325d18a [Joseph K. Bradley] Merge branch 'chouqin-dt-preprune' into rfs-new
      4ef9bf1 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs-new
      61b2e72 [Joseph K. Bradley] Added max of 10GB for maxMemoryInMB in Strategy.
      a95e7c8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
      6da8571 [Joseph K. Bradley] RFs partly implemented, not done yet
      eddd1eb [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs-new
      5c4ac33 [Joseph K. Bradley] Added check in Strategy to make sure minInstancesPerNode >= 1
      0dd4d87 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-spark-3160
      95c479d [Joseph K. Bradley] * Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements" * small style fixes
      e2628b6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
      19b01af [Joseph K. Bradley] Merge remote-tracking branch 'chouqin/dt-preprune' into chouqin-dt-preprune
      f1d11d1 [chouqin] fix typo
      c7ebaf1 [chouqin] fix typo
      39f9b60 [chouqin] change edge `minInstancesPerNode` to 2 and add one more test
      c6e2dfc [Joseph K. Bradley] Added minInstancesPerNode and minInfoGain parameters to DecisionTreeRunner.scala and to Python API in tree.py
      306120f [Joseph K. Bradley] Fixed typo in DecisionTreeModel.scala doc
      eaa1dcf [Joseph K. Bradley] Added topNode doc in DecisionTree and scalastyle fix
      d4d7864 [Joseph K. Bradley] Marked Node.build as deprecated
      d4dbb99 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-spark-3160
      1a8f0ad [Joseph K. Bradley] Eliminated pre-allocated nodes array in main train() method. * Nodes are constructed and added to the tree structure as needed during training.
      0278a11 [chouqin] remove `noSplit` and set `Predict` private to tree
      d593ec7 [chouqin] fix docs and change minInstancesPerNode to 1
      2ab763b [Joseph K. Bradley] Simplifications to DecisionTree code:
      efcc736 [qiping.lqp] fix bug
      10b8012 [qiping.lqp] fix style
      6728fad [qiping.lqp] minor fix: remove empty lines
      bb465ca [qiping.lqp] Merge branch 'master' of https://github.com/apache/spark into dt-preprune
      cadd569 [qiping.lqp] add api docs
      46b891f [qiping.lqp] fix bug
      e72c7e4 [qiping.lqp] add comments
      845c6fa [qiping.lqp] fix style
      f195e83 [qiping.lqp] fix style
      987cbf4 [qiping.lqp] fix bug
      ff34845 [qiping.lqp] separate calculation of predict of node from calculation of info gain
      ac42378 [qiping.lqp] add min info gain and min instances per node parameters in decision tree
      0dc2b636
    • Reynold Xin's avatar
      [SPARK-3543] TaskContext remaining cleanup work. · f350cd30
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #2560 from rxin/TaskContext and squashes the following commits:
      
      9eff95a [Reynold Xin] [SPARK-3543] remaining cleanup work.
      f350cd30
    • Jim Lim's avatar
      SPARK-2761 refactor #maybeSpill into Spillable · 25164a89
      Jim Lim authored
      Moved `#maybeSpill` in ExternalSorter and EAOM into `Spillable`.
      
      Author: Jim Lim <jim@quixey.com>
      
      Closes #2416 from jimjh/SPARK-2761 and squashes the following commits:
      
      cf8be9a [Jim Lim] SPARK-2761 fix documentation, reorder code
      f94d522 [Jim Lim] SPARK-2761 refactor Spillable to simplify sig
      e75a24e [Jim Lim] SPARK-2761 use protected over protected[this]
      7270e0d [Jim Lim] SPARK-2761 refactor #maybeSpill into Spillable
      25164a89
    • Reynold Xin's avatar
      Revert "[SPARK-1021] Defer the data-driven computation of partition bounds in so..." · 8e874185
      Reynold Xin authored
      This reverts commit 2d972fd8.
      
      The commit was hanging correlationoptimizer14.
      8e874185
    • WangTaoTheTonic's avatar
      [SPARK-3715][Docs]minor typo · 1f13a40c
      WangTaoTheTonic authored
      https://issues.apache.org/jira/browse/SPARK-3715
      
      Author: WangTaoTheTonic <barneystinson@aliyun.com>
      
      Closes #2567 from WangTaoTheTonic/minortypo and squashes the following commits:
      
      9cc3f7a [WangTaoTheTonic] minor typo
      1f13a40c
    • William Benton's avatar
      SPARK-3699: SQL and Hive console tasks now clean up appropriately · 6918012d
      William Benton authored
      The sbt tasks sql/console and hive/console will now `stop()`
      the `SparkContext` upon exit.  Previously, they left an ugly stack
      trace when quitting.
      
      Author: William Benton <willb@redhat.com>
      
      Closes #2547 from willb/consoleCleanup and squashes the following commits:
      
      d5e431f [William Benton] SQL and Hive console tasks now clean up.
      6918012d
    • Reynold Xin's avatar
      Minor fix for the previous commit. · 66e1c40c
      Reynold Xin authored
      66e1c40c
    • Dale's avatar
      SPARK-CORE [SPARK-3651] Group common CoarseGrainedSchedulerBackend variables together · 9966d1a8
      Dale authored
      from [SPARK-3651]
      In CoarseGrainedSchedulerBackend, we have:
      
          private val executorActor = new HashMap[String, ActorRef]
          private val executorAddress = new HashMap[String, Address]
          private val executorHost = new HashMap[String, String]
          private val freeCores = new HashMap[String, Int]
          private val totalCores = new HashMap[String, Int]
      
      We only ever put / remove stuff from these maps together. It would simplify the code if we consolidate these all into one map as we have done in JobProgressListener in https://issues.apache.org/jira/browse/SPARK-2299.
      
      Author: Dale <tigerquoll@outlook.com>
      
      Closes #2533 from tigerquoll/SPARK-3651 and squashes the following commits:
      
      d1be0a9 [Dale] [SPARK-3651]  implemented suggested changes. Changed a reference from executorInfo to executorData to be consistent with other usages
      6890663 [Dale] [SPARK-3651]  implemented suggested changes
      7d671cf [Dale] [SPARK-3651]  Grouped variables under a ExecutorDataObject, and reference them via a map entry as they are all retrieved under the same key
      9966d1a8
  3. Sep 27, 2014
    • Uri Laserson's avatar
      [SPARK-3389] Add Converter for ease of Parquet reading in PySpark · 24823293
      Uri Laserson authored
      https://issues.apache.org/jira/browse/SPARK-3389
      
      Author: Uri Laserson <laserson@cloudera.com>
      
      Closes #2256 from laserson/SPARK-3389 and squashes the following commits:
      
      0ed363e [Uri Laserson] PEP8'd the python file
      0b4b380 [Uri Laserson] Moved converter to examples and added python example
      eecf4dc [Uri Laserson] [SPARK-3389] Add Converter for ease of Parquet reading in PySpark
      24823293
    • Reynold Xin's avatar
      [SPARK-3543] Clean up Java TaskContext implementation. · 5b922bb4
      Reynold Xin authored
      This addresses some minor issues in https://github.com/apache/spark/pull/2425
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #2557 from rxin/TaskContext and squashes the following commits:
      
      a51e5f6 [Reynold Xin] [SPARK-3543] Clean up Java TaskContext implementation.
      5b922bb4
    • Davies Liu's avatar
      [SPARK-3681] [SQL] [PySpark] fix serialization of List and Map in SchemaRDD · 0d8cdf0e
      Davies Liu authored
      Currently, the schema of object in ArrayType or MapType is attached lazily, it will have better performance but introduce issues while serialization or accessing nested objects.
      
      This patch will apply schema to the objects of ArrayType or MapType immediately when accessing them, will be a little bit slower, but much robust.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2526 from davies/nested and squashes the following commits:
      
      2399ae5 [Davies Liu] fix serialization of List and Map in SchemaRDD
      0d8cdf0e
    • Michael Armbrust's avatar
      [SPARK-3680][SQL] Fix bug caused by eager typing of HiveGenericUDFs · f0c7e195
      Michael Armbrust authored
      Typing of UDFs should be lazy as it is often not valid to call `dataType` on an expression until after all of its children are `resolved`.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2525 from marmbrus/concatBug and squashes the following commits:
      
      5b8efe7 [Michael Armbrust] fix bug with eager typing of udfs
      f0c7e195
    • w00228970's avatar
      [SPARK-3676][SQL] Fix hive test suite failure due to diffs in JDK 1.6/1.7 · 08008810
      w00228970 authored
      This is a bug in JDK6: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4428022
      
      this is because jdk get different result to operate ```double```,
      ```System.out.println(1/500d)``` in different jdk get different result
      jdk 1.6.0(_31) ---- 0.0020
      jdk 1.7.0(_05) ---- 0.002
      this leads to HiveQuerySuite failed when generate golden answer in jdk 1.7 and run tests in jdk 1.6, result did not match
      
      Author: w00228970 <wangfei1@huawei.com>
      
      Closes #2517 from scwf/HiveQuerySuite and squashes the following commits:
      
      0cb5e8d [w00228970] delete golden answer of division-0 and timestamp cast #1
      1df3964 [w00228970] Jdk version leads to different query output for Double, this make HiveQuerySuite failed
      08008810
    • CrazyJvm's avatar
      Docs : use "--total-executor-cores" rather than "--cores" after spark-shell · 66107f46
      CrazyJvm authored
      Author: CrazyJvm <crazyjvm@gmail.com>
      
      Closes #2540 from CrazyJvm/standalone-core and squashes the following commits:
      
      66d9fc6 [CrazyJvm] use "--total-executor-cores" rather than "--cores" after spark-shell
      66107f46
    • Reynold Xin's avatar
      Minor cleanup to tighten visibility and remove compilation warning. · 436a7730
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #2555 from rxin/cleanup and squashes the following commits:
      
      6add199 [Reynold Xin] Minor cleanup to tighten visibility and remove compilation warning.
      436a7730
    • Erik Erlandson's avatar
      [SPARK-1021] Defer the data-driven computation of partition bounds in so... · 2d972fd8
      Erik Erlandson authored
      ...rtByKey() until evaluation.
      
      Author: Erik Erlandson <eerlands@redhat.com>
      
      Closes #1689 from erikerlandson/spark-1021-pr and squashes the following commits:
      
      50b6da6 [Erik Erlandson] use standard getIteratorSize in countAsync
      4e334a9 [Erik Erlandson] exception mystery fixed by fixing bug in ComplexFutureAction
      b88b5d4 [Erik Erlandson] tweak async actions to use ComplexFutureAction[T] so they handle RangePartitioner sampling job properly
      b2b20e8 [Erik Erlandson] Fix bug in exception passing with ComplexFutureAction[T]
      ca8913e [Erik Erlandson] RangePartition sampling job -> FutureAction
      7143f97 [Erik Erlandson] [SPARK-1021] modify range bounds variable to be thread safe
      ac67195 [Erik Erlandson] [SPARK-1021] Defer the data-driven computation of partition bounds in sortByKey() until evaluation.
      2d972fd8
    • Jeff Steinmetz's avatar
      stop, start and destroy require the EC2_REGION · 9e8ced78
      Jeff Steinmetz authored
      i.e
      ./spark-ec2 --region=us-west-1 stop yourclustername
      
      Author: Jeff Steinmetz <jeffrey.steinmetz@gmail.com>
      
      Closes #2473 from jeffsteinmetz/master and squashes the following commits:
      
      7491f2c [Jeff Steinmetz] fix case in EC2 cluster setup documentation
      bd3d777 [Jeff Steinmetz] standardized ec2 documenation to use <lower-case> sample args
      2bf4a57 [Jeff Steinmetz] standardized ec2 documenation to use <lower-case> sample args
      68d8372 [Jeff Steinmetz] standardized ec2 documenation to use <lower-case> sample args
      d2ab6e2 [Jeff Steinmetz] standardized ec2 documenation to use <lower-case> sample args
      520e6dc [Jeff Steinmetz] standardized ec2 documenation to use <lower-case> sample args
      37fc876 [Jeff Steinmetz] stop, start and destroy require the EC2_REGION
      9e8ced78
    • Michael Armbrust's avatar
      [SPARK-3675][SQL] Allow starting a JDBC server on an existing context · d8a9d1d4
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2515 from marmbrus/jdbcExistingContext and squashes the following commits:
      
      7866fad [Michael Armbrust] Allows starting a JDBC server on an existing context.
      d8a9d1d4
    • Michael Armbrust's avatar
      [SQL][DOCS] Clarify that the server is for JDBC and ODBC · f0eea76d
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2527 from marmbrus/patch-1 and squashes the following commits:
      
      a0f9f1c [Michael Armbrust] [SQL][DOCS] Clarify that the server is for JDBC and ODBC
      f0eea76d
    • wangfei's avatar
      [Build]remove spark-staging-1030 · 0cdcdd2c
      wangfei authored
      Since 1.1.0 has published, remove spark-staging-1030.
      
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #2532 from scwf/patch-2 and squashes the following commits:
      
      bc9e00b [wangfei] remove spark-staging-1030
      0cdcdd2c
    • Sarah Gerweck's avatar
      Slaves file is now a template. · e976ca23
      Sarah Gerweck authored
      Change 0dc868e7 removed the `conf/slaves` file and made it a template like most of the other configuration files. This means you can no longer run `make-distribution.sh` unless you manually create a slaves file to be statically bundled in your distribution, which seems at odds with making it a template file.
      
      Author: Sarah Gerweck <sarah.a180@gmail.com>
      
      Closes #2549 from sarahgerweck/noMoreSlaves and squashes the following commits:
      
      d11d99a [Sarah Gerweck] Slaves file is now a template.
      e976ca23
  4. Sep 26, 2014
    • Reynold Xin's avatar
      Close #2194. · a3feaf04
      Reynold Xin authored
      a3feaf04
    • Prashant Sharma's avatar
      [SPARK-3543] Write TaskContext in Java and expose it through a static accessor. · 5e34855c
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Author: Shashank Sharma <shashank21j@gmail.com>
      
      Closes #2425 from ScrapCodes/SPARK-3543/withTaskContext and squashes the following commits:
      
      8ae414c [Shashank Sharma] CR
      ee8bd00 [Prashant Sharma] Added internal API in docs comments.
      ddb8cbe [Prashant Sharma] Moved setting the thread local to where TaskContext is instantiated.
      a7d5e23 [Prashant Sharma] Added doc comments.
      edf945e [Prashant Sharma] Code review git add -A
      f716fd1 [Prashant Sharma] introduced thread local for getting the task context.
      333c7d6 [Prashant Sharma] Translated Task context from scala to java.
      5e34855c
    • Josh Rosen's avatar
      Revert "[SPARK-3478] [PySpark] Profile the Python tasks" · f872e4fb
      Josh Rosen authored
      This reverts commit 1aa549ba.
      f872e4fb
    • Cheng Hao's avatar
      [SPARK-3393] [SQL] Align the log4j configuration for Spark & SparkSQLCLI · 7364fa5a
      Cheng Hao authored
      User may be confused for the HQL logging & configurations, we'd better provide a default templates.
      
      Both files are copied from Hive.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #2263 from chenghao-intel/hive_template and squashes the following commits:
      
      53bffa9 [Cheng Hao] Remove the hive-log4j.properties initialization
      7364fa5a
    • Daoyuan Wang's avatar
      [SPARK-3531][SQL]select null from table would throw a MatchError · 0ec2d2e8
      Daoyuan Wang authored
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #2396 from adrian-wang/selectnull and squashes the following commits:
      
      2458229 [Daoyuan Wang] rebase solution
      0ec2d2e8
    • Andrew Or's avatar
      [SPARK-3476] Remove outdated memory checks in Yarn · 8da10bf1
      Andrew Or authored
      See description in [JIRA](https://issues.apache.org/jira/browse/SPARK-3476).
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #2528 from andrewor14/yarn-memory-checks and squashes the following commits:
      
      c5400cd [Andrew Or] Simplify checks
      e30ffac [Andrew Or] Remove outdated memory checks
      8da10bf1
    • Daoyuan Wang's avatar
      [SPARK-3695]shuffle fetch fail output · 30461c6a
      Daoyuan Wang authored
      should output detailed host and port in error message
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #2539 from adrian-wang/fetchfail and squashes the following commits:
      
      6c1b1e0 [Daoyuan Wang] shuffle fetch fail output
      30461c6a
    • RJ Nowling's avatar
      [SPARK-3614][MLLIB] Add minimumOccurence filtering to IDF · ec9df6a7
      RJ Nowling authored
      This PR for [SPARK-3614](https://issues.apache.org/jira/browse/SPARK-3614) adds functionality for filtering out terms which do not appear in at least a minimum number of documents.
      
      This is implemented using a minimumOccurence parameter (default 0).  When terms' document frequencies are less than minimumOccurence, their IDFs are set to 0, just like when the DF is 0.  As a result, the TF-IDFs for the terms are found to be 0, as if the terms were not present in the documents.
      
      This PR makes the following changes:
      * Add a minimumOccurence parameter to the IDF and DocumentFrequencyAggregator classes.
      * Create a parameter-less constructor for IDF with a default minimumOccurence value of 0 to remain backwards-compatibility with the original IDF API.
      * Sets the IDFs to 0 for terms which DFs are less than minimumOccurence
      * Add tests to the Spark IDFSuite and Java JavaTfIdfSuite test suites
      * Updated the MLLib Feature Extraction programming guide to describe the new feature
      
      Author: RJ Nowling <rnowling@gmail.com>
      
      Closes #2494 from rnowling/spark-3614-idf-filter and squashes the following commits:
      
      0aa3c63 [RJ Nowling] Fix identation
      e6523a8 [RJ Nowling] Remove unnecessary toDouble's from IDFSuite
      bfa82ec [RJ Nowling] Add space after if
      30d20b3 [RJ Nowling] Add spaces around equals signs
      9013447 [RJ Nowling] Add space before division operator
      79978fc [RJ Nowling] Remove unnecessary semi-colon
      40fd70c [RJ Nowling] Change minimumOccurence to minDocFreq in code and docs
      47850ab [RJ Nowling] Changed minimumOccurence to Int from Long
      9fb4093 [RJ Nowling] Remove unnecessary lines from IDF class docs
      1fc09d8 [RJ Nowling] Add backwards-compatible constructor to DocumentFrequencyAggregator
      1801fd2 [RJ Nowling] Fix style errors in IDF.scala
      6897252 [RJ Nowling] Preface minimumOccurence members with val to make them final and immutable
      a200bab [RJ Nowling] Remove unnecessary else statement
      4b974f5 [RJ Nowling] Remove accidentally-added import from testing
      c0cc643 [RJ Nowling] Add minimumOccurence filtering to IDF
      ec9df6a7
    • aniketbhatnagar's avatar
      SPARK-3639 | Removed settings master in examples · d16e161d
      aniketbhatnagar authored
      This patch removes setting of master as local in Kinesis examples so that users can set it using submit-job.
      
      Author: aniketbhatnagar <aniket.bhatnagar@gmail.com>
      
      Closes #2536 from aniketbhatnagar/Kinesis-Examples-Master-Unset and squashes the following commits:
      
      c9723ac [aniketbhatnagar] Merge remote-tracking branch 'origin/Kinesis-Examples-Master-Unset' into Kinesis-Examples-Master-Unset
      fec8ead [aniketbhatnagar] SPARK-3639 | Removed settings master in examples
      31cdc59 [aniketbhatnagar] SPARK-3639 | Removed settings master in examples
      d16e161d
    • Davies Liu's avatar
      [SPARK-3478] [PySpark] Profile the Python tasks · 1aa549ba
      Davies Liu authored
      This patch add profiling support for PySpark, it will show the profiling results
      before the driver exits, here is one example:
      
      ```
      ============================================================
      Profile of RDD<id=3>
      ============================================================
               5146507 function calls (5146487 primitive calls) in 71.094 seconds
      
         Ordered by: internal time, cumulative time
      
         ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        5144576   68.331    0.000   68.331    0.000 statcounter.py:44(merge)
             20    2.735    0.137   71.071    3.554 statcounter.py:33(__init__)
             20    0.017    0.001    0.017    0.001 {cPickle.dumps}
           1024    0.003    0.000    0.003    0.000 t.py:16(<lambda>)
             20    0.001    0.000    0.001    0.000 {reduce}
             21    0.001    0.000    0.001    0.000 {cPickle.loads}
             20    0.001    0.000    0.001    0.000 copy_reg.py:95(_slotnames)
             41    0.001    0.000    0.001    0.000 serializers.py:461(read_int)
             40    0.001    0.000    0.002    0.000 serializers.py:179(_batched)
             62    0.000    0.000    0.000    0.000 {method 'read' of 'file' objects}
             20    0.000    0.000   71.072    3.554 rdd.py:863(<lambda>)
             20    0.000    0.000    0.001    0.000 serializers.py:198(load_stream)
          40/20    0.000    0.000   71.072    3.554 rdd.py:2093(pipeline_func)
             41    0.000    0.000    0.002    0.000 serializers.py:130(load_stream)
             40    0.000    0.000   71.072    1.777 rdd.py:304(func)
             20    0.000    0.000   71.094    3.555 worker.py:82(process)
      ```
      
      Also, use can show profile result manually by `sc.show_profiles()` or dump it into disk
      by `sc.dump_profiles(path)`, such as
      
      ```python
      >>> sc._conf.set("spark.python.profile", "true")
      >>> rdd = sc.parallelize(range(100)).map(str)
      >>> rdd.count()
      100
      >>> sc.show_profiles()
      ============================================================
      Profile of RDD<id=1>
      ============================================================
               284 function calls (276 primitive calls) in 0.001 seconds
      
         Ordered by: internal time, cumulative time
      
         ncalls  tottime  percall  cumtime  percall filename:lineno(function)
              4    0.000    0.000    0.000    0.000 serializers.py:198(load_stream)
              4    0.000    0.000    0.000    0.000 {reduce}
           12/4    0.000    0.000    0.001    0.000 rdd.py:2092(pipeline_func)
              4    0.000    0.000    0.000    0.000 {cPickle.loads}
              4    0.000    0.000    0.000    0.000 {cPickle.dumps}
            104    0.000    0.000    0.000    0.000 rdd.py:852(<genexpr>)
              8    0.000    0.000    0.000    0.000 serializers.py:461(read_int)
             12    0.000    0.000    0.000    0.000 rdd.py:303(func)
      ```
      The profiling is disabled by default, can be enabled by "spark.python.profile=true".
      
      Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump"
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2351 from davies/profiler and squashes the following commits:
      
      7ef2aa0 [Davies Liu] bugfix, add tests for show_profiles and dump_profiles()
      2b0daf2 [Davies Liu] fix docs
      7a56c24 [Davies Liu] bugfix
      cba9463 [Davies Liu] move show_profiles and dump_profiles to SparkContext
      fb9565b [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      116d52a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      09d02c3 [Davies Liu] Merge branch 'master' into profiler
      c23865c [Davies Liu] Merge branch 'master' into profiler
      15d6f18 [Davies Liu] add docs for two configs
      dadee1a [Davies Liu] add docs string and clear profiles after show or dump
      4f8309d [Davies Liu] address comment, add tests
      0a5b6eb [Davies Liu] fix Python UDF
      4b20494 [Davies Liu] add profile for python
      1aa549ba
    • Hari Shreedharan's avatar
      [SPARK-3686][STREAMING] Wait for sink to commit the channel before check... · b235e013
      Hari Shreedharan authored
      ...ing for the channel size.
      
      Author: Hari Shreedharan <hshreedharan@apache.org>
      
      Closes #2531 from harishreedharan/sparksinksuite-fix and squashes the following commits:
      
      30393c1 [Hari Shreedharan] Use more deterministic method to figure out when batches come in.
      6ce9d8b [Hari Shreedharan] [SPARK-3686][STREAMING] Wait for sink to commit the channel before checking for the channel size.
      b235e013
  5. Sep 25, 2014
    • zsxwing's avatar
      SPARK-2634: Change MapOutputTrackerWorker.mapStatuses to ConcurrentHashMap · 86bce764
      zsxwing authored
      MapOutputTrackerWorker.mapStatuses is used concurrently, it should be thread-safe. This bug has already been fixed in #1328. Nevertheless, considering #1328 won't be merged soon, I send this trivial fix and hope this issue can be solved soon.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #1541 from zsxwing/SPARK-2634 and squashes the following commits:
      
      d450053 [zsxwing] SPARK-2634: Change MapOutputTrackerWorker.mapStatuses to ConcurrentHashMap
      86bce764
    • Kousuke Saruta's avatar
      [SPARK-3584] sbin/slaves doesn't work when we use password authentication for SSH · 0dc868e7
      Kousuke Saruta authored
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #2444 from sarutak/slaves-scripts-modification and squashes the following commits:
      
      eff7394 [Kousuke Saruta] Improve the description about Cluster Launch Script in docs/spark-standalone.md
      7858225 [Kousuke Saruta] Modified sbin/slaves to use the environment variable "SPARK_SSH_FOREGROUND" as a flag
      53d7121 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into slaves-scripts-modification
      e570431 [Kousuke Saruta] Added a description for SPARK_SSH_FOREGROUND variable
      7120a0c [Kousuke Saruta] Added a description about default host for sbin/slaves
      1bba8a9 [Kousuke Saruta] Added SPARK_SSH_FOREGROUND flag to sbin/slaves
      88e2f17 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into slaves-scripts-modification
      297e75d [Kousuke Saruta] Modified sbin/slaves not to export HOSTLIST
      0dc868e7
Loading