Skip to content
Snippets Groups Projects
  1. Nov 21, 2014
  2. Nov 20, 2014
    • Michael Armbrust's avatar
      [SPARK-4522][SQL] Parse schema with missing metadata. · 90a6a46b
      Michael Armbrust authored
      This is just a quick fix for 1.2.  SPARK-4523 describes a more complete solution.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3392 from marmbrus/parquetMetadata and squashes the following commits:
      
      bcc6626 [Michael Armbrust] Parse schema with missing metadata.
      90a6a46b
    • Davies Liu's avatar
      add Sphinx as a dependency of building docs · 8cd6eea6
      Davies Liu authored
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3388 from davies/doc_readme and squashes the following commits:
      
      daa1482 [Davies Liu] add Sphinx dependency
      8cd6eea6
    • Michael Armbrust's avatar
      [SPARK-4413][SQL] Parquet support through datasource API · 02ec058e
      Michael Armbrust authored
      Goals:
       - Support for accessing parquet using SQL but not requiring Hive (thus allowing support of parquet tables with decimal columns)
       - Support for folder based partitioning with automatic discovery of available partitions
       - Caching of file metadata
      
      See scaladoc of `ParquetRelation2` for more details.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3269 from marmbrus/newParquet and squashes the following commits:
      
      1dd75f1 [Michael Armbrust] Pass all paths for FileInputFormat at once.
      645768b [Michael Armbrust] Review comments.
      abd8e2f [Michael Armbrust] Alternative implementation of parquet based on the datasources API.
      938019e [Michael Armbrust] Add an experimental interface to data sources that exposes catalyst expressions.
      e9d2641 [Michael Armbrust] logging / formatting improvements.
      02ec058e
    • Cheng Hao's avatar
      [SPARK-4244] [SQL] Support Hive Generic UDFs with constant object inspector parameters · 84d79ee9
      Cheng Hao authored
      Query `SELECT named_struct(lower("AA"), "12", lower("Bb"), "13") FROM src LIMIT 1` will throw exception, some of the Hive Generic UDF/UDAF requires the input object inspector is `ConstantObjectInspector`, however, we won't get that before the expression optimization executed. (Constant Folding).
      
      This PR is a work around to fix this. (As ideally, the `output` of LogicalPlan should be identical before and after Optimization).
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3109 from chenghao-intel/optimized and squashes the following commits:
      
      487ff79 [Cheng Hao] rebase to the latest master & update the unittest
      84d79ee9
    • Davies Liu's avatar
      [SPARK-4477] [PySpark] remove numpy from RDDSampler · d39f2e9c
      Davies Liu authored
      In RDDSampler, it try use numpy to gain better performance for possion(), but the number of call of random() is only (1+faction) * N in the pure python implementation of possion(), so there is no much performance gain from numpy.
      
      numpy is not a dependent of pyspark, so it maybe introduce some problem, such as there is no numpy installed in slaves, but only installed master, as reported in SPARK-927.
      
      It also complicate the code a lot, so we may should remove numpy from RDDSampler.
      
      I also did some benchmark to verify that:
      ```
      >>> from pyspark.mllib.random import RandomRDDs
      >>> rdd = RandomRDDs.uniformRDD(sc, 1 << 20, 1).cache()
      >>> rdd.count()  # cache it
      >>> rdd.sample(True, 0.9).count()    # measure this line
      ```
      the results:
      
      |withReplacement      |  random  | numpy.random |
       ------- | ------------ |  -------
      |True | 1.5 s|  1.4 s|
      |False|  0.6 s | 0.8 s|
      
      closes #2313
      
      Note: this patch including some commits that not mirrored to github, it will be OK after it catches up.
      
      Author: Davies Liu <davies@databricks.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3351 from davies/numpy and squashes the following commits:
      
      5c438d7 [Davies Liu] fix comment
      c5b9252 [Davies Liu] Merge pull request #1 from mengxr/SPARK-4477
      98eb31b [Xiangrui Meng] make poisson sampling slightly faster
      ee17d78 [Davies Liu] remove = for float
      13f7b05 [Davies Liu] Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into numpy
      f583023 [Davies Liu] fix tests
      51649f5 [Davies Liu] remove numpy in RDDSampler
      78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no performance gain
      f5fdf63 [Davies Liu] fix bug with int in weights
      4dfa2cd [Davies Liu] refactor
      f866bcf [Davies Liu] remove unneeded change
      c7a2007 [Davies Liu] switch to python implementation
      95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into randomSplit
      0d9b256 [Davies Liu] refactor
      1715ee3 [Davies Liu] address comments
      41fce54 [Davies Liu] randomSplit()
      d39f2e9c
    • Jacky Li's avatar
      [SQL] fix function description mistake · ad5f1f3c
      Jacky Li authored
      Sample code in the description of SchemaRDD.where is not correct
      
      Author: Jacky Li <jacky.likun@gmail.com>
      
      Closes #3344 from jackylk/patch-6 and squashes the following commits:
      
      62cd126 [Jacky Li] [SQL] fix function description mistake
      ad5f1f3c
    • Cheng Hao's avatar
      [SPARK-2918] [SQL] Support the CTAS in EXPLAIN command · 6aa0fc9f
      Cheng Hao authored
      Hive supports the `explain` the CTAS, which was supported by Spark SQL previously, however, seems it was reverted after the code refactoring in HiveQL.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3357 from chenghao-intel/explain and squashes the following commits:
      
      7aace63 [Cheng Hao] Support the CTAS in EXPLAIN command
      6aa0fc9f
    • Takuya UESHIN's avatar
      [SPARK-4318][SQL] Fix empty sum distinct. · 2c2e7a44
      Takuya UESHIN authored
      Executing sum distinct for empty table throws `java.lang.UnsupportedOperationException: empty.reduceLeft`.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #3184 from ueshin/issues/SPARK-4318 and squashes the following commits:
      
      8168c42 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4318
      66fdb0a [Takuya UESHIN] Re-refine aggregate functions.
      6186eb4 [Takuya UESHIN] Fix Sum of GeneratedAggregate.
      d2975f6 [Takuya UESHIN] Refine Sum and Average of GeneratedAggregate.
      1bba675 [Takuya UESHIN] Refine Sum, SumDistinct and Average functions.
      917e533 [Takuya UESHIN] Use aggregate instead of groupBy().
      1a5f874 [Takuya UESHIN] Add tests to be executed as non-partial aggregation.
      a5a57d2 [Takuya UESHIN] Fix empty Average.
      22799dc [Takuya UESHIN] Fix empty Sum and SumDistinct.
      65b7dd2 [Takuya UESHIN] Fix empty sum distinct.
      2c2e7a44
    • ravipesala's avatar
      [SPARK-4513][SQL] Support relational operator '<=>' in Spark SQL · 98e94197
      ravipesala authored
      The relational operator '<=>' is not working in Spark SQL. Same works in Spark HiveQL
      
      Author: ravipesala <ravindra.pesala@huawei.com>
      
      Closes #3387 from ravipesala/<=> and squashes the following commits:
      
      7198e90 [ravipesala] Supporting relational operator '<=>' in Spark SQL
      98e94197
    • Davies Liu's avatar
      [SPARK-4439] [MLlib] add python api for random forest · 1c53a5db
      Davies Liu authored
      ```
          class RandomForestModel
           |  A model trained by RandomForest
           |
           |  numTrees(self)
           |      Get number of trees in forest.
           |
           |  predict(self, x)
           |      Predict values for a single data point or an RDD of points using the model trained.
           |
           |  toDebugString(self)
           |      Full model
           |
           |  totalNumNodes(self)
           |      Get total number of nodes, summed over all trees in the forest.
           |
      
          class RandomForest
           |  trainClassifier(cls, data, numClassesForClassification, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='gini', maxDepth=4, maxBins=32, seed=None):
           |      Method to train a decision tree model for binary or multiclass classification.
           |
           |      :param data: Training dataset: RDD of LabeledPoint.
           |                   Labels should take values {0, 1, ..., numClasses-1}.
           |      :param numClassesForClassification: number of classes for classification.
           |      :param categoricalFeaturesInfo: Map storing arity of categorical features.
           |                                  E.g., an entry (n -> k) indicates that feature n is categorical
           |                                  with k categories indexed from 0: {0, 1, ..., k-1}.
           |      :param numTrees: Number of trees in the random forest.
           |      :param featureSubsetStrategy: Number of features to consider for splits at each node.
           |                                Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
           |                                If "auto" is set, this parameter is set based on numTrees:
           |                                  if numTrees == 1, set to "all";
           |                                  if numTrees > 1 (forest) set to "sqrt".
           |      :param impurity: Criterion used for information gain calculation.
           |                   Supported values: "gini" (recommended) or "entropy".
           |      :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means
           |                       1 internal node + 2 leaf nodes. (default: 4)
           |      :param maxBins: maximum number of bins used for splitting features (default: 100)
           |      :param seed:  Random seed for bootstrapping and choosing feature subsets.
           |      :return: RandomForestModel that can be used for prediction
           |
           |   trainRegressor(cls, data, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='variance', maxDepth=4, maxBins=32, seed=None):
           |      Method to train a decision tree model for regression.
           |
           |      :param data: Training dataset: RDD of LabeledPoint.
           |                   Labels are real numbers.
           |      :param categoricalFeaturesInfo: Map storing arity of categorical features.
           |                                   E.g., an entry (n -> k) indicates that feature n is categorical
           |                                   with k categories indexed from 0: {0, 1, ..., k-1}.
           |      :param numTrees: Number of trees in the random forest.
           |      :param featureSubsetStrategy: Number of features to consider for splits at each node.
           |                                 Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
           |                                 If "auto" is set, this parameter is set based on numTrees:
           |                                 if numTrees == 1, set to "all";
           |                                 if numTrees > 1 (forest) set to "onethird".
           |      :param impurity: Criterion used for information gain calculation.
           |                       Supported values: "variance".
           |      :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means
           |                       1 internal node + 2 leaf nodes.(default: 4)
           |      :param maxBins: maximum number of bins used for splitting features (default: 100)
           |      :param seed:  Random seed for bootstrapping and choosing feature subsets.
           |      :return: RandomForestModel that can be used for prediction
           |
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3320 from davies/forest and squashes the following commits:
      
      8003dfc [Davies Liu] reorder
      53cf510 [Davies Liu] fix docs
      4ca593d [Davies Liu] fix docs
      e0df852 [Davies Liu] fix docs
      0431746 [Davies Liu] rebased
      2b6f239 [Davies Liu] Merge branch 'master' of github.com:apache/spark into forest
      885abee [Davies Liu] address comments
      dae7fc0 [Davies Liu] address comments
      89a000f [Davies Liu] fix docs
      565d476 [Davies Liu] add python api for random forest
      1c53a5db
    • Dan McClary's avatar
      [SPARK-4228][SQL] SchemaRDD to JSON · b8e6886f
      Dan McClary authored
      Here's a simple fix for SchemaRDD to JSON.
      
      Author: Dan McClary <dan.mcclary@gmail.com>
      
      Closes #3213 from dwmclary/SPARK-4228 and squashes the following commits:
      
      d714e1d [Dan McClary] fixed PEP 8 error
      cac2879 [Dan McClary] move pyspark comment and doctest to correct location
      f9471d3 [Dan McClary] added pyspark doc and doctest
      6598cee [Dan McClary] adding complex type queries
      1a5fd30 [Dan McClary] removing SPARK-4228 from SQLQuerySuite
      4a651f0 [Dan McClary] cleaned PEP and Scala style failures.  Moved tests to JsonSuite
      47ceff6 [Dan McClary] cleaned up scala style issues
      2ee1e70 [Dan McClary] moved rowToJSON to JsonRDD
      4387dd5 [Dan McClary] Added UserDefinedType, cleaned up case formatting
      8f7bfb6 [Dan McClary] Map type added to SchemaRDD.toJSON
      1b11980 [Dan McClary] Map and UserDefinedTypes partially done
      11d2016 [Dan McClary] formatting and unicode deserialization default fixed
      6af72d1 [Dan McClary] deleted extaneous comment
      4d11c0c [Dan McClary] JsonFactory rewrite of toJSON for SchemaRDD
      149dafd [Dan McClary] wrapped scala toJSON in sql.py
      5e5eb1b [Dan McClary] switched to Jackson for JSON processing
      6c94a54 [Dan McClary] added toJSON to pyspark SchemaRDD
      aaeba58 [Dan McClary] added toJSON to pyspark SchemaRDD
      1d171aa [Dan McClary] upated missing brace on if statement
      319e3ba [Dan McClary] updated to upstream master with merged SPARK-4228
      424f130 [Dan McClary] tests pass, ready for pull and PR
      626a5b1 [Dan McClary] added toJSON to SchemaRDD
      f7d166a [Dan McClary] added toJSON method
      5d34e37 [Dan McClary] merge resolved
      d6d19e9 [Dan McClary] pr example
      b8e6886f
    • Cheng Lian's avatar
      [SPARK-3938][SQL] Names in-memory columnar RDD with corresponding table name · abf29187
      Cheng Lian authored
      This PR enables the Web UI storage tab to show the in-memory table name instead of the mysterious query plan string as the name of the in-memory columnar RDD.
      
      Note that after #2501, a single columnar RDD can be shared by multiple in-memory tables, as long as their query results are the same. In this case, only the first cached table name is shown. For example:
      
      ```sql
      CACHE TABLE first AS SELECT * FROM src;
      CACHE TABLE second AS SELECT * FROM src;
      ```
      
      The Web UI only shows "In-memory table first".
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3383)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3383 from liancheng/columnar-rdd-name and squashes the following commits:
      
      071907f [Cheng Lian] Fixes tests
      12ddfa6 [Cheng Lian] Names in-memory columnar RDD with corresponding table name
      abf29187
    • Xiangrui Meng's avatar
      [SPARK-4486][MLLIB] Improve GradientBoosting APIs and doc · 15cacc81
      Xiangrui Meng authored
      There are some inconsistencies in the gradient boosting APIs. The target is a general boosting meta-algorithm, but the implementation is attached to trees. This was partially due to the delay of SPARK-1856. But for the 1.2 release, we should make the APIs consistent.
      
      1. WeightedEnsembleModel -> private[tree] TreeEnsembleModel and renamed members accordingly.
      1. GradientBoosting -> GradientBoostedTrees
      1. Add RandomForestModel and GradientBoostedTreesModel and hide CombiningStrategy
      1. Slightly refactored TreeEnsembleModel (Vote takes weights into consideration.)
      1. Remove `trainClassifier` and `trainRegressor` from `GradientBoostedTrees` because they are the same as `train`
      1. Rename class `train` method to `run` because it hides the static methods with the same name in Java. Deprecated `DecisionTree.train` class method.
      1. Simplify BoostingStrategy and make sure the input strategy is not modified. Users should put algo and numClasses in treeStrategy. We create ensembleStrategy inside boosting.
      1. Fix a bug in GradientBoostedTreesSuite with AbsoluteError
      1. doc updates
      
      manishamde jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3374 from mengxr/SPARK-4486 and squashes the following commits:
      
      7097251 [Xiangrui Meng] address joseph's comments
      98dea09 [Xiangrui Meng] address manish's comments
      4aae3b7 [Xiangrui Meng] add RandomForestModel and GradientBoostedTreesModel, hide CombiningStrategy
      ea4c467 [Xiangrui Meng] fix unit tests
      751da4e [Xiangrui Meng] rename class method train -> run
      19030a5 [Xiangrui Meng] update boosting public APIs
      15cacc81
  3. Nov 19, 2014
    • Leolh's avatar
      [SPARK-4446] [SPARK CORE] · e216ffae
      Leolh authored
      MetadataCleaner schedule task with a wrong param for delay time .
      
      Author: Leolh <leosandylh@gmail.com>
      
      Closes #3306 from Leolh/master and squashes the following commits:
      
      4a21f4e [Leolh] Update MetadataCleaner.scala
      e216ffae
    • Andrew Or's avatar
      [SPARK-4480] Avoid many small spills in external data structures · 0eb4a7fb
      Andrew Or authored
      **Summary.** Currently, we may spill many small files in `ExternalAppendOnlyMap` and `ExternalSorter`. The underlying root cause of this is summarized in [SPARK-4452](https://issues.apache.org/jira/browse/SPARK-4452). This PR does not address this root cause, but simply provides the guarantee that we never spill the in-memory data structure if its size is less than a configurable threshold of 5MB. This config is not documented because we don't want users to set it themselves, and it is not hard-coded because we need to change it in tests.
      
      **Symptom.** Each spill is orders of magnitude smaller than 1MB, and there are many spills. In environments where the ulimit is set, this frequently causes "too many open file" exceptions observed in [SPARK-3633](https://issues.apache.org/jira/browse/SPARK-3633).
      ```
      14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4792 B to disk (292769 spills so far)
      14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4760 B to disk (292770 spills so far)
      14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4520 B to disk (292771 spills so far)
      14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4560 B to disk (292772 spills so far)
      14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4792 B to disk (292773 spills so far)
      14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4784 B to disk (292774 spills so far)
      ```
      
      **Reproduction.** I ran the following on a small 4-node cluster with 512MB executors. Note that the back-to-back shuffle here is necessary for reasons described in [SPARK-4522](https://issues.apache.org/jira/browse/SPARK-4452). The second shuffle is a `reduceByKey` because it performs a map-side combine.
      ```
      sc.parallelize(1 to 100000000, 100)
        .map { i => (i, i) }
        .groupByKey()
        .reduceByKey(_ ++ _)
        .count()
      ```
      Before the change, I notice that each thread may spill up to 1000 times, and the size of each spill is on the order of 10KB. After the change, each thread spills only up to 20 times in the worst case, and the size of each spill is on the order of 1MB.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3353 from andrewor14/avoid-small-spills and squashes the following commits:
      
      49f380f [Andrew Or] Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/spark into avoid-small-spills
      27d6966 [Andrew Or] Merge branch 'master' of github.com:apache/spark into avoid-small-spills
      f4736e3 [Andrew Or] Fix tests
      a919776 [Andrew Or] Avoid many small spills
      0eb4a7fb
    • Nishkam Ravi's avatar
      [Spark-4484] Treat maxResultSize as unlimited when set to 0; improve error message · 73fedf5a
      Nishkam Ravi authored
      The check for maxResultSize > 0 is missing, results in failures. Also, error message needs to be improved so the developers know that there is a new parameter to be configured
      
      Author: Nishkam Ravi <nravi@cloudera.com>
      Author: nravi <nravi@c1704.halxg.cloudera.com>
      Author: nishkamravi2 <nishkamravi@gmail.com>
      
      Closes #3360 from nishkamravi2/master_nravi and squashes the following commits:
      
      5c9a4cb [nishkamravi2] Update TaskSetManagerSuite.scala
      535295a [nishkamravi2] Update TaskSetManager.scala
      3e1b616 [Nishkam Ravi] Modify test for maxResultSize
      9f6583e [Nishkam Ravi] Changes to maxResultSize code (improve error message and add condition to check if maxResultSize > 0)
      5f8f9ed [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      636a9ff [nishkamravi2] Update YarnAllocator.scala
      8f76c8b [Nishkam Ravi] Doc change for yarn memory overhead
      35daa64 [Nishkam Ravi] Slight change in the doc for yarn memory overhead
      5ac2ec1 [Nishkam Ravi] Remove out
      dac1047 [Nishkam Ravi] Additional documentation for yarn memory overhead issue
      42c2c3d [Nishkam Ravi] Additional changes for yarn memory overhead issue
      362da5e [Nishkam Ravi] Additional changes for yarn memory overhead
      c726bd9 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      f00fa31 [Nishkam Ravi] Improving logging for AM memoryOverhead
      1cf2d1e [nishkamravi2] Update YarnAllocator.scala
      ebcde10 [Nishkam Ravi] Modify default YARN memory_overhead-- from an additive constant to a multiplier (redone to resolve merge conflicts)
      2e69f11 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      efd688a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark
      2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark
      3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark
      5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark
      eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
      df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
      6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
      5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
      681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
      73fedf5a
    • Akshat Aranya's avatar
      [SPARK-4478] Keep totalRegisteredExecutors up-to-date · 9ccc53c7
      Akshat Aranya authored
      This rebases PR 3368.
      
      This commit fixes totalRegisteredExecutors update [SPARK-4478], so that we can correctly keep track of number of registered executors.
      
      Author: Akshat Aranya <aaranya@quantcast.com>
      
      Closes #3373 from coolfrood/topic/SPARK-4478 and squashes the following commits:
      
      8a4d1e4 [Akshat Aranya] Added comment
      150ae93 [Akshat Aranya] [SPARK-4478] Keep totalRegisteredExecutors up-to-date
      9ccc53c7
    • Joseph E. Gonzalez's avatar
      Updating GraphX programming guide and documentation · 377b0682
      Joseph E. Gonzalez authored
      This pull request revises the programming guide to reflect changes in the GraphX API as well as the deprecated mapReduceTriplets operator.
      
      Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
      
      Closes #3359 from jegonzal/GraphXProgrammingGuide and squashes the following commits:
      
      4421964 [Joseph E. Gonzalez] updating documentation for graphx
      377b0682
    • Josh Rosen's avatar
      [SPARK-4495] Fix memory leak in JobProgressListener · 04d462f6
      Josh Rosen authored
      This commit fixes a memory leak in JobProgressListener that I introduced in SPARK-2321 and adds a testing framework to ensure that it’s very difficult to inadvertently introduce new memory leaks.
      
      This solution might be overkill, but the main idea is to partition JobProgressListener's state into three buckets: collections that should be empty once Spark is idle, collections that must obey some hard size limit, and collections that have a soft size limit (they can grow arbitrarily large when Spark is active but must shrink to fit within some bound after Spark becomes idle).
      
      Based on this, we can write fairly generic tests that run workloads that submit more than `spark.ui.retainedStages` stages and `spark.ui.retainedJobs` jobs then check that these various collections' sizes obey their contracts.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3372 from JoshRosen/SPARK-4495 and squashes the following commits:
      
      c73fab5 [Josh Rosen] "data structures" -> collections
      be72e81 [Josh Rosen] [SPARK-4495] Fix memory leaks in JobProgressListener
      04d462f6
    • Yadong Qi's avatar
      [SPARK-4294][Streaming] UnionDStream stream should express the requirements in... · c3002c4a
      Yadong Qi authored
      [SPARK-4294][Streaming] UnionDStream stream should express the requirements in the same way as TransformedDStream
      
      In class TransformedDStream:
      ```scala
      require(parents.length > 0, "List of DStreams to transform is empty")
      require(parents.map(.ssc).distinct.size == 1, "Some of the DStreams have different contexts")
      require(parents.map(.slideDuration).distinct.size == 1,
      "Some of the DStreams have different slide durations")
      ```
      
      In class UnionDStream:
      ```scala
      if (parents.length == 0)
      { throw new IllegalArgumentException("Empty array of parents") }
      if (parents.map(.ssc).distinct.size > 1)
      { throw new IllegalArgumentException("Array of parents have different StreamingContexts") }
      if (parents.map(.slideDuration).distinct.size > 1)
      { throw new IllegalArgumentException("Array of parents have different slide times") }
      ```
      
      The function is the same, but the realization is not. I think they shoule be the same.
      
      Author: Yadong Qi <qiyadong2010@gmail.com>
      
      Closes #3152 from watermen/bug-fix1 and squashes the following commits:
      
      ed66db6 [Yadong Qi] Change transform to union
      b6b3b8b [Yadong Qi] The same function should have the same realization.
      c3002c4a
    • Davies Liu's avatar
      [SPARK-4384] [PySpark] improve sort spilling · 73c8ea84
      Davies Liu authored
      If there some big broadcasts (or other object) in Python worker, the free memory could be used for sorting will be too small, then it will keep spilling small files into disks, finally failed with too many open files.
      
      This PR try to delay the spilling until the used memory goes over limit and start to increase since last spilling, it will increase the size of spilling files, improve the stability and performance in this cases. (We also do this in ExternalAggregator).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3252 from davies/sort and squashes the following commits:
      
      711fb6c [Davies Liu] improve sort spilling
      73c8ea84
    • Takuya UESHIN's avatar
      [SPARK-4429][BUILD] Build for Scala 2.11 using sbt fails. · f9adda9a
      Takuya UESHIN authored
      I tried to build for Scala 2.11 using sbt with the following command:
      
      ```
      $ sbt/sbt -Dscala-2.11 assembly
      ```
      
      but it ends with the following error messages:
      
      ```
      [error] (streaming-kafka/*:update) sbt.ResolveException: unresolved dependency: org.apache.kafka#kafka_2.11;0.8.0: not found
      [error] (catalyst/*:update) sbt.ResolveException: unresolved dependency: org.scalamacros#quasiquotes_2.11;2.0.1: not found
      ```
      
      The reason is:
      If system property `-Dscala-2.11` (without value) was set, `SparkBuild.scala` adds `scala-2.11` profile, but also `sbt-pom-reader` activates `scala-2.10` profile instead of `scala-2.11` profile because the activator `PropertyProfileActivator` used by `sbt-pom-reader` internally checks if the property value is empty or not.
      
      The value is set to non-empty value, then no need to add profiles in `SparkBuild.scala` because `sbt-pom-reader` can handle as expected.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #3342 from ueshin/issues/SPARK-4429 and squashes the following commits:
      
      14d86e8 [Takuya UESHIN] Add a comment.
      4eef52b [Takuya UESHIN] Remove unneeded condition.
      ce98d0f [Takuya UESHIN] Set non-empty value to system property "scala-2.11" if the property exists instead of adding profile.
      f9adda9a
    • Ken Takagiwa's avatar
      [DOC][PySpark][Streaming] Fix docstring for sphinx · 9b7bbcef
      Ken Takagiwa authored
      This commit should be merged for 1.2 release.
      cc tdas
      
      Author: Ken Takagiwa <ugw.gi.world@gmail.com>
      
      Closes #3311 from giwa/patch-3 and squashes the following commits:
      
      ab474a8 [Ken Takagiwa] [DOC][PySpark][Streaming] Fix docstring for sphinx
      9b7bbcef
    • Prashant Sharma's avatar
      SPARK-3962 Marked scope as provided for external projects. · 1c938413
      Prashant Sharma authored
      Somehow maven shade plugin is set in infinite loop of creating effective pom.
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Author: Prashant Sharma <scrapcodes@gmail.com>
      
      Closes #2959 from ScrapCodes/SPARK-3962/scope-provided and squashes the following commits:
      
      994d1d3 [Prashant Sharma] Fixed failing flume tests
      270b4fb [Prashant Sharma] Removed most of the unused code.
      bb3bbfd [Prashant Sharma] SPARK-3962 Marked scope as provided for external.
      1c938413
    • Andrew Or's avatar
      [HOT FIX] MiMa tests are broken · 0df02ca4
      Andrew Or authored
      This is blocking #3353 and other patches.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3371 from andrewor14/mima-hot-fix and squashes the following commits:
      
      842d059 [Andrew Or] Move excludes to the right section
      c4d4f4e [Andrew Or] MIMA hot fix
      0df02ca4
    • zsxwing's avatar
      [SPARK-4481][Streaming][Doc] Fix the wrong description of updateFunc · 3bf7ceeb
      zsxwing authored
      Removed `If `this` function returns None, then corresponding state key-value pair will be eliminated.` for the description of `updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)]`
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3356 from zsxwing/SPARK-4481 and squashes the following commits:
      
      76a9891 [zsxwing] Add a note that keys may be added or removed
      0ebc42a [zsxwing] Fix the wrong description of updateFunc
      3bf7ceeb
    • Tathagata Das's avatar
      [SPARK-4482][Streaming] Disable ReceivedBlockTracker's write ahead log by default · 22fc4e75
      Tathagata Das authored
      The write ahead log of ReceivedBlockTracker gets enabled as soon as checkpoint directory is set. This should not happen, as the WAL should be enabled only if the WAL is enabled in the Spark configuration.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #3358 from tdas/SPARK-4482 and squashes the following commits:
      
      b740136 [Tathagata Das] Fixed bug in ReceivedBlockTracker
      22fc4e75
    • Kenichi Maehashi's avatar
      [SPARK-4470] Validate number of threads in local mode · eacc7883
      Kenichi Maehashi authored
      When running Spark locally, if number of threads is specified as 0 (e.g., `spark-submit --master local[0] ...`), the job got stuck and does not run at all.
      I think it's better to validate the parameter.
      
      Fix for [SPARK-4470](https://issues.apache.org/jira/browse/SPARK-4470).
      
      Author: Kenichi Maehashi <webmaster@kenichimaehashi.com>
      
      Closes #3337 from kmaehashi/spark-4470 and squashes the following commits:
      
      3ad76f3 [Kenichi Maehashi] fix code style
      7716734 [Kenichi Maehashi] SPARK-4470: Validate number of threads in local mode
      eacc7883
    • Tianshuo Deng's avatar
      [SPARK-4467] fix elements read count for ExtrenalSorter · d75579d0
      Tianshuo Deng authored
      the elementsRead variable should be reset to 0 after each spilling
      
      Author: Tianshuo Deng <tdeng@twitter.com>
      
      Closes #3302 from tsdeng/fix_external_sorter_record_count and squashes the following commits:
      
      7b56ca0 [Tianshuo Deng] fix method signature
      782c7de [Tianshuo Deng] make elementsRead private, fix comment
      bb7ff28 [Tianshuo Deng] update elemetsRead through addElementsRead method
      74ca246 [Tianshuo Deng] fix elements read count
      d75579d0
    • tedyu's avatar
      SPARK-4455 Exclude dependency on hbase-annotations module · 5f5ac2da
      tedyu authored
      pwendell
      Please take a look
      
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #3286 from tedyu/master and squashes the following commits:
      
      e61e610 [tedyu] SPARK-4455 Exclude dependency on hbase-annotations module
      7e3a57a [tedyu] Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/spark
      2f28b08 [tedyu] Exclude dependency on hbase-annotations module
      5f5ac2da
    • Patrick Wendell's avatar
      MAINTENANCE: Automated closing of pull requests. · 8327df69
      Patrick Wendell authored
      This commit exists to close the following pull requests on Github:
      
      Closes #2777 (close requested by 'ankurdave')
      Closes #2947 (close requested by 'nchammas')
      Closes #3141 (close requested by 'tdas')
      Closes #2989 (close requested by 'pwendell')
      8327df69
    • Mingfei's avatar
      [Spark-4432]close InStream after the block is accessed · 165cec9c
      Mingfei authored
      InStream is not closed after data is read from Tachyon. which makes the blocks in Tachyon locked after accessed.
      
      Author: Mingfei <mingfei.shi@intel.com>
      
      Closes #3290 from shimingfei/lockFix and squashes the following commits:
      
      fffe345 [Mingfei] close InStream after the block is accessed
      165cec9c
    • Mingfei's avatar
      [SPARK-4441] Close Tachyon client when TachyonBlockManager is shutdown · 67e9876b
      Mingfei authored
      Currently Tachyon client is not closed when TachyonBlockManager is shut down. which causes some resources in Tachyon not reclaimed
      
      Author: Mingfei <mingfei.shi@intel.com>
      
      Closes #3299 from shimingfei/closeClient and squashes the following commits:
      
      0913fbd [Mingfei] close Tachyon client when TachyonBlockManager is shutdown
      67e9876b
  4. Nov 18, 2014
    • Marcelo Vanzin's avatar
      Bumping version to 1.3.0-SNAPSHOT. · 397d3aae
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #3277 from vanzin/version-1.3 and squashes the following commits:
      
      7c3c396 [Marcelo Vanzin] Added temp repo to sbt build.
      5f404ff [Marcelo Vanzin] Add another exclusion.
      19457e7 [Marcelo Vanzin] Update old version to 1.2, add temporary 1.2 repo.
      3c8d705 [Marcelo Vanzin] Workaround for MIMA checks.
      e940810 [Marcelo Vanzin] Bumping version to 1.3.0-SNAPSHOT.
      397d3aae
    • Cheng Lian's avatar
      [SPARK-4468][SQL] Fixes Parquet filter creation for inequality predicates with... · 423baea9
      Cheng Lian authored
      [SPARK-4468][SQL] Fixes Parquet filter creation for inequality predicates with literals on the left hand side
      
      For expressions like `10 < someVar`, we should create an `Operators.Gt` filter, but right now an `Operators.Lt` is created. This issue affects all inequality predicates with literals on the left hand side.
      
      (This bug existed before #3317 and affects branch-1.1. #3338 was opened to backport this to branch-1.1.)
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3334)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3334 from liancheng/fix-parquet-comp-filter and squashes the following commits:
      
      0130897 [Cheng Lian] Fixes Parquet comparison filter generation
      423baea9
    • Davies Liu's avatar
      [SPARK-4327] [PySpark] Python API for RDD.randomSplit() · 7f22fa81
      Davies Liu authored
      ```
      pyspark.RDD.randomSplit(self, weights, seed=None)
          Randomly splits this RDD with the provided weights.
      
          :param weights: weights for splits, will be normalized if they don't sum to 1
          :param seed: random seed
          :return: split RDDs in an list
      
          >>> rdd = sc.parallelize(range(10), 1)
          >>> rdd1, rdd2, rdd3 = rdd.randomSplit([0.4, 0.6, 1.0], 11)
          >>> rdd1.collect()
          [3, 6]
          >>> rdd2.collect()
          [0, 5, 7]
          >>> rdd3.collect()
          [1, 2, 4, 8, 9]
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3193 from davies/randomSplit and squashes the following commits:
      
      78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no performance gain
      f5fdf63 [Davies Liu] fix bug with int in weights
      4dfa2cd [Davies Liu] refactor
      f866bcf [Davies Liu] remove unneeded change
      c7a2007 [Davies Liu] switch to python implementation
      95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into randomSplit
      0d9b256 [Davies Liu] refactor
      1715ee3 [Davies Liu] address comments
      41fce54 [Davies Liu] randomSplit()
      7f22fa81
    • Xiangrui Meng's avatar
      [SPARK-4433] fix a racing condition in zipWithIndex · bb460461
      Xiangrui Meng authored
      Spark hangs with the following code:
      
      ~~~
      sc.parallelize(1 to 10).zipWithIndex.repartition(10).count()
      ~~~
      
      This is because ZippedWithIndexRDD triggers a job in getPartitions and it causes a deadlock in DAGScheduler.getPreferredLocs (synced). The fix is to compute `startIndices` during construction.
      
      This should be applied to branch-1.0, branch-1.1, and branch-1.2.
      
      pwendell
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3291 from mengxr/SPARK-4433 and squashes the following commits:
      
      c284d9f [Xiangrui Meng] fix a racing condition in zipWithIndex
      bb460461
    • Davies Liu's avatar
      [SPARK-3721] [PySpark] broadcast objects larger than 2G · 4a377aff
      Davies Liu authored
      This patch will bring support for broadcasting objects larger than 2G.
      
      pickle, zlib, FrameSerializer and Array[Byte] all can not support objects larger than 2G, so this patch introduce LargeObjectSerializer to serialize broadcast objects, the object will be serialized and compressed into small chunks, it also change the type of Broadcast[Array[Byte]]] into Broadcast[Array[Array[Byte]]]].
      
      Testing for support broadcast objects larger than 2G is slow and memory hungry, so this is tested manually, could be added into SparkPerf.
      
      Author: Davies Liu <davies@databricks.com>
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2659 from davies/huge and squashes the following commits:
      
      7b57a14 [Davies Liu] add more tests for broadcast
      28acff9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
      a2f6a02 [Davies Liu] bug fix
      4820613 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
      5875c73 [Davies Liu] address comments
      10a349b [Davies Liu] address comments
      0c33016 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
      6182c8f [Davies Liu] Merge branch 'master' into huge
      d94b68f [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
      2514848 [Davies Liu] address comments
      fda395b [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
      1c2d928 [Davies Liu] fix scala style
      091b107 [Davies Liu] broadcast objects larger than 2G
      4a377aff
Loading