Skip to content
Snippets Groups Projects
  1. Jan 08, 2016
  2. Jan 07, 2016
  3. Jan 06, 2016
    • Guillaume Poulin's avatar
      [SPARK-12678][CORE] MapPartitionsRDD clearDependencies · b6738520
      Guillaume Poulin authored
      MapPartitionsRDD was keeping a reference to `prev` after a call to
      `clearDependencies` which could lead to memory leak.
      
      Author: Guillaume Poulin <poulin.guillaume@gmail.com>
      
      Closes #10623 from gpoulin/map_partition_deps.
      b6738520
    • jerryshao's avatar
      [SPARK-12673][UI] Add missing uri prepending for job description · 174e72ce
      jerryshao authored
      Otherwise the url will be failed to proxy to the right one if in YARN mode. Here is the screenshot:
      
      ![screen shot 2016-01-06 at 5 28 26 pm](https://cloud.githubusercontent.com/assets/850797/12139632/bbe78ecc-b49c-11e5-8932-94e8b3622a09.png)
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #10618 from jerryshao/SPARK-12673.
      174e72ce
    • Josh Rosen's avatar
      [SPARK-7689] Remove TTL-based metadata cleaning in Spark 2.0 · 8e19c766
      Josh Rosen authored
      This PR removes `spark.cleaner.ttl` and the associated TTL-based metadata cleaning code.
      
      Now that we have the `ContextCleaner` and a timer to trigger periodic GCs, I don't think that `spark.cleaner.ttl` is necessary anymore. The TTL-based cleaning isn't enabled by default, isn't included in our end-to-end tests, and has been a source of user confusion when it is misconfigured. If the TTL is set too low, data which is still being used may be evicted / deleted, leading to hard to diagnose bugs.
      
      For all of these reasons, I think that we should remove this functionality in Spark 2.0. Additional benefits of doing this include marginally reduced memory usage, since we no longer need to store timetsamps in hashmaps, and a handful fewer threads.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #10534 from JoshRosen/remove-ttl-based-cleaning.
      8e19c766
    • Robert Dodier's avatar
      [SPARK-12663][MLLIB] More informative error message in MLUtils.loadLibSVMFile · 6b6d02be
      Robert Dodier authored
      This PR contains 1 commit which resolves [SPARK-12663](https://issues.apache.org/jira/browse/SPARK-12663).
      
      For the record, I got a positive response from 2 people when I floated this idea on devspark.apache.org on 2015-10-23. [Link to archived discussion.](http://apache-spark-developers-list.1001551.n3.nabble.com/slightly-more-informative-error-message-in-MLUtils-loadLibSVMFile-td14764.html)
      
      Author: Robert Dodier <robert_dodier@users.sourceforge.net>
      
      Closes #10611 from robert-dodier/loadlibsvmfile-error-msg-branch.
      6b6d02be
    • Nong Li's avatar
      [SPARK-12640][SQL] Add simple benchmarking utility class and add Parquet scan benchmarks. · a74d743c
      Nong Li authored
      [SPARK-12640][SQL] Add simple benchmarking utility class and add Parquet scan benchmarks.
      
      We've run benchmarks ad hoc to measure the scanner performance. We will continue to invest in this
      and it makes sense to get these benchmarks into code. This adds a simple benchmarking utility to do
      this.
      
      Author: Nong Li <nong@databricks.com>
      Author: Nong <nongli@gmail.com>
      
      Closes #10589 from nongli/spark-12640.
      a74d743c
    • Sean Owen's avatar
      [SPARK-12604][CORE] Java count(AprroxDistinct)ByKey methods return Scala Long not Java · ac56cf60
      Sean Owen authored
      Change Java countByKey, countApproxDistinctByKey return types to use Java Long, not Scala; update similar methods for consistency on java.long.Long.valueOf with no API change
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #10554 from srowen/SPARK-12604.
      ac56cf60
    • Wenchen Fan's avatar
      [SPARK-12539][SQL] support writing bucketed table · 917d3fc0
      Wenchen Fan authored
      This PR adds bucket write support to Spark SQL. User can specify bucketing columns, numBuckets and sorting columns with or without partition columns. For example:
      ```
      df.write.partitionBy("year").bucketBy(8, "country").sortBy("amount").saveAsTable("sales")
      ```
      
      When bucketing is used, we will calculate bucket id for each record, and group the records by bucket id. For each group, we will create a file with bucket id in its name, and write data into it. For each bucket file, if sorting columns are specified, the data will be sorted before write.
      
      Note that there may be multiply files for one bucket, as the data is distributed.
      
      Currently we store the bucket metadata at hive metastore in a non-hive-compatible way. We use different bucketing hash function compared to hive, so we can't be compatible anyway.
      
      Limitations:
      
      * Can't write bucketed data without hive metastore.
      * Can't insert bucketed data into existing hive tables.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10498 from cloud-fan/bucket-write.
      917d3fc0
    • Davies Liu's avatar
      [SPARK-12681] [SQL] split IdentifiersParser.g into two files · 6f7ba640
      Davies Liu authored
      To avoid to have a huge Java source (over 64K loc), that can't be compiled.
      
      cc hvanhovell
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #10624 from davies/split_ident.
      6f7ba640
    • Shixiong Zhu's avatar
      Revert "[SPARK-12672][STREAMING][UI] Use the uiRoot function instead of... · cbaea959
      Shixiong Zhu authored
      Revert "[SPARK-12672][STREAMING][UI] Use the uiRoot function instead of default root path to gain the streaming batch url."
      
      This reverts commit 19e4e9fe. Will merge #10618 instead.
      cbaea959
    • huangzhaowei's avatar
      [SPARK-12672][STREAMING][UI] Use the uiRoot function instead of default root... · 19e4e9fe
      huangzhaowei authored
      [SPARK-12672][STREAMING][UI] Use the uiRoot function instead of default root path to gain the streaming batch url.
      
      Author: huangzhaowei <carlmartinmax@gmail.com>
      
      Closes #10617 from SaintBacchus/SPARK-12672.
      19e4e9fe
    • Shixiong Zhu's avatar
      [SPARK-12617][PYSPARK] Move Py4jCallbackConnectionCleaner to Streaming · 1e6648d6
      Shixiong Zhu authored
      Move Py4jCallbackConnectionCleaner to Streaming because the callback server starts only in StreamingContext.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #10621 from zsxwing/SPARK-12617-2.
      1e6648d6
    • BenFradet's avatar
      [SPARK-12368][ML][DOC] Better doc for the binary classification evaluator' metricName · f82ebb15
      BenFradet authored
      For the BinaryClassificationEvaluator, the scaladoc doesn't mention that "areaUnderPR" is supported, only that the default is "areadUnderROC".
      Also, in the documentation, it is said that:
      "The default metric used to choose the best ParamMap can be overriden by the setMetric method in each of these evaluators."
      However, the method is called setMetricName.
      
      This PR aims to fix both issues.
      
      Author: BenFradet <benjamin.fradet@gmail.com>
      
      Closes #10328 from BenFradet/SPARK-12368.
      f82ebb15
    • zero323's avatar
      [SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not None · fcd013cf
      zero323 authored
      If initial model passed to GMM is not empty it causes `net.razorvine.pickle.PickleException`. It can be fixed by converting `initialModel.weights` to `list`.
      
      Author: zero323 <matthew.szymkiewicz@gmail.com>
      
      Closes #9986 from zero323/SPARK-12006.
      fcd013cf
    • Herman van Hovell's avatar
      [SPARK-12573][SPARK-12574][SQL] Move SQL Parser from Hive to Catalyst · ea489f14
      Herman van Hovell authored
      This PR moves a major part of the new SQL parser to Catalyst. This is a prelude to start using this parser for all of our SQL parsing. The following key changes have been made:
      
      The ANTLR Parser & Supporting classes have been moved to the Catalyst project. They are now part of the ```org.apache.spark.sql.catalyst.parser``` package. These classes contained quite a bit of code that was originally from the Hive project, I have added aknowledgements whenever this applied. All Hive dependencies have been factored out. I have also taken this chance to clean-up the ```ASTNode``` class, and to improve the error handling.
      
      The HiveQl object that provides the functionality to convert an AST into a LogicalPlan has been refactored into three different classes, one for every SQL sub-project:
      - ```CatalystQl```: This implements Query and Expression parsing functionality.
      - ```SparkQl```: This is a subclass of CatalystQL and provides SQL/Core only functionality such as Explain and Describe.
      - ```HiveQl```: This is a subclass of ```SparkQl``` and this adds Hive-only functionality to the parser such as Analyze, Drop, Views, CTAS & Transforms. This class still depends on Hive.
      
      cc rxin
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #10583 from hvanhovell/SPARK-12575.
      ea489f14
    • Yanbo Liang's avatar
      [SPARK-11815][ML][PYSPARK] PySpark DecisionTreeClassifier &... · 3aa34882
      Yanbo Liang authored
      [SPARK-11815][ML][PYSPARK] PySpark DecisionTreeClassifier & DecisionTreeRegressor should support setSeed
      
      PySpark ```DecisionTreeClassifier``` & ```DecisionTreeRegressor``` should support ```setSeed``` like what we do at Scala side.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #9807 from yanboliang/spark-11815.
      3aa34882
    • Yanbo Liang's avatar
      [SPARK-11945][ML][PYSPARK] Add computeCost to KMeansModel for PySpark spark.ml · 95eb6516
      Yanbo Liang authored
      Add ```computeCost``` to ```KMeansModel``` as evaluator for PySpark spark.ml.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #9931 from yanboliang/SPARK-11945.
      95eb6516
    • Joshi's avatar
      [SPARK-11531][ML] SparseVector error Msg · 007da1a9
      Joshi authored
      PySpark SparseVector should have "Found duplicate indices" error message
      
      Author: Joshi <rekhajoshm@gmail.com>
      Author: Rekha Joshi <rekhajoshm@gmail.com>
      
      Closes #9525 from rekhajoshm/SPARK-11531.
      007da1a9
    • Holden Karau's avatar
      [SPARK-7675][ML][PYSPARK] sparkml params type conversion · 3b29004d
      Holden Karau authored
      From JIRA:
      Currently, PySpark wrappers for spark.ml Scala classes are brittle when accepting Param types. E.g., Normalizer's "p" param cannot be set to "2" (an integer); it must be set to "2.0" (a float). Fixing this is not trivial since there does not appear to be a natural place to insert the conversion before Python wrappers call Java's Params setter method.
      
      A possible fix will be to include a method "_checkType" to PySpark's Param class which checks the type, prints an error if needed, and converts types when relevant (e.g., int to float, or scipy matrix to array). The Java wrapper method which copies params to Scala can call this method when available.
      
      This fix instead checks the types at set time since I think failing sooner is better, but I can switch it around to check at copy time if that would be better. So far this only converts int to float and other conversions (like scipymatrix to array) are left for the future.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #9581 from holdenk/SPARK-7675-PySpark-sparkml-Params-type-conversion.
      3b29004d
    • Yash Datta's avatar
      [SPARK-11878][SQL] Eliminate distribute by in case group by is present with... · 9061e777
      Yash Datta authored
      [SPARK-11878][SQL] Eliminate distribute by in case group by is present with exactly the same grouping expressi
      
      For queries like :
      select <> from table group by a distribute by a
      we can eliminate distribute by ; since group by will anyways do a hash partitioning
      Also applicable when user uses Dataframe API
      
      Author: Yash Datta <Yash.Datta@guavus.com>
      
      Closes #9858 from saucam/eliminatedistribute.
      9061e777
    • Kousuke Saruta's avatar
      [SPARK-12665][CORE][GRAPHX] Remove Vector, VectorSuite and... · 94c202c7
      Kousuke Saruta authored
      [SPARK-12665][CORE][GRAPHX] Remove Vector, VectorSuite and GraphKryoRegistrator which are deprecated and no longer used
      
      Whole code of Vector.scala, VectorSuite.scala and GraphKryoRegistrator.scala  are no longer used so it's time to remove them in Spark 2.0.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #10613 from sarutak/SPARK-12665.
      94c202c7
    • QiangCai's avatar
      [SPARK-12340][SQL] fix Int overflow in the SparkPlan.executeTake, RDD.take and... · 5d871ea4
      QiangCai authored
      [SPARK-12340][SQL] fix Int overflow in the SparkPlan.executeTake, RDD.take and AsyncRDDActions.takeAsync
      
      I have closed pull request https://github.com/apache/spark/pull/10487. And I create this pull request to resolve the problem.
      
      spark jira
      https://issues.apache.org/jira/browse/SPARK-12340
      
      Author: QiangCai <david.caiq@gmail.com>
      
      Closes #10562 from QiangCai/bugfix.
      5d871ea4
    • Liang-Chi Hsieh's avatar
      [SPARK-12578][SQL] Distinct should not be silently ignored when used in an... · b2467b38
      Liang-Chi Hsieh authored
      [SPARK-12578][SQL] Distinct should not be silently ignored when used in an aggregate function with OVER clause
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-12578
      
      Slightly update to Hive parser. We should keep the distinct keyword when used in an aggregate function with OVER clause. So the CheckAnalysis will detect it and throw exception later.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #10557 from viirya/keep-distinct-hivesql.
      b2467b38
Loading