Skip to content
Snippets Groups Projects
  1. Feb 12, 2016
    • Davies Liu's avatar
      [SPARK-13293][SQL] generate Expand · 2228f074
      Davies Liu authored
      Expand suffer from create the UnsafeRow from same input multiple times, with codegen, it only need to copy some of the columns.
      
      After this, we can see 3X improvements (from 43 seconds to 13 seconds) on a TPCDS query (Q67) that have eight columns in Rollup.
      
      Ideally, we could mask some of the columns based on bitmask, I'd leave that in the future, because currently Aggregation (50 ns) is much slower than that just copy the variables (1-2 ns).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11177 from davies/gen_expand.
      2228f074
    • Michael Gummelt's avatar
      [SPARK-5095] remove flaky test · 62b1c07e
      Michael Gummelt authored
      Overrode the start() method, which was previously starting a thread causing a race condition. I believe this should fix the flaky test.
      
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      
      Closes #11164 from mgummelt/fix_mesos_tests.
      62b1c07e
    • Michael Gummelt's avatar
      [SPARK-5095] Fix style in mesos coarse grained scheduler code · 38bc6018
      Michael Gummelt authored
      andrewor14 This addressed your style comments from #10993
      
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      
      Closes #11187 from mgummelt/fix_mesos_style.
      38bc6018
    • vijaykiran's avatar
      [SPARK-12630][PYSPARK] [DOC] PySpark classification parameter desc to consistent format · 42d65681
      vijaykiran authored
      Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the classification module.
      
      Author: vijaykiran <mail@vijaykiran.com>
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #11183 from BryanCutler/pyspark-consistent-param-classification-SPARK-12630.
      42d65681
    • Yanbo Liang's avatar
      [SPARK-12962] [SQL] [PySpark] PySpark support covar_samp and covar_pop · 90de6b2f
      Yanbo Liang authored
      PySpark support ```covar_samp``` and ```covar_pop```.
      
      cc rxin davies marmbrus
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #10876 from yanboliang/spark-12962.
      90de6b2f
    • hyukjinkwon's avatar
      [SPARK-13260][SQL] count(*) does not work with CSV data source · ac7d6af1
      hyukjinkwon authored
      https://issues.apache.org/jira/browse/SPARK-13260
      This is a quicky fix for `count(*)`.
      
      When the `requiredColumns` is empty, currently it returns `sqlContext.sparkContext.emptyRDD[Row]` which does not have the count.
      
      Just like JSON datasource, this PR lets the CSV datasource count the rows but do not parse each set of tokens.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #11169 from HyukjinKwon/SPARK-13260.
      ac7d6af1
    • Reynold Xin's avatar
      [SPARK-13282][SQL] LogicalPlan toSql should just return a String · c4d5ad80
      Reynold Xin authored
      Previously we were using Option[String] and None to indicate the case when Spark fails to generate SQL. It is easier to just use exceptions to propagate error cases, rather than having for comprehension everywhere. I also introduced a "build" function that simplifies string concatenation (i.e. no need to reason about whether we have an extra space or not).
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11171 from rxin/SPARK-13282.
      c4d5ad80
    • Davies Liu's avatar
      [SPARK-12705] [SQL] push missing attributes for Sort · 5b805df2
      Davies Liu authored
      The current implementation of ResolveSortReferences can only push one missing attributes into it's child, it failed to analyze TPCDS Q98, because of there are two missing attributes in that (one from Window, another from Aggregate).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11153 from davies/resolve_sort.
      5b805df2
    • Holden Karau's avatar
      [SPARK-13154][PYTHON] Add linting for pydocs · 64515e5f
      Holden Karau authored
      We should have lint rules using sphinx to automatically catch the pydoc issues that are sometimes introduced.
      
      Right now ./dev/lint-python will skip building the docs if sphinx isn't present - but it might make sense to fail hard - just a matter of if we want to insist all PySpark developers have sphinx present.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #11109 from holdenk/SPARK-13154-add-pydoc-lint-for-docs.
      64515e5f
    • Yanbo Liang's avatar
      [SPARK-12974][ML][PYSPARK] Add Python API for spark.ml bisecting k-means · a183dda6
      Yanbo Liang authored
      Add Python API for spark.ml bisecting k-means.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #10889 from yanboliang/spark-12974.
      a183dda6
    • Sanket's avatar
      [SPARK-6166] Limit number of in flight outbound requests · 894921d8
      Sanket authored
      This JIRA is related to
      https://github.com/apache/spark/pull/5852
      Had to do some minor rework and test to make sure it
      works with current version of spark.
      
      Author: Sanket <schintap@untilservice-lm>
      
      Closes #10838 from redsanket/limit-outbound-connections.
      894921d8
  2. Feb 11, 2016
  3. Feb 10, 2016
    • Davies Liu's avatar
      [SPARK-12706] [SQL] grouping() and grouping_id() · b5761d15
      Davies Liu authored
      Grouping() returns a column is aggregated or not, grouping_id() returns the aggregation levels.
      
      grouping()/grouping_id() could be used with window function, but does not work in having/sort clause, will be fixed by another PR.
      
      The GROUPING__ID/grouping_id() in Hive is wrong (according to docs), we also did it wrongly, this PR change that to match the behavior in most databases (also the docs of Hive).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #10677 from davies/grouping.
      b5761d15
    • gatorsmile's avatar
      [SPARK-13205][SQL] SQL Generation Support for Self Join · 0f09f022
      gatorsmile authored
      This PR addresses two issues:
        - Self join does not work in SQL Generation
        - When creating new instances for `LogicalRelation`, `metastoreTableIdentifier` is lost.
      
      liancheng Could you please review the code changes? Thank you!
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #11084 from gatorsmile/selfJoinInSQLGen.
      0f09f022
    • gatorsmile's avatar
      [SPARK-12725][SQL] Resolving Name Conflicts in SQL Generation and Name... · 663cc400
      gatorsmile authored
      [SPARK-12725][SQL] Resolving Name Conflicts in SQL Generation and Name Ambiguity Caused by Internally Generated Expressions
      
      Some analysis rules generate aliases or auxiliary attribute references with the same name but different expression IDs. For example, `ResolveAggregateFunctions` introduces `havingCondition` and `aggOrder`, and `DistinctAggregationRewriter` introduces `gid`.
      
      This is OK for normal query execution since these attribute references get expression IDs. However, it's troublesome when converting resolved query plans back to SQL query strings since expression IDs are erased.
      
      Here's an example Spark 1.6.0 snippet for illustration:
      ```scala
      sqlContext.range(10).select('id as 'a, 'id as 'b).registerTempTable("t")
      sqlContext.sql("SELECT SUM(a) FROM t GROUP BY a, b ORDER BY COUNT(a), COUNT(b)").explain(true)
      ```
      The above code produces the following resolved plan:
      ```
      == Analyzed Logical Plan ==
      _c0: bigint
      Project [_c0#101L]
      +- Sort [aggOrder#102L ASC,aggOrder#103L ASC], true
         +- Aggregate [a#47L,b#48L], [(sum(a#47L),mode=Complete,isDistinct=false) AS _c0#101L,(count(a#47L),mode=Complete,isDistinct=false) AS aggOrder#102L,(count(b#48L),mode=Complete,isDistinct=false) AS aggOrder#103L]
            +- Subquery t
               +- Project [id#46L AS a#47L,id#46L AS b#48L]
                  +- LogicalRDD [id#46L], MapPartitionsRDD[44] at range at <console>:26
      ```
      Here we can see that both aggregate expressions in `ORDER BY` are extracted into an `Aggregate` operator, and both of them are named `aggOrder` with different expression IDs.
      
      The solution is to automatically add the expression IDs into the attribute name for the Alias and AttributeReferences that are generated by Analyzer in SQL Generation.
      
      In this PR, it also resolves another issue. Users could use the same name as the internally generated names. The duplicate names should not cause name ambiguity. When resolving the column, Catalyst should not pick the column that is internally generated.
      
      Could you review the solution? marmbrus liancheng
      
      I did not set the newly added flag for all the alias and attribute reference generated by Analyzers. Please let me know if I should do it? Thank you!
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #11050 from gatorsmile/namingConflicts.
      663cc400
    • raela's avatar
      [SPARK-13274] Fix Aggregator Links on GroupedDataset Scala API · 719973b0
      raela authored
      Update Aggregator links to point to #org.apache.spark.sql.expressions.Aggregator
      
      Author: raela <raela@databricks.com>
      
      Closes #11158 from raelawang/master.
      719973b0
    • Tathagata Das's avatar
      [SPARK-13146][SQL] Management API for continuous queries · 0902e202
      Tathagata Das authored
      ### Management API for Continuous Queries
      
      **API for getting status of each query**
      - Whether active or not
      - Unique name of each query
      - Status of the sources and sinks
      - Exceptions
      
      **API for managing each query**
      - Immediately stop an active query
      - Waiting for a query to be terminated, correctly or with error
      
      **API for managing multiple queries**
      - Listing all active queries
      - Getting an active query by name
      - Waiting for any one of the active queries to be terminated
      
      **API for listening to query life cycle events**
      - ContinuousQueryListener API for query start, progress and termination events.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #11030 from tdas/streaming-df-management-api.
      0902e202
Loading