Skip to content
Snippets Groups Projects
  1. Feb 12, 2016
  2. Feb 11, 2016
  3. Feb 10, 2016
    • Davies Liu's avatar
      [SPARK-12706] [SQL] grouping() and grouping_id() · b5761d15
      Davies Liu authored
      Grouping() returns a column is aggregated or not, grouping_id() returns the aggregation levels.
      
      grouping()/grouping_id() could be used with window function, but does not work in having/sort clause, will be fixed by another PR.
      
      The GROUPING__ID/grouping_id() in Hive is wrong (according to docs), we also did it wrongly, this PR change that to match the behavior in most databases (also the docs of Hive).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #10677 from davies/grouping.
      b5761d15
    • gatorsmile's avatar
      [SPARK-13205][SQL] SQL Generation Support for Self Join · 0f09f022
      gatorsmile authored
      This PR addresses two issues:
        - Self join does not work in SQL Generation
        - When creating new instances for `LogicalRelation`, `metastoreTableIdentifier` is lost.
      
      liancheng Could you please review the code changes? Thank you!
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #11084 from gatorsmile/selfJoinInSQLGen.
      0f09f022
    • gatorsmile's avatar
      [SPARK-12725][SQL] Resolving Name Conflicts in SQL Generation and Name... · 663cc400
      gatorsmile authored
      [SPARK-12725][SQL] Resolving Name Conflicts in SQL Generation and Name Ambiguity Caused by Internally Generated Expressions
      
      Some analysis rules generate aliases or auxiliary attribute references with the same name but different expression IDs. For example, `ResolveAggregateFunctions` introduces `havingCondition` and `aggOrder`, and `DistinctAggregationRewriter` introduces `gid`.
      
      This is OK for normal query execution since these attribute references get expression IDs. However, it's troublesome when converting resolved query plans back to SQL query strings since expression IDs are erased.
      
      Here's an example Spark 1.6.0 snippet for illustration:
      ```scala
      sqlContext.range(10).select('id as 'a, 'id as 'b).registerTempTable("t")
      sqlContext.sql("SELECT SUM(a) FROM t GROUP BY a, b ORDER BY COUNT(a), COUNT(b)").explain(true)
      ```
      The above code produces the following resolved plan:
      ```
      == Analyzed Logical Plan ==
      _c0: bigint
      Project [_c0#101L]
      +- Sort [aggOrder#102L ASC,aggOrder#103L ASC], true
         +- Aggregate [a#47L,b#48L], [(sum(a#47L),mode=Complete,isDistinct=false) AS _c0#101L,(count(a#47L),mode=Complete,isDistinct=false) AS aggOrder#102L,(count(b#48L),mode=Complete,isDistinct=false) AS aggOrder#103L]
            +- Subquery t
               +- Project [id#46L AS a#47L,id#46L AS b#48L]
                  +- LogicalRDD [id#46L], MapPartitionsRDD[44] at range at <console>:26
      ```
      Here we can see that both aggregate expressions in `ORDER BY` are extracted into an `Aggregate` operator, and both of them are named `aggOrder` with different expression IDs.
      
      The solution is to automatically add the expression IDs into the attribute name for the Alias and AttributeReferences that are generated by Analyzer in SQL Generation.
      
      In this PR, it also resolves another issue. Users could use the same name as the internally generated names. The duplicate names should not cause name ambiguity. When resolving the column, Catalyst should not pick the column that is internally generated.
      
      Could you review the solution? marmbrus liancheng
      
      I did not set the newly added flag for all the alias and attribute reference generated by Analyzers. Please let me know if I should do it? Thank you!
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #11050 from gatorsmile/namingConflicts.
      663cc400
    • raela's avatar
      [SPARK-13274] Fix Aggregator Links on GroupedDataset Scala API · 719973b0
      raela authored
      Update Aggregator links to point to #org.apache.spark.sql.expressions.Aggregator
      
      Author: raela <raela@databricks.com>
      
      Closes #11158 from raelawang/master.
      719973b0
    • Tathagata Das's avatar
      [SPARK-13146][SQL] Management API for continuous queries · 0902e202
      Tathagata Das authored
      ### Management API for Continuous Queries
      
      **API for getting status of each query**
      - Whether active or not
      - Unique name of each query
      - Status of the sources and sinks
      - Exceptions
      
      **API for managing each query**
      - Immediately stop an active query
      - Waiting for a query to be terminated, correctly or with error
      
      **API for managing multiple queries**
      - Listing all active queries
      - Getting an active query by name
      - Waiting for any one of the active queries to be terminated
      
      **API for listening to query life cycle events**
      - ContinuousQueryListener API for query start, progress and termination events.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #11030 from tdas/streaming-df-management-api.
      0902e202
    • Sean Owen's avatar
      [SPARK-12414][CORE] Remove closure serializer · 29c54730
      Sean Owen authored
      Remove spark.closure.serializer option and use JavaSerializer always
      
      CC andrewor14 rxin I see there's a discussion in the JIRA but just thought I'd offer this for a look at what the change would be.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11150 from srowen/SPARK-12414.
      29c54730
    • Takeshi YAMAMURO's avatar
      [SPARK-13057][SQL] Add benchmark codes and the performance results for... · 5947fa8f
      Takeshi YAMAMURO authored
      [SPARK-13057][SQL] Add benchmark codes and the performance results for implemented compression schemes for InMemoryRelation
      
      This pr adds benchmark codes for in-memory cache compression to make future developments and discussions more smooth.
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #10965 from maropu/ImproveColumnarCache.
      5947fa8f
    • Josh Rosen's avatar
    • zhuol's avatar
      [SPARK-13126] fix the right margin of history page. · 4b80026f
      zhuol authored
      The right margin of the history page is little bit off. A simple fix for that issue.
      
      Author: zhuol <zhuol@yahoo-inc.com>
      
      Closes #11029 from zhuoliu/13126.
      4b80026f
    • Alex Bozarth's avatar
      [SPARK-13163][WEB UI] Column width on new History Server DataTables not getting set correctly · 39cc620e
      Alex Bozarth authored
      The column width for the new DataTables now adjusts for the current page rather than being hard-coded for the entire table's data.
      
      Author: Alex Bozarth <ajbozart@us.ibm.com>
      
      Closes #11057 from ajbozarth/spark13163.
      39cc620e
    • Josh Rosen's avatar
      [SPARK-13254][SQL] Fix planning of TakeOrderedAndProject operator · 5cf20598
      Josh Rosen authored
      The patch for SPARK-8964 ("use Exchange to perform shuffle in Limit" / #7334) inadvertently broke the planning of the TakeOrderedAndProject operator: because ReturnAnswer was the new root of the query plan, the TakeOrderedAndProject rule was unable to match before BasicOperators.
      
      This patch fixes this by moving the `TakeOrderedAndCollect` and `CollectLimit` rules into the same strategy.
      
      In addition, I made changes to the TakeOrderedAndProject operator in order to make its `doExecute()` method lazy and added a new TakeOrderedAndProjectSuite which tests the new code path.
      
      /cc davies and marmbrus for review.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11145 from JoshRosen/take-ordered-and-project-fix.
      5cf20598
    • Michael Gummelt's avatar
      [SPARK-5095][MESOS] Support launching multiple mesos executors in coarse grained mesos mode. · 80cb963a
      Michael Gummelt authored
      This is the next iteration of tnachen's previous PR: https://github.com/apache/spark/pull/4027
      
      In that PR, we resolved with andrewor14 and pwendell to implement the Mesos scheduler's support of `spark.executor.cores` to be consistent with YARN and Standalone.  This PR implements that resolution.
      
      This PR implements two high-level features.  These two features are co-dependent, so they're implemented both here:
      - Mesos support for spark.executor.cores
      - Multiple executors per slave
      
      We at Mesosphere have been working with Typesafe on a Spark/Mesos integration test suite: https://github.com/typesafehub/mesos-spark-integration-tests, which passes for this PR.
      
      The contribution is my original work and I license the work to the project under the project's open source license.
      
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      
      Closes #10993 from mgummelt/executor_sizing.
      80cb963a
    • Sean Owen's avatar
      [SPARK-9307][CORE][SPARK] Logging: Make it either stable or private · c0b71e0b
      Sean Owen authored
      Make Logging private[spark]. Pretty much all there is to it.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11103 from srowen/SPARK-9307.
      c0b71e0b
    • tedyu's avatar
      [SPARK-13203] Add scalastyle rule banning use of mutable.SynchronizedBuffer · e834e421
      tedyu authored
      andrewor14
      Please take a look
      
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #11134 from tedyu/master.
      e834e421
    • Jon Maurer's avatar
      [SPARK-11518][DEPLOY, WINDOWS] Handle spaces in Windows command scripts · 2ba9b6a2
      Jon Maurer authored
      Author: Jon Maurer <tritab@gmail.com>
      Author: Jonathan Maurer <jmaurer@Jonathans-MacBook-Pro.local>
      
      Closes #10789 from tritab/cmd_updates.
      2ba9b6a2
Loading