Skip to content
Snippets Groups Projects
  1. Jul 24, 2015
    • Josh Rosen's avatar
      [SPARK-9295] Analysis should detect sorting on unsupported column types · 6aceaf3d
      Josh Rosen authored
      This patch extends CheckAnalysis to throw errors for queries that try to sort on unsupported column types, such as ArrayType.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7633 from JoshRosen/SPARK-9295 and squashes the following commits:
      
      23b2fbf [Josh Rosen] Embed function in foreach
      bfe1451 [Josh Rosen] Update to allow sorting by null literals
      2f1b802 [Josh Rosen] Add analysis rule to detect sorting on unsupported column types (SPARK-9295)
      6aceaf3d
    • Josh Rosen's avatar
      [SPARK-9292] Analysis should check that join conditions' data types are BooleanType · c2b50d69
      Josh Rosen authored
      This patch adds an analysis check to ensure that join conditions' data types are BooleanType. This check is necessary in order to report proper errors for non-boolean DataFrame join conditions.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7630 from JoshRosen/SPARK-9292 and squashes the following commits:
      
      aec6c7b [Josh Rosen] Check condition type in resolved()
      75a3ea6 [Josh Rosen] Fix SPARK-9292.
      c2b50d69
    • Reynold Xin's avatar
      [SPARK-9305] Rename org.apache.spark.Row to Item. · c8d71a41
      Reynold Xin authored
      It's a thing used in test cases, but named Row. Pretty annoying because everytime I search for Row, it shows up before the Spark SQL Row, which is what a developer wants most of the time.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7638 from rxin/remove-row and squashes the following commits:
      
      aeda52d [Reynold Xin] [SPARK-9305] Rename org.apache.spark.Row to Item.
      c8d71a41
    • Reynold Xin's avatar
      [SPARK-9285][SQL] Remove InternalRow's inheritance from Row. · 431ca39b
      Reynold Xin authored
      I also changed InternalRow's size/length function to numFields, to make it more obvious that it is not about bytes, but the number of fields.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7626 from rxin/internalRow and squashes the following commits:
      
      e124daf [Reynold Xin] Fixed test case.
      805ceb7 [Reynold Xin] Commented out the failed test suite.
      f8a9ca5 [Reynold Xin] Fixed more bugs. Still at least one more remaining.
      76d9081 [Reynold Xin] Fixed data sources.
      7807f70 [Reynold Xin] Fixed DataFrameSuite.
      cb60cd2 [Reynold Xin] Code review & small bug fixes.
      0a2948b [Reynold Xin] Fixed style.
      3280d03 [Reynold Xin] [SPARK-9285][SQL] Remove InternalRow's inheritance from Row.
      431ca39b
    • Davies Liu's avatar
      [SPARK-9069] [SQL] follow up · dfb18be0
      Davies Liu authored
      Address comments for #7605
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7634 from davies/decimal_unlimited2 and squashes the following commits:
      
      b2d8b0d [Davies Liu] add doc and test for DecimalType.isWiderThan
      65b251c [Davies Liu] fix test
      6a91f32 [Davies Liu] fix style
      ca9c973 [Davies Liu] address comments
      dfb18be0
    • Liang-Chi Hsieh's avatar
      [SPARK-8756] [SQL] Keep cached information and avoid re-calculating footers in ParquetRelation2 · 6a7e537f
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-8756
      
      Currently, in ParquetRelation2, footers are re-read every time refresh() is called. But we can check if it is possibly changed before we do the reading because reading all footers will be expensive when there are too many partitions. This pr fixes this by keeping some cached information to check it.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #7154 from viirya/cached_footer_parquet_relation and squashes the following commits:
      
      92e9347 [Liang-Chi Hsieh] Fix indentation.
      ae0ec64 [Liang-Chi Hsieh] Fix wrong assignment.
      c8fdfb7 [Liang-Chi Hsieh] Fix it.
      a52b6d1 [Liang-Chi Hsieh] For comments.
      c2a2420 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cached_footer_parquet_relation
      fa5458f [Liang-Chi Hsieh] Use Map to cache FileStatus and do merging previously loaded schema and newly loaded one.
      6ae0911 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cached_footer_parquet_relation
      21bbdec [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cached_footer_parquet_relation
      12a0ed9 [Liang-Chi Hsieh] Add check of FileStatus's modification time.
      186429d [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cached_footer_parquet_relation
      0ef8caf [Liang-Chi Hsieh] Keep cached information and avoid re-calculating footers.
      6a7e537f
    • Reynold Xin's avatar
      [SPARK-9200][SQL] Don't implicitly cast non-atomic types to string type. · cb8c241f
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7636 from rxin/complex-string-implicit-cast and squashes the following commits:
      
      3e67327 [Reynold Xin] [SPARK-9200][SQL] Don't implicitly cast non-atomic types to string type.
      cb8c241f
    • Wenchen Fan's avatar
      [SPARK-9294][SQL] cleanup comments, code style, naming typo for the new aggregation · 408e64b2
      Wenchen Fan authored
      fix some comments and code style for https://github.com/apache/spark/pull/7458
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7619 from cloud-fan/agg-clean and squashes the following commits:
      
      3925457 [Wenchen Fan] one more...
      cc78357 [Wenchen Fan] one more cleanup
      26f6a93 [Wenchen Fan] some minor cleanup for the new aggregation
      408e64b2
  2. Jul 23, 2015
    • Davies Liu's avatar
      [SPARK-9069] [SPARK-9264] [SQL] remove unlimited precision support for DecimalType · 8a94eb23
      Davies Liu authored
      Romove Decimal.Unlimited (change to support precision up to 38, to match with Hive and other databases).
      
      In order to keep backward source compatibility, Decimal.Unlimited is still there, but change to Decimal(38, 18).
      
      If no precision and scale is provide, it's Decimal(10, 0) as before.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7605 from davies/decimal_unlimited and squashes the following commits:
      
      aa3f115 [Davies Liu] fix tests and style
      fb0d20d [Davies Liu] address comments
      bfaae35 [Davies Liu] fix style
      df93657 [Davies Liu] address comments and clean up
      06727fd [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_unlimited
      4c28969 [Davies Liu] fix tests
      8d783cc [Davies Liu] fix tests
      788631c [Davies Liu] fix double with decimal in Union/except
      1779bde [Davies Liu] fix scala style
      c9c7c78 [Davies Liu] remove Decimal.Unlimited
      8a94eb23
    • Cheng Lian's avatar
      [SPARK-9207] [SQL] Enables Parquet filter push-down by default · bebe3f7b
      Cheng Lian authored
      PARQUET-136 and PARQUET-173 have been fixed in parquet-mr 1.7.0. It's time to enable filter push-down by default now.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7612 from liancheng/spark-9207 and squashes the following commits:
      
      77e6b5e [Cheng Lian] Enables Parquet filter push-down by default
      bebe3f7b
    • Josh Rosen's avatar
      [SPARK-9286] [SQL] Methods in Unevaluable should be final and... · b2f3aca1
      Josh Rosen authored
      [SPARK-9286] [SQL] Methods in Unevaluable should be final and AlgebraicAggregate should extend Unevaluable.
      
      This patch marks the Unevaluable.eval() and UnevaluablegenCode() methods as final and fixes two cases where they were overridden.  It also updates AggregateFunction2 to extend Unevaluable.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7627 from JoshRosen/unevaluable-fix and squashes the following commits:
      
      8d9ed22 [Josh Rosen] AlgebraicAggregate should extend Unevaluable
      65329c2 [Josh Rosen] Do not have AggregateFunction1 inherit from AggregateExpression1
      fa68a22 [Josh Rosen] Make eval() and genCode() final
      b2f3aca1
    • David Arroyo Cazorla's avatar
      [SPARK-5447][SQL] Replace reference 'schema rdd' with DataFrame @rxin. · 662d60db
      David Arroyo Cazorla authored
      Author: David Arroyo Cazorla <darroyo@stratio.com>
      
      Closes #7618 from darroyocazorla/master and squashes the following commits:
      
      5f91379 [David Arroyo Cazorla] [SPARK-5447][SQL] Replace reference 'schema rdd' with DataFrame
      662d60db
    • Xiangrui Meng's avatar
      [SPARK-9243] [Documentation] null -> zero in crosstab doc · ecfb3127
      Xiangrui Meng authored
      We forgot to update doc. brkyvz
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #7608 from mengxr/SPARK-9243 and squashes the following commits:
      
      0ea3236 [Xiangrui Meng] null -> zero in crosstab doc
      ecfb3127
    • Cheng Hao's avatar
      [Build][Minor] Fix building error & performance · 19aeab57
      Cheng Hao authored
      1. When build the latest code with sbt, it throws exception like:
      [error] /home/hcheng/git/catalyst/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala:78: match may not be exhaustive.
      [error] It would fail on the following input: UNKNOWN
      [error]       val classNameByStatus = status match {
      [error]
      
      2. Potential performance issue when implicitly convert an Array[Any] to Seq[Any]
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #7611 from chenghao-intel/toseq and squashes the following commits:
      
      cab75c5 [Cheng Hao] remove the toArray
      24df682 [Cheng Hao] fix building error & performance
      19aeab57
    • Wenchen Fan's avatar
      [SPARK-9082] [SQL] [FOLLOW-UP] use `partition` in `PushPredicateThroughProject` · 52ef76de
      Wenchen Fan authored
      a follow up of https://github.com/apache/spark/pull/7446
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7607 from cloud-fan/tmp and squashes the following commits:
      
      7106989 [Wenchen Fan] use `partition` in `PushPredicateThroughProject`
      52ef76de
    • Reynold Xin's avatar
      Revert "[SPARK-8579] [SQL] support arbitrary object in UnsafeRow" · fb36397b
      Reynold Xin authored
      Reverts ObjectPool. As it stands, it has a few problems:
      
      1. ObjectPool doesn't work with spilling and memory accounting.
      2. I don't think in the long run the idea of an object pool is what we want to support, since it essentially goes back to unmanaged memory, and creates pressure on GC, and is hard to account for the total in memory size.
      3. The ObjectPool patch removed the specialized getters for strings and binary, and as a result, actually introduced branches when reading non primitive data types.
      
      If we do want to support arbitrary user defined types in the future, I think we can just add an object array in UnsafeRow, rather than relying on indirect memory addressing through a pool. We also need to pick execution strategies that are optimized for those, rather than keeping a lot of unserialized JVM objects in memory during aggregation.
      
      This is probably the hardest thing I had to revert in Spark, due to recent patches that also change the same part of the code. Would be great to get a careful look.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7591 from rxin/revert-object-pool and squashes the following commits:
      
      01db0bc [Reynold Xin] Scala style.
      eda89fc [Reynold Xin] Fixed describe.
      2967118 [Reynold Xin] Fixed accessor for JoinedRow.
      e3294eb [Reynold Xin] Merge branch 'master' into revert-object-pool
      657855f [Reynold Xin] Temp commit.
      c20f2c8 [Reynold Xin] Style fix.
      fe37079 [Reynold Xin] Revert "[SPARK-8579] [SQL] support arbitrary object in UnsafeRow"
      fb36397b
    • Yijie Shen's avatar
      [SPARK-8935] [SQL] Implement code generation for all casts · 6d0d8b40
      Yijie Shen authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-8935
      
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #7365 from yjshen/cast_codegen and squashes the following commits:
      
      ef6e8b5 [Yijie Shen] getColumn and setColumn in struct cast, autounboxing in array and map
      eaece18 [Yijie Shen] remove null case in cast code gen
      fd7eba4 [Yijie Shen] resolve comments
      80378a5 [Yijie Shen] the missing self cast
      611d66e [Yijie Shen] Bug fix: NullType & primitive object unboxing
      6d5c0fe [Yijie Shen] rebase and add Interval codegen
      9424b65 [Yijie Shen] tiny style fix
      4a1c801 [Yijie Shen] remove CodeHolder class, use function instead.
      3f5df88 [Yijie Shen] CodeHolder for complex dataTypes
      c286f13 [Yijie Shen] moved all the cast code into class body
      4edfd76 [Yijie Shen] [WIP] finished primitive part
      6d0d8b40
  3. Jul 22, 2015
    • Josh Rosen's avatar
      [SPARK-9144] Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled · b217230f
      Josh Rosen authored
      Spark has an option called spark.localExecution.enabled; according to the docs:
      
      > Enables Spark to run certain jobs, such as first() or take() on the driver, without sending tasks to the cluster. This can make certain jobs execute very quickly, but may require shipping a whole partition of data to the driver.
      
      This feature ends up adding quite a bit of complexity to DAGScheduler, especially in the runLocallyWithinThread method, but as far as I know nobody uses this feature (I searched the mailing list and haven't seen any recent mentions of the configuration nor stacktraces including the runLocally method). As a step towards scheduler complexity reduction, I propose that we remove this feature and all code related to it for Spark 1.5.
      
      This pull request simply brings #7484 up to date.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7585 from rxin/remove-local-exec and squashes the following commits:
      
      84bd10e [Reynold Xin] Python fix.
      1d9739a [Reynold Xin] Merge pull request #7484 from JoshRosen/remove-localexecution
      eec39fa [Josh Rosen] Remove allowLocal(); deprecate user-facing uses of it.
      b0835dc [Josh Rosen] Remove local execution code in DAGScheduler
      8975d96 [Josh Rosen] Remove local execution tests.
      ffa8c9b [Josh Rosen] Remove documentation for configuration
      b217230f
    • Reynold Xin's avatar
      [SPARK-9262][build] Treat Scala compiler warnings as errors · d71a13f4
      Reynold Xin authored
      I've seen a few cases in the past few weeks that the compiler is throwing warnings that are caused by legitimate bugs. This patch upgrades warnings to errors, except deprecation warnings.
      
      Note that ideally we should be able to mark deprecation warnings as errors as well. However, due to the lack of ability to suppress individual warning messages in the Scala compiler, we cannot do that (since we do need to access deprecated APIs in Hadoop).
      
      Most of the work are done by ericl.
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #7598 from rxin/warnings and squashes the following commits:
      
      beb311b [Reynold Xin] Fixed tests.
      542c031 [Reynold Xin] Fixed one more warning.
      87c354a [Reynold Xin] Fixed all non-deprecation warnings.
      78660ac [Eric Liang] first effort to fix warnings
      d71a13f4
    • Matei Zaharia's avatar
      [SPARK-9244] Increase some memory defaults · fe26584a
      Matei Zaharia authored
      There are a few memory limits that people hit often and that we could
      make higher, especially now that memory sizes have grown.
      
      - spark.akka.frameSize: This defaults at 10 but is often hit for map
        output statuses in large shuffles. This memory is not fully allocated
        up-front, so we can just make this larger and still not affect jobs
        that never sent a status that large. We increase it to 128.
      
      - spark.executor.memory: Defaults at 512m, which is really small. We
        increase it to 1g.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #7586 from mateiz/configs and squashes the following commits:
      
      ce0038a [Matei Zaharia] [SPARK-9244] Increase some memory defaults
      fe26584a
    • Yin Huai's avatar
      [SPARK-4366] [SQL] [Follow-up] Fix SqlParser compiling warning. · cf21d05f
      Yin Huai authored
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #7588 from yhuai/SPARK-4366-update1 and squashes the following commits:
      
      25f5f36 [Yin Huai] Fix SqlParser Warning.
      cf21d05f
    • Davies Liu's avatar
      [SPARK-9024] Unsafe HashJoin/HashOuterJoin/HashSemiJoin · e0b7ba59
      Davies Liu authored
      This PR introduce unsafe version (using UnsafeRow) of HashJoin, HashOuterJoin and HashSemiJoin, including the broadcast one and shuffle one (except FullOuterJoin, which is better to be implemented using SortMergeJoin).
      
      It use HashMap to store UnsafeRow right now, will change to use BytesToBytesMap for better performance (in another PR).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7480 from davies/unsafe_join and squashes the following commits:
      
      6294b1e [Davies Liu] fix projection
      10583f1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join
      dede020 [Davies Liu] fix test
      84c9807 [Davies Liu] address comments
      a05b4f6 [Davies Liu] support UnsafeRow in LeftSemiJoinBNL and BroadcastNestedLoopJoin
      611d2ed [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join
      9481ae8 [Davies Liu] return UnsafeRow after join()
      ca2b40f [Davies Liu] revert unrelated change
      68f5cd9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join
      0f4380d [Davies Liu] ada a comment
      69e38f5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join
      1a40f02 [Davies Liu] refactor
      ab1690f [Davies Liu] address comments
      60371f2 [Davies Liu] use UnsafeRow in SemiJoin
      a6c0b7d [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join
      184b852 [Davies Liu] fix style
      6acbb11 [Davies Liu] fix tests
      95d0762 [Davies Liu] remove println
      bea4a50 [Davies Liu] Unsafe HashJoin
      e0b7ba59
    • Yijie Shen's avatar
      [SPARK-9165] [SQL] codegen for CreateArray, CreateStruct and CreateNamedStruct · 86f80e2b
      Yijie Shen authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9165
      
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #7537 from yjshen/array_struct_codegen and squashes the following commits:
      
      3a6dce6 [Yijie Shen] use infix notion in createArray test
      5e90f0a [Yijie Shen] resolve comments: classOf
      39cefb8 [Yijie Shen] codegen for createArray createStruct & createNamedStruct
      86f80e2b
    • Wenchen Fan's avatar
      [SPARK-9082] [SQL] Filter using non-deterministic expressions should not be pushed down · 76520955
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7446 from cloud-fan/filter and squashes the following commits:
      
      330021e [Wenchen Fan] add exists to tree node
      2cab68c [Wenchen Fan] more enhance
      949be07 [Wenchen Fan] push down part of predicate if possible
      3912f84 [Wenchen Fan] address comments
      8ce15ca [Wenchen Fan] fix bug
      557158e [Wenchen Fan] Filter using non-deterministic expressions should not be pushed down
      76520955
    • Yin Huai's avatar
      [SPARK-4233] [SPARK-4367] [SPARK-3947] [SPARK-3056] [SQL] Aggregation Improvement · c03299a1
      Yin Huai authored
      This is the first PR for the aggregation improvement, which is tracked by https://issues.apache.org/jira/browse/SPARK-4366 (umbrella JIRA). This PR contains work for its subtasks, SPARK-3056, SPARK-3947, SPARK-4233, and SPARK-4367.
      
      This PR introduces a new code path for evaluating aggregate functions. This code path is guarded by `spark.sql.useAggregate2` and by default the value of this flag is true.
      
      This new code path contains:
      * A new aggregate function interface (`AggregateFunction2`) and 7 built-int aggregate functions based on this new interface (`AVG`, `COUNT`, `FIRST`, `LAST`, `MAX`, `MIN`, `SUM`)
      * A UDAF interface (`UserDefinedAggregateFunction`) based on the new code path and two example UDAFs (`MyDoubleAvg` and `MyDoubleSum`).
      * A sort-based aggregate operator (`Aggregate2Sort`) for the new aggregate function interface .
      * A sort-based aggregate operator (`FinalAndCompleteAggregate2Sort`) for distinct aggregations (for distinct aggregations the query plan will use `Aggregate2Sort` and `FinalAndCompleteAggregate2Sort` together).
      
      With this change, `spark.sql.useAggregate2` is `true`, the flow of compiling an aggregation query is:
      1. Our analyzer looks up functions and returns aggregate functions built based on the old aggregate function interface.
      2. When our planner is compiling the physical plan, it tries try to convert all aggregate functions to the ones built based on the new interface. The planner will fallback to the old code path if any of the following two conditions is true:
      * code-gen is disabled.
      * there is any function that cannot be converted (right now, Hive UDAFs).
      * the schema of grouping expressions contain any complex data type.
      * There are multiple distinct columns.
      
      Right now, the new code path handles a single distinct column in the query (you can have multiple aggregate functions using that distinct column). For a query having a aggregate function with DISTINCT and regular aggregate functions, the generated plan will do partial aggregations for those regular aggregate function.
      
      Thanks chenghao-intel for his initial work on it.
      
      Author: Yin Huai <yhuai@databricks.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #7458 from yhuai/UDAF and squashes the following commits:
      
      7865f5e [Yin Huai] Put the catalyst expression in the comment of the generated code for it.
      b04d6c8 [Yin Huai] Remove unnecessary change.
      f1d5901 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
      35b0520 [Yin Huai] Use semanticEquals to replace grouping expressions in the output of the aggregate operator.
      3b43b24 [Yin Huai] bug fix.
      00eb298 [Yin Huai] Make it compile.
      a3ca551 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
      e0afca3 [Yin Huai] Gracefully fallback to old aggregation code path.
      8a8ac4a [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
      88c7d4d [Yin Huai] Enable spark.sql.useAggregate2 by default for testing purpose.
      dc96fd1 [Yin Huai] Many updates:
      85c9c4b [Yin Huai] newline.
      43de3de [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
      c3614d7 [Yin Huai] Handle single distinct column.
      68b8ee9 [Yin Huai] Support single distinct column set. WIP
      3013579 [Yin Huai] Format.
      d678aee [Yin Huai] Remove AggregateExpressionSuite.scala since our built-in aggregate functions will be based on AlgebraicAggregate and we need to have another way to test it.
      e243ca6 [Yin Huai] Add aggregation iterators.
      a101960 [Yin Huai] Change MyJavaUDAF to MyDoubleSum.
      594cdf5 [Yin Huai] Change existing AggregateExpression to AggregateExpression1 and add an AggregateExpression as the common interface for both AggregateExpression1 and AggregateExpression2.
      380880f [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
      0a827b3 [Yin Huai] Add comments and doc. Move some classes to the right places.
      a19fea6 [Yin Huai] Add UDAF interface.
      262d4c4 [Yin Huai] Make it compile.
      b2e358e [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
      6edb5ac [Yin Huai] Format update.
      70b169c [Yin Huai] Remove groupOrdering.
      4721936 [Yin Huai] Add CheckAggregateFunction to extendedCheckRules.
      d821a34 [Yin Huai] Cleanup.
      32aea9c [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
      5b46d41 [Yin Huai] Bug fix.
      aff9534 [Yin Huai] Make Aggregate2Sort work with both algebraic AggregateFunctions and non-algebraic AggregateFunctions.
      2857b55 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
      4435f20 [Yin Huai] Add ConvertAggregateFunction to HiveContext's analyzer.
      1b490ed [Michael Armbrust] make hive test
      8cfa6a9 [Michael Armbrust] add test
      1b0bb3f [Yin Huai] Do not bind references in AlgebraicAggregate and use code gen for all places.
      072209f [Yin Huai] Bug fix: Handle expressions in grouping columns that are not attribute references.
      f7d9e54 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into UDAF
      39ee975 [Yin Huai] Code cleanup: Remove unnecesary AttributeReferences.
      b7720ba [Yin Huai] Add an analysis rule to convert aggregate function to the new version.
      5c00f3f [Michael Armbrust] First draft of codegen
      6bbc6ba [Michael Armbrust] now with correct answers\!
      f7996d0 [Michael Armbrust] Add AlgebraicAggregate
      dded1c5 [Yin Huai] wip
      c03299a1
    • Andrew Or's avatar
      [SPARK-9232] [SQL] Duplicate code in JSONRelation · f4785f5b
      Andrew Or authored
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #7576 from andrewor14/clean-up-json-relation and squashes the following commits:
      
      ea80803 [Andrew Or] Clean up duplicate code
      f4785f5b
  4. Jul 21, 2015
    • Reynold Xin's avatar
      [SPARK-9154][SQL] Rename formatString to format_string. · a4c83cb1
      Reynold Xin authored
      Also make format_string the canonical form, rather than printf.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7579 from rxin/format_strings and squashes the following commits:
      
      53ee54f [Reynold Xin] Fixed unit tests.
      52357e1 [Reynold Xin] Add format_string alias.
      b40a42a [Reynold Xin] [SPARK-9154][SQL] Rename formatString to format_string.
      a4c83cb1
    • Tarek Auel's avatar
      [SPARK-9154] [SQL] codegen StringFormat · d4c7a7a3
      Tarek Auel authored
      Jira: https://issues.apache.org/jira/browse/SPARK-9154
      
      fixes bug of #7546
      
      marmbrus I can't reopen the other PR, because I didn't closed it. Can you trigger Jenkins?
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7571 from tarekauel/SPARK-9154 and squashes the following commits:
      
      dcae272 [Tarek Auel] [SPARK-9154][SQL] build fix
      1487602 [Tarek Auel] Merge remote-tracking branch 'upstream/master' into SPARK-9154
      f512c5f [Tarek Auel] [SPARK-9154][SQL] build fix
      a943d3e [Tarek Auel] [SPARK-9154] implicit input cast, added tests for null, support for null primitives
      10b4de8 [Tarek Auel] [SPARK-9154][SQL] codegen removed fallback trait
      cd8322b [Tarek Auel] [SPARK-9154][SQL] codegen string format
      086caba [Tarek Auel] [SPARK-9154][SQL] codegen string format
      d4c7a7a3
    • Dennis Huo's avatar
      [SPARK-9206] [SQL] Fix HiveContext classloading for GCS connector. · c07838b5
      Dennis Huo authored
      IsolatedClientLoader.isSharedClass includes all of com.google.\*, presumably
      for Guava, protobuf, and/or other shared Google libraries, but needs to
      count com.google.cloud.\* as "hive classes" when determining which ClassLoader
      to use. Otherwise, things like HiveContext.parquetFile will throw a
      ClassCastException when fs.defaultFS is set to a Google Cloud Storage (gs://)
      path. On StackOverflow: http://stackoverflow.com/questions/31478955
      
      EDIT: Adding yhuai who worked on the relevant classloading isolation pieces.
      
      Author: Dennis Huo <dhuo@google.com>
      
      Closes #7549 from dennishuo/dhuo-fix-hivecontext-gcs and squashes the following commits:
      
      1f8db07 [Dennis Huo] Fix HiveContext classloading for GCS connector.
      c07838b5
    • Reynold Xin's avatar
      [SPARK-8906][SQL] Move all internal data source classes into execution.datasources. · 60c0ce13
      Reynold Xin authored
      This way, the sources package contains only public facing interfaces.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7565 from rxin/move-ds and squashes the following commits:
      
      7661aff [Reynold Xin] Mima
      9d5196a [Reynold Xin] Rearranged imports.
      3dd7174 [Reynold Xin] [SPARK-8906][SQL] Move all internal data source classes into execution.datasources.
      60c0ce13
    • navis.ryu's avatar
      [SPARK-8357] Fix unsafe memory leak on empty inputs in GeneratedAggregate · 9ba7c64d
      navis.ryu authored
      This patch fixes a managed memory leak in GeneratedAggregate.  The leak occurs when the unsafe aggregation path is used to perform grouped aggregation on an empty input; in this case, GeneratedAggregate allocates an UnsafeFixedWidthAggregationMap that is never cleaned up because `next()` is never called on the aggregate result iterator.
      
      This patch fixes this by short-circuiting on empty inputs.
      
      This patch is an updated version of #6810.
      
      Closes #6810.
      
      Author: navis.ryu <navis@apache.org>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7560 from JoshRosen/SPARK-8357 and squashes the following commits:
      
      3486ce4 [Josh Rosen] Some minor cleanup
      c649310 [Josh Rosen] Revert SparkPlan change:
      3c7db0f [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-8357
      adc8239 [Josh Rosen] Back out Projection changes.
      c5419b3 [navis.ryu] addressed comments
      143e1ef [navis.ryu] fixed format & added test for CCE case
      735972f [navis.ryu] used new conf apis
      1a02a55 [navis.ryu] Rolled-back test-conf cleanup & fixed possible CCE & added more tests
      51178e8 [navis.ryu] addressed comments
      4d326b9 [navis.ryu] fixed test fails
      15c5afc [navis.ryu] added a test as suggested by JoshRosen
      d396589 [navis.ryu] added comments
      1b07556 [navis.ryu] [SPARK-8357] [SQL] Memory leakage on unsafe aggregation path with empty input
      9ba7c64d
    • Michael Armbrust's avatar
      Revert "[SPARK-9154] [SQL] codegen StringFormat" · 87d890cc
      Michael Armbrust authored
      This reverts commit 7f072c3d.
      
      Revert #7546
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #7570 from marmbrus/revert9154 and squashes the following commits:
      
      ed2c32a [Michael Armbrust] Revert "[SPARK-9154] [SQL] codegen StringFormat"
      87d890cc
    • Tarek Auel's avatar
      [SPARK-9154] [SQL] codegen StringFormat · 7f072c3d
      Tarek Auel authored
      Jira: https://issues.apache.org/jira/browse/SPARK-9154
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7546 from tarekauel/SPARK-9154 and squashes the following commits:
      
      a943d3e [Tarek Auel] [SPARK-9154] implicit input cast, added tests for null, support for null primitives
      10b4de8 [Tarek Auel] [SPARK-9154][SQL] codegen removed fallback trait
      cd8322b [Tarek Auel] [SPARK-9154][SQL] codegen string format
      086caba [Tarek Auel] [SPARK-9154][SQL] codegen string format
      7f072c3d
    • Yijie Shen's avatar
      [SPARK-9081] [SPARK-9168] [SQL] nanvl & dropna/fillna supporting nan as well · be5c5d37
      Yijie Shen authored
      JIRA:
      https://issues.apache.org/jira/browse/SPARK-9081
      https://issues.apache.org/jira/browse/SPARK-9168
      
      This PR target at two modifications:
      1.  Change `isNaN` to return `false` on `null` input
      2.  Make `dropna` and `fillna` to fill/drop NaN values as well
      3.  Implement `nanvl`
      
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #7523 from yjshen/fillna_dropna and squashes the following commits:
      
      f0a51db [Yijie Shen] make coalesce untouched and implement nanvl
      1d3e35f [Yijie Shen] make Coalesce aware of NaN in order to support fillna
      2760cbc [Yijie Shen] change isNaN(null) to false as well as implement dropna
      be5c5d37
    • Yijie Shen's avatar
      [SPARK-9173][SQL]UnionPushDown should also support Intersect and Except · ae230596
      Yijie Shen authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9173
      
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #7540 from yjshen/union_pushdown and squashes the following commits:
      
      278510a [Yijie Shen] rename UnionPushDown to SetOperationPushDown
      91741c1 [Yijie Shen] Add UnionPushDown support for intersect and except
      ae230596
    • Pedro Rodriguez's avatar
      [SPARK-8230][SQL] Add array/map size method · 560c658a
      Pedro Rodriguez authored
      Pull Request for: https://issues.apache.org/jira/browse/SPARK-8230
      
      Primary issue resolved is to implement array/map size for Spark SQL. Code is ready for review by a committer. Chen Hao is on the JIRA ticket, but I don't know his username on github, rxin is also on JIRA ticket.
      
      Things to review:
      1. Where to put added functions namespace wise, they seem to be part of a few operations on collections which includes `sort_array` and `array_contains`. Hence the name given `collectionOperations.scala` and `_collection_functions` in python.
      2. In Python code, should it be in a `1.5.0` function array or in a collections array?
      3. Are there any missing methods on the `Size` case class? Looks like many of these functions have generated Java code, is that also needed in this case?
      4. Something else?
      
      Author: Pedro Rodriguez <ski.rodriguez@gmail.com>
      Author: Pedro Rodriguez <prodriguez@trulia.com>
      
      Closes #7462 from EntilZha/SPARK-8230 and squashes the following commits:
      
      9a442ae [Pedro Rodriguez] fixed functions and sorted __all__
      9aea3bb [Pedro Rodriguez] removed imports from python docs
      15d4bf1 [Pedro Rodriguez] Added null test case and changed to nullSafeCodeGen
      d88247c [Pedro Rodriguez] removed python code
      bd5f0e4 [Pedro Rodriguez] removed duplicate function from rebase/merge
      59931b4 [Pedro Rodriguez] fixed compile bug instroduced when merging
      c187175 [Pedro Rodriguez] updated code to add size to __all__ directly and removed redundent pretty print
      130839f [Pedro Rodriguez] fixed failing test
      aa9bade [Pedro Rodriguez] fix style
      e093473 [Pedro Rodriguez] updated python code with docs, switched classes/traits implemented, added (failing) expression tests
      0449377 [Pedro Rodriguez] refactored code to use better abstract classes/traits and implementations
      9a1a2ff [Pedro Rodriguez] added unit tests for map size
      2bfbcb6 [Pedro Rodriguez] added unit test for size
      20df2b4 [Pedro Rodriguez] Finished working version of size function and added it to python
      b503e75 [Pedro Rodriguez] First attempt at implementing size for maps and arrays
      99a6a5c [Pedro Rodriguez] fixed failing test
      cac75ac [Pedro Rodriguez] fix style
      933d843 [Pedro Rodriguez] updated python code with docs, switched classes/traits implemented, added (failing) expression tests
      42bb7d4 [Pedro Rodriguez] refactored code to use better abstract classes/traits and implementations
      f9c3b8a [Pedro Rodriguez] added unit tests for map size
      2515d9f [Pedro Rodriguez] added documentation
      0e60541 [Pedro Rodriguez] added unit test for size
      acf9853 [Pedro Rodriguez] Finished working version of size function and added it to python
      84a5d38 [Pedro Rodriguez] First attempt at implementing size for maps and arrays
      560c658a
    • Cheng Hao's avatar
      [SPARK-8255] [SPARK-8256] [SQL] Add regex_extract/regex_replace · 8c8f0ef5
      Cheng Hao authored
      Add expressions `regex_extract` & `regex_replace`
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #7468 from chenghao-intel/regexp and squashes the following commits:
      
      e5ea476 [Cheng Hao] minor update for documentation
      ef96fd6 [Cheng Hao] update the code gen
      72cf28f [Cheng Hao] Add more log for compilation error
      4e11381 [Cheng Hao] Add regexp_replace / regexp_extract support
      8c8f0ef5
    • Cheng Lian's avatar
      [SPARK-9100] [SQL] Adds DataFrame reader/writer shortcut methods for ORC · d38c5029
      Cheng Lian authored
      This PR adds DataFrame reader/writer shortcut methods for ORC in both Scala and Python.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7444 from liancheng/spark-9100 and squashes the following commits:
      
      284d043 [Cheng Lian] Fixes PySpark test cases and addresses PR comments
      e0b09fb [Cheng Lian] Adds DataFrame reader/writer shortcut methods for ORC
      d38c5029
    • Tarek Auel's avatar
      [SPARK-9161][SQL] codegen FormatNumber · 1ddd0f2f
      Tarek Auel authored
      Jira https://issues.apache.org/jira/browse/SPARK-9161
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7545 from tarekauel/SPARK-9161 and squashes the following commits:
      
      21425c8 [Tarek Auel] [SPARK-9161][SQL] codegen FormatNumber
      1ddd0f2f
    • Josh Rosen's avatar
      [SPARK-9023] [SQL] Followup for #7456 (Efficiency improvements for UnsafeRows in Exchange) · 48f8fd46
      Josh Rosen authored
      This patch addresses code review feedback from #7456.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7551 from JoshRosen/unsafe-exchange-followup and squashes the following commits:
      
      76dbdf8 [Josh Rosen] Add comments + more methods to UnsafeRowSerializer
      3d7a1f2 [Josh Rosen] Add writeToStream() method to UnsafeRow
      48f8fd46
Loading