Skip to content
Snippets Groups Projects
  1. Mar 07, 2016
    • Andrew Or's avatar
      [SPARK-13685][SQL] Rename catalog.Catalog to ExternalCatalog · bc7a3ec2
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Today we have `analysis.Catalog` and `catalog.Catalog`. In the future the former will call the latter. When that happens, if both of them are still called `Catalog` it will be very confusing. This patch renames the latter `ExternalCatalog` because it is expected to talk to external systems.
      
      ## How was this patch tested?
      
      Jenkins.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #11526 from andrewor14/rename-catalog.
      bc7a3ec2
  2. Mar 06, 2016
    • Shixiong Zhu's avatar
      [SPARK-13697] [PYSPARK] Fix the missing module name of TransformFunctionSerializer.loads · ee913e6e
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Set the function's module name to `__main__` if it's missing in `TransformFunctionSerializer.loads`.
      
      ## How was this patch tested?
      
      Manually test in the shell.
      
      Before this patch:
      ```
      >>> from pyspark.streaming import StreamingContext
      >>> from pyspark.streaming.util import TransformFunction
      >>> ssc = StreamingContext(sc, 1)
      >>> func = TransformFunction(sc, lambda x: x, sc.serializer)
      >>> func.rdd_wrapper(lambda x: x)
      TransformFunction(<function <lambda> at 0x106ac8b18>)
      >>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, func.rdd_wrap_func, func.deserializers)))
      >>> func2 = ssc._transformerSerializer.loads(bytes)
      >>> print(func2.func.__module__)
      None
      >>> print(func2.rdd_wrap_func.__module__)
      None
      >>>
      ```
      After this patch:
      ```
      >>> from pyspark.streaming import StreamingContext
      >>> from pyspark.streaming.util import TransformFunction
      >>> ssc = StreamingContext(sc, 1)
      >>> func = TransformFunction(sc, lambda x: x, sc.serializer)
      >>> func.rdd_wrapper(lambda x: x)
      TransformFunction(<function <lambda> at 0x108bf1b90>)
      >>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, func.rdd_wrap_func, func.deserializers)))
      >>> func2 = ssc._transformerSerializer.loads(bytes)
      >>> print(func2.func.__module__)
      __main__
      >>> print(func2.rdd_wrap_func.__module__)
      __main__
      >>>
      ```
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11535 from zsxwing/loads-module.
      ee913e6e
  3. Mar 05, 2016
    • Cheng Lian's avatar
      Revert "[SPARK-13616][SQL] Let SQLBuilder convert logical plan without a project on top of it" · 8ff88094
      Cheng Lian authored
      This reverts commit f87ce050.
      
      According to discussion in #11466, let's revert PR #11466 for safe.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #11539 from liancheng/revert-pr-11466.
      8ff88094
    • Shixiong Zhu's avatar
      [SPARK-13693][STREAMING][TESTS] Stop StreamingContext before deleting checkpoint dir · 8290004d
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Stop StreamingContext before deleting checkpoint dir to avoid the race condition that deleting the checkpoint dir and writing checkpoint happen at the same time.
      
      The flaky test log is here: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/
      
      ## How was this patch tested?
      
      unit tests
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11531 from zsxwing/SPARK-13693.
      8290004d
    • gatorsmile's avatar
      [SPARK-12720][SQL] SQL Generation Support for Cube, Rollup, and Grouping Sets · adce5ee7
      gatorsmile authored
      #### What changes were proposed in this pull request?
      
      This PR is for supporting SQL generation for cube, rollup and grouping sets.
      
      For example, a query using rollup:
      ```SQL
      SELECT count(*) as cnt, key % 5, grouping_id() FROM t1 GROUP BY key % 5 WITH ROLLUP
      ```
      Original logical plan:
      ```
        Aggregate [(key#17L % cast(5 as bigint))#47L,grouping__id#46],
                  [(count(1),mode=Complete,isDistinct=false) AS cnt#43L,
                   (key#17L % cast(5 as bigint))#47L AS _c1#45L,
                   grouping__id#46 AS _c2#44]
        +- Expand [List(key#17L, value#18, (key#17L % cast(5 as bigint))#47L, 0),
                   List(key#17L, value#18, null, 1)],
                  [key#17L,value#18,(key#17L % cast(5 as bigint))#47L,grouping__id#46]
           +- Project [key#17L,
                       value#18,
                       (key#17L % cast(5 as bigint)) AS (key#17L % cast(5 as bigint))#47L]
              +- Subquery t1
                 +- Relation[key#17L,value#18] ParquetRelation
      ```
      Converted SQL:
      ```SQL
        SELECT count( 1) AS `cnt`,
               (`t1`.`key` % CAST(5 AS BIGINT)),
               grouping_id() AS `_c2`
        FROM `default`.`t1`
        GROUP BY (`t1`.`key` % CAST(5 AS BIGINT))
        GROUPING SETS (((`t1`.`key` % CAST(5 AS BIGINT))), ())
      ```
      
      #### How was the this patch tested?
      
      Added eight test cases in `LogicalPlanToSQLSuite`.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #11283 from gatorsmile/groupingSetsToSQL.
      adce5ee7
  4. Mar 04, 2016
    • Jason White's avatar
      [SPARK-12073][STREAMING] backpressure rate controller consumes events preferentially from lagg… · f19228ee
      Jason White authored
      …ing partitions
      
      I'm pretty sure this is the reason we couldn't easily recover from an unbalanced Kafka partition under heavy load when using backpressure.
      
      `maxMessagesPerPartition` calculates an appropriate limit for the message rate from all partitions, and then divides by the number of partitions to determine how many messages to retrieve per partition. The problem with this approach is that when one partition is behind by millions of records (due to random Kafka issues), but the rate estimator calculates only 100k total messages can be retrieved, each partition (out of say 32) only retrieves max 100k/32=3125 messages.
      
      This PR (still needing a test) determines a per-partition desired message count by using the current lag for each partition to preferentially weight the total message limit among the partitions. In this situation, if each partition gets 1k messages, but 1 partition starts 1M behind, then the total number of messages to retrieve is (32 * 1k + 1M) = 1032000 messages, of which the one partition needs 1001000. So, it gets (1001000 / 1032000) = 97% of the 100k messages, and the other 31 partitions share the remaining 3%.
      
      Assuming all of 100k the messages are retrieved and processed within the batch window, the rate calculator will increase the number of messages to retrieve in the next batch, until it reaches a new stable point or the backlog is finished processed.
      
      We're going to try deploying this internally at Shopify to see if this resolves our issue.
      
      tdas koeninger holdenk
      
      Author: Jason White <jason.white@shopify.com>
      
      Closes #10089 from JasonMWhite/rate_controller_offsets.
      f19228ee
    • Nong Li's avatar
      [SPARK-13255] [SQL] Update vectorized reader to directly return ColumnarBatch... · a6e2bd31
      Nong Li authored
      [SPARK-13255] [SQL] Update vectorized reader to directly return ColumnarBatch instead of InternalRows.
      
      ## What changes were proposed in this pull request?
      
      (Please fill in changes proposed in this fix)
      
      Currently, the parquet reader returns rows one by one which is bad for performance. This patch
      updates the reader to directly return ColumnarBatches. This is only enabled with whole stage
      codegen, which is the only operator currently that is able to consume ColumnarBatches (instead
      of rows). The current implementation is a bit of a hack to get this to work and we should do
      more refactoring of these low level interfaces to make this work better.
      
      ## How was this patch tested?
      
      ```
      Results:
      TPCDS:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)
      ---------------------------------------------------------------------------------
      q55 (before)                             8897 / 9265         12.9          77.2
      q55                                      5486 / 5753         21.0          47.6
      ```
      
      Author: Nong Li <nong@databricks.com>
      
      Closes #11435 from nongli/spark-13255.
      a6e2bd31
    • Alex Bozarth's avatar
      [SPARK-13459][WEB UI] Separate Alive and Dead Executors in Executor Totals Table · 5f42c28b
      Alex Bozarth authored
      ## What changes were proposed in this pull request?
      
      Now that dead executors are shown in the executors table (#10058) the totals table is updated to include the separate totals for alive and dead executors as well as the current total, as originally discussed in #10668
      
      ## How was this patch tested?
      
      Manually verified by running the Standalone Web UI in the latest Safari and Firefox ESR
      
      Author: Alex Bozarth <ajbozart@us.ibm.com>
      
      Closes #11381 from ajbozarth/spark13459.
      5f42c28b
    • Andrew Or's avatar
      [SPARK-13633][SQL] Move things into catalyst.parser package · b7d41474
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      This patch simply moves things to existing package `o.a.s.sql.catalyst.parser` in an effort to reduce the size of the diff in #11048. This is conceptually the same as a recently merged patch #11482.
      
      ## How was this patch tested?
      
      Jenkins.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #11506 from andrewor14/parser-package.
      b7d41474
    • Xusen Yin's avatar
      [SPARK-13036][SPARK-13318][SPARK-13319] Add save/load for feature.py · 83302c3b
      Xusen Yin authored
      Add save/load for feature.py. Meanwhile, add save/load for `ElementwiseProduct` in Scala side and fix a bug of missing `setDefault` in `VectorSlicer` and `StopWordsRemover`.
      
      In this PR I ignore the `RFormula` and `RFormulaModel` because its Scala implementation is pending in https://github.com/apache/spark/pull/9884. I'll add them in this PR if https://github.com/apache/spark/pull/9884 gets merged first. Or add a follow-up JIRA for `RFormula`.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #11203 from yinxusen/SPARK-13036.
      83302c3b
    • Dongjoon Hyun's avatar
      [SPARK-13676] Fix mismatched default values for regParam in LogisticRegression · c8f25459
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      The default value of regularization parameter for `LogisticRegression` algorithm is different in Scala and Python. We should provide the same value.
      
      **Scala**
      ```
      scala> new org.apache.spark.ml.classification.LogisticRegression().getRegParam
      res0: Double = 0.0
      ```
      
      **Python**
      ```
      >>> from pyspark.ml.classification import LogisticRegression
      >>> LogisticRegression().getRegParam()
      0.1
      ```
      
      ## How was this patch tested?
      manual. Check the following in `pyspark`.
      ```
      >>> from pyspark.ml.classification import LogisticRegression
      >>> LogisticRegression().getRegParam()
      0.0
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11519 from dongjoon-hyun/SPARK-13676.
      c8f25459
    • Masayoshi TSUZUKI's avatar
      [SPARK-13673][WINDOWS] Fixed not to pollute environment variables. · e6175082
      Masayoshi TSUZUKI authored
      ## What changes were proposed in this pull request?
      
      This patch fixes the problem that `bin\beeline.cmd` pollutes environment variables.
      The similar problem is reported and fixed in https://issues.apache.org/jira/browse/SPARK-3943, but `bin\beeline.cmd` seems to be added later.
      
      ## How was this patch tested?
      
      manual tests:
        I executed the new `bin\beeline.cmd` and confirmed that %SPARK_HOME% doesn't remain in the command prompt.
      
      Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
      
      Closes #11516 from tsudukim/feature/SPARK-13673.
      e6175082
    • Rajesh Balamohan's avatar
      [SPARK-12925] Improve HiveInspectors.unwrap for StringObjectInspector.… · 204b02b5
      Rajesh Balamohan authored
      Earlier fix did not copy the bytes and it is possible for higher level to reuse Text object. This was causing issues. Proposed fix now copies the bytes from Text. This still avoids the expensive encoding/decoding
      
      Author: Rajesh Balamohan <rbalamohan@apache.org>
      
      Closes #11477 from rajeshbalamohan/SPARK-12925.2.
      204b02b5
    • Holden Karau's avatar
      [SPARK-13398][STREAMING] Move away from thread pool task support to forkjoin · c04dc27c
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Remove old deprecated ThreadPoolExecutor and replace with ExecutionContext using a ForkJoinPool. The downside of this is that scala's ForkJoinPool doesn't give us a way to specify the thread pool name (and is a wrapper of Java's in 2.12) except by providing a custom factory. Note that we can't use Java's ForkJoinPool directly in Scala 2.11 since it uses a ExecutionContext which reports system parallelism. One other implicit change that happens is the old ExecutionContext would have reported a different default parallelism since it used system parallelism rather than threadpool parallelism (this was likely not intended but also likely not a huge difference).
      
      The previous version of this PR attempted to use an execution context constructed on the ThreadPool (but not the deprecated ThreadPoolExecutor class) so as to keep the ability to have human readable named threads but this reported system parallelism.
      
      ## How was this patch tested?
      
      unit tests: streaming/testOnly org.apache.spark.streaming.util.*
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #11423 from holdenk/SPARK-13398-move-away-from-ThreadPoolTaskSupport-java-forkjoin.
      c04dc27c
    • Abou Haydar Elias's avatar
      [SPARK-13646][MLLIB] QuantileDiscretizer counts dataset twice in get… · 27e88faa
      Abou Haydar Elias authored
      ## What changes were proposed in this pull request?
      
      It avoids counting the dataframe twice.
      
      Author: Abou Haydar Elias <abouhaydar.elias@gmail.com>
      Author: Elie A <abouhaydar.elias@gmail.com>
      
      Closes #11491 from eliasah/quantile-discretizer-patch.
      27e88faa
    • Davies Liu's avatar
      [SPARK-13603][SQL] support SQL generation for subquery · dd83c209
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This is support SQL generation for subquery expressions, which will be replaced to a SubqueryHolder inside SQLBuilder recursively.
      
      ## How was this patch tested?
      
      Added unit tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11453 from davies/sql_subquery.
      dd83c209
    • Shixiong Zhu's avatar
      [SPARK-13652][CORE] Copy ByteBuffer in sendRpcSync as it will be recycled · 465c665d
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      `sendRpcSync` should copy the response content because the underlying buffer will be recycled and reused.
      
      ## How was this patch tested?
      
      Jenkins unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11499 from zsxwing/SPARK-13652.
      465c665d
  5. Mar 03, 2016
    • thomastechs's avatar
      [SPARK-12941][SQL][MASTER] Spark-SQL JDBC Oracle dialect fails to map string... · f6ac7c30
      thomastechs authored
      [SPARK-12941][SQL][MASTER] Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR datatype mapping
      
      ## What changes were proposed in this pull request?
      A test suite added for the bug fix -SPARK 12941; for the mapping of the StringType to corresponding in Oracle
      
      ## How was this patch tested?
      manual tests done
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: thomastechs <thomas.sebastian@tcs.com>
      Author: THOMAS SEBASTIAN <thomas.sebastian@tcs.com>
      
      Closes #11489 from thomastechs/thomastechs-12941-master-new.
      f6ac7c30
    • Wenchen Fan's avatar
      [SPARK-13647] [SQL] also check if numeric value is within allowed range in _verify_type · 15d57f9c
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This PR makes the `_verify_type` in `types.py` more strict, also check if numeric value is within allowed range.
      
      ## How was this patch tested?
      
      newly added doc test.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11492 from cloud-fan/py-verify.
      15d57f9c
    • Davies Liu's avatar
      [SPARK-13601] [TESTS] use 1 partition in tests to avoid race conditions · d062587d
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Fix race conditions when cleanup files.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11507 from davies/flaky.
      d062587d
    • Davies Liu's avatar
      [SPARK-13415][SQL] Visualize subquery in SQL web UI · b373a888
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR support visualization for subquery in SQL web UI, also improve the explain of subquery, especially when it's used together with whole stage codegen.
      
      For example:
      ```python
      >>> sqlContext.range(100).registerTempTable("range")
      >>> sqlContext.sql("select id / (select sum(id) from range) from range where id > (select id from range limit 1)").explain(True)
      == Parsed Logical Plan ==
      'Project [unresolvedalias(('id / subquery#9), None)]
      :  +- 'SubqueryAlias subquery#9
      :     +- 'Project [unresolvedalias('sum('id), None)]
      :        +- 'UnresolvedRelation `range`, None
      +- 'Filter ('id > subquery#8)
         :  +- 'SubqueryAlias subquery#8
         :     +- 'GlobalLimit 1
         :        +- 'LocalLimit 1
         :           +- 'Project [unresolvedalias('id, None)]
         :              +- 'UnresolvedRelation `range`, None
         +- 'UnresolvedRelation `range`, None
      
      == Analyzed Logical Plan ==
      (id / scalarsubquery()): double
      Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id / scalarsubquery())#11]
      :  +- SubqueryAlias subquery#9
      :     +- Aggregate [(sum(id#0L),mode=Complete,isDistinct=false) AS sum(id)#10L]
      :        +- SubqueryAlias range
      :           +- Range 0, 100, 1, 4, [id#0L]
      +- Filter (id#0L > subquery#8)
         :  +- SubqueryAlias subquery#8
         :     +- GlobalLimit 1
         :        +- LocalLimit 1
         :           +- Project [id#0L]
         :              +- SubqueryAlias range
         :                 +- Range 0, 100, 1, 4, [id#0L]
         +- SubqueryAlias range
            +- Range 0, 100, 1, 4, [id#0L]
      
      == Optimized Logical Plan ==
      Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id / scalarsubquery())#11]
      :  +- SubqueryAlias subquery#9
      :     +- Aggregate [(sum(id#0L),mode=Complete,isDistinct=false) AS sum(id)#10L]
      :        +- Range 0, 100, 1, 4, [id#0L]
      +- Filter (id#0L > subquery#8)
         :  +- SubqueryAlias subquery#8
         :     +- GlobalLimit 1
         :        +- LocalLimit 1
         :           +- Project [id#0L]
         :              +- Range 0, 100, 1, 4, [id#0L]
         +- Range 0, 100, 1, 4, [id#0L]
      
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id / scalarsubquery())#11]
      :     :  +- Subquery subquery#9
      :     :     +- WholeStageCodegen
      :     :        :  +- TungstenAggregate(key=[], functions=[(sum(id#0L),mode=Final,isDistinct=false)], output=[sum(id)#10L])
      :     :        :     +- INPUT
      :     :        +- Exchange SinglePartition, None
      :     :           +- WholeStageCodegen
      :     :              :  +- TungstenAggregate(key=[], functions=[(sum(id#0L),mode=Partial,isDistinct=false)], output=[sum#14L])
      :     :              :     +- Range 0, 1, 4, 100, [id#0L]
      :     +- Filter (id#0L > subquery#8)
      :        :  +- Subquery subquery#8
      :        :     +- CollectLimit 1
      :        :        +- WholeStageCodegen
      :        :           :  +- Project [id#0L]
      :        :           :     +- Range 0, 1, 4, 100, [id#0L]
      :        +- Range 0, 1, 4, 100, [id#0L]
      ```
      
      The web UI looks like:
      
      ![subquery](https://cloud.githubusercontent.com/assets/40902/13377963/932bcbae-dda7-11e5-82f7-03c9be85d77c.png)
      
      This PR also change the tree structure of WholeStageCodegen to make it consistent than others. Before this change, Both WholeStageCodegen and InputAdapter hold a references to the same plans, those could be updated without notify another, causing problems, this is discovered by #11403 .
      
      ## How was this patch tested?
      
      Existing tests, also manual tests with the example query, check the explain and web UI.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11417 from davies/viz_subquery.
      b373a888
    • Shixiong Zhu's avatar
      [SPARK-13584][SQL][TESTS] Make ContinuousQueryManagerSuite not output logs to the console · ad0de99f
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Make ContinuousQueryManagerSuite not output logs to the console. The logs will still output to `unit-tests.log`.
      
      I also updated `SQLListenerMemoryLeakSuite` to use `quietly` to avoid changing the log level which won't output logs to `unit-tests.log`.
      
      ## How was this patch tested?
      
      Just check Jenkins output.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11439 from zsxwing/quietly-ContinuousQueryManagerSuite.
      ad0de99f
    • Andrew Or's avatar
      [SPARK-13632][SQL] Move commands.scala to command package · 3edcc402
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      This patch simply moves things to a new package in an effort to reduce the size of the diff in #11048. Currently the new package only has one file, but in the future we'll add many new commands in SPARK-13139.
      
      ## How was this patch tested?
      
      Jenkins.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #11482 from andrewor14/commands-package.
      3edcc402
    • Dongjoon Hyun's avatar
      [MINOR] Fix typos in comments and testcase name of code · 941b270b
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR fixes typos in comments and testcase name of code.
      
      ## How was this patch tested?
      
      manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11481 from dongjoon-hyun/minor_fix_typos_in_code.
      941b270b
    • Sean Owen's avatar
      [SPARK-13423][HOTFIX] Static analysis fixes for 2.x / fixed for Scala 2.10, again · 52035d10
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fixes (another) compile problem due to inadvertent use of Option.contains, only in Scala 2.11
      
      ## How was this patch tested?
      
      Jenkins tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11496 from srowen/SPARK-13423.3.
      52035d10
    • Yanbo Liang's avatar
      [MINOR][ML][DOC] Remove duplicated periods at the end of some sharedParam · ce58e99a
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Remove duplicated periods at the end of some sharedParams in ScalaDoc, such as [here](https://github.com/apache/spark/pull/11344/files#diff-9edc669edcf2c0c7cf1efe4a0a57da80L367)
      cc mengxr srowen
      ## How was this patch tested?
      Documents change, no test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11344 from yanboliang/shared-cleanup.
      ce58e99a
    • hyukjinkwon's avatar
      [SPARK-13543][SQL] Support for specifying compression codec for Parquet/ORC via option() · cf95d728
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR adds the support to specify compression codecs for both ORC and Parquet.
      
      ## How was this patch tested?
      
      unittests within IDE and code style tests with `dev/run_tests`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #11464 from HyukjinKwon/SPARK-13543.
      cf95d728
    • JeremyNixon's avatar
      [SPARK-12877][ML] Add train-validation-split to pyspark · 511d4929
      JeremyNixon authored
      ## What changes were proposed in this pull request?
      The changes proposed were to add train-validation-split to pyspark.ml.tuning.
      
      ## How was the this patch tested?
      This patch was tested through unit tests located in pyspark/ml/test.py.
      
      This is my original work and I license it to Spark.
      
      Author: JeremyNixon <jnixon2@gmail.com>
      
      Closes #11335 from JeremyNixon/tvs_pyspark.
      511d4929
    • Steve Loughran's avatar
      [SPARK-13599][BUILD] remove transitive groovy dependencies from Hive · 9a48c656
      Steve Loughran authored
      ## What changes were proposed in this pull request?
      
      Modifies the dependency declarations of the all the hive artifacts, to explicitly exclude the groovy-all JAR.
      
      This stops the groovy classes *and everything else in that uber-JAR* from getting into spark-assembly JAR.
      
      ## How was this patch tested?
      
      1. Pre-patch build was made: `mvn clean install -Pyarn,hive,hive-thriftserver`
      1. spark-assembly expanded, observed to have the org.codehaus.groovy packages and JARs
      1. A maven dependency tree was created `mvn dependency:tree -Pyarn,hive,hive-thriftserver  -Dverbose > target/dependencies.txt`
      1. This text file examined to confirm that groovy was being imported as a dependency of `org.spark-project.hive`
      1. Patch applied
      1. Repeated step1: clean build of project with ` -Pyarn,hive,hive-thriftserver` set
      1. Examined created spark-assembly, verified no org.codehaus packages
      1. Verified that the maven dependency tree no longer references groovy
      
      Note also that the size of the assembly JAR was 181628646 bytes before this patch, 166318515 after —15MB smaller. That's a good metric of things being excluded
      
      Author: Steve Loughran <stevel@hortonworks.com>
      
      Closes #11449 from steveloughran/fixes/SPARK-13599-groovy-dependency.
      9a48c656
    • Xin Ren's avatar
      [SPARK-13013][DOCS] Replace example code in mllib-clustering.md using include_example · 70f6f964
      Xin Ren authored
      Replace example code in mllib-clustering.md using include_example
      https://issues.apache.org/jira/browse/SPARK-13013
      
      The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6.
      
      Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example.
      `{% include_example scala/org/apache/spark/examples/mllib/KMeansExample.scala %}`
      Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/KMeansExample.scala` and pick code blocks marked "example" and replace code block in
      `{% highlight %}`
       in the markdown.
      
      See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337
      
      Author: Xin Ren <iamshrek@126.com>
      
      Closes #11116 from keypointt/SPARK-13013.
      70f6f964
    • Sean Owen's avatar
      [SPARK-13423][HOTFIX] Static analysis fixes for 2.x / fixed for Scala 2.10 · 645c3a85
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fixes compile problem due to inadvertent use of `Option.contains`, only in Scala 2.11. The change should have been to replace `Option.exists(_ == x)` with `== Some(x)`. Replacing exists with contains only makes sense for collections. Replacing use of `Option.exists` still makes sense though as it's misleading.
      
      ## How was this patch tested?
      
      Jenkins tests / compilation
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11493 from srowen/SPARK-13423.2.
      645c3a85
    • Dongjoon Hyun's avatar
      [SPARK-13583][CORE][STREAMING] Remove unused imports and add checkstyle rule · b5f02d67
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review by saving much time.
      This issue aims remove unused imports from Java/Scala code and add `UnusedImports` checkstyle rule to help developers.
      
      ## How was this patch tested?
      ```
      ./dev/lint-java
      ./build/sbt compile
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11438 from dongjoon-hyun/SPARK-13583.
      b5f02d67
    • Sean Owen's avatar
      [SPARK-13423][WIP][CORE][SQL][STREAMING] Static analysis fixes for 2.x · e97fc7f1
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Make some cross-cutting code improvements according to static analysis. These are individually up for discussion since they exist in separate commits that can be reverted. The changes are broadly:
      
      - Inner class should be static
      - Mismatched hashCode/equals
      - Overflow in compareTo
      - Unchecked warnings
      - Misuse of assert, vs junit.assert
      - get(a) + getOrElse(b) -> getOrElse(a,b)
      - Array/String .size -> .length (occasionally, -> .isEmpty / .nonEmpty) to avoid implicit conversions
      - Dead code
      - tailrec
      - exists(_ == ) -> contains find + nonEmpty -> exists filter + size -> count
      - reduce(_+_) -> sum map + flatten -> map
      
      The most controversial may be .size -> .length simply because of its size. It is intended to avoid implicits that might be expensive in some places.
      
      ## How was the this patch tested?
      
      Existing Jenkins unit tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11292 from srowen/SPARK-13423.
      e97fc7f1
    • Dongjoon Hyun's avatar
      [HOT-FIX] Recover some deprecations for 2.10 compatibility. · 02b7677e
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      #11479 [SPARK-13627] broke 2.10 compatibility: [2.10-Build](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-scala-2.10/292/console)
      At this moment, we need to support both 2.10 and 2.11.
      This PR recovers some deprecated methods which were replace by [SPARK-13627].
      
      ## How was this patch tested?
      
      Jenkins build: Both 2.10, 2.11.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11488 from dongjoon-hyun/hotfix_compatibility_with_2.10.
      02b7677e
    • Liang-Chi Hsieh's avatar
      [SPARK-13466] [SQL] Remove projects that become redundant after column pruning rule · 7b25dc7b
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-13466
      
      ## What changes were proposed in this pull request?
      
      With column pruning rule in optimizer, some Project operators will become redundant. We should remove these redundant Projects.
      
      For an example query:
      
          val input = LocalRelation('key.int, 'value.string)
      
          val query =
            Project(Seq($"x.key", $"y.key"),
              Join(
                SubqueryAlias("x", input),
                BroadcastHint(SubqueryAlias("y", input)), Inner, None))
      
      After the first run of column pruning, it would like:
      
          Project(Seq($"x.key", $"y.key"),
            Join(
              Project(Seq($"x.key"), SubqueryAlias("x", input)),
              Project(Seq($"y.key"),      <-- inserted by the rule
              BroadcastHint(SubqueryAlias("y", input))),
              Inner, None))
      
      Actually we don't need the outside Project now. This patch will remove it:
      
          Join(
            Project(Seq($"x.key"), SubqueryAlias("x", input)),
            Project(Seq($"y.key"),
            BroadcastHint(SubqueryAlias("y", input))),
            Inner, None)
      
      ## How was the this patch tested?
      
      Unit test is added into ColumnPruningSuite.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #11341 from viirya/remove-redundant-project.
      7b25dc7b
    • Liang-Chi Hsieh's avatar
      [SPARK-13635] [SQL] Enable LimitPushdown optimizer rule because we have... · 1085bd86
      Liang-Chi Hsieh authored
      [SPARK-13635] [SQL] Enable LimitPushdown optimizer rule because we have whole-stage codegen for Limit
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-13635
      
      ## What changes were proposed in this pull request?
      
      LimitPushdown optimizer rule has been disabled due to no whole-stage codegen for Limit. As we have whole-stage codegen for Limit now, we should enable it.
      
      ## How was this patch tested?
      
      As we only re-enable LimitPushdown optimizer rule, no need to add new tests for it.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #11483 from viirya/enable-limitpushdown.
      1085bd86
    • Devaraj K's avatar
      [SPARK-13621][CORE] TestExecutor.scala needs to be moved to test package · 56e3d007
      Devaraj K authored
      Moved TestExecutor.scala from src to test package and removed the unused file TestClient.scala.
      
      Author: Devaraj K <devaraj@apache.org>
      
      Closes #11474 from devaraj-kavali/SPARK-13621.
      56e3d007
    • Liang-Chi Hsieh's avatar
      [SPARK-13616][SQL] Let SQLBuilder convert logical plan without a project on top of it · f87ce050
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-13616
      
      ## What changes were proposed in this pull request?
      
      It is possibly that a logical plan has been removed `Project` from the top of it. Or the plan doesn't has a top `Project` from the beginning because it is not necessary. Currently the `SQLBuilder` can't convert such plans back to SQL. This change is to add this feature.
      
      ## How was this patch tested?
      
      A test is added to `LogicalPlanToSQLSuite`.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #11466 from viirya/sqlbuilder-notopselect.
      f87ce050
  6. Mar 02, 2016
    • Dongjoon Hyun's avatar
      [SPARK-13627][SQL][YARN] Fix simple deprecation warnings. · 9c274ac4
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR aims to fix the following deprecation warnings.
        * MethodSymbolApi.paramss--> paramLists
        * AnnotationApi.tpe -> tree.tpe
        * BufferLike.readOnly -> toList.
        * StandardNames.nme -> termNames
        * scala.tools.nsc.interpreter.AbstractFileClassLoader -> scala.reflect.internal.util.AbstractFileClassLoader
        * TypeApi.declarations-> decls
      
      ## How was this patch tested?
      
      Check the compile build log and pass the tests.
      ```
      ./build/sbt
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11479 from dongjoon-hyun/SPARK-13627.
      9c274ac4
    • Wenchen Fan's avatar
      [SPARK-13617][SQL] remove unnecessary GroupingAnalytics trait · b60b8137
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      The `trait GroupingAnalytics` only has one implementation, it's an unnecessary abstraction. This PR removes it, and does some code simplification when resolving `GroupingSet`.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11469 from cloud-fan/groupingset.
      b60b8137
Loading