Skip to content
Snippets Groups Projects
  1. May 06, 2016
    • Luciano Resende's avatar
      [SPARK-14738][BUILD] Separate docker integration tests from main build · a03c5e68
      Luciano Resende authored
      ## What changes were proposed in this pull request?
      
      Create a maven profile for executing the docker integration tests using maven
      Remove docker integration tests from main sbt build
      Update documentation on how to run docker integration tests from sbt
      
      ## How was this patch tested?
      
      Manual test of the docker integration tests as in :
      mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 compile test
      
      ## Other comments
      
      Note that the the DB2 Docker Tests are still disabled as there is a kernel version issue on the AMPLab Jenkins slaves and we would need to get them on the right level before enabling those tests. They do run ok locally with the updates from PR #12348
      
      Author: Luciano Resende <lresende@apache.org>
      
      Closes #12508 from lresende/docker.
      a03c5e68
  2. May 05, 2016
    • Sun Rui's avatar
      [SPARK-11395][SPARKR] Support over and window specification in SparkR. · 157a49aa
      Sun Rui authored
      This PR:
      1. Implement WindowSpec S4 class.
      2. Implement Window.partitionBy() and Window.orderBy() as utility functions to create WindowSpec objects.
      3. Implement over() of Column class.
      
      Author: Sun Rui <rui.sun@intel.com>
      Author: Sun Rui <sunrui2016@gmail.com>
      
      Closes #10094 from sun-rui/SPARK-11395.
      157a49aa
    • Andrew Or's avatar
      [HOTFIX] Fix MLUtils compile · 7f5922aa
      Andrew Or authored
      7f5922aa
    • Jacek Laskowski's avatar
      [SPARK-15152][DOC][MINOR] Scaladoc and Code style Improvements · bbb77734
      Jacek Laskowski authored
      ## What changes were proposed in this pull request?
      
      Minor doc and code style fixes
      
      ## How was this patch tested?
      
      local build
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #12928 from jaceklaskowski/SPARK-15152.
      bbb77734
    • Dilip Biswal's avatar
      [SPARK-14893][SQL] Re-enable HiveSparkSubmitSuite SPARK-8489 test after HiveContext is removed · 02c07e89
      Dilip Biswal authored
      ## What changes were proposed in this pull request?
      
      Enable the test that was disabled when HiveContext was removed.
      
      ## How was this patch tested?
      
      Made sure the enabled test passes with the new jar.
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #12924 from dilipbiswal/spark-14893.
      02c07e89
    • Ryan Blue's avatar
      [SPARK-9926] Parallelize partition logic in UnionRDD. · 08db4912
      Ryan Blue authored
      This patch has the new logic from #8512 that uses a parallel collection to compute partitions in UnionRDD. The rest of #8512 added an alternative code path for calculating splits in S3, but that isn't necessary to get the same speedup. The underlying problem wasn't that bulk listing wasn't used, it was that an extra FileStatus was retrieved for each file. The fix was just committed as [HADOOP-12810](https://issues.apache.org/jira/browse/HADOOP-12810). (I think the original commit also used a single prefix to enumerate all paths, but that isn't always helpful and it was removed in later versions so there is no need for SparkS3Utils.)
      
      I tested this using the same table that piapiaozhexiu was using. Calculating splits for a 10-day period took 25 seconds with this change and HADOOP-12810, which is on par with the results from #8512.
      
      Author: Ryan Blue <blue@apache.org>
      Author: Cheolsoo Park <cheolsoop@netflix.com>
      
      Closes #11242 from rdblue/SPARK-9926-parallelize-union-rdd.
      08db4912
    • depend's avatar
      [SPARK-15158][CORE] downgrade shouldRollover message to debug level · 5c47db06
      depend authored
      ## What changes were proposed in this pull request?
      set log level to debug when check shouldRollover
      
      ## How was this patch tested?
      It's tested manually.
      
      Author: depend <depend@gmail.com>
      
      Closes #12931 from depend/master.
      5c47db06
    • Dongjoon Hyun's avatar
      [SPARK-15134][EXAMPLE] Indent SparkSession builder patterns and update... · 2c170dd3
      Dongjoon Hyun authored
      [SPARK-15134][EXAMPLE] Indent SparkSession builder patterns and update binary_classification_metrics_example.py
      
      ## What changes were proposed in this pull request?
      
      This issue addresses the comments in SPARK-15031 and also fix java-linter errors.
      - Use multiline format in SparkSession builder patterns.
      - Update `binary_classification_metrics_example.py` to use `SparkSession`.
      - Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far)
      
      ## How was this patch tested?
      
      After passing the Jenkins tests and run `dev/lint-java` manually.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12911 from dongjoon-hyun/SPARK-15134.
      2c170dd3
    • Shixiong Zhu's avatar
      [SPARK-15135][SQL] Make sure SparkSession thread safe · bb9991de
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Went through SparkSession and its members and fixed non-thread-safe classes used by SparkSession
      
      ## How was this patch tested?
      
      Existing unit tests
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #12915 from zsxwing/spark-session-thread-safe.
      bb9991de
    • Sandeep Singh's avatar
      [SPARK-15072][SQL][REPL][EXAMPLES] Remove SparkSession.withHiveSupport · ed6f3f8a
      Sandeep Singh authored
      ## What changes were proposed in this pull request?
      Removing the `withHiveSupport` method of `SparkSession`, instead use `enableHiveSupport`
      
      ## How was this patch tested?
      ran tests locally
      
      Author: Sandeep Singh <sandeep@techaddict.me>
      
      Closes #12851 from techaddict/SPARK-15072.
      ed6f3f8a
    • gatorsmile's avatar
      [SPARK-14124][SQL][FOLLOWUP] Implement Database-related DDL Commands · 8cba57a7
      gatorsmile authored
      #### What changes were proposed in this pull request?
      
      First, a few test cases failed in mac OS X  because the property value of `java.io.tmpdir` does not include a trailing slash on some platform. Hive always removes the last trailing slash. For example, what I got in the web:
      ```
      Win NT  --> C:\TEMP\
      Win XP  --> C:\TEMP
      Solaris --> /var/tmp/
      Linux   --> /var/tmp
      ```
      Second, a couple of test cases are added to verify if the commands work properly.
      
      #### How was this patch tested?
      Added a test case for it and correct the previous test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #12081 from gatorsmile/mkdir.
      8cba57a7
    • Cheng Lian's avatar
      [MINOR][BUILD] Adds spark-warehouse/ to .gitignore · 63db2bd2
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      Adds spark-warehouse/ to `.gitignore`.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #12929 from liancheng/gitignore-spark-warehouse.
      63db2bd2
    • NarineK's avatar
      [SPARK-15110] [SPARKR] Implement repartitionByColumn for SparkR DataFrames · 22226fcc
      NarineK authored
      ## What changes were proposed in this pull request?
      
      Implement repartitionByColumn on DataFrame.
      This will allow us to run R functions on each partition identified by column groups with dapply() method.
      
      ## How was this patch tested?
      
      Unit tests
      
      Author: NarineK <narine.kokhlikyan@us.ibm.com>
      
      Closes #12887 from NarineK/repartitionByColumns.
      22226fcc
    • hyukjinkwon's avatar
      [SPARK-15148][SQL] Upgrade Univocity library from 2.0.2 to 2.1.0 · ac12b35d
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-15148
      
      Mainly it improves the performance roughtly about 30%-40% according to the [release note](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.1.0). For the details of the purpose is described in the JIRA.
      
      This PR upgrades Univocity library from 2.0.2 to 2.1.0.
      
      ## How was this patch tested?
      
      Existing tests should cover this.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #12923 from HyukjinKwon/SPARK-15148.
      ac12b35d
    • Wenchen Fan's avatar
      [SPARK-14139][SQL] RowEncoder should preserve schema nullability · 55cc1c99
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      The problem is: In `RowEncoder`, we use `Invoke` to get the field of an external row, which lose the nullability information. This PR creates a `GetExternalRowField` expression, so that we can preserve the nullability info.
      
      TODO: simplify the null handling logic in `RowEncoder`, to remove so many if branches, in follow-up PR.
      
      ## How was this patch tested?
      
      new tests in `RowEncoderSuite`
      
      Note that, This PR takes over https://github.com/apache/spark/pull/11980, with a little simplification, so all credits should go to koertkuipers
      
      Author: Wenchen Fan <wenchen@databricks.com>
      Author: Koert Kuipers <koert@tresata.com>
      
      Closes #12364 from cloud-fan/nullable.
      55cc1c99
    • Jason Moore's avatar
      [SPARK-14915][CORE] Don't re-queue a task if another attempt has already succeeded · 77361a43
      Jason Moore authored
      ## What changes were proposed in this pull request?
      
      Don't re-queue a task if another attempt has already succeeded.  This currently happens when a speculative task is denied from committing the result due to another copy of the task already having succeeded.
      
      ## How was this patch tested?
      
      I'm running a job which has a fair bit of skew in the processing time across the tasks for speculation to trigger in the last quarter (default settings), causing many commit denied exceptions to be thrown.  Previously, these tasks were then being retried over and over again until the stage possibly completes (despite using compute resources on these superfluous tasks).  With this change (applied to the 1.6 branch), they no longer retry and the stage completes successfully without these extra task attempts.
      
      Author: Jason Moore <jasonmoore2k@outlook.com>
      
      Closes #12751 from jasonmoore2k/SPARK-14915.
      77361a43
    • Luciano Resende's avatar
      [SPARK-14589][SQL] Enhance DB2 JDBC Dialect docker tests · 10443022
      Luciano Resende authored
      ## What changes were proposed in this pull request?
      
      Enhance the DB2 JDBC Dialect docker tests as they seemed to have had some issues on previous merge causing some tests to fail.
      
      ## How was this patch tested?
      
      By running the integration tests locally.
      
      Author: Luciano Resende <lresende@apache.org>
      
      Closes #12348 from lresende/SPARK-14589.
      10443022
    • Holden Karau's avatar
      [SPARK-15106][PYSPARK][ML] Add PySpark package doc for ML component & remove "BETA" · 4c0d827c
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Copy the package documentation from Scala/Java to Python for ML package and remove beta tags. Not super sure if we want to keep the BETA tag but since we are making it the default it seems like probably the time to remove it (happy to put it back in if we want to keep it BETA).
      
      ## How was this patch tested?
      
      Python documentation built locally as HTML and text and verified output.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #12883 from holdenk/SPARK-15106-add-pyspark-package-doc-for-ml.
      4c0d827c
    • mcheah's avatar
      [SPARK-12154] Upgrade to Jersey 2 · b7fdc23c
      mcheah authored
      ## What changes were proposed in this pull request?
      
      Replace com.sun.jersey with org.glassfish.jersey. Changes to the Spark Web UI code were required to compile. The changes were relatively standard Jersey migration things.
      
      ## How was this patch tested?
      
      I did a manual test for the standalone web APIs. Although I didn't test the functionality of the security filter itself, the code that changed non-trivially is how we actually register the filter. I attached a debugger to the Spark master and verified that the SecurityFilter code is indeed invoked upon hitting /api/v1/applications.
      
      Author: mcheah <mcheah@palantir.com>
      
      Closes #12715 from mccheah/feature/upgrade-jersey.
      b7fdc23c
    • Lining Sun's avatar
      [SPARK-15123] upgrade org.json4s to 3.2.11 version · 592fc455
      Lining Sun authored
      ## What changes were proposed in this pull request?
      
      We had the issue when using snowplow in our Spark applications. Snowplow requires json4s version 3.2.11 while Spark still use a few years old version 3.2.10. The change is to upgrade json4s jar to 3.2.11.
      
      ## How was this patch tested?
      
      We built Spark jar and successfully ran our applications in local and cluster modes.
      
      Author: Lining Sun <lining@gmail.com>
      
      Closes #12901 from liningalex/master.
      592fc455
    • Abhinav Gupta's avatar
      [SPARK-15045] [CORE] Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable · 1a5c6fce
      Abhinav Gupta authored
      ## What changes were proposed in this pull request?
      
      Removed the DeadCode as suggested.
      
      Author: Abhinav Gupta <abhi.951990@gmail.com>
      
      Closes #12829 from abhi951990/master.
      1a5c6fce
    • Kousuke Saruta's avatar
      [SPARK-15132][MINOR][SQL] Debug log for generated code should be printed with proper indentation · 1a9b3415
      Kousuke Saruta authored
      ## What changes were proposed in this pull request?
      
      Similar to #11990, GenerateOrdering and GenerateColumnAccessor should print debug log for generated code with proper indentation.
      
      ## How was this patch tested?
      
      Manually checked.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #12908 from sarutak/SPARK-15132.
      1a9b3415
  3. May 04, 2016
    • Davies Liu's avatar
      [MINOR] remove dead code · 42837419
      Davies Liu authored
      42837419
    • Tathagata Das's avatar
      [SPARK-15131][SQL] Shutdown StateStore management thread when SparkContext has been shutdown · bde27b89
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Make sure that whenever the StateStoreCoordinator cannot be contacted, assume that the SparkContext and RpcEnv on the driver has been shutdown, and therefore stop the StateStore management thread, and unload all loaded stores.
      
      ## How was this patch tested?
      
      Updated unit tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #12905 from tdas/SPARK-15131.
      bde27b89
    • gatorsmile's avatar
      [SPARK-14993][SQL] Fix Partition Discovery Inconsistency when Input is a Path to Parquet File · ef55e46c
      gatorsmile authored
      #### What changes were proposed in this pull request?
      When we load a dataset, if we set the path to ```/path/a=1```, we will not take `a` as the partitioning column. However, if we set the path to ```/path/a=1/file.parquet```, we take `a` as the partitioning column and it shows up in the schema.
      
      This PR is to fix the behavior inconsistency issue.
      
      The base path contains a set of paths that are considered as the base dirs of the input datasets. The partitioning discovery logic will make sure it will stop when it reaches any base path.
      
      By default, the paths of the dataset provided by users will be base paths. Below are three typical cases,
      **Case 1**```sqlContext.read.parquet("/path/something=true/")```: the base path will be
      `/path/something=true/`, and the returned DataFrame will not contain a column of `something`.
      **Case 2**```sqlContext.read.parquet("/path/something=true/a.parquet")```: the base path will be
      still `/path/something=true/`, and the returned DataFrame will also not contain a column of
      `something`.
      **Case 3**```sqlContext.read.parquet("/path/")```: the base path will be `/path/`, and the returned
      DataFrame will have the column of `something`.
      
      Users also can override the basePath by setting `basePath` in the options to pass the new base
      path to the data source. For example,
      ```sqlContext.read.option("basePath", "/path/").parquet("/path/something=true/")```,
      and the returned DataFrame will have the column of `something`.
      
      The related PRs:
      - https://github.com/apache/spark/pull/9651
      - https://github.com/apache/spark/pull/10211
      
      #### How was this patch tested?
      Added a couple of test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #12828 from gatorsmile/readPartitionedTable.
      ef55e46c
    • Sean Zhong's avatar
      [SPARK-6339][SQL] Supports CREATE TEMPORARY VIEW tableIdentifier AS query · 8fb1463d
      Sean Zhong authored
      ## What changes were proposed in this pull request?
      
      This PR support new SQL syntax CREATE TEMPORARY VIEW.
      Like:
      ```
      CREATE TEMPORARY VIEW viewName AS SELECT * from xx
      CREATE OR REPLACE TEMPORARY VIEW viewName AS SELECT * from xx
      CREATE TEMPORARY VIEW viewName (c1 COMMENT 'blabla', c2 COMMENT 'blabla') AS SELECT * FROM xx
      ```
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: Sean Zhong <clockfly@gmail.com>
      
      Closes #12872 from clockfly/spark-6399.
      8fb1463d
    • Andrew Or's avatar
      [SPARK-14896][SQL] Deprecate HiveContext in python · fa79d346
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      See title.
      
      ## How was this patch tested?
      
      PySpark tests.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12917 from andrewor14/deprecate-hive-context-python.
      fa79d346
    • sethah's avatar
      [MINOR][SQL] Fix typo in DataFrameReader csv documentation · b2813776
      sethah authored
      ## What changes were proposed in this pull request?
      Typo fix
      
      ## How was this patch tested?
      No tests
      
      My apologies for the tiny PR, but I stumbled across this today and wanted to get it corrected for 2.0.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #12912 from sethah/csv_typo.
      b2813776
    • Wenchen Fan's avatar
      [SPARK-15116] In REPL we should create SparkSession first and get SparkContext from it · a432a2b8
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      see https://github.com/apache/spark/pull/12873#discussion_r61993910. The problem is, if we create `SparkContext` first and then call `SparkSession.builder.enableHiveSupport().getOrCreate()`, we will reuse the existing `SparkContext` and the hive flag won't be set.
      
      ## How was this patch tested?
      
      verified it locally.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #12890 from cloud-fan/repl.
      a432a2b8
    • Sebastien Rainville's avatar
      [SPARK-13001][CORE][MESOS] Prevent getting offers when reached max cores · eb019af9
      Sebastien Rainville authored
      Similar to https://github.com/apache/spark/pull/8639
      
      This change rejects offers for 120s when reached `spark.cores.max` in coarse-grained mode to mitigate offer starvation. This prevents Mesos to send us offers again and again, starving other frameworks. This is especially problematic when running many small frameworks on the same Mesos cluster, e.g. many small Sparks streaming jobs, and cause the bigger spark jobs to stop receiving offers. By rejecting the offers for a long period of time, they become available to those other frameworks.
      
      Author: Sebastien Rainville <sebastien@hopper.com>
      
      Closes #10924 from sebastienrainville/master.
      eb019af9
    • Dongjoon Hyun's avatar
      [SPARK-15031][EXAMPLE] Use SparkSession in Scala/Python/Java example. · cdce4e62
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR aims to update Scala/Python/Java examples by replacing `SQLContext` with newly added `SparkSession`.
      
      - Use **SparkSession Builder Pattern** in 154(Scala 55, Java 52, Python 47) files.
      - Add `getConf` in Python SparkContext class: `python/pyspark/context.py`
      - Replace **SQLContext Singleton Pattern** with **SparkSession Singleton Pattern**:
        - `SqlNetworkWordCount.scala`
        - `JavaSqlNetworkWordCount.java`
        - `sql_network_wordcount.py`
      
      Now, `SQLContexts` are used only in R examples and the following two Python examples. The python examples are untouched in this PR since it already fails some unknown issue.
      - `simple_params_example.py`
      - `aft_survival_regression.py`
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12809 from dongjoon-hyun/SPARK-15031.
      cdce4e62
    • Bryan Cutler's avatar
      [SPARK-12299][CORE] Remove history serving functionality from Master · cf2e9da6
      Bryan Cutler authored
      Remove history server functionality from standalone Master.  Previously, the Master process rebuilt a SparkUI once the application was completed which sometimes caused problems, such as OOM, when the application event log is large (see SPARK-6270).  Keeping this functionality out of the Master will help to simplify the process and increase stability.
      
      Testing for this change included running core unit tests and manually running an application on a standalone cluster to verify that it completed successfully and that the Master UI functioned correctly.  Also added 2 unit tests to verify killing an application and driver from MasterWebUI makes the correct request to the Master.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #10991 from BryanCutler/remove-history-master-SPARK-12299.
      cf2e9da6
    • Thomas Graves's avatar
      [SPARK-15121] Improve logging of external shuffle handler · 0c00391f
      Thomas Graves authored
      ## What changes were proposed in this pull request?
      
      Add more informative logging in the external shuffle service to aid in debugging who is connecting to the YARN Nodemanager when the external shuffle service runs under it.
      
      ## How was this patch tested?
      
      Ran and saw logs coming out in log file.
      
      Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>
      
      Closes #12900 from tgravescs/SPARK-15121.
      0c00391f
    • Reynold Xin's avatar
      [SPARK-15126][SQL] RuntimeConfig.set should return Unit · 6ae9fc00
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      Currently we return RuntimeConfig itself to facilitate chaining. However, it makes the output in interactive environments (e.g. notebooks, scala repl) weird because it'd show the response of calling set as a RuntimeConfig itself.
      
      ## How was this patch tested?
      Updated unit tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12902 from rxin/SPARK-15126.
      6ae9fc00
    • Tathagata Das's avatar
      [SPARK-15103][SQL] Refactored FileCatalog class to allow StreamFileCatalog to infer partitioning · 0fd3a474
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      File Stream Sink writes the list of written files in a metadata log. StreamFileCatalog reads the list of the files for processing. However StreamFileCatalog does not infer partitioning like HDFSFileCatalog.
      
      This PR enables that by refactoring HDFSFileCatalog to create an abstract class PartitioningAwareFileCatalog, that has all the functionality to infer partitions from a list of leaf files.
      - HDFSFileCatalog has been renamed to ListingFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from recursive directory scanning.
      - StreamFileCatalog has been renamed to MetadataLogFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from the metadata log.
      - The above two classes has been moved into their own files as they are not interfaces that should be in fileSourceInterfaces.scala.
      
      ## How was this patch tested?
      - FileStreamSinkSuite was update to see if partitioning gets inferred, and on reading whether the partitions get pruned correctly based on the query.
      - Other unit tests are unchanged and pass as expected.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #12879 from tdas/SPARK-15103.
      0fd3a474
    • Reynold Xin's avatar
      [SPARK-15115][SQL] Reorganize whole stage codegen benchmark suites · 6274a520
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      We currently have a single suite that is very large, making it difficult to maintain and play with specific primitives. This patch reorganizes the file by creating multiple benchmark suites in a single package.
      
      Most of the changes are straightforward move of code. On top of the code moving, I did:
      1. Use SparkSession instead of SQLContext.
      2. Turned most benchmark scenarios into a their own test cases, rather than having multiple scenarios in a single test case, which takes forever to run.
      
      ## How was this patch tested?
      This is a test only change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12891 from rxin/SPARK-15115.
      6274a520
    • Zheng RuiFeng's avatar
      [MINOR] Add python3 compatibility in python examples · 4530250f
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      Add python3 compatibility in python examples
      
      ## How was this patch tested?
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #12868 from zhengruifeng/fix_gmm_py.
      4530250f
    • Liang-Chi Hsieh's avatar
      [SPARK-14951] [SQL] Support subexpression elimination in TungstenAggregate · b85d21fb
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      We can support subexpression elimination in TungstenAggregate by using current `EquivalentExpressions` which is already used in subexpression elimination for expression codegen.
      
      However, in wholestage codegen, we can't wrap the common expression's codes in functions as before, we simply generate the code snippets for common expressions. These code snippets are inserted before the common expressions are actually used in generated java codes.
      
      For multiple `TypedAggregateExpression` used in aggregation operator, since their input type should be the same. So their `inputDeserializer` will be the same too. This patch can also reduce redundant input deserialization.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #12729 from viirya/subexpr-elimination-tungstenaggregate.
      b85d21fb
    • Reynold Xin's avatar
      [SPARK-15109][SQL] Accept Dataset[_] in joins · d864c55c
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch changes the join API in Dataset so they can accept any Dataset, rather than just DataFrames.
      
      ## How was this patch tested?
      N/A.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12886 from rxin/SPARK-15109.
      d864c55c
    • Liwei Lin's avatar
      [SPARK-15022][SPARK-15023][SQL][STREAMING] Add support for testing against the... · e597ec6f
      Liwei Lin authored
      [SPARK-15022][SPARK-15023][SQL][STREAMING] Add support for testing against the `ProcessingTime(intervalMS > 0)` trigger and `ManualClock`
      
      ## What changes were proposed in this pull request?
      
      Currently in `StreamTest`, we have a `StartStream` which will start a streaming query against trigger `ProcessTime(intervalMS = 0)` and `SystemClock`.
      
      We also need to test cases against `ProcessTime(intervalMS > 0)`, which often requires `ManualClock`.
      
      This patch:
      - fixes an issue of `ProcessingTimeExecutor`, where for a batch it should run `batchRunner` only once but might run multiple times under certain conditions;
      - adds support for testing against the `ProcessingTime(intervalMS > 0)` trigger and `AdvanceManualClock`, by specifying them as fields for `StartStream`, and by adding an `AdvanceClock` action;
      - adds a test, which takes advantage of the new `StartStream` and `AdvanceManualClock`, to test against [PR#[SPARK-14942] Reduce delay between batch construction and execution ](https://github.com/apache/spark/pull/12725).
      
      ## How was this patch tested?
      
      N/A
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #12797 from lw-lin/add-trigger-test-support.
      e597ec6f
Loading