Skip to content
Snippets Groups Projects
  1. May 27, 2016
    • Sital Kedia's avatar
      [SPARK-15569] Reduce frequency of updateBytesWritten function in Disk… · ce756daa
      Sital Kedia authored
      ## What changes were proposed in this pull request?
      
      Profiling a Spark job spilling large amount of intermediate data we found that significant portion of time is being spent in DiskObjectWriter.updateBytesWritten function. Looking at the code, we see that the function is being called too frequently to update the number of bytes written to disk. We should reduce the frequency to avoid this.
      
      ## How was this patch tested?
      
      Tested by running the job on cluster and saw 20% CPU gain  by this change.
      
      Author: Sital Kedia <skedia@fb.com>
      
      Closes #13332 from sitalkedia/DiskObjectWriter.
      ce756daa
    • Xinh Huynh's avatar
      [MINOR][DOCS] Typo fixes in Dataset scaladoc · 5bdbedf2
      Xinh Huynh authored
      ## What changes were proposed in this pull request?
      
      Minor typo fixes in Dataset scaladoc
      * Corrected context type as SparkSession, not SQLContext.
      liancheng rxin andrewor14
      
      ## How was this patch tested?
      
      Compiled locally
      
      Author: Xinh Huynh <xinh_huynh@yahoo.com>
      
      Closes #13330 from xinhhuynh/fix-dataset-typos.
      5bdbedf2
    • Reynold Xin's avatar
      [SPARK-15597][SQL] Add SparkSession.emptyDataset · a52e6813
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch adds a new function emptyDataset to SparkSession, for creating an empty dataset.
      
      ## How was this patch tested?
      Added a test case.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13344 from rxin/SPARK-15597.
      a52e6813
    • Sameer Agarwal's avatar
      [SPARK-15599][SQL][DOCS] API docs for `createDataset` functions in SparkSession · 635fb30f
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      Adds API docs and usage examples for the 3 `createDataset` calls in `SparkSession`
      
      ## How was this patch tested?
      
      N/A
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #13345 from sameeragarwal/dataset-doc.
      635fb30f
    • Dongjoon Hyun's avatar
      [SPARK-15584][SQL] Abstract duplicate code: `spark.sql.sources.` properties · 4538443e
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR replaces `spark.sql.sources.` strings with `CreateDataSourceTableUtils.*` constant variables.
      
      ## How was this patch tested?
      
      Pass the existing Jenkins tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13349 from dongjoon-hyun/SPARK-15584.
      4538443e
    • Dongjoon Hyun's avatar
      [SPARK-15603][MLLIB] Replace SQLContext with SparkSession in ML/MLLib · d24e2515
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR replaces all deprecated `SQLContext` occurrences with `SparkSession` in `ML/MLLib` module except the following two classes. These two classes use `SQLContext` in their function signatures.
      - ReadWrite.scala
      - TreeModels.scala
      
      ## How was this patch tested?
      
      Pass the existing Jenkins tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13352 from dongjoon-hyun/SPARK-15603.
      d24e2515
    • gatorsmile's avatar
      [SPARK-15565][SQL] Add the File Scheme to the Default Value of WAREHOUSE_PATH · c1727290
      gatorsmile authored
      #### What changes were proposed in this pull request?
      The default value of `spark.sql.warehouse.dir` is `System.getProperty("user.dir")/spark-warehouse`. Since `System.getProperty("user.dir")` is a local dir, we should explicitly set the scheme to local filesystem.
      
      cc yhuai
      
      #### How was this patch tested?
      Added two test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13348 from gatorsmile/addSchemeToDefaultWarehousePath.
      c1727290
    • Xin Wu's avatar
      [SPARK-15431][SQL][HOTFIX] ignore 'list' command testcase from CliSuite for now · 6f95c6c0
      Xin Wu authored
      ## What changes were proposed in this pull request?
      The test cases for  `list` command added in `CliSuite` by PR #13212 can not run in some jenkins jobs after being merged.
      However, some jenkins jobs can pass:
      https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/
      https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/
      https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.2/
      https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/
      https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/
      
      Others failed on this test case. But the failures on those jobs are at slightly different checkpoints among different jobs too. So it seems that CliSuite's output capture is flaky for list commands to check for expected output. There are test cases already in `HiveQuerySuite` and `SparkContextSuite` to cover the cases. So I am ignoring 2 test cases added by PR #13212 .
      
      Author: Xin Wu <xinwu@us.ibm.com>
      
      Closes #13276 from xwu0226/SPARK-15431-clisuite.
      6f95c6c0
    • gatorsmile's avatar
      [SPARK-15529][SQL] Replace SQLContext and HiveContext with SparkSession in Test · d5911d11
      gatorsmile authored
      #### What changes were proposed in this pull request?
      This PR is to use the new entrance `Sparksession` to replace the existing `SQLContext` and `HiveContext` in SQL test suites.
      
      No change is made in the following suites:
      - `ListTablesSuite` is to test the APIs of `SQLContext`.
      - `SQLContextSuite` is to test `SQLContext`
      - `HiveContextCompatibilitySuite` is to test `HiveContext`
      
      **Update**: Move tests in `ListTableSuite` to `SQLContextSuite`
      
      #### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #13337 from gatorsmile/sparkSessionTest.
      d5911d11
    • Zheng RuiFeng's avatar
      [MINOR] Fix Typos 'a -> an' · 6b1a6180
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      
      `a` -> `an`
      
      I use regex to generate potential error lines:
      `grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml/*/*scala`
      and review them line by line.
      
      ## How was this patch tested?
      
      local build
      `lint-java` checking
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #13317 from zhengruifeng/a_an.
      6b1a6180
    • Joseph K. Bradley's avatar
      [MINOR][CORE] Fixed doc for Accumulator2.add · ee3609a2
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Scala doc used outdated ```+=```.  Replaced with ```add```.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #13346 from jkbradley/accum-doc.
      ee3609a2
  2. May 26, 2016
    • felixcheung's avatar
      [SPARK-10903] followup - update API doc for SqlContext · c8288323
      felixcheung authored
      ## What changes were proposed in this pull request?
      
      Follow up on the earlier PR - in here we are fixing up roxygen2 doc examples.
      Also add to the programming guide migration section.
      
      ## How was this patch tested?
      
      SparkR tests
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #13340 from felixcheung/sqlcontextdoc.
      c8288323
    • hyukjinkwon's avatar
      [SPARK-8603][SPARKR] Use shell() instead of system2() for SparkR on Windows · 1c403733
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR corrects SparkR to use `shell()` instead of `system2()` on Windows.
      
      Using `system2(...)` on Windows does not process windows file separator `\`. `shell(tralsate = TRUE, ...)` can treat this problem. So, this was changed to be chosen according to OS.
      
      Existing tests were failed on Windows due to this problem. For example, those were failed.
      
        ```
      8. Failure: sparkJars tag in SparkContext (test_includeJAR.R#34)
      9. Failure: sparkJars tag in SparkContext (test_includeJAR.R#36)
      ```
      
      The cases above were due to using of `system2`.
      
      In addition, this PR also fixes some tests failed on Windows.
      
        ```
      5. Failure: sparkJars sparkPackages as comma-separated strings (test_context.R#128)
      6. Failure: sparkJars sparkPackages as comma-separated strings (test_context.R#131)
      7. Failure: sparkJars sparkPackages as comma-separated strings (test_context.R#134)
      ```
      
        The cases above were due to a weird behaviour of `normalizePath()`. On Linux, if the path does not exist, it just prints out the input but it prints out including the current path on Windows.
      
        ```r
      # On Linus
      path <- normalizePath("aa")
      print(path)
      [1] "aa"
      
      # On Windows
      path <- normalizePath("aa")
      print(path)
      [1] "C:\\Users\\aa"
      ```
      
      ## How was this patch tested?
      
      Jenkins tests and manually tested in a Window machine as below:
      
      Here is the [stdout](https://gist.github.com/HyukjinKwon/4bf35184f3a30f3bce987a58ec2bbbab) of testing.
      
      Closes #7025
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: Hyukjin Kwon <gurwls223@gmail.com>
      Author: Prakash PC <prakash.chinnu@gmail.com>
      
      Closes #13165 from HyukjinKwon/pr/7025.
      1c403733
    • Andrew Or's avatar
      [SPARK-15583][SQL] Disallow altering datasource properties · 3fca635b
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Certain table properties (and SerDe properties) are in the protected namespace `spark.sql.sources.`, which we use internally for datasource tables. The user should not be allowed to
      
      (1) Create a Hive table setting these properties
      (2) Alter these properties in an existing table
      
      Previously, we threw an exception if the user tried to alter the properties of an existing datasource table. However, this is overly restrictive for datasource tables and does not do anything for Hive tables.
      
      ## How was this patch tested?
      
      DDLSuite
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #13341 from andrewor14/alter-table-props.
      3fca635b
    • Xin Ren's avatar
      [SPARK-15542][SPARKR] Make error message clear for script './R/install-dev.sh'... · 6ab973ec
      Xin Ren authored
      [SPARK-15542][SPARKR] Make error message clear for script './R/install-dev.sh' when R is missing on Mac
      
      https://issues.apache.org/jira/browse/SPARK-15542
      
      ## What changes were proposed in this pull request?
      
      When running`./R/install-dev.sh` in **Mac OS EI Captain** environment, I got
      ```
      mbp185-xr:spark xin$ ./R/install-dev.sh
      usage: dirname path
      ```
      This message is very confusing to me, and then I found R is not properly configured on my Mac when this script is using `$(which R)` to get R home.
      
      I tried similar situation on CentOS with R missing, and it's giving me very clear error message while MacOS is not.
      on CentOS:
      ```
      [rootip-xxx-31-9-xx spark]# which R
      /usr/bin/which: no R in (/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/lib/jvm/java-1.7.0-openjdk.x86_64/bin:/root/bin)
      ```
      but on Mac, if not found then nothing returned and this is causing the confusing message for R build failure and running R/install-dev.sh:
      ```
      mbp185-xr:spark xin$ which R
      mbp185-xr:spark xin$
      ```
      
      Here I just added a clear message for this miss configuration for R when running `R/install-dev.sh`.
      ```
      mbp185-xr:spark xin$ ./R/install-dev.sh
      Cannot find R home by running 'which R', please make sure R is properly installed.
      ```
      
      ## How was this patch tested?
      Manually tested on local machine.
      
      Author: Xin Ren <iamshrek@126.com>
      
      Closes #13308 from keypointt/SPARK-15542.
      6ab973ec
    • Andrew Or's avatar
      [SPARK-15538][SPARK-15539][SQL] Truncate table fixes round 2 · 008a5377
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Two more changes:
      (1) Fix truncate table for data source tables (only for cases without `PARTITION`)
      (2) Disallow truncating external tables or views
      
      ## How was this patch tested?
      
      `DDLSuite`
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #13315 from andrewor14/truncate-table.
      008a5377
    • Yin Huai's avatar
      [SPARK-15532][SQL] SQLContext/HiveContext's public constructors should use... · 3ac2363d
      Yin Huai authored
      [SPARK-15532][SQL] SQLContext/HiveContext's public constructors should use SparkSession.build.getOrCreate
      
      ## What changes were proposed in this pull request?
      This PR changes SQLContext/HiveContext's public constructor to use SparkSession.build.getOrCreate and removes isRootContext from SQLContext.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #13310 from yhuai/SPARK-15532.
      3ac2363d
    • Cheng Lian's avatar
      [SPARK-15550][SQL] Dataset.show() should show contents nested products as rows · e7082cae
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR addresses two related issues:
      
      1. `Dataset.showString()` should show case classes/Java beans at all levels as rows, while master code only handles top level ones.
      
      2. `Dataset.showString()` should show full contents produced the underlying query plan
      
         Dataset is only a view of the underlying query plan. Columns not referred by the encoder are still reachable using methods like `Dataset.col`. So it probably makes more sense to show full contents of the query plan.
      
      ## How was this patch tested?
      
      Two new test cases are added in `DatasetSuite` to check `.showString()` output.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #13331 from liancheng/spark-15550-ds-show.
      e7082cae
    • Sameer Agarwal's avatar
      [SPARK-8428][SPARK-13850] Fix integer overflows in TimSort · fe6de16f
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This patch fixes a few integer overflows in `UnsafeSortDataFormat.copyRange()` and `ShuffleSortDataFormat copyRange()` that seems to be the most likely cause behind a number of `TimSort` contract violation errors seen in Spark 2.0 and Spark 1.6 while sorting large datasets.
      
      ## How was this patch tested?
      
      Added a test in `ExternalSorterSuite` that instantiates a large array of the form of [150000000, 150000001, 150000002, ...., 300000000, 0, 1, 2, ..., 149999999] that triggers a `copyRange` in `TimSort.mergeLo` or `TimSort.mergeHi`. Note that the input dataset should contain at least 268.43 million rows with a certain data distribution for an overflow to occur.
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #13336 from sameeragarwal/timsort-bug.
      fe6de16f
    • Sean Zhong's avatar
      [SPARK-13445][SQL] Improves error message and add test coverage for Window function · b5859e0b
      Sean Zhong authored
      ## What changes were proposed in this pull request?
      
      Add more verbose error message when order by clause is missed when using Window function.
      
      ## How was this patch tested?
      
      Unit test.
      
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #13333 from clockfly/spark-13445.
      b5859e0b
    • Sean Owen's avatar
      [SPARK-15457][MLLIB][ML] Eliminate some warnings from MLlib about deprecations · b0a03fee
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Several classes and methods have been deprecated and are creating lots of build warnings in branch-2.0. This issue is to identify and fix those items:
      * WithSGD classes: Change to make class not deprecated, object deprecated, and public class constructor deprecated. Any public use will require a deprecated API. We need to keep a non-deprecated private API since we cannot eliminate certain uses: Python API, streaming algs, and examples.
        * Use in PythonMLlibAPI: Change to using private constructors
        * Streaming algs: No warnings after we un-deprecate the classes
        * Examples: Deprecate or change ones which use deprecated APIs
      * MulticlassMetrics fields (precision, etc.)
      * LinearRegressionSummary.model field
      
      ## How was this patch tested?
      
      Existing tests.  Checked for warnings manually.
      
      Author: Sean Owen <sowen@cloudera.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #13314 from jkbradley/warning-cleanups.
      b0a03fee
    • Reynold Xin's avatar
      [SPARK-15552][SQL] Remove unnecessary private[sql] methods in SparkSession · 0f61d6ef
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      SparkSession has a list of unnecessary private[sql] methods. These methods cause some trouble because private[sql] doesn't apply in Java. In the cases that they are easy to remove, we can simply remove them. This patch does that.
      
      As part of this pull request, I also replaced a bunch of protected[sql] with private[sql], to tighten up visibility.
      
      ## How was this patch tested?
      Updated test cases to reflect the changes.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13319 from rxin/SPARK-15552.
      0f61d6ef
    • Eric Liang's avatar
      [SPARK-15520][SQL] Also set sparkContext confs when using SparkSession builder in pyspark · 594a1bf2
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      Also sets confs in the underlying sc when using SparkSession.builder.getOrCreate(). This is a bug-fix from a post-merge comment in https://github.com/apache/spark/pull/13289
      
      ## How was this patch tested?
      
      Python doc-tests.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #13309 from ericl/spark-15520-1.
      594a1bf2
    • Andrew Or's avatar
      [SPARK-15539][SQL] DROP TABLE throw exception if table doesn't exist · 2b1ac6ce
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Same as #13302, but for DROP TABLE.
      
      ## How was this patch tested?
      
      `DDLSuite`
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #13307 from andrewor14/drop-table.
      2b1ac6ce
    • Steve Loughran's avatar
      [SPARK-13148][YARN] document zero-keytab Oozie application launch; add diagnostics · 01b350a4
      Steve Loughran authored
      This patch provides detail on what to do for keytabless Oozie launches of spark apps, and adds some debug-level diagnostics of what credentials have been submitted
      
      Author: Steve Loughran <stevel@hortonworks.com>
      Author: Steve Loughran <stevel@apache.org>
      
      Closes #11033 from steveloughran/stevel/feature/SPARK-13148-oozie.
      01b350a4
    • felixcheung's avatar
      [SPARK-10903][SPARKR] R - Simplify SQLContext method signatures and use a singleton · c76457c8
      felixcheung authored
      Eliminate the need to pass sqlContext to method since it is a singleton - and we don't want to support multiple contexts in a R session.
      
      Changes are done in a back compat way with deprecation warning added. Method signature for S3 methods are added in a concise, clean approach such that in the next release the deprecated signature can be taken out easily/cleanly (just delete a few lines per method).
      
      Custom method dispatch is implemented to allow for multiple JVM reference types that are all 'jobj' in R and to avoid having to add 30 new exports.
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #9192 from felixcheung/rsqlcontext.
      c76457c8
    • Villu Ruusmann's avatar
      [SPARK-15523][ML][MLLIB] Update JPMML to 1.2.15 · 6d506c9a
      Villu Ruusmann authored
      ## What changes were proposed in this pull request?
      
      See https://issues.apache.org/jira/browse/SPARK-15523
      
      This PR replaces PR #13293. It's isolated to a new branch, and contains some more squashed changes.
      
      ## How was this patch tested?
      
      1. Executed `mvn clean package` in `mllib` directory
      2. Executed `dev/test-dependencies.sh --replace-manifest` in the root directory.
      
      Author: Villu Ruusmann <villu.ruusmann@gmail.com>
      
      Closes #13297 from vruusmann/update-jpmml.
      6d506c9a
    • wm624@hotmail.com's avatar
      [SPARK-15492][ML][DOC] Binarization scala example copy & paste to spark-shell error · e451f7f0
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      (Please fill in changes proposed in this fix)
      The Binarization scala example val dataFrame : Dataframe = spark.createDataFrame(data).toDF("label", "feature"), which can't be pasted in the spark-shell as Dataframe is not imported. Compared with other examples, this explicit type is not required.
      
      So I removed Dataframe in the code.
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      Manually tested
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #13266 from wangmiao1981/unit.
      e451f7f0
    • Bo Meng's avatar
      [SPARK-15537][SQL] fix dir delete issue · 53d4abe9
      Bo Meng authored
      ## What changes were proposed in this pull request?
      
      For some of the test cases, e.g. `OrcSourceSuite`, it will create temp folders and temp files inside them. But after tests finish, the folders are not removed. This will cause lots of temp files created and space occupied, if we keep running the test cases.
      
      The reason is dir.delete() won't work if dir is not empty. We need to recursively delete the content before deleting the folder.
      
      ## How was this patch tested?
      
      Manually checked the temp folder to make sure the temp files were deleted.
      
      Author: Bo Meng <mengbo@hotmail.com>
      
      Closes #13304 from bomeng/SPARK-15537.
      53d4abe9
    • Reynold Xin's avatar
      [SPARK-15543][SQL] Rename DefaultSources to make them more self-describing · 361ebc28
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch renames various DefaultSources to make their names more self-describing. The choice of "DefaultSource" was from the days when we did not have a good way to specify short names.
      
      They are now named:
      - LibSVMFileFormat
      - CSVFileFormat
      - JdbcRelationProvider
      - JsonFileFormat
      - ParquetFileFormat
      - TextFileFormat
      
      Backward compatibility is maintained through aliasing.
      
      ## How was this patch tested?
      Updated relevant test cases too.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13311 from rxin/SPARK-15543.
      361ebc28
    • Imran Rashid's avatar
      [SPARK-10372] [CORE] basic test framework for entire spark scheduler · dfc9fc02
      Imran Rashid authored
      This is a basic framework for testing the entire scheduler.  The tests this adds aren't very interesting -- the point of this PR is just to setup the framework, to keep the initial change small, but it can be built upon to test more features (eg., speculation, killing tasks, blacklisting, etc.).
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #8559 from squito/SPARK-10372-scheduler-integs.
      dfc9fc02
  3. May 25, 2016
    • wm624@hotmail.com's avatar
      [SPARK-15439][SPARKR] Failed to run unit test in SparkR · 06bae8af
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      (Please fill in changes proposed in this fix)
      There are some failures when running SparkR unit tests.
      In this PR, I fixed two of these failures in test_context.R and test_sparkSQL.R
      The first one is due to different masked name. I added missed names in the expected arrays.
      The second one is because one PR removed the logic of a previous fix of missing subset method.
      
      The file privilege issue is still there. I am debugging it. SparkR shell can run the test case successfully.
      test_that("pipeRDD() on RDDs", {
        actual <- collect(pipeRDD(rdd, "more"))
      When using run-test script, it complains no such directories as below:
      cannot open file '/tmp/Rtmp4FQbah/filee2273f9d47f7': No such file or directory
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      Manually test it
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #13284 from wangmiao1981/R.
      06bae8af
    • Sameer Agarwal's avatar
      [SPARK-15533][SQL] Deprecate Dataset.explode · 06ed1fa3
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This patch deprecates `Dataset.explode` and documents appropriate workarounds to use `flatMap()` or `functions.explode()` instead.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #13312 from sameeragarwal/deprecate.
      06ed1fa3
    • Herman van Hovell's avatar
      [SPARK-15525][SQL][BUILD] Upgrade ANTLR4 SBT plugin · 527499b6
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      The ANTLR4 SBT plugin has been moved from its own repo to one on bintray. The version was also changed from `0.7.10` to `0.7.11`. The latter actually broke our build (ihji has fixed this by also adding `0.7.10` and others to the bin-tray repo).
      
      This PR upgrades the SBT-ANTLR4 plugin and ANTLR4 to their most recent versions (`0.7.11`/`4.5.3`). I have also removed a few obsolete build configurations.
      
      ## How was this patch tested?
      Manually running SBT/Maven builds.
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #13299 from hvanhovell/SPARK-15525.
      527499b6
    • Andrew Or's avatar
      [SPARK-15534][SPARK-15535][SQL] Truncate table fixes · ee682fe2
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Two changes:
      - When things fail, `TRUNCATE TABLE` just returns nothing. Instead, we should throw exceptions.
      - Remove `TRUNCATE TABLE ... COLUMN`, which was never supported by either Spark or Hive.
      
      ## How was this patch tested?
      Jenkins.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #13302 from andrewor14/truncate-table.
      ee682fe2
    • Gio Borje's avatar
      Log warnings for numIterations * miniBatchFraction < 1.0 · 589cce93
      Gio Borje authored
      ## What changes were proposed in this pull request?
      
      Add a warning log for the case that `numIterations * miniBatchFraction <1.0` during gradient descent. If the product of those two numbers is less than `1.0`, then not all training examples will be used during optimization. To put this concretely, suppose that `numExamples = 100`, `miniBatchFraction = 0.2` and `numIterations = 3`. Then, 3 iterations will occur each sampling approximately 6 examples each. In the best case, each of the 6 examples are unique; hence 18/100 examples are used.
      
      This may be counter-intuitive to most users and led to the issue during the development of another Spark  ML model: https://github.com/zhengruifeng/spark-libFM/issues/11. If a user actually does not require the training data set, it would be easier and more intuitive to use `RDD.sample`.
      
      ## How was this patch tested?
      
      `build/mvn -DskipTests clean package` build succeeds
      
      Author: Gio Borje <gborje@linkedin.com>
      
      Closes #13265 from Hydrotoast/master.
      589cce93
    • Bryan Cutler's avatar
      [MINOR] [PYSPARK] [EXAMPLES] Changed examples to use SparkSession.sparkContext instead of _sc · 9c297df3
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      Some PySpark examples need a SparkContext and get it by accessing _sc directly from the session.  These examples should use the provided property `sparkContext` in `SparkSession` instead.
      
      ## How was this patch tested?
      Ran modified examples
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #13303 from BryanCutler/pyspark-session-sparkContext-MINOR.
      9c297df3
    • Takuya UESHIN's avatar
      [SPARK-14269][SCHEDULER] Eliminate unnecessary submitStage() call. · 698ef762
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      Currently a method `submitStage()` for waiting stages is called on every iteration of the event loop in `DAGScheduler` to submit all waiting stages, but most of them are not necessary because they are not related to Stage status.
      The case we should try to submit waiting stages is only when their parent stages are successfully completed.
      
      This elimination can improve `DAGScheduler` performance.
      
      ## How was this patch tested?
      
      Added some checks and other existing tests, and our projects.
      
      We have a project bottle-necked by `DAGScheduler`, having about 2000 stages.
      
      Before this patch the almost all execution time in `Driver` process was spent to process `submitStage()` of `dag-scheduler-event-loop` thread but after this patch the performance was improved as follows:
      
      |        | total execution time | `dag-scheduler-event-loop` thread time | `submitStage()` |
      |--------|---------------------:|---------------------------------------:|----------------:|
      | Before |              760 sec |                                710 sec |         667 sec |
      | After  |              440 sec |                                 14 sec |          10 sec |
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #12060 from ueshin/issues/SPARK-14269.
      698ef762
    • Jurriaan Pruis's avatar
      [SPARK-15493][SQL] default QuoteEscapingEnabled flag to true when writing CSV · c875d81a
      Jurriaan Pruis authored
      ## What changes were proposed in this pull request?
      
      Default QuoteEscapingEnabled flag to true when writing CSV and add an escapeQuotes option to be able to change this.
      
      See https://github.com/uniVocity/univocity-parsers/blob/f3eb2af26374940e60d91d1703bde54619f50c51/src/main/java/com/univocity/parsers/csv/CsvWriterSettings.java#L231-L247
      
      This change is needed to be able to write RFC 4180 compatible CSV files (https://tools.ietf.org/html/rfc4180#section-2)
      
      https://issues.apache.org/jira/browse/SPARK-15493
      
      ## How was this patch tested?
      
      Added a test that verifies the output is quoted correctly.
      
      Author: Jurriaan Pruis <email@jurriaanpruis.nl>
      
      Closes #13267 from jurriaan/quote-escaping.
      c875d81a
    • Takuya UESHIN's avatar
      [SPARK-15483][SQL] IncrementalExecution should use extra strategies. · 4b880674
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      Extra strategies does not work for streams because `IncrementalExecution` uses modified planner with stateful operations but it does not include extra strategies.
      
      This pr fixes `IncrementalExecution` to include extra strategies to use them.
      
      ## How was this patch tested?
      
      I added a test to check if extra strategies work for streams.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #13261 from ueshin/issues/SPARK-15483.
      4b880674
Loading