Skip to content
Snippets Groups Projects
  1. May 26, 2016
    • hyukjinkwon's avatar
      [SPARK-8603][SPARKR] Use shell() instead of system2() for SparkR on Windows · 1c403733
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR corrects SparkR to use `shell()` instead of `system2()` on Windows.
      
      Using `system2(...)` on Windows does not process windows file separator `\`. `shell(tralsate = TRUE, ...)` can treat this problem. So, this was changed to be chosen according to OS.
      
      Existing tests were failed on Windows due to this problem. For example, those were failed.
      
        ```
      8. Failure: sparkJars tag in SparkContext (test_includeJAR.R#34)
      9. Failure: sparkJars tag in SparkContext (test_includeJAR.R#36)
      ```
      
      The cases above were due to using of `system2`.
      
      In addition, this PR also fixes some tests failed on Windows.
      
        ```
      5. Failure: sparkJars sparkPackages as comma-separated strings (test_context.R#128)
      6. Failure: sparkJars sparkPackages as comma-separated strings (test_context.R#131)
      7. Failure: sparkJars sparkPackages as comma-separated strings (test_context.R#134)
      ```
      
        The cases above were due to a weird behaviour of `normalizePath()`. On Linux, if the path does not exist, it just prints out the input but it prints out including the current path on Windows.
      
        ```r
      # On Linus
      path <- normalizePath("aa")
      print(path)
      [1] "aa"
      
      # On Windows
      path <- normalizePath("aa")
      print(path)
      [1] "C:\\Users\\aa"
      ```
      
      ## How was this patch tested?
      
      Jenkins tests and manually tested in a Window machine as below:
      
      Here is the [stdout](https://gist.github.com/HyukjinKwon/4bf35184f3a30f3bce987a58ec2bbbab) of testing.
      
      Closes #7025
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: Hyukjin Kwon <gurwls223@gmail.com>
      Author: Prakash PC <prakash.chinnu@gmail.com>
      
      Closes #13165 from HyukjinKwon/pr/7025.
      1c403733
    • Andrew Or's avatar
      [SPARK-15583][SQL] Disallow altering datasource properties · 3fca635b
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Certain table properties (and SerDe properties) are in the protected namespace `spark.sql.sources.`, which we use internally for datasource tables. The user should not be allowed to
      
      (1) Create a Hive table setting these properties
      (2) Alter these properties in an existing table
      
      Previously, we threw an exception if the user tried to alter the properties of an existing datasource table. However, this is overly restrictive for datasource tables and does not do anything for Hive tables.
      
      ## How was this patch tested?
      
      DDLSuite
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #13341 from andrewor14/alter-table-props.
      3fca635b
    • Xin Ren's avatar
      [SPARK-15542][SPARKR] Make error message clear for script './R/install-dev.sh'... · 6ab973ec
      Xin Ren authored
      [SPARK-15542][SPARKR] Make error message clear for script './R/install-dev.sh' when R is missing on Mac
      
      https://issues.apache.org/jira/browse/SPARK-15542
      
      ## What changes were proposed in this pull request?
      
      When running`./R/install-dev.sh` in **Mac OS EI Captain** environment, I got
      ```
      mbp185-xr:spark xin$ ./R/install-dev.sh
      usage: dirname path
      ```
      This message is very confusing to me, and then I found R is not properly configured on my Mac when this script is using `$(which R)` to get R home.
      
      I tried similar situation on CentOS with R missing, and it's giving me very clear error message while MacOS is not.
      on CentOS:
      ```
      [rootip-xxx-31-9-xx spark]# which R
      /usr/bin/which: no R in (/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/lib/jvm/java-1.7.0-openjdk.x86_64/bin:/root/bin)
      ```
      but on Mac, if not found then nothing returned and this is causing the confusing message for R build failure and running R/install-dev.sh:
      ```
      mbp185-xr:spark xin$ which R
      mbp185-xr:spark xin$
      ```
      
      Here I just added a clear message for this miss configuration for R when running `R/install-dev.sh`.
      ```
      mbp185-xr:spark xin$ ./R/install-dev.sh
      Cannot find R home by running 'which R', please make sure R is properly installed.
      ```
      
      ## How was this patch tested?
      Manually tested on local machine.
      
      Author: Xin Ren <iamshrek@126.com>
      
      Closes #13308 from keypointt/SPARK-15542.
      6ab973ec
    • Andrew Or's avatar
      [SPARK-15538][SPARK-15539][SQL] Truncate table fixes round 2 · 008a5377
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Two more changes:
      (1) Fix truncate table for data source tables (only for cases without `PARTITION`)
      (2) Disallow truncating external tables or views
      
      ## How was this patch tested?
      
      `DDLSuite`
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #13315 from andrewor14/truncate-table.
      008a5377
    • Yin Huai's avatar
      [SPARK-15532][SQL] SQLContext/HiveContext's public constructors should use... · 3ac2363d
      Yin Huai authored
      [SPARK-15532][SQL] SQLContext/HiveContext's public constructors should use SparkSession.build.getOrCreate
      
      ## What changes were proposed in this pull request?
      This PR changes SQLContext/HiveContext's public constructor to use SparkSession.build.getOrCreate and removes isRootContext from SQLContext.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #13310 from yhuai/SPARK-15532.
      3ac2363d
    • Cheng Lian's avatar
      [SPARK-15550][SQL] Dataset.show() should show contents nested products as rows · e7082cae
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR addresses two related issues:
      
      1. `Dataset.showString()` should show case classes/Java beans at all levels as rows, while master code only handles top level ones.
      
      2. `Dataset.showString()` should show full contents produced the underlying query plan
      
         Dataset is only a view of the underlying query plan. Columns not referred by the encoder are still reachable using methods like `Dataset.col`. So it probably makes more sense to show full contents of the query plan.
      
      ## How was this patch tested?
      
      Two new test cases are added in `DatasetSuite` to check `.showString()` output.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #13331 from liancheng/spark-15550-ds-show.
      e7082cae
    • Sameer Agarwal's avatar
      [SPARK-8428][SPARK-13850] Fix integer overflows in TimSort · fe6de16f
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This patch fixes a few integer overflows in `UnsafeSortDataFormat.copyRange()` and `ShuffleSortDataFormat copyRange()` that seems to be the most likely cause behind a number of `TimSort` contract violation errors seen in Spark 2.0 and Spark 1.6 while sorting large datasets.
      
      ## How was this patch tested?
      
      Added a test in `ExternalSorterSuite` that instantiates a large array of the form of [150000000, 150000001, 150000002, ...., 300000000, 0, 1, 2, ..., 149999999] that triggers a `copyRange` in `TimSort.mergeLo` or `TimSort.mergeHi`. Note that the input dataset should contain at least 268.43 million rows with a certain data distribution for an overflow to occur.
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #13336 from sameeragarwal/timsort-bug.
      fe6de16f
    • Sean Zhong's avatar
      [SPARK-13445][SQL] Improves error message and add test coverage for Window function · b5859e0b
      Sean Zhong authored
      ## What changes were proposed in this pull request?
      
      Add more verbose error message when order by clause is missed when using Window function.
      
      ## How was this patch tested?
      
      Unit test.
      
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #13333 from clockfly/spark-13445.
      b5859e0b
    • Sean Owen's avatar
      [SPARK-15457][MLLIB][ML] Eliminate some warnings from MLlib about deprecations · b0a03fee
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Several classes and methods have been deprecated and are creating lots of build warnings in branch-2.0. This issue is to identify and fix those items:
      * WithSGD classes: Change to make class not deprecated, object deprecated, and public class constructor deprecated. Any public use will require a deprecated API. We need to keep a non-deprecated private API since we cannot eliminate certain uses: Python API, streaming algs, and examples.
        * Use in PythonMLlibAPI: Change to using private constructors
        * Streaming algs: No warnings after we un-deprecate the classes
        * Examples: Deprecate or change ones which use deprecated APIs
      * MulticlassMetrics fields (precision, etc.)
      * LinearRegressionSummary.model field
      
      ## How was this patch tested?
      
      Existing tests.  Checked for warnings manually.
      
      Author: Sean Owen <sowen@cloudera.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #13314 from jkbradley/warning-cleanups.
      b0a03fee
    • Reynold Xin's avatar
      [SPARK-15552][SQL] Remove unnecessary private[sql] methods in SparkSession · 0f61d6ef
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      SparkSession has a list of unnecessary private[sql] methods. These methods cause some trouble because private[sql] doesn't apply in Java. In the cases that they are easy to remove, we can simply remove them. This patch does that.
      
      As part of this pull request, I also replaced a bunch of protected[sql] with private[sql], to tighten up visibility.
      
      ## How was this patch tested?
      Updated test cases to reflect the changes.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13319 from rxin/SPARK-15552.
      0f61d6ef
    • Eric Liang's avatar
      [SPARK-15520][SQL] Also set sparkContext confs when using SparkSession builder in pyspark · 594a1bf2
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      Also sets confs in the underlying sc when using SparkSession.builder.getOrCreate(). This is a bug-fix from a post-merge comment in https://github.com/apache/spark/pull/13289
      
      ## How was this patch tested?
      
      Python doc-tests.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #13309 from ericl/spark-15520-1.
      594a1bf2
    • Andrew Or's avatar
      [SPARK-15539][SQL] DROP TABLE throw exception if table doesn't exist · 2b1ac6ce
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Same as #13302, but for DROP TABLE.
      
      ## How was this patch tested?
      
      `DDLSuite`
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #13307 from andrewor14/drop-table.
      2b1ac6ce
    • Steve Loughran's avatar
      [SPARK-13148][YARN] document zero-keytab Oozie application launch; add diagnostics · 01b350a4
      Steve Loughran authored
      This patch provides detail on what to do for keytabless Oozie launches of spark apps, and adds some debug-level diagnostics of what credentials have been submitted
      
      Author: Steve Loughran <stevel@hortonworks.com>
      Author: Steve Loughran <stevel@apache.org>
      
      Closes #11033 from steveloughran/stevel/feature/SPARK-13148-oozie.
      01b350a4
    • felixcheung's avatar
      [SPARK-10903][SPARKR] R - Simplify SQLContext method signatures and use a singleton · c76457c8
      felixcheung authored
      Eliminate the need to pass sqlContext to method since it is a singleton - and we don't want to support multiple contexts in a R session.
      
      Changes are done in a back compat way with deprecation warning added. Method signature for S3 methods are added in a concise, clean approach such that in the next release the deprecated signature can be taken out easily/cleanly (just delete a few lines per method).
      
      Custom method dispatch is implemented to allow for multiple JVM reference types that are all 'jobj' in R and to avoid having to add 30 new exports.
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #9192 from felixcheung/rsqlcontext.
      c76457c8
    • Villu Ruusmann's avatar
      [SPARK-15523][ML][MLLIB] Update JPMML to 1.2.15 · 6d506c9a
      Villu Ruusmann authored
      ## What changes were proposed in this pull request?
      
      See https://issues.apache.org/jira/browse/SPARK-15523
      
      This PR replaces PR #13293. It's isolated to a new branch, and contains some more squashed changes.
      
      ## How was this patch tested?
      
      1. Executed `mvn clean package` in `mllib` directory
      2. Executed `dev/test-dependencies.sh --replace-manifest` in the root directory.
      
      Author: Villu Ruusmann <villu.ruusmann@gmail.com>
      
      Closes #13297 from vruusmann/update-jpmml.
      6d506c9a
    • wm624@hotmail.com's avatar
      [SPARK-15492][ML][DOC] Binarization scala example copy & paste to spark-shell error · e451f7f0
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      (Please fill in changes proposed in this fix)
      The Binarization scala example val dataFrame : Dataframe = spark.createDataFrame(data).toDF("label", "feature"), which can't be pasted in the spark-shell as Dataframe is not imported. Compared with other examples, this explicit type is not required.
      
      So I removed Dataframe in the code.
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      Manually tested
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #13266 from wangmiao1981/unit.
      e451f7f0
    • Bo Meng's avatar
      [SPARK-15537][SQL] fix dir delete issue · 53d4abe9
      Bo Meng authored
      ## What changes were proposed in this pull request?
      
      For some of the test cases, e.g. `OrcSourceSuite`, it will create temp folders and temp files inside them. But after tests finish, the folders are not removed. This will cause lots of temp files created and space occupied, if we keep running the test cases.
      
      The reason is dir.delete() won't work if dir is not empty. We need to recursively delete the content before deleting the folder.
      
      ## How was this patch tested?
      
      Manually checked the temp folder to make sure the temp files were deleted.
      
      Author: Bo Meng <mengbo@hotmail.com>
      
      Closes #13304 from bomeng/SPARK-15537.
      53d4abe9
    • Reynold Xin's avatar
      [SPARK-15543][SQL] Rename DefaultSources to make them more self-describing · 361ebc28
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch renames various DefaultSources to make their names more self-describing. The choice of "DefaultSource" was from the days when we did not have a good way to specify short names.
      
      They are now named:
      - LibSVMFileFormat
      - CSVFileFormat
      - JdbcRelationProvider
      - JsonFileFormat
      - ParquetFileFormat
      - TextFileFormat
      
      Backward compatibility is maintained through aliasing.
      
      ## How was this patch tested?
      Updated relevant test cases too.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13311 from rxin/SPARK-15543.
      361ebc28
    • Imran Rashid's avatar
      [SPARK-10372] [CORE] basic test framework for entire spark scheduler · dfc9fc02
      Imran Rashid authored
      This is a basic framework for testing the entire scheduler.  The tests this adds aren't very interesting -- the point of this PR is just to setup the framework, to keep the initial change small, but it can be built upon to test more features (eg., speculation, killing tasks, blacklisting, etc.).
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #8559 from squito/SPARK-10372-scheduler-integs.
      dfc9fc02
  2. May 25, 2016
    • wm624@hotmail.com's avatar
      [SPARK-15439][SPARKR] Failed to run unit test in SparkR · 06bae8af
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      (Please fill in changes proposed in this fix)
      There are some failures when running SparkR unit tests.
      In this PR, I fixed two of these failures in test_context.R and test_sparkSQL.R
      The first one is due to different masked name. I added missed names in the expected arrays.
      The second one is because one PR removed the logic of a previous fix of missing subset method.
      
      The file privilege issue is still there. I am debugging it. SparkR shell can run the test case successfully.
      test_that("pipeRDD() on RDDs", {
        actual <- collect(pipeRDD(rdd, "more"))
      When using run-test script, it complains no such directories as below:
      cannot open file '/tmp/Rtmp4FQbah/filee2273f9d47f7': No such file or directory
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      Manually test it
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #13284 from wangmiao1981/R.
      06bae8af
    • Sameer Agarwal's avatar
      [SPARK-15533][SQL] Deprecate Dataset.explode · 06ed1fa3
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This patch deprecates `Dataset.explode` and documents appropriate workarounds to use `flatMap()` or `functions.explode()` instead.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #13312 from sameeragarwal/deprecate.
      06ed1fa3
    • Herman van Hovell's avatar
      [SPARK-15525][SQL][BUILD] Upgrade ANTLR4 SBT plugin · 527499b6
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      The ANTLR4 SBT plugin has been moved from its own repo to one on bintray. The version was also changed from `0.7.10` to `0.7.11`. The latter actually broke our build (ihji has fixed this by also adding `0.7.10` and others to the bin-tray repo).
      
      This PR upgrades the SBT-ANTLR4 plugin and ANTLR4 to their most recent versions (`0.7.11`/`4.5.3`). I have also removed a few obsolete build configurations.
      
      ## How was this patch tested?
      Manually running SBT/Maven builds.
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #13299 from hvanhovell/SPARK-15525.
      527499b6
    • Andrew Or's avatar
      [SPARK-15534][SPARK-15535][SQL] Truncate table fixes · ee682fe2
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Two changes:
      - When things fail, `TRUNCATE TABLE` just returns nothing. Instead, we should throw exceptions.
      - Remove `TRUNCATE TABLE ... COLUMN`, which was never supported by either Spark or Hive.
      
      ## How was this patch tested?
      Jenkins.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #13302 from andrewor14/truncate-table.
      ee682fe2
    • Gio Borje's avatar
      Log warnings for numIterations * miniBatchFraction < 1.0 · 589cce93
      Gio Borje authored
      ## What changes were proposed in this pull request?
      
      Add a warning log for the case that `numIterations * miniBatchFraction <1.0` during gradient descent. If the product of those two numbers is less than `1.0`, then not all training examples will be used during optimization. To put this concretely, suppose that `numExamples = 100`, `miniBatchFraction = 0.2` and `numIterations = 3`. Then, 3 iterations will occur each sampling approximately 6 examples each. In the best case, each of the 6 examples are unique; hence 18/100 examples are used.
      
      This may be counter-intuitive to most users and led to the issue during the development of another Spark  ML model: https://github.com/zhengruifeng/spark-libFM/issues/11. If a user actually does not require the training data set, it would be easier and more intuitive to use `RDD.sample`.
      
      ## How was this patch tested?
      
      `build/mvn -DskipTests clean package` build succeeds
      
      Author: Gio Borje <gborje@linkedin.com>
      
      Closes #13265 from Hydrotoast/master.
      589cce93
    • Bryan Cutler's avatar
      [MINOR] [PYSPARK] [EXAMPLES] Changed examples to use SparkSession.sparkContext instead of _sc · 9c297df3
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      Some PySpark examples need a SparkContext and get it by accessing _sc directly from the session.  These examples should use the provided property `sparkContext` in `SparkSession` instead.
      
      ## How was this patch tested?
      Ran modified examples
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #13303 from BryanCutler/pyspark-session-sparkContext-MINOR.
      9c297df3
    • Takuya UESHIN's avatar
      [SPARK-14269][SCHEDULER] Eliminate unnecessary submitStage() call. · 698ef762
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      Currently a method `submitStage()` for waiting stages is called on every iteration of the event loop in `DAGScheduler` to submit all waiting stages, but most of them are not necessary because they are not related to Stage status.
      The case we should try to submit waiting stages is only when their parent stages are successfully completed.
      
      This elimination can improve `DAGScheduler` performance.
      
      ## How was this patch tested?
      
      Added some checks and other existing tests, and our projects.
      
      We have a project bottle-necked by `DAGScheduler`, having about 2000 stages.
      
      Before this patch the almost all execution time in `Driver` process was spent to process `submitStage()` of `dag-scheduler-event-loop` thread but after this patch the performance was improved as follows:
      
      |        | total execution time | `dag-scheduler-event-loop` thread time | `submitStage()` |
      |--------|---------------------:|---------------------------------------:|----------------:|
      | Before |              760 sec |                                710 sec |         667 sec |
      | After  |              440 sec |                                 14 sec |          10 sec |
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #12060 from ueshin/issues/SPARK-14269.
      698ef762
    • Jurriaan Pruis's avatar
      [SPARK-15493][SQL] default QuoteEscapingEnabled flag to true when writing CSV · c875d81a
      Jurriaan Pruis authored
      ## What changes were proposed in this pull request?
      
      Default QuoteEscapingEnabled flag to true when writing CSV and add an escapeQuotes option to be able to change this.
      
      See https://github.com/uniVocity/univocity-parsers/blob/f3eb2af26374940e60d91d1703bde54619f50c51/src/main/java/com/univocity/parsers/csv/CsvWriterSettings.java#L231-L247
      
      This change is needed to be able to write RFC 4180 compatible CSV files (https://tools.ietf.org/html/rfc4180#section-2)
      
      https://issues.apache.org/jira/browse/SPARK-15493
      
      ## How was this patch tested?
      
      Added a test that verifies the output is quoted correctly.
      
      Author: Jurriaan Pruis <email@jurriaanpruis.nl>
      
      Closes #13267 from jurriaan/quote-escaping.
      c875d81a
    • Takuya UESHIN's avatar
      [SPARK-15483][SQL] IncrementalExecution should use extra strategies. · 4b880674
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      Extra strategies does not work for streams because `IncrementalExecution` uses modified planner with stateful operations but it does not include extra strategies.
      
      This pr fixes `IncrementalExecution` to include extra strategies to use them.
      
      ## How was this patch tested?
      
      I added a test to check if extra strategies work for streams.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #13261 from ueshin/issues/SPARK-15483.
      4b880674
    • Nick Pentreath's avatar
      [SPARK-15500][DOC][ML][PYSPARK] Remove default value in Param doc field in ALS · 1cb347fb
      Nick Pentreath authored
      Remove "Default: MEMORY_AND_DISK" from `Param` doc field in ALS storage level params. This fixes up the output of `explainParam(s)` so that default values are not displayed twice.
      
      We can revisit in the case that [SPARK-15130](https://issues.apache.org/jira/browse/SPARK-15130) moves ahead with adding defaults in some way to PySpark param doc fields.
      
      Tests N/A.
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #13277 from MLnick/SPARK-15500-als-remove-default-storage-param.
      1cb347fb
    • lfzCarlosC's avatar
      [MINOR][MLLIB][STREAMING][SQL] Fix typos · 02c8072e
      lfzCarlosC authored
      fixed typos for source code for components [mllib] [streaming] and [SQL]
      
      None and obvious.
      
      Author: lfzCarlosC <lfz.carlos@gmail.com>
      
      Closes #13298 from lfzCarlosC/master.
      02c8072e
    • Dongjoon Hyun's avatar
      [MINOR][CORE] Fix a HadoopRDD log message and remove unused imports in rdd files. · d6d3e507
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR fixes the following typos in log message and comments of `HadoopRDD.scala`. Also, this removes unused imports.
      ```scala
      -      logWarning("Caching NewHadoopRDDs as deserialized objects usually leads to undesired" +
      +      logWarning("Caching HadoopRDDs as deserialized objects usually leads to undesired" +
      ...
      -      // since its not removed yet
      +      // since it's not removed yet
      ```
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13294 from dongjoon-hyun/minor_rdd_fix_log_message.
      d6d3e507
    • Eric Liang's avatar
      [SPARK-15520][SQL] SparkSession builder in python should also allow overriding... · 8239fdcb
      Eric Liang authored
      [SPARK-15520][SQL] SparkSession builder in python should also allow overriding confs of existing sessions
      
      ## What changes were proposed in this pull request?
      
      This fixes the python SparkSession builder to allow setting confs correctly. This was a leftover TODO from https://github.com/apache/spark/pull/13200.
      
      ## How was this patch tested?
      
      Python doc tests.
      
      cc andrewor14
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #13289 from ericl/spark-15520.
      8239fdcb
    • Jeff Zhang's avatar
      [SPARK-15345][SQL][PYSPARK] SparkSession's conf doesn't take effect when this... · 01e7b9c8
      Jeff Zhang authored
      [SPARK-15345][SQL][PYSPARK] SparkSession's conf doesn't take effect when this already an existing SparkContext
      
      ## What changes were proposed in this pull request?
      
      Override the existing SparkContext is the provided SparkConf is different. PySpark part hasn't been fixed yet, will do that after the first round of review to ensure this is the correct approach.
      
      ## How was this patch tested?
      
      Manually verify it in spark-shell.
      
      rxin  Please help review it, I think this is a very critical issue for spark 2.0
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #13160 from zjffdu/SPARK-15345.
      01e7b9c8
    • Lukasz's avatar
      [SPARK-9044] Fix "Storage" tab in UI so that it reflects RDD name change. · b120fba6
      Lukasz authored
      ## What changes were proposed in this pull request?
      
      1. Making 'name' field of RDDInfo mutable.
      2. In StorageListener: catching the fact that RDD's name was changed and updating it in RDDInfo.
      
      ## How was this patch tested?
      
      1. Manual verification - the 'Storage' tab now behaves as expected.
      2. The commit also contains a new unit test which verifies this.
      
      Author: Lukasz <lgieron@gmail.com>
      
      Closes #13264 from lgieron/SPARK-9044.
      b120fba6
    • Reynold Xin's avatar
      [SPARK-15436][SQL] Remove DescribeFunction and ShowFunctions · 4f27b8dd
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch removes the last two commands defined in the catalyst module: DescribeFunction and ShowFunctions. They were unnecessary since the parser could just generate DescribeFunctionCommand and ShowFunctionsCommand directly.
      
      ## How was this patch tested?
      Created a new SparkSqlParserSuite.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13292 from rxin/SPARK-15436.
      4f27b8dd
    • Krishna Kalyan's avatar
      [SPARK-12071][DOC] Document the behaviour of NA in R · 9082b796
      Krishna Kalyan authored
      ## What changes were proposed in this pull request?
      
      Under Upgrading From SparkR 1.5.x to 1.6.x section added the information, SparkSQL converts `NA` in R to `null`.
      
      ## How was this patch tested?
      
      Document update, no tests.
      
      Author: Krishna Kalyan <krishnakalyan3@gmail.com>
      
      Closes #13268 from krishnakalyan3/spark-12071-1.
      9082b796
    • Holden Karau's avatar
      [SPARK-15412][PYSPARK][SPARKR][DOCS] Improve linear isotonic regression pydoc... · cd9f1690
      Holden Karau authored
      [SPARK-15412][PYSPARK][SPARKR][DOCS] Improve linear isotonic regression pydoc & doc build insturctions
      
      ## What changes were proposed in this pull request?
      
      PySpark: Add links to the predictors from the models in regression.py, improve linear and isotonic pydoc in minor ways.
      User guide / R: Switch the installed package list to be enough to build the R docs on a "fresh" install on ubuntu and add sudo to match the rest of the commands.
      User Guide: Add a note about using gem2.0 for systems with both 1.9 and 2.0 (e.g. some ubuntu but maybe more).
      
      ## How was this patch tested?
      
      built pydocs locally, tested new user build instructions
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #13199 from holdenk/SPARK-15412-improve-linear-isotonic-regression-pydoc.
      cd9f1690
    • Shixiong Zhu's avatar
      [SPARK-15508][STREAMING][TESTS] Fix flaky test: JavaKafkaStreamSuite.testKafkaStream · c9c1c0e5
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      `JavaKafkaStreamSuite.testKafkaStream` assumes when `sent.size == result.size`, the contents of `sent` and `result` should be same. However, that's not true. The content of `result` may not be the final content.
      
      This PR modified the test to always retry the assertions even if the contents of `sent` and `result` are not same.
      
      Here is the failure in Jenkins: http://spark-tests.appspot.com/tests/org.apache.spark.streaming.kafka.JavaKafkaStreamSuite/testKafkaStream
      
      ## How was this patch tested?
      
      Jenkins unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13281 from zsxwing/flaky-kafka-test.
      c9c1c0e5
  3. May 24, 2016
    • Wenchen Fan's avatar
      [SPARK-15498][TESTS] fix slow tests · 50b660d7
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This PR fixes 3 slow tests:
      
      1. `ParquetQuerySuite.read/write wide table`: This is not a good unit test as it runs more than 5 minutes. This PR removes it and add a new regression test in `CodeGenerationSuite`, which is more "unit".
      2. `ParquetQuerySuite.returning batch for wide table`: reduce the threshold and use smaller data size.
      3. `DatasetSuite.SPARK-14554: Dataset.map may generate wrong java code for wide table`: Improve `CodeFormatter.format`(introduced at https://github.com/apache/spark/pull/12979) can dramatically speed this it up.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13273 from cloud-fan/test.
      50b660d7
    • Parth Brahmbhatt's avatar
      [SPARK-15365][SQL] When table size statistics are not available from... · 4acababc
      Parth Brahmbhatt authored
      [SPARK-15365][SQL] When table size statistics are not available from metastore, we should fallback to HDFS
      
      ## What changes were proposed in this pull request?
      Currently if a table is used in join operation we rely on Metastore returned size to calculate if we can convert the operation to Broadcast join. This optimization only kicks in for table's that have the statistics available in metastore. Hive generally rolls over to HDFS if the statistics are not available directly from metastore and this seems like a reasonable choice to adopt given the optimization benefit of using broadcast joins.
      
      ## How was this patch tested?
      I have executed queries locally to test.
      
      Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com>
      
      Closes #13150 from Parth-Brahmbhatt/SPARK-15365.
      4acababc
Loading