Skip to content
Snippets Groups Projects
  1. May 26, 2016
    • Sean Zhong's avatar
      [SPARK-13445][SQL] Improves error message and add test coverage for Window function · b5859e0b
      Sean Zhong authored
      ## What changes were proposed in this pull request?
      
      Add more verbose error message when order by clause is missed when using Window function.
      
      ## How was this patch tested?
      
      Unit test.
      
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #13333 from clockfly/spark-13445.
      b5859e0b
    • Sean Owen's avatar
      [SPARK-15457][MLLIB][ML] Eliminate some warnings from MLlib about deprecations · b0a03fee
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Several classes and methods have been deprecated and are creating lots of build warnings in branch-2.0. This issue is to identify and fix those items:
      * WithSGD classes: Change to make class not deprecated, object deprecated, and public class constructor deprecated. Any public use will require a deprecated API. We need to keep a non-deprecated private API since we cannot eliminate certain uses: Python API, streaming algs, and examples.
        * Use in PythonMLlibAPI: Change to using private constructors
        * Streaming algs: No warnings after we un-deprecate the classes
        * Examples: Deprecate or change ones which use deprecated APIs
      * MulticlassMetrics fields (precision, etc.)
      * LinearRegressionSummary.model field
      
      ## How was this patch tested?
      
      Existing tests.  Checked for warnings manually.
      
      Author: Sean Owen <sowen@cloudera.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #13314 from jkbradley/warning-cleanups.
      b0a03fee
    • Reynold Xin's avatar
      [SPARK-15552][SQL] Remove unnecessary private[sql] methods in SparkSession · 0f61d6ef
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      SparkSession has a list of unnecessary private[sql] methods. These methods cause some trouble because private[sql] doesn't apply in Java. In the cases that they are easy to remove, we can simply remove them. This patch does that.
      
      As part of this pull request, I also replaced a bunch of protected[sql] with private[sql], to tighten up visibility.
      
      ## How was this patch tested?
      Updated test cases to reflect the changes.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13319 from rxin/SPARK-15552.
      0f61d6ef
    • Eric Liang's avatar
      [SPARK-15520][SQL] Also set sparkContext confs when using SparkSession builder in pyspark · 594a1bf2
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      Also sets confs in the underlying sc when using SparkSession.builder.getOrCreate(). This is a bug-fix from a post-merge comment in https://github.com/apache/spark/pull/13289
      
      ## How was this patch tested?
      
      Python doc-tests.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #13309 from ericl/spark-15520-1.
      594a1bf2
    • Andrew Or's avatar
      [SPARK-15539][SQL] DROP TABLE throw exception if table doesn't exist · 2b1ac6ce
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Same as #13302, but for DROP TABLE.
      
      ## How was this patch tested?
      
      `DDLSuite`
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #13307 from andrewor14/drop-table.
      2b1ac6ce
    • Steve Loughran's avatar
      [SPARK-13148][YARN] document zero-keytab Oozie application launch; add diagnostics · 01b350a4
      Steve Loughran authored
      This patch provides detail on what to do for keytabless Oozie launches of spark apps, and adds some debug-level diagnostics of what credentials have been submitted
      
      Author: Steve Loughran <stevel@hortonworks.com>
      Author: Steve Loughran <stevel@apache.org>
      
      Closes #11033 from steveloughran/stevel/feature/SPARK-13148-oozie.
      01b350a4
    • felixcheung's avatar
      [SPARK-10903][SPARKR] R - Simplify SQLContext method signatures and use a singleton · c76457c8
      felixcheung authored
      Eliminate the need to pass sqlContext to method since it is a singleton - and we don't want to support multiple contexts in a R session.
      
      Changes are done in a back compat way with deprecation warning added. Method signature for S3 methods are added in a concise, clean approach such that in the next release the deprecated signature can be taken out easily/cleanly (just delete a few lines per method).
      
      Custom method dispatch is implemented to allow for multiple JVM reference types that are all 'jobj' in R and to avoid having to add 30 new exports.
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #9192 from felixcheung/rsqlcontext.
      c76457c8
    • Villu Ruusmann's avatar
      [SPARK-15523][ML][MLLIB] Update JPMML to 1.2.15 · 6d506c9a
      Villu Ruusmann authored
      ## What changes were proposed in this pull request?
      
      See https://issues.apache.org/jira/browse/SPARK-15523
      
      This PR replaces PR #13293. It's isolated to a new branch, and contains some more squashed changes.
      
      ## How was this patch tested?
      
      1. Executed `mvn clean package` in `mllib` directory
      2. Executed `dev/test-dependencies.sh --replace-manifest` in the root directory.
      
      Author: Villu Ruusmann <villu.ruusmann@gmail.com>
      
      Closes #13297 from vruusmann/update-jpmml.
      6d506c9a
    • wm624@hotmail.com's avatar
      [SPARK-15492][ML][DOC] Binarization scala example copy & paste to spark-shell error · e451f7f0
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      (Please fill in changes proposed in this fix)
      The Binarization scala example val dataFrame : Dataframe = spark.createDataFrame(data).toDF("label", "feature"), which can't be pasted in the spark-shell as Dataframe is not imported. Compared with other examples, this explicit type is not required.
      
      So I removed Dataframe in the code.
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      Manually tested
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #13266 from wangmiao1981/unit.
      e451f7f0
    • Bo Meng's avatar
      [SPARK-15537][SQL] fix dir delete issue · 53d4abe9
      Bo Meng authored
      ## What changes were proposed in this pull request?
      
      For some of the test cases, e.g. `OrcSourceSuite`, it will create temp folders and temp files inside them. But after tests finish, the folders are not removed. This will cause lots of temp files created and space occupied, if we keep running the test cases.
      
      The reason is dir.delete() won't work if dir is not empty. We need to recursively delete the content before deleting the folder.
      
      ## How was this patch tested?
      
      Manually checked the temp folder to make sure the temp files were deleted.
      
      Author: Bo Meng <mengbo@hotmail.com>
      
      Closes #13304 from bomeng/SPARK-15537.
      53d4abe9
    • Reynold Xin's avatar
      [SPARK-15543][SQL] Rename DefaultSources to make them more self-describing · 361ebc28
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch renames various DefaultSources to make their names more self-describing. The choice of "DefaultSource" was from the days when we did not have a good way to specify short names.
      
      They are now named:
      - LibSVMFileFormat
      - CSVFileFormat
      - JdbcRelationProvider
      - JsonFileFormat
      - ParquetFileFormat
      - TextFileFormat
      
      Backward compatibility is maintained through aliasing.
      
      ## How was this patch tested?
      Updated relevant test cases too.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13311 from rxin/SPARK-15543.
      361ebc28
    • Imran Rashid's avatar
      [SPARK-10372] [CORE] basic test framework for entire spark scheduler · dfc9fc02
      Imran Rashid authored
      This is a basic framework for testing the entire scheduler.  The tests this adds aren't very interesting -- the point of this PR is just to setup the framework, to keep the initial change small, but it can be built upon to test more features (eg., speculation, killing tasks, blacklisting, etc.).
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #8559 from squito/SPARK-10372-scheduler-integs.
      dfc9fc02
  2. May 25, 2016
    • wm624@hotmail.com's avatar
      [SPARK-15439][SPARKR] Failed to run unit test in SparkR · 06bae8af
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      (Please fill in changes proposed in this fix)
      There are some failures when running SparkR unit tests.
      In this PR, I fixed two of these failures in test_context.R and test_sparkSQL.R
      The first one is due to different masked name. I added missed names in the expected arrays.
      The second one is because one PR removed the logic of a previous fix of missing subset method.
      
      The file privilege issue is still there. I am debugging it. SparkR shell can run the test case successfully.
      test_that("pipeRDD() on RDDs", {
        actual <- collect(pipeRDD(rdd, "more"))
      When using run-test script, it complains no such directories as below:
      cannot open file '/tmp/Rtmp4FQbah/filee2273f9d47f7': No such file or directory
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      Manually test it
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #13284 from wangmiao1981/R.
      06bae8af
    • Sameer Agarwal's avatar
      [SPARK-15533][SQL] Deprecate Dataset.explode · 06ed1fa3
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This patch deprecates `Dataset.explode` and documents appropriate workarounds to use `flatMap()` or `functions.explode()` instead.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #13312 from sameeragarwal/deprecate.
      06ed1fa3
    • Herman van Hovell's avatar
      [SPARK-15525][SQL][BUILD] Upgrade ANTLR4 SBT plugin · 527499b6
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      The ANTLR4 SBT plugin has been moved from its own repo to one on bintray. The version was also changed from `0.7.10` to `0.7.11`. The latter actually broke our build (ihji has fixed this by also adding `0.7.10` and others to the bin-tray repo).
      
      This PR upgrades the SBT-ANTLR4 plugin and ANTLR4 to their most recent versions (`0.7.11`/`4.5.3`). I have also removed a few obsolete build configurations.
      
      ## How was this patch tested?
      Manually running SBT/Maven builds.
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #13299 from hvanhovell/SPARK-15525.
      527499b6
    • Andrew Or's avatar
      [SPARK-15534][SPARK-15535][SQL] Truncate table fixes · ee682fe2
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Two changes:
      - When things fail, `TRUNCATE TABLE` just returns nothing. Instead, we should throw exceptions.
      - Remove `TRUNCATE TABLE ... COLUMN`, which was never supported by either Spark or Hive.
      
      ## How was this patch tested?
      Jenkins.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #13302 from andrewor14/truncate-table.
      ee682fe2
    • Gio Borje's avatar
      Log warnings for numIterations * miniBatchFraction < 1.0 · 589cce93
      Gio Borje authored
      ## What changes were proposed in this pull request?
      
      Add a warning log for the case that `numIterations * miniBatchFraction <1.0` during gradient descent. If the product of those two numbers is less than `1.0`, then not all training examples will be used during optimization. To put this concretely, suppose that `numExamples = 100`, `miniBatchFraction = 0.2` and `numIterations = 3`. Then, 3 iterations will occur each sampling approximately 6 examples each. In the best case, each of the 6 examples are unique; hence 18/100 examples are used.
      
      This may be counter-intuitive to most users and led to the issue during the development of another Spark  ML model: https://github.com/zhengruifeng/spark-libFM/issues/11. If a user actually does not require the training data set, it would be easier and more intuitive to use `RDD.sample`.
      
      ## How was this patch tested?
      
      `build/mvn -DskipTests clean package` build succeeds
      
      Author: Gio Borje <gborje@linkedin.com>
      
      Closes #13265 from Hydrotoast/master.
      589cce93
    • Bryan Cutler's avatar
      [MINOR] [PYSPARK] [EXAMPLES] Changed examples to use SparkSession.sparkContext instead of _sc · 9c297df3
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      Some PySpark examples need a SparkContext and get it by accessing _sc directly from the session.  These examples should use the provided property `sparkContext` in `SparkSession` instead.
      
      ## How was this patch tested?
      Ran modified examples
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #13303 from BryanCutler/pyspark-session-sparkContext-MINOR.
      9c297df3
    • Takuya UESHIN's avatar
      [SPARK-14269][SCHEDULER] Eliminate unnecessary submitStage() call. · 698ef762
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      Currently a method `submitStage()` for waiting stages is called on every iteration of the event loop in `DAGScheduler` to submit all waiting stages, but most of them are not necessary because they are not related to Stage status.
      The case we should try to submit waiting stages is only when their parent stages are successfully completed.
      
      This elimination can improve `DAGScheduler` performance.
      
      ## How was this patch tested?
      
      Added some checks and other existing tests, and our projects.
      
      We have a project bottle-necked by `DAGScheduler`, having about 2000 stages.
      
      Before this patch the almost all execution time in `Driver` process was spent to process `submitStage()` of `dag-scheduler-event-loop` thread but after this patch the performance was improved as follows:
      
      |        | total execution time | `dag-scheduler-event-loop` thread time | `submitStage()` |
      |--------|---------------------:|---------------------------------------:|----------------:|
      | Before |              760 sec |                                710 sec |         667 sec |
      | After  |              440 sec |                                 14 sec |          10 sec |
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #12060 from ueshin/issues/SPARK-14269.
      698ef762
    • Jurriaan Pruis's avatar
      [SPARK-15493][SQL] default QuoteEscapingEnabled flag to true when writing CSV · c875d81a
      Jurriaan Pruis authored
      ## What changes were proposed in this pull request?
      
      Default QuoteEscapingEnabled flag to true when writing CSV and add an escapeQuotes option to be able to change this.
      
      See https://github.com/uniVocity/univocity-parsers/blob/f3eb2af26374940e60d91d1703bde54619f50c51/src/main/java/com/univocity/parsers/csv/CsvWriterSettings.java#L231-L247
      
      This change is needed to be able to write RFC 4180 compatible CSV files (https://tools.ietf.org/html/rfc4180#section-2)
      
      https://issues.apache.org/jira/browse/SPARK-15493
      
      ## How was this patch tested?
      
      Added a test that verifies the output is quoted correctly.
      
      Author: Jurriaan Pruis <email@jurriaanpruis.nl>
      
      Closes #13267 from jurriaan/quote-escaping.
      c875d81a
    • Takuya UESHIN's avatar
      [SPARK-15483][SQL] IncrementalExecution should use extra strategies. · 4b880674
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      Extra strategies does not work for streams because `IncrementalExecution` uses modified planner with stateful operations but it does not include extra strategies.
      
      This pr fixes `IncrementalExecution` to include extra strategies to use them.
      
      ## How was this patch tested?
      
      I added a test to check if extra strategies work for streams.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #13261 from ueshin/issues/SPARK-15483.
      4b880674
    • Nick Pentreath's avatar
      [SPARK-15500][DOC][ML][PYSPARK] Remove default value in Param doc field in ALS · 1cb347fb
      Nick Pentreath authored
      Remove "Default: MEMORY_AND_DISK" from `Param` doc field in ALS storage level params. This fixes up the output of `explainParam(s)` so that default values are not displayed twice.
      
      We can revisit in the case that [SPARK-15130](https://issues.apache.org/jira/browse/SPARK-15130) moves ahead with adding defaults in some way to PySpark param doc fields.
      
      Tests N/A.
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #13277 from MLnick/SPARK-15500-als-remove-default-storage-param.
      1cb347fb
    • lfzCarlosC's avatar
      [MINOR][MLLIB][STREAMING][SQL] Fix typos · 02c8072e
      lfzCarlosC authored
      fixed typos for source code for components [mllib] [streaming] and [SQL]
      
      None and obvious.
      
      Author: lfzCarlosC <lfz.carlos@gmail.com>
      
      Closes #13298 from lfzCarlosC/master.
      02c8072e
    • Dongjoon Hyun's avatar
      [MINOR][CORE] Fix a HadoopRDD log message and remove unused imports in rdd files. · d6d3e507
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR fixes the following typos in log message and comments of `HadoopRDD.scala`. Also, this removes unused imports.
      ```scala
      -      logWarning("Caching NewHadoopRDDs as deserialized objects usually leads to undesired" +
      +      logWarning("Caching HadoopRDDs as deserialized objects usually leads to undesired" +
      ...
      -      // since its not removed yet
      +      // since it's not removed yet
      ```
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13294 from dongjoon-hyun/minor_rdd_fix_log_message.
      d6d3e507
    • Eric Liang's avatar
      [SPARK-15520][SQL] SparkSession builder in python should also allow overriding... · 8239fdcb
      Eric Liang authored
      [SPARK-15520][SQL] SparkSession builder in python should also allow overriding confs of existing sessions
      
      ## What changes were proposed in this pull request?
      
      This fixes the python SparkSession builder to allow setting confs correctly. This was a leftover TODO from https://github.com/apache/spark/pull/13200.
      
      ## How was this patch tested?
      
      Python doc tests.
      
      cc andrewor14
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #13289 from ericl/spark-15520.
      8239fdcb
    • Jeff Zhang's avatar
      [SPARK-15345][SQL][PYSPARK] SparkSession's conf doesn't take effect when this... · 01e7b9c8
      Jeff Zhang authored
      [SPARK-15345][SQL][PYSPARK] SparkSession's conf doesn't take effect when this already an existing SparkContext
      
      ## What changes were proposed in this pull request?
      
      Override the existing SparkContext is the provided SparkConf is different. PySpark part hasn't been fixed yet, will do that after the first round of review to ensure this is the correct approach.
      
      ## How was this patch tested?
      
      Manually verify it in spark-shell.
      
      rxin  Please help review it, I think this is a very critical issue for spark 2.0
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #13160 from zjffdu/SPARK-15345.
      01e7b9c8
    • Lukasz's avatar
      [SPARK-9044] Fix "Storage" tab in UI so that it reflects RDD name change. · b120fba6
      Lukasz authored
      ## What changes were proposed in this pull request?
      
      1. Making 'name' field of RDDInfo mutable.
      2. In StorageListener: catching the fact that RDD's name was changed and updating it in RDDInfo.
      
      ## How was this patch tested?
      
      1. Manual verification - the 'Storage' tab now behaves as expected.
      2. The commit also contains a new unit test which verifies this.
      
      Author: Lukasz <lgieron@gmail.com>
      
      Closes #13264 from lgieron/SPARK-9044.
      b120fba6
    • Reynold Xin's avatar
      [SPARK-15436][SQL] Remove DescribeFunction and ShowFunctions · 4f27b8dd
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch removes the last two commands defined in the catalyst module: DescribeFunction and ShowFunctions. They were unnecessary since the parser could just generate DescribeFunctionCommand and ShowFunctionsCommand directly.
      
      ## How was this patch tested?
      Created a new SparkSqlParserSuite.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13292 from rxin/SPARK-15436.
      4f27b8dd
    • Krishna Kalyan's avatar
      [SPARK-12071][DOC] Document the behaviour of NA in R · 9082b796
      Krishna Kalyan authored
      ## What changes were proposed in this pull request?
      
      Under Upgrading From SparkR 1.5.x to 1.6.x section added the information, SparkSQL converts `NA` in R to `null`.
      
      ## How was this patch tested?
      
      Document update, no tests.
      
      Author: Krishna Kalyan <krishnakalyan3@gmail.com>
      
      Closes #13268 from krishnakalyan3/spark-12071-1.
      9082b796
    • Holden Karau's avatar
      [SPARK-15412][PYSPARK][SPARKR][DOCS] Improve linear isotonic regression pydoc... · cd9f1690
      Holden Karau authored
      [SPARK-15412][PYSPARK][SPARKR][DOCS] Improve linear isotonic regression pydoc & doc build insturctions
      
      ## What changes were proposed in this pull request?
      
      PySpark: Add links to the predictors from the models in regression.py, improve linear and isotonic pydoc in minor ways.
      User guide / R: Switch the installed package list to be enough to build the R docs on a "fresh" install on ubuntu and add sudo to match the rest of the commands.
      User Guide: Add a note about using gem2.0 for systems with both 1.9 and 2.0 (e.g. some ubuntu but maybe more).
      
      ## How was this patch tested?
      
      built pydocs locally, tested new user build instructions
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #13199 from holdenk/SPARK-15412-improve-linear-isotonic-regression-pydoc.
      cd9f1690
    • Shixiong Zhu's avatar
      [SPARK-15508][STREAMING][TESTS] Fix flaky test: JavaKafkaStreamSuite.testKafkaStream · c9c1c0e5
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      `JavaKafkaStreamSuite.testKafkaStream` assumes when `sent.size == result.size`, the contents of `sent` and `result` should be same. However, that's not true. The content of `result` may not be the final content.
      
      This PR modified the test to always retry the assertions even if the contents of `sent` and `result` are not same.
      
      Here is the failure in Jenkins: http://spark-tests.appspot.com/tests/org.apache.spark.streaming.kafka.JavaKafkaStreamSuite/testKafkaStream
      
      ## How was this patch tested?
      
      Jenkins unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13281 from zsxwing/flaky-kafka-test.
      c9c1c0e5
  3. May 24, 2016
    • Wenchen Fan's avatar
      [SPARK-15498][TESTS] fix slow tests · 50b660d7
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This PR fixes 3 slow tests:
      
      1. `ParquetQuerySuite.read/write wide table`: This is not a good unit test as it runs more than 5 minutes. This PR removes it and add a new regression test in `CodeGenerationSuite`, which is more "unit".
      2. `ParquetQuerySuite.returning batch for wide table`: reduce the threshold and use smaller data size.
      3. `DatasetSuite.SPARK-14554: Dataset.map may generate wrong java code for wide table`: Improve `CodeFormatter.format`(introduced at https://github.com/apache/spark/pull/12979) can dramatically speed this it up.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13273 from cloud-fan/test.
      50b660d7
    • Parth Brahmbhatt's avatar
      [SPARK-15365][SQL] When table size statistics are not available from... · 4acababc
      Parth Brahmbhatt authored
      [SPARK-15365][SQL] When table size statistics are not available from metastore, we should fallback to HDFS
      
      ## What changes were proposed in this pull request?
      Currently if a table is used in join operation we rely on Metastore returned size to calculate if we can convert the operation to Broadcast join. This optimization only kicks in for table's that have the statistics available in metastore. Hive generally rolls over to HDFS if the statistics are not available directly from metastore and this seems like a reasonable choice to adopt given the optimization benefit of using broadcast joins.
      
      ## How was this patch tested?
      I have executed queries locally to test.
      
      Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com>
      
      Closes #13150 from Parth-Brahmbhatt/SPARK-15365.
      4acababc
    • Reynold Xin's avatar
      [SPARK-15518] Rename various scheduler backend for consistency · 14494da8
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch renames various scheduler backends to make them consistent:
      
      - LocalScheduler -> LocalSchedulerBackend
      - AppClient -> StandaloneAppClient
      - AppClientListener -> StandaloneAppClientListener
      - SparkDeploySchedulerBackend -> StandaloneSchedulerBackend
      - CoarseMesosSchedulerBackend -> MesosCoarseGrainedSchedulerBackend
      - MesosSchedulerBackend -> MesosFineGrainedSchedulerBackend
      
      ## How was this patch tested?
      Updated test cases to reflect the name change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13288 from rxin/SPARK-15518.
      14494da8
    • Dongjoon Hyun's avatar
      [SPARK-15512][CORE] repartition(0) should raise IllegalArgumentException · f08bf587
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Previously, SPARK-8893 added the constraints on positive number of partitions for repartition/coalesce operations in general. This PR adds one missing part for that and adds explicit two testcases.
      
      **Before**
      ```scala
      scala> sc.parallelize(1 to 5).coalesce(0)
      java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
      ...
      scala> sc.parallelize(1 to 5).repartition(0).collect()
      res1: Array[Int] = Array()   // empty
      scala> spark.sql("select 1").coalesce(0)
      res2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [1: int]
      scala> spark.sql("select 1").coalesce(0).collect()
      java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
      scala> spark.sql("select 1").repartition(0)
      res3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [1: int]
      scala> spark.sql("select 1").repartition(0).collect()
      res4: Array[org.apache.spark.sql.Row] = Array()  // empty
      ```
      
      **After**
      ```scala
      scala> sc.parallelize(1 to 5).coalesce(0)
      java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
      ...
      scala> sc.parallelize(1 to 5).repartition(0)
      java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
      ...
      scala> spark.sql("select 1").coalesce(0)
      java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
      ...
      scala> spark.sql("select 1").repartition(0)
      java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
      ...
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests with new testcases.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13282 from dongjoon-hyun/SPARK-15512.
      f08bf587
    • Tathagata Das's avatar
      [SPARK-15458][SQL][STREAMING] Disable schema inference for streaming datasets on file streams · e631b819
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      If the user relies on the schema to be inferred in file streams can break easily for multiple reasons
      - accidentally running on a directory which has no data
      - schema changing underneath
      - on restart, the query will infer schema again, and may unexpectedly infer incorrect schema, as the file in the directory may be different at the time of the restart.
      
      To avoid these complicated scenarios, for Spark 2.0, we are going to disable schema inferencing by default with a config, so that user is forced to consider explicitly what is the schema it wants, rather than the system trying to infer it and run into weird corner cases.
      
      In this PR, I introduce a SQLConf that determines whether schema inference for file streams is allowed or not. It is disabled by default.
      
      ## How was this patch tested?
      Updated unit tests that test error behavior with and without schema inference enabled.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #13238 from tdas/SPARK-15458.
      e631b819
    • Nick Pentreath's avatar
      [SPARK-15502][DOC][ML][PYSPARK] add guide note that ALS only supports integer ids · 20900e5f
      Nick Pentreath authored
      This PR adds a note to clarify that the ML API for ALS only supports integers for user/item ids, and that other types for these columns can be used but the ids must fall within integer range.
      
      (Refer [SPARK-14891](https://issues.apache.org/jira/browse/SPARK-14891)).
      
      Also cleaned up a reference to `mllib` in the ML doc.
      
      ## How was this patch tested?
      Built and viewed User Guide doc locally.
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #13278 from MLnick/SPARK-15502-als-int-id-doc-note.
      20900e5f
    • Dongjoon Hyun's avatar
      [MINOR][CORE][TEST] Update obsolete `takeSample` test case. · be99a99f
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR fixes some obsolete comments and assertion in `takeSample` testcase of `RDDSuite.scala`.
      
      ## How was this patch tested?
      
      This fixes the testcase only.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13260 from dongjoon-hyun/SPARK-15481.
      be99a99f
    • wangyang's avatar
      [SPARK-15388][SQL] Fix spark sql CREATE FUNCTION with hive 1.2.1 · 784cc07d
      wangyang authored
      ## What changes were proposed in this pull request?
      
      spark.sql("CREATE FUNCTION myfunc AS 'com.haizhi.bdp.udf.UDFGetGeoCode'") throws "org.apache.hadoop.hive.ql.metadata.HiveException:MetaException(message:NoSuchObjectException(message:Function default.myfunc does not exist))" with hive 1.2.1.
      
      I think it is introduced by pr #12853. Fixing it by catching Exception (not NoSuchObjectException) and string matching.
      
      ## How was this patch tested?
      
      added a unit test and also tested it manually
      
      Author: wangyang <wangyang@haizhi.com>
      
      Closes #13177 from wangyang1992/fixCreateFunc2.
      784cc07d
    • Marcelo Vanzin's avatar
      [SPARK-15405][YARN] Remove unnecessary upload of config archive. · a313a5ae
      Marcelo Vanzin authored
      We only need one copy of it. The client code that was uploading the
      second copy just needs to be modified to update the metadata in the
      cache, so that the AM knows where to find the configuration.
      
      Tested by running app on YARN and verifying in the logs only one archive
      is uploaded.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #13232 from vanzin/SPARK-15405.
      a313a5ae
Loading