Skip to content
Snippets Groups Projects
  1. Oct 14, 2016
    • Nick Pentreath's avatar
      [SPARK-16063][SQL] Add storageLevel to Dataset · 5aeb7384
      Nick Pentreath authored
      [SPARK-11905](https://issues.apache.org/jira/browse/SPARK-11905
      
      ) added support for `persist`/`cache` for `Dataset`. However, there is no user-facing API to check if a `Dataset` is cached and if so what the storage level is. This PR adds `getStorageLevel` to `Dataset`, analogous to `RDD.getStorageLevel`.
      
      Updated `DatasetCacheSuite`.
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #13780 from MLnick/ds-storagelevel.
      
      Signed-off-by: default avatarMichael Armbrust <michael@databricks.com>
      5aeb7384
    • Davies Liu's avatar
      [SPARK-17863][SQL] should not add column into Distinct · da9aeb0f
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      We are trying to resolve the attribute in sort by pulling up some column for grandchild into child, but that's wrong when the child is Distinct, because the added column will change the behavior of Distinct, we should not do that.
      
      ## How was this patch tested?
      
      Added regression test.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #15489 from davies/order_distinct.
      da9aeb0f
    • Yin Huai's avatar
      Revert "[SPARK-17620][SQL] Determine Serde by hive.default.fileformat when... · 522dd0d0
      Yin Huai authored
      Revert "[SPARK-17620][SQL] Determine Serde by hive.default.fileformat when Creating Hive Serde Tables"
      
      This reverts commit 7ab86244.
      522dd0d0
    • Dilip Biswal's avatar
      [SPARK-17620][SQL] Determine Serde by hive.default.fileformat when Creating Hive Serde Tables · 7ab86244
      Dilip Biswal authored
      ## What changes were proposed in this pull request?
      Make sure the hive.default.fileformat is used to when creating the storage format metadata.
      
      Output
      ``` SQL
      scala> spark.sql("SET hive.default.fileformat=orc")
      res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
      
      scala> spark.sql("CREATE TABLE tmp_default(id INT)")
      res2: org.apache.spark.sql.DataFrame = []
      ```
      Before
      ```SQL
      scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
      ..
      [# Storage Information,,]
      [SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
      [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
      [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
      [Compressed:,No,]
      [Storage Desc Parameters:,,]
      [  serialization.format,1,]
      ```
      After
      ```SQL
      scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
      ..
      [# Storage Information,,]
      [SerDe Library:,org.apache.hadoop.hive.ql.io.orc.OrcSerde,]
      [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
      [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
      [Compressed:,No,]
      [Storage Desc Parameters:,,]
      [  serialization.format,1,]
      
      ```
      
      ## How was this patch tested?
      Added new tests to HiveDDLCommandSuite
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #15190 from dilipbiswal/orc.
      7ab86244
    • sethah's avatar
      [SPARK-17941][ML][TEST] Logistic regression tests should use sample weights. · de1c1ca5
      sethah authored
      ## What changes were proposed in this pull request?
      
      The sample weight testing for logistic regressions is not robust. Logistic regression suite already has many test cases comparing results to R glmnet. Since both libraries support sample weights, we should use sample weights in the test to increase coverage for sample weighting. This patch doesn't really add any code and makes the testing more complete.
      
      Also fixed some errors with the R code that was referenced in the test suit. Changed `standardization=T` to `standardize=T` since the former is invalid.
      
      ## How was this patch tested?
      
      Existing unit tests are modified. No non-test code is touched.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #15488 from sethah/logreg_weight_tests.
      de1c1ca5
    • Tathagata Das's avatar
      [TEST] Ignore flaky test in StreamingQueryListenerSuite · 05800b4b
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Ignoring the flaky test introduced in #15307
      
      https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/1736/testReport/junit/org.apache.spark.sql.streaming/StreamingQueryListenerSuite/single_listener__check_trigger_statuses/
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #15491 from tdas/metrics-flaky-test.
      05800b4b
    • Andrew Ash's avatar
      Typo: form -> from · fa37877a
      Andrew Ash authored
      ## What changes were proposed in this pull request?
      
      Minor typo fix
      
      ## How was this patch tested?
      
      Existing unit tests on Jenkins
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #15486 from ash211/patch-8.
      Unverified
      fa37877a
    • Dhruve Ashar's avatar
      [DOC] Fix typo in sql hive doc · a0ebcb3a
      Dhruve Ashar authored
      Change is too trivial to file a JIRA.
      
      Author: Dhruve Ashar <dhruveashar@gmail.com>
      
      Closes #15485 from dhruve/master.
      Unverified
      a0ebcb3a
    • wangzhenhua's avatar
      [SPARK-17073][SQL][FOLLOWUP] generate column-level statistics · 7486442f
      wangzhenhua authored
      ## What changes were proposed in this pull request?
      This pr adds some test cases for statistics: case sensitive column names, non ascii column names, refresh table, and also improves some documentation.
      
      ## How was this patch tested?
      add test cases
      
      Author: wangzhenhua <wangzhenhua@huawei.com>
      
      Closes #15360 from wzhfy/colStats2.
      7486442f
    • invkrh's avatar
      [SPARK-17855][CORE] Remove query string from jar url · 28b645b1
      invkrh authored
      ## What changes were proposed in this pull request?
      
      Spark-submit support jar url with http protocol. However, if the url contains any query strings, `worker.DriverRunner.downloadUserJar()` method will throw "Did not see expected jar" exception. This is because this method checks the existance of a downloaded jar whose name contains query strings. This is a problem when your jar is located on some web service which requires some additional information to retrieve the file.
      
      This pr just removes query strings before checking jar existance on worker.
      
      ## How was this patch tested?
      
      For now, you can only test this patch by manual test.
      * Deploy a spark cluster locally
      * Make sure apache httpd service is on
      * Save an uber jar, e.g spark-job.jar under `/var/www/html/`
      * Use http://localhost/spark-job.jar?param=1 as jar url when running `spark-submit`
      * Job should be launched
      
      Author: invkrh <invkrh@gmail.com>
      
      Closes #15420 from invkrh/spark-17855.
      Unverified
      28b645b1
    • Peng's avatar
      [SPARK-17870][MLLIB][ML] Change statistic to pValue for SelectKBest and... · c8b612de
      Peng authored
      [SPARK-17870][MLLIB][ML] Change statistic to pValue for SelectKBest and SelectPercentile because of DoF difference
      
      ## What changes were proposed in this pull request?
      
      For feature selection method ChiSquareSelector, it is based on the ChiSquareTestResult.statistic (ChiSqure value) to select the features. It select the features with the largest ChiSqure value. But the Degree of Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and for different df, you cannot base on ChiSqure value to select features.
      
      So we change statistic to pValue for SelectKBest and SelectPercentile
      
      ## How was this patch tested?
      change existing test
      
      Author: Peng <peng.meng@intel.com>
      
      Closes #15444 from mpjlu/chisqure-bug.
      Unverified
      c8b612de
    • Zheng RuiFeng's avatar
      [SPARK-14634][ML] Add BisectingKMeansSummary · a1b136d0
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      Add BisectingKMeansSummary
      
      ## How was this patch tested?
      unit test
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #12394 from zhengruifeng/biKMSummary.
      a1b136d0
    • Yanbo Liang's avatar
      [SPARK-15402][ML][PYSPARK] PySpark ml.evaluation should support save/load · 1db8feab
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Since ```ml.evaluation``` has supported save/load at Scala side, supporting it at Python side is very straightforward and easy.
      
      ## How was this patch tested?
      Add python doctest.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #13194 from yanboliang/spark-15402.
      1db8feab
    • Wenchen Fan's avatar
      [SPARK-17903][SQL] MetastoreRelation should talk to external catalog instead of hive client · 2fb12b0a
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      `HiveExternalCatalog` should be the only interface to talk to the hive metastore. In `MetastoreRelation` we can just use `ExternalCatalog` instead of `HiveClient` to interact with hive metastore,  and add missing API in `ExternalCatalog`.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15460 from cloud-fan/relation.
      2fb12b0a
    • Reynold Xin's avatar
      [SPARK-17925][SQL] Break fileSourceInterfaces.scala into multiple pieces · 6c29b3de
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch does a few changes to the file structure of data sources:
      
      - Break fileSourceInterfaces.scala into multiple pieces (HadoopFsRelation, FileFormat, OutputWriter)
      - Move ParquetOutputWriter into its own file
      
      I created this as a separate patch so it'd be easier to review my future PRs that focus on refactoring this internal logic. This patch only moves code around, and has no logic changes.
      
      ## How was this patch tested?
      N/A - should be covered by existing tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15473 from rxin/SPARK-17925.
      6c29b3de
  2. Oct 13, 2016
    • Reynold Xin's avatar
      [SPARK-17927][SQL] Remove dead code in WriterContainer. · 8543996c
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      speculationEnabled and DATASOURCE_OUTPUTPATH seem like just dead code.
      
      ## How was this patch tested?
      Tests should fail if they are not dead code.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15477 from rxin/SPARK-17927.
      8543996c
    • Yanbo Liang's avatar
      [SPARK-15957][FOLLOW-UP][ML][PYSPARK] Add Python API for RFormula forceIndexLabel. · 44cbb61b
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Follow-up work of #13675, add Python API for ```RFormula forceIndexLabel```.
      
      ## How was this patch tested?
      Unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15430 from yanboliang/spark-15957-python.
      44cbb61b
    • Jakob Odersky's avatar
      [SPARK-17368][SQL] Add support for value class serialization and deserialization · 9dc0ca06
      Jakob Odersky authored
      ## What changes were proposed in this pull request?
      Value classes were unsupported because catalyst data types were
      obtained through reflection on erased types, which would resolve to a
      value class' wrapped type and hence lead to unavailable methods during
      code generation.
      
      E.g. the following class
      ```scala
      case class Foo(x: Int) extends AnyVal
      ```
      would be seen as an `int` in catalyst and will cause instance cast failures when generated java code tries to treat it as a `Foo`.
      
      This patch simply removes the erasure step when getting data types for
      catalyst.
      
      ## How was this patch tested?
      Additional tests in `ExpressionEncoderSuite`.
      
      Author: Jakob Odersky <jakob@odersky.com>
      
      Closes #15284 from jodersky/value-classes.
      9dc0ca06
    • petermaxlee's avatar
      [SPARK-17661][SQL] Consolidate various listLeafFiles implementations · adc11242
      petermaxlee authored
      ## What changes were proposed in this pull request?
      There are 4 listLeafFiles-related functions in Spark:
      
      - ListingFileCatalog.listLeafFiles (which calls HadoopFsRelation.listLeafFilesInParallel if the number of paths passed in is greater than a threshold; if it is lower, then it has its own serial version implemented)
      - HadoopFsRelation.listLeafFiles (called only by HadoopFsRelation.listLeafFilesInParallel)
      - HadoopFsRelation.listLeafFilesInParallel (called only by ListingFileCatalog.listLeafFiles)
      
      It is actually very confusing and error prone because there are effectively two distinct implementations for the serial version of listing leaf files. As an example, SPARK-17599 updated only one of the code path and ignored the other one.
      
      This code can be improved by:
      
      - Move all file listing code into ListingFileCatalog, since it is the only class that needs this.
      - Keep only one function for listing files in serial.
      
      ## How was this patch tested?
      This change should be covered by existing unit and integration tests. I also moved a test case for HadoopFsRelation.shouldFilterOut from HadoopFsRelationSuite to ListingFileCatalogSuite.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #15235 from petermaxlee/SPARK-17661.
      adc11242
    • Tathagata Das's avatar
      [SPARK-17731][SQL][STREAMING] Metrics for structured streaming · 7106866c
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Metrics are needed for monitoring structured streaming apps. Here is the design doc for implementing the necessary metrics.
      https://docs.google.com/document/d/1NIdcGuR1B3WIe8t7VxLrt58TJB4DtipWEbj5I_mzJys/edit?usp=sharing
      
      Specifically, this PR adds the following public APIs changes.
      
      ### New APIs
      - `StreamingQuery.status` returns a `StreamingQueryStatus` object (renamed from `StreamingQueryInfo`, see later)
      
      - `StreamingQueryStatus` has the following important fields
        - inputRate - Current rate (rows/sec) at which data is being generated by all the sources
        - processingRate - Current rate (rows/sec) at which the query is processing data from
                                        all the sources
        - ~~outputRate~~ - *Does not work with wholestage codegen*
        - latency - Current average latency between the data being available in source and the sink writing the corresponding output
        - sourceStatuses: Array[SourceStatus] - Current statuses of the sources
        - sinkStatus: SinkStatus - Current status of the sink
        - triggerStatus - Low-level detailed status of the last completed/currently active trigger
          - latencies - getOffset, getBatch, full trigger, wal writes
          - timestamps - trigger start, finish, after getOffset, after getBatch
          - numRows - input, output, state total/updated rows for aggregations
      
      - `SourceStatus` has the following important fields
        - inputRate - Current rate (rows/sec) at which data is being generated by the source
        - processingRate - Current rate (rows/sec) at which the query is processing data from the source
        - triggerStatus - Low-level detailed status of the last completed/currently active trigger
      
      - Python API for `StreamingQuery.status()`
      
      ### Breaking changes to existing APIs
      **Existing direct public facing APIs**
      - Deprecated direct public-facing APIs `StreamingQuery.sourceStatuses` and `StreamingQuery.sinkStatus` in favour of `StreamingQuery.status.sourceStatuses/sinkStatus`.
        - Branch 2.0 should have it deprecated, master should have it removed.
      
      **Existing advanced listener APIs**
      - `StreamingQueryInfo` renamed to `StreamingQueryStatus` for consistency with `SourceStatus`, `SinkStatus`
         - Earlier StreamingQueryInfo was used only in the advanced listener API, but now it is used in direct public-facing API (StreamingQuery.status)
      
      - Field `queryInfo` in listener events `QueryStarted`, `QueryProgress`, `QueryTerminated` changed have name `queryStatus` and return type `StreamingQueryStatus`.
      
      - Field `offsetDesc` in `SourceStatus` was Option[String], converted it to `String`.
      
      - For `SourceStatus` and `SinkStatus` made constructor private instead of private[sql] to make them more java-safe. Instead added `private[sql] object SourceStatus/SinkStatus.apply()` which are harder to accidentally use in Java.
      
      ## How was this patch tested?
      
      Old and new unit tests.
      - Rate calculation and other internal logic of StreamMetrics tested by StreamMetricsSuite.
      - New info in statuses returned through StreamingQueryListener is tested in StreamingQueryListenerSuite.
      - New and old info returned through StreamingQuery.status is tested in StreamingQuerySuite.
      - Source-specific tests for making sure input rows are counted are is source-specific test suites.
      - Additional tests to test minor additions in LocalTableScanExec, StateStore, etc.
      
      Metrics also manually tested using Ganglia sink
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #15307 from tdas/SPARK-17731.
      7106866c
    • Shixiong Zhu's avatar
      [SPARK-17834][SQL] Fetch the earliest offsets manually in KafkaSource instead... · 08eac356
      Shixiong Zhu authored
      [SPARK-17834][SQL] Fetch the earliest offsets manually in KafkaSource instead of counting on KafkaConsumer
      
      ## What changes were proposed in this pull request?
      
      Because `KafkaConsumer.poll(0)` may update the partition offsets, this PR just calls `seekToBeginning` to manually set the earliest offsets for the KafkaSource initial offsets.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15397 from zsxwing/SPARK-17834.
      08eac356
    • Pete Robbins's avatar
      [SPARK-17827][SQL] maxColLength type should be Int for String and Binary · 84f149e4
      Pete Robbins authored
      ## What changes were proposed in this pull request?
      correct the expected type from Length function to be Int
      
      ## How was this patch tested?
      Test runs on little endian and big endian platforms
      
      Author: Pete Robbins <robbinspg@gmail.com>
      
      Closes #15464 from robbinspg/SPARK-17827.
      84f149e4
    • Reynold Xin's avatar
      [SPARK-17830][SQL] Annotate remaining SQL APIs with InterfaceStability · 04d417a7
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch annotates all the remaining APIs in SQL (excluding streaming) with InterfaceStability.
      
      ## How was this patch tested?
      N/A - just annotation change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15457 from rxin/SPARK-17830-2.
      04d417a7
    • gatorsmile's avatar
      [SPARK-17657][SQL] Disallow Users to Change Table Type · 0a8e51a5
      gatorsmile authored
      ### What changes were proposed in this pull request?
      Hive allows users to change the table type from `Managed` to `External` or from `External` to `Managed` by altering table's property `EXTERNAL`. See the JIRA: https://issues.apache.org/jira/browse/HIVE-1329
      
      So far, Spark SQL does not correctly support it, although users can do it. Many assumptions are broken in the implementation. Thus, this PR is to disallow users to change it.
      
      In addition, we also do not allow users to set the property `EXTERNAL` when creating a table.
      
      ### How was this patch tested?
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15230 from gatorsmile/alterTableSetExternal.
      0a8e51a5
    • jerryshao's avatar
      [SPARK-17686][CORE] Support printing out scala and java version with spark-submit --version command · 7bf8a404
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      In our universal gateway service we need to specify different jars to Spark according to scala version. For now only after launching Spark application can we know which version of Scala it depends on. It makes hard for us to support different Scala + Spark versions to pick the right jars.
      
      So here propose to print out Scala version according to Spark version in "spark-submit --version", so that user could leverage this output to make the choice without needing to launching application.
      
      ## How was this patch tested?
      
      Manually verified in local environment.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #15456 from jerryshao/SPARK-17686.
      7bf8a404
    • Wenchen Fan's avatar
      [SPARK-17899][SQL] add a debug mode to keep raw table properties in HiveExternalCatalog · db8784fe
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Currently `HiveExternalCatalog` will filter out the Spark SQL internal table properties, e.g. `spark.sql.sources.provider`, `spark.sql.sources.schema`, etc. This is reasonable for external users as they don't want to see these internal properties in `DESC TABLE`.
      
      However, as a Spark developer, sometimes we do wanna see the raw table properties. This PR adds a new internal SQL conf, `spark.sql.debug`, to enable debug mode and keep these raw table properties.
      
      This config can also be used in similar places where we wanna retain debug information in the future.
      
      ## How was this patch tested?
      
      new test in MetastoreDataSourcesSuite
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15458 from cloud-fan/debug.
      db8784fe
    • Alex Bozarth's avatar
      [SPARK-11272][WEB UI] Add support for downloading event logs from HistoryServer UI · 6f2fa6c5
      Alex Bozarth authored
      ## What changes were proposed in this pull request?
      
      This is a reworked PR based on feedback in #9238 after it was closed and not reopened. As suggested in that PR I've only added the download feature. This functionality already exists in the api and this allows easier access to download event logs to share with others.
      
      I've attached a screenshot of the committed version, but I will also include alternate options with screen shots in the comments below. I'm personally not sure which option is best.
      
      ## How was this patch tested?
      
      Manual testing
      
      ![screen shot 2016-10-07 at 6 11 12 pm](https://cloud.githubusercontent.com/assets/13952758/19209213/832fe48e-8cba-11e6-9840-749b1be4d399.png)
      
      Author: Alex Bozarth <ajbozart@us.ibm.com>
      
      Closes #15400 from ajbozarth/spark11272.
      6f2fa6c5
    • buzhihuojie's avatar
      minor doc fix for Row.scala · 7222a25a
      buzhihuojie authored
      ## What changes were proposed in this pull request?
      
      minor doc fix for "getAnyValAs" in class Row
      
      ## How was this patch tested?
      
      None.
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: buzhihuojie <ren.weiluo@gmail.com>
      
      Closes #15452 from david-weiluo-ren/minorDocFixForRow.
      7222a25a
    • Liang-Chi Hsieh's avatar
      [SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropduplicates · 064d6650
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      Two issues regarding Dataset.dropduplicates:
      
      1. Dataset.dropDuplicates should consider the columns with same column name
      
          We find and get the first resolved attribute from output with the given column name in `Dataset.dropDuplicates`. When we have the more than one columns with the same name. Other columns are put into aggregation columns, instead of grouping columns.
      
      2. Dataset.dropDuplicates should not change the output of child plan
      
          We create new `Alias` with new exprId in `Dataset.dropDuplicates` now. However it causes problem when we want to select the columns as follows:
      
              val ds = Seq(("a", 1), ("a", 2), ("b", 1), ("a", 1)).toDS()
              // ds("_2") will cause analysis exception
              ds.dropDuplicates("_1").select(ds("_1").as[String], ds("_2").as[Int])
      
      Because the two issues are both related to `Dataset.dropduplicates` and the code changes are not big, so submitting them together as one PR.
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #15427 from viirya/fix-dropduplicates.
      064d6650
  3. Oct 12, 2016
    • Burak Yavuz's avatar
      [SPARK-17876] Write StructuredStreaming WAL to a stream instead of materializing all at once · edeb51a3
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      The CompactibleFileStreamLog materializes the whole metadata log in memory as a String. This can cause issues when there are lots of files that are being committed, especially during a compaction batch.
      You may come across stacktraces that look like:
      ```
      java.lang.OutOfMemoryError: Requested array size exceeds VM limit
      at java.lang.StringCoding.encode(StringCoding.java:350)
      at java.lang.String.getBytes(String.java:941)
      at org.apache.spark.sql.execution.streaming.FileStreamSinkLog.serialize(FileStreamSinkLog.scala:127)
      
      ```
      The safer way is to write to an output stream so that we don't have to materialize a huge string.
      
      ## How was this patch tested?
      
      Existing unit tests
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #15437 from brkyvz/ser-to-stream.
      edeb51a3
    • Yanbo Liang's avatar
      [SPARK-17835][ML][MLLIB] Optimize NaiveBayes mllib wrapper to eliminate extra pass on data · 21cb59f1
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      [SPARK-14077](https://issues.apache.org/jira/browse/SPARK-14077) copied the ```NaiveBayes``` implementation from mllib to ml and left mllib as a wrapper. However, there are some difference between mllib and ml to handle labels:
      * mllib allow input labels as {-1, +1}, however, ml assumes the input labels in range [0, numClasses).
      * mllib ```NaiveBayesModel``` expose ```labels``` but ml did not due to the assumption mention above.
      
      During the copy in [SPARK-14077](https://issues.apache.org/jira/browse/SPARK-14077), we use
      ```val labels = data.map(_.label).distinct().collect().sorted```
      to get the distinct labels firstly, and then encode the labels for training. It involves extra Spark job compared with the original implementation. Since ```NaiveBayes``` only do one pass aggregation during training, adding another one seems less efficient. We can get the labels in a single pass along with ```NaiveBayes``` training and send them to MLlib side.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15402 from yanboliang/spark-17835.
      21cb59f1
    • WeichenXu's avatar
      [SPARK-17745][ML][PYSPARK] update NB python api - add weight col parameter · 0d4a6952
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      update python api for NaiveBayes: add weight col parameter.
      
      ## How was this patch tested?
      
      doctests added.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #15406 from WeichenXu123/nb_python_update.
      0d4a6952
    • Reynold Xin's avatar
      [SPARK-17845] [SQL] More self-evident window function frame boundary API · 6f20a92c
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch improves the window function frame boundary API to make it more obvious to read and to use. The two high level changes are:
      
      1. Create Window.currentRow, Window.unboundedPreceding, Window.unboundedFollowing to indicate the special values in frame boundaries. These methods map to the special integral values so we are not breaking backward compatibility here. This change makes the frame boundaries more self-evident (instead of Long.MinValue, it becomes Window.unboundedPreceding).
      
      2. In Python, for any value less than or equal to JVM's Long.MinValue, treat it as Window.unboundedPreceding. For any value larger than or equal to JVM's Long.MaxValue, treat it as Window.unboundedFollowing. Before this change, if the user specifies any value that is less than Long.MinValue but not -sys.maxsize (e.g. -sys.maxsize + 1), the number we pass over to the JVM would overflow, resulting in a frame that does not make sense.
      
      Code example required to specify a frame before this patch:
      ```
      Window.rowsBetween(-Long.MinValue, 0)
      ```
      
      While the above code should still work, the new way is more obvious to read:
      ```
      Window.rowsBetween(Window.unboundedPreceding, Window.currentRow)
      ```
      
      ## How was this patch tested?
      - Updated DataFrameWindowSuite (for Scala/Java)
      - Updated test_window_functions_cumulative_sum (for Python)
      - Renamed DataFrameWindowSuite DataFrameWindowFunctionsSuite to better reflect its purpose
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15438 from rxin/SPARK-17845.
      6f20a92c
    • cody koeninger's avatar
      [SPARK-17782][STREAMING][KAFKA] alternative eliminate race condition of poll twice · f9a56a15
      cody koeninger authored
      ## What changes were proposed in this pull request?
      
      Alternative approach to https://github.com/apache/spark/pull/15387
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #15401 from koeninger/SPARK-17782-alt.
      f9a56a15
    • Imran Rashid's avatar
      [SPARK-17675][CORE] Expand Blacklist for TaskSets · 9ce7d3e5
      Imran Rashid authored
      ## What changes were proposed in this pull request?
      
      This is a step along the way to SPARK-8425.
      
      To enable incremental review, the first step proposed here is to expand the blacklisting within tasksets. In particular, this will enable blacklisting for
      * (task, executor) pairs (this already exists via an undocumented config)
      * (task, node)
      * (taskset, executor)
      * (taskset, node)
      
      Adding (task, node) is critical to making spark fault-tolerant of one-bad disk in a cluster, without requiring careful tuning of "spark.task.maxFailures". The other additions are also important to avoid many misleading task failures and long scheduling delays when there is one bad node on a large cluster.
      
      Note that some of the code changes here aren't really required for just this -- they put pieces in place for SPARK-8425 even though they are not used yet (eg. the `BlacklistTracker` helper is a little out of place, `TaskSetBlacklist` holds onto a little more info than it needs to for just this change, and `ExecutorFailuresInTaskSet` is more complex than it needs to be).
      
      ## How was this patch tested?
      
      Added unit tests, run tests via jenkins.
      
      Author: Imran Rashid <irashid@cloudera.com>
      Author: mwws <wei.mao@intel.com>
      
      Closes #15249 from squito/taskset_blacklist_only.
      9ce7d3e5
    • Shixiong Zhu's avatar
      [SPARK-17850][CORE] Add a flag to ignore corrupt files · 47776e7c
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Add a flag to ignore corrupt files. For Spark core, the configuration is `spark.files.ignoreCorruptFiles`. For Spark SQL, it's `spark.sql.files.ignoreCorruptFiles`.
      
      ## How was this patch tested?
      
      The added unit tests
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15422 from zsxwing/SPARK-17850.
      47776e7c
    • Sean Owen's avatar
      [BUILD] Closing stale PRs · eb69335c
      Sean Owen authored
      Closes #15303
      Closes #15078
      Closes #15080
      Closes #15135
      Closes #14565
      Closes #12355
      Closes #15404
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15451 from srowen/CloseStalePRs.
      eb69335c
    • Sean Owen's avatar
      [SPARK-17840][DOCS] Add some pointers for wiki/CONTRIBUTING.md in README.md... · f8062b63
      Sean Owen authored
      [SPARK-17840][DOCS] Add some pointers for wiki/CONTRIBUTING.md in README.md and some warnings in PULL_REQUEST_TEMPLATE
      
      ## What changes were proposed in this pull request?
      
      Link to contributing wiki in PR template, README.md
      
      ## How was this patch tested?
      
      Doc-only change, tested by Jekyll
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15429 from srowen/SPARK-17840.
      f8062b63
    • Hossein's avatar
      [SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GB · 5cc503f4
      Hossein authored
      ## What changes were proposed in this pull request?
      If the R data structure that is being parallelized is larger than `INT_MAX` we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply call `PythonRDD.readRDDFromFile` to create the RDD.
      
      I tested this on my MacBook. Following code works with this patch:
      ```R
      intMax <- .Machine$integer.max
      largeVec <- 1:intMax
      rdd <- SparkR:::parallelize(sc, largeVec, 2)
      ```
      
      ## How was this patch tested?
      * [x] Unit tests
      
      Author: Hossein <hossein@databricks.com>
      
      Closes #15375 from falaki/SPARK-17790.
      5cc503f4
    • prigarg's avatar
      [SPARK-17884][SQL] To resolve Null pointer exception when casting from empty... · d5580eba
      prigarg authored
      [SPARK-17884][SQL] To resolve Null pointer exception when casting from empty string to interval type.
      
      ## What changes were proposed in this pull request?
      This change adds a check in castToInterval method of Cast expression , such that if converted value is null , then isNull variable should be set to true.
      
      Earlier, the expression Cast(Literal(), CalendarIntervalType) was throwing NullPointerException because of the above mentioned reason.
      
      ## How was this patch tested?
      Added test case in CastSuite.scala
      
      jira entry for detail: https://issues.apache.org/jira/browse/SPARK-17884
      
      Author: prigarg <prigarg@adobe.com>
      
      Closes #15449 from priyankagargnitk/SPARK-17884.
      d5580eba
Loading