Skip to content
Snippets Groups Projects
  1. May 26, 2017
    • zero323's avatar
      [SPARK-20694][DOCS][SQL] Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide · ae33abf7
      zero323 authored
      ## What changes were proposed in this pull request?
      
      - Add Scala, Python and Java examples for `partitionBy`, `sortBy` and `bucketBy`.
      - Add _Bucketing, Sorting and Partitioning_ section to SQL Programming Guide
      - Remove bucketing from Unsupported Hive Functionalities.
      
      ## How was this patch tested?
      
      Manual tests, docs build.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17938 from zero323/DOCS-BUCKETING-AND-PARTITIONING.
      ae33abf7
    • Sital Kedia's avatar
      [SPARK-20014] Optimize mergeSpillsWithFileStream method · 473d7552
      Sital Kedia authored
      ## What changes were proposed in this pull request?
      
      When the individual partition size in a spill is small, mergeSpillsWithTransferTo method does many small disk ios which is really inefficient. One way to improve the performance will be to use mergeSpillsWithFileStream method by turning off transfer to and using buffered file read/write to improve the io throughput.
      However, the current implementation of mergeSpillsWithFileStream does not do a buffer read/write of the files and in addition to that it unnecessarily flushes the output files for each partitions.
      
      ## How was this patch tested?
      
      Tested this change by running a job on the cluster and the map stage run time was reduced by around 20%.
      
      Author: Sital Kedia <skedia@fb.com>
      
      Closes #17343 from sitalkedia/upstream_mergeSpillsWithFileStream.
      473d7552
    • Michael Armbrust's avatar
      [SPARK-20844] Remove experimental from Structured Streaming APIs · d935e0a9
      Michael Armbrust authored
      Now that Structured Streaming has been out for several Spark release and has large production use cases, the `Experimental` label is no longer appropriate.  I've left `InterfaceStability.Evolving` however, as I think we may make a few changes to the pluggable Source & Sink API in Spark 2.3.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #18065 from marmbrus/streamingGA.
      d935e0a9
    • 10129659's avatar
      [SPARK-20835][CORE] It should exit directly when the --total-executor-cores... · 0fd84b05
      10129659 authored
      [SPARK-20835][CORE] It should exit directly when the --total-executor-cores parameter is setted less than 0 when submit a application
      
      ## What changes were proposed in this pull request?
      In my test, the submitted app running with out an error when the --total-executor-cores less than 0
      and given the warnings:
      "2017-05-22 17:19:36,319 WARN org.apache.spark.scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources";
      
      It should exit directly when the --total-executor-cores parameter is setted less than 0 when submit a application
      (Please fill in changes proposed in this fix)
      
      ## How was this patch tested?
      Run the ut tests
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: 10129659 <chen.yanshan@zte.com.cn>
      
      Closes #18060 from eatoncys/totalcores.
      0fd84b05
    • Wenchen Fan's avatar
      [SPARK-20887][CORE] support alternative keys in ConfigBuilder · 629f38e1
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      `ConfigBuilder` builds `ConfigEntry` which can only read value with one key, if we wanna change the config name but still keep the old one, it's hard to do.
      
      This PR introduce `ConfigBuilder.withAlternative`, to support reading config value with alternative keys. And also rename `spark.scheduler.listenerbus.eventqueue.size` to `spark.scheduler.listenerbus.eventqueue.capacity` with this feature, according to https://github.com/apache/spark/pull/14269#discussion_r118432313
      
      ## How was this patch tested?
      
      a new test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18110 from cloud-fan/config.
      629f38e1
    • Wil Selwood's avatar
      [MINOR] document edge case of updateFunc usage · b6f2017a
      Wil Selwood authored
      ## What changes were proposed in this pull request?
      
      Include documentation of the fact that the updateFunc is sometimes called with no new values. This is documented in the main documentation here: https://spark.apache.org/docs/latest/streaming-programming-guide.html#updatestatebykey-operation however from the docs included with the code it is not clear that this is the case.
      
      ## How was this patch tested?
      
      PR only changes comments. Confirmed code still builds.
      
      Author: Wil Selwood <wil.selwood@sa.catapult.org.uk>
      
      Closes #18088 from wselwood/note-edge-case-in-docs.
      b6f2017a
    • Wenchen Fan's avatar
      [SPARK-20868][CORE] UnsafeShuffleWriter should verify the position after FileChannel.transferTo · d9ad7890
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Long time ago we fixed a [bug](https://issues.apache.org/jira/browse/SPARK-3948) in shuffle writer about `FileChannel.transferTo`. We were not very confident about that fix, so we added a position check after the writing, try to discover the bug earlier.
      
       However this checking is missing in the new `UnsafeShuffleWriter`, this PR adds it.
      
      https://issues.apache.org/jira/browse/SPARK-18105 maybe related to that `FileChannel.transferTo` bug, hopefully we can find out the root cause after adding this position check.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18091 from cloud-fan/shuffle.
      d9ad7890
    • Zheng RuiFeng's avatar
      [SPARK-20849][DOC][SPARKR] Document R DecisionTree · a97c4970
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      1, add an example for sparkr `decisionTree`
      2, document it in user guide
      
      ## How was this patch tested?
      local submit
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #18067 from zhengruifeng/dt_example.
      a97c4970
    • Liang-Chi Hsieh's avatar
      [SPARK-20392][SQL] Set barrier to prevent re-entering a tree · 8ce0d8ff
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      It is reported that there is performance downgrade when applying ML pipeline for dataset with many columns but few rows.
      
      A big part of the performance downgrade comes from some operations (e.g., `select`) on DataFrame/Dataset which re-create new DataFrame/Dataset with a new `LogicalPlan`. The cost can be ignored in the usage of SQL, normally.
      
      However, it's not rare to chain dozens of pipeline stages in ML. When the query plan grows incrementally during running those stages, the total cost spent on re-creation of DataFrame grows too. In particular, the `Analyzer` will go through the big query plan even most part of it is analyzed.
      
      By eliminating part of the cost, the time to run the example code locally is reduced from about 1min to about 30 secs.
      
      In particular, the time applying the pipeline locally is mostly spent on calling transform of the 137 `Bucketizer`s. Before the change, each call of `Bucketizer`'s transform can cost about 0.4 sec. So the total time spent on all `Bucketizer`s' transform is about 50 secs. After the change, each call only costs about 0.1 sec.
      
      <del>We also make `boundEnc` as lazy variable to reduce unnecessary running time.</del>
      
      ### Performance improvement
      
      The codes and datasets provided by Barry Becker to re-produce this issue and benchmark can be found on the JIRA.
      
      Before this patch: about 1 min
      After this patch: about 20 secs
      
      ## How was this patch tested?
      
      Existing tests.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #17770 from viirya/SPARK-20392.
      8ce0d8ff
  2. May 25, 2017
    • Wayne Zhang's avatar
      [SPARK-14659][ML] RFormula consistent with R when handling strings · f47700c9
      Wayne Zhang authored
      ## What changes were proposed in this pull request?
      When handling strings, the category dropped by RFormula and R are different:
      - RFormula drops the least frequent level
      - R drops the first level after ascending alphabetical ordering
      
      This PR supports different string ordering types in StringIndexer #17879 so that RFormula can drop the same level as R when handling strings using`stringOrderType = "alphabetDesc"`.
      
      ## How was this patch tested?
      new tests
      
      Author: Wayne Zhang <actuaryzhang@uber.com>
      
      Closes #17967 from actuaryzhang/RFormula.
      f47700c9
    • setjet's avatar
      [SPARK-20775][SQL] Added scala support from_json · 2dbe0c52
      setjet authored
      ## What changes were proposed in this pull request?
      
      from_json function required to take in a java.util.Hashmap. For other functions, a java wrapper is provided which casts a java hashmap to a scala map. Only a java function is provided in this case, forcing scala users to pass in a java.util.Hashmap.
      
      Added the missing wrapper.
      
      ## How was this patch tested?
      Added a unit test for passing in a scala map
      
      Author: setjet <rubenljanssen@gmail.com>
      
      Closes #18094 from setjet/spark-20775.
      2dbe0c52
    • Michael Allman's avatar
      [SPARK-20888][SQL][DOCS] Document change of default setting of... · c1e7989c
      Michael Allman authored
      [SPARK-20888][SQL][DOCS] Document change of default setting of spark.sql.hive.caseSensitiveInferenceMode
      
      (Link to Jira: https://issues.apache.org/jira/browse/SPARK-20888)
      
      ## What changes were proposed in this pull request?
      
      Document change of default setting of spark.sql.hive.caseSensitiveInferenceMode configuration key from NEVER_INFO to INFER_AND_SAVE in the Spark SQL 2.1 to 2.2 migration notes.
      
      Author: Michael Allman <michael@videoamp.com>
      
      Closes #18112 from mallman/spark-20888-document_infer_and_save.
      c1e7989c
    • Shixiong Zhu's avatar
      [SPARK-20874][EXAMPLES] Add Structured Streaming Kafka Source to examples project · 98c38529
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Add Structured Streaming Kafka Source to the `examples` project so that people can run `bin/run-example StructuredKafkaWordCount ...`.
      
      ## How was this patch tested?
      
      manually tested it.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #18101 from zsxwing/add-missing-example-dep.
      98c38529
    • hyukjinkwon's avatar
      [SPARK-19707][SPARK-18922][TESTS][SQL][CORE] Fix test failures/the invalid... · e9f983df
      hyukjinkwon authored
      [SPARK-19707][SPARK-18922][TESTS][SQL][CORE] Fix test failures/the invalid path check for sc.addJar on Windows
      
      ## What changes were proposed in this pull request?
      
      This PR proposes two things:
      
      - A follow up for SPARK-19707 (Improving the invalid path check for sc.addJar on Windows as well).
      
      ```
      org.apache.spark.SparkContextSuite:
       - add jar with invalid path *** FAILED *** (32 milliseconds)
         2 was not equal to 1 (SparkContextSuite.scala:309)
         ...
      ```
      
      - Fix path vs URI related test failures on Windows.
      
      ```
      org.apache.spark.storage.LocalDirsSuite:
       - SPARK_LOCAL_DIRS override also affects driver *** FAILED *** (0 milliseconds)
         new java.io.File("/NONEXISTENT_PATH").exists() was true (LocalDirsSuite.scala:50)
         ...
      
       - Utils.getLocalDir() throws an exception if any temporary directory cannot be retrieved *** FAILED *** (15 milliseconds)
         Expected exception java.io.IOException to be thrown, but no exception was thrown. (LocalDirsSuite.scala:64)
         ...
      ```
      
      ```
      org.apache.spark.sql.hive.HiveSchemaInferenceSuite:
       - orc: schema should be inferred and saved when INFER_AND_SAVE is specified *** FAILED *** (203 milliseconds)
         java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-dae61ab3-a851-4dd3-bf4e-be97c501f254
         ...
      
       - parquet: schema should be inferred and saved when INFER_AND_SAVE is specified *** FAILED *** (203 milliseconds)
         java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-fa3aff89-a66e-4376-9a37-2a9b87596939
         ...
      
       - orc: schema should be inferred but not stored when INFER_ONLY is specified *** FAILED *** (141 milliseconds)
         java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-fb464e59-b049-481b-9c75-f53295c9fc2c
         ...
      
       - parquet: schema should be inferred but not stored when INFER_ONLY is specified *** FAILED *** (125 milliseconds)
         java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-9487568e-80a4-42b3-b0a5-d95314c4ccbc
         ...
      
       - orc: schema should not be inferred when NEVER_INFER is specified *** FAILED *** (156 milliseconds)
         java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-0d2dfa45-1b0f-4958-a8be-1074ed0135a
         ...
      
       - parquet: schema should not be inferred when NEVER_INFER is specified *** FAILED *** (547 milliseconds)
         java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-6d95d64e-613e-4a59-a0f6-d198c5aa51ee
         ...
      ```
      
      ```
      org.apache.spark.sql.execution.command.DDLSuite:
       - create temporary view using *** FAILED *** (15 milliseconds)
         org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-3881d9ca-561b-488d-90b9-97587472b853	mp;
         ...
      
       - insert data to a data source table which has a non-existing location should succeed *** FAILED *** (109 milliseconds)
         file:/C:projectsspark%09arget%09mpspark-4cad3d19-6085-4b75-b407-fe5e9d21df54 did not equal file:///C:/projects/spark/target/tmp/spark-4cad3d19-6085-4b75-b407-fe5e9d21df54 (DDLSuite.scala:1869)
         ...
      
       - insert into a data source table with a non-existing partition location should succeed *** FAILED *** (94 milliseconds)
         file:/C:projectsspark%09arget%09mpspark-4b52e7de-e3aa-42fd-95d4-6d4d58d1d95d did not equal file:///C:/projects/spark/target/tmp/spark-4b52e7de-e3aa-42fd-95d4-6d4d58d1d95d (DDLSuite.scala:1910)
         ...
      
       - read data from a data source table which has a non-existing location should succeed *** FAILED *** (93 milliseconds)
         file:/C:projectsspark%09arget%09mpspark-f8c281e2-08c2-4f73-abbf-f3865b702c34 did not equal file:///C:/projects/spark/target/tmp/spark-f8c281e2-08c2-4f73-abbf-f3865b702c34 (DDLSuite.scala:1937)
         ...
      
       - read data from a data source table with non-existing partition location should succeed *** FAILED *** (110 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - create datasource table with a non-existing location *** FAILED *** (94 milliseconds)
         file:/C:projectsspark%09arget%09mpspark-387316ae-070c-4e78-9b78-19ebf7b29ec8 did not equal file:///C:/projects/spark/target/tmp/spark-387316ae-070c-4e78-9b78-19ebf7b29ec8 (DDLSuite.scala:1982)
         ...
      
       - CTAS for external data source table with a non-existing location *** FAILED *** (16 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - CTAS for external data source table with a existed location *** FAILED *** (15 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - data source table:partition column name containing a b *** FAILED *** (125 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - data source table:partition column name containing a:b *** FAILED *** (143 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - data source table:partition column name containing a%b *** FAILED *** (109 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - data source table:partition column name containing a,b *** FAILED *** (109 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - location uri contains a b for datasource table *** FAILED *** (94 milliseconds)
         file:/C:projectsspark%09arget%09mpspark-5739cda9-b702-4e14-932c-42e8c4174480a%20b did not equal file:///C:/projects/spark/target/tmp/spark-5739cda9-b702-4e14-932c-42e8c4174480/a%20b (DDLSuite.scala:2084)
         ...
      
       - location uri contains a:b for datasource table *** FAILED *** (78 milliseconds)
         file:/C:projectsspark%09arget%09mpspark-9bdd227c-840f-4f08-b7c5-4036638f098da:b did not equal file:///C:/projects/spark/target/tmp/spark-9bdd227c-840f-4f08-b7c5-4036638f098d/a:b (DDLSuite.scala:2084)
         ...
      
       - location uri contains a%b for datasource table *** FAILED *** (78 milliseconds)
         file:/C:projectsspark%09arget%09mpspark-62bb5f1d-fa20-460a-b534-cb2e172a3640a%25b did not equal file:///C:/projects/spark/target/tmp/spark-62bb5f1d-fa20-460a-b534-cb2e172a3640/a%25b (DDLSuite.scala:2084)
         ...
      
       - location uri contains a b for database *** FAILED *** (16 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - location uri contains a:b for database *** FAILED *** (15 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - location uri contains a%b for database *** FAILED *** (0 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      ```
      
      ```
      org.apache.spark.sql.hive.execution.HiveDDLSuite:
       - create hive table with a non-existing location *** FAILED *** (16 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - CTAS for external hive table with a non-existing location *** FAILED *** (16 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - CTAS for external hive table with a existed location *** FAILED *** (16 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - partition column name of parquet table containing a b *** FAILED *** (156 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - partition column name of parquet table containing a:b *** FAILED *** (94 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - partition column name of parquet table containing a%b *** FAILED *** (125 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - partition column name of parquet table containing a,b *** FAILED *** (110 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - partition column name of hive table containing a b *** FAILED *** (15 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - partition column name of hive table containing a:b *** FAILED *** (16 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - partition column name of hive table containing a%b *** FAILED *** (16 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - partition column name of hive table containing a,b *** FAILED *** (0 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - hive table: location uri contains a b *** FAILED *** (0 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - hive table: location uri contains a:b *** FAILED *** (0 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - hive table: location uri contains a%b *** FAILED *** (0 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      ```
      
      ```
      org.apache.spark.sql.sources.PathOptionSuite:
       - path option also exist for write path *** FAILED *** (94 milliseconds)
         file:/C:projectsspark%09arget%09mpspark-2870b281-7ac0-43d6-b6b6-134e01ab6fdc did not equal file:///C:/projects/spark/target/tmp/spark-2870b281-7ac0-43d6-b6b6-134e01ab6fdc (PathOptionSuite.scala:98)
         ...
      ```
      
      ```
      org.apache.spark.sql.CachedTableSuite:
       - SPARK-19765: UNCACHE TABLE should un-cache all cached plans that refer to this table *** FAILED *** (110 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      ```
      
      ```
      org.apache.spark.sql.execution.DataSourceScanExecRedactionSuite:
       - treeString is redacted *** FAILED *** (250 milliseconds)
         "file:/C:/projects/spark/target/tmp/spark-3ecc1fa4-3e76-489c-95f4-f0b0500eae28" did not contain "C:\projects\spark\target\tmp\spark-3ecc1fa4-3e76-489c-95f4-f0b0500eae28" (DataSourceScanExecRedactionSuite.scala:46)
         ...
      ```
      
      ## How was this patch tested?
      
      Tested via AppVeyor for each and checked it passed once each. These should be retested via AppVeyor in this PR.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17987 from HyukjinKwon/windows-20170515.
      e9f983df
    • Lior Regev's avatar
      [SPARK-20741][SPARK SUBMIT] Added cleanup of JARs archive generated by SparkSubmit · 7306d556
      Lior Regev authored
      ## What changes were proposed in this pull request?
      
      Deleted generated JARs archive after distribution to HDFS
      
      ## How was this patch tested?
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Lior Regev <lioregev@gmail.com>
      
      Closes #17986 from liorregev/master.
      7306d556
    • Yan Facai (颜发才)'s avatar
      [SPARK-20768][PYSPARK][ML] Expose numPartitions (expert) param of PySpark FPGrowth. · 139da116
      Yan Facai (颜发才) authored
      ## What changes were proposed in this pull request?
      
      Expose numPartitions (expert) param of PySpark FPGrowth.
      
      ## How was this patch tested?
      
      + [x] Pass all unit tests.
      
      Author: Yan Facai (颜发才) <facai.yan@gmail.com>
      
      Closes #18058 from facaiy/ENH/pyspark_fpg_add_num_partition.
      139da116
    • Yanbo Liang's avatar
      [SPARK-19281][FOLLOWUP][ML] Minor fix for PySpark FPGrowth. · 913a6bfe
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Follow-up for #17218, some minor fix for PySpark ```FPGrowth```.
      
      ## How was this patch tested?
      Existing UT.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #18089 from yanboliang/spark-19281.
      913a6bfe
    • jinxing's avatar
      [SPARK-19659] Fetch big blocks to disk when shuffle-read. · 3f94e64a
      jinxing authored
      ## What changes were proposed in this pull request?
      
      Currently the whole block is fetched into memory(off heap by default) when shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can be large when skew situations. If OOM happens during shuffle read, job will be killed and users will be notified to "Consider boosting spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more memory can resolve the OOM. However the approach is not perfectly suitable for production environment, especially for data warehouse.
      Using Spark SQL as data engine in warehouse, users hope to have a unified parameter(e.g. memory) but less resource wasted(resource is allocated but not used). The hope is strong especially when migrating data engine to Spark from another one(e.g. Hive). Tuning the parameter for thousands of SQLs one by one is very time consuming.
      It's not always easy to predict skew situations, when happen, it make sense to fetch remote blocks to disk for shuffle-read, rather than kill the job because of OOM.
      
      In this pr, I propose to fetch big blocks to disk(which is also mentioned in SPARK-3019):
      
      1. Track average size and also the outliers(which are larger than 2*avgSize) in MapStatus;
      2. Request memory from `MemoryManager` before fetch blocks and release the memory to `MemoryManager` when `ManagedBuffer` is released.
      3. Fetch remote blocks to disk when failing acquiring memory from `MemoryManager`, otherwise fetch to memory.
      
      This is an improvement for memory control when shuffle blocks and help to avoid OOM in scenarios like below:
      1. Single huge block;
      2. Sizes of many blocks are underestimated in `MapStatus` and the actual footprint of blocks is much larger than the estimated.
      
      ## How was this patch tested?
      Added unit test in `MapStatusSuite` and `ShuffleBlockFetcherIteratorSuite`.
      
      Author: jinxing <jinxing6042@126.com>
      
      Closes #16989 from jinxing64/SPARK-19659.
      3f94e64a
    • Xianyang Liu's avatar
      [SPARK-20250][CORE] Improper OOM error when a task been killed while spilling data · 731462a0
      Xianyang Liu authored
      ## What changes were proposed in this pull request?
      
      Currently, when a task is calling spill() but it receives a killing request from driver (e.g., speculative task), the `TaskMemoryManager` will throw an `OOM` exception.  And we don't catch `Fatal` exception when a error caused by `Thread.interrupt`. So for `ClosedByInterruptException`, we should throw `RuntimeException` instead of `OutOfMemoryError`.
      
      https://issues.apache.org/jira/browse/SPARK-20250?jql=project%20%3D%20SPARK
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: Xianyang Liu <xianyang.liu@intel.com>
      
      Closes #18090 from ConeyLiu/SPARK-20250.
      731462a0
  3. May 24, 2017
    • Liang-Chi Hsieh's avatar
      [SPARK-20848][SQL][FOLLOW-UP] Shutdown the pool after reading parquet files · 6b68d61c
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up to #18073. Taking a safer approach to shutdown the pool to prevent possible issue. Also using `ThreadUtils.newForkJoinPool` instead to set a better thread name.
      
      ## How was this patch tested?
      
      Manually test.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #18100 from viirya/SPARK-20848-followup.
      6b68d61c
    • liuxian's avatar
      [SPARK-20403][SQL] Modify the instructions of some functions · 197f9018
      liuxian authored
      ## What changes were proposed in this pull request?
      1.    add  instructions of  'cast'  function When using 'show functions'  and 'desc function cast'
             command in spark-sql
      2.    Modify the  instructions of functions,such as
           boolean,tinyint,smallint,int,bigint,float,double,decimal,date,timestamp,binary,string
      
      ## How was this patch tested?
      Before modification:
      spark-sql>desc function boolean;
      Function: boolean
      Class: org.apache.spark.sql.catalyst.expressions.Cast
      Usage: boolean(expr AS type) - Casts the value `expr` to the target data type `type`.
      
      After modification:
      spark-sql> desc function boolean;
      Function: boolean
      Class: org.apache.spark.sql.catalyst.expressions.Cast
      Usage: boolean(expr) - Casts the value `expr` to the target data type `boolean`.
      
      spark-sql> desc function cast
      Function: cast
      Class: org.apache.spark.sql.catalyst.expressions.Cast
      Usage: cast(expr AS type) - Casts the value `expr` to the target data type `type`.
      
      Author: liuxian <liu.xian3@zte.com.cn>
      
      Closes #17698 from 10110346/wip_lx_0418.
      197f9018
    • Jacek Laskowski's avatar
      [SPARK-16202][SQL][DOC] Follow-up to Correct The Description of... · 5f8ff2fc
      Jacek Laskowski authored
      [SPARK-16202][SQL][DOC] Follow-up to Correct The Description of CreatableRelationProvider's createRelation
      
      ## What changes were proposed in this pull request?
      
      Follow-up to SPARK-16202:
      
      1. Remove the duplication of the meaning of `SaveMode` (as one was in fact missing that had proven that the duplication may be incomplete in the future again)
      
      2. Use standard scaladoc tags
      
      /cc gatorsmile rxin yhuai (as they were involved previously)
      
      ## How was this patch tested?
      
      local build
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #18026 from jaceklaskowski/CreatableRelationProvider-SPARK-16202.
      5f8ff2fc
    • Kris Mok's avatar
      [SPARK-20872][SQL] ShuffleExchange.nodeName should handle null coordinator · c0b3e45e
      Kris Mok authored
      ## What changes were proposed in this pull request?
      
      A one-liner change in `ShuffleExchange.nodeName` to cover the case when `coordinator` is `null`, so that the match expression is exhaustive.
      
      Please refer to [SPARK-20872](https://issues.apache.org/jira/browse/SPARK-20872) for a description of the symptoms.
      TL;DR is that inspecting a `ShuffleExchange` (directly or transitively) on the Executor side can hit a case where the `coordinator` field of a `ShuffleExchange` is null, and thus will trigger a `MatchError` in `ShuffleExchange.nodeName()`'s inexhaustive match expression.
      
      Also changed two other match conditions in `ShuffleExchange` on the `coordinator` field to be consistent.
      
      ## How was this patch tested?
      
      Manually tested this change with a case where the `coordinator` is null to make sure `ShuffleExchange.nodeName` doesn't throw a `MatchError` any more.
      
      Author: Kris Mok <kris.mok@databricks.com>
      
      Closes #18095 from rednaxelafx/shuffleexchange-nodename.
      c0b3e45e
    • Marcelo Vanzin's avatar
      [SPARK-20205][CORE] Make sure StageInfo is updated before sending event. · 95aef660
      Marcelo Vanzin authored
      The DAGScheduler was sending a "stage submitted" event before it properly
      updated the event's information. This meant that a listener (e.g. the
      even logging listener) could record wrong information about the event.
      
      This change sets the stage's submission time before the event is submitted,
      when there are tasks to be executed in the stage.
      
      Tested with existing unit tests.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #17925 from vanzin/SPARK-20205.
      95aef660
    • Reynold Xin's avatar
      [SPARK-20867][SQL] Move hints from Statistics into HintInfo class · a6474667
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This is a follow-up to SPARK-20857 to move the broadcast hint from Statistics into a new HintInfo class, so we can be more flexible in adding new hints in the future.
      
      ## How was this patch tested?
      Updated test cases to reflect the change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #18087 from rxin/SPARK-20867.
      a6474667
    • Liang-Chi Hsieh's avatar
      [SPARK-20848][SQL] Shutdown the pool after reading parquet files · f72ad303
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      From JIRA: On each call to spark.read.parquet, a new ForkJoinPool is created. One of the threads in the pool is kept in the WAITING state, and never stopped, which leads to unbounded growth in number of threads.
      
      We should shutdown the pool after reading parquet files.
      
      ## How was this patch tested?
      
      Added a test to ParquetFileFormatSuite.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #18073 from viirya/SPARK-20848.
      f72ad303
    • Bago Amirbekian's avatar
      [SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel · bc66a77b
      Bago Amirbekian authored
      ## What changes were proposed in this pull request?
      
      Fixed TypeError with python3 and numpy 1.12.1. Numpy's `reshape` no longer takes floats as arguments as of 1.12. Also, python3 uses float division for `/`, we should be using `//` to ensure that `_dataWithBiasSize` doesn't get set to a float.
      
      ## How was this patch tested?
      
      Existing tests run using python3 and numpy 1.12.
      
      Author: Bago Amirbekian <bago@databricks.com>
      
      Closes #18081 from MrBago/BF-py3floatbug.
      bc66a77b
    • zero323's avatar
      [SPARK-20631][FOLLOW-UP] Fix incorrect tests. · 1816eb3b
      zero323 authored
      ## What changes were proposed in this pull request?
      
      - Fix incorrect tests for `_check_thresholds`.
      - Move test to `ParamTests`.
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #18085 from zero323/SPARK-20631-FOLLOW-UP.
      1816eb3b
    • Peng's avatar
      [SPARK-20764][ML][PYSPARK][FOLLOWUP] Fix visibility discrepancy with... · 9afcf127
      Peng authored
      [SPARK-20764][ML][PYSPARK][FOLLOWUP] Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version
      
      ## What changes were proposed in this pull request?
      Add test cases for PR-18062
      
      ## How was this patch tested?
      The existing UT
      
      Author: Peng <peng.meng@intel.com>
      
      Closes #18068 from mpjlu/moreTest.
      9afcf127
    • Xingbo Jiang's avatar
      [SPARK-18406][CORE] Race between end-of-task and completion iterator read lock release · d76633e3
      Xingbo Jiang authored
      ## What changes were proposed in this pull request?
      
      When a TaskContext is not propagated properly to all child threads for the task, just like the reported cases in this issue, we fail to get to TID from TaskContext and that causes unable to release the lock and assertion failures. To resolve this, we have to explicitly pass the TID value to the `unlock` method.
      
      ## How was this patch tested?
      
      Add new failing regression test case in `RDDSuite`.
      
      Author: Xingbo Jiang <xingbo.jiang@databricks.com>
      
      Closes #18076 from jiangxb1987/completion-iterator.
      d76633e3
  4. May 23, 2017
    • Bago Amirbekian's avatar
      [SPARK-20861][ML][PYTHON] Delegate looping over paramMaps to estimators · 9434280c
      Bago Amirbekian authored
      Changes:
      
      pyspark.ml Estimators can take either a list of param maps or a dict of params. This change allows the CrossValidator and TrainValidationSplit Estimators to pass through lists of param maps to the underlying estimators so that those estimators can handle parallelization when appropriate (eg distributed hyper parameter tuning).
      
      Testing:
      
      Existing unit tests.
      
      Author: Bago Amirbekian <bago@databricks.com>
      
      Closes #18077 from MrBago/delegate_params.
      9434280c
    • Kirby Linvill's avatar
      [SPARK-15648][SQL] Add teradataDialect for JDBC connection to Teradata · 4816c2ef
      Kirby Linvill authored
      The contribution is my original work and I license the work to the project under the project’s open source license.
      
      Note: the Teradata JDBC connector limits the row size to 64K. The default string datatype equivalent I used is a 255 character/byte length varchar. This effectively limits the max number of string columns to 250 when using the Teradata jdbc connector.
      
      ## What changes were proposed in this pull request?
      
      Added a teradataDialect for JDBC connection to Teradata. The Teradata dialect uses VARCHAR(255) in place of TEXT for string datatypes, and CHAR(1) in place of BIT(1) for boolean datatypes.
      
      ## How was this patch tested?
      
      I added two unit tests to double check that the types get set correctly for a teradata jdbc url. I also ran a couple manual tests to make sure the jdbc connector worked with teradata and to make sure that an error was thrown if a row could potentially exceed 64K (this error comes from the teradata jdbc connector, not from the spark code). I did not check how string columns longer than 255 characters are handled.
      
      Author: Kirby Linvill <kirby.linvill@teradata.com>
      Author: klinvill <kjlinvill@gmail.com>
      
      Closes #16746 from klinvill/master.
      4816c2ef
    • Reynold Xin's avatar
      [SPARK-20857][SQL] Generic resolved hint node · 0d589ba0
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch renames BroadcastHint to ResolvedHint (and Hint to UnresolvedHint) so the hint framework is more generic and would allow us to introduce other hint types in the future without introducing new hint nodes.
      
      ## How was this patch tested?
      Updated test cases.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #18072 from rxin/SPARK-20857.
      0d589ba0
    • Yanbo Liang's avatar
      [MINOR][SPARKR][ML] Joint coefficients with intercept for SparkR linear SVM summary. · ad09e4ca
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Joint coefficients with intercept for SparkR linear SVM summary.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #18035 from yanboliang/svm-r.
      ad09e4ca
    • Liang-Chi Hsieh's avatar
      [SPARK-20399][SQL][FOLLOW-UP] Add a config to fallback string literal parsing... · 442287ae
      Liang-Chi Hsieh authored
      [SPARK-20399][SQL][FOLLOW-UP] Add a config to fallback string literal parsing consistent with old sql parser behavior
      
      ## What changes were proposed in this pull request?
      
      As srowen pointed in https://github.com/apache/spark/commit/609ba5f2b9fd89b1b9971d08f7cc680d202dbc7c#commitcomment-22221259, the previous tests are not proper.
      
      This follow-up is going to fix the tests.
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #18048 from viirya/SPARK-20399-follow-up.
      442287ae
    • Shivaram Venkataraman's avatar
      [SPARK-20727] Skip tests that use Hadoop utils on CRAN Windows · d06610f9
      Shivaram Venkataraman authored
      ## What changes were proposed in this pull request?
      
      This change skips tests that use the Hadoop libraries while running
      on CRAN check with Windows as the operating system. This is to handle
      cases where the Hadoop winutils binaries are missing on the target
      system. The skipped tests consist of
      1. Tests that save, load a model in MLlib
      2. Tests that save, load CSV, JSON and Parquet files in SQL
      3. Hive tests
      
      ## How was this patch tested?
      
      Tested by running on a local windows VM with HADOOP_HOME unset. Also testing with https://win-builder.r-project.org
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #17966 from shivaram/sparkr-windows-cran.
      d06610f9
  5. May 22, 2017
Loading