Skip to content
Snippets Groups Projects
  1. May 31, 2016
    • Marcelo Vanzin's avatar
      [SPARK-15451][BUILD] Use jdk7's rt.jar when available. · 57adb77e
      Marcelo Vanzin authored
      This helps with preventing jdk8-specific calls being checked in,
      because PR builders are running the compiler with the wrong settings.
      
      If the JAVA_7_HOME env variable is set, assume it points at
      a jdk7 and use its rt.jar when invoking javac. For zinc, just run
      it with jdk7, and disable it when building jdk8-specific code.
      
      A big note for sbt usage: adding the bootstrap options forces sbt
      to fork the compiler, and that disables incremental compilation.
      That means that it's really not convenient to use for normal
      development, but should be ok for automated builds.
      
      Tested with JAVA_HOME=jdk8 and JAVA_7_HOME=jdk7:
      - mvn + zinc
      - mvn sans zinc
      - sbt
      
      Verified that in all cases, jdk8-specific library calls fail to
      compile.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #13272 from vanzin/SPARK-15451.
      57adb77e
    • Tathagata Das's avatar
      [SPARK-15517][SQL][STREAMING] Add support for complete output mode in Structure Streaming · 90b11439
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      Currently structured streaming only supports append output mode.  This PR adds the following.
      
      - Added support for Complete output mode in the internal state store, analyzer and planner.
      - Added public API in Scala and Python for users to specify output mode
      - Added checks for unsupported combinations of output mode and DF operations
        - Plans with no aggregation should support only Append mode
        - Plans with aggregation should support only Update and Complete modes
        - Default output mode is Append mode (**Question: should we change this to automatically set to Complete mode when there is aggregation?**)
      - Added support for Complete output mode in Memory Sink. So Memory Sink internally supports append and complete, update. But from public API only Complete and Append output modes are supported.
      
      ## How was this patch tested?
      Unit tests in various test suites
      - StreamingAggregationSuite: tests for complete mode
      - MemorySinkSuite: tests for checking behavior in Append and Complete modes.
      - UnsupportedOperationSuite: tests for checking unsupported combinations of DF ops and output modes
      - DataFrameReaderWriterSuite: tests for checking that output mode cannot be called on static DFs
      - Python doc test and existing unit tests modified to call write.outputMode.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #13286 from tdas/complete-mode.
      90b11439
    • Dilip Biswal's avatar
      [SPARK-15557] [SQL] cast the string into DoubleType when it's used together with decimal · dfe2cbeb
      Dilip Biswal authored
      In this case, the result type of the expression becomes DECIMAL(38, 36) as we promote the individual string literals to DECIMAL(38, 18) when we handle string promotions for `BinaryArthmaticExpression`.
      
      I think we need to cast the string literals to Double type instead. I looked at the history and found that  this was changed to use decimal instead of double to avoid potential loss of precision when we cast decimal to double.
      
      To double check i ran the query against hive, mysql. This query returns non NULL result for both the databases and both promote the expression to use double.
      Here is the output.
      
      - Hive
      ```SQL
      hive> create table l2 as select (cast(99 as decimal(19,6)) + '2') from l1;
      OK
      hive> describe l2;
      OK
      _c0                 	double
      ```
      - MySQL
      ```SQL
      mysql> create table foo2 as select (cast(99 as decimal(19,6)) + '2') from test;
      Query OK, 1 row affected (0.01 sec)
      Records: 1  Duplicates: 0  Warnings: 0
      
      mysql> describe foo2;
      +-----------------------------------+--------+------+-----+---------+-------+
      | Field                             | Type   | Null | Key | Default | Extra |
      +-----------------------------------+--------+------+-----+---------+-------+
      | (cast(99 as decimal(19,6)) + '2') | double | NO   |     | 0       |       |
      +-----------------------------------+--------+------+-----+---------+-------+
      ```
      
      ## How was this patch tested?
      Added a new test in SQLQuerySuite
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #13368 from dilipbiswal/spark-15557.
      dfe2cbeb
    • Davies Liu's avatar
      [SPARK-15327] [SQL] fix split expression in whole stage codegen · 2df6ca84
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Right now, we will split the code for expressions into multiple functions when it exceed 64k, which requires that the the expressions are using Row object, but this is not true for whole-state codegen, it will fail to compile after splitted.
      
      This PR will not split the code in whole-stage codegen.
      
      ## How was this patch tested?
      
      Added regression tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13235 from davies/fix_nested_codegen.
      2df6ca84
    • Yanbo Liang's avatar
      [MINOR][DOC][ML] ml.clustering scala & python api doc sync · 594484cd
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Since we done Scala API audit for ml.clustering at #13148, we should also fix and update the corresponding Python API docs to keep them in sync.
      
      ## How was this patch tested?
      Docs change, no tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #13291 from yanboliang/spark-15361-followup.
      594484cd
    • Shixiong Zhu's avatar
      Revert "[SPARK-11753][SQL][TEST-HADOOP2.2] Make allowNonNumericNumbers option work · 9a74de18
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This reverts commit c24b6b67. Sent a PR to run Jenkins tests due to the revert conflicts of `dev/deps/spark-deps-hadoop*`.
      
      ## How was this patch tested?
      
      Jenkins unit tests, integration tests, manual tests)
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13417 from zsxwing/revert-SPARK-11753.
      9a74de18
    • Yin Huai's avatar
      [SPARK-15622][SQL] Wrap the parent classloader of Janino's classloader in the ParentClassLoader. · c6de5832
      Yin Huai authored
      ## What changes were proposed in this pull request?
      At https://github.com/aunkrig/janino/blob/janino_2.7.8/janino/src/org/codehaus/janino/ClassLoaderIClassLoader.java#L80-L85, Janino's classloader throws the exception when its parent throws a ClassNotFoundException with a cause set. However, it does not throw the exception when there is no cause set. Seems we need to use a special ClassLoader to wrap the actual parent classloader set to Janino handle this behavior.
      
      ## How was this patch tested?
      I have reverted the workaround made by https://issues.apache.org/jira/browse/SPARK-11636 ( https://github.com/apache/spark/compare/master...yhuai:SPARK-15622?expand=1#diff-bb538fda94224dd0af01d0fd7e1b4ea0R81) and `test-only *ReplSuite -- -z "SPARK-2576 importing implicits"` still passes the test (without the change in `CodeGenerator`, this test does not pass with the change in `ExecutorClassLoader `).
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #13366 from yhuai/SPARK-15622.
      c6de5832
    • Wenchen Fan's avatar
      [SPARK-15658][SQL] UDT serializer should declare its data type as udt instead of udt.sqlType · 2bfed1a0
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      When we build serializer for UDT object, we should declare its data type as udt instead of udt.sqlType, or if we deserialize it again, we lose the information that it's a udt object and throw analysis exception.
      
      ## How was this patch tested?
      
      new test in `UserDefiendTypeSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13402 from cloud-fan/udt.
      2bfed1a0
    • gatorsmile's avatar
      [SPARK-15647][SQL] Fix Boundary Cases in OptimizeCodegen Rule · d67c82e4
      gatorsmile authored
      #### What changes were proposed in this pull request?
      
      The following condition in the Optimizer rule `OptimizeCodegen` is not right.
      ```Scala
      branches.size < conf.maxCaseBranchesForCodegen
      ```
      
      - The number of branches in case when clause should be `branches.size + elseBranch.size`.
      - `maxCaseBranchesForCodegen` is the maximum boundary for enabling codegen. Thus, we should use `<=` instead of `<`.
      
      This PR is to fix this boundary case and also add missing test cases for verifying the conf `MAX_CASES_BRANCHES`.
      
      #### How was this patch tested?
      Added test cases in `SQLConfSuite`
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13392 from gatorsmile/maxCaseWhen.
      d67c82e4
    • Lianhui Wang's avatar
      [SPARK-15649][SQL] Avoid to serialize MetastoreRelation in HiveTableScanExec · 2bfc4f15
      Lianhui Wang authored
      ## What changes were proposed in this pull request?
      in HiveTableScanExec, schema is lazy and is related with relation.attributeMap. So it needs to serialize MetastoreRelation when serializing task binary bytes.It can avoid to serialize MetastoreRelation.
      
      ## How was this patch tested?
      
      Author: Lianhui Wang <lianhuiwang09@gmail.com>
      
      Closes #13397 from lianhuiwang/avoid-serialize.
      2bfc4f15
    • Takeshi YAMAMURO's avatar
      [SPARK-15528][SQL] Fix race condition in NumberConverter · 95db8a44
      Takeshi YAMAMURO authored
      ## What changes were proposed in this pull request?
      A local variable in NumberConverter is wrongly shared between threads.
      This pr fixes the race condition.
      
      ## How was this patch tested?
      Manually checked.
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #13391 from maropu/SPARK-15528.
      95db8a44
    • catapan's avatar
      [SPARK-15641] HistoryServer to not show invalid date for incomplete application · 6878f3e2
      catapan authored
      ## What changes were proposed in this pull request?
      For incomplete applications in HistoryServer, the complete column will show "-" instead of incorrect date.
      
      ## How was this patch tested?
      manually tested.
      
      Author: catapan <cedarpan86@gmail.com>
      Author: Ziying Pan <cedarpan@Ziyings-MacBook.local>
      
      Closes #13396 from catapan/SPARK-15641_fix_completed_column.
      6878f3e2
    • Reynold Xin's avatar
      [SPARK-15638][SQL] Audit Dataset, SparkSession, and SQLContext · 67592104
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch contains a list of changes as a result of my auditing Dataset, SparkSession, and SQLContext. The patch audits the categorization of experimental APIs, function groups, and deprecations. For the detailed list of changes, please see the diff.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13370 from rxin/SPARK-15638.
      67592104
  2. May 30, 2016
    • Devaraj K's avatar
      [SPARK-10530][CORE] Kill other task attempts when one taskattempt belonging... · 5b21139d
      Devaraj K authored
      [SPARK-10530][CORE] Kill other task attempts when one taskattempt belonging the same task is succeeded in speculation
      
      ## What changes were proposed in this pull request?
      
      With this patch, TaskSetManager kills other running attempts when any one of the attempt succeeds for the same task. Also killed tasks will not be considered as failed tasks and they get listed separately in the UI and also shows the task state as KILLED instead of FAILED.
      
      ## How was this patch tested?
      
      core\src\test\scala\org\apache\spark\ui\jobs\JobProgressListenerSuite.scala
      core\src\test\scala\org\apache\spark\util\JsonProtocolSuite.scala
      
      I have verified this patch manually by enabling spark.speculation as true, when any attempt gets succeeded then other running attempts are getting killed for the same task and other pending tasks are getting assigned in those. And also when any attempt gets killed then they are considered as KILLED tasks and not considered as FAILED tasks. Please find the attached screen shots for the reference.
      
      ![stage-tasks-table](https://cloud.githubusercontent.com/assets/3174804/14075132/394c6a12-f4f4-11e5-8638-20ff7b8cc9bc.png)
      ![stages-table](https://cloud.githubusercontent.com/assets/3174804/14075134/3b60f412-f4f4-11e5-9ea6-dd0dcc86eb03.png)
      
      Ref : https://github.com/apache/spark/pull/11916
      
      Author: Devaraj K <devaraj@apache.org>
      
      Closes #11996 from devaraj-kavali/SPARK-10530.
      5b21139d
    • Matthew Wise's avatar
      [DOCS] fix example code issues in documentation · 2d34183b
      Matthew Wise authored
      ## What changes were proposed in this pull request?
      
      Fixed broken java code examples in streaming documentation
      
      Attn: tdas
      
      Author: Matthew Wise <matthew.rs.wise@gmail.com>
      
      Closes #13388 from mawise/fix_docs_java_streaming_example.
      2d34183b
    • Xin Ren's avatar
      [SPARK-15645][STREAMING] Fix some typos of Streaming module · 5728aa55
      Xin Ren authored
      ## What changes were proposed in this pull request?
      
      No code change, just some typo fixing.
      
      ## How was this patch tested?
      
      Manually run project build with testing, and build is successful.
      
      Author: Xin Ren <iamshrek@126.com>
      
      Closes #13385 from keypointt/codeWalkThroughStreaming.
      5728aa55
    • Cheng Lian's avatar
      [SPARK-15112][SQL] Disables EmbedSerializerInFilter for plan fragments that change schema · 1360a6d6
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      `EmbedSerializerInFilter` implicitly assumes that the plan fragment being optimized doesn't change plan schema, which is reasonable because `Dataset.filter` should never change the schema.
      
      However, due to another issue involving `DeserializeToObject` and `SerializeFromObject`, typed filter *does* change plan schema (see [SPARK-15632][1]). This breaks `EmbedSerializerInFilter` and causes corrupted data.
      
      This PR disables `EmbedSerializerInFilter` when there's a schema change to avoid data corruption. The schema change issue should be addressed in follow-up PRs.
      
      ## How was this patch tested?
      
      New test case added in `DatasetSuite`.
      
      [1]: https://issues.apache.org/jira/browse/SPARK-15632
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #13362 from liancheng/spark-15112-corrupted-filter.
      1360a6d6
  3. May 29, 2016
    • Sean Owen's avatar
      [MINOR] Resolve a number of miscellaneous build warnings · ce1572d1
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      This change resolves a number of build warnings that have accumulated, before 2.x. It does not address a large number of deprecation warnings, especially related to the Accumulator API. That will happen separately.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #13377 from srowen/BuildWarnings.
      ce1572d1
  4. May 28, 2016
    • Reynold Xin's avatar
      [SPARK-15636][SQL] Make aggregate expressions more concise in explain · 472f1618
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch reduces the verbosity of aggregate expressions in explain (but does not actually remove any information). As an example, for the following command:
      ```
      spark.range(10).selectExpr("sum(id) + 1", "count(distinct id)").explain(true)
      ```
      
      Output before this patch:
      ```
      == Physical Plan ==
      *TungstenAggregate(key=[], functions=[(sum(id#0L),mode=Final,isDistinct=false),(count(id#0L),mode=Final,isDistinct=true)], output=[(sum(id) + 1)#3L,count(DISTINCT id)#16L])
      +- Exchange SinglePartition, None
         +- *TungstenAggregate(key=[], functions=[(sum(id#0L),mode=PartialMerge,isDistinct=false),(count(id#0L),mode=Partial,isDistinct=true)], output=[sum#18L,count#21L])
            +- *TungstenAggregate(key=[id#0L], functions=[(sum(id#0L),mode=PartialMerge,isDistinct=false)], output=[id#0L,sum#18L])
               +- Exchange hashpartitioning(id#0L, 5), None
                  +- *TungstenAggregate(key=[id#0L], functions=[(sum(id#0L),mode=Partial,isDistinct=false)], output=[id#0L,sum#18L])
                     +- *Range (0, 10, splits=2)
      ```
      
      Output after this patch:
      ```
      == Physical Plan ==
      *TungstenAggregate(key=[], functions=[sum(id#0L),count(distinct id#0L)], output=[(sum(id) + 1)#3L,count(DISTINCT id)#16L])
      +- Exchange SinglePartition, None
         +- *TungstenAggregate(key=[], functions=[merge_sum(id#0L),partial_count(distinct id#0L)], output=[sum#18L,count#21L])
            +- *TungstenAggregate(key=[id#0L], functions=[merge_sum(id#0L)], output=[id#0L,sum#18L])
               +- Exchange hashpartitioning(id#0L, 5), None
                  +- *TungstenAggregate(key=[id#0L], functions=[partial_sum(id#0L)], output=[id#0L,sum#18L])
                     +- *Range (0, 10, splits=2)
      ```
      
      Note the change from `(sum(id#0L),mode=PartialMerge,isDistinct=false)` to `merge_sum(id#0L)`.
      
      In general aggregate explain is still very verbose, but further work will be done as follow-up pull requests.
      
      ## How was this patch tested?
      Tested manually.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13367 from rxin/SPARK-15636.
      472f1618
    • felixcheung's avatar
      [SPARK-15637][SPARKR] fix R tests on R 3.2.2 · 74c1b79f
      felixcheung authored
      ## What changes were proposed in this pull request?
      
      Change version check in R tests
      
      ## How was this patch tested?
      
      R tests
      shivaram
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #13369 from felixcheung/rversioncheck.
      74c1b79f
    • Yadong Qi's avatar
      [SPARK-15549][SQL] Disable bucketing when the output doesn't contain all bucketing columns · b4c32c49
      Yadong Qi authored
      ## What changes were proposed in this pull request?
      I create a bucketed table bucketed_table with bucket column i,
      ```scala
      case class Data(i: Int, j: Int, k: Int)
      sc.makeRDD(Array((1, 2, 3))).map(x => Data(x._1, x._2, x._3)).toDF.write.bucketBy(2, "i").saveAsTable("bucketed_table")
      ```
      
      and I run the following SQLs:
      ```sql
      SELECT j FROM bucketed_table;
      Error in query: bucket column i not found in existing columns (j);
      
      SELECT j, MAX(k) FROM bucketed_table GROUP BY j;
      Error in query: bucket column i not found in existing columns (j, k);
      ```
      
      I think we should add a check that, we only enable bucketing when it satisfies all conditions below:
      1. the conf is enabled
      2. the relation is bucketed
      3. the output contains all bucketing columns
      
      ## How was this patch tested?
      Updated test cases to reflect the changes.
      
      Author: Yadong Qi <qiyadong2010@gmail.com>
      
      Closes #13321 from watermen/SPARK-15549.
      b4c32c49
  5. May 27, 2016
    • Liang-Chi Hsieh's avatar
      [SPARK-15553][SQL] Dataset.createTempView should use CreateViewCommand · f1b220ee
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      Let `Dataset.createTempView` and `Dataset.createOrReplaceTempView` use `CreateViewCommand`, rather than calling `SparkSession.createTempView`. Besides, this patch also removes `SparkSession.createTempView`.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #13327 from viirya/dataset-createtempview.
      f1b220ee
    • Reynold Xin's avatar
      [SPARK-15633][MINOR] Make package name for Java tests consistent · 73178c75
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This is a simple patch that makes package names for Java 8 test suites consistent. I moved everything to test.org.apache.spark to we can test package private APIs properly. Also added "java8" as the package name so we can easily run all the tests related to Java 8.
      
      ## How was this patch tested?
      This is a test only change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13364 from rxin/SPARK-15633.
      73178c75
    • Zheng RuiFeng's avatar
      [SPARK-15610][ML] update error message for k in pca · 9893dc97
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      Fix the wrong bound of `k` in `PCA`
      `require(k <= sources.first().size, ...`  ->  `require(k < sources.first().size`
      
      BTW, remove unused import in `ml.ElementwiseProduct`
      
      ## How was this patch tested?
      
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #13356 from zhengruifeng/fix_pca.
      9893dc97
    • dding3's avatar
      [SPARK-15562][ML] Delete temp directory after program exit in DataFrameExample · 88c9c467
      dding3 authored
      ## What changes were proposed in this pull request?
      Temp directory used to save records is not deleted after program exit in DataFrameExample. Although it called deleteOnExit, it doesn't work as the directory is not empty. Similar things happend in ContextCleanerSuite. Update the code to make sure temp directory is deleted after program exit.
      
      ## How was this patch tested?
      
      unit tests and local build.
      
      Author: dding3 <ding.ding@intel.com>
      
      Closes #13328 from dding3/master.
      88c9c467
    • wm624@hotmail.com's avatar
      [SPARK-15449][MLLIB][EXAMPLE] Wrong Data Format - Documentation Issue · 5d4dafe8
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      (Please fill in changes proposed in this fix)
      In the MLLib naivebayes example, scala and python example doesn't use libsvm data, but Java does.
      
      I make changes in scala and python example to use the libsvm data as the same as Java example.
      
      ## How was this patch tested?
      
      Manual tests
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #13301 from wangmiao1981/example.
      5d4dafe8
    • Andrew Or's avatar
      [SPARK-15594][SQL] ALTER TABLE SERDEPROPERTIES does not respect partition spec · 4a2fb8b8
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      These commands ignore the partition spec and change the storage properties of the table itself:
      ```
      ALTER TABLE table_name PARTITION (a=1, b=2) SET SERDE 'my_serde'
      ALTER TABLE table_name PARTITION (a=1, b=2) SET SERDEPROPERTIES ('key1'='val1')
      ```
      Now they change the storage properties of the specified partition.
      
      ## How was this patch tested?
      
      DDLSuite
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #13343 from andrewor14/alter-table-serdeproperties.
      4a2fb8b8
    • Ryan Blue's avatar
      [SPARK-9876][SQL] Update Parquet to 1.8.1. · 776d183c
      Ryan Blue authored
      ## What changes were proposed in this pull request?
      
      This includes minimal changes to get Spark using the current release of Parquet, 1.8.1.
      
      ## How was this patch tested?
      
      This uses the existing Parquet tests.
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #13280 from rdblue/SPARK-9876-update-parquet.
      776d183c
    • Xin Wu's avatar
      [SPARK-15431][SQL][BRANCH-2.0-TEST] rework the clisuite test cases · 019afd9c
      Xin Wu authored
      ## What changes were proposed in this pull request?
      This PR reworks on the CliSuite test cases for `LIST FILES/JARS` commands.
      
      CC yhuai Thanks!
      
      Author: Xin Wu <xinwu@us.ibm.com>
      
      Closes #13361 from xwu0226/SPARK-15431-clisuite-new.
      019afd9c
    • DB Tsai's avatar
      [SPARK-15413][ML][MLLIB] Change `toBreeze` to `asBreeze` in Vector and Matrix · 21b2605d
      DB Tsai authored
      ## What changes were proposed in this pull request?
      
      We're using `asML` to convert the mllib vector/matrix to ml vector/matrix now. Using `as` is more correct given that this conversion actually shares the same underline data structure. As a result, in this PR, `toBreeze` will be changed to `asBreeze`. This is a private API, as a result, it will not affect any user's application.
      
      ## How was this patch tested?
      
      unit tests
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #13198 from dbtsai/minor.
      21b2605d
    • yinxusen's avatar
      [SPARK-15008][ML][PYSPARK] Add integration test for OneVsRest · 130b8d07
      yinxusen authored
      ## What changes were proposed in this pull request?
      
      1. Add `_transfer_param_map_to/from_java` for OneVsRest;
      
      2. Add `_compare_params` in ml/tests.py to help compare params.
      
      3. Add `test_onevsrest` as the integration test for OneVsRest.
      
      ## How was this patch tested?
      
      Python unit test.
      
      Author: yinxusen <yinxusen@gmail.com>
      
      Closes #12875 from yinxusen/SPARK-15008.
      130b8d07
    • Yanbo Liang's avatar
      [SPARK-11959][SPARK-15484][DOC][ML] Document WLS and IRLS · a3550e37
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      * Document ```WeightedLeastSquares```(normal equation) and ```IterativelyReweightedLeastSquares```.
      * Copy ```L-BFGS``` documents from ```spark.mllib``` to ```spark.ml```.
      
      Due to the session ```Optimization of linear methods``` is used for developers, I think we should provide the brief introduction of the optimization method, necessary references and how it implements in Spark. It's not necessary to paste all mathematical formula and derivation here. If developers/users want to learn more, they can track reference.
      
      ## How was this patch tested?
      Document update, no tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #13262 from yanboliang/spark-15484.
      a3550e37
    • sethah's avatar
      [SPARK-15186][ML][DOCS] Add user guide for generalized linear regression · c96244f5
      sethah authored
      ## What changes were proposed in this pull request?
      
      This patch adds a user guide section for generalized linear regression and includes the examples from [#12754](https://github.com/apache/spark/pull/12754).
      
      ## How was this patch tested?
      
      Documentation only, no tests required.
      
      ## Approach
      
      In general, it is a bit unclear what level of detail ought to be included in the user guide since there is a lot of variability within the current user guide. I tried to give a fairly brief mathematical introduction to GLMs, and cover what types of problems they could be used for. Additionally, I included a brief blurb on the IRLS solver. The input/output columns are given in a table as is found elsewhere in the docs (though, again, these appear rather intermittently in the current docs), as well as a table providing the supported families and their link functions.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #13139 from sethah/SPARK-15186.
      c96244f5
    • Tejas Patil's avatar
      [SPARK-14400][SQL] ScriptTransformation does not fail the job for bad user command · a96e4151
      Tejas Patil authored
      ## What changes were proposed in this pull request?
      
      - Refer to the Jira for the problem: jira : https://issues.apache.org/jira/browse/SPARK-14400
      - The fix is to check if the process has exited with a non-zero exit code in `hasNext()`. I have moved this and checking of writer thread exception to a separate method.
      
      ## How was this patch tested?
      
      - Ran a job which had incorrect transform script command and saw that the job fails
      - Existing unit tests for `ScriptTransformationSuite`. Added a new unit test
      
      Author: Tejas Patil <tejasp@fb.com>
      
      Closes #12194 from tejasapatil/script_transform.
      a96e4151
    • Andrew Or's avatar
      b376a4ea
    • jerryshao's avatar
      [YARN][DOC][MINOR] Remove several obsolete env variables and update the doc · 1b98fa2e
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Remove several obsolete env variables not supported for Spark on YARN now, also updates the docs to include several changes with 2.0.
      
      ## How was this patch tested?
      
      N/A
      
      CC vanzin tgravescs
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #13296 from jerryshao/yarn-doc.
      1b98fa2e
    • Sean Owen's avatar
      [SPARK-15531][DEPLOY] spark-class tries to use too much memory when running Launcher · 623aae59
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Explicitly limit launcher JVM memory to modest 128m
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #13360 from srowen/SPARK-15531.
      623aae59
    • Sital Kedia's avatar
      [SPARK-15569] Reduce frequency of updateBytesWritten function in Disk… · ce756daa
      Sital Kedia authored
      ## What changes were proposed in this pull request?
      
      Profiling a Spark job spilling large amount of intermediate data we found that significant portion of time is being spent in DiskObjectWriter.updateBytesWritten function. Looking at the code, we see that the function is being called too frequently to update the number of bytes written to disk. We should reduce the frequency to avoid this.
      
      ## How was this patch tested?
      
      Tested by running the job on cluster and saw 20% CPU gain  by this change.
      
      Author: Sital Kedia <skedia@fb.com>
      
      Closes #13332 from sitalkedia/DiskObjectWriter.
      ce756daa
    • Xinh Huynh's avatar
      [MINOR][DOCS] Typo fixes in Dataset scaladoc · 5bdbedf2
      Xinh Huynh authored
      ## What changes were proposed in this pull request?
      
      Minor typo fixes in Dataset scaladoc
      * Corrected context type as SparkSession, not SQLContext.
      liancheng rxin andrewor14
      
      ## How was this patch tested?
      
      Compiled locally
      
      Author: Xinh Huynh <xinh_huynh@yahoo.com>
      
      Closes #13330 from xinhhuynh/fix-dataset-typos.
      5bdbedf2
    • Reynold Xin's avatar
      [SPARK-15597][SQL] Add SparkSession.emptyDataset · a52e6813
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch adds a new function emptyDataset to SparkSession, for creating an empty dataset.
      
      ## How was this patch tested?
      Added a test case.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13344 from rxin/SPARK-15597.
      a52e6813
Loading