Skip to content
Snippets Groups Projects
  1. Aug 08, 2016
    • Marcelo Vanzin's avatar
      [SPARK-16586][CORE] Handle JVM errors printed to stdout. · 1739e75f
      Marcelo Vanzin authored
      Some very rare JVM errors are printed to stdout, and that confuses
      the code in spark-class. So add a check so that those cases are
      detected and the proper error message is shown to the user.
      
      Tested by running spark-submit after setting "ulimit -v 32000".
      
      Closes #14231
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #14508 from vanzin/SPARK-16586.
      1739e75f
    • gatorsmile's avatar
      [SPARK-16936][SQL] Case Sensitivity Support for Refresh Temp Table · 5959df21
      gatorsmile authored
      ### What changes were proposed in this pull request?
      Currently, the `refreshTable` API is always case sensitive.
      
      When users use the view name without the exact case match, the API silently ignores the call. Users might expect the command has been successfully completed. However, when users run the subsequent SQL commands, they might still get the exception, like
      ```
      Job aborted due to stage failure:
      Task 1 in stage 4.0 failed 1 times, most recent failure: Lost task 1.0 in stage 4.0 (TID 7, localhost):
      java.io.FileNotFoundException:
      File file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-bd4b9ea6-9aec-49c5-8f05-01cff426211e/part-r-00000-0c84b915-c032-4f2e-abf5-1d48fdbddf38.snappy.parquet does not exist
      ```
      
      This PR is to fix the issue.
      
      ### How was this patch tested?
      Added a test case.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14523 from gatorsmile/refreshTempTable.
      5959df21
    • gatorsmile's avatar
      [SPARK-16457][SQL] Fix Wrong Messages when CTAS with a Partition By Clause · ab126909
      gatorsmile authored
      #### What changes were proposed in this pull request?
      When doing a CTAS with a Partition By clause, we got a wrong error message.
      
      For example,
      ```SQL
      CREATE TABLE gen__tmp
      PARTITIONED BY (key string)
      AS SELECT key, value FROM mytable1
      ```
      The error message we get now is like
      ```
      Operation not allowed: Schema may not be specified in a Create Table As Select (CTAS) statement(line 2, pos 0)
      ```
      
      However, based on the code, the message we should get is like
      ```
      Operation not allowed: A Create Table As Select (CTAS) statement is not allowed to create a partitioned table using Hive's file formats. Please use the syntax of "CREATE TABLE tableName USING dataSource OPTIONS (...) PARTITIONED BY ...\" to create a partitioned table through a CTAS statement.(line 2, pos 0)
      ```
      
      Currently, partitioning columns is part of the schema. This PR fixes the bug by changing the detection orders.
      
      #### How was this patch tested?
      Added test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14113 from gatorsmile/ctas.
      ab126909
    • Sean Zhong's avatar
      [SPARK-16906][SQL] Adds auxiliary info like input class and input schema in... · 94a9d11e
      Sean Zhong authored
      [SPARK-16906][SQL] Adds auxiliary info like input class and input schema in TypedAggregateExpression
      
      ## What changes were proposed in this pull request?
      
      This PR adds auxiliary info like input class and input schema in TypedAggregateExpression
      
      ## How was this patch tested?
      
      Manual test.
      
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #14501 from clockfly/typed_aggregation.
      94a9d11e
    • Nattavut Sutyanyong's avatar
      [SPARK-16804][SQL] Correlated subqueries containing non-deterministic... · 06f5dc84
      Nattavut Sutyanyong authored
      [SPARK-16804][SQL] Correlated subqueries containing non-deterministic operations return incorrect results
      
      ## What changes were proposed in this pull request?
      
      This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase by returning an error message when the LIMIT is found in the path from the parent table to the correlated predicate in the subquery.
      
      ## How was this patch tested?
      
      ./dev/run-tests
      a new unit test on the problematic pattern.
      
      Author: Nattavut Sutyanyong <nsy.can@gmail.com>
      
      Closes #14411 from nsyca/master.
      06f5dc84
    • Weiqing Yang's avatar
      [SPARK-16945] Fix Java Lint errors · e10ca8de
      Weiqing Yang authored
      ## What changes were proposed in this pull request?
      This PR is to fix the minor Java linter errors as following:
      [ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/VariableLengthRowBasedKeyValueBatch.java:[42,10] (modifier) RedundantModifier: Redundant 'final' modifier.
      [ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/VariableLengthRowBasedKeyValueBatch.java:[97,10] (modifier) RedundantModifier: Redundant 'final' modifier.
      
      ## How was this patch tested?
      Manual test.
      dev/lint-java
      Using `mvn` from path: /usr/local/bin/mvn
      Checkstyle checks passed.
      
      Author: Weiqing Yang <yangweiqing001@gmail.com>
      
      Closes #14532 from Sherry302/master.
      e10ca8de
    • sethah's avatar
      [SPARK-16404][ML] LeastSquaresAggregators serializes unnecessary data · 1db1c656
      sethah authored
      ## What changes were proposed in this pull request?
      Similar to `LogisticAggregator`, `LeastSquaresAggregator` used for linear regression ends up serializing the coefficients and the features standard deviations, which is not necessary and can cause performance issues for high dimensional data. This patch removes this serialization.
      
      In https://github.com/apache/spark/pull/13729 the approach was to pass these values directly to the add method. The approach used here, initially, is to mark these fields as transient instead which gives the benefit of keeping the signature of the add method simple and interpretable. The downside is that it requires the use of `transient lazy val`s which are difficult to reason about if one is not quite familiar with serialization in Scala/Spark.
      
      ## How was this patch tested?
      
      **MLlib**
      ![image](https://cloud.githubusercontent.com/assets/7275795/16703660/436f79fa-4524-11e6-9022-ef00058ec718.png)
      
      **ML without patch**
      ![image](https://cloud.githubusercontent.com/assets/7275795/16703831/c4d50b9e-4525-11e6-80cb-9b58c850cd41.png)
      
      **ML with patch**
      ![image](https://cloud.githubusercontent.com/assets/7275795/16703675/63e0cf40-4524-11e6-9120-1f512a70e083.png)
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #14109 from sethah/LIR_serialize.
      1db1c656
    • Tejas Patil's avatar
      [SPARK-16919] Configurable update interval for console progress bar · e076fb05
      Tejas Patil authored
      ## What changes were proposed in this pull request?
      
      Currently the update interval for the console progress bar is hardcoded. This PR makes it configurable for users.
      
      ## How was this patch tested?
      
      Ran a long running job and with a high value of update interval, the updates were shown less frequently.
      
      Author: Tejas Patil <tejasp@fb.com>
      
      Closes #14507 from tejasapatil/SPARK-16919.
      e076fb05
  2. Aug 07, 2016
  3. Aug 06, 2016
    • Josh Rosen's avatar
      [SPARK-16925] Master should call schedule() after all executor exit events, not only failures · 4f5f9b67
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      This patch fixes a bug in Spark's standalone Master which could cause applications to hang if tasks cause executors to exit with zero exit codes.
      
      As an example of the bug, run
      
      ```
      sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) }
      ```
      
      on a standalone cluster which has a single Spark application. This will cause all executors to die but those executors won't be replaced unless another Spark application or worker joins or leaves the cluster (or if an executor exits with a non-zero exit code). This behavior is caused by a bug in how the Master handles the `ExecutorStateChanged` event: the current implementation calls `schedule()` only if the executor exited with a non-zero exit code, so a task which causes a JVM to unexpectedly exit "cleanly" will skip the `schedule()` call.
      
      This patch addresses this by modifying the `ExecutorStateChanged` to always unconditionally call `schedule()`. This should be safe because it should always be safe to call `schedule()`; adding extra `schedule()` calls can only affect performance and should not introduce correctness bugs.
      
      ## How was this patch tested?
      
      I added a regression test in `DistributedSuite`.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #14510 from JoshRosen/SPARK-16925.
      4f5f9b67
  4. Aug 05, 2016
    • Nicholas Chammas's avatar
      [SPARK-16772][PYTHON][DOCS] Fix API doc references to UDFRegistration + Update "important classes" · 2dd03886
      Nicholas Chammas authored
      ## Proposed Changes
      
      * Update the list of "important classes" in `pyspark.sql` to match 2.0.
      * Fix references to `UDFRegistration` so that the class shows up in the docs. It currently [doesn't](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html).
      * Remove some unnecessary whitespace in the Python RST doc files.
      
      I reused the [existing JIRA](https://issues.apache.org/jira/browse/SPARK-16772) I created last week for similar API doc fixes.
      
      ## How was this patch tested?
      
      * I ran `lint-python` successfully.
      * I ran `make clean build` on the Python docs and confirmed the results are as expected locally in my browser.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #14496 from nchammas/SPARK-16772-UDFRegistration.
      2dd03886
    • Artur Sukhenko's avatar
      [SPARK-16796][WEB UI] Mask spark.authenticate.secret on Spark environ… · 14dba452
      Artur Sukhenko authored
      ## What changes were proposed in this pull request?
      
      Mask `spark.authenticate.secret` on Spark environment page (Web UI).
      This is addition to https://github.com/apache/spark/pull/14409
      
      ## How was this patch tested?
      `./dev/run-tests`
      [info] ScalaTest
      [info] Run completed in 1 hour, 8 minutes, 38 seconds.
      [info] Total number of tests run: 2166
      [info] Suites: completed 65, aborted 0
      [info] Tests: succeeded 2166, failed 0, canceled 0, ignored 590, pending 0
      [info] All tests passed.
      
      Author: Artur Sukhenko <artur.sukhenko@gmail.com>
      
      Closes #14484 from Devian-ua/SPARK-16796.
      14dba452
    • hyukjinkwon's avatar
      [SPARK-16847][SQL] Prevent to potentially read corrupt statstics on binary in... · 55d6dad6
      hyukjinkwon authored
      [SPARK-16847][SQL] Prevent to potentially read corrupt statstics on binary in Parquet vectorized reader
      
      ## What changes were proposed in this pull request?
      
      This problem was found in [PARQUET-251](https://issues.apache.org/jira/browse/PARQUET-251) and we disabled filter pushdown on binary columns in Spark before. We enabled this after upgrading Parquet but it seems there is potential incompatibility for Parquet files written in lower Spark versions.
      
      Currently, this does not happen in normal Parquet reader. However, In Spark, we implemented a vectorized reader, separately with Parquet's standard API. For normal Parquet reader this is being handled but not in the vectorized reader.
      
      It is okay to just pass `FileMetaData`. This is being handled in parquet-mr (See https://github.com/apache/parquet-mr/commit/e3b95020f777eb5e0651977f654c1662e3ea1f29). This will prevent loading corrupt statistics in each page in Parquet.
      
      This PR replaces the deprecated usage of constructor.
      
      ## How was this patch tested?
      
      N/A
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #14450 from HyukjinKwon/SPARK-16847.
      55d6dad6
    • Yin Huai's avatar
      [SPARK-16901] Hive settings in hive-site.xml may be overridden by Hive's default values · e679bc3c
      Yin Huai authored
      ## What changes were proposed in this pull request?
      When we create the HiveConf for metastore client, we use a Hadoop Conf as the base, which may contain Hive settings in hive-site.xml (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L49). However, HiveConf's initialize function basically ignores the base Hadoop Conf and always its default values (i.e. settings with non-null default values) as the base (https://github.com/apache/hive/blob/release-1.2.1/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L2687). So, even a user put javax.jdo.option.ConnectionURL in hive-site.xml, it is not used and Hive will use its default, which is jdbc:derby:;databaseName=metastore_db;create=true.
      
      This issue only shows up when `spark.sql.hive.metastore.jars` is not set to builtin.
      
      ## How was this patch tested?
      New test in HiveSparkSubmitSuite.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #14497 from yhuai/SPARK-16901.
      e679bc3c
    • Yanbo Liang's avatar
      [SPARK-16750][FOLLOW-UP][ML] Add transformSchema for... · 6cbde337
      Yanbo Liang authored
      [SPARK-16750][FOLLOW-UP][ML] Add transformSchema for StringIndexer/VectorAssembler and fix failed tests.
      
      ## What changes were proposed in this pull request?
      This is follow-up for #14378. When we add ```transformSchema``` for all estimators and transformers, I found there are tests failed for ```StringIndexer``` and ```VectorAssembler```. So I moved these parts of work separately in this PR, to make it more clear to review.
      The corresponding tests should throw ```IllegalArgumentException``` at schema validation period after we add ```transformSchema```. It's efficient that to throw exception at the start of ```fit``` or ```transform``` rather than during the process.
      
      ## How was this patch tested?
      Modified unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #14455 from yanboliang/transformSchema.
      6cbde337
    • Ekasit Kijsipongse's avatar
      [SPARK-13238][CORE] Add ganglia dmax parameter · 1f96c97f
      Ekasit Kijsipongse authored
      The current ganglia reporter doesn't set metric expiration time (dmax). The metrics of all finished applications are indefinitely left displayed in ganglia web. The dmax parameter allows user to set the lifetime of the metrics. The default value is 0 for compatibility with previous versions.
      
      Author: Ekasit Kijsipongse <ekasitk@gmail.com>
      
      Closes #11127 from ekasitk/ganglia-dmax.
      1f96c97f
    • Bryan Cutler's avatar
      [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs · 180fd3e0
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Improve example outputs to better reflect the functionality that is being presented.  This mostly consisted of modifying what was printed at the end of the example, such as calling show() with truncate=False, but sometimes required minor tweaks in the example data to get relevant output.  Explicitly set parameters when they are used as part of the example.  Fixed Java examples that failed to run because of using old-style MLlib Vectors or problem with schema.  Synced examples between different APIs.
      
      ## How was this patch tested?
      Ran each example for Scala, Python, and Java and made sure output was legible on a terminal of width 100.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #14308 from BryanCutler/ml-examples-improve-output-SPARK-16260.
      180fd3e0
    • Sylvain Zimmer's avatar
      [SPARK-16826][SQL] Switch to java.net.URI for parse_url() · 2460f03f
      Sylvain Zimmer authored
      ## What changes were proposed in this pull request?
      The java.net.URL class has a globally synchronized Hashtable, which limits the throughput of any single executor doing lots of calls to parse_url(). Tests have shown that a 36-core machine can only get to 10% CPU use because the threads are locked most of the time.
      
      This patch switches to java.net.URI which has less features than java.net.URL but focuses on URI parsing, which is enough for parse_url().
      
      New tests were added to make sure a few common edge cases didn't change behaviour.
      https://issues.apache.org/jira/browse/SPARK-16826
      
      ## How was this patch tested?
      I've kept the old URL code commented for now, so that people can verify that the new unit tests do pass with java.net.URL.
      
      Thanks to srowen for the help!
      
      Author: Sylvain Zimmer <sylvain@sylvainzimmer.com>
      
      Closes #14488 from sylvinus/master.
      2460f03f
    • Yuming Wang's avatar
      [SPARK-16625][SQL] General data types to be mapped to Oracle · 39a2b2ea
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      
      Spark will convert **BooleanType** to **BIT(1)**, **LongType** to **BIGINT**, **ByteType**  to **BYTE** when saving DataFrame to Oracle, but Oracle does not support BIT, BIGINT and BYTE types.
      
      This PR is convert following _Spark Types_ to _Oracle types_ refer to [Oracle Developer's Guide](https://docs.oracle.com/cd/E19501-01/819-3659/gcmaz/)
      
      Spark Type | Oracle
      ----|----
      BooleanType | NUMBER(1)
      IntegerType | NUMBER(10)
      LongType | NUMBER(19)
      FloatType | NUMBER(19, 4)
      DoubleType | NUMBER(19, 4)
      ByteType | NUMBER(3)
      ShortType | NUMBER(5)
      
      ## How was this patch tested?
      
      Add new tests in [JDBCSuite.scala](https://github.com/wangyum/spark/commit/22b0c2a4228cb8b5098ad741ddf4d1904e745ff6#diff-dc4b58851b084b274df6fe6b189db84d) and [OracleDialect.scala](https://github.com/wangyum/spark/commit/22b0c2a4228cb8b5098ad741ddf4d1904e745ff6#diff-5e0cadf526662f9281aa26315b3750ad)
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #14377 from wangyum/SPARK-16625.
      39a2b2ea
    • petermaxlee's avatar
      [MINOR] Update AccumulatorV2 doc to not mention "+=". · e0260641
      petermaxlee authored
      ## What changes were proposed in this pull request?
      As reported by Bryan Cutler on the mailing list, AccumulatorV2 does not have a += method, yet the documentation still references it.
      
      ## How was this patch tested?
      N/A
      
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #14466 from petermaxlee/accumulator.
      e0260641
    • cody koeninger's avatar
      [SPARK-16312][STREAMING][KAFKA][DOC] Doc for Kafka 0.10 integration · c9f2501a
      cody koeninger authored
      ## What changes were proposed in this pull request?
      Doc for the Kafka 0.10 integration
      
      ## How was this patch tested?
      Scala code examples were taken from my example repo, so hopefully they compile.
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #14385 from koeninger/SPARK-16312.
      c9f2501a
    • Wenchen Fan's avatar
      [SPARK-16879][SQL] unify logical plans for CREATE TABLE and CTAS · 5effc016
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      we have various logical plans for CREATE TABLE and CTAS: `CreateTableUsing`, `CreateTableUsingAsSelect`, `CreateHiveTableAsSelectLogicalPlan`. This PR unifies them to reduce the complexity and centralize the error handling.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #14482 from cloud-fan/table.
      5effc016
    • Hiroshi Inoue's avatar
      [SPARK-15726][SQL] Make DatasetBenchmark fairer among Dataset, DataFrame and RDD · faaefab2
      Hiroshi Inoue authored
      ## What changes were proposed in this pull request?
      
      DatasetBenchmark compares the performances of RDD, DataFrame and Dataset while running the same operations. However, there are two problems that make the comparisons unfair.
      
      1) In backToBackMap test case, only DataFrame implementation executes less work compared to RDD or Dataset implementations. This test case processes Long+String pairs, but the output from the DataFrame implementation does not include String part while RDD or Dataset generates Long+String pairs as output. This difference significantly changes the performance characteristics due to the String manipulation and creation overheads.
      
      2) In back-to-back map and back-to-back filter test cases, `map` or `filter` operation is executed only once regardless of `numChains` parameter for RDD. Hence the execution times for RDD have been largely underestimated.
      
      Of course, these issues do not affect Spark users, but it may confuse Spark developers.
      
      ## How was this patch tested?
      By executing the DatasetBenchmark
      
      Author: Hiroshi Inoue <inouehrs@jp.ibm.com>
      
      Closes #13459 from inouehrs/fix_benchmark_fairness.
      faaefab2
  5. Aug 04, 2016
    • Sean Zhong's avatar
      [SPARK-16907][SQL] Fix performance regression for parquet table when... · 1fa64449
      Sean Zhong authored
      [SPARK-16907][SQL] Fix performance regression for parquet table when vectorized parquet record reader is not being used
      
      ## What changes were proposed in this pull request?
      
      For non-partitioned parquet table, if the vectorized parquet record reader is not being used, Spark 2.0 adds an extra unnecessary memory copy to append partition values for each row.
      
      There are several typical cases that vectorized parquet record reader is not being used:
      1. When the table schema is not flat, like containing nested fields.
      2. When `spark.sql.parquet.enableVectorizedReader = false`
      
      By fixing this bug, we get about 20% - 30% performance gain in test case like this:
      
      ```
      // Generates parquet table with nested columns
      spark.range(100000000).select(struct($"id").as("nc")).write.parquet("/tmp/data4")
      
      def time[R](block: => R): Long = {
          val t0 = System.nanoTime()
          val result = block    // call-by-name
          val t1 = System.nanoTime()
          println("Elapsed time: " + (t1 - t0)/1000000 + "ms")
          (t1 - t0)/1000000
      }
      
      val x = ((0 until 20).toList.map(x => time(spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))).sum/20
      ```
      
      ## How was this patch tested?
      
      After a few times warm up, we get 26% performance improvement
      
      Before fix:
      ```
      Average: 4584ms, raw data (10 tries): 4726ms 4509ms 4454ms 4879ms 4586ms 4733ms 4500ms 4361ms 4456ms 4640ms
      ```
      
      After fix:
      ```
      Average: 3614ms, raw data(10 tries): 3554ms 3740ms 4019ms 3439ms 3460ms 3664ms 3557ms 3584ms 3612ms 3531ms
      ```
      
      Test env: Intel(R) Core(TM) i7-6700 CPU  3.40GHz, Intel SSD SC2KW24
      
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #14445 from clockfly/fix_parquet_regression_2.
      1fa64449
    • Marcelo Vanzin's avatar
      MAINTENANCE. Cleaning up stale PRs. · 53e766cf
      Marcelo Vanzin authored
      Closing the following PRs due to requests or unresponsive users.
      
      Closes #13923
      Closes #14462
      Closes #13123
      Closes #14423 (requested by srowen)
      Closes #14424 (requested by srowen)
      Closes #14101 (requested by jkbradley)
      Closes #10676 (requested by srowen)
      Closes #10943 (requested by yhuai)
      Closes #9936
      Closes #10701
      Closes #10474
      Closes #13248
      Closes #14347
      Closes #10356
      Closes #9866
      Closes #14310 (requested by srowen)
      Closes #14390 (requested by srowen)
      Closes #14343 (requested by srowen)
      Closes #14402 (requested by srowen)
      Closes #14437 (requested by srowen)
      Closes #12000 (already merged)
      53e766cf
    • Josh Rosen's avatar
      [HOTFIX] Remove unnecessary imports from #12944 that broke build · d91c6755
      Josh Rosen authored
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #14499 from JoshRosen/hotfix.
      d91c6755
    • Sital Kedia's avatar
      [SPARK-15074][SHUFFLE] Cache shuffle index file to speedup shuffle fetch · 9c15d079
      Sital Kedia authored
      ## What changes were proposed in this pull request?
      
      Shuffle fetch on large intermediate dataset is slow because the shuffle service open/close the index file for each shuffle fetch. This change introduces a cache for the index information so that we can avoid accessing the index files for each block fetch
      
      ## How was this patch tested?
      
      Tested by running a job on the cluster and the shuffle read time was reduced by 50%.
      
      Author: Sital Kedia <skedia@fb.com>
      
      Closes #12944 from sitalkedia/shuffle_service.
      9c15d079
    • Zheng RuiFeng's avatar
      [SPARK-16863][ML] ProbabilisticClassifier.fit check threshoulds' length · 0e2e5d7d
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      
      Add threshoulds' length checking for Classifiers which extends ProbabilisticClassifier
      
      ## How was this patch tested?
      
      unit tests and manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #14470 from zhengruifeng/classifier_check_setThreshoulds_length.
      0e2e5d7d
    • hyukjinkwon's avatar
      [SPARK-16877][BUILD] Add rules for preventing to use Java annotations (Deprecated and Override) · 1d781572
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR adds both rules for preventing to use `Deprecated` and `Override`.
      
      - Java's `Override`
        It seems Scala compiler just ignores this. Apparently, `override` modifier is only mandatory for " that override some other **concrete member definition** in a parent class" but not for for **incomplete member definition** (such as ones from trait or abstract), see (http://www.scala-lang.org/files/archive/spec/2.11/05-classes-and-objects.html#override)
      
        For a simple example,
      
        - Normal class - needs `override` modifier
      
        ```bash
        scala> class A { def say = {}}
        defined class A
      
        scala> class B extends A { def say = {}}
        <console>:8: error: overriding method say in class A of type => Unit;
         method say needs `override' modifier
               class B extends A { def say = {}}
                                       ^
        ```
      
        - Trait - does not need `override` modifier
      
        ```bash
        scala> trait A { def say }
        defined trait A
      
        scala> class B extends A { def say = {}}
        defined class B
        ```
      
        To cut this short, this case below is possible,
      
        ```bash
        scala> class B extends A {
             |    Override
             |    def say = {}
             | }
        defined class B
        ```
        we can write `Override` annotation (meaning nothing) which might confuse engineers that Java's annotation is working fine. It might be great if we prevent those potential confusion.
      
      - Java's `Deprecated`
        When `Deprecated` is used,  it seems Scala compiler recognises this correctly but it seems we use Scala one `deprecated` across codebase.
      
      ## How was this patch tested?
      
      Manually tested, by inserting both `Override` and `Deprecated`. This will shows the error messages as below:
      
      ```bash
      Scalastyle checks failed at following occurrences:
      [error] ... : deprecated should be used instead of java.lang.Deprecated.
      ```
      
      ```basg
      Scalastyle checks failed at following occurrences:
      [error] ... : override modifier should be used instead of java.lang.Override.
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #14490 from HyukjinKwon/SPARK-16877.
      1d781572
    • WeichenXu's avatar
      [SPARK-16880][ML][MLLIB] make ann training data persisted if needed · 462784ff
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      To Make sure ANN layer input training data to be persisted,
      so that it can avoid overhead cost if the RDD need to be computed from lineage.
      
      ## How was this patch tested?
      
      Existing Tests.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #14483 from WeichenXu123/add_ann_persist_training_data.
      462784ff
    • Zheng RuiFeng's avatar
      [SPARK-16875][SQL] Add args checking for DataSet randomSplit and sample · be8ea4b2
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      
      Add the missing args-checking for randomSplit and sample
      
      ## How was this patch tested?
      unit tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #14478 from zhengruifeng/fix_randomSplit.
      be8ea4b2
    • Eric Liang's avatar
      [SPARK-16884] Move DataSourceScanExec out of ExistingRDD.scala file · ac2a26d0
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      This moves DataSourceScanExec out so it's more discoverable, and now that it doesn't necessarily depend on an existing RDD.  cc davies
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #14487 from ericl/split-scan.
      ac2a26d0
    • Davies Liu's avatar
      [SPARK-16802] [SQL] fix overflow in LongToUnsafeRowMap · 9d4e6212
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This patch fix the overflow in LongToUnsafeRowMap when the range of key is very wide (the key is much much smaller then minKey, for example, key is Long.MinValue, minKey is > 0).
      
      ## How was this patch tested?
      
      Added regression test (also for SPARK-16740)
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #14464 from davies/fix_overflow.
      9d4e6212
    • Sean Zhong's avatar
      [SPARK-16853][SQL] fixes encoder error in DataSet typed select · 9d7a4740
      Sean Zhong authored
      ## What changes were proposed in this pull request?
      
      For DataSet typed select:
      ```
      def select[U1: Encoder](c1: TypedColumn[T, U1]): Dataset[U1]
      ```
      If type T is a case class or a tuple class that is not atomic, the resulting logical plan's schema will mismatch with `Dataset[T]` encoder's schema, which will cause encoder error and throw AnalysisException.
      
      ### Before change:
      ```
      scala> case class A(a: Int, b: Int)
      scala> Seq((0, A(1,2))).toDS.select($"_2".as[A])
      org.apache.spark.sql.AnalysisException: cannot resolve '`a`' given input columns: [_2];
      ..
      ```
      
      ### After change:
      ```
      scala> case class A(a: Int, b: Int)
      scala> Seq((0, A(1,2))).toDS.select($"_2".as[A]).show
      +---+---+
      |  a|  b|
      +---+---+
      |  1|  2|
      +---+---+
      ```
      
      ## How was this patch tested?
      
      Unit test.
      
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #14474 from clockfly/SPARK-16853.
      9d7a4740
Loading