Skip to content
Snippets Groups Projects
  1. Apr 25, 2017
    • Patrick Wendell's avatar
      8460b090
    • Patrick Wendell's avatar
    • jerryshao's avatar
      [SPARK-20239][CORE][2.1-BACKPORT] Improve HistoryServer's ACL mechanism · 359382c0
      jerryshao authored
      Current SHS (Spark History Server) has two different ACLs:
      
      * ACL of base URL, it is controlled by "spark.acls.enabled" or "spark.ui.acls.enabled", and with this enabled, only user configured with "spark.admin.acls" (or group) or "spark.ui.view.acls" (or group), or the user who started SHS could list all the applications, otherwise none of them can be listed. This will also affect REST APIs which listing the summary of all apps and one app.
      * Per application ACL. This is controlled by "spark.history.ui.acls.enabled". With this enabled only history admin user and user/group who ran this app can access the details of this app.
      
      With this two ACLs, we may encounter several unexpected behaviors:
      
      1. if base URL's ACL (`spark.acls.enable`) is enabled but user A has no view permission. User "A" cannot see the app list but could still access details of it's own app.
      2. if ACLs of base URL (`spark.acls.enable`) is disabled, then user "A" could download any application's event log, even it is not run by user "A".
      3. The changes of Live UI's ACL will affect History UI's ACL which share the same conf file.
      
      The unexpected behaviors is mainly because we have two different ACLs, ideally we should have only one to manage all.
      
      So to improve SHS's ACL mechanism, here in this PR proposed to:
      
      1. Disable "spark.acls.enable" and only use "spark.history.ui.acls.enable" for history server.
      2. Check permission for event-log download REST API.
      
      With this PR:
      
      1. Admin user could see/download the list of all applications, as well as application details.
      2. Normal user could see the list of all applications, but can only download and check the details of applications accessible to him.
      
      New UTs are added, also verified in real cluster.
      
      CC tgravescs vanzin please help to review, this PR changes the semantics you did previously. Thanks a lot.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #17755 from jerryshao/SPARK-20239-2.1-backport.
      359382c0
    • Sergey Zhemzhitsky's avatar
      [SPARK-20404][CORE] Using Option(name) instead of Some(name) · 2d47e1aa
      Sergey Zhemzhitsky authored
      
      Using Option(name) instead of Some(name) to prevent runtime failures when using accumulators created like the following
      ```
      sparkContext.accumulator(0, null)
      ```
      
      Author: Sergey Zhemzhitsky <szhemzhitski@gmail.com>
      
      Closes #17740 from szhem/SPARK-20404-null-acc-names.
      
      (cherry picked from commit 0bc7a902)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      2d47e1aa
    • Armin Braun's avatar
      [SPARK-20455][DOCS] Fix Broken Docker IT Docs · 65990fc5
      Armin Braun authored
      
      ## What changes were proposed in this pull request?
      
      Just added the Maven `test`goal.
      
      ## How was this patch tested?
      
      No test needed, just a trivial documentation fix.
      
      Author: Armin Braun <me@obrown.io>
      
      Closes #17756 from original-brownbear/SPARK-20455.
      
      (cherry picked from commit c8f12195)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      65990fc5
    • Sameer Agarwal's avatar
      [SPARK-20451] Filter out nested mapType datatypes from sort order in randomSplit · 42796659
      Sameer Agarwal authored
      
      ## What changes were proposed in this pull request?
      
      In `randomSplit`, It is possible that the underlying dataset doesn't guarantee the ordering of rows in its constituent partitions each time a split is materialized which could result in overlapping
      splits.
      
      To prevent this, as part of SPARK-12662, we explicitly sort each input partition to make the ordering deterministic. Given that `MapTypes` cannot be sorted this patch explicitly prunes them out from the sort order. Additionally, if the resulting sort order is empty, this patch then materializes the dataset to guarantee determinism.
      
      ## How was this patch tested?
      
      Extended `randomSplit on reordered partitions` in `DataFrameStatSuite` to also test for dataframes with mapTypes nested mapTypes.
      
      Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
      
      Closes #17751 from sameeragarwal/randomsplit2.
      
      (cherry picked from commit 31345fde)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      42796659
  2. Apr 24, 2017
    • Eric Liang's avatar
      [SPARK-20450][SQL] Unexpected first-query schema inference cost with 2.1.1 · d99b49b1
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-19611 fixes a regression from 2.0 where Spark silently fails to read case-sensitive fields missing a case-sensitive schema in the table properties. The fix is to detect this situation, infer the schema, and write the case-sensitive schema into the metastore.
      
      However this can incur an unexpected performance hit the first time such a problematic table is queried (and there is a high false-positive rate here since most tables don't actually have case-sensitive fields).
      
      This PR changes the default to NEVER_INFER (same behavior as 2.1.0). In 2.2, we can consider leaving the default to INFER_AND_SAVE.
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #17749 from ericl/spark-20450.
      d99b49b1
  3. Apr 22, 2017
    • Bogdan Raducanu's avatar
      [SPARK-20407][TESTS][BACKPORT-2.1] ParquetQuerySuite 'Enabling/disabling... · ba505805
      Bogdan Raducanu authored
      [SPARK-20407][TESTS][BACKPORT-2.1] ParquetQuerySuite 'Enabling/disabling ignoreCorruptFiles' flaky test
      
      ## What changes were proposed in this pull request?
      
      SharedSQLContext.afterEach now calls DebugFilesystem.assertNoOpenStreams inside eventually.
      SQLTestUtils withTempDir calls waitForTasksToFinish before deleting the directory.
      
      ## How was this patch tested?
      New test but marked as ignored because it takes 30s. Can be unignored for review.
      
      Author: Bogdan Raducanu <bogdan@databricks.com>
      
      Closes #17720 from bogdanrdc/SPARK-20407-BACKPORT2.1.
      ba505805
  4. Apr 21, 2017
  5. Apr 20, 2017
    • Wenchen Fan's avatar
      [SPARK-20409][SQL] fail early if aggregate function in GROUP BY · 66e7a8f1
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      It's illegal to have aggregate function in GROUP BY, and we should fail at analysis phase, if this happens.
      
      ## How was this patch tested?
      
      new regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17704 from cloud-fan/minor.
      66e7a8f1
  6. Apr 19, 2017
  7. Apr 18, 2017
  8. Apr 17, 2017
    • Xiao Li's avatar
      [SPARK-20349][SQL][REVERT-BRANCH2.1] ListFunctions returns duplicate functions... · 3808b472
      Xiao Li authored
      [SPARK-20349][SQL][REVERT-BRANCH2.1] ListFunctions returns duplicate functions after using persistent functions
      
      Revert the changes of https://github.com/apache/spark/pull/17646 made in Branch 2.1, because it breaks the build. It needs the parser interface, but SessionCatalog in branch 2.1 does not have it.
      
      ### What changes were proposed in this pull request?
      
      The session catalog caches some persistent functions in the `FunctionRegistry`, so there can be duplicates. Our Catalog API `listFunctions` does not handle it.
      
      It would be better if `SessionCatalog` API can de-duplciate the records, instead of doing it by each API caller. In `FunctionRegistry`, our functions are identified by the unquoted string. Thus, this PR is try to parse it using our parser interface and then de-duplicate the names.
      
      ### How was this patch tested?
      Added test cases.
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17661 from gatorsmile/compilationFix17646.
      3808b472
    • Reynold Xin's avatar
      [HOTFIX] Fix compilation. · 622d7a8b
      Reynold Xin authored
      622d7a8b
    • Jakob Odersky's avatar
      [SPARK-17647][SQL] Fix backslash escaping in 'LIKE' patterns. · db9517c1
      Jakob Odersky authored
      This patch fixes a bug in the way LIKE patterns are translated to Java regexes. The bug causes any character following an escaped backslash to be escaped, i.e. there is double-escaping.
      A concrete example is the following pattern:`'%\\%'`. The expected Java regex that this pattern should correspond to (according to the behavior described below) is `'.*\\.*'`, however the current situation leads to `'.*\\%'` instead.
      
      ---
      
      Update: in light of the discussion that ensued, we should explicitly define the expected behaviour of LIKE expressions, especially in certain edge cases. With the help of gatorsmile, we put together a list of different RDBMS and their variations wrt to certain standard features.
      
      | RDBMS\Features | Wildcards | Default escape [1] | Case sensitivity |
      | --- | --- | --- | --- |
      | [MS SQL Server](https://msdn.microsoft.com/en-us/library/ms179859.aspx) | _, %, [], [^] | none | no |
      | [Oracle](https://docs.oracle.com/cd/B12037_01/server.101/b10759/conditions016.htm) | _, % | none | yes |
      | [DB2 z/OS](http://www.ibm.com/support/knowledgecenter/SSEPEK_11.0.0/sqlref/src/tpc/db2z_likepredicate.html) | _, % | none | yes |
      | [MySQL](http://dev.mysql.com/doc/refman/5.7/en/string-comparison-functions.html) | _, % | none | no |
      | [PostreSQL](https://www.postgresql.org/docs/9.0/static/functions-matching.html) | _, % | \ | yes |
      | [Hive](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) | _, % | none | yes |
      | Current Spark | _, % | \ | yes |
      
      [1] Default escape character: most systems do not have a default escape character, instead the user can specify one by calling a like expression with an escape argument [A] LIKE [B] ESCAPE [C]. This syntax is currently not supported by Spark, however I would volunteer to implement this feature in a separate ticket.
      
      The specifications are often quite terse and certain scenarios are undocumented, so here is a list of scenarios that I am uncertain about and would appreciate any input. Specifically I am looking for feedback on whether or not Spark's current behavior should be changed.
      1. [x] Ending a pattern with the escape sequence, e.g. `like 'a\'`.
         PostreSQL gives an error: 'LIKE pattern must not end with escape character', which I personally find logical. Currently, Spark allows "non-terminated" escapes and simply ignores them as part of the pattern.
         According to [DB2's documentation](http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00130n.html), ending a pattern in an escape character is invalid.
         _Proposed new behaviour in Spark: throw AnalysisException_
      2. [x] Empty input, e.g. `'' like ''`
         Postgres and DB2 will match empty input only if the pattern is empty as well, any other combination of empty input will not match. Spark currently follows this rule.
      3. [x] Escape before a non-special character, e.g. `'a' like '\a'`.
         Escaping a non-wildcard character is not really documented but PostgreSQL just treats it verbatim, which I also find the least surprising behavior. Spark does the same.
         According to [DB2's documentation](http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00130n.html
      
      ), it is invalid to follow an escape character with anything other than an escape character, an underscore or a percent sign.
         _Proposed new behaviour in Spark: throw AnalysisException_
      
      The current specification is also described in the operator's source code in this patch.
      
      Extra case in regex unit tests.
      
      Author: Jakob Odersky <jakob@odersky.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Reynold Xin <rxin@databricks.com>
      
      Closes #15398 from jodersky/SPARK-17647.
      
      (cherry picked from commit e5fee3e4)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      db9517c1
    • Xiao Li's avatar
      [SPARK-20349][SQL] ListFunctions returns duplicate functions after using persistent functions · 7aad057b
      Xiao Li authored
      
      ### What changes were proposed in this pull request?
      The session catalog caches some persistent functions in the `FunctionRegistry`, so there can be duplicates. Our Catalog API `listFunctions` does not handle it.
      
      It would be better if `SessionCatalog` API can de-duplciate the records, instead of doing it by each API caller. In `FunctionRegistry`, our functions are identified by the unquoted string. Thus, this PR is try to parse it using our parser interface and then de-duplicate the names.
      
      ### How was this patch tested?
      Added test cases.
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17646 from gatorsmile/showFunctions.
      
      (cherry picked from commit 01ff0350)
      Signed-off-by: default avatarXiao Li <gatorsmile@gmail.com>
      7aad057b
    • Xiao Li's avatar
      [SPARK-20335][SQL][BACKPORT-2.1] Children expressions of Hive UDF impacts the... · efa11a42
      Xiao Li authored
      [SPARK-20335][SQL][BACKPORT-2.1] Children expressions of Hive UDF impacts the determinism of Hive UDF
      
      ### What changes were proposed in this pull request?
      
      This PR is to backport https://github.com/apache/spark/pull/17635 to Spark 2.1
      
      ---
      ```JAVA
        /**
         * Certain optimizations should not be applied if UDF is not deterministic.
         * Deterministic UDF returns same result each time it is invoked with a
         * particular input. This determinism just needs to hold within the context of
         * a query.
         *
         * return true if the UDF is deterministic
         */
        boolean deterministic() default true;
      ```
      
      Based on the definition of [UDFType](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFType.java#L42-L50), when Hive UDF's children are non-deterministic, Hive UDF is also non-deterministic.
      
      ### How was this patch tested?
      Added test cases.
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17652 from gatorsmile/backport-17635.
      efa11a42
  9. Apr 14, 2017
  10. Apr 13, 2017
    • Bogdan Raducanu's avatar
      [SPARK-19946][TESTS][BACKPORT-2.1] DebugFilesystem.assertNoOpenStreams should... · bca7ce28
      Bogdan Raducanu authored
      [SPARK-19946][TESTS][BACKPORT-2.1] DebugFilesystem.assertNoOpenStreams should report the open streams to help debugging
      
      ## What changes were proposed in this pull request?
      Backport for PR #17292
      DebugFilesystem.assertNoOpenStreams throws an exception with a cause exception that actually shows the code line which leaked the stream.
      
      ## How was this patch tested?
      New test in SparkContextSuite to check there is a cause exception.
      
      Author: Bogdan Raducanu <bogdan@databricks.com>
      
      Closes #17632 from bogdanrdc/SPARK-19946-BRANCH2.1.
      bca7ce28
    • Xiao Li's avatar
      [SPARK-19924][SQL][BACKPORT-2.1] Handle InvocationTargetException for all Hive Shim · 98ae5481
      Xiao Li authored
      ### What changes were proposed in this pull request?
      
      This is to backport the PR https://github.com/apache/spark/pull/17265 to Spark 2.1 branch.
      
      ---
      Since we are using shim for most Hive metastore APIs, the exceptions thrown by the underlying method of Method.invoke() are wrapped by `InvocationTargetException`. Instead of doing it one by one, we should handle all of them in the `withClient`. If any of them is missing, the error message could looks unfriendly. For example, below is an example for dropping tables.
      
      ```
      Expected exception org.apache.spark.sql.AnalysisException to be thrown, but java.lang.reflect.InvocationTargetException was thrown.
      ScalaTestFailureLocation: org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14 at (ExternalCatalogSuite.scala:193)
      org.scalatest.exceptions.TestFailedException: Expected exception org.apache.spark.sql.AnalysisException to be thrown, but java.lang.reflect.InvocationTargetException was thrown.
      	at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496)
      	at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
      	at org.scalatest.Assertions$class.intercept(Assertions.scala:1004)
      	at org.scalatest.FunSuite.intercept(FunSuite.scala:1555)
      	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14.apply$mcV$sp(ExternalCatalogSuite.scala:193)
      	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14.apply(ExternalCatalogSuite.scala:183)
      	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14.apply(ExternalCatalogSuite.scala:183)
      	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
      	at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
      	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
      	at org.scalatest.Transformer.apply(Transformer.scala:22)
      	at org.scalatest.Transformer.apply(Transformer.scala:20)
      	at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
      	at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
      	at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
      	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
      	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
      	at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
      	at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
      	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(ExternalCatalogSuite.scala:40)
      	at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
      	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite.runTest(ExternalCatalogSuite.scala:40)
      	at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
      	at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
      	at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
      	at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
      	at scala.collection.immutable.List.foreach(List.scala:381)
      	at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
      	at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
      	at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
      	at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
      	at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
      	at org.scalatest.Suite$class.run(Suite.scala:1424)
      	at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
      	at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
      	at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
      	at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
      	at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
      	at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
      	at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
      	at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
      	at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:31)
      	at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:55)
      	at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2563)
      	at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2557)
      	at scala.collection.immutable.List.foreach(List.scala:381)
      	at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:2557)
      	at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1044)
      	at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1043)
      	at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:2722)
      	at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1043)
      	at org.scalatest.tools.Runner$.run(Runner.scala:883)
      	at org.scalatest.tools.Runner.run(Runner.scala)
      	at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:138)
      	at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:28)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
      Caused by: java.lang.reflect.InvocationTargetException
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at org.apache.spark.sql.hive.client.Shim_v0_14.dropTable(HiveShim.scala:736)
      	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropTable$1.apply$mcV$sp(HiveClientImpl.scala:451)
      	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropTable$1.apply(HiveClientImpl.scala:451)
      	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropTable$1.apply(HiveClientImpl.scala:451)
      	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:287)
      	at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:228)
      	at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:227)
      	at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:270)
      	at org.apache.spark.sql.hive.client.HiveClientImpl.dropTable(HiveClientImpl.scala:450)
      	at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$dropTable$1.apply$mcV$sp(HiveExternalCatalog.scala:456)
      	at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$dropTable$1.apply(HiveExternalCatalog.scala:454)
      	at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$dropTable$1.apply(HiveExternalCatalog.scala:454)
      	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:94)
      	at org.apache.spark.sql.hive.HiveExternalCatalog.dropTable(HiveExternalCatalog.scala:454)
      	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14$$anonfun$apply$mcV$sp$8.apply$mcV$sp(ExternalCatalogSuite.scala:194)
      	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14$$anonfun$apply$mcV$sp$8.apply(ExternalCatalogSuite.scala:194)
      	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14$$anonfun$apply$mcV$sp$8.apply(ExternalCatalogSuite.scala:194)
      	at org.scalatest.Assertions$class.intercept(Assertions.scala:997)
      	... 57 more
      Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: NoSuchObjectException(message:db2.unknown_table table not found)
      	at org.apache.hadoop.hive.ql.metadata.Hive.dropTable(Hive.java:1038)
      	... 79 more
      Caused by: NoSuchObjectException(message:db2.unknown_table table not found)
      	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table_core(HiveMetaStore.java:1808)
      	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1778)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
      	at com.sun.proxy.$Proxy10.get_table(Unknown Source)
      	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1208)
      	at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.getTable(SessionHiveMetaStoreClient.java:131)
      	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.dropTable(HiveMetaStoreClient.java:952)
      	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.dropTable(HiveMetaStoreClient.java:904)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
      	at com.sun.proxy.$Proxy11.dropTable(Unknown Source)
      	at org.apache.hadoop.hive.ql.metadata.Hive.dropTable(Hive.java:1035)
      	... 79 more
      ```
      
      After unwrapping the exception, the message is like
      ```
      org.apache.hadoop.hive.ql.metadata.HiveException: NoSuchObjectException(message:db2.unknown_table table not found);
      org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: NoSuchObjectException(message:db2.unknown_table table not found);
      	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:100)
      	at org.apache.spark.sql.hive.HiveExternalCatalog.dropTable(HiveExternalCatalog.scala:460)
      	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14.apply$mcV$sp(ExternalCatalogSuite.scala:193)
      	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14.apply(ExternalCatalogSuite.scala:183)
      	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14.apply(ExternalCatalogSuite.scala:183)
      	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
      ...
      ```
      ### How was this patch tested?
      N/A
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17627 from gatorsmile/backport-17265.
      98ae5481
  11. Apr 12, 2017
    • Shixiong Zhu's avatar
      [SPARK-20131][CORE] Don't use `this` lock in StandaloneSchedulerBackend.stop · be36c2f1
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      `o.a.s.streaming.StreamingContextSuite.SPARK-18560 Receiver data should be deserialized properly` is flaky is because there is a potential dead-lock in StandaloneSchedulerBackend which causes `await` timeout. Here is the related stack trace:
      ```
      "Thread-31" #211 daemon prio=5 os_prio=31 tid=0x00007fedd4808000 nid=0x16403 waiting on condition [0x00007000239b7000]
         java.lang.Thread.State: TIMED_WAITING (parking)
      	at sun.misc.Unsafe.park(Native Method)
      	- parking to wait for  <0x000000079b49ca10> (a scala.concurrent.impl.Promise$CompletionLatch)
      	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
      	at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
      	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
      	at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
      	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
      	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
      	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
      	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
      	at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:402)
      	at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.org$apache$spark$scheduler$cluster$StandaloneSchedulerBackend$$stop(StandaloneSchedulerBackend.scala:213)
      	- locked <0x00000007066fca38> (a org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend)
      	at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.stop(StandaloneSchedulerBackend.scala:116)
      	- locked <0x00000007066fca38> (a org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend)
      	at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:517)
      	at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1657)
      	at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1921)
      	at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1302)
      	at org.apache.spark.SparkContext.stop(SparkContext.scala:1920)
      	at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:708)
      	at org.apache.spark.streaming.StreamingContextSuite$$anonfun$43$$anonfun$apply$mcV$sp$66$$anon$3.run(StreamingContextSuite.scala:827)
      
      "dispatcher-event-loop-3" #18 daemon prio=5 os_prio=31 tid=0x00007fedd603a000 nid=0x6203 waiting for monitor entry [0x0000700003be4000]
         java.lang.Thread.State: BLOCKED (on object monitor)
      	at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:253)
      	- waiting to lock <0x00000007066fca38> (a org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend)
      	at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:124)
      	at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
      	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
      	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
      	at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      ```
      
      This PR removes `synchronized` and changes `stopping` to AtomicBoolean to ensure idempotent to fix the dead-lock.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #17610 from zsxwing/SPARK-20131.
      
      (cherry picked from commit c5f1cc37)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      be36c2f1
    • Reynold Xin's avatar
      [SPARK-20304][SQL] AssertNotNull should not include path in string representation · 7e0ddda3
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      AssertNotNull's toString/simpleString dumps the entire walkedTypePath. walkedTypePath is used for error message reporting and shouldn't be part of the output.
      
      ## How was this patch tested?
      Manually tested.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #17616 from rxin/SPARK-20304.
      
      (cherry picked from commit 54085538)
      Signed-off-by: default avatarXiao Li <gatorsmile@gmail.com>
      7e0ddda3
    • jtoka's avatar
      [SPARK-20296][TRIVIAL][DOCS] Count distinct error message for streaming · dbb6d1b4
      jtoka authored
      
      ## What changes were proposed in this pull request?
      Update count distinct error message for streaming datasets/dataframes to match current behavior. These aggregations are not yet supported, regardless of whether the dataset/dataframe is aggregated.
      
      Author: jtoka <jason.tokayer@gmail.com>
      
      Closes #17609 from jtoka/master.
      
      (cherry picked from commit 2e1fd46e)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      dbb6d1b4
    • Lee Dongjin's avatar
      [MINOR][DOCS] Fix spacings in Structured Streaming Programming Guide · b2970d97
      Lee Dongjin authored
      
      ## What changes were proposed in this pull request?
      
      1. Omitted space between the sentences: `... on static data.The Spark SQL engine will ...` -> `... on static data. The Spark SQL engine will ...`
      2. Omitted colon in Output Model section.
      
      ## How was this patch tested?
      
      None.
      
      Author: Lee Dongjin <dongjin@apache.org>
      
      Closes #17564 from dongjinleekr/feature/fix-programming-guide.
      
      (cherry picked from commit b9384382)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      b2970d97
    • DB Tsai's avatar
      [SPARK-20291][SQL] NaNvl(FloatType, NullType) should not be cast to NaNvl(DoubleType, DoubleType) · 46e212d2
      DB Tsai authored
      ## What changes were proposed in this pull request?
      
      `NaNvl(float value, null)` will be converted into `NaNvl(float value, Cast(null, DoubleType))` and finally `NaNvl(Cast(float value, DoubleType), Cast(null, DoubleType))`.
      
      This will cause mismatching in the output type when the input type is float.
      
      By adding extra rule in TypeCoercion can resolve this issue.
      
      ## How was this patch tested?
      
      unite tests.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #17606 from dbtsai/fixNaNvl.
      
      (cherry picked from commit 8ad63ee1)
      Signed-off-by: default avatarDB Tsai <dbtsai@dbtsai.com>
      Unverified
      46e212d2
  12. Apr 10, 2017
    • DB Tsai's avatar
      [SPARK-18555][MINOR][SQL] Fix the @since tag when backporting from 2.2 branch into 2.1 branch · 03a42c01
      DB Tsai authored
      ## What changes were proposed in this pull request?
      
      Fix the since tag when backporting critical bugs (SPARK-18555) from 2.2 branch into 2.1 branch.
      
      ## How was this patch tested?
      
      N/A
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: DB Tsai <dbtsai@dbtsai.com>
      
      Closes #17600 from dbtsai/branch-2.1.
      Unverified
      03a42c01
    • Shixiong Zhu's avatar
      [SPARK-17564][TESTS] Fix flaky RequestTimeoutIntegrationSuite.furtherRequestsDelay · 8eb71b81
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR  fixs the following failure:
      ```
      sbt.ForkMain$ForkError: java.lang.AssertionError: null
      	at org.junit.Assert.fail(Assert.java:86)
      	at org.junit.Assert.assertTrue(Assert.java:41)
      	at org.junit.Assert.assertTrue(Assert.java:52)
      	at org.apache.spark.network.RequestTimeoutIntegrationSuite.furtherRequestsDelay(RequestTimeoutIntegrationSuite.java:230)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:497)
      	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
      	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
      	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
      	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
      	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
      	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
      	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
      	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
      	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
      	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
      	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
      	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
      	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
      	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
      	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
      	at org.junit.runners.Suite.runChild(Suite.java:128)
      	at org.junit.runners.Suite.runChild(Suite.java:27)
      	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
      	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
      	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
      	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
      	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
      	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
      	at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
      	at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
      	at com.novocode.junit.JUnitRunner$1.execute(JUnitRunner.java:132)
      	at sbt.ForkMain$Run$2.call(ForkMain.java:296)
      	at sbt.ForkMain$Run$2.call(ForkMain.java:286)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      ```
      
      It happens several times per month on [Jenkins](http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.network.RequestTimeoutIntegrationSuite&test_name=furtherRequestsDelay). The failure is because `callback1` may not be called before `assertTrue(callback1.failure instanceof IOException);`. It's pretty easy to reproduce this error by adding a sleep before this line: https://github.com/apache/spark/blob/379b0b0bbdbba2278ce3bcf471bd75f6ffd9cf0d/common/network-common/src/test/java/org/apache/spark/network/RequestTimeoutIntegrationSuite.java#L267
      
      
      
      The fix is straightforward: just use the latch to wait until `callback1` is called.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #17599 from zsxwing/SPARK-17564.
      
      (cherry picked from commit 734dfbfc)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      8eb71b81
    • DB Tsai's avatar
      [SPARK-20270][SQL] na.fill should not change the values in long or integer... · f40e44de
      DB Tsai authored
      [SPARK-20270][SQL] na.fill should not change the values in long or integer when the default value is in double
      
      ## What changes were proposed in this pull request?
      
      This bug was partially addressed in SPARK-18555 https://github.com/apache/spark/pull/15994
      
      , but the root cause isn't completely solved. This bug is pretty critical since it changes the member id in Long in our application if the member id can not be represented by Double losslessly when the member id is very big.
      
      Here is an example how this happens, with
      ```
            Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), (9123146099426677101L, null),
              (9123146560113991650L, 1.6), (null, null)).toDF("a", "b").na.fill(0.2),
      ```
      the logical plan will be
      ```
      == Analyzed Logical Plan ==
      a: bigint, b: double
      Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as bigint) AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as double) AS b#241]
      +- Project [_1#229L AS a#232L, _2#230 AS b#233]
         +- LocalRelation [_1#229L, _2#230]
      ```
      
      Note that even the value is not null, Spark will cast the Long into Double first. Then if it's not null, Spark will cast it back to Long which results in losing precision.
      
      The behavior should be that the original value should not be changed if it's not null, but Spark will change the value which is wrong.
      
      With the PR, the logical plan will be
      ```
      == Analyzed Logical Plan ==
      a: bigint, b: double
      Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L, coalesce(nanvl(b#233, cast(null as double)), cast(0.2 as double)) AS b#241]
      +- Project [_1#229L AS a#232L, _2#230 AS b#233]
         +- LocalRelation [_1#229L, _2#230]
      ```
      which behaves correctly without changing the original Long values and also avoids extra cost of unnecessary casting.
      
      ## How was this patch tested?
      
      unit test added.
      
      +cc srowen rxin cloud-fan gatorsmile
      
      Thanks.
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #17577 from dbtsai/fixnafill.
      
      (cherry picked from commit 1a0bc416)
      Signed-off-by: default avatarDB Tsai <dbtsai@dbtsai.com>
      Unverified
      f40e44de
    • root's avatar
      [SPARK-18555][SQL] DataFrameNaFunctions.fill miss up original values in long integers · b26f2c2c
      root authored
      
      ## What changes were proposed in this pull request?
      
         DataSet.na.fill(0) used on a DataSet which has a long value column, it will change the original long value.
      
         The reason is that the type of the function fill's param is Double, and the numeric columns are always cast to double(`fillCol[Double](f, value)`) .
      ```
        def fill(value: Double, cols: Seq[String]): DataFrame = {
          val columnEquals = df.sparkSession.sessionState.analyzer.resolver
          val projections = df.schema.fields.map { f =>
            // Only fill if the column is part of the cols list.
            if (f.dataType.isInstanceOf[NumericType] && cols.exists(col => columnEquals(f.name, col))) {
              fillCol[Double](f, value)
            } else {
              df.col(f.name)
            }
          }
          df.select(projections : _*)
        }
      ```
      
       For example:
      ```
      scala> val df = Seq[(Long, Long)]((1, 2), (-1, -2), (9123146099426677101L, 9123146560113991650L)).toDF("a", "b")
      df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint]
      
      scala> df.show
      +-------------------+-------------------+
      |                  a|                  b|
      +-------------------+-------------------+
      |                  1|                  2|
      |                 -1|                 -2|
      |9123146099426677101|9123146560113991650|
      +-------------------+-------------------+
      
      scala> df.na.fill(0).show
      +-------------------+-------------------+
      |                  a|                  b|
      +-------------------+-------------------+
      |                  1|                  2|
      |                 -1|                 -2|
      |9123146099426676736|9123146560113991680|
      +-------------------+-------------------+
       ```
      
      the original values changed [which is not we expected result]:
      ```
       9123146099426677101 -> 9123146099426676736
       9123146560113991650 -> 9123146560113991680
      ```
      
      ## How was this patch tested?
      
      unit test added.
      
      Author: root <root@iZbp1gsnrlfzjxh82cz80vZ.(none)>
      
      Closes #15994 from windpiger/nafillMissupOriginalValue.
      
      (cherry picked from commit 508de38c)
      Signed-off-by: default avatarDB Tsai <dbtsai@dbtsai.com>
      Unverified
      b26f2c2c
    • Shixiong Zhu's avatar
      [SPARK-20285][TESTS] Increase the pyspark streaming test timeout to 30 seconds · 489c1f35
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Saw the following failure locally:
      
      ```
      Traceback (most recent call last):
        File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, in test_cogroup
          self._test_func(input, func, expected, sort=True, input2=input2)
        File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, in _test_func
          self.assertEqual(expected, result)
      AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != []
      
      First list contains 3 additional elements.
      First extra element 0:
      [(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))]
      
      + []
      - [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))],
      -  [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))],
      -  [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]]
      ```
      
      It also happened on Jenkins: http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120
      
      
      
      It's because when the machine is overloaded, the timeout is not enough. This PR just increases the timeout to 30 seconds.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #17597 from zsxwing/SPARK-20285.
      
      (cherry picked from commit f9a50ba2)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      489c1f35
    • Bogdan Raducanu's avatar
      [SPARK-20280][CORE] FileStatusCache Weigher integer overflow · bc7304e1
      Bogdan Raducanu authored
      
      ## What changes were proposed in this pull request?
      
      Weigher.weigh needs to return Int but it is possible for an Array[FileStatus] to have size > Int.maxValue. To avoid this, the size is scaled down by a factor of 32. The maximumWeight of the cache is also scaled down by the same factor.
      
      ## How was this patch tested?
      New test in FileIndexSuite
      
      Author: Bogdan Raducanu <bogdan@databricks.com>
      
      Closes #17591 from bogdanrdc/SPARK-20280.
      
      (cherry picked from commit f6dd8e0e)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      bc7304e1
  13. Apr 09, 2017
    • Reynold Xin's avatar
      [SPARK-20264][SQL] asm should be non-test dependency in sql/core · 1a73046b
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      sq/core module currently declares asm as a test scope dependency. Transitively it should actually be a normal dependency since the actual core module defines it. This occasionally confuses IntelliJ.
      
      ## How was this patch tested?
      N/A - This is a build change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #17574 from rxin/SPARK-20264.
      
      (cherry picked from commit 7bfa05e0)
      Signed-off-by: default avatarXiao Li <gatorsmile@gmail.com>
      1a73046b
    • Vijay Ramesh's avatar
      [SPARK-20260][MLLIB] String interpolation required for error message · 43a7fcad
      Vijay Ramesh authored
      ## What changes were proposed in this pull request?
      This error message doesn't get properly formatted because of a missing `s`.  Currently the error looks like:
      
      ```
      Caused by: java.lang.IllegalArgumentException: requirement failed: indices should be one-based and in ascending order; found current=$current, previous=$previous; line="$line"
      ```
      (note the literal `$current` instead of the interpolated value)
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Vijay Ramesh <vramesh@demandbase.com>
      
      Closes #17572 from vijaykramesh/master.
      
      (cherry picked from commit 261eaf51)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      43a7fcad
  14. Apr 07, 2017
    • Reynold Xin's avatar
      [SPARK-20262][SQL] AssertNotNull should throw NullPointerException · 658b3588
      Reynold Xin authored
      
      AssertNotNull currently throws RuntimeException. It should throw NullPointerException, which is more specific.
      
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #17573 from rxin/SPARK-20262.
      
      (cherry picked from commit e1afc4dc)
      Signed-off-by: default avatarXiao Li <gatorsmile@gmail.com>
      658b3588
    • Wenchen Fan's avatar
      [SPARK-20246][SQL] should not push predicate down through aggregate with... · fc242ccf
      Wenchen Fan authored
      [SPARK-20246][SQL] should not push predicate down through aggregate with non-deterministic expressions
      
      ## What changes were proposed in this pull request?
      
      Similar to `Project`, when `Aggregate` has non-deterministic expressions, we should not push predicate down through it, as it will change the number of input rows and thus change the evaluation result of non-deterministic expressions in `Aggregate`.
      
      ## How was this patch tested?
      
      new regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17562 from cloud-fan/filter.
      
      (cherry picked from commit 7577e9c3)
      Signed-off-by: default avatarXiao Li <gatorsmile@gmail.com>
      fc242ccf
    • 郭小龙 10207633's avatar
      [SPARK-20218][DOC][APP-ID] applications//stages' in REST API,add description. · 77911201
      郭小龙 10207633 authored
      ## What changes were proposed in this pull request?
      
      1. '/applications/[app-id]/stages' in rest api.status should add description '?status=[active|complete|pending|failed] list only stages in the state.'
      
      Now the lack of this description, resulting in the use of this api do not know the use of the status through the brush stage list.
      
      2.'/applications/[app-id]/stages/[stage-id]' in REST API,remove redundant description ‘?status=[active|complete|pending|failed] list only stages in the state.’.
      Because only one stage is determined based on stage-id.
      
      code:
        GET
        def stageList(QueryParam("status") statuses: JList[StageStatus]): Seq[StageData] = {
          val listener = ui.jobProgressListener
          val stageAndStatus = AllStagesResource.stagesAndStatus(ui)
          val adjStatuses = {
            if (statuses.isEmpty()) {
              Arrays.asList(StageStatus.values(): _*)
            } else {
              statuses
            }
          };
      
      ## How was this patch tested?
      
      manual tests
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn>
      
      Closes #17534 from guoxiaolongzte/SPARK-20218.
      
      (cherry picked from commit 9e0893b5)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      77911201
  15. Apr 05, 2017
    • Liang-Chi Hsieh's avatar
      [SPARK-20214][ML] Make sure converted csc matrix has sorted indices · fb81a412
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      `_convert_to_vector` converts a scipy sparse matrix to csc matrix for initializing `SparseVector`. However, it doesn't guarantee the converted csc matrix has sorted indices and so a failure happens when you do something like that:
      
          from scipy.sparse import lil_matrix
          lil = lil_matrix((4, 1))
          lil[1, 0] = 1
          lil[3, 0] = 2
          _convert_to_vector(lil.todok())
      
          File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 78, in _convert_to_vector
            return SparseVector(l.shape[0], csc.indices, csc.data)
          File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 556, in __init__
            % (self.indices[i], self.indices[i + 1]))
          TypeError: Indices 3 and 1 are not strictly increasing
      
      A simple test can confirm that `dok_matrix.tocsc()` won't guarantee sorted indices:
      
          >>> from scipy.sparse import lil_matrix
          >>> lil = lil_matrix((4, 1))
          >>> lil[1, 0] = 1
          >>> lil[3, 0] = 2
          >>> dok = lil.todok()
          >>> csc = dok.tocsc()
          >>> csc.has_sorted_indices
          0
          >>> csc.indices
          array([3, 1], dtype=int32)
      
      I checked the source codes of scipy. The only way to guarantee it is `csc_matrix.tocsr()` and `csr_matrix.tocsc()`.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #17532 from viirya/make-sure-sorted-indices.
      
      (cherry picked from commit 12206058)
      Signed-off-by: default avatarJoseph K. Bradley <joseph@databricks.com>
      fb81a412
Loading