Skip to content
Snippets Groups Projects
  1. Jul 17, 2017
    • gatorsmile's avatar
      [MINOR] Improve SQLConf messages · a8c6d0f6
      gatorsmile authored
      ### What changes were proposed in this pull request?
      The current SQLConf messages of `spark.sql.hive.convertMetastoreParquet` and `spark.sql.hive.convertMetastoreOrc` are not very clear to end users. This PR is to improve them.
      
      ### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18657 from gatorsmile/msgUpdates.
      a8c6d0f6
    • Tathagata Das's avatar
      [SPARK-21409][SS] Expose state store memory usage in SQL metrics and progress updates · 9d8c8317
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Currently, there is no tracking of memory usage of state stores. This JIRA is to expose that through SQL metrics and StreamingQueryProgress.
      
      Additionally, added the ability to expose implementation-specific metrics through the StateStore APIs to the SQLMetrics.
      
      ## How was this patch tested?
      Added unit tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #18629 from tdas/SPARK-21409.
      9d8c8317
    • jerryshao's avatar
      [SPARK-21377][YARN] Make jars specify with --jars/--packages load-able in AM's credential renwer · 53465075
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      In this issue we have a long running Spark application with secure HBase, which requires `HBaseCredentialProvider` to get tokens periodically, we specify HBase related jars with `--packages`, but these dependencies are not added into AM classpath, so when `HBaseCredentialProvider` tries to initialize HBase connections to get tokens, it will be failed.
      
      Currently because jars specified with `--jars` or `--packages` are not added into AM classpath, the only way to extend AM classpath is to use "spark.driver.extraClassPath" which supposed to be used in yarn cluster mode.
      
      So in this fix, we proposed to use/reuse a classloader for `AMCredentialRenewer` to acquire new tokens.
      
      Also in this patch, we fixed AM cannot get tokens from HDFS issue, it is because FileSystem is gotten before kerberos logged, so using this FS to get tokens will throw exception.
      
      ## How was this patch tested?
      
      Manual verification.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #18616 from jerryshao/SPARK-21377.
      53465075
    • John Lee's avatar
      [SPARK-21321][SPARK CORE] Spark very verbose on shutdown · 0e07a29c
      John Lee authored
      ## What changes were proposed in this pull request?
      
      The current code is very verbose on shutdown.
      
      The changes I propose is to change the log level when the driver is shutting down and the RPC connections are closed (RpcEnvStoppedException).
      
      ## How was this patch tested?
      
      Tested with word count(deploy-mode = cluster, master = yarn, num-executors = 4) with 300GB of data.
      
      Author: John Lee <jlee2@yahoo-inc.com>
      
      Closes #18547 from yoonlee95/SPARK-21321.
      0e07a29c
    • Ajay Saini's avatar
      [SPARK-21221][ML] CrossValidator and TrainValidationSplit Persist Nested... · 7047f49f
      Ajay Saini authored
      [SPARK-21221][ML] CrossValidator and TrainValidationSplit Persist Nested Estimators such as OneVsRest
      
      ## What changes were proposed in this pull request?
      Added functionality for CrossValidator and TrainValidationSplit to persist nested estimators such as OneVsRest. Also added CrossValidator and TrainValidation split persistence to pyspark.
      
      ## How was this patch tested?
      Performed both cross validation and train validation split with a one vs. rest estimator and tested read/write functionality of the estimator parameter maps required by these meta-algorithms.
      
      Author: Ajay Saini <ajays725@gmail.com>
      
      Closes #18428 from ajaysaini725/MetaAlgorithmPersistNestedEstimators.
      7047f49f
    • hyukjinkwon's avatar
      [SPARK-21394][SPARK-21432][PYTHON] Reviving callable object/partial function... · 4ce735ee
      hyukjinkwon authored
      [SPARK-21394][SPARK-21432][PYTHON] Reviving callable object/partial function support in UDF in PySpark
      
      ## What changes were proposed in this pull request?
      
      This PR proposes to avoid `__name__` in the tuple naming the attributes assigned directly from the wrapped function to the wrapper function, and use `self._name` (`func.__name__` or `obj.__class__.name__`).
      
      After SPARK-19161, we happened to break callable objects as UDFs in Python as below:
      
      ```python
      from pyspark.sql import functions
      
      class F(object):
          def __call__(self, x):
              return x
      
      foo = F()
      udf = functions.udf(foo)
      ```
      
      ```
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark/python/pyspark/sql/functions.py", line 2142, in udf
          return _udf(f=f, returnType=returnType)
        File ".../spark/python/pyspark/sql/functions.py", line 2133, in _udf
          return udf_obj._wrapped()
        File ".../spark/python/pyspark/sql/functions.py", line 2090, in _wrapped
          functools.wraps(self.func)
        File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/functools.py", line 33, in update_wrapper
          setattr(wrapper, attr, getattr(wrapped, attr))
      AttributeError: F instance has no attribute '__name__'
      ```
      
      This worked in Spark 2.1:
      
      ```python
      from pyspark.sql import functions
      
      class F(object):
          def __call__(self, x):
              return x
      
      foo = F()
      udf = functions.udf(foo)
      spark.range(1).select(udf("id")).show()
      ```
      
      ```
      +-----+
      |F(id)|
      +-----+
      |    0|
      +-----+
      ```
      
      **After**
      
      ```python
      from pyspark.sql import functions
      
      class F(object):
          def __call__(self, x):
              return x
      
      foo = F()
      udf = functions.udf(foo)
      spark.range(1).select(udf("id")).show()
      ```
      
      ```
      +-----+
      |F(id)|
      +-----+
      |    0|
      +-----+
      ```
      
      _In addition, we also happened to break partial functions as below_:
      
      ```python
      from pyspark.sql import functions
      from functools import partial
      
      partial_func = partial(lambda x: x, x=1)
      udf = functions.udf(partial_func)
      ```
      
      ```
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark/python/pyspark/sql/functions.py", line 2154, in udf
          return _udf(f=f, returnType=returnType)
        File ".../spark/python/pyspark/sql/functions.py", line 2145, in _udf
          return udf_obj._wrapped()
        File ".../spark/python/pyspark/sql/functions.py", line 2099, in _wrapped
          functools.wraps(self.func, assigned=assignments)
        File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/functools.py", line 33, in update_wrapper
          setattr(wrapper, attr, getattr(wrapped, attr))
      AttributeError: 'functools.partial' object has no attribute '__module__'
      ```
      
      This worked in Spark 2.1:
      
      ```python
      from pyspark.sql import functions
      from functools import partial
      
      partial_func = partial(lambda x: x, x=1)
      udf = functions.udf(partial_func)
      spark.range(1).select(udf()).show()
      ```
      
      ```
      +---------+
      |partial()|
      +---------+
      |        1|
      +---------+
      ```
      
      **After**
      
      ```python
      from pyspark.sql import functions
      from functools import partial
      
      partial_func = partial(lambda x: x, x=1)
      udf = functions.udf(partial_func)
      spark.range(1).select(udf()).show()
      ```
      
      ```
      +---------+
      |partial()|
      +---------+
      |        1|
      +---------+
      ```
      
      ## How was this patch tested?
      
      Unit tests in `python/pyspark/sql/tests.py` and manual tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18615 from HyukjinKwon/callable-object.
      4ce735ee
    • gatorsmile's avatar
      [SPARK-21354][SQL] INPUT FILE related functions do not support more than one sources · e398c281
      gatorsmile authored
      ### What changes were proposed in this pull request?
      The build-in functions `input_file_name`, `input_file_block_start`, `input_file_block_length` do not support more than one sources, like what Hive does. Currently, Spark does not block it and the outputs are ambiguous/non-deterministic. It could be from any side.
      
      ```
      hive> select *, INPUT__FILE__NAME FROM t1, t2;
      FAILED: SemanticException Column INPUT__FILE__NAME Found in more than One Tables/Subqueries
      ```
      
      This PR blocks it and issues an error.
      
      ### How was this patch tested?
      Added a test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18580 from gatorsmile/inputFileName.
      e398c281
  2. Jul 16, 2017
  3. Jul 15, 2017
  4. Jul 14, 2017
    • Kazuaki Ishizaki's avatar
      [SPARK-21344][SQL] BinaryType comparison does signed byte array comparison · ac5d5d79
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations.
      
      ## How was this patch tested?
      
      Added a test suite in `OrderingSuite`.
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #18571 from kiszk/SPARK-21344.
      ac5d5d79
    • Shixiong Zhu's avatar
      [SPARK-21421][SS] Add the query id as a local property to allow source and sink using it · 2d968a07
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Add the query id as a local property to allow source and sink using it.
      
      ## How was this patch tested?
      
      The new unit test.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #18638 from zsxwing/SPARK-21421.
      2d968a07
    • Marcelo Vanzin's avatar
      [SPARK-9825][YARN] Do not overwrite final Hadoop config entries. · 601a237b
      Marcelo Vanzin authored
      When localizing the gateway config files in a YARN application, avoid
      overwriting final configs by distributing the gateway files to a separate
      directory, and explicitly loading them into the Hadoop config, instead
      of placing those files before the cluster's files in the classpath.
      
      This is done by saving the gateway's config to a separate XML file
      distributed with the rest of the Spark app's config, and loading that
      file when creating a new config through `YarnSparkHadoopUtil`.
      
      Tested with existing unit tests, and by verifying the behavior in a YARN
      cluster (final values are not overridden, non-final values are).
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #18370 from vanzin/SPARK-9825.
      601a237b
  5. Jul 13, 2017
    • jerryshao's avatar
      [SPARK-21376][YARN] Fix yarn client token expire issue when cleaning the... · cb8d5cc9
      jerryshao authored
      [SPARK-21376][YARN] Fix yarn client token expire issue when cleaning the staging files in long running scenario
      
      ## What changes were proposed in this pull request?
      
      This issue happens in long running application with yarn cluster mode, because yarn#client doesn't sync token with AM, so it will always keep the initial token, this token may be expired in the long running scenario, so when yarn#client tries to clean up staging directory after application finished, it will use this expired token and meet token expire issue.
      
      ## How was this patch tested?
      
      Manual verification is secure cluster.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #18617 from jerryshao/SPARK-21376.
      cb8d5cc9
    • Sean Owen's avatar
      [SPARK-15526][MLLIB] Shade JPMML · 5c8edfc4
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Shade JPMML classes (`org.jpmml.**`) and related PMML model classes (`org.dmg.pmml.**`). This insulates downstream users from the version of JPMML in Spark, allows us to upgrade more freely, and allows downstream users to use a different version. JPMML minor releases are not generally forwards/backwards compatible.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #18584 from srowen/SPARK-15526.
      5c8edfc4
    • Stavros Kontopoulos's avatar
      [SPARK-21403][MESOS] fix --packages for mesos · d8257b99
      Stavros Kontopoulos authored
      ## What changes were proposed in this pull request?
      Fixes --packages flag for mesos in cluster mode. Probably I will handle standalone and Yarn in another commit, I need to investigate those cases as they are different.
      
      ## How was this patch tested?
      Tested with a community 1.9 dc/os cluster. packages were successfully resolved in cluster mode within a container.
      
      andrewor14  susanxhuynh ArtRand srowen  pls review.
      
      Author: Stavros Kontopoulos <st.kontopoulos@gmail.com>
      
      Closes #18587 from skonto/fix_packages_mesos_cluster.
      d8257b99
    • Kazuaki Ishizaki's avatar
      [SPARK-21373][CORE] Update Jetty to 9.3.20.v20170531 · af80e01b
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR upgrades jetty to the latest version 9.3.20.v20170531. The version includes the fix of CVE-2017-9735.
      
      Here are links to descriptions for CVE-2017-9735.
      * https://nvd.nist.gov/vuln/detail/CVE-2017-9735
      * https://github.com/eclipse/jetty.project/issues/1556
      
      Here is [a release note](https://github.com/eclipse/jetty.project/blob/jetty-9.3.x/VERSION.txt) for the latest jetty
      
      ## How was this patch tested?
      
      tested by existing test suites
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #18601 from kiszk/SPARK-21373.
      af80e01b
    • Sean Owen's avatar
      [SPARK-19810][BUILD][CORE] Remove support for Scala 2.10 · 425c4ada
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      - Remove Scala 2.10 build profiles and support
      - Replace some 2.10 support in scripts with commented placeholders for 2.12 later
      - Remove deprecated API calls from 2.10 support
      - Remove usages of deprecated context bounds where possible
      - Remove Scala 2.10 workarounds like ScalaReflectionLock
      - Other minor Scala warning fixes
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17150 from srowen/SPARK-19810.
      425c4ada
  6. Jul 12, 2017
  7. Jul 11, 2017
    • gatorsmile's avatar
      [SPARK-19285][SQL] Implement UDF0 · d3e07165
      gatorsmile authored
      ### What changes were proposed in this pull request?
      This PR is to implement UDF0. `UDF0` is needed when users need to implement a JAVA UDF with no argument.
      
      ### How was this patch tested?
      Added a test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18598 from gatorsmile/udf0.
      d3e07165
    • Marcelo Vanzin's avatar
      [SPARK-16019][YARN] Use separate RM poll interval when starting client AM. · 1cad31f0
      Marcelo Vanzin authored
      Currently the code monitoring the launch of the client AM uses the value of
      spark.yarn.report.interval as the interval for polling the RM; if someone
      has that value to a really large interval, it would take that long to detect
      that the client AM has started, which is not expected.
      
      Instead, have a separate config for the interval to use when the client AM is
      starting. The other config is still used in cluster mode, and to detect the
      status of the client AM after it is already running.
      
      Tested by running client and cluster mode apps with a modified value of
      spark.yarn.report.interval, verifying client AM launch is detected before
      that interval elapses.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #18380 from vanzin/SPARK-16019.
      1cad31f0
    • hyukjinkwon's avatar
      [SPARK-21365][PYTHON] Deduplicate logics parsing DDL type/schema definition · ebc124d4
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR deals with four points as below:
      
      - Reuse existing DDL parser APIs rather than reimplementing within PySpark
      
      - Support DDL formatted string, `field type, field type`.
      
      - Support case-insensitivity for parsing.
      
      - Support nested data types as below:
      
        **Before**
        ```
        >>> spark.createDataFrame([[[1]]], "struct<a: struct<b: int>>").show()
        ...
        ValueError: The strcut field string format is: 'field_name:field_type', but got: a: struct<b: int>
        ```
      
        ```
        >>> spark.createDataFrame([[[1]]], "a: struct<b: int>").show()
        ...
        ValueError: The strcut field string format is: 'field_name:field_type', but got: a: struct<b: int>
        ```
      
        ```
        >>> spark.createDataFrame([[1]], "a int").show()
        ...
        ValueError: Could not parse datatype: a int
        ```
      
        **After**
        ```
        >>> spark.createDataFrame([[[1]]], "struct<a: struct<b: int>>").show()
        +---+
        |  a|
        +---+
        |[1]|
        +---+
        ```
      
        ```
        >>> spark.createDataFrame([[[1]]], "a: struct<b: int>").show()
        +---+
        |  a|
        +---+
        |[1]|
        +---+
        ```
      
        ```
        >>> spark.createDataFrame([[1]], "a int").show()
        +---+
        |  a|
        +---+
        |  1|
        +---+
        ```
      
      ## How was this patch tested?
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18590 from HyukjinKwon/deduplicate-python-ddl.
      ebc124d4
    • Xingbo Jiang's avatar
      [SPARK-21366][SQL][TEST] Add sql test for window functions · 66d21686
      Xingbo Jiang authored
      ## What changes were proposed in this pull request?
      
      Add sql test for window functions, also remove uncecessary test cases in `WindowQuerySuite`.
      
      ## How was this patch tested?
      
      Added `window.sql` and the corresponding output file.
      
      Author: Xingbo Jiang <xingbo.jiang@databricks.com>
      
      Closes #18591 from jiangxb1987/window.
      66d21686
    • hyukjinkwon's avatar
      [SPARK-21263][SQL] Do not allow partially parsing double and floats via NumberFormat in CSV · 7514db1d
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to remove `NumberFormat.parse` use to disallow a case of partially parsed data. For example,
      
      ```
      scala> spark.read.schema("a DOUBLE").option("mode", "FAILFAST").csv(Seq("10u12").toDS).show()
      +----+
      |   a|
      +----+
      |10.0|
      +----+
      ```
      
      ## How was this patch tested?
      
      Unit tests added in `UnivocityParserSuite` and `CSVSuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18532 from HyukjinKwon/SPARK-21263.
      7514db1d
    • Michael Allman's avatar
      [SPARK-20331][SQL] Enhanced Hive partition pruning predicate pushdown · a4baa8f4
      Michael Allman authored
      (Link to Jira: https://issues.apache.org/jira/browse/SPARK-20331)
      
      ## What changes were proposed in this pull request?
      
      Spark 2.1 introduced scalable support for Hive tables with huge numbers of partitions. Key to leveraging this support is the ability to prune unnecessary table partitions to answer queries. Spark supports a subset of the class of partition pruning predicates that the Hive metastore supports. If a user writes a query with a partition pruning predicate that is *not* supported by Spark, Spark falls back to loading all partitions and pruning client-side. We want to broaden Spark's current partition pruning predicate pushdown capabilities.
      
      One of the key missing capabilities is support for disjunctions. For example, for a table partitioned by date, writing a query with a predicate like
      
          date = 20161011 or date = 20161014
      
      will result in Spark fetching all partitions. For a table partitioned by date and hour, querying a range of hours across dates can be quite difficult to accomplish without fetching all partition metadata.
      
      The current partition pruning support supports only comparisons against literals. We can expand that to foldable expressions by evaluating them at planning time.
      
      We can also implement support for the "IN" comparison by expanding it to a sequence of "OR"s.
      
      ## How was this patch tested?
      
      The `HiveClientSuite` and `VersionsSuite` were refactored and simplified to make Hive client-based, version-specific testing more modular and conceptually simpler. There are now two Hive test suites: `HiveClientSuite` and `HivePartitionFilteringSuite`. These test suites have a single-argument constructor taking a `version` parameter. As such, these test suites cannot be run by themselves. Instead, they have been bundled into "aggregation" test suites which run each suite for each Hive client version. These aggregation suites are called `HiveClientSuites` and `HivePartitionFilteringSuites`. The `VersionsSuite` and `HiveClientSuite` have been refactored into each of these aggregation suites, respectively.
      
      `HiveClientSuite` and `HivePartitionFilteringSuite` subclass a new abstract class, `HiveVersionSuite`. `HiveVersionSuite` collects functionality related to testing a single Hive version and overrides relevant test suite methods to display version-specific information.
      
      A new trait, `HiveClientVersions`, has been added with a sequence of Hive test versions.
      
      Author: Michael Allman <michael@videoamp.com>
      
      Closes #17633 from mallman/spark-20331-enhanced_partition_pruning_pushdown.
      a4baa8f4
    • hyukjinkwon's avatar
      [SPARK-20456][PYTHON][FOLLOWUP] Fix timezone-dependent doctests in unix_timestamp and from_unixtime · d4d9e17b
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to simply ignore the results in examples that are timezone-dependent in `unix_timestamp` and `from_unixtime`.
      
      ```
      Failed example:
          time_df.select(unix_timestamp('dt', 'yyyy-MM-dd').alias('unix_time')).collect()
      Expected:
          [Row(unix_time=1428476400)]
      Got:unix_timestamp
          [Row(unix_time=1428418800)]
      ```
      
      ```
      Failed example:
          time_df.select(from_unixtime('unix_time').alias('ts')).collect()
      Expected:
          [Row(ts=u'2015-04-08 00:00:00')]
      Got:
          [Row(ts=u'2015-04-08 16:00:00')]
      ```
      
      ## How was this patch tested?
      
      Manually tested and `./run-tests --modules pyspark-sql`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18597 from HyukjinKwon/SPARK-20456.
      d4d9e17b
  8. Jul 10, 2017
    • jinxing's avatar
      [SPARK-21315][SQL] Skip some spill files when generateIterator(startIndex) in... · 97a1aa2c
      jinxing authored
      [SPARK-21315][SQL] Skip some spill files when generateIterator(startIndex) in ExternalAppendOnlyUnsafeRowArray.
      
      ## What changes were proposed in this pull request?
      
      In current code, it is expensive to use `UnboundedFollowingWindowFunctionFrame`, because it is iterating from the start to lower bound every time calling `write` method. When traverse the iterator, it's possible to skip some spilled files thus to save some time.
      
      ## How was this patch tested?
      
      Added unit test
      
      Did a small test for benchmark:
      
      Put 2000200 rows into `UnsafeExternalSorter`-- 2 spill files(each contains 1000000 rows) and inMemSorter contains 200 rows.
      Move the iterator forward to index=2000001.
      
      *With this change*:
      `getIterator(2000001)`, it will cost almost 0ms~1ms;
      *Without this change*:
      `for(int i=0; i<2000001; i++)geIterator().loadNext()`, it will cost 300ms.
      
      Author: jinxing <jinxing6042@126.com>
      
      Closes #18541 from jinxing64/SPARK-21315.
      97a1aa2c
    • Shixiong Zhu's avatar
      [SPARK-21369][CORE] Don't use Scala Tuple2 in common/network-* · 833eab2c
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Remove all usages of Scala Tuple2 from common/network-* projects. Otherwise, Yarn users cannot use `spark.reducer.maxReqSizeShuffleToMem`.
      
      ## How was this patch tested?
      
      Jenkins.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #18593 from zsxwing/SPARK-21369.
      833eab2c
    • gatorsmile's avatar
      [SPARK-21350][SQL] Fix the error message when the number of arguments is wrong when invoking a UDF · 1471ee7a
      gatorsmile authored
      ### What changes were proposed in this pull request?
      Users get a very confusing error when users specify a wrong number of parameters.
      ```Scala
          val df = spark.emptyDataFrame
          spark.udf.register("foo", (_: String).length)
          df.selectExpr("foo(2, 3, 4)")
      ```
      ```
      org.apache.spark.sql.UDFSuite$$anonfun$9$$anonfun$apply$mcV$sp$12 cannot be cast to scala.Function3
      java.lang.ClassCastException: org.apache.spark.sql.UDFSuite$$anonfun$9$$anonfun$apply$mcV$sp$12 cannot be cast to scala.Function3
      	at org.apache.spark.sql.catalyst.expressions.ScalaUDF.<init>(ScalaUDF.scala:109)
      ```
      
      This PR is to capture the exception and issue an error message that is consistent with what we did for built-in functions. After the fix, the error message is improved to
      ```
      Invalid number of arguments for function foo; line 1 pos 0
      org.apache.spark.sql.AnalysisException: Invalid number of arguments for function foo; line 1 pos 0
      	at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:119)
      ```
      
      ### How was this patch tested?
      Added a test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18574 from gatorsmile/statsCheck.
      1471ee7a
    • Takeshi Yamamuro's avatar
      [SPARK-21043][SQL] Add unionByName in Dataset · a2bec6c9
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr added `unionByName` in `DataSet`.
      Here is how to use:
      ```
      val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
      val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
      df1.unionByName(df2).show
      
      // output:
      // +----+----+----+
      // |col0|col1|col2|
      // +----+----+----+
      // |   1|   2|   3|
      // |   6|   4|   5|
      // +----+----+----+
      ```
      
      ## How was this patch tested?
      Added tests in `DataFrameSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #18300 from maropu/SPARK-21043-2.
      a2bec6c9
    • chie8842's avatar
      [SPARK-21358][EXAMPLES] Argument of repartitionandsortwithinpartitions at pyspark · c3713fde
      chie8842 authored
      ## What changes were proposed in this pull request?
      At example of repartitionAndSortWithinPartitions at rdd.py, third argument should be True or False.
      I proposed fix of example code.
      
      ## How was this patch tested?
      * I rename test_repartitionAndSortWithinPartitions to test_repartitionAndSortWIthinPartitions_asc to specify boolean argument.
      * I added test_repartitionAndSortWithinPartitions_desc to test False pattern at third argument.
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: chie8842 <chie8842@gmail.com>
      
      Closes #18586 from chie8842/SPARK-21358.
      c3713fde
Loading