Skip to content
Snippets Groups Projects
  1. Sep 13, 2017
    • Jane Wang's avatar
      [SPARK-4131] Merge HiveTmpFile.scala to SaveAsHiveFile.scala · 8c7e19a3
      Jane Wang authored
      ## What changes were proposed in this pull request?
      
      The code is already merged to master:
      https://github.com/apache/spark/pull/18975
      
      This is a following up PR to merge HiveTmpFile.scala to SaveAsHiveFile.
      
      ## How was this patch tested?
      
      Build successfully
      
      Author: Jane Wang <janewang@fb.com>
      
      Closes #19221 from janewangfb/merge_savehivefile_hivetmpfile.
      8c7e19a3
    • donnyzone's avatar
      [SPARK-21980][SQL] References in grouping functions should be indexed with semanticEquals · 21c4450f
      donnyzone authored
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-21980
      
      This PR fixes the issue in ResolveGroupingAnalytics rule, which indexes the column references in grouping functions without considering case sensitive configurations.
      
      The problem can be reproduced by:
      
      `val df = spark.createDataFrame(Seq((1, 1), (2, 1), (2, 2))).toDF("a", "b")
       df.cube("a").agg(grouping("A")).show()`
      
      ## How was this patch tested?
      unit tests
      
      Author: donnyzone <wellfengzhu@gmail.com>
      
      Closes #19202 from DonnyZone/ResolveGroupingAnalytics.
      21c4450f
    • Armin's avatar
      [SPARK-21970][CORE] Fix Redundant Throws Declarations in Java Codebase · b6ef1f57
      Armin authored
      ## What changes were proposed in this pull request?
      
      1. Removing all redundant throws declarations from Java codebase.
      2. Removing dead code made visible by this from `ShuffleExternalSorter#closeAndGetSpills`
      
      ## How was this patch tested?
      
      Build still passes.
      
      Author: Armin <me@obrown.io>
      
      Closes #19182 from original-brownbear/SPARK-21970.
      b6ef1f57
    • Zheng RuiFeng's avatar
      [SPARK-21690][ML] one-pass imputer · 0fa5b7ca
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      parallelize the computation of all columns
      
      performance tests:
      
      |numColums| Mean(Old) | Median(Old) | Mean(RDD) | Median(RDD) | Mean(DF) | Median(DF) |
      |------|----------|------------|----------|------------|----------|------------|
      |1|0.0771394713|0.0658712813|0.080779802|0.048165981499999996|0.10525509870000001|0.0499620203|
      |10|0.7234340630999999|0.5954440414|0.0867935197|0.13263428659999998|0.09255724889999999|0.1573943635|
      |100|7.3756451568|6.2196631259|0.1911931552|0.8625376817000001|0.5557462431|1.7216837982000002|
      
      ## How was this patch tested?
      existing tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #18902 from zhengruifeng/parallelize_imputer.
      0fa5b7ca
    • caoxuewen's avatar
      [SPARK-21963][CORE][TEST] Create temp file should be delete after use · ca00cc70
      caoxuewen authored
      ## What changes were proposed in this pull request?
      
      After you create a temporary table, you need to delete it, otherwise it will leave a file similar to the file name ‘SPARK194465907929586320484966temp’.
      
      ## How was this patch tested?
      
      N / A
      
      Author: caoxuewen <cao.xuewen@zte.com.cn>
      
      Closes #19174 from heary-cao/DeleteTempFile.
      ca00cc70
    • Sean Owen's avatar
      [SPARK-21893][BUILD][STREAMING][WIP] Put Kafka 0.8 behind a profile · 4fbf748b
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Put Kafka 0.8 support behind a kafka-0-8 profile.
      
      ## How was this patch tested?
      
      Existing tests, but, until PR builder and Jenkins configs are updated the effect here is to not build or test Kafka 0.8 support at all.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19134 from srowen/SPARK-21893.
      4fbf748b
    • German Schiavon's avatar
      [SPARK-21982] Set locale to US · a1d98c6d
      German Schiavon authored
      ## What changes were proposed in this pull request?
      
      In UtilsSuite Locale was set by default to US, but at the moment of using format function it wasn't, taking by default JVM locale which could be different than US making this test fail.
      
      ## How was this patch tested?
      Unit test (UtilsSuite)
      
      Author: German Schiavon <germanschiavon@gmail.com>
      
      Closes #19205 from Gschiavon/fix/test-locale.
      a1d98c6d
    • Sean Owen's avatar
      [BUILD] Close stale PRs · dd88fa3d
      Sean Owen authored
      Closes #18522
      Closes #17722
      Closes #18879
      Closes #18891
      Closes #18806
      Closes #18948
      Closes #18949
      Closes #19070
      Closes #19039
      Closes #19142
      Closes #18515
      Closes #19154
      Closes #19162
      Closes #19187
      Closes #19091
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19203 from srowen/CloseStalePRs3.
      dd88fa3d
    • WeichenXu's avatar
      [SPARK-21027][MINOR][FOLLOW-UP] add missing since tag · f6c5d8f6
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      add missing since tag for `setParallelism` in #19110
      
      ## How was this patch tested?
      
      N/A
      
      Author: WeichenXu <weichen.xu@databricks.com>
      
      Closes #19214 from WeichenXu123/minor01.
      f6c5d8f6
  2. Sep 12, 2017
    • goldmedal's avatar
      [SPARK-21513][SQL] Allow UDF to_json support converting MapType to json · 371e4e20
      goldmedal authored
      # What changes were proposed in this pull request?
      UDF to_json only supports converting `StructType` or `ArrayType` of `StructType`s to a json output string now.
      According to the discussion of JIRA SPARK-21513, I allow to `to_json` support converting `MapType` and `ArrayType` of `MapType`s to a json output string.
      This PR is for SQL and Scala API only.
      
      # How was this patch tested?
      Adding unit test case.
      
      cc viirya HyukjinKwon
      
      Author: goldmedal <liugs963@gmail.com>
      Author: Jia-Xuan Liu <liugs963@gmail.com>
      
      Closes #18875 from goldmedal/SPARK-21513.
      371e4e20
    • Wang Gengliang's avatar
      [SPARK-21979][SQL] Improve QueryPlanConstraints framework · 1a985747
      Wang Gengliang authored
      ## What changes were proposed in this pull request?
      
      Improve QueryPlanConstraints framework, make it robust and simple.
      In https://github.com/apache/spark/pull/15319, constraints for expressions like `a = f(b, c)` is resolved.
      However, for expressions like
      ```scala
      a = f(b, c) && c = g(a, b)
      ```
      The current QueryPlanConstraints framework will produce non-converging constraints.
      Essentially, the problem is caused by having both the name and child of aliases in the same constraint set.   We infer constraints, and push down constraints as predicates in filters, later on these predicates are propagated as constraints, etc..
      Simply using the alias names only can resolve these problems.  The size of constraints is reduced without losing any information. We can always get these inferred constraints on child of aliases when pushing down filters.
      
      Also, the EqualNullSafe between name and child in propagating alias is meaningless
      ```scala
      allConstraints += EqualNullSafe(e, a.toAttribute)
      ```
      It just produces redundant constraints.
      
      ## How was this patch tested?
      
      Unit test
      
      Author: Wang Gengliang <ltnwgl@gmail.com>
      
      Closes #19201 from gengliangwang/QueryPlanConstraints.
      1a985747
    • Zheng RuiFeng's avatar
      [SPARK-18608][ML] Fix double caching · c5f9b89d
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      `df.rdd.getStorageLevel` => `df.storageLevel`
      
      using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this issue are fixed.
      
      Previous discussion in other PRs: https://github.com/apache/spark/pull/19107, https://github.com/apache/spark/pull/17014
      
      ## How was this patch tested?
      existing tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #19197 from zhengruifeng/double_caching.
      c5f9b89d
    • sarutak's avatar
      [SPARK-21368][SQL] TPCDSQueryBenchmark can't refer query files. · b9b54b1c
      sarutak authored
      ## What changes were proposed in this pull request?
      
      TPCDSQueryBenchmark packaged into a jar doesn't work with spark-submit.
      It's because of the failure of reference query files in the jar file.
      
      ## How was this patch tested?
      
      Ran the benchmark.
      
      Author: sarutak <sarutak@oss.nttdata.co.jp>
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #18592 from sarutak/fix-tpcds-benchmark.
      b9b54b1c
    • Ajay Saini's avatar
      [SPARK-21027][ML][PYTHON] Added tunable parallelism to one vs. rest in both Scala mllib and Pyspark · 720c94fe
      Ajay Saini authored
      # What changes were proposed in this pull request?
      
      Added tunable parallelism to the pyspark implementation of one vs. rest classification. Added a parallelism parameter to the Scala implementation of one vs. rest along with functionality for using the parameter to tune the level of parallelism.
      
      I take this PR #18281 over because the original author is busy but we need merge this PR soon.
      After this been merged, we can close #18281 .
      
      ## How was this patch tested?
      
      Test suite added.
      
      Author: Ajay Saini <ajays725@gmail.com>
      Author: WeichenXu <weichen.xu@databricks.com>
      
      Closes #19110 from WeichenXu123/spark-21027.
      720c94fe
    • Zhenhua Wang's avatar
      [SPARK-17642][SQL] support DESC EXTENDED/FORMATTED table column commands · 515910e9
      Zhenhua Wang authored
      ## What changes were proposed in this pull request?
      
      Support DESC (EXTENDED | FORMATTED) ? TABLE COLUMN command.
      Support DESC EXTENDED | FORMATTED TABLE COLUMN command to show column-level statistics.
      Do NOT support describe nested columns.
      
      ## How was this patch tested?
      
      Added test cases.
      
      Author: Zhenhua Wang <wzh_zju@163.com>
      Author: Zhenhua Wang <wangzhenhua@huawei.com>
      Author: wangzhenhua <wangzhenhua@huawei.com>
      
      Closes #16422 from wzhfy/descColumn.
      515910e9
    • Kousuke Saruta's avatar
      [DOCS] Fix unreachable links in the document · 95755823
      Kousuke Saruta authored
      ## What changes were proposed in this pull request?
      
      Recently, I found two unreachable links in the document and fixed them.
      Because of small changes related to the document, I don't file this issue in JIRA but please suggest I should do it if you think it's needed.
      
      ## How was this patch tested?
      
      Tested manually.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #19195 from sarutak/fix-unreachable-link.
      95755823
    • Jen-Ming Chung's avatar
      [SPARK-21610][SQL][FOLLOWUP] Corrupt records are not handled properly when... · 7d0a3ef4
      Jen-Ming Chung authored
      [SPARK-21610][SQL][FOLLOWUP] Corrupt records are not handled properly when creating a dataframe from a file
      
      ## What changes were proposed in this pull request?
      
      When the `requiredSchema` only contains `_corrupt_record`, the derived `actualSchema` is empty and the `_corrupt_record` are all null for all rows. This PR captures above situation and raise an exception with a reasonable workaround messag so that users can know what happened and how to fix the query.
      
      ## How was this patch tested?
      
      Added unit test in `CSVSuite`.
      
      Author: Jen-Ming Chung <jenmingisme@gmail.com>
      
      Closes #19199 from jmchung/SPARK-21610-FOLLOWUP.
      7d0a3ef4
    • Marco Gaido's avatar
      [SPARK-14516][ML] Adding ClusteringEvaluator with the implementation of Cosine... · dd781675
      Marco Gaido authored
      [SPARK-14516][ML] Adding ClusteringEvaluator with the implementation of Cosine silhouette and squared Euclidean silhouette.
      
      ## What changes were proposed in this pull request?
      
      This PR adds the ClusteringEvaluator Evaluator which contains two metrics:
       - **cosineSilhouette**: the Silhouette measure using the cosine distance;
       - **squaredSilhouette**: the Silhouette measure using the squared Euclidean distance.
      
      The implementation of the two metrics refers to the algorithm proposed and explained [here](https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view). These algorithms have been thought for a distributed and parallel environment, thus they have reasonable performance, unlike a naive Silhouette implementation following its definition.
      
      ## How was this patch tested?
      
      The patch has been tested with the additional unit tests added (comparing the results with the ones provided by [Python sklearn library](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html)).
      
      Author: Marco Gaido <mgaido@hortonworks.com>
      
      Closes #18538 from mgaido91/SPARK-14516.
      dd781675
    • FavioVazquez's avatar
      [SPARK-21976][DOC] Fix wrong documentation for Mean Absolute Error. · e2ac2f1c
      FavioVazquez authored
      ## What changes were proposed in this pull request?
      
      Fixed wrong documentation for Mean Absolute Error.
      
      Even though the code is correct for the MAE:
      
      ```scala
      Since("1.2.0")
        def meanAbsoluteError: Double = {
          summary.normL1(1) / summary.count
        }
      ```
      In the documentation the division by N is missing.
      
      ## How was this patch tested?
      
      All of spark tests were run.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: FavioVazquez <favio.vazquezp@gmail.com>
      Author: faviovazquez <favio.vazquezp@gmail.com>
      Author: Favio André Vázquez <favio.vazquezp@gmail.com>
      
      Closes #19190 from FavioVazquez/mae-fix.
      e2ac2f1c
  3. Sep 11, 2017
    • caoxuewen's avatar
      [MINOR][SQL] remove unuse import class · dc74c0e6
      caoxuewen authored
      ## What changes were proposed in this pull request?
      
      this PR describe remove the import class that are unused.
      
      ## How was this patch tested?
      
      N/A
      
      Author: caoxuewen <cao.xuewen@zte.com.cn>
      
      Closes #19131 from heary-cao/unuse_import.
      dc74c0e6
    • Chunsheng Ji's avatar
      [SPARK-21856] Add probability and rawPrediction to MLPC for Python · 4bab8f59
      Chunsheng Ji authored
      Probability and rawPrediction has been added to MultilayerPerceptronClassifier for Python
      
      Add unit test.
      
      Author: Chunsheng Ji <chunsheng.ji@gmail.com>
      
      Closes #19172 from chunshengji/SPARK-21856.
      4bab8f59
  4. Sep 10, 2017
    • Felix Cheung's avatar
      [BUILD][TEST][SPARKR] add sparksubmitsuite to appveyor tests · 828fab03
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      more file regex
      
      ## How was this patch tested?
      
      Jenkins, AppVeyor
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #19177 from felixcheung/rmoduletotest.
      828fab03
    • Jen-Ming Chung's avatar
      [SPARK-21610][SQL] Corrupt records are not handled properly when creating a dataframe from a file · 6273a711
      Jen-Ming Chung authored
      ## What changes were proposed in this pull request?
      ```
      echo '{"field": 1}
      {"field": 2}
      {"field": "3"}' >/tmp/sample.json
      ```
      
      ```scala
      import org.apache.spark.sql.types._
      
      val schema = new StructType()
        .add("field", ByteType)
        .add("_corrupt_record", StringType)
      
      val file = "/tmp/sample.json"
      
      val dfFromFile = spark.read.schema(schema).json(file)
      
      scala> dfFromFile.show(false)
      +-----+---------------+
      |field|_corrupt_record|
      +-----+---------------+
      |1    |null           |
      |2    |null           |
      |null |{"field": "3"} |
      +-----+---------------+
      
      scala> dfFromFile.filter($"_corrupt_record".isNotNull).count()
      res1: Long = 0
      
      scala> dfFromFile.filter($"_corrupt_record".isNull).count()
      res2: Long = 3
      ```
      When the `requiredSchema` only contains `_corrupt_record`, the derived `actualSchema` is empty and the `_corrupt_record` are all null for all rows. This PR captures above situation and raise an exception with a reasonable workaround messag so that users can know what happened and how to fix the query.
      
      ## How was this patch tested?
      
      Added test case.
      
      Author: Jen-Ming Chung <jenmingisme@gmail.com>
      
      Closes #18865 from jmchung/SPARK-21610.
      6273a711
    • Peter Szalai's avatar
      [SPARK-20098][PYSPARK] dataType's typeName fix · 520d92a1
      Peter Szalai authored
      ## What changes were proposed in this pull request?
      `typeName`  classmethod has been fixed by using type -> typeName map.
      
      ## How was this patch tested?
      local build
      
      Author: Peter Szalai <szalaipeti.vagyok@gmail.com>
      
      Closes #17435 from szalai1/datatype-gettype-fix.
      520d92a1
  5. Sep 09, 2017
    • Jane Wang's avatar
      [SPARK-4131] Support "Writing data into the filesystem from queries" · f7679055
      Jane Wang authored
      ## What changes were proposed in this pull request?
      
      This PR implements the sql feature:
      INSERT OVERWRITE [LOCAL] DIRECTORY directory1
        [ROW FORMAT row_format] [STORED AS file_format]
        SELECT ... FROM ...
      
      ## How was this patch tested?
      Added new unittests and also pulled the code to fb-spark so that we could test writing to hdfs directory.
      
      Author: Jane Wang <janewang@fb.com>
      
      Closes #18975 from janewangfb/port_local_directory.
      f7679055
    • Yanbo Liang's avatar
      [MINOR][SQL] Correct DataFrame doc. · e4d8f9a3
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Correct DataFrame doc.
      
      ## How was this patch tested?
      Only doc change, no tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #19173 from yanboliang/df-doc.
      e4d8f9a3
    • Liang-Chi Hsieh's avatar
      [SPARK-21954][SQL] JacksonUtils should verify MapType's value type instead of key type · 6b45d7e9
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      `JacksonUtils.verifySchema` verifies if a data type can be converted to JSON. For `MapType`, it now verifies the key type. However, in `JacksonGenerator`, when converting a map to JSON, we only care about its values and create a writer for the values. The keys in a map are treated as strings by calling `toString` on the keys.
      
      Thus, we should change `JacksonUtils.verifySchema` to verify the value type of `MapType`.
      
      ## How was this patch tested?
      
      Added tests.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #19167 from viirya/test-jacksonutils.
      6b45d7e9
    • Andrew Ash's avatar
      [SPARK-21941] Stop storing unused attemptId in SQLTaskMetrics · 8a5eb506
      Andrew Ash authored
      ## What changes were proposed in this pull request?
      
      In a driver heap dump containing 390,105 instances of SQLTaskMetrics this
      would have saved me approximately 3.2MB of memory.
      
      Since we're not getting any benefit from storing this unused value, let's
      eliminate it until a future PR makes use of it.
      
      ## How was this patch tested?
      
      Existing unit tests
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #19153 from ash211/aash/trim-sql-listener.
      8a5eb506
  6. Sep 08, 2017
    • Xin Ren's avatar
      [SPARK-19866][ML][PYSPARK] Add local version of Word2Vec findSynonyms for spark.ml: Python API · 31c74fec
      Xin Ren authored
      https://issues.apache.org/jira/browse/SPARK-19866
      
      ## What changes were proposed in this pull request?
      
      Add Python API for findSynonymsArray matching Scala API.
      
      ## How was this patch tested?
      
      Manual test
      `./python/run-tests --python-executables=python2.7 --modules=pyspark-ml`
      
      Author: Xin Ren <iamshrek@126.com>
      Author: Xin Ren <renxin.ubc@gmail.com>
      Author: Xin Ren <keypointt@users.noreply.github.com>
      
      Closes #17451 from keypointt/SPARK-19866.
      31c74fec
    • hyukjinkwon's avatar
      [SPARK-15243][ML][SQL][PYTHON] Add missing support for unicode in Param... · 8598d03a
      hyukjinkwon authored
      [SPARK-15243][ML][SQL][PYTHON] Add missing support for unicode in Param methods & functions in dataframe
      
      ## What changes were proposed in this pull request?
      
      This PR proposes to support unicodes in Param methods in ML, other missed functions in DataFrame.
      
      For example, this causes a `ValueError` in Python 2.x when param is a unicode string:
      
      ```python
      >>> from pyspark.ml.classification import LogisticRegression
      >>> lr = LogisticRegression()
      >>> lr.hasParam("threshold")
      True
      >>> lr.hasParam(u"threshold")
      Traceback (most recent call last):
       ...
          raise TypeError("hasParam(): paramName must be a string")
      TypeError: hasParam(): paramName must be a string
      ```
      
      This PR is based on https://github.com/apache/spark/pull/13036
      
      ## How was this patch tested?
      
      Unit tests in `python/pyspark/ml/tests.py` and `python/pyspark/sql/tests.py`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #17096 from HyukjinKwon/SPARK-15243.
      8598d03a
    • Kazuaki Ishizaki's avatar
      [SPARK-21946][TEST] fix flaky test: "alter table: rename cached table" in InMemoryCatalogedDDLSuite · 8a4f228d
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR fixes flaky test `InMemoryCatalogedDDLSuite "alter table: rename cached table"`.
      Since this test validates distributed DataFrame, the result should be checked by using `checkAnswer`. The original version used `df.collect().Seq` method that does not guaranty an order of each element of the result.
      
      ## How was this patch tested?
      
      Use existing test case
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #19159 from kiszk/SPARK-21946.
      8a4f228d
    • Liang-Chi Hsieh's avatar
      [SPARK-21726][SQL][FOLLOW-UP] Check for structural integrity of the plan in Optimzer in test mode · 0dfc1ec5
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      The condition in `Optimizer.isPlanIntegral` is wrong. We should always return `true` if not in test mode.
      
      ## How was this patch tested?
      
      Manually test.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #19161 from viirya/SPARK-21726-followup.
      0dfc1ec5
    • Wenchen Fan's avatar
      [SPARK-21936][SQL] backward compatibility test framework for HiveExternalCatalog · dbb82412
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      `HiveExternalCatalog` is a semi-public interface. When creating tables, `HiveExternalCatalog` converts the table metadata to hive table format and save into hive metastore. It's very import to guarantee backward compatibility here, i.e., tables created by previous Spark versions should still be readable in newer Spark versions.
      
      Previously we find backward compatibility issues manually, which is really easy to miss bugs. This PR introduces a test framework to automatically test `HiveExternalCatalog` backward compatibility, by downloading Spark binaries with different versions, and create tables with these Spark versions, and read these tables with current Spark version.
      
      ## How was this patch tested?
      
      test-only change
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #19148 from cloud-fan/test.
      dbb82412
    • Liang-Chi Hsieh's avatar
      [SPARK-21726][SQL] Check for structural integrity of the plan in Optimzer in test mode. · 6e37524a
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      We have many optimization rules now in `Optimzer`. Right now we don't have any checks in the optimizer to check for the structural integrity of the plan (e.g. resolved). When debugging, it is difficult to identify which rules return invalid plans.
      
      It would be great if in test mode, we can check whether a plan is still resolved after the execution of each rule, so we can catch rules that return invalid plans.
      
      ## How was this patch tested?
      
      Added tests.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #18956 from viirya/SPARK-21726.
      6e37524a
    • liuxian's avatar
      [SPARK-21949][TEST] Tables created in unit tests should be dropped after use · f62b20f3
      liuxian authored
      ## What changes were proposed in this pull request?
       Tables should be dropped after use in unit tests.
      ## How was this patch tested?
      N/A
      
      Author: liuxian <liu.xian3@zte.com.cn>
      
      Closes #19155 from 10110346/droptable.
      f62b20f3
    • Takuya UESHIN's avatar
      [SPARK-21950][SQL][PYTHON][TEST] pyspark.sql.tests.SQLTests2 should stop SparkContext. · 57bc1e9e
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      `pyspark.sql.tests.SQLTests2` doesn't stop newly created spark context in the test and it might affect the following tests.
      This pr makes `pyspark.sql.tests.SQLTests2` stop `SparkContext`.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #19158 from ueshin/issues/SPARK-21950.
      57bc1e9e
  7. Sep 07, 2017
    • Dongjoon Hyun's avatar
      [SPARK-21939][TEST] Use TimeLimits instead of Timeouts · c26976fe
      Dongjoon Hyun authored
      Since ScalaTest 3.0.0, `org.scalatest.concurrent.Timeouts` is deprecated.
      This PR replaces the deprecated one with `org.scalatest.concurrent.TimeLimits`.
      
      ```scala
      -import org.scalatest.concurrent.Timeouts._
      +import org.scalatest.concurrent.TimeLimits._
      ```
      
      Pass the existing test suites.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19150 from dongjoon-hyun/SPARK-21939.
      
      Change-Id: I1a1b07f1b97e51e2263dfb34b7eaaa099b2ded5e
      c26976fe
    • Dongjoon Hyun's avatar
      [SPARK-13656][SQL] Delete spark.sql.parquet.cacheMetadata from SQLConf and docs · e00f1a1d
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Since [SPARK-15639](https://github.com/apache/spark/pull/13701), `spark.sql.parquet.cacheMetadata` and `PARQUET_CACHE_METADATA` is not used. This PR removes from SQLConf and docs.
      
      ## How was this patch tested?
      
      Pass the existing Jenkins.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19129 from dongjoon-hyun/SPARK-13656.
      e00f1a1d
    • Sanket Chintapalli's avatar
      [SPARK-21890] Credentials not being passed to add the tokens · b9ab791a
      Sanket Chintapalli authored
      I observed this while running a oozie job trying to connect to hbase via spark.
      It look like the creds are not being passed in thehttps://github.com/apache/spark/blob/branch-2.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/security/HadoopFSCredentialProvider.scala#L53 for 2.2 release.
      More Info as to why it fails on secure grid:
      Oozie client gets the necessary tokens the application needs before launching. It passes those tokens along to the oozie launcher job (MR job) which will then actually call the Spark client to launch the spark app and pass the tokens along.
      The oozie launcher job cannot get anymore tokens because all it has is tokens ( you can't get tokens with tokens, you need tgt or keytab).
      The error here is because the launcher job runs the Spark Client to submit the spark job but the spark client doesn't see that it already has the hdfs tokens so it tries to get more, which ends with the exception.
      There was a change with SPARK-19021 to generalize the hdfs credentials provider that changed it so we don't pass the existing credentials into the call to get tokens so it doesn't realize it already has the necessary tokens.
      
      https://issues.apache.org/jira/browse/SPARK-21890
      Modified to pass creds to get delegation tokens
      
      Author: Sanket Chintapalli <schintap@yahoo-inc.com>
      
      Closes #19140 from redsanket/SPARK-21890-master.
      b9ab791a
    • Dongjoon Hyun's avatar
      [SPARK-21912][SQL] ORC/Parquet table should not create invalid column names · eea2b877
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, users meet job abortions while creating or altering ORC/Parquet tables with invalid column names. We had better prevent this by raising **AnalysisException** with a guide to use aliases instead like Paquet data source tables.
      
      **BEFORE**
      ```scala
      scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`")
      17/09/04 13:28:21 ERROR Utils: Aborting task
      java.lang.IllegalArgumentException: Error: : expected at the position 8 of 'struct<a b:int>' but ' ' is found.
      17/09/04 13:28:21 ERROR FileFormatWriter: Job job_20170904132821_0001 aborted.
      17/09/04 13:28:21 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
      org.apache.spark.SparkException: Task failed while writing rows.
      ```
      
      **AFTER**
      ```scala
      scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`")
      17/09/04 13:27:40 ERROR CreateDataSourceTableAsSelectCommand: Failed to write to table orc1
      org.apache.spark.sql.AnalysisException: Attribute name "a b" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins with a new test case.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19124 from dongjoon-hyun/SPARK-21912.
      eea2b877
Loading