Skip to content
Snippets Groups Projects
  1. Jul 05, 2017
    • ouyangxiaochen's avatar
      [SPARK-20383][SQL] Supporting Create [temporary] Function with the keyword 'OR... · 5787ace4
      ouyangxiaochen authored
      [SPARK-20383][SQL] Supporting Create [temporary] Function with the keyword 'OR REPLACE' and 'IF NOT EXISTS'
      
      ## What changes were proposed in this pull request?
      
      support to create [temporary] function with the keyword 'OR REPLACE' and 'IF NOT EXISTS'
      
      ## How was this patch tested?
      manual test and added test cases
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: ouyangxiaochen <ou.yangxiaochen@zte.com.cn>
      
      Closes #17681 from ouyangxiaochen/spark-419.
      5787ace4
    • Takuya UESHIN's avatar
      [SPARK-16167][SQL] RowEncoder should preserve array/map type nullability. · 873f3ad2
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      Currently `RowEncoder` doesn't preserve nullability of `ArrayType` or `MapType`.
      It returns always `containsNull = true` for `ArrayType`, `valueContainsNull = true` for `MapType` and also the nullability of itself is always `true`.
      
      This pr fixes the nullability of them.
      ## How was this patch tested?
      
      Add tests to check if `RowEncoder` preserves array/map nullability.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #13873 from ueshin/issues/SPARK-16167.
      873f3ad2
    • actuaryzhang's avatar
      [SPARK-21310][ML][PYSPARK] Expose offset in PySpark · 4852b7d4
      actuaryzhang authored
      ## What changes were proposed in this pull request?
      Add offset to PySpark in GLM as in #16699.
      
      ## How was this patch tested?
      Python test
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #18534 from actuaryzhang/pythonOffset.
      4852b7d4
    • Takuya UESHIN's avatar
      [SPARK-18623][SQL] Add `returnNullable` to `StaticInvoke` and modify it to handle properly. · a3864325
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      Add `returnNullable` to `StaticInvoke` the same as #15780 is trying to add to `Invoke` and modify to handle properly.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #16056 from ueshin/issues/SPARK-18623.
      a3864325
    • Wenchen Fan's avatar
      [SPARK-21304][SQL] remove unnecessary isNull variable for collection related encoder expressions · f2c3b1dd
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      For these collection-related encoder expressions, we don't need to create `isNull` variable if the loop element is not nullable.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18529 from cloud-fan/minor.
      f2c3b1dd
  2. Jul 04, 2017
    • actuaryzhang's avatar
      [SPARK-20889][SPARKR][FOLLOWUP] Clean up grouped doc for column methods · e9a93f81
      actuaryzhang authored
      ## What changes were proposed in this pull request?
      Add doc for methods that were left out, and fix various style and consistency issues.
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #18493 from actuaryzhang/sparkRDocCleanup.
      e9a93f81
    • Takuya UESHIN's avatar
      [SPARK-21300][SQL] ExternalMapToCatalyst should null-check map key prior to... · ce10545d
      Takuya UESHIN authored
      [SPARK-21300][SQL] ExternalMapToCatalyst should null-check map key prior to converting to internal value.
      
      ## What changes were proposed in this pull request?
      
      `ExternalMapToCatalyst` should null-check map key prior to converting to internal value to throw an appropriate Exception instead of something like NPE.
      
      ## How was this patch tested?
      
      Added a test and existing tests.
      
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #18524 from ueshin/issues/SPARK-21300.
      ce10545d
    • gatorsmile's avatar
      [SPARK-21295][SQL] Use qualified names in error message for missing references · de14086e
      gatorsmile authored
      ### What changes were proposed in this pull request?
      It is strange to see the following error message. Actually, the column is from another table.
      ```
      cannot resolve '`right.a`' given input columns: [a, c, d];
      ```
      
      After the PR, the error message looks like
      ```
      cannot resolve '`right.a`' given input columns: [left.a, right.c, right.d];
      ```
      
      ### How was this patch tested?
      Added a test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18520 from gatorsmile/removeSQLConf.
      de14086e
    • wangmiao1981's avatar
      [MINOR][SPARKR] ignore Rplots.pdf test output after running R tests · daabf425
      wangmiao1981 authored
      ## What changes were proposed in this pull request?
      
      After running R tests in local build, it outputs Rplots.pdf. This one should be ignored in the git repository.
      
      Author: wangmiao1981 <wm624@hotmail.com>
      
      Closes #18518 from wangmiao1981/ignore.
      daabf425
    • actuaryzhang's avatar
      [SPARK-20889][SPARKR] Grouped documentation for WINDOW column methods · cec39215
      actuaryzhang authored
      ## What changes were proposed in this pull request?
      
      Grouped documentation for column window methods.
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #18481 from actuaryzhang/sparkRDocWindow.
      cec39215
    • dardelet's avatar
      [SPARK-21268][MLLIB] Move center calculations to a distributed map in KMeans · 4d6d8192
      dardelet authored
      ## What changes were proposed in this pull request?
      
      The scal() and creation of newCenter vector is done in the driver, after a collectAsMap operation while it could be done in the distributed RDD.
      This PR moves this code before the collectAsMap for more efficiency
      
      ## How was this patch tested?
      
      This was tested manually by running the KMeansExample and verifying that the new code ran without error and gave same output as before.
      
      Author: dardelet <guillaumegorp@gmail.com>
      Author: Guillaume Dardelet <dardelet@users.noreply.github.com>
      
      Closes #18491 from dardelet/move-center-calculation-to-distributed-map-kmean.
      4d6d8192
    • Dongjoon Hyun's avatar
      [SPARK-20256][SQL] SessionState should be created more lazily · 1b50e0e0
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      `SessionState` is designed to be created lazily. However, in reality, it created immediately in `SparkSession.Builder.getOrCreate` ([here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L943)).
      
      This PR aims to recover the lazy behavior by keeping the options into `initialSessionOptions`. The benefit is like the following. Users can start `spark-shell` and use RDD operations without any problems.
      
      **BEFORE**
      ```scala
      $ bin/spark-shell
      java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder'
      ...
      Caused by: org.apache.spark.sql.AnalysisException:
          org.apache.hadoop.hive.ql.metadata.HiveException:
             MetaException(message:java.security.AccessControlException:
                Permission denied: user=spark, access=READ,
                   inode="/apps/hive/warehouse":hive:hdfs:drwx------
      ```
      As reported in SPARK-20256, this happens when the warehouse directory is not allowed for this user.
      
      **AFTER**
      ```scala
      $ bin/spark-shell
      ...
      Welcome to
            ____              __
           / __/__  ___ _____/ /__
          _\ \/ _ \/ _ `/ __/  '_/
         /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
            /_/
      
      Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)
      Type in expressions to have them evaluated.
      Type :help for more information.
      
      scala> sc.range(0, 10, 1).count()
      res0: Long = 10
      ```
      
      ## How was this patch tested?
      
      Manual.
      
      This closes #18512 .
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #18501 from dongjoon-hyun/SPARK-20256.
      1b50e0e0
    • YIHAODIAN\wangshuangshuang's avatar
      [SPARK-19726][SQL] Faild to insert null timestamp value to mysql using spark jdbc · a3c29fcb
      YIHAODIAN\wangshuangshuang authored
      ## What changes were proposed in this pull request?
      
      when creating table like following:
      > create table timestamp_test(id int(11), time_stamp timestamp not null default current_timestamp);
      
      The result of Excuting "insert into timestamp_test values (111, null)" is different between Spark and JDBC.
      ```
      mysql> select * from timestamp_test;
      +------+---------------------+
      | id   | time_stamp          |
      +------+---------------------+
      |  111 | 1970-01-01 00:00:00 | -> spark
      |  111 | 2017-06-27 19:32:38 | -> mysql
      +------+---------------------+
      2 rows in set (0.00 sec)
      ```
         Because in such case ```StructField.nullable``` is false, so the generated codes of ```InvokeLike``` and ```BoundReference``` don't check whether the field is null or not. Instead, they directly use ```CodegenContext.INPUT_ROW.getLong(1)```, however, ```UnsafeRow.setNullAt(1)``` will put 0 in the underlying memory.
      
         The PR will ```always``` set ```StructField.nullable```  true after obtaining metadata from jdbc connection, Since we can insert null to not null timestamp column in MySQL. In this way, spark will propagate null to underlying DB engine, and let DB to choose how to process NULL.
      
      ## How was this patch tested?
      
      Added tests.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: YIHAODIAN\wangshuangshuang <wangshuangshuang@yihaodian.com>
      Author: Shuangshuang Wang <wsszone@gmail.com>
      
      Closes #18445 from shuangshuangwang/SPARK-19726.
      a3c29fcb
    • gatorsmile's avatar
      [SPARK-21256][SQL] Add withSQLConf to Catalyst Test · 29b1f6b0
      gatorsmile authored
      ### What changes were proposed in this pull request?
      SQLConf is moved to Catalyst. We are adding more and more test cases for verifying the conf-specific behaviors. It is nice to add a helper function to simplify the test cases.
      
      ### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18469 from gatorsmile/withSQLConf.
      29b1f6b0
    • hyukjinkwon's avatar
      [SPARK-19507][SPARK-21296][PYTHON] Avoid per-record type dispatch in schema... · d492cc5a
      hyukjinkwon authored
      [SPARK-19507][SPARK-21296][PYTHON] Avoid per-record type dispatch in schema verification and improve exception message
      
      ## What changes were proposed in this pull request?
      **Context**
      
      While reviewing https://github.com/apache/spark/pull/17227, I realised here we type-dispatch per record. The PR itself is fine in terms of performance as is but this prints a prefix, `"obj"` in exception message as below:
      
      ```
      from pyspark.sql.types import *
      schema = StructType([StructField('s', IntegerType(), nullable=False)])
      spark.createDataFrame([["1"]], schema)
      ...
      TypeError: obj.s: IntegerType can not accept object '1' in type <type 'str'>
      ```
      
      I suggested to get rid of this but during investigating this, I realised my approach might bring a performance regression as it is a hot path.
      
      Only for SPARK-19507 and https://github.com/apache/spark/pull/17227, It needs more changes to cleanly get rid of the prefix and I rather decided to fix both issues together.
      
      **Propersal**
      
      This PR tried to
      
        - get rid of per-record type dispatch as we do in many code paths in Scala  so that it improves the performance (roughly ~25% improvement) - SPARK-21296
      
          This was tested with a simple code `spark.createDataFrame(range(1000000), "int")`. However, I am quite sure the actual improvement in practice is larger than this, in particular, when the schema is complicated.
      
         - improve error message in exception describing field information as prose - SPARK-19507
      
      ## How was this patch tested?
      
      Manually tested and unit tests were added in `python/pyspark/sql/tests.py`.
      
      Benchmark - codes: https://gist.github.com/HyukjinKwon/c3397469c56cb26c2d7dd521ed0bc5a3
      Error message - codes: https://gist.github.com/HyukjinKwon/b1b2c7f65865444c4a8836435100e398
      
      **Before**
      
      Benchmark:
        - Results: https://gist.github.com/HyukjinKwon/4a291dab45542106301a0c1abcdca924
      
      Error message
        - Results: https://gist.github.com/HyukjinKwon/57b1916395794ce924faa32b14a3fe19
      
      **After**
      
      Benchmark
        - Results: https://gist.github.com/HyukjinKwon/21496feecc4a920e50c4e455f836266e
      
      Error message
        - Results: https://gist.github.com/HyukjinKwon/7a494e4557fe32a652ce1236e504a395
      
      Closes #17227
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: David Gingrich <david@textio.com>
      
      Closes #18521 from HyukjinKwon/python-type-dispatch.
      d492cc5a
    • hyukjinkwon's avatar
      [MINOR][SPARK SUBMIT] Print out R file usage in spark-submit · 2b1e94b9
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, running the shell below:
      
      ```bash
      $ ./bin/spark-submit tmp.R a b c
      ```
      
      with R file, `tmp.R` as below:
      
      ```r
      #!/usr/bin/env Rscript
      
      library(SparkR)
      sparkRSQL.init(sparkR.init(master = "local"))
      collect(createDataFrame(list(list(1))))
      print(commandArgs(trailingOnly = TRUE))
      ```
      
      working fine as below:
      
      ```bash
        _1
      1  1
      [1] "a" "b" "c"
      ```
      
      However, it looks not printed in usage documentation as below:
      
      ```bash
      $ ./bin/spark-submit
      ```
      
      ```
      Usage: spark-submit [options] <app jar | python file> [app arguments]
      ...
      ```
      
      For `./bin/sparkR`, it looks fine as below:
      
      ```bash
      $ ./bin/sparkR tmp.R
      ```
      
      ```
      Running R applications through 'sparkR' is not supported as of Spark 2.0.
      Use ./bin/spark-submit <R file>
      ```
      
      Running the script below:
      
      ```bash
      $ ./bin/spark-submit
      ```
      
      **Before**
      
      ```
      Usage: spark-submit [options] <app jar | python file> [app arguments]
      ...
      ```
      
      **After**
      
      ```
      Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
      ...
      ```
      
      ## How was this patch tested?
      
      Manually tested.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18505 from HyukjinKwon/minor-doc-summit.
      2b1e94b9
    • Thomas Decaux's avatar
      [MINOR] Add french stop word "les" · 8ca4ebef
      Thomas Decaux authored
      ## What changes were proposed in this pull request?
      
      Added "les" as french stop word (plurial of le)
      
      Author: Thomas Decaux <ebuildy@gmail.com>
      
      Closes #18514 from ebuildy/patch-1.
      8ca4ebef
  3. Jul 03, 2017
    • hyukjinkwon's avatar
      [SPARK-21264][PYTHON] Call cross join path in join without 'on' and with 'how' · a848d552
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, it throws a NPE when missing columns but join type is speicified in join at PySpark as below:
      
      ```python
      spark.conf.set("spark.sql.crossJoin.enabled", "false")
      spark.range(1).join(spark.range(1), how="inner").show()
      ```
      
      ```
      Traceback (most recent call last):
      ...
      py4j.protocol.Py4JJavaError: An error occurred while calling o66.join.
      : java.lang.NullPointerException
      	at org.apache.spark.sql.Dataset.join(Dataset.scala:931)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      ...
      ```
      
      ```python
      spark.conf.set("spark.sql.crossJoin.enabled", "true")
      spark.range(1).join(spark.range(1), how="inner").show()
      ```
      
      ```
      ...
      py4j.protocol.Py4JJavaError: An error occurred while calling o84.join.
      : java.lang.NullPointerException
      	at org.apache.spark.sql.Dataset.join(Dataset.scala:931)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      ...
      ```
      
      This PR suggests to follow Scala's one as below:
      
      ```scala
      scala> spark.conf.set("spark.sql.crossJoin.enabled", "false")
      
      scala> spark.range(1).join(spark.range(1), Seq.empty[String], "inner").show()
      ```
      
      ```
      org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
      Range (0, 1, step=1, splits=Some(8))
      and
      Range (0, 1, step=1, splits=Some(8))
      Join condition is missing or trivial.
      Use the CROSS JOIN syntax to allow cartesian products between these relations.;
      ...
      ```
      
      ```scala
      scala> spark.conf.set("spark.sql.crossJoin.enabled", "true")
      
      scala> spark.range(1).join(spark.range(1), Seq.empty[String], "inner").show()
      ```
      ```
      +---+---+
      | id| id|
      +---+---+
      |  0|  0|
      +---+---+
      ```
      
      **After**
      
      ```python
      spark.conf.set("spark.sql.crossJoin.enabled", "false")
      spark.range(1).join(spark.range(1), how="inner").show()
      ```
      
      ```
      Traceback (most recent call last):
      ...
      pyspark.sql.utils.AnalysisException: u'Detected cartesian product for INNER join between logical plans\nRange (0, 1, step=1, splits=Some(8))\nand\nRange (0, 1, step=1, splits=Some(8))\nJoin condition is missing or trivial.\nUse the CROSS JOIN syntax to allow cartesian products between these relations.;'
      ```
      
      ```python
      spark.conf.set("spark.sql.crossJoin.enabled", "true")
      spark.range(1).join(spark.range(1), how="inner").show()
      ```
      ```
      +---+---+
      | id| id|
      +---+---+
      |  0|  0|
      +---+---+
      ```
      
      ## How was this patch tested?
      
      Added tests in `python/pyspark/sql/tests.py`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18484 from HyukjinKwon/SPARK-21264.
      a848d552
    • liuxian's avatar
      [SPARK-21283][CORE] FileOutputStream should be created as append mode · 6657e00d
      liuxian authored
      ## What changes were proposed in this pull request?
      
      `FileAppender` is used to write `stderr` and `stdout` files  in `ExecutorRunner`, But before writing `ErrorStream` into the the `stderr` file, the header information has been written into ,if  FileOutputStream is  not created as append mode, the  header information will be lost
      
      ## How was this patch tested?
      unit test case
      
      Author: liuxian <liu.xian3@zte.com.cn>
      
      Closes #18507 from 10110346/wip-lx-0703.
      6657e00d
    • gatorsmile's avatar
      [TEST] Different behaviors of SparkContext Conf when building SparkSession · c79c10eb
      gatorsmile authored
      ## What changes were proposed in this pull request?
      If the created ACTIVE sparkContext is not EXPLICITLY passed through the Builder's API `sparkContext()`, the conf of this sparkContext will also contain the conf set through the API `config()`; otherwise, the conf of this sparkContext will NOT contain the conf set through the API `config()`
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18517 from gatorsmile/fixTestCase2.
      c79c10eb
    • Wenchen Fan's avatar
      [SPARK-21284][SQL] rename SessionCatalog.registerFunction parameter name · f953ca56
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Looking at the code in `SessionCatalog.registerFunction`, the parameter `ignoreIfExists` is a wrong name. When `ignoreIfExists` is true, we will override the function if it already exists. So `overrideIfExists` should be the corrected name.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18510 from cloud-fan/minor.
      f953ca56
    • Takeshi Yamamuro's avatar
      [SPARK-20073][SQL] Prints an explicit warning message in case of NULL-safe equals · 363bfe30
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr added code to print the same warning messages with `===` cases when using NULL-safe equals (`<=>`).
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #18436 from maropu/SPARK-20073.
      363bfe30
    • aokolnychyi's avatar
      [SPARK-21102][SQL] Refresh command is too aggressive in parsing · 17bdc36e
      aokolnychyi authored
      ### Idea
      
      This PR adds validation to REFRESH sql statements. Currently, users can specify whatever they want as resource path. For example, spark.sql("REFRESH ! $ !") will be executed without any exceptions.
      
      ### Implementation
      
      I am not sure that my current implementation is the most optimal, so any feedback is appreciated. My first idea was to make the grammar as strict as possible. Unfortunately, there were some problems. I tried the approach below:
      
      SqlBase.g4
      ```
      ...
          | REFRESH TABLE tableIdentifier                                    #refreshTable
          | REFRESH resourcePath                                             #refreshResource
      ...
      
      resourcePath
          : STRING
          | (IDENTIFIER | number | nonReserved | '/' | '-')+ // other symbols can be added if needed
          ;
      ```
      It is not flexible enough and requires to explicitly mention all possible symbols. Therefore, I came up with the current approach that is implemented in the code.
      
      Let me know your opinion on which one is better.
      
      Author: aokolnychyi <anton.okolnychyi@sap.com>
      
      Closes #18368 from aokolnychyi/spark-21102.
      17bdc36e
    • Zhenhua Wang's avatar
      [TEST] Load test table based on case sensitivity · eb7a5a66
      Zhenhua Wang authored
      ## What changes were proposed in this pull request?
      
      It is strange that we will get "table not found" error if **the first sql** uses upper case table names, when developers write tests with `TestHiveSingleton`, **although case insensitivity**. This is because in `TestHiveQueryExecution`, test tables are loaded based on exact matching instead of case sensitivity.
      
      ## How was this patch tested?
      
      Added a new test case.
      
      Author: Zhenhua Wang <wzh_zju@163.com>
      
      Closes #18504 from wzhfy/testHive.
      eb7a5a66
    • Sean Owen's avatar
      [SPARK-21137][CORE] Spark reads many small files slowly · a9339db9
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Parallelize FileInputFormat.listStatus in Hadoop API via LIST_STATUS_NUM_THREADS to speed up examination of file sizes for wholeTextFiles et al
      
      ## How was this patch tested?
      
      Existing tests, which will exercise the key path here: using a local file system.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #18441 from srowen/SPARK-21137.
      a9339db9
    • guoxiaolong's avatar
      [SPARK-21250][WEB-UI] Add a url in the table of 'Running Executors' in worker... · d913db16
      guoxiaolong authored
      [SPARK-21250][WEB-UI] Add a url in the table of 'Running Executors' in worker page to visit job page.
      
      ## What changes were proposed in this pull request?
      
      Add a url in the table of 'Running Executors' in worker page to visit job page.
      
      When I click URL of 'Name', the current page jumps to the job page. Of course this is only in the table of 'Running Executors'.
      
      This URL of 'Name' is in the table of 'Finished Executors' does not exist, the click will not jump to any page.
      
      fix before:
      ![1](https://user-images.githubusercontent.com/26266482/27679397-30ddc262-5ceb-11e7-839b-0889d1f42480.png)
      
      fix after:
      ![2](https://user-images.githubusercontent.com/26266482/27679405-3588ef12-5ceb-11e7-9756-0a93815cd698.png)
      
      ## How was this patch tested?
      manual tests
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: guoxiaolong <guo.xiaolong1@zte.com.cn>
      
      Closes #18464 from guoxiaolongzte/SPARK-21250.
      d913db16
  4. Jul 02, 2017
    • Rui Zha's avatar
      [SPARK-18004][SQL] Make sure the date or timestamp related predicate can be... · d4107196
      Rui Zha authored
      [SPARK-18004][SQL] Make sure the date or timestamp related predicate can be pushed down to Oracle correctly
      
      ## What changes were proposed in this pull request?
      
      Move `compileValue` method in JDBCRDD to JdbcDialect, and override the `compileValue` method in OracleDialect to rewrite the Oracle-specific timestamp and date literals in where clause.
      
      ## How was this patch tested?
      
      An integration test has been added.
      
      Author: Rui Zha <zrdt713@gmail.com>
      Author: Zharui <zrdt713@gmail.com>
      
      Closes #18451 from SharpRay/extend-compileValue-to-dialects.
      d4107196
    • Yanbo Liang's avatar
      [SPARK-19852][PYSPARK][ML] Python StringIndexer supports 'keep' to handle invalid data · c19680be
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      This PR is to maintain API parity with changes made in SPARK-17498 to support a new option
      'keep' in StringIndexer to handle unseen labels or NULL values with PySpark.
      
      Note: This is updated version of #17237 , the primary author of this PR is VinceShieh .
      ## How was this patch tested?
      Unit tests.
      
      Author: VinceShieh <vincent.xie@intel.com>
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #18453 from yanboliang/spark-19852.
      c19680be
    • Xingbo Jiang's avatar
      [SPARK-21260][SQL][MINOR] Remove the unused OutputFakerExec · c605fee0
      Xingbo Jiang authored
      ## What changes were proposed in this pull request?
      
      OutputFakerExec was added long ago and is not used anywhere now so we should remove it.
      
      ## How was this patch tested?
      N/A
      
      Author: Xingbo Jiang <xingbo.jiang@databricks.com>
      
      Closes #18473 from jiangxb1987/OutputFakerExec.
      c605fee0
  5. Jul 01, 2017
    • Devaraj K's avatar
      [SPARK-21170][CORE] Utils.tryWithSafeFinallyAndFailureCallbacks throws... · 6beca9ce
      Devaraj K authored
      [SPARK-21170][CORE] Utils.tryWithSafeFinallyAndFailureCallbacks throws IllegalArgumentException: Self-suppression not permitted
      
      ## What changes were proposed in this pull request?
      
      Not adding the exception to the suppressed if it is the same instance as originalThrowable.
      
      ## How was this patch tested?
      
      Added new tests to verify this, these tests fail without source code changes and passes with the change.
      
      Author: Devaraj K <devaraj@apache.org>
      
      Closes #18384 from devaraj-kavali/SPARK-21170.
      6beca9ce
    • Ruifeng Zheng's avatar
      [SPARK-18518][ML] HasSolver supports override · e0b047ea
      Ruifeng Zheng authored
      ## What changes were proposed in this pull request?
      1, make param support non-final with `finalFields` option
      2, generate `HasSolver` with `finalFields = false`
      3, override `solver` in LiR, GLR, and make MLPC inherit `HasSolver`
      
      ## How was this patch tested?
      existing tests
      
      Author: Ruifeng Zheng <ruifengz@foxmail.com>
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #16028 from zhengruifeng/param_non_final.
      e0b047ea
    • actuaryzhang's avatar
      [SPARK-21275][ML] Update GLM test to use supportedFamilyNames · 37ef32e5
      actuaryzhang authored
      ## What changes were proposed in this pull request?
      Update GLM test to use supportedFamilyNames as suggested here:
      https://github.com/apache/spark/pull/16699#discussion-diff-100574976R855
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #18495 from actuaryzhang/mlGlmTest2.
      37ef32e5
  6. Jun 30, 2017
    • Reynold Xin's avatar
      [SPARK-21273][SQL] Propagate logical plan stats using visitor pattern and mixin · b1d719e7
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      We currently implement statistics propagation directly in logical plan. Given we already have two different implementations, it'd make sense to actually decouple the two and add stats propagation using mixin. This would reduce the coupling between logical plan and statistics handling.
      
      This can also be a powerful pattern in the future to add additional properties (e.g. constraints).
      
      ## How was this patch tested?
      Should be covered by existing test cases.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #18479 from rxin/stats-trait.
      b1d719e7
    • wangzhenhua's avatar
      [SPARK-21127][SQL] Update statistics after data changing commands · 61b5df56
      wangzhenhua authored
      ## What changes were proposed in this pull request?
      
      Update stats after the following data changing commands:
      
      - InsertIntoHadoopFsRelationCommand
      - InsertIntoHiveTable
      - LoadDataCommand
      - TruncateTableCommand
      - AlterTableSetLocationCommand
      - AlterTableDropPartitionCommand
      
      ## How was this patch tested?
      Added new test cases.
      
      Author: wangzhenhua <wangzhenhua@huawei.com>
      Author: Zhenhua Wang <wzh_zju@163.com>
      
      Closes #18334 from wzhfy/changeStatsForOperation.
      61b5df56
    • Wenchen Fan's avatar
      [SPARK-17528][SQL] data should be copied properly before saving into InternalRow · 4eb41879
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      For performance reasons, `UnsafeRow.getString`, `getStruct`, etc. return a "pointer" that points to a memory region of this unsafe row. This makes the unsafe projection a little dangerous, because all of its output rows share one instance.
      
      When we implement SQL operators, we should be careful to not cache the input rows because they may be produced by unsafe projection from child operator and thus its content may change overtime.
      
      However, when we updating values of InternalRow(e.g. in mutable projection and safe projection), we only copy UTF8String, we should also copy InternalRow, ArrayData and MapData. This PR fixes this, and also fixes the copy of vairous InternalRow, ArrayData and MapData implementations.
      
      ## How was this patch tested?
      
      new regression tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18483 from cloud-fan/fix-copy.
      4eb41879
    • Liang-Chi Hsieh's avatar
      [SPARK-21052][SQL][FOLLOW-UP] Add hash map metrics to join · fd132552
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      Remove `numHashCollisions` in `BytesToBytesMap`. And change `getAverageProbesPerLookup()` to `getAverageProbesPerLookup` as suggested.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #18480 from viirya/SPARK-21052-followup.
      fd132552
    • Xiao Li's avatar
      [SPARK-21129][SQL] Arguments of SQL function call should not be named expressions · eed9c4ef
      Xiao Li authored
      ### What changes were proposed in this pull request?
      
      Function argument should not be named expressions. It could cause two issues:
      - Misleading error message
      - Unexpected query results when the column name is `distinct`, which is not a reserved word in our parser.
      
      ```
      spark-sql> select count(distinct c1, distinct c2) from t1;
      Error in query: cannot resolve '`distinct`' given input columns: [c1, c2]; line 1 pos 26;
      'Project [unresolvedalias('count(c1#30, 'distinct), None)]
      +- SubqueryAlias t1
         +- CatalogRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#30, c2#31]
      ```
      
      After the fix, the error message becomes
      ```
      spark-sql> select count(distinct c1, distinct c2) from t1;
      Error in query:
      extraneous input 'c2' expecting {')', ',', '.', '[', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 'IS', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '||', '^'}(line 1, pos 35)
      
      == SQL ==
      select count(distinct c1, distinct c2) from t1
      -----------------------------------^^^
      ```
      
      ### How was this patch tested?
      Added a test case to parser suite.
      
      Author: Xiao Li <gatorsmile@gmail.com>
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18338 from gatorsmile/parserDistinctAggFunc.
      eed9c4ef
    • 曾林西's avatar
      [SPARK-21223] Change fileToAppInfo in FsHistoryProvider to fix concurrent issue. · 1fe08d62
      曾林西 authored
      # What issue does this PR address ?
      Jira:https://issues.apache.org/jira/browse/SPARK-21223
      fix the Thread-safety issue in FsHistoryProvider
      Currently, Spark HistoryServer use a HashMap named fileToAppInfo in class FsHistoryProvider to store the map of eventlog path and attemptInfo.
      When use ThreadPool to Replay the log files in the list and merge the list of old applications with new ones, multi thread may update fileToAppInfo at the same time, which may cause Thread-safety issues, such as  falling into an infinite loop because of calling resize func of the hashtable.
      
      Author: 曾林西 <zenglinxi@meituan.com>
      
      Closes #18430 from zenglinxi0615/master.
      1fe08d62
    • Yanbo Liang's avatar
      [ML] Fix scala-2.10 build failure of GeneralizedLinearRegressionSuite. · 528c9281
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Fix scala-2.10 build failure of ```GeneralizedLinearRegressionSuite```.
      
      ## How was this patch tested?
      Build with scala-2.10.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #18489 from yanboliang/glr.
      528c9281
    • Xingbo Jiang's avatar
      [SPARK-18294][CORE] Implement commit protocol to support `mapred` package's committer · 3c2fc19d
      Xingbo Jiang authored
      ## What changes were proposed in this pull request?
      
      This PR makes the following changes:
      
      - Implement a new commit protocol `HadoopMapRedCommitProtocol` which support the old `mapred` package's committer;
      - Refactor SparkHadoopWriter and SparkHadoopMapReduceWriter, now they are combined together, thus we can support write through both mapred and mapreduce API by the new SparkHadoopWriter, a lot of duplicated codes are removed.
      
      After this change, it should be pretty easy for us to support the committer from both the new and the old hadoop API at high level.
      
      ## How was this patch tested?
      No major behavior change, passed the existing test cases.
      
      Author: Xingbo Jiang <xingbo.jiang@databricks.com>
      
      Closes #18438 from jiangxb1987/SparkHadoopWriter.
      3c2fc19d
Loading