Skip to content
Snippets Groups Projects
  1. Jul 06, 2017
    • Bogdan Raducanu's avatar
      [SPARK-21228][SQL] InSet incorrect handling of structs · 26ac085d
      Bogdan Raducanu authored
      ## What changes were proposed in this pull request?
      When data type is struct, InSet now uses TypeUtils.getInterpretedOrdering (similar to EqualTo) to build a TreeSet. In other cases it will use a HashSet as before (which should be faster). Similarly, In.eval uses Ordering.equiv instead of equals.
      
      ## How was this patch tested?
      New test in SQLQuerySuite.
      
      Author: Bogdan Raducanu <bogdan@databricks.com>
      
      Closes #18455 from bogdanrdc/SPARK-21228.
      26ac085d
    • caoxuewen's avatar
      [SPARK-20950][CORE] add a new config to diskWriteBufferSize which is hard coded before · 565e7a8d
      caoxuewen authored
      ## What changes were proposed in this pull request?
      
      This PR Improvement in two:
      1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of ShuffleExternalSorter.
          when change the size of the diskWriteBufferSize to test `forceSorterToSpill`
          The average performance of running 10 times is as follows:(their unit is MS).
      ```
      diskWriteBufferSize:       1M    512K    256K    128K    64K    32K    16K    8K    4K
      ---------------------------------------------------------------------------------------
      RecordSize = 2.5M          742   722     694     686     667    668    671    669   683
      RecordSize = 1M            294   293     292     287     283    285    281    279   285
      ```
      
      2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in mergeSpillsWithFileStream function.
      
      ## How was this patch tested?
      The unit test.
      
      Author: caoxuewen <cao.xuewen@zte.com.cn>
      
      Closes #18174 from heary-cao/buffersize.
      565e7a8d
    • Wang Gengliang's avatar
      [SPARK-21273][SQL][FOLLOW-UP] Add missing test cases back and revise code style · d540dfbf
      Wang Gengliang authored
      ## What changes were proposed in this pull request?
      
      Add missing test cases back and revise code style
      
      Follow up the previous PR: https://github.com/apache/spark/pull/18479
      
      ## How was this patch tested?
      
      Unit test
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Wang Gengliang <ltnwgl@gmail.com>
      
      Closes #18548 from gengliangwang/stat_propagation_revise.
      d540dfbf
    • wangzhenhua's avatar
      [SPARK-21324][TEST] Improve statistics test suites · b8e4d567
      wangzhenhua authored
      ## What changes were proposed in this pull request?
      
      1. move `StatisticsCollectionTestBase` to a separate file.
      2. move some test cases to `StatisticsCollectionSuite` so that `hive/StatisticsSuite` only keeps tests that need hive support.
      3. clear up some test cases.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: wangzhenhua <wangzhenhua@huawei.com>
      Author: Zhenhua Wang <wzh_zju@163.com>
      
      Closes #18545 from wzhfy/cleanStatSuites.
      b8e4d567
    • Liang-Chi Hsieh's avatar
      [SPARK-20703][SQL] Associate metrics with data writes onto DataFrameWriter operations · 6ff05a66
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      Right now in the UI, after SPARK-20213, we can show the operations to write data out. However, there is no way to associate metrics with data writes. We should show relative metrics on the operations.
      
      #### Supported commands
      
      This change supports updating metrics for file-based data writing operations, including `InsertIntoHadoopFsRelationCommand`, `InsertIntoHiveTable`.
      
      Supported metrics:
      
      * number of written files
      * number of dynamic partitions
      * total bytes of written data
      * total number of output rows
      * average writing data out time (ms)
      * (TODO) min/med/max number of output rows per file/partition
      * (TODO) min/med/max bytes of written data per file/partition
      
      ####  Commands not supported
      
      `InsertIntoDataSourceCommand`, `SaveIntoDataSourceCommand`:
      
      The two commands uses DataSource APIs to write data out, i.e., the logic of writing data out is delegated to the DataSource implementations, such as  `InsertableRelation.insert` and `CreatableRelationProvider.createRelation`. So we can't obtain metrics from delegated methods for now.
      
      `CreateHiveTableAsSelectCommand`, `CreateDataSourceTableAsSelectCommand` :
      
      The two commands invokes other commands to write data out. The invoked commands can even write to non file-based data source. We leave them as future TODO.
      
      #### How to update metrics of writing files out
      
      A `RunnableCommand` which wants to update metrics, needs to override its `metrics` and provide the metrics data structure to `ExecutedCommandExec`.
      
      The metrics are prepared during the execution of `FileFormatWriter`. The callback function passed to `FileFormatWriter` will accept the metrics and update accordingly.
      
      There is a metrics updating function in `RunnableCommand`. In runtime, the function will be bound to the spark context and `metrics` of `ExecutedCommandExec` and pass to `FileFormatWriter`.
      
      ## How was this patch tested?
      
      Updated unit tests.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #18159 from viirya/SPARK-20703-2.
      6ff05a66
    • jerryshao's avatar
      [SPARK-21012][SUBMIT] Add glob support for resources adding to Spark · 5800144a
      jerryshao authored
      Current "--jars (spark.jars)", "--files (spark.files)", "--py-files (spark.submit.pyFiles)" and "--archives (spark.yarn.dist.archives)" only support non-glob path. This is OK for most of the cases, but when user requires to add more jars, files into Spark, it is too verbose to list one by one. So here propose to add glob path support for resources.
      
      Also improving the code of downloading resources.
      
      ## How was this patch tested?
      
      UT added, also verified manually in local cluster.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #18235 from jerryshao/SPARK-21012.
      5800144a
    • Tathagata Das's avatar
      [SS][MINOR] Fix flaky test in DatastreamReaderWriterSuite. temp checkpoint dir should be deleted · 60043f22
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Stopping query while it is being initialized can throw interrupt exception, in which case temporary checkpoint directories will not be deleted, and the test will fail.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #18442 from tdas/DatastreamReaderWriterSuite-fix.
      60043f22
    • Sumedh Wale's avatar
      [SPARK-21312][SQL] correct offsetInBytes in UnsafeRow.writeToStream · 14a3bb3a
      Sumedh Wale authored
      ## What changes were proposed in this pull request?
      
      Corrects offsetInBytes calculation in UnsafeRow.writeToStream. Known failures include writes to some DataSources that have own SparkPlan implementations and cause EXCHANGE in writes.
      
      ## How was this patch tested?
      
      Extended UnsafeRowSuite.writeToStream to include an UnsafeRow over byte array having non-zero offset.
      
      Author: Sumedh Wale <swale@snappydata.io>
      
      Closes #18535 from sumwale/SPARK-21312.
      14a3bb3a
    • gatorsmile's avatar
      [SPARK-21308][SQL] Remove SQLConf parameters from the optimizer · 75b168fd
      gatorsmile authored
      ### What changes were proposed in this pull request?
      This PR removes SQLConf parameters from the optimizer rules
      
      ### How was this patch tested?
      The existing test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18533 from gatorsmile/rmSQLConfOptimizer.
      75b168fd
  2. Jul 05, 2017
    • Shixiong Zhu's avatar
      [SPARK-21248][SS] The clean up codes in StreamExecution should not be interrupted · ab866f11
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR uses `runUninterruptibly` to avoid that the clean up codes in StreamExecution is interrupted. It also removes an optimization in `runUninterruptibly` to make sure this method never throw `InterruptedException`.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #18461 from zsxwing/SPARK-21248.
      ab866f11
    • Dongjoon Hyun's avatar
      [SPARK-21278][PYSPARK] Upgrade to Py4J 0.10.6 · c8d0aba1
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR aims to bump Py4J in order to fix the following float/double bug.
      Py4J 0.10.5 fixes this (https://github.com/bartdag/py4j/issues/272) and the latest Py4J is 0.10.6.
      
      **BEFORE**
      ```
      >>> df = spark.range(1)
      >>> df.select(df['id'] + 17.133574204226083).show()
      +--------------------+
      |(id + 17.1335742042)|
      +--------------------+
      |       17.1335742042|
      +--------------------+
      ```
      
      **AFTER**
      ```
      >>> df = spark.range(1)
      >>> df.select(df['id'] + 17.133574204226083).show()
      +-------------------------+
      |(id + 17.133574204226083)|
      +-------------------------+
      |       17.133574204226083|
      +-------------------------+
      ```
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #18546 from dongjoon-hyun/SPARK-21278.
      c8d0aba1
    • gatorsmile's avatar
      [SPARK-21307][SQL] Remove SQLConf parameters from the parser-related classes. · c8e7f445
      gatorsmile authored
      ### What changes were proposed in this pull request?
      This PR is to remove SQLConf parameters from the parser-related classes.
      
      ### How was this patch tested?
      The existing test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18531 from gatorsmile/rmSQLConfParser.
      c8e7f445
    • Jeff Zhang's avatar
      [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFunction Should Support UDAFs · 742da086
      Jeff Zhang authored
      ## What changes were proposed in this pull request?
      
      Support register Java UDAFs in PySpark so that user can use Java UDAF in PySpark. Besides that I also add api in `UDFRegistration`
      
      ## How was this patch tested?
      
      Unit test is added
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #17222 from zjffdu/SPARK-19439.
      742da086
    • sadikovi's avatar
      [SPARK-20858][DOC][MINOR] Document ListenerBus event queue size · 960298ee
      sadikovi authored
      ## What changes were proposed in this pull request?
      
      This change adds a new configuration option `spark.scheduler.listenerbus.eventqueue.size` to the configuration docs to specify the capacity of the spark listener bus event queue. Default value is 10000.
      
      This is doc PR for [SPARK-15703](https://issues.apache.org/jira/browse/SPARK-15703).
      
      I added option to the `Scheduling` section, however it might be more related to `Spark UI` section.
      
      ## How was this patch tested?
      
      Manually verified correct rendering of configuration option.
      
      Author: sadikovi <ivan.sadikov@lincolnuni.ac.nz>
      Author: Ivan Sadikov <ivan.sadikov@team.telstra.com>
      
      Closes #18476 from sadikovi/SPARK-20858.
      960298ee
    • he.qiao's avatar
      [SPARK-21286][TEST] Modified StorageTabSuite unit test · e3e2b5da
      he.qiao authored
      ## What changes were proposed in this pull request?
      The old unit test not effect
      
      ## How was this patch tested?
      unit test
      
      Author: he.qiao <he.qiao17@zte.com.cn>
      
      Closes #18511 from Geek-He/dev_0703.
      e3e2b5da
    • ouyangxiaochen's avatar
      [SPARK-20383][SQL] Supporting Create [temporary] Function with the keyword 'OR... · 5787ace4
      ouyangxiaochen authored
      [SPARK-20383][SQL] Supporting Create [temporary] Function with the keyword 'OR REPLACE' and 'IF NOT EXISTS'
      
      ## What changes were proposed in this pull request?
      
      support to create [temporary] function with the keyword 'OR REPLACE' and 'IF NOT EXISTS'
      
      ## How was this patch tested?
      manual test and added test cases
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: ouyangxiaochen <ou.yangxiaochen@zte.com.cn>
      
      Closes #17681 from ouyangxiaochen/spark-419.
      5787ace4
    • Takuya UESHIN's avatar
      [SPARK-16167][SQL] RowEncoder should preserve array/map type nullability. · 873f3ad2
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      Currently `RowEncoder` doesn't preserve nullability of `ArrayType` or `MapType`.
      It returns always `containsNull = true` for `ArrayType`, `valueContainsNull = true` for `MapType` and also the nullability of itself is always `true`.
      
      This pr fixes the nullability of them.
      ## How was this patch tested?
      
      Add tests to check if `RowEncoder` preserves array/map nullability.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #13873 from ueshin/issues/SPARK-16167.
      873f3ad2
    • actuaryzhang's avatar
      [SPARK-21310][ML][PYSPARK] Expose offset in PySpark · 4852b7d4
      actuaryzhang authored
      ## What changes were proposed in this pull request?
      Add offset to PySpark in GLM as in #16699.
      
      ## How was this patch tested?
      Python test
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #18534 from actuaryzhang/pythonOffset.
      4852b7d4
    • Takuya UESHIN's avatar
      [SPARK-18623][SQL] Add `returnNullable` to `StaticInvoke` and modify it to handle properly. · a3864325
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      Add `returnNullable` to `StaticInvoke` the same as #15780 is trying to add to `Invoke` and modify to handle properly.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #16056 from ueshin/issues/SPARK-18623.
      a3864325
    • Wenchen Fan's avatar
      [SPARK-21304][SQL] remove unnecessary isNull variable for collection related encoder expressions · f2c3b1dd
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      For these collection-related encoder expressions, we don't need to create `isNull` variable if the loop element is not nullable.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18529 from cloud-fan/minor.
      f2c3b1dd
  3. Jul 04, 2017
    • actuaryzhang's avatar
      [SPARK-20889][SPARKR][FOLLOWUP] Clean up grouped doc for column methods · e9a93f81
      actuaryzhang authored
      ## What changes were proposed in this pull request?
      Add doc for methods that were left out, and fix various style and consistency issues.
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #18493 from actuaryzhang/sparkRDocCleanup.
      e9a93f81
    • Takuya UESHIN's avatar
      [SPARK-21300][SQL] ExternalMapToCatalyst should null-check map key prior to... · ce10545d
      Takuya UESHIN authored
      [SPARK-21300][SQL] ExternalMapToCatalyst should null-check map key prior to converting to internal value.
      
      ## What changes were proposed in this pull request?
      
      `ExternalMapToCatalyst` should null-check map key prior to converting to internal value to throw an appropriate Exception instead of something like NPE.
      
      ## How was this patch tested?
      
      Added a test and existing tests.
      
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #18524 from ueshin/issues/SPARK-21300.
      ce10545d
    • gatorsmile's avatar
      [SPARK-21295][SQL] Use qualified names in error message for missing references · de14086e
      gatorsmile authored
      ### What changes were proposed in this pull request?
      It is strange to see the following error message. Actually, the column is from another table.
      ```
      cannot resolve '`right.a`' given input columns: [a, c, d];
      ```
      
      After the PR, the error message looks like
      ```
      cannot resolve '`right.a`' given input columns: [left.a, right.c, right.d];
      ```
      
      ### How was this patch tested?
      Added a test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18520 from gatorsmile/removeSQLConf.
      de14086e
    • wangmiao1981's avatar
      [MINOR][SPARKR] ignore Rplots.pdf test output after running R tests · daabf425
      wangmiao1981 authored
      ## What changes were proposed in this pull request?
      
      After running R tests in local build, it outputs Rplots.pdf. This one should be ignored in the git repository.
      
      Author: wangmiao1981 <wm624@hotmail.com>
      
      Closes #18518 from wangmiao1981/ignore.
      daabf425
    • actuaryzhang's avatar
      [SPARK-20889][SPARKR] Grouped documentation for WINDOW column methods · cec39215
      actuaryzhang authored
      ## What changes were proposed in this pull request?
      
      Grouped documentation for column window methods.
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #18481 from actuaryzhang/sparkRDocWindow.
      cec39215
    • dardelet's avatar
      [SPARK-21268][MLLIB] Move center calculations to a distributed map in KMeans · 4d6d8192
      dardelet authored
      ## What changes were proposed in this pull request?
      
      The scal() and creation of newCenter vector is done in the driver, after a collectAsMap operation while it could be done in the distributed RDD.
      This PR moves this code before the collectAsMap for more efficiency
      
      ## How was this patch tested?
      
      This was tested manually by running the KMeansExample and verifying that the new code ran without error and gave same output as before.
      
      Author: dardelet <guillaumegorp@gmail.com>
      Author: Guillaume Dardelet <dardelet@users.noreply.github.com>
      
      Closes #18491 from dardelet/move-center-calculation-to-distributed-map-kmean.
      4d6d8192
    • Dongjoon Hyun's avatar
      [SPARK-20256][SQL] SessionState should be created more lazily · 1b50e0e0
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      `SessionState` is designed to be created lazily. However, in reality, it created immediately in `SparkSession.Builder.getOrCreate` ([here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L943)).
      
      This PR aims to recover the lazy behavior by keeping the options into `initialSessionOptions`. The benefit is like the following. Users can start `spark-shell` and use RDD operations without any problems.
      
      **BEFORE**
      ```scala
      $ bin/spark-shell
      java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder'
      ...
      Caused by: org.apache.spark.sql.AnalysisException:
          org.apache.hadoop.hive.ql.metadata.HiveException:
             MetaException(message:java.security.AccessControlException:
                Permission denied: user=spark, access=READ,
                   inode="/apps/hive/warehouse":hive:hdfs:drwx------
      ```
      As reported in SPARK-20256, this happens when the warehouse directory is not allowed for this user.
      
      **AFTER**
      ```scala
      $ bin/spark-shell
      ...
      Welcome to
            ____              __
           / __/__  ___ _____/ /__
          _\ \/ _ \/ _ `/ __/  '_/
         /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
            /_/
      
      Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)
      Type in expressions to have them evaluated.
      Type :help for more information.
      
      scala> sc.range(0, 10, 1).count()
      res0: Long = 10
      ```
      
      ## How was this patch tested?
      
      Manual.
      
      This closes #18512 .
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #18501 from dongjoon-hyun/SPARK-20256.
      1b50e0e0
    • YIHAODIAN\wangshuangshuang's avatar
      [SPARK-19726][SQL] Faild to insert null timestamp value to mysql using spark jdbc · a3c29fcb
      YIHAODIAN\wangshuangshuang authored
      ## What changes were proposed in this pull request?
      
      when creating table like following:
      > create table timestamp_test(id int(11), time_stamp timestamp not null default current_timestamp);
      
      The result of Excuting "insert into timestamp_test values (111, null)" is different between Spark and JDBC.
      ```
      mysql> select * from timestamp_test;
      +------+---------------------+
      | id   | time_stamp          |
      +------+---------------------+
      |  111 | 1970-01-01 00:00:00 | -> spark
      |  111 | 2017-06-27 19:32:38 | -> mysql
      +------+---------------------+
      2 rows in set (0.00 sec)
      ```
         Because in such case ```StructField.nullable``` is false, so the generated codes of ```InvokeLike``` and ```BoundReference``` don't check whether the field is null or not. Instead, they directly use ```CodegenContext.INPUT_ROW.getLong(1)```, however, ```UnsafeRow.setNullAt(1)``` will put 0 in the underlying memory.
      
         The PR will ```always``` set ```StructField.nullable```  true after obtaining metadata from jdbc connection, Since we can insert null to not null timestamp column in MySQL. In this way, spark will propagate null to underlying DB engine, and let DB to choose how to process NULL.
      
      ## How was this patch tested?
      
      Added tests.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: YIHAODIAN\wangshuangshuang <wangshuangshuang@yihaodian.com>
      Author: Shuangshuang Wang <wsszone@gmail.com>
      
      Closes #18445 from shuangshuangwang/SPARK-19726.
      a3c29fcb
    • gatorsmile's avatar
      [SPARK-21256][SQL] Add withSQLConf to Catalyst Test · 29b1f6b0
      gatorsmile authored
      ### What changes were proposed in this pull request?
      SQLConf is moved to Catalyst. We are adding more and more test cases for verifying the conf-specific behaviors. It is nice to add a helper function to simplify the test cases.
      
      ### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18469 from gatorsmile/withSQLConf.
      29b1f6b0
    • hyukjinkwon's avatar
      [SPARK-19507][SPARK-21296][PYTHON] Avoid per-record type dispatch in schema... · d492cc5a
      hyukjinkwon authored
      [SPARK-19507][SPARK-21296][PYTHON] Avoid per-record type dispatch in schema verification and improve exception message
      
      ## What changes were proposed in this pull request?
      **Context**
      
      While reviewing https://github.com/apache/spark/pull/17227, I realised here we type-dispatch per record. The PR itself is fine in terms of performance as is but this prints a prefix, `"obj"` in exception message as below:
      
      ```
      from pyspark.sql.types import *
      schema = StructType([StructField('s', IntegerType(), nullable=False)])
      spark.createDataFrame([["1"]], schema)
      ...
      TypeError: obj.s: IntegerType can not accept object '1' in type <type 'str'>
      ```
      
      I suggested to get rid of this but during investigating this, I realised my approach might bring a performance regression as it is a hot path.
      
      Only for SPARK-19507 and https://github.com/apache/spark/pull/17227, It needs more changes to cleanly get rid of the prefix and I rather decided to fix both issues together.
      
      **Propersal**
      
      This PR tried to
      
        - get rid of per-record type dispatch as we do in many code paths in Scala  so that it improves the performance (roughly ~25% improvement) - SPARK-21296
      
          This was tested with a simple code `spark.createDataFrame(range(1000000), "int")`. However, I am quite sure the actual improvement in practice is larger than this, in particular, when the schema is complicated.
      
         - improve error message in exception describing field information as prose - SPARK-19507
      
      ## How was this patch tested?
      
      Manually tested and unit tests were added in `python/pyspark/sql/tests.py`.
      
      Benchmark - codes: https://gist.github.com/HyukjinKwon/c3397469c56cb26c2d7dd521ed0bc5a3
      Error message - codes: https://gist.github.com/HyukjinKwon/b1b2c7f65865444c4a8836435100e398
      
      **Before**
      
      Benchmark:
        - Results: https://gist.github.com/HyukjinKwon/4a291dab45542106301a0c1abcdca924
      
      Error message
        - Results: https://gist.github.com/HyukjinKwon/57b1916395794ce924faa32b14a3fe19
      
      **After**
      
      Benchmark
        - Results: https://gist.github.com/HyukjinKwon/21496feecc4a920e50c4e455f836266e
      
      Error message
        - Results: https://gist.github.com/HyukjinKwon/7a494e4557fe32a652ce1236e504a395
      
      Closes #17227
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: David Gingrich <david@textio.com>
      
      Closes #18521 from HyukjinKwon/python-type-dispatch.
      d492cc5a
    • hyukjinkwon's avatar
      [MINOR][SPARK SUBMIT] Print out R file usage in spark-submit · 2b1e94b9
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, running the shell below:
      
      ```bash
      $ ./bin/spark-submit tmp.R a b c
      ```
      
      with R file, `tmp.R` as below:
      
      ```r
      #!/usr/bin/env Rscript
      
      library(SparkR)
      sparkRSQL.init(sparkR.init(master = "local"))
      collect(createDataFrame(list(list(1))))
      print(commandArgs(trailingOnly = TRUE))
      ```
      
      working fine as below:
      
      ```bash
        _1
      1  1
      [1] "a" "b" "c"
      ```
      
      However, it looks not printed in usage documentation as below:
      
      ```bash
      $ ./bin/spark-submit
      ```
      
      ```
      Usage: spark-submit [options] <app jar | python file> [app arguments]
      ...
      ```
      
      For `./bin/sparkR`, it looks fine as below:
      
      ```bash
      $ ./bin/sparkR tmp.R
      ```
      
      ```
      Running R applications through 'sparkR' is not supported as of Spark 2.0.
      Use ./bin/spark-submit <R file>
      ```
      
      Running the script below:
      
      ```bash
      $ ./bin/spark-submit
      ```
      
      **Before**
      
      ```
      Usage: spark-submit [options] <app jar | python file> [app arguments]
      ...
      ```
      
      **After**
      
      ```
      Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
      ...
      ```
      
      ## How was this patch tested?
      
      Manually tested.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18505 from HyukjinKwon/minor-doc-summit.
      2b1e94b9
    • Thomas Decaux's avatar
      [MINOR] Add french stop word "les" · 8ca4ebef
      Thomas Decaux authored
      ## What changes were proposed in this pull request?
      
      Added "les" as french stop word (plurial of le)
      
      Author: Thomas Decaux <ebuildy@gmail.com>
      
      Closes #18514 from ebuildy/patch-1.
      8ca4ebef
  4. Jul 03, 2017
    • hyukjinkwon's avatar
      [SPARK-21264][PYTHON] Call cross join path in join without 'on' and with 'how' · a848d552
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, it throws a NPE when missing columns but join type is speicified in join at PySpark as below:
      
      ```python
      spark.conf.set("spark.sql.crossJoin.enabled", "false")
      spark.range(1).join(spark.range(1), how="inner").show()
      ```
      
      ```
      Traceback (most recent call last):
      ...
      py4j.protocol.Py4JJavaError: An error occurred while calling o66.join.
      : java.lang.NullPointerException
      	at org.apache.spark.sql.Dataset.join(Dataset.scala:931)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      ...
      ```
      
      ```python
      spark.conf.set("spark.sql.crossJoin.enabled", "true")
      spark.range(1).join(spark.range(1), how="inner").show()
      ```
      
      ```
      ...
      py4j.protocol.Py4JJavaError: An error occurred while calling o84.join.
      : java.lang.NullPointerException
      	at org.apache.spark.sql.Dataset.join(Dataset.scala:931)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      ...
      ```
      
      This PR suggests to follow Scala's one as below:
      
      ```scala
      scala> spark.conf.set("spark.sql.crossJoin.enabled", "false")
      
      scala> spark.range(1).join(spark.range(1), Seq.empty[String], "inner").show()
      ```
      
      ```
      org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
      Range (0, 1, step=1, splits=Some(8))
      and
      Range (0, 1, step=1, splits=Some(8))
      Join condition is missing or trivial.
      Use the CROSS JOIN syntax to allow cartesian products between these relations.;
      ...
      ```
      
      ```scala
      scala> spark.conf.set("spark.sql.crossJoin.enabled", "true")
      
      scala> spark.range(1).join(spark.range(1), Seq.empty[String], "inner").show()
      ```
      ```
      +---+---+
      | id| id|
      +---+---+
      |  0|  0|
      +---+---+
      ```
      
      **After**
      
      ```python
      spark.conf.set("spark.sql.crossJoin.enabled", "false")
      spark.range(1).join(spark.range(1), how="inner").show()
      ```
      
      ```
      Traceback (most recent call last):
      ...
      pyspark.sql.utils.AnalysisException: u'Detected cartesian product for INNER join between logical plans\nRange (0, 1, step=1, splits=Some(8))\nand\nRange (0, 1, step=1, splits=Some(8))\nJoin condition is missing or trivial.\nUse the CROSS JOIN syntax to allow cartesian products between these relations.;'
      ```
      
      ```python
      spark.conf.set("spark.sql.crossJoin.enabled", "true")
      spark.range(1).join(spark.range(1), how="inner").show()
      ```
      ```
      +---+---+
      | id| id|
      +---+---+
      |  0|  0|
      +---+---+
      ```
      
      ## How was this patch tested?
      
      Added tests in `python/pyspark/sql/tests.py`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18484 from HyukjinKwon/SPARK-21264.
      a848d552
    • liuxian's avatar
      [SPARK-21283][CORE] FileOutputStream should be created as append mode · 6657e00d
      liuxian authored
      ## What changes were proposed in this pull request?
      
      `FileAppender` is used to write `stderr` and `stdout` files  in `ExecutorRunner`, But before writing `ErrorStream` into the the `stderr` file, the header information has been written into ,if  FileOutputStream is  not created as append mode, the  header information will be lost
      
      ## How was this patch tested?
      unit test case
      
      Author: liuxian <liu.xian3@zte.com.cn>
      
      Closes #18507 from 10110346/wip-lx-0703.
      6657e00d
    • gatorsmile's avatar
      [TEST] Different behaviors of SparkContext Conf when building SparkSession · c79c10eb
      gatorsmile authored
      ## What changes were proposed in this pull request?
      If the created ACTIVE sparkContext is not EXPLICITLY passed through the Builder's API `sparkContext()`, the conf of this sparkContext will also contain the conf set through the API `config()`; otherwise, the conf of this sparkContext will NOT contain the conf set through the API `config()`
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18517 from gatorsmile/fixTestCase2.
      c79c10eb
    • Wenchen Fan's avatar
      [SPARK-21284][SQL] rename SessionCatalog.registerFunction parameter name · f953ca56
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Looking at the code in `SessionCatalog.registerFunction`, the parameter `ignoreIfExists` is a wrong name. When `ignoreIfExists` is true, we will override the function if it already exists. So `overrideIfExists` should be the corrected name.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18510 from cloud-fan/minor.
      f953ca56
    • Takeshi Yamamuro's avatar
      [SPARK-20073][SQL] Prints an explicit warning message in case of NULL-safe equals · 363bfe30
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr added code to print the same warning messages with `===` cases when using NULL-safe equals (`<=>`).
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #18436 from maropu/SPARK-20073.
      363bfe30
    • aokolnychyi's avatar
      [SPARK-21102][SQL] Refresh command is too aggressive in parsing · 17bdc36e
      aokolnychyi authored
      ### Idea
      
      This PR adds validation to REFRESH sql statements. Currently, users can specify whatever they want as resource path. For example, spark.sql("REFRESH ! $ !") will be executed without any exceptions.
      
      ### Implementation
      
      I am not sure that my current implementation is the most optimal, so any feedback is appreciated. My first idea was to make the grammar as strict as possible. Unfortunately, there were some problems. I tried the approach below:
      
      SqlBase.g4
      ```
      ...
          | REFRESH TABLE tableIdentifier                                    #refreshTable
          | REFRESH resourcePath                                             #refreshResource
      ...
      
      resourcePath
          : STRING
          | (IDENTIFIER | number | nonReserved | '/' | '-')+ // other symbols can be added if needed
          ;
      ```
      It is not flexible enough and requires to explicitly mention all possible symbols. Therefore, I came up with the current approach that is implemented in the code.
      
      Let me know your opinion on which one is better.
      
      Author: aokolnychyi <anton.okolnychyi@sap.com>
      
      Closes #18368 from aokolnychyi/spark-21102.
      17bdc36e
    • Zhenhua Wang's avatar
      [TEST] Load test table based on case sensitivity · eb7a5a66
      Zhenhua Wang authored
      ## What changes were proposed in this pull request?
      
      It is strange that we will get "table not found" error if **the first sql** uses upper case table names, when developers write tests with `TestHiveSingleton`, **although case insensitivity**. This is because in `TestHiveQueryExecution`, test tables are loaded based on exact matching instead of case sensitivity.
      
      ## How was this patch tested?
      
      Added a new test case.
      
      Author: Zhenhua Wang <wzh_zju@163.com>
      
      Closes #18504 from wzhfy/testHive.
      eb7a5a66
    • Sean Owen's avatar
      [SPARK-21137][CORE] Spark reads many small files slowly · a9339db9
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Parallelize FileInputFormat.listStatus in Hadoop API via LIST_STATUS_NUM_THREADS to speed up examination of file sizes for wholeTextFiles et al
      
      ## How was this patch tested?
      
      Existing tests, which will exercise the key path here: using a local file system.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #18441 from srowen/SPARK-21137.
      a9339db9
Loading