Skip to content
Snippets Groups Projects
  1. May 25, 2016
    • Jurriaan Pruis's avatar
      [SPARK-15493][SQL] default QuoteEscapingEnabled flag to true when writing CSV · c875d81a
      Jurriaan Pruis authored
      ## What changes were proposed in this pull request?
      
      Default QuoteEscapingEnabled flag to true when writing CSV and add an escapeQuotes option to be able to change this.
      
      See https://github.com/uniVocity/univocity-parsers/blob/f3eb2af26374940e60d91d1703bde54619f50c51/src/main/java/com/univocity/parsers/csv/CsvWriterSettings.java#L231-L247
      
      This change is needed to be able to write RFC 4180 compatible CSV files (https://tools.ietf.org/html/rfc4180#section-2)
      
      https://issues.apache.org/jira/browse/SPARK-15493
      
      ## How was this patch tested?
      
      Added a test that verifies the output is quoted correctly.
      
      Author: Jurriaan Pruis <email@jurriaanpruis.nl>
      
      Closes #13267 from jurriaan/quote-escaping.
      c875d81a
    • Takuya UESHIN's avatar
      [SPARK-15483][SQL] IncrementalExecution should use extra strategies. · 4b880674
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      Extra strategies does not work for streams because `IncrementalExecution` uses modified planner with stateful operations but it does not include extra strategies.
      
      This pr fixes `IncrementalExecution` to include extra strategies to use them.
      
      ## How was this patch tested?
      
      I added a test to check if extra strategies work for streams.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #13261 from ueshin/issues/SPARK-15483.
      4b880674
    • Nick Pentreath's avatar
      [SPARK-15500][DOC][ML][PYSPARK] Remove default value in Param doc field in ALS · 1cb347fb
      Nick Pentreath authored
      Remove "Default: MEMORY_AND_DISK" from `Param` doc field in ALS storage level params. This fixes up the output of `explainParam(s)` so that default values are not displayed twice.
      
      We can revisit in the case that [SPARK-15130](https://issues.apache.org/jira/browse/SPARK-15130) moves ahead with adding defaults in some way to PySpark param doc fields.
      
      Tests N/A.
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #13277 from MLnick/SPARK-15500-als-remove-default-storage-param.
      1cb347fb
    • lfzCarlosC's avatar
      [MINOR][MLLIB][STREAMING][SQL] Fix typos · 02c8072e
      lfzCarlosC authored
      fixed typos for source code for components [mllib] [streaming] and [SQL]
      
      None and obvious.
      
      Author: lfzCarlosC <lfz.carlos@gmail.com>
      
      Closes #13298 from lfzCarlosC/master.
      02c8072e
    • Dongjoon Hyun's avatar
      [MINOR][CORE] Fix a HadoopRDD log message and remove unused imports in rdd files. · d6d3e507
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR fixes the following typos in log message and comments of `HadoopRDD.scala`. Also, this removes unused imports.
      ```scala
      -      logWarning("Caching NewHadoopRDDs as deserialized objects usually leads to undesired" +
      +      logWarning("Caching HadoopRDDs as deserialized objects usually leads to undesired" +
      ...
      -      // since its not removed yet
      +      // since it's not removed yet
      ```
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13294 from dongjoon-hyun/minor_rdd_fix_log_message.
      d6d3e507
    • Eric Liang's avatar
      [SPARK-15520][SQL] SparkSession builder in python should also allow overriding... · 8239fdcb
      Eric Liang authored
      [SPARK-15520][SQL] SparkSession builder in python should also allow overriding confs of existing sessions
      
      ## What changes were proposed in this pull request?
      
      This fixes the python SparkSession builder to allow setting confs correctly. This was a leftover TODO from https://github.com/apache/spark/pull/13200.
      
      ## How was this patch tested?
      
      Python doc tests.
      
      cc andrewor14
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #13289 from ericl/spark-15520.
      8239fdcb
    • Jeff Zhang's avatar
      [SPARK-15345][SQL][PYSPARK] SparkSession's conf doesn't take effect when this... · 01e7b9c8
      Jeff Zhang authored
      [SPARK-15345][SQL][PYSPARK] SparkSession's conf doesn't take effect when this already an existing SparkContext
      
      ## What changes were proposed in this pull request?
      
      Override the existing SparkContext is the provided SparkConf is different. PySpark part hasn't been fixed yet, will do that after the first round of review to ensure this is the correct approach.
      
      ## How was this patch tested?
      
      Manually verify it in spark-shell.
      
      rxin  Please help review it, I think this is a very critical issue for spark 2.0
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #13160 from zjffdu/SPARK-15345.
      01e7b9c8
    • Lukasz's avatar
      [SPARK-9044] Fix "Storage" tab in UI so that it reflects RDD name change. · b120fba6
      Lukasz authored
      ## What changes were proposed in this pull request?
      
      1. Making 'name' field of RDDInfo mutable.
      2. In StorageListener: catching the fact that RDD's name was changed and updating it in RDDInfo.
      
      ## How was this patch tested?
      
      1. Manual verification - the 'Storage' tab now behaves as expected.
      2. The commit also contains a new unit test which verifies this.
      
      Author: Lukasz <lgieron@gmail.com>
      
      Closes #13264 from lgieron/SPARK-9044.
      b120fba6
    • Reynold Xin's avatar
      [SPARK-15436][SQL] Remove DescribeFunction and ShowFunctions · 4f27b8dd
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch removes the last two commands defined in the catalyst module: DescribeFunction and ShowFunctions. They were unnecessary since the parser could just generate DescribeFunctionCommand and ShowFunctionsCommand directly.
      
      ## How was this patch tested?
      Created a new SparkSqlParserSuite.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13292 from rxin/SPARK-15436.
      4f27b8dd
    • Krishna Kalyan's avatar
      [SPARK-12071][DOC] Document the behaviour of NA in R · 9082b796
      Krishna Kalyan authored
      ## What changes were proposed in this pull request?
      
      Under Upgrading From SparkR 1.5.x to 1.6.x section added the information, SparkSQL converts `NA` in R to `null`.
      
      ## How was this patch tested?
      
      Document update, no tests.
      
      Author: Krishna Kalyan <krishnakalyan3@gmail.com>
      
      Closes #13268 from krishnakalyan3/spark-12071-1.
      9082b796
    • Holden Karau's avatar
      [SPARK-15412][PYSPARK][SPARKR][DOCS] Improve linear isotonic regression pydoc... · cd9f1690
      Holden Karau authored
      [SPARK-15412][PYSPARK][SPARKR][DOCS] Improve linear isotonic regression pydoc & doc build insturctions
      
      ## What changes were proposed in this pull request?
      
      PySpark: Add links to the predictors from the models in regression.py, improve linear and isotonic pydoc in minor ways.
      User guide / R: Switch the installed package list to be enough to build the R docs on a "fresh" install on ubuntu and add sudo to match the rest of the commands.
      User Guide: Add a note about using gem2.0 for systems with both 1.9 and 2.0 (e.g. some ubuntu but maybe more).
      
      ## How was this patch tested?
      
      built pydocs locally, tested new user build instructions
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #13199 from holdenk/SPARK-15412-improve-linear-isotonic-regression-pydoc.
      cd9f1690
    • Shixiong Zhu's avatar
      [SPARK-15508][STREAMING][TESTS] Fix flaky test: JavaKafkaStreamSuite.testKafkaStream · c9c1c0e5
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      `JavaKafkaStreamSuite.testKafkaStream` assumes when `sent.size == result.size`, the contents of `sent` and `result` should be same. However, that's not true. The content of `result` may not be the final content.
      
      This PR modified the test to always retry the assertions even if the contents of `sent` and `result` are not same.
      
      Here is the failure in Jenkins: http://spark-tests.appspot.com/tests/org.apache.spark.streaming.kafka.JavaKafkaStreamSuite/testKafkaStream
      
      ## How was this patch tested?
      
      Jenkins unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13281 from zsxwing/flaky-kafka-test.
      c9c1c0e5
  2. May 24, 2016
    • Wenchen Fan's avatar
      [SPARK-15498][TESTS] fix slow tests · 50b660d7
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This PR fixes 3 slow tests:
      
      1. `ParquetQuerySuite.read/write wide table`: This is not a good unit test as it runs more than 5 minutes. This PR removes it and add a new regression test in `CodeGenerationSuite`, which is more "unit".
      2. `ParquetQuerySuite.returning batch for wide table`: reduce the threshold and use smaller data size.
      3. `DatasetSuite.SPARK-14554: Dataset.map may generate wrong java code for wide table`: Improve `CodeFormatter.format`(introduced at https://github.com/apache/spark/pull/12979) can dramatically speed this it up.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13273 from cloud-fan/test.
      50b660d7
    • Parth Brahmbhatt's avatar
      [SPARK-15365][SQL] When table size statistics are not available from... · 4acababc
      Parth Brahmbhatt authored
      [SPARK-15365][SQL] When table size statistics are not available from metastore, we should fallback to HDFS
      
      ## What changes were proposed in this pull request?
      Currently if a table is used in join operation we rely on Metastore returned size to calculate if we can convert the operation to Broadcast join. This optimization only kicks in for table's that have the statistics available in metastore. Hive generally rolls over to HDFS if the statistics are not available directly from metastore and this seems like a reasonable choice to adopt given the optimization benefit of using broadcast joins.
      
      ## How was this patch tested?
      I have executed queries locally to test.
      
      Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com>
      
      Closes #13150 from Parth-Brahmbhatt/SPARK-15365.
      4acababc
    • Reynold Xin's avatar
      [SPARK-15518] Rename various scheduler backend for consistency · 14494da8
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch renames various scheduler backends to make them consistent:
      
      - LocalScheduler -> LocalSchedulerBackend
      - AppClient -> StandaloneAppClient
      - AppClientListener -> StandaloneAppClientListener
      - SparkDeploySchedulerBackend -> StandaloneSchedulerBackend
      - CoarseMesosSchedulerBackend -> MesosCoarseGrainedSchedulerBackend
      - MesosSchedulerBackend -> MesosFineGrainedSchedulerBackend
      
      ## How was this patch tested?
      Updated test cases to reflect the name change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13288 from rxin/SPARK-15518.
      14494da8
    • Dongjoon Hyun's avatar
      [SPARK-15512][CORE] repartition(0) should raise IllegalArgumentException · f08bf587
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Previously, SPARK-8893 added the constraints on positive number of partitions for repartition/coalesce operations in general. This PR adds one missing part for that and adds explicit two testcases.
      
      **Before**
      ```scala
      scala> sc.parallelize(1 to 5).coalesce(0)
      java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
      ...
      scala> sc.parallelize(1 to 5).repartition(0).collect()
      res1: Array[Int] = Array()   // empty
      scala> spark.sql("select 1").coalesce(0)
      res2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [1: int]
      scala> spark.sql("select 1").coalesce(0).collect()
      java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
      scala> spark.sql("select 1").repartition(0)
      res3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [1: int]
      scala> spark.sql("select 1").repartition(0).collect()
      res4: Array[org.apache.spark.sql.Row] = Array()  // empty
      ```
      
      **After**
      ```scala
      scala> sc.parallelize(1 to 5).coalesce(0)
      java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
      ...
      scala> sc.parallelize(1 to 5).repartition(0)
      java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
      ...
      scala> spark.sql("select 1").coalesce(0)
      java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
      ...
      scala> spark.sql("select 1").repartition(0)
      java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
      ...
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests with new testcases.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13282 from dongjoon-hyun/SPARK-15512.
      f08bf587
    • Tathagata Das's avatar
      [SPARK-15458][SQL][STREAMING] Disable schema inference for streaming datasets on file streams · e631b819
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      If the user relies on the schema to be inferred in file streams can break easily for multiple reasons
      - accidentally running on a directory which has no data
      - schema changing underneath
      - on restart, the query will infer schema again, and may unexpectedly infer incorrect schema, as the file in the directory may be different at the time of the restart.
      
      To avoid these complicated scenarios, for Spark 2.0, we are going to disable schema inferencing by default with a config, so that user is forced to consider explicitly what is the schema it wants, rather than the system trying to infer it and run into weird corner cases.
      
      In this PR, I introduce a SQLConf that determines whether schema inference for file streams is allowed or not. It is disabled by default.
      
      ## How was this patch tested?
      Updated unit tests that test error behavior with and without schema inference enabled.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #13238 from tdas/SPARK-15458.
      e631b819
    • Nick Pentreath's avatar
      [SPARK-15502][DOC][ML][PYSPARK] add guide note that ALS only supports integer ids · 20900e5f
      Nick Pentreath authored
      This PR adds a note to clarify that the ML API for ALS only supports integers for user/item ids, and that other types for these columns can be used but the ids must fall within integer range.
      
      (Refer [SPARK-14891](https://issues.apache.org/jira/browse/SPARK-14891)).
      
      Also cleaned up a reference to `mllib` in the ML doc.
      
      ## How was this patch tested?
      Built and viewed User Guide doc locally.
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #13278 from MLnick/SPARK-15502-als-int-id-doc-note.
      20900e5f
    • Dongjoon Hyun's avatar
      [MINOR][CORE][TEST] Update obsolete `takeSample` test case. · be99a99f
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR fixes some obsolete comments and assertion in `takeSample` testcase of `RDDSuite.scala`.
      
      ## How was this patch tested?
      
      This fixes the testcase only.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13260 from dongjoon-hyun/SPARK-15481.
      be99a99f
    • wangyang's avatar
      [SPARK-15388][SQL] Fix spark sql CREATE FUNCTION with hive 1.2.1 · 784cc07d
      wangyang authored
      ## What changes were proposed in this pull request?
      
      spark.sql("CREATE FUNCTION myfunc AS 'com.haizhi.bdp.udf.UDFGetGeoCode'") throws "org.apache.hadoop.hive.ql.metadata.HiveException:MetaException(message:NoSuchObjectException(message:Function default.myfunc does not exist))" with hive 1.2.1.
      
      I think it is introduced by pr #12853. Fixing it by catching Exception (not NoSuchObjectException) and string matching.
      
      ## How was this patch tested?
      
      added a unit test and also tested it manually
      
      Author: wangyang <wangyang@haizhi.com>
      
      Closes #13177 from wangyang1992/fixCreateFunc2.
      784cc07d
    • Marcelo Vanzin's avatar
      [SPARK-15405][YARN] Remove unnecessary upload of config archive. · a313a5ae
      Marcelo Vanzin authored
      We only need one copy of it. The client code that was uploading the
      second copy just needs to be modified to update the metadata in the
      cache, so that the AM knows where to find the configuration.
      
      Tested by running app on YARN and verifying in the logs only one archive
      is uploaded.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #13232 from vanzin/SPARK-15405.
      a313a5ae
    • Liang-Chi Hsieh's avatar
      [SPARK-15433] [PYSPARK] PySpark core test should not use SerDe from PythonMLLibAPI · 695d9a0f
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      Currently PySpark core test uses the `SerDe` from `PythonMLLibAPI` which includes many MLlib things. It should use `SerDeUtil` instead.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #13214 from viirya/pycore-use-serdeutil.
      695d9a0f
    • Dongjoon Hyun's avatar
      [SPARK-13135] [SQL] Don't print expressions recursively in generated code · f8763b80
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR is an up-to-date and a little bit improved version of #11019 of rxin for
      - (1) preventing recursive printing of expressions in generated code.
      
      Since the major function of this PR is indeed the above,  he should be credited for the work he did. In addition to #11019, this PR improves the followings in code generation.
      - (2) Improve multiline comment indentation.
      - (3) Reduce the number of empty lines (mainly consecutive empty lines).
      - (4) Remove all space characters on empty lines.
      
      **Example**
      ```scala
      spark.range(1, 1000).select('id+1+2+3, 'id+4+5+6)
      ```
      
      **Before**
      ```
      Generated code:
      /* 001 */ public Object generate(Object[] references) {
      ...
      /* 005 */ /**
      /* 006 */ * Codegend pipeline for
      /* 007 */ * Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
      /* 008 */ * +- Range 1, 1, 8, 999, [id#0L]
      /* 009 */ */
      ...
      /* 075 */     // PRODUCE: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
      /* 076 */
      /* 077 */     // PRODUCE: Range 1, 1, 8, 999, [id#0L]
      /* 078 */
      /* 079 */     // initialize Range
      ...
      /* 092 */       // CONSUME: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
      /* 093 */
      /* 094 */       // CONSUME: WholeStageCodegen
      /* 095 */
      /* 096 */       // (((input[0, bigint, false] + 1) + 2) + 3)
      /* 097 */       // ((input[0, bigint, false] + 1) + 2)
      /* 098 */       // (input[0, bigint, false] + 1)
      ...
      /* 107 */       // (((input[0, bigint, false] + 4) + 5) + 6)
      /* 108 */       // ((input[0, bigint, false] + 4) + 5)
      /* 109 */       // (input[0, bigint, false] + 4)
      ...
      /* 126 */ }
      ```
      
      **After**
      ```
      Generated code:
      /* 001 */ public Object generate(Object[] references) {
      ...
      /* 005 */ /**
      /* 006 */  * Codegend pipeline for
      /* 007 */  * Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
      /* 008 */  * +- Range 1, 1, 8, 999, [id#0L]
      /* 009 */  */
      ...
      /* 075 */     // PRODUCE: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
      /* 076 */     // PRODUCE: Range 1, 1, 8, 999, [id#0L]
      /* 077 */     // initialize Range
      ...
      /* 090 */       // CONSUME: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
      /* 091 */       // CONSUME: WholeStageCodegen
      /* 092 */       // (((input[0, bigint, false] + 1) + 2) + 3)
      ...
      /* 101 */       // (((input[0, bigint, false] + 4) + 5) + 6)
      ...
      /* 118 */ }
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests and see the result of the following command manually.
      ```scala
      scala> spark.range(1, 1000).select('id+1+2+3, 'id+4+5+6).queryExecution.debug.codegen()
      ```
      
      Author: Dongjoon Hyun <dongjoonapache.org>
      Author: Reynold Xin <rxindatabricks.com>
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13192 from dongjoon-hyun/SPARK-13135.
      f8763b80
    • Liang-Chi Hsieh's avatar
      [SPARK-11753][SQL][TEST-HADOOP2.2] Make allowNonNumericNumbers option work · c24b6b67
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      Jackson suppprts `allowNonNumericNumbers` option to parse non-standard non-numeric numbers such as "NaN", "Infinity", "INF".  Currently used Jackson version (2.5.3) doesn't support it all. This patch upgrades the library and make the two ignored tests in `JsonParsingOptionsSuite` passed.
      
      ## How was this patch tested?
      
      `JsonParsingOptionsSuite`.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #9759 from viirya/fix-json-nonnumric.
      c24b6b67
    • Nick Pentreath's avatar
      [SPARK-15442][ML][PYSPARK] Add 'relativeError' param to PySpark QuantileDiscretizer · 6075f5b4
      Nick Pentreath authored
      This PR adds the `relativeError` param to PySpark's `QuantileDiscretizer` to match Scala.
      
      Also cleaned up a duplication of `numBuckets` where the param is both a class and instance attribute (I removed the instance attr to match the style of params throughout `ml`).
      
      Finally, cleaned up the docs for `QuantileDiscretizer` to reflect that it now uses `approxQuantile`.
      
      ## How was this patch tested?
      
      A little doctest and built API docs locally to check HTML doc generation.
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #13228 from MLnick/SPARK-15442-py-relerror-param.
      6075f5b4
    • Daoyuan Wang's avatar
      [SPARK-15397][SQL] fix string udf locate as hive · d642b273
      Daoyuan Wang authored
      ## What changes were proposed in this pull request?
      
      in hive, `locate("aa", "aaa", 0)` would yield 0, `locate("aa", "aaa", 1)` would yield 1 and `locate("aa", "aaa", 2)` would yield 2, while in Spark, `locate("aa", "aaa", 0)` would yield 1,  `locate("aa", "aaa", 1)` would yield 2 and  `locate("aa", "aaa", 2)` would yield 0. This results from the different understanding of the third parameter in udf `locate`. It means the starting index and starts from 1, so when we use 0, the return would always be 0.
      
      ## How was this patch tested?
      
      tested with modified `StringExpressionsSuite` and `StringFunctionsSuite`
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #13186 from adrian-wang/locate.
      d642b273
  3. May 23, 2016
    • Andrew Or's avatar
    • Kazuaki Ishizaki's avatar
      [SPARK-15285][SQL] Generated SpecificSafeProjection.apply method grows beyond 64 KB · fa244e5a
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR splits the generated code for ```SafeProjection.apply``` by using ```ctx.splitExpressions()```. This is because the large code body for ```NewInstance``` may grow beyond 64KB bytecode size for ```apply()``` method.
      
      ## How was this patch tested?
      
      Added new tests
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #13243 from kiszk/SPARK-15285.
      fa244e5a
    • gatorsmile's avatar
      [SPARK-15485][SQL][DOCS] Spark SQL Configuration · d2077164
      gatorsmile authored
      #### What changes were proposed in this pull request?
      So far, the page Configuration in the official documentation does not have a section for Spark SQL.
      http://spark.apache.org/docs/latest/configuration.html
      
      For Spark users, the information and default values of these public configuration parameters are very useful. This PR is to add this missing section to the configuration.html.
      
      rxin yhuai marmbrus
      
      #### How was this patch tested?
      Below is the generated webpage.
      <img width="924" alt="screenshot 2016-05-23 11 35 57" src="https://cloud.githubusercontent.com/assets/11567269/15480492/b08fefc4-20da-11e6-9fa2-7cd5b699ed35.png">
      <img width="914" alt="screenshot 2016-05-23 11 37 38" src="https://cloud.githubusercontent.com/assets/11567269/15480499/c5f9482e-20da-11e6-95ff-10821add1af4.png">
      <img width="923" alt="screenshot 2016-05-23 11 36 11" src="https://cloud.githubusercontent.com/assets/11567269/15480506/cbd81644-20da-11e6-9d27-effb716b2fac.png">
      <img width="920" alt="screenshot 2016-05-23 11 36 18" src="https://cloud.githubusercontent.com/assets/11567269/15480511/d013e332-20da-11e6-854a-cf8813c46f36.png">
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13263 from gatorsmile/configurationSQL.
      d2077164
    • WeichenXu's avatar
      [SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with... · a15ca553
      WeichenXu authored
      [SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code
      
      ## What changes were proposed in this pull request?
      
      Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code.
      
      ## How was this patch tested?
      
      Existing test.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #13242 from WeichenXu123/python_doctest_update_sparksession.
      a15ca553
    • gatorsmile's avatar
      [SPARK-15311][SQL] Disallow DML on Regular Tables when Using In-Memory Catalog · 5afd927a
      gatorsmile authored
      #### What changes were proposed in this pull request?
      So far, when using In-Memory Catalog, we allow DDL operations for the tables. However, the corresponding DML operations are not supported for the tables that are neither temporary nor data source tables. For example,
      ```SQL
      CREATE TABLE tabName(i INT, j STRING)
      SELECT * FROM tabName
      INSERT OVERWRITE TABLE tabName SELECT 1, 'a'
      ```
      In the above example, before this PR fix, we will get very confusing exception messages for either `SELECT` or `INSERT`
      ```
      org.apache.spark.sql.AnalysisException: unresolved operator 'SimpleCatalogRelation default, CatalogTable(`default`.`tbl`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(i,int,true,None), CatalogColumn(j,string,true,None)),List(),List(),List(),-1,,1463928681802,-1,Map(),None,None,None,List()), None;
      ```
      
      This PR is to issue appropriate exceptions in this case. The message will be like
      ```
      org.apache.spark.sql.AnalysisException: Please enable Hive support when operating non-temporary tables: `tbl`;
      ```
      #### How was this patch tested?
      Added a test case in `DDLSuite`.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #13093 from gatorsmile/selectAfterCreate.
      5afd927a
    • Xin Wu's avatar
      [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s) command natively · 01659bc5
      Xin Wu authored
      ## What changes were proposed in this pull request?
      Currently command `ADD FILE|JAR <filepath | jarpath>` is supported natively in SparkSQL. However, when this command is run, the file/jar is added to the resources that can not be looked up by `LIST FILE(s)|JAR(s)` command because the `LIST` command is passed to Hive command processor in Spark-SQL or simply not supported in Spark-shell. There is no way users can find out what files/jars are added to the spark context.
      Refer to [Hive commands](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli)
      
      This PR is to support following commands:
      `LIST (FILE[s] [filepath ...] | JAR[s] [jarfile ...])`
      
      ### For example:
      ##### LIST FILE(s)
      ```
      scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt")
      res1: org.apache.spark.sql.DataFrame = []
      scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt")
      res2: org.apache.spark.sql.DataFrame = []
      
      scala> spark.sql("list file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt").show(false)
      +----------------------------------------------+
      |result                                        |
      +----------------------------------------------+
      |hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt|
      +----------------------------------------------+
      
      scala> spark.sql("list files").show(false)
      +----------------------------------------------+
      |result                                        |
      +----------------------------------------------+
      |hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt|
      |hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt |
      +----------------------------------------------+
      ```
      
      ##### LIST JAR(s)
      ```
      scala> spark.sql("add jar /Users/xinwu/spark/core/src/test/resources/TestUDTF.jar")
      res9: org.apache.spark.sql.DataFrame = [result: int]
      
      scala> spark.sql("list jar TestUDTF.jar").show(false)
      +---------------------------------------------+
      |result                                       |
      +---------------------------------------------+
      |spark://192.168.1.234:50131/jars/TestUDTF.jar|
      +---------------------------------------------+
      
      scala> spark.sql("list jars").show(false)
      +---------------------------------------------+
      |result                                       |
      +---------------------------------------------+
      |spark://192.168.1.234:50131/jars/TestUDTF.jar|
      +---------------------------------------------+
      ```
      ## How was this patch tested?
      New test cases are added for Spark-SQL, Spark-Shell and SparkContext API code path.
      
      Author: Xin Wu <xinwu@us.ibm.com>
      Author: xin Wu <xinwu@us.ibm.com>
      
      Closes #13212 from xwu0226/list_command.
      01659bc5
    • hyukjinkwon's avatar
      [MINOR][SPARKR][DOC] Add a description for running unit tests in Windows · a8e97d17
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR adds the description for running unit tests in Windows.
      
      ## How was this patch tested?
      
      On a bare machine (Window 7, 32bits), this was manually built and tested.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #13217 from HyukjinKwon/minor-r-doc.
      a8e97d17
    • sureshthalamati's avatar
      [SPARK-15315][SQL] Adding error check to the CSV datasource writer for... · 03c7b7c4
      sureshthalamati authored
      [SPARK-15315][SQL] Adding error check to  the CSV datasource writer for unsupported complex data types.
      
      ## What changes were proposed in this pull request?
      
      Adds error handling to the CSV writer  for unsupported complex data types.  Currently garbage gets written to the output csv files if the data frame schema has complex data types.
      
      ## How was this patch tested?
      
      Added new unit test case.
      
      Author: sureshthalamati <suresh.thalamati@gmail.com>
      
      Closes #13105 from sureshthalamati/csv_complex_types_SPARK-15315.
      03c7b7c4
    • Dongjoon Hyun's avatar
      [MINOR][SQL][DOCS] Add notes of the deterministic assumption on UDF functions · 37c617e4
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Spark assumes that UDF functions are deterministic. This PR adds explicit notes about that.
      
      ## How was this patch tested?
      
      It's only about docs.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13087 from dongjoon-hyun/SPARK-15282.
      37c617e4
    • Andrew Or's avatar
      [SPARK-15279][SQL] Catch conflicting SerDe when creating table · 2585d2b3
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      The user may do something like:
      ```
      CREATE TABLE my_tab ROW FORMAT SERDE 'anything' STORED AS PARQUET
      CREATE TABLE my_tab ROW FORMAT SERDE 'anything' STORED AS ... SERDE 'myserde'
      CREATE TABLE my_tab ROW FORMAT DELIMITED ... STORED AS ORC
      CREATE TABLE my_tab ROW FORMAT DELIMITED ... STORED AS ... SERDE 'myserde'
      ```
      None of these should be allowed because the SerDe's conflict. As of this patch:
      - `ROW FORMAT DELIMITED` is only compatible with `TEXTFILE`
      - `ROW FORMAT SERDE` is only compatible with `TEXTFILE`, `RCFILE` and `SEQUENCEFILE`
      
      ## How was this patch tested?
      
      New tests in `DDLCommandSuite`.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #13068 from andrewor14/row-format-conflict.
      2585d2b3
    • Wenchen Fan's avatar
      [SPARK-15471][SQL] ScalaReflection cleanup · 07c36a2f
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      1. simplify the logic of deserializing option type.
      2. simplify the logic of serializing array type, and remove silentSchemaFor
      3. remove some unnecessary code.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13250 from cloud-fan/encoder.
      07c36a2f
    • Davies Liu's avatar
      [SPARK-14031][SQL] speedup CSV writer · 80091b8a
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Currently, we create an CSVWriter for every row, it's very expensive and memory hungry, took about 15 seconds to write out 1 mm rows (two columns).
      
      This PR will write the rows in batch mode, create a CSVWriter for every 1k rows, which could write out 1 mm rows in about 1 seconds (15X faster).
      
      ## How was this patch tested?
      
      Manually benchmark it.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13229 from davies/csv_writer.
      80091b8a
    • Sameer Agarwal's avatar
      [SPARK-15425][SQL] Disallow cross joins by default · dafcb05c
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      In order to prevent users from inadvertently writing queries with cartesian joins, this patch introduces a new conf `spark.sql.crossJoin.enabled` (set to `false` by default) that if not set, results in a `SparkException` if the query contains one or more cartesian products.
      
      ## How was this patch tested?
      
      Added a test to verify the new behavior in `JoinSuite`. Additionally, `SQLQuerySuite` and `SQLMetricsSuite` were modified to explicitly enable cartesian products.
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #13209 from sameeragarwal/disallow-cartesian.
      dafcb05c
  4. May 22, 2016
    • wangyang's avatar
      [SPARK-15379][SQL] check special invalid date · fc44b694
      wangyang authored
      ## What changes were proposed in this pull request?
      
      When invalid date string like "2015-02-29 00:00:00" are cast as date or timestamp using spark sql, it used to not return null but another valid date (2015-03-01 in this case).
      In this pr, invalid date string like "2016-02-29" and "2016-04-31" are returned as null when cast as date or timestamp.
      
      ## How was this patch tested?
      
      Unit tests are added.
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: wangyang <wangyang@haizhi.com>
      
      Closes #13169 from wangyang1992/invalid_date.
      fc44b694
Loading