Skip to content
Snippets Groups Projects
  1. May 10, 2017
    • Ala Luszczak's avatar
      [SPARK-19447] Remove remaining references to generated rows metric · 5c2c4dcc
      Ala Luszczak authored
      ## What changes were proposed in this pull request?
      
      https://github.com/apache/spark/commit/b486ffc86d8ad6c303321dcf8514afee723f61f8 left behind references to "number of generated rows" metrics, that should have been removed.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: Ala Luszczak <ala@databricks.com>
      
      Closes #17939 from ala/SPARK-19447-fix.
      5c2c4dcc
    • wangzhenhua's avatar
      [SPARK-20678][SQL] Ndv for columns not in filter condition should also be updated · 76e4a556
      wangzhenhua authored
      ## What changes were proposed in this pull request?
      
      In filter estimation, we update column stats for those columns in filter condition. However, if the number of rows decreases after the filter (i.e. the overall selectivity is less than 1), we need to update (scale down) the number of distinct values (NDV) for all columns, no matter they are in filter conditions or not.
      
      This pr also fixes the inconsistency of rounding mode for ndv and rowCount.
      
      ## How was this patch tested?
      
      Added new tests.
      
      Author: wangzhenhua <wangzhenhua@huawei.com>
      
      Closes #17918 from wzhfy/scaleDownNdvAfterFilter.
      76e4a556
    • Wenchen Fan's avatar
      [SPARK-20688][SQL] correctly check analysis for scalar sub-queries · 789bdbe3
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      In `CheckAnalysis`, we should call `checkAnalysis` for `ScalarSubquery` at the beginning, as later we will call `plan.output` which is invalid if `plan` is not resolved.
      
      ## How was this patch tested?
      
      new regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17930 from cloud-fan/tmp.
      789bdbe3
    • NICHOLAS T. MARION's avatar
      [SPARK-20393][WEBU UI] Strengthen Spark to prevent XSS vulnerabilities · b512233a
      NICHOLAS T. MARION authored
      ## What changes were proposed in this pull request?
      
      Add stripXSS and stripXSSMap to Spark Core's UIUtils. Calling these functions at any point that getParameter is called against a HttpServletRequest.
      
      ## How was this patch tested?
      
      Unit tests, IBM Security AppScan Standard no longer showing vulnerabilities, manual verification of WebUI pages.
      
      Author: NICHOLAS T. MARION <nmarion@us.ibm.com>
      
      Closes #17686 from n-marion/xss-fix.
      b512233a
    • Takuya UESHIN's avatar
      [SPARK-20668][SQL] Modify ScalaUDF to handle nullability. · 0ef16bd4
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      When registering Scala UDF, we can know if the udf will return nullable value or not. `ScalaUDF` and related classes should handle the nullability.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #17911 from ueshin/issues/SPARK-20668.
      0ef16bd4
    • Josh Rosen's avatar
      [SPARK-20686][SQL] PropagateEmptyRelation incorrectly handles aggregate without grouping · a90c5cd8
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      The query
      
      ```
      SELECT 1 FROM (SELECT COUNT(*) WHERE FALSE) t1
      ```
      
      should return a single row of output because the subquery is an aggregate without a group-by and thus should return a single row. However, Spark incorrectly returns zero rows.
      
      This is caused by SPARK-16208 / #13906, a patch which added an optimizer rule to propagate EmptyRelation through operators. The logic for handling aggregates is wrong: it checks whether aggregate expressions are non-empty for deciding whether the output should be empty, whereas it should be checking grouping expressions instead:
      
      An aggregate with non-empty grouping expression will return one output row per group. If the input to the grouped aggregate is empty then all groups will be empty and thus the output will be empty. It doesn't matter whether the aggregation output columns include aggregate expressions since that won't affect the number of output rows.
      
      If the grouping expressions are empty, however, then the aggregate will always produce a single output row and thus we cannot propagate the EmptyRelation.
      
      The current implementation is incorrect and also misses an optimization opportunity by not propagating EmptyRelation in the case where a grouped aggregate has aggregate expressions (in other words, `SELECT COUNT(*) from emptyRelation GROUP BY x` would _not_ be optimized to `EmptyRelation` in the old code, even though it safely could be).
      
      This patch resolves this issue by modifying `PropagateEmptyRelation` to consider only the presence/absence of grouping expressions, not the aggregate functions themselves, when deciding whether to propagate EmptyRelation.
      
      ## How was this patch tested?
      
      - Added end-to-end regression tests in `SQLQueryTest`'s `group-by.sql` file.
      - Updated unit tests in `PropagateEmptyRelationSuite`.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #17929 from JoshRosen/fix-PropagateEmptyRelation.
      a90c5cd8
    • hyukjinkwon's avatar
      [SPARK-20590][SQL] Use Spark internal datasource if multiples are found for the same shorten name · 3d2131ab
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      One of the common usability problems around reading data in spark (particularly CSV) is that there can often be a conflict between different readers in the classpath.
      
      As an example, if someone launches a 2.x spark shell with the spark-csv package in the classpath, Spark currently fails in an extremely unfriendly way (see databricks/spark-csv#367):
      
      ```bash
      ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
      scala> val df = spark.read.csv("/foo/bar.csv")
      java.lang.RuntimeException: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name.
        at scala.sys.package$.error(package.scala:27)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574)
        at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85)
        at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
        at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
        at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
        ... 48 elided
      ```
      
      This PR proposes a simple way of fixing this error by picking up the internal datasource if there is single (the datasource that has "org.apache.spark" prefix).
      
      ```scala
      scala> spark.range(1).write.format("csv").mode("overwrite").save("/tmp/abc")
      17/05/10 09:47:44 WARN DataSource: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat,
      com.databricks.spark.csv.DefaultSource15), defaulting to the internal datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat).
      ```
      
      ```scala
      scala> spark.range(1).write.format("Csv").mode("overwrite").save("/tmp/abc")
      17/05/10 09:47:52 WARN DataSource: Multiple sources found for Csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat,
      com.databricks.spark.csv.DefaultSource15), defaulting to the internal datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat).
      ```
      
      ## How was this patch tested?
      
      Manually tested as below:
      
      ```bash
      ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
      ```
      
      ```scala
      spark.sparkContext.setLogLevel("WARN")
      ```
      
      **positive cases**:
      
      ```scala
      scala> spark.range(1).write.format("csv").mode("overwrite").save("/tmp/abc")
      17/05/10 09:47:44 WARN DataSource: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat,
      com.databricks.spark.csv.DefaultSource15), defaulting to the internal datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat).
      ```
      
      ```scala
      scala> spark.range(1).write.format("Csv").mode("overwrite").save("/tmp/abc")
      17/05/10 09:47:52 WARN DataSource: Multiple sources found for Csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat,
      com.databricks.spark.csv.DefaultSource15), defaulting to the internal datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat).
      ```
      
      (newlines were inserted for readability).
      
      ```scala
      scala> spark.range(1).write.format("com.databricks.spark.csv").mode("overwrite").save("/tmp/abc")
      ```
      
      ```scala
      scala> spark.range(1).write.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").mode("overwrite").save("/tmp/abc")
      ```
      
      **negative cases**:
      
      ```scala
      scala> spark.range(1).write.format("com.databricks.spark.csv.CsvRelation").save("/tmp/abc")
      java.lang.InstantiationException: com.databricks.spark.csv.CsvRelation
      ...
      ```
      
      ```scala
      scala> spark.range(1).write.format("com.databricks.spark.csv.CsvRelatio").save("/tmp/abc")
      java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv.CsvRelatio. Please find packages at http://spark.apache.org/third-party-projects.html
      ...
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17916 from HyukjinKwon/datasource-detect.
      3d2131ab
  2. May 09, 2017
    • Yuming Wang's avatar
      [SPARK-17685][SQL] Make SortMergeJoinExec's currentVars is null when calling createJoinKey · 771abeb4
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      
      The following SQL query cause `IndexOutOfBoundsException` issue when `LIMIT > 1310720`:
      ```sql
      CREATE TABLE tab1(int int, int2 int, str string);
      CREATE TABLE tab2(int int, int2 int, str string);
      INSERT INTO tab1 values(1,1,'str');
      INSERT INTO tab1 values(2,2,'str');
      INSERT INTO tab2 values(1,1,'str');
      INSERT INTO tab2 values(2,3,'str');
      
      SELECT
        count(*)
      FROM
        (
          SELECT t1.int, t2.int2
          FROM (SELECT * FROM tab1 LIMIT 1310721) t1
          INNER JOIN (SELECT * FROM tab2 LIMIT 1310721) t2
          ON (t1.int = t2.int AND t1.int2 = t2.int2)
        ) t;
      ```
      
      This pull request fix this issue.
      
      ## How was this patch tested?
      
      unit tests
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #17920 from wangyum/SPARK-17685.
      771abeb4
    • uncleGen's avatar
      [SPARK-20373][SQL][SS] Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute · c0189abc
      uncleGen authored
      ## What changes were proposed in this pull request?
      
      Any Dataset/DataFrame batch query with the operation `withWatermark` does not execute because the batch planner does not have any rule to explicitly handle the EventTimeWatermark logical plan.
      The right solution is to simply remove the plan node, as the watermark should not affect any batch query in any way.
      
      Changes:
      - In this PR, we add a new rule `EliminateEventTimeWatermark` to check if we need to ignore the event time watermark. We will ignore watermark in any batch query.
      
      Depends upon:
      - [SPARK-20672](https://issues.apache.org/jira/browse/SPARK-20672). We can not add this rule into analyzer directly, because streaming query will be copied to `triggerLogicalPlan ` in every trigger, and the rule will be applied to `triggerLogicalPlan` mistakenly.
      
      Others:
      - A typo fix in example.
      
      ## How was this patch tested?
      
      add new unit test.
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #17896 from uncleGen/SPARK-20373.
      c0189abc
    • Yin Huai's avatar
      Revert "[SPARK-20311][SQL] Support aliases for table value functions" · f79aa285
      Yin Huai authored
      This reverts commit 714811d0.
      f79aa285
    • Reynold Xin's avatar
      Revert "[SPARK-12297][SQL] Hive compatibility for Parquet Timestamps" · ac1ab6b9
      Reynold Xin authored
      This reverts commit 22691556.
      
      See JIRA ticket for more information.
      ac1ab6b9
    • Sean Owen's avatar
      [SPARK-19876][BUILD] Move Trigger.java to java source hierarchy · 25ee816e
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Simply moves `Trigger.java` to `src/main/java` from `src/main/scala`
      See https://github.com/apache/spark/pull/17219
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17921 from srowen/SPARK-19876.2.
      25ee816e
    • Reynold Xin's avatar
      [SPARK-20674][SQL] Support registering UserDefinedFunction as named UDF · d099f414
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      For some reason we don't have an API to register UserDefinedFunction as named UDF. It is a no brainer to add one, in addition to the existing register functions we have.
      
      ## How was this patch tested?
      Added a test case in UDFSuite for the new API.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #17915 from rxin/SPARK-20674.
      d099f414
    • Takeshi Yamamuro's avatar
      [SPARK-20311][SQL] Support aliases for table value functions · 714811d0
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr added parsing rules to support aliases in table value functions.
      
      ## How was this patch tested?
      Added tests in `PlanParserSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #17666 from maropu/SPARK-20311.
      714811d0
    • Xiao Li's avatar
      [SPARK-20667][SQL][TESTS] Cleanup the cataloged metadata after completing the... · 0d00c768
      Xiao Li authored
      [SPARK-20667][SQL][TESTS] Cleanup the cataloged metadata after completing the package of sql/core and sql/hive
      
      ## What changes were proposed in this pull request?
      
      So far, we do not drop all the cataloged objects after each package. Sometimes, we might hit strange test case errors because the previous test suite did not drop the cataloged/temporary objects (tables/functions/database). At least, we can first clean up the environment when completing the package of `sql/core` and `sql/hive`.
      
      ## How was this patch tested?
      N/A
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17908 from gatorsmile/reset.
      0d00c768
  3. May 08, 2017
    • sujith71955's avatar
      [SPARK-20380][SQL] Unable to set/unset table comment property using ALTER... · 42cc6d13
      sujith71955 authored
      [SPARK-20380][SQL] Unable to set/unset table comment property using ALTER TABLE SET/UNSET TBLPROPERTIES ddl
      
      ### What changes were proposed in this pull request?
      Table comment was not getting  set/unset using **ALTER TABLE  SET/UNSET TBLPROPERTIES** query
      eg: ALTER TABLE table_with_comment SET TBLPROPERTIES("comment"= "modified comment)
       when user alter the table properties  and adds/updates table comment,table comment which is a field  of **CatalogTable**  instance is not getting updated and  old table comment if exists was shown to user, inorder  to handle this issue, update the comment field value in **CatalogTable** with the newly added/modified comment along with other table level properties when user executes **ALTER TABLE  SET TBLPROPERTIES** query.
      
      This pr has also taken care of unsetting the table comment when user executes query  **ALTER TABLE  UNSET TBLPROPERTIES** inorder to unset or remove table comment.
      eg: ALTER TABLE table_comment UNSET TBLPROPERTIES IF EXISTS ('comment')
      
      ### How was this patch tested?
      Added test cases  as part of **SQLQueryTestSuite** for verifying  table comment using desc formatted table query after adding/modifying table comment as part of **AlterTableSetPropertiesCommand** and unsetting the table comment using **AlterTableUnsetPropertiesCommand**.
      
      Author: sujith71955 <sujithchacko.2010@gmail.com>
      
      Closes #17649 from sujith71955/alter_table_comment.
      42cc6d13
  4. May 07, 2017
    • Imran Rashid's avatar
      [SPARK-12297][SQL] Hive compatibility for Parquet Timestamps · 22691556
      Imran Rashid authored
      ## What changes were proposed in this pull request?
      
      This change allows timestamps in parquet-based hive table to behave as a "floating time", without a timezone, as timestamps are for other file formats.  If the storage timezone is the same as the session timezone, this conversion is a no-op.  When data is read from a hive table, the table property is *always* respected.  This allows spark to not change behavior when reading old data, but read newly written data correctly (whatever the source of the data is).
      
      Spark inherited the original behavior from Hive, but Hive is also updating behavior to use the same  scheme in HIVE-12767 / HIVE-16231.
      
      The default for Spark remains unchanged; created tables do not include the new table property.
      
      This will only apply to hive tables; nothing is added to parquet metadata to indicate the timezone, so data that is read or written directly from parquet files will never have any conversions applied.
      
      ## How was this patch tested?
      
      Added a unit test which creates tables, reads and writes data, under a variety of permutations (different storage timezones, different session timezones, vectorized reading on and off).
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #16781 from squito/SPARK-12297.
      22691556
    • Jacek Laskowski's avatar
      [MINOR][SQL][DOCS] Improve unix_timestamp's scaladoc (and typo hunting) · 500436b4
      Jacek Laskowski authored
      ## What changes were proposed in this pull request?
      
      * Docs are consistent (across different `unix_timestamp` variants and their internal expressions)
      * typo hunting
      
      ## How was this patch tested?
      
      local build
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #17801 from jaceklaskowski/unix_timestamp.
      500436b4
    • Xiao Li's avatar
      [SPARK-20557][SQL] Support JDBC data type Time with Time Zone · cafca54c
      Xiao Li authored
      ### What changes were proposed in this pull request?
      
      This PR is to support JDBC data type TIME WITH TIME ZONE. It can be converted to TIMESTAMP
      
      In addition, before this PR, for unsupported data types, we simply output the type number instead of the type name.
      
      ```
      java.sql.SQLException: Unsupported type 2014
      ```
      After this PR, the message is like
      ```
      java.sql.SQLException: Unsupported type TIMESTAMP_WITH_TIMEZONE
      ```
      
      - Also upgrade the H2 version to `1.4.195` which has the type fix for "TIMESTAMP WITH TIMEZONE". However, it is not fully supported. Thus, we capture the exception, but we still need it to partially test the support of "TIMESTAMP WITH TIMEZONE", because Docker tests are not regularly run.
      
      ### How was this patch tested?
      Added test cases.
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17835 from gatorsmile/h2.
      cafca54c
  5. May 05, 2017
  6. May 04, 2017
  7. May 03, 2017
    • hyukjinkwon's avatar
      [MINOR][SQL] Fix the test title from =!= to <=>, remove a duplicated test and add a test for =!= · 13eb37c8
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes three things as below:
      
      - This test looks not testing `<=>` and identical with the test above, `===`. So, it removes the test.
      
        ```diff
        -   test("<=>") {
        -     checkAnswer(
        -      testData2.filter($"a" === 1),
        -      testData2.collect().toSeq.filter(r => r.getInt(0) == 1))
        -
        -    checkAnswer(
        -      testData2.filter($"a" === $"b"),
        -      testData2.collect().toSeq.filter(r => r.getInt(0) == r.getInt(1)))
        -   }
        ```
      
      - Replace the test title from `=!=` to `<=>`. It looks the test actually testing `<=>`.
      
        ```diff
        +  private lazy val nullData = Seq(
        +    (Some(1), Some(1)), (Some(1), Some(2)), (Some(1), None), (None, None)).toDF("a", "b")
        +
          ...
        -  test("=!=") {
        +  test("<=>") {
        -    val nullData = spark.createDataFrame(sparkContext.parallelize(
        -      Row(1, 1) ::
        -      Row(1, 2) ::
        -      Row(1, null) ::
        -      Row(null, null) :: Nil),
        -      StructType(Seq(StructField("a", IntegerType), StructField("b", IntegerType))))
        -
               checkAnswer(
                 nullData.filter($"b" <=> 1),
          ...
        ```
      
      - Add the tests for `=!=` which looks not existing.
      
        ```diff
        +  test("=!=") {
        +    checkAnswer(
        +      nullData.filter($"b" =!= 1),
        +      Row(1, 2) :: Nil)
        +
        +    checkAnswer(nullData.filter($"b" =!= null), Nil)
        +
        +    checkAnswer(
        +      nullData.filter($"a" =!= $"b"),
        +      Row(1, 2) :: Nil)
        +  }
        ```
      
      ## How was this patch tested?
      
      Manually running the tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17842 from HyukjinKwon/minor-test-fix.
      13eb37c8
    • Liwei Lin's avatar
      [SPARK-19965][SS] DataFrame batch reader may fail to infer partitions when... · 6b9e49d1
      Liwei Lin authored
      [SPARK-19965][SS] DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output
      
      ## The Problem
      
      Right now DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output:
      
      ```
      [info] - partitioned writing and batch reading with 'basePath' *** FAILED *** (3 seconds, 928 milliseconds)
      [info]   java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:
      [info] 	***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637
      [info] 	***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637/_spark_metadata
      [info]
      [info] If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.
      [info]   at scala.Predef$.assert(Predef.scala:170)
      [info]   at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:133)
      [info]   at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:98)
      [info]   at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:156)
      [info]   at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:54)
      [info]   at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:55)
      [info]   at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:133)
      [info]   at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)
      [info]   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:160)
      [info]   at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:536)
      [info]   at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:520)
      [info]   at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply$mcV$sp(FileStreamSinkSuite.scala:292)
      [info]   at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268)
      [info]   at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268)
      ```
      
      ## What changes were proposed in this pull request?
      
      This patch alters `InMemoryFileIndex` to filter out these `basePath`s whose ancestor is the streaming metadata dir (`_spark_metadata`). E.g., the following and other similar dir or files will be filtered out:
      - (introduced by globbing `basePath/*`)
         - `basePath/_spark_metadata`
      - (introduced by globbing `basePath/*/*`)
         - `basePath/_spark_metadata/0`
         - `basePath/_spark_metadata/1`
         - ...
      
      ## How was this patch tested?
      
      Added unit tests
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #17346 from lw-lin/filter-metadata.
      6b9e49d1
    • Reynold Xin's avatar
      [SPARK-20576][SQL] Support generic hint function in Dataset/DataFrame · 527fc5d0
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      We allow users to specify hints (currently only "broadcast" is supported) in SQL and DataFrame. However, while SQL has a standard hint format (/*+ ... */), DataFrame doesn't have one and sometimes users are confused that they can't find how to apply a broadcast hint. This ticket adds a generic hint function on DataFrame that allows using the same hint on DataFrames as well as SQL.
      
      As an example, after this patch, the following will apply a broadcast hint on a DataFrame using the new hint function:
      
      ```
      df1.join(df2.hint("broadcast"))
      ```
      
      ## How was this patch tested?
      Added a test case in DataFrameJoinSuite.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #17839 from rxin/SPARK-20576.
      527fc5d0
    • Liwei Lin's avatar
      [SPARK-20441][SPARK-20432][SS] Within the same streaming query, one... · 27f543b1
      Liwei Lin authored
      [SPARK-20441][SPARK-20432][SS] Within the same streaming query, one StreamingRelation should only be transformed to one StreamingExecutionRelation
      
      ## What changes were proposed in this pull request?
      
      Within the same streaming query, when one `StreamingRelation` is referred multiple times – e.g. `df.union(df)` – we should transform it only to one `StreamingExecutionRelation`, instead of two or more different `StreamingExecutionRelation`s (each of which would have a separate set of source, source logs, ...).
      
      ## How was this patch tested?
      
      Added two test cases, each of which would fail without this patch.
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #17735 from lw-lin/SPARK-20441.
      27f543b1
    • Sean Owen's avatar
      [SPARK-20523][BUILD] Clean up build warnings for 2.2.0 release · 16fab6b0
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fix build warnings primarily related to Breeze 0.13 operator changes, Java style problems
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17803 from srowen/SPARK-20523.
      16fab6b0
    • Michael Armbrust's avatar
      [SPARK-20567] Lazily bind in GenerateExec · 6235132a
      Michael Armbrust authored
      It is not valid to eagerly bind with the child's output as this causes failures when we attempt to canonicalize the plan (replacing the attribute references with dummies).
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #17838 from marmbrus/fixBindExplode.
      6235132a
  8. May 02, 2017
    • Xiao Li's avatar
      [SPARK-19235][SQL][TEST][FOLLOW-UP] Enable Test Cases in DDLSuite with Hive Metastore · b1e639ab
      Xiao Li authored
      ### What changes were proposed in this pull request?
      This is a follow-up of enabling test cases in DDLSuite with Hive Metastore. It consists of the following remaining tasks:
      - Run all the `alter table` and `drop table` DDL tests against data source tables when using Hive metastore.
      - Do not run any `alter table` and `drop table` DDL test against Hive serde tables when using InMemoryCatalog.
      - Reenable `alter table: set serde partition` and `alter table: set serde` tests for Hive serde tables.
      
      ### How was this patch tested?
      N/A
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17524 from gatorsmile/cleanupDDLSuite.
      b1e639ab
    • Burak Yavuz's avatar
      [SPARK-20549] java.io.CharConversionException: Invalid UTF-32' in JsonToStructs · 86174ea8
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      A fix for the same problem was made in #17693 but ignored `JsonToStructs`. This PR uses the same fix for `JsonToStructs`.
      
      ## How was this patch tested?
      
      Regression test
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #17826 from brkyvz/SPARK-20549.
      86174ea8
    • Kazuaki Ishizaki's avatar
      [SPARK-20537][CORE] Fixing OffHeapColumnVector reallocation · afb21bf2
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      As #17773 revealed `OnHeapColumnVector` may copy a part of the original storage.
      
      `OffHeapColumnVector` reallocation also copies to the new storage data up to 'elementsAppended'. This variable is only updated when using the `ColumnVector.appendX` API, while `ColumnVector.putX` is more commonly used.
      This PR copies the new storage data up to the previously-allocated size in`OffHeapColumnVector`.
      
      ## How was this patch tested?
      
      Existing test suites
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #17811 from kiszk/SPARK-20537.
      afb21bf2
  9. May 01, 2017
  10. Apr 30, 2017
    • hyukjinkwon's avatar
      [SPARK-20492][SQL] Do not print empty parentheses for invalid primitive types in parser · 1ee494d0
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, when the type string is invalid, it looks printing empty parentheses. This PR proposes a small improvement in an error message by removing it in the parse as below:
      
      ```scala
      spark.range(1).select($"col".cast("aa"))
      ```
      
      **Before**
      
      ```
      org.apache.spark.sql.catalyst.parser.ParseException:
      DataType aa() is not supported.(line 1, pos 0)
      
      == SQL ==
      aa
      ^^^
      ```
      
      **After**
      
      ```
      org.apache.spark.sql.catalyst.parser.ParseException:
      DataType aa is not supported.(line 1, pos 0)
      
      == SQL ==
      aa
      ^^^
      ```
      
      ## How was this patch tested?
      
      Unit tests in `DataTypeParserSuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17784 from HyukjinKwon/SPARK-20492.
      1ee494d0
  11. Apr 29, 2017
    • hyukjinkwon's avatar
      [SPARK-20442][PYTHON][DOCS] Fill up documentations for functions in Column API in PySpark · d228cd0b
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to fill up the documentation with examples for `bitwiseOR`, `bitwiseAND`, `bitwiseXOR`. `contains`, `asc` and `desc` in `Column` API.
      
      Also, this PR fixes minor typos in the documentation and matches some of the contents between Scala doc and Python doc.
      
      Lastly, this PR suggests to use `spark` rather than `sc` in doc tests in `Column` for Python documentation.
      
      ## How was this patch tested?
      
      Doc tests were added and manually tested with the commands below:
      
      `./python/run-tests.py --module pyspark-sql`
      `./python/run-tests.py --module pyspark-sql --python-executable python3`
      `./dev/lint-python`
      
      Output was checked via `make html` under `./python/docs`. The snapshots will be left on the codes with comments.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17737 from HyukjinKwon/SPARK-20442.
      d228cd0b
Loading