Skip to content
Snippets Groups Projects
  1. May 10, 2017
    • Josh Rosen's avatar
      [SPARK-20686][SQL] PropagateEmptyRelation incorrectly handles aggregate without grouping · a90c5cd8
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      The query
      
      ```
      SELECT 1 FROM (SELECT COUNT(*) WHERE FALSE) t1
      ```
      
      should return a single row of output because the subquery is an aggregate without a group-by and thus should return a single row. However, Spark incorrectly returns zero rows.
      
      This is caused by SPARK-16208 / #13906, a patch which added an optimizer rule to propagate EmptyRelation through operators. The logic for handling aggregates is wrong: it checks whether aggregate expressions are non-empty for deciding whether the output should be empty, whereas it should be checking grouping expressions instead:
      
      An aggregate with non-empty grouping expression will return one output row per group. If the input to the grouped aggregate is empty then all groups will be empty and thus the output will be empty. It doesn't matter whether the aggregation output columns include aggregate expressions since that won't affect the number of output rows.
      
      If the grouping expressions are empty, however, then the aggregate will always produce a single output row and thus we cannot propagate the EmptyRelation.
      
      The current implementation is incorrect and also misses an optimization opportunity by not propagating EmptyRelation in the case where a grouped aggregate has aggregate expressions (in other words, `SELECT COUNT(*) from emptyRelation GROUP BY x` would _not_ be optimized to `EmptyRelation` in the old code, even though it safely could be).
      
      This patch resolves this issue by modifying `PropagateEmptyRelation` to consider only the presence/absence of grouping expressions, not the aggregate functions themselves, when deciding whether to propagate EmptyRelation.
      
      ## How was this patch tested?
      
      - Added end-to-end regression tests in `SQLQueryTest`'s `group-by.sql` file.
      - Updated unit tests in `PropagateEmptyRelationSuite`.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #17929 from JoshRosen/fix-PropagateEmptyRelation.
      a90c5cd8
    • hyukjinkwon's avatar
      [SPARK-20590][SQL] Use Spark internal datasource if multiples are found for the same shorten name · 3d2131ab
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      One of the common usability problems around reading data in spark (particularly CSV) is that there can often be a conflict between different readers in the classpath.
      
      As an example, if someone launches a 2.x spark shell with the spark-csv package in the classpath, Spark currently fails in an extremely unfriendly way (see databricks/spark-csv#367):
      
      ```bash
      ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
      scala> val df = spark.read.csv("/foo/bar.csv")
      java.lang.RuntimeException: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name.
        at scala.sys.package$.error(package.scala:27)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574)
        at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85)
        at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
        at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
        at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
        ... 48 elided
      ```
      
      This PR proposes a simple way of fixing this error by picking up the internal datasource if there is single (the datasource that has "org.apache.spark" prefix).
      
      ```scala
      scala> spark.range(1).write.format("csv").mode("overwrite").save("/tmp/abc")
      17/05/10 09:47:44 WARN DataSource: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat,
      com.databricks.spark.csv.DefaultSource15), defaulting to the internal datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat).
      ```
      
      ```scala
      scala> spark.range(1).write.format("Csv").mode("overwrite").save("/tmp/abc")
      17/05/10 09:47:52 WARN DataSource: Multiple sources found for Csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat,
      com.databricks.spark.csv.DefaultSource15), defaulting to the internal datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat).
      ```
      
      ## How was this patch tested?
      
      Manually tested as below:
      
      ```bash
      ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
      ```
      
      ```scala
      spark.sparkContext.setLogLevel("WARN")
      ```
      
      **positive cases**:
      
      ```scala
      scala> spark.range(1).write.format("csv").mode("overwrite").save("/tmp/abc")
      17/05/10 09:47:44 WARN DataSource: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat,
      com.databricks.spark.csv.DefaultSource15), defaulting to the internal datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat).
      ```
      
      ```scala
      scala> spark.range(1).write.format("Csv").mode("overwrite").save("/tmp/abc")
      17/05/10 09:47:52 WARN DataSource: Multiple sources found for Csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat,
      com.databricks.spark.csv.DefaultSource15), defaulting to the internal datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat).
      ```
      
      (newlines were inserted for readability).
      
      ```scala
      scala> spark.range(1).write.format("com.databricks.spark.csv").mode("overwrite").save("/tmp/abc")
      ```
      
      ```scala
      scala> spark.range(1).write.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").mode("overwrite").save("/tmp/abc")
      ```
      
      **negative cases**:
      
      ```scala
      scala> spark.range(1).write.format("com.databricks.spark.csv.CsvRelation").save("/tmp/abc")
      java.lang.InstantiationException: com.databricks.spark.csv.CsvRelation
      ...
      ```
      
      ```scala
      scala> spark.range(1).write.format("com.databricks.spark.csv.CsvRelatio").save("/tmp/abc")
      java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv.CsvRelatio. Please find packages at http://spark.apache.org/third-party-projects.html
      ...
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17916 from HyukjinKwon/datasource-detect.
      3d2131ab
  2. May 09, 2017
    • Yuming Wang's avatar
      [SPARK-17685][SQL] Make SortMergeJoinExec's currentVars is null when calling createJoinKey · 771abeb4
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      
      The following SQL query cause `IndexOutOfBoundsException` issue when `LIMIT > 1310720`:
      ```sql
      CREATE TABLE tab1(int int, int2 int, str string);
      CREATE TABLE tab2(int int, int2 int, str string);
      INSERT INTO tab1 values(1,1,'str');
      INSERT INTO tab1 values(2,2,'str');
      INSERT INTO tab2 values(1,1,'str');
      INSERT INTO tab2 values(2,3,'str');
      
      SELECT
        count(*)
      FROM
        (
          SELECT t1.int, t2.int2
          FROM (SELECT * FROM tab1 LIMIT 1310721) t1
          INNER JOIN (SELECT * FROM tab2 LIMIT 1310721) t2
          ON (t1.int = t2.int AND t1.int2 = t2.int2)
        ) t;
      ```
      
      This pull request fix this issue.
      
      ## How was this patch tested?
      
      unit tests
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #17920 from wangyum/SPARK-17685.
      771abeb4
    • uncleGen's avatar
      [SPARK-20373][SQL][SS] Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute · c0189abc
      uncleGen authored
      ## What changes were proposed in this pull request?
      
      Any Dataset/DataFrame batch query with the operation `withWatermark` does not execute because the batch planner does not have any rule to explicitly handle the EventTimeWatermark logical plan.
      The right solution is to simply remove the plan node, as the watermark should not affect any batch query in any way.
      
      Changes:
      - In this PR, we add a new rule `EliminateEventTimeWatermark` to check if we need to ignore the event time watermark. We will ignore watermark in any batch query.
      
      Depends upon:
      - [SPARK-20672](https://issues.apache.org/jira/browse/SPARK-20672). We can not add this rule into analyzer directly, because streaming query will be copied to `triggerLogicalPlan ` in every trigger, and the rule will be applied to `triggerLogicalPlan` mistakenly.
      
      Others:
      - A typo fix in example.
      
      ## How was this patch tested?
      
      add new unit test.
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #17896 from uncleGen/SPARK-20373.
      c0189abc
    • Yin Huai's avatar
      Revert "[SPARK-20311][SQL] Support aliases for table value functions" · f79aa285
      Yin Huai authored
      This reverts commit 714811d0.
      f79aa285
    • Reynold Xin's avatar
      Revert "[SPARK-12297][SQL] Hive compatibility for Parquet Timestamps" · ac1ab6b9
      Reynold Xin authored
      This reverts commit 22691556.
      
      See JIRA ticket for more information.
      ac1ab6b9
    • Holden Karau's avatar
      [SPARK-20627][PYSPARK] Drop the hadoop distirbution name from the Python version · 1b85bcd9
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Drop the hadoop distirbution name from the Python version (PEP440 - https://www.python.org/dev/peps/pep-0440/). We've been using the local version string to disambiguate between different hadoop versions packaged with PySpark, but PEP0440 states that local versions should not be used when publishing up-stream. Since we no longer make PySpark pip packages for different hadoop versions, we can simply drop the hadoop information. If at a later point we need to start publishing different hadoop versions we can look at make different packages or similar.
      
      ## How was this patch tested?
      
      Ran `make-distribution` locally
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #17885 from holdenk/SPARK-20627-remove-pip-local-version-string.
      1b85bcd9
    • Sean Owen's avatar
      [SPARK-19876][BUILD] Move Trigger.java to java source hierarchy · 25ee816e
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Simply moves `Trigger.java` to `src/main/java` from `src/main/scala`
      See https://github.com/apache/spark/pull/17219
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17921 from srowen/SPARK-19876.2.
      25ee816e
    • Reynold Xin's avatar
      [SPARK-20674][SQL] Support registering UserDefinedFunction as named UDF · d099f414
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      For some reason we don't have an API to register UserDefinedFunction as named UDF. It is a no brainer to add one, in addition to the existing register functions we have.
      
      ## How was this patch tested?
      Added a test case in UDFSuite for the new API.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #17915 from rxin/SPARK-20674.
      d099f414
    • Wenchen Fan's avatar
      [SPARK-20548][FLAKY-TEST] share one REPL instance among REPL test cases · f561a76b
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      `ReplSuite.newProductSeqEncoder with REPL defined class` was flaky and throws OOM exception frequently. By analyzing the heap dump, we found the reason is that, in each test case of `ReplSuite`, we create a REPL instance, which creates a classloader and loads a lot of classes related to `SparkContext`. More details please see https://github.com/apache/spark/pull/17833#issuecomment-298711435.
      
      In this PR, we create a new test suite, `SingletonReplSuite`, which shares one REPL instances among all the test cases. Then we move most of the tests from `ReplSuite` to `SingletonReplSuite`, to avoid creating a lot of REPL instances and reduce memory footprint.
      
      ## How was this patch tested?
      
      test only change
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17844 from cloud-fan/flaky-test.
      f561a76b
    • Sanket's avatar
      [SPARK-20355] Add per application spark version on the history server headerpage · 181261a8
      Sanket authored
      ## What changes were proposed in this pull request?
      
      Spark Version for a specific application is not displayed on the history page now. It should be nice to switch the spark version on the UI when we click on the specific application.
      Currently there seems to be way as SparkListenerLogStart records the application version. So, it should be trivial to listen to this event and provision this change on the UI.
      For Example
      <img width="1439" alt="screen shot 2017-04-06 at 3 23 41 pm" src="https://cloud.githubusercontent.com/assets/8295799/25092650/41f3970a-2354-11e7-9b0d-4646d0adeb61.png">
      <img width="1399" alt="screen shot 2017-04-17 at 9 59 33 am" src="https://cloud.githubusercontent.com/assets/8295799/25092743/9f9e2f28-2354-11e7-9605-f2f1c63f21fe.png">
      
      {"Event":"SparkListenerLogStart","Spark Version":"2.0.0"}
      (Please fill in changes proposed in this fix)
      Modified the SparkUI for History server to listen to SparkLogListenerStart event and extract the version and print it.
      
      ## How was this patch tested?
      Manual testing of UI page. Attaching the UI screenshot changes here
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Sanket <schintap@untilservice-lm>
      
      Closes #17658 from redsanket/SPARK-20355.
      181261a8
    • Takeshi Yamamuro's avatar
      [SPARK-20311][SQL] Support aliases for table value functions · 714811d0
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr added parsing rules to support aliases in table value functions.
      
      ## How was this patch tested?
      Added tests in `PlanParserSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #17666 from maropu/SPARK-20311.
      714811d0
    • Xiao Li's avatar
      [SPARK-20667][SQL][TESTS] Cleanup the cataloged metadata after completing the... · 0d00c768
      Xiao Li authored
      [SPARK-20667][SQL][TESTS] Cleanup the cataloged metadata after completing the package of sql/core and sql/hive
      
      ## What changes were proposed in this pull request?
      
      So far, we do not drop all the cataloged objects after each package. Sometimes, we might hit strange test case errors because the previous test suite did not drop the cataloged/temporary objects (tables/functions/database). At least, we can first clean up the environment when completing the package of `sql/core` and `sql/hive`.
      
      ## How was this patch tested?
      N/A
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17908 from gatorsmile/reset.
      0d00c768
    • Yanbo Liang's avatar
      [SPARK-20606][ML] ML 2.2 QA: Remove deprecated methods for ML · b8733e0a
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Remove ML methods we deprecated in 2.1.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #17867 from yanboliang/spark-20606.
      b8733e0a
    • Jon McLean's avatar
      [SPARK-20615][ML][TEST] SparseVector.argmax throws IndexOutOfBoundsException · be53a783
      Jon McLean authored
      ## What changes were proposed in this pull request?
      
      Added a check for for the number of defined values.  Previously the argmax function assumed that at least one value was defined if the vector size was greater than zero.
      
      ## How was this patch tested?
      
      Tests were added to the existing VectorsSuite to cover this case.
      
      Author: Jon McLean <jon.mclean@atsid.com>
      
      Closes #17877 from jonmclean/vectorArgmaxIndexBug.
      be53a783
    • Nick Pentreath's avatar
      [SPARK-20587][ML] Improve performance of ML ALS recommendForAll · 10b00aba
      Nick Pentreath authored
      This PR is a `DataFrame` version of #17742 for [SPARK-11968](https://issues.apache.org/jira/browse/SPARK-11968), for improving the performance of `recommendAll` methods.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #17845 from MLnick/ml-als-perf.
      10b00aba
    • Peng's avatar
      [SPARK-11968][MLLIB] Optimize MLLIB ALS recommendForAll · 80794247
      Peng authored
      The recommendForAll of MLLIB ALS is very slow.
      GC is a key problem of the current method.
      The task use the following code to keep temp result:
      val output = new Array[(Int, (Int, Double))](m*n)
      m = n = 4096 (default value, no method to set)
      so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and cause serious GC problem, and it is frequently OOM.
      
      Actually, we don't need to save all the temp result. Support we recommend topK (topK is about 10, or 20) product for each user, we only need 4k * topK * (4 + 4 + 8) memory to save the temp result.
      
      The Test Environment:
      3 workers: each work 10 core, each work 30G memory, each work 1 executor.
      The Data: User 480,000, and Item 17,000
      
      BlockSize:     1024  2048  4096  8192
      Old method:  245s  332s  488s  OOM
      This solution: 121s  118s   117s  120s
      
      The existing UT.
      
      Author: Peng <peng.meng@intel.com>
      Author: Peng Meng <peng.meng@intel.com>
      
      Closes #17742 from mpjlu/OptimizeAls.
      80794247
    • Felix Cheung's avatar
      [SPARK-20661][SPARKR][TEST][FOLLOWUP] SparkR tableNames() test fails · b952b44a
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Change it to check for relative count like in this test https://github.com/apache/spark/blame/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L3355 for catalog APIs
      
      ## How was this patch tested?
      
      unit tests, this needs to combine with another commit with SQL change to check
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17905 from felixcheung/rtabletests.
      b952b44a
  3. May 08, 2017
    • Hossein's avatar
      [SPARK-20661][SPARKR][TEST] SparkR tableNames() test fails · 2abfee18
      Hossein authored
      ## What changes were proposed in this pull request?
      Cleaning existing temp tables before running tableNames tests
      
      ## How was this patch tested?
      SparkR Unit tests
      
      Author: Hossein <hossein@databricks.com>
      
      Closes #17903 from falaki/SPARK-20661.
      2abfee18
    • jerryshao's avatar
      [SPARK-20605][CORE][YARN][MESOS] Deprecate not used AM and executor port configuration · 829cd7b8
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      After SPARK-10997, client mode Netty RpcEnv doesn't require to start server, so port configurations are not used any more, here propose to remove these two configurations: "spark.executor.port" and "spark.am.port".
      
      ## How was this patch tested?
      
      Existing UTs.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #17866 from jerryshao/SPARK-20605.
      829cd7b8
    • Xianyang Liu's avatar
      [SPARK-20621][DEPLOY] Delete deprecated config parameter in 'spark-env.sh' · aeb2ecc0
      Xianyang Liu authored
      ## What changes were proposed in this pull request?
      
      Currently, `spark.executor.instances` is deprecated in `spark-env.sh`, because we suggest config it in `spark-defaults.conf` or other config file. And also this parameter is useless even if you set it in `spark-env.sh`, so remove it in this patch.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Xianyang Liu <xianyang.liu@intel.com>
      
      Closes #17881 from ConeyLiu/deprecatedParam.
      aeb2ecc0
    • Nick Pentreath's avatar
      [SPARK-20596][ML][TEST] Consolidate and improve ALS recommendAll test cases · 58518d07
      Nick Pentreath authored
      Existing test cases for `recommendForAllX` methods (added in [SPARK-19535](https://issues.apache.org/jira/browse/SPARK-19535)) test `k < num items` and `k = num items`. Technically we should also test that `k > num items` returns the same results as `k = num items`.
      
      ## How was this patch tested?
      
      Updated existing unit tests.
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #17860 from MLnick/SPARK-20596-als-rec-tests.
      58518d07
    • Xianyang Liu's avatar
      [SPARK-19956][CORE] Optimize a location order of blocks with topology information · 15526653
      Xianyang Liu authored
      ## What changes were proposed in this pull request?
      
      When call the method getLocations of BlockManager, we only compare the data block host. Random selection for non-local data blocks, this may cause the selected data block to be in a different rack. So in this patch to increase the sort of the rack.
      
      ## How was this patch tested?
      
      New test case.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Xianyang Liu <xianyang.liu@intel.com>
      
      Closes #17300 from ConeyLiu/blockmanager.
      15526653
    • liuxian's avatar
      [SPARK-20519][SQL][CORE] Modify to prevent some possible runtime exceptions · 0f820e2b
      liuxian authored
      Signed-off-by: liuxian <liu.xian3zte.com.cn>
      
      ## What changes were proposed in this pull request?
      
      When the input parameter is null, may be a runtime exception occurs
      
      ## How was this patch tested?
      Existing unit tests
      
      Author: liuxian <liu.xian3@zte.com.cn>
      
      Closes #17796 from 10110346/wip_lx_0428.
      0f820e2b
    • Wayne Zhang's avatar
      [SPARKR][DOC] fix typo in vignettes · 2fdaeb52
      Wayne Zhang authored
      ## What changes were proposed in this pull request?
      Fix typo in vignettes
      
      Author: Wayne Zhang <actuaryzhang@uber.com>
      
      Closes #17884 from actuaryzhang/typo.
      2fdaeb52
    • sujith71955's avatar
      [SPARK-20380][SQL] Unable to set/unset table comment property using ALTER... · 42cc6d13
      sujith71955 authored
      [SPARK-20380][SQL] Unable to set/unset table comment property using ALTER TABLE SET/UNSET TBLPROPERTIES ddl
      
      ### What changes were proposed in this pull request?
      Table comment was not getting  set/unset using **ALTER TABLE  SET/UNSET TBLPROPERTIES** query
      eg: ALTER TABLE table_with_comment SET TBLPROPERTIES("comment"= "modified comment)
       when user alter the table properties  and adds/updates table comment,table comment which is a field  of **CatalogTable**  instance is not getting updated and  old table comment if exists was shown to user, inorder  to handle this issue, update the comment field value in **CatalogTable** with the newly added/modified comment along with other table level properties when user executes **ALTER TABLE  SET TBLPROPERTIES** query.
      
      This pr has also taken care of unsetting the table comment when user executes query  **ALTER TABLE  UNSET TBLPROPERTIES** inorder to unset or remove table comment.
      eg: ALTER TABLE table_comment UNSET TBLPROPERTIES IF EXISTS ('comment')
      
      ### How was this patch tested?
      Added test cases  as part of **SQLQueryTestSuite** for verifying  table comment using desc formatted table query after adding/modifying table comment as part of **AlterTableSetPropertiesCommand** and unsetting the table comment using **AlterTableUnsetPropertiesCommand**.
      
      Author: sujith71955 <sujithchacko.2010@gmail.com>
      
      Closes #17649 from sujith71955/alter_table_comment.
      42cc6d13
    • Felix Cheung's avatar
      [SPARK-20626][SPARKR] address date test warning with timezone on windows · c24bdaab
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      set timezone on windows
      
      ## How was this patch tested?
      
      unit test, AppVeyor
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17892 from felixcheung/rtimestamptest.
      c24bdaab
  4. May 07, 2017
    • Imran Rashid's avatar
      [SPARK-12297][SQL] Hive compatibility for Parquet Timestamps · 22691556
      Imran Rashid authored
      ## What changes were proposed in this pull request?
      
      This change allows timestamps in parquet-based hive table to behave as a "floating time", without a timezone, as timestamps are for other file formats.  If the storage timezone is the same as the session timezone, this conversion is a no-op.  When data is read from a hive table, the table property is *always* respected.  This allows spark to not change behavior when reading old data, but read newly written data correctly (whatever the source of the data is).
      
      Spark inherited the original behavior from Hive, but Hive is also updating behavior to use the same  scheme in HIVE-12767 / HIVE-16231.
      
      The default for Spark remains unchanged; created tables do not include the new table property.
      
      This will only apply to hive tables; nothing is added to parquet metadata to indicate the timezone, so data that is read or written directly from parquet files will never have any conversions applied.
      
      ## How was this patch tested?
      
      Added a unit test which creates tables, reads and writes data, under a variety of permutations (different storage timezones, different session timezones, vectorized reading on and off).
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #16781 from squito/SPARK-12297.
      22691556
    • zero323's avatar
      [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucketBy · f53a8207
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Adds Python wrappers for `DataFrameWriter.bucketBy` and `DataFrameWriter.sortBy` ([SPARK-16931](https://issues.apache.org/jira/browse/SPARK-16931))
      
      ## How was this patch tested?
      
      Unit tests covering new feature.
      
      __Note__: Based on work of GregBowyer (f49b9a23468f7af32cb53d2b654272757c151725)
      
      CC HyukjinKwon
      
      Author: zero323 <zero323@users.noreply.github.com>
      Author: Greg Bowyer <gbowyer@fastmail.co.uk>
      
      Closes #17077 from zero323/SPARK-16931.
      f53a8207
    • zero323's avatar
      [SPARK-20550][SPARKR] R wrapper for Dataset.alias · 1f73d358
      zero323 authored
      ## What changes were proposed in this pull request?
      
      - Add SparkR wrapper for `Dataset.alias`.
      - Adjust roxygen annotations for `functions.alias` (including example usage).
      
      ## How was this patch tested?
      
      Unit tests, `check_cran.sh`.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17825 from zero323/SPARK-20550.
      1f73d358
    • Jacek Laskowski's avatar
      [MINOR][SQL][DOCS] Improve unix_timestamp's scaladoc (and typo hunting) · 500436b4
      Jacek Laskowski authored
      ## What changes were proposed in this pull request?
      
      * Docs are consistent (across different `unix_timestamp` variants and their internal expressions)
      * typo hunting
      
      ## How was this patch tested?
      
      local build
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #17801 from jaceklaskowski/unix_timestamp.
      500436b4
    • Felix Cheung's avatar
      [SPARK-20543][SPARKR][FOLLOWUP] Don't skip tests on AppVeyor · 7087e011
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      add environment
      
      ## How was this patch tested?
      
      wait for appveyor run
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17878 from felixcheung/appveyorrcran.
      7087e011
    • Steve Loughran's avatar
      [SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object store access. · 2cf83c47
      Steve Loughran authored
      ## What changes were proposed in this pull request?
      
      Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson.
      
      It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`.
      
      There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector.
      
      (this is the successor to #12004; I can't re-open it)
      
      ## How was this patch tested?
      
      Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples)
      
      Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well.
      
      Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile.
      
      SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package`
      maven build `mvn install -Phadoop-cloud -Phadoop-2.7`
      
      This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible.
      
      Author: Steve Loughran <stevel@apache.org>
      Author: Steve Loughran <stevel@hortonworks.com>
      
      Closes #17834 from steveloughran/cloud/SPARK-7481-current.
      2cf83c47
    • Daniel Li's avatar
      [SPARK-20484][MLLIB] Add documentation to ALS code · 88e6d750
      Daniel Li authored
      ## What changes were proposed in this pull request?
      
      This PR adds documentation to the ALS code.
      
      ## How was this patch tested?
      
      Existing tests were used.
      
      mengxr srowen
      
      This contribution is my original work.  I have the license to work on this project under the Spark project’s open source license.
      
      Author: Daniel Li <dan@danielyli.com>
      
      Closes #17793 from danielyli/spark-20484.
      88e6d750
    • caoxuewen's avatar
      [SPARK-20518][CORE] Supplement the new blockidsuite unit tests · 37f963ac
      caoxuewen authored
      ## What changes were proposed in this pull request?
      
      This PR adds the new unit tests to support ShuffleDataBlockId , ShuffleIndexBlockId , TempShuffleBlockId , TempLocalBlockId
      
      ## How was this patch tested?
      
      The new unit test.
      
      Author: caoxuewen <cao.xuewen@zte.com.cn>
      
      Closes #17794 from heary-cao/blockidsuite.
      37f963ac
    • zero323's avatar
      [SPARK-18777][PYTHON][SQL] Return UDF from udf.register · 63d90e7d
      zero323 authored
      ## What changes were proposed in this pull request?
      
      - Move udf wrapping code from `functions.udf` to `functions.UserDefinedFunction`.
      - Return wrapped udf from `catalog.registerFunction` and dependent methods.
      - Update docstrings in `catalog.registerFunction` and `SQLContext.registerFunction`.
      - Unit tests.
      
      ## How was this patch tested?
      
      - Existing unit tests and docstests.
      - Additional tests covering new feature.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17831 from zero323/SPARK-18777.
      63d90e7d
    • Xiao Li's avatar
      [SPARK-20557][SQL] Support JDBC data type Time with Time Zone · cafca54c
      Xiao Li authored
      ### What changes were proposed in this pull request?
      
      This PR is to support JDBC data type TIME WITH TIME ZONE. It can be converted to TIMESTAMP
      
      In addition, before this PR, for unsupported data types, we simply output the type number instead of the type name.
      
      ```
      java.sql.SQLException: Unsupported type 2014
      ```
      After this PR, the message is like
      ```
      java.sql.SQLException: Unsupported type TIMESTAMP_WITH_TIMEZONE
      ```
      
      - Also upgrade the H2 version to `1.4.195` which has the type fix for "TIMESTAMP WITH TIMEZONE". However, it is not fully supported. Thus, we capture the exception, but we still need it to partially test the support of "TIMESTAMP WITH TIMEZONE", because Docker tests are not regularly run.
      
      ### How was this patch tested?
      Added test cases.
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17835 from gatorsmile/h2.
      cafca54c
  5. May 05, 2017
Loading