Skip to content
Snippets Groups Projects
  1. Jun 20, 2016
    • Eric Liang's avatar
      [SPARK-16025][CORE] Document OFF_HEAP storage level in 2.0 · 07367533
      Eric Liang authored
      This has changed from 1.6, and now stores memory off-heap using spark's off-heap support instead of in tachyon.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #13744 from ericl/spark-16025.
      07367533
    • hyukjinkwon's avatar
      [SPARK-16044][SQL] input_file_name() returns empty strings in data sources based on NewHadoopRDD · 4f7f1c43
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR makes `input_file_name()` function return the file paths not empty strings for external data sources based on `NewHadoopRDD`, such as [spark-redshift](https://github.com/databricks/spark-redshift/blob/cba5eee1ab79ae8f0fa9e668373a54d2b5babf6b/src/main/scala/com/databricks/spark/redshift/RedshiftRelation.scala#L149) and [spark-xml](https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/util/XmlFile.scala#L39-L47).
      
      The codes with the external data sources below:
      
      ```scala
      df.select(input_file_name).show()
      ```
      
      will produce
      
      - **Before**
        ```
      +-----------------+
      |input_file_name()|
      +-----------------+
      |                 |
      +-----------------+
      ```
      
      - **After**
        ```
      +--------------------+
      |   input_file_name()|
      +--------------------+
      |file:/private/var...|
      +--------------------+
      ```
      
      ## How was this patch tested?
      
      Unit tests in `ColumnExpressionSuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #13759 from HyukjinKwon/SPARK-16044.
      4f7f1c43
    • Xiangrui Meng's avatar
      [SPARK-16074][MLLIB] expose VectorUDT/MatrixUDT in a public API · 18a8a9b1
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself is private in Spark. However, in order to let developers implement their own transformers and estimators, we should expose both types in a public API to simply the implementation of transformSchema, transform, etc. Otherwise, they need to get the data types using reflection.
      
      ## How was this patch tested?
      
      Unit tests in Scala and Java.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #13789 from mengxr/SPARK-16074.
      18a8a9b1
    • gatorsmile's avatar
      [SPARK-16056][SPARK-16057][SPARK-16058][SQL] Fix Multiple Bugs in Column... · d9a3a2a0
      gatorsmile authored
      [SPARK-16056][SPARK-16057][SPARK-16058][SQL] Fix Multiple Bugs in Column Partitioning in JDBC Source
      
      #### What changes were proposed in this pull request?
      This PR is to fix the following bugs:
      
      **Issue 1: Wrong Results when lowerBound is larger than upperBound in Column Partitioning**
      ```scala
      spark.read.jdbc(
        url = urlWithUserAndPass,
        table = "TEST.seq",
        columnName = "id",
        lowerBound = 4,
        upperBound = 0,
        numPartitions = 3,
        connectionProperties = new Properties)
      ```
      **Before code changes:**
      The returned results are wrong and the generated partitions are wrong:
      ```
        Part 0 id < 3 or id is null
        Part 1 id >= 3 AND id < 2
        Part 2 id >= 2
      ```
      **After code changes:**
      Issue an `IllegalArgumentException` exception:
      ```
      Operation not allowed: the lower bound of partitioning column is larger than the upper bound. lowerBound: 5; higherBound: 1
      ```
      **Issue 2: numPartitions is more than the number of key values between upper and lower bounds**
      ```scala
      spark.read.jdbc(
        url = urlWithUserAndPass,
        table = "TEST.seq",
        columnName = "id",
        lowerBound = 1,
        upperBound = 5,
        numPartitions = 10,
        connectionProperties = new Properties)
      ```
      **Before code changes:**
      Returned correct results but the generated partitions are very inefficient, like:
      ```
      Partition 0: id < 1 or id is null
      Partition 1: id >= 1 AND id < 1
      Partition 2: id >= 1 AND id < 1
      Partition 3: id >= 1 AND id < 1
      Partition 4: id >= 1 AND id < 1
      Partition 5: id >= 1 AND id < 1
      Partition 6: id >= 1 AND id < 1
      Partition 7: id >= 1 AND id < 1
      Partition 8: id >= 1 AND id < 1
      Partition 9: id >= 1
      ```
      **After code changes:**
      Adjust `numPartitions` and can return the correct answers:
      ```
      Partition 0: id < 2 or id is null
      Partition 1: id >= 2 AND id < 3
      Partition 2: id >= 3 AND id < 4
      Partition 3: id >= 4
      ```
      **Issue 3: java.lang.ArithmeticException when numPartitions is zero**
      ```Scala
      spark.read.jdbc(
        url = urlWithUserAndPass,
        table = "TEST.seq",
        columnName = "id",
        lowerBound = 0,
        upperBound = 4,
        numPartitions = 0,
        connectionProperties = new Properties)
      ```
      **Before code changes:**
      Got the following exception:
      ```
        java.lang.ArithmeticException: / by zero
      ```
      **After code changes:**
      Able to return a correct answer by disabling column partitioning when numPartitions is equal to or less than zero
      
      #### How was this patch tested?
      Added test cases to verify the results
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13773 from gatorsmile/jdbcPartitioning.
      d9a3a2a0
    • Reynold Xin's avatar
      [SPARK-13792][SQL] Limit logging of bad records in CSV data source · c775bf09
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This pull request adds a new option (maxMalformedLogPerPartition) in CSV reader to limit the maximum of logging message Spark generates per partition for malformed records.
      
      The error log looks something like
      ```
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: More than 10 malformed records have been found on this partition. Malformed records from now on will not be logged.
      ```
      
      Closes #12173
      
      ## How was this patch tested?
      Manually tested.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13795 from rxin/SPARK-13792.
      c775bf09
    • Dongjoon Hyun's avatar
      [SPARK-15294][R] Add `pivot` to SparkR · 217db56b
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR adds `pivot` function to SparkR for API parity. Since this PR is based on https://github.com/apache/spark/pull/13295 , mhnatiuk should be credited for the work he did.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (including new testcase.)
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13786 from dongjoon-hyun/SPARK-15294.
      217db56b
    • Davies Liu's avatar
      [SPARK-16086] [SQL] fix Python UDF without arguments (for 1.6) · a46553cb
      Davies Liu authored
      
      Fix the bug for Python UDF that does not have any arguments.
      
      Added regression tests.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #13793 from davies/fix_no_arguments.
      
      (cherry picked from commit abe36c53)
      Signed-off-by: default avatarDavies Liu <davies.liu@gmail.com>
      a46553cb
    • Narine Kokhlikyan's avatar
      remove duplicated docs in dapply · e2b7eba8
      Narine Kokhlikyan authored
      ## What changes were proposed in this pull request?
      Removed unnecessary duplicated documentation in dapply and dapplyCollect.
      
      In this pull request I created separate R docs for dapply and dapplyCollect - kept dapply's documentation separate from dapplyCollect's and referred from one to another via a link.
      
      ## How was this patch tested?
      Existing test cases.
      
      Author: Narine Kokhlikyan <narine@slice.com>
      
      Closes #13790 from NarineK/dapply-docs-fix.
      e2b7eba8
    • Bryan Cutler's avatar
      [SPARK-16079][PYSPARK][ML] Added missing import for... · a42bf555
      Bryan Cutler authored
      [SPARK-16079][PYSPARK][ML] Added missing import for DecisionTreeRegressionModel used in GBTClassificationModel
      
      ## What changes were proposed in this pull request?
      
      Fixed missing import for DecisionTreeRegressionModel used in GBTClassificationModel trees method.
      
      ## How was this patch tested?
      
      Local tests
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #13787 from BryanCutler/pyspark-GBTClassificationModel-import-SPARK-16079.
      a42bf555
    • Kousuke Saruta's avatar
      [SPARK-16061][SQL][MINOR] The property... · 6daa8cf1
      Kousuke Saruta authored
      [SPARK-16061][SQL][MINOR] The property "spark.streaming.stateStore.maintenanceInterval" should be renamed to "spark.sql.streaming.stateStore.maintenanceInterval"
      
      ## What changes were proposed in this pull request?
      The property spark.streaming.stateStore.maintenanceInterval should be renamed and harmonized with other properties related to Structured Streaming like spark.sql.streaming.stateStore.minDeltasForSnapshot.
      
      ## How was this patch tested?
      Existing unit tests.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #13777 from sarutak/SPARK-16061.
      6daa8cf1
    • Tathagata Das's avatar
      [SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harmonize the behavior of... · b99129cc
      Tathagata Das authored
      [SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harmonize the behavior of DataFrameReader.text/csv/json/parquet/orc
      
      ## What changes were proposed in this pull request?
      
      Issues with current reader behavior.
      - `text()` without args returns an empty DF with no columns -> inconsistent, its expected that text will always return a DF with `value` string field,
      - `textFile()` without args fails with exception because of the above reason, it expected the DF returned by `text()` to have a `value` field.
      - `orc()` does not have var args, inconsistent with others
      - `json(single-arg)` was removed, but that caused source compatibility issues - [SPARK-16009](https://issues.apache.org/jira/browse/SPARK-16009)
      - user specified schema was not respected when `text/csv/...` were used with no args - [SPARK-16007](https://issues.apache.org/jira/browse/SPARK-16007)
      
      The solution I am implementing is to do the following.
      - For each format, there will be a single argument method, and a vararg method. For json, parquet, csv, text, this means adding json(string), etc.. For orc, this means adding orc(varargs).
      - Remove the special handling of text(), csv(), etc. that returns empty dataframe with no fields. Rather pass on the empty sequence of paths to the datasource, and let each datasource handle it right. For e.g, text data source, should return empty DF with schema (value: string)
      - Deduped docs and fixed their formatting.
      
      ## How was this patch tested?
      Added new unit tests for Scala and Java tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #13727 from tdas/SPARK-15982.
      b99129cc
    • Cheng Lian's avatar
      [SPARK-15863][SQL][DOC] Initial SQL programming guide update for Spark 2.0 · 6df8e388
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      Initial SQL programming guide update for Spark 2.0. Contents like 1.6 to 2.0 migration guide are still incomplete.
      
      We may also want to add more examples for Scala/Java Dataset typed transformations.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #13592 from liancheng/sql-programming-guide-2.0.
      6df8e388
    • Dongjoon Hyun's avatar
      [SPARK-14995][R] Add `since` tag in Roxygen documentation for SparkR API methods · d0eddb80
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR adds `since` tags to Roxygen documentation according to the previous documentation archive.
      
      https://home.apache.org/~dongjoon/spark-2.0.0-docs/api/R/
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13734 from dongjoon-hyun/SPARK-14995.
      d0eddb80
    • Sean Owen's avatar
      [MINOR] Closing stale pull requests. · 92514232
      Sean Owen authored
      Closes #13114
      Closes #10187
      Closes #13432
      Closes #13550
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #13781 from srowen/CloseStalePR.
      92514232
    • Felix Cheung's avatar
      [SPARK-15159][SPARKR] SparkSession roxygen2 doc, programming guide, example updates · 359c2e82
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      roxygen2 doc, programming guide, example updates
      
      ## How was this patch tested?
      
      manual checks
      shivaram
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #13751 from felixcheung/rsparksessiondoc.
      359c2e82
    • Dongjoon Hyun's avatar
      [SPARK-16053][R] Add `spark_partition_id` in SparkR · b0f2fb5b
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR adds `spark_partition_id` virtual column function in SparkR for API parity.
      
      The following is just an example to illustrate a SparkR usage on a partitioned parquet table created by `spark.range(10).write.mode("overwrite").parquet("/tmp/t1")`.
      ```r
      > collect(select(read.parquet('/tmp/t1'), c('id', spark_partition_id())))
         id SPARK_PARTITION_ID()
      1   3                    0
      2   4                    0
      3   8                    1
      4   9                    1
      5   0                    2
      6   1                    3
      7   2                    4
      8   5                    5
      9   6                    6
      10  7                    7
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (including new testcase).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13768 from dongjoon-hyun/SPARK-16053.
      b0f2fb5b
    • Felix Cheung's avatar
      [SPARKR] fix R roxygen2 doc for count on GroupedData · aee1420e
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      fix code doc
      
      ## How was this patch tested?
      
      manual
      
      shivaram
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #13782 from felixcheung/rcountdoc.
      aee1420e
    • Felix Cheung's avatar
      [SPARK-16028][SPARKR] spark.lapply can work with active context · 46d98e0a
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      spark.lapply and setLogLevel
      
      ## How was this patch tested?
      
      unit test
      
      shivaram thunterdb
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #13752 from felixcheung/rlapply.
      46d98e0a
    • Dongjoon Hyun's avatar
      [SPARK-16051][R] Add `read.orc/write.orc` to SparkR · c44bf137
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This issue adds `read.orc/write.orc` to SparkR for API parity.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (with new testcases).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13763 from dongjoon-hyun/SPARK-16051.
      c44bf137
    • Felix Cheung's avatar
      [SPARK-16029][SPARKR] SparkR add dropTempView and deprecate dropTempTable · 36e812d4
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Add dropTempView and deprecate dropTempTable
      
      ## How was this patch tested?
      
      unit tests
      
      shivaram liancheng
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #13753 from felixcheung/rdroptempview.
      36e812d4
    • Dongjoon Hyun's avatar
      [SPARK-16059][R] Add `monotonically_increasing_id` function in SparkR · 96134248
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR adds `monotonically_increasing_id` column function in SparkR for API parity.
      After this PR, SparkR supports the followings.
      
      ```r
      > df <- read.json("examples/src/main/resources/people.json")
      > collect(select(df, monotonically_increasing_id(), df$name, df$age))
        monotonically_increasing_id()    name age
      1                             0 Michael  NA
      2                             1    Andy  30
      3                             2  Justin  19
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (with added testcase).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13774 from dongjoon-hyun/SPARK-16059.
      96134248
    • Shixiong Zhu's avatar
      [SPARK-16050][TESTS] Remove the flaky test: ConsoleSinkSuite · 5cfabec8
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      ConsoleSinkSuite just collects content from stdout and compare them with the expected string. However, because Spark may not stop some background threads at once, there is a race condition that other threads are outputting logs to **stdout** while ConsoleSinkSuite is running. Then it will make ConsoleSinkSuite fail.
      
      Therefore, I just deleted `ConsoleSinkSuite`. If we want to test ConsoleSinkSuite in future, we should refactoring ConsoleSink to make it testable instead of depending on stdout. Therefore, this test is useless and I just delete it.
      
      ## How was this patch tested?
      
      Just removed a flaky test.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13776 from zsxwing/SPARK-16050.
      5cfabec8
    • Yin Huai's avatar
      [SPARK-16030][SQL] Allow specifying static partitions when inserting to data source tables · 905f774b
      Yin Huai authored
      ## What changes were proposed in this pull request?
      This PR adds the static partition support to INSERT statement when the target table is a data source table.
      
      ## How was this patch tested?
      New tests in InsertIntoHiveTableSuite and DataSourceAnalysisSuite.
      
      **Note: This PR is based on https://github.com/apache/spark/pull/13766. The last commit is the actual change.**
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #13769 from yhuai/SPARK-16030-1.
      905f774b
  2. Jun 19, 2016
    • Yin Huai's avatar
      [SPARK-16036][SPARK-16037][SPARK-16034][SQL] Follow up code clean up and improvement · 6d0f921a
      Yin Huai authored
      ## What changes were proposed in this pull request?
      This PR is the follow-up PR for https://github.com/apache/spark/pull/13754/files and https://github.com/apache/spark/pull/13749. I will comment inline to explain my changes.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #13766 from yhuai/caseSensitivity.
      6d0f921a
    • Matei Zaharia's avatar
      [SPARK-16031] Add debug-only socket source in Structured Streaming · 4f17fddc
      Matei Zaharia authored
      ## What changes were proposed in this pull request?
      
      This patch adds a text-based socket source similar to the one in Spark Streaming for debugging and tutorials. The source is clearly marked as debug-only so that users don't try to run it in production applications, because this type of source cannot provide HA without storing a lot of state in Spark.
      
      ## How was this patch tested?
      
      Unit tests and manual tests in spark-shell.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #13748 from mateiz/socket-source.
      4f17fddc
    • wm624@hotmail.com's avatar
      [SPARK-16040][MLLIB][DOC] spark.mllib PIC document extra line of refernece · 5930d7a2
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      In the 2.0 document, Line "A full example that produces the experiment described in the PIC paper can be found under examples/." is redundant.
      
      There is already "Find full example code at "examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala" in the Spark repo.".
      
      We should remove the first line, which is consistent with other documents.
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      Manual test
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #13755 from wangmiao1981/doc.
      5930d7a2
    • Prashant Sharma's avatar
      [SPARK-15942][REPL] Unblock `:reset` command in REPL. · 1b3a9b96
      Prashant Sharma authored
      ## What changes were proposed in this pull
      (Paste from JIRA issue.)
      As a follow up for SPARK-15697, I have following semantics for `:reset` command.
      On `:reset` we forget all that user has done but not the initialization of spark. To avoid confusion or make it more clear, we show the message `spark` and `sc` are not erased, infact they are in same state as they were left by previous operations done by the user.
      While doing above, somewhere I felt that this is not usually what reset means. But an accidental shutdown of a cluster can be very costly, so may be in that sense this is less surprising and still useful.
      
      ## How was this patch tested?
      
      Manually, by calling `:reset` command, by both altering the state of SparkContext and creating some local variables.
      
      Author: Prashant Sharma <prashant@apache.org>
      Author: Prashant Sharma <prashsh1@in.ibm.com>
      
      Closes #13661 from ScrapCodes/repl-reset-command.
      1b3a9b96
    • Davies Liu's avatar
      [SPARK-15613] [SQL] Fix incorrect days to millis conversion due to Daylight Saving Time · 001a5896
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Internally, we use Int to represent a date (the days since 1970-01-01), when we convert that into unix timestamp (milli-seconds since epoch in UTC), we get the offset of a timezone using local millis (the milli-seconds since 1970-01-01 in a timezone), but TimeZone.getOffset() expect unix timestamp, the result could be off by one hour (in Daylight Saving Time (DST) or not).
      
      This PR change to use best effort approximate of posix timestamp to lookup the offset. In the event of changing of DST, Some time is not defined (for example, 2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for example, 2016-11-06 01:00:00), this best effort approximate should be enough in practice.
      
      ## How was this patch tested?
      
      Added regression tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13652 from davies/fix_timezone.
      001a5896
  3. Jun 18, 2016
    • Sean Zhong's avatar
      [SPARK-16034][SQL] Checks the partition columns when calling... · ce3b98ba
      Sean Zhong authored
      [SPARK-16034][SQL] Checks the partition columns when calling dataFrame.write.mode("append").saveAsTable
      
      ## What changes were proposed in this pull request?
      
      `DataFrameWriter` can be used to append data to existing data source tables. It becomes tricky when partition columns used in `DataFrameWriter.partitionBy(columns)` don't match the actual partition columns of the underlying table. This pull request enforces the check so that the partition columns of these two always match.
      
      ## How was this patch tested?
      
      Unit test.
      
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #13749 from clockfly/SPARK-16034.
      ce3b98ba
    • Wenchen Fan's avatar
      [SPARK-16036][SPARK-16037][SQL] fix various table insertion problems · 3d010c83
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      The current table insertion has some weird behaviours:
      
      1. inserting into a partitioned table with mismatch columns has confusing error message for hive table, and wrong result for datasource table
      2. inserting into a partitioned table without partition list has wrong result for hive table.
      
      This PR fixes these 2 problems.
      
      ## How was this patch tested?
      
      new test in hive `SQLQuerySuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13754 from cloud-fan/insert2.
      3d010c83
    • Josh Howes's avatar
      [SPARK-15973][PYSPARK] Fix GroupedData Documentation · e574c997
      Josh Howes authored
      *This contribution is my original work and that I license the work to the project under the project's open source license.*
      
      ## What changes were proposed in this pull request?
      
      Documentation updates to PySpark's GroupedData
      
      ## How was this patch tested?
      
      Manual Tests
      
      Author: Josh Howes <josh.howes@gmail.com>
      Author: Josh Howes <josh.howes@maxpoint.com>
      
      Closes #13724 from josh-howes/bugfix/SPARK-15973.
      e574c997
    • Andrew Or's avatar
      [SPARK-16023][SQL] Move InMemoryRelation to its own file · 35a2f3c0
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Improve readability of `InMemoryTableScanExec.scala`, which has too much stuff in it.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #13742 from andrewor14/move-inmemory-relation.
      35a2f3c0
    • Jeff Zhang's avatar
      [SPARK-15803] [PYSPARK] Support with statement syntax for SparkSession · 898cb652
      Jeff Zhang authored
      ## What changes were proposed in this pull request?
      
      Support with statement syntax for SparkSession in pyspark
      
      ## How was this patch tested?
      
      Manually verify it. Although I can add unit test for it, it would affect other unit test because the SparkContext is stopped after the with statement.
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #13541 from zjffdu/SPARK-15803.
      898cb652
    • andreapasqua's avatar
      [SPARK-16035][PYSPARK] Fix SparseVector parser assertion for end parenthesis · 4c64e88d
      andreapasqua authored
      ## What changes were proposed in this pull request?
      The check on the end parenthesis of the expression to parse was using the wrong variable. I corrected that.
      ## How was this patch tested?
      Manual test
      
      Author: andreapasqua <andrea@radius.com>
      
      Closes #13750 from andreapasqua/sparse-vector-parser-assertion-fix.
      4c64e88d
  4. Jun 17, 2016
    • Shixiong Zhu's avatar
      [SPARK-16020][SQL] Fix complete mode aggregation with console sink · d0ac0e6f
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      We cannot use `limit` on DataFrame in ConsoleSink because it will use a wrong planner. This PR just collects `DataFrame` and calls `show` on a batch DataFrame based on the result. This is fine since ConsoleSink is only for debugging.
      
      ## How was this patch tested?
      
      Manually confirmed ConsoleSink now works with complete mode aggregation.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13740 from zsxwing/complete-console.
      d0ac0e6f
    • Felix Cheung's avatar
      [SPARK-15159][SPARKR] SparkR SparkSession API · 8c198e24
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      This PR introduces the new SparkSession API for SparkR.
      `sparkR.session.getOrCreate()` and `sparkR.session.stop()`
      
      "getOrCreate" is a bit unusual in R but it's important to name this clearly.
      
      SparkR implementation should
      - SparkSession is the main entrypoint (vs SparkContext; due to limited functionality supported with SparkContext in SparkR)
      - SparkSession replaces SQLContext and HiveContext (both a wrapper around SparkSession, and because of API changes, supporting all 3 would be a lot more work)
      - Changes to SparkSession is mostly transparent to users due to SPARK-10903
      - Full backward compatibility is expected - users should be able to initialize everything just in Spark 1.6.1 (`sparkR.init()`), but with deprecation warning
      - Mostly cosmetic changes to parameter list - users should be able to move to `sparkR.session.getOrCreate()` easily
      - An advanced syntax with named parameters (aka varargs aka "...") is supported; that should be closer to the Builder syntax that is in Scala/Python (which unfortunately does not work in R because it will look like this: `enableHiveSupport(config(config(master(appName(builder(), "foo"), "local"), "first", "value"), "next, "value"))`
      - Updating config on an existing SparkSession is supported, the behavior is the same as Python, in which config is applied to both SparkContext and SparkSession
      - Some SparkSession changes are not matched in SparkR, mostly because it would be breaking API change: `catalog` object, `createOrReplaceTempView`
      - Other SQLContext workarounds are replicated in SparkR, eg. `tables`, `tableNames`
      - `sparkR` shell is updated to use the SparkSession entrypoint (`sqlContext` is removed, just like with Scale/Python)
      - All tests are updated to use the SparkSession entrypoint
      - A bug in `read.jdbc` is fixed
      
      TODO
      - [x] Add more tests
      - [ ] Separate PR - update all roxygen2 doc coding example
      - [ ] Separate PR - update SparkR programming guide
      
      ## How was this patch tested?
      
      unit tests, manual tests
      
      shivaram sun-rui rxin
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #13635 from felixcheung/rsparksession.
      8c198e24
    • Xiangrui Meng's avatar
      [SPARK-15946][MLLIB] Conversion between old/new vector columns in a DataFrame (Python) · edb23f9e
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      This PR implements python wrappers for #13662 to convert old/new vector columns in a DataFrame.
      
      ## How was this patch tested?
      
      doctest in Python
      
      cc: yanboliang
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #13731 from mengxr/SPARK-15946.
      edb23f9e
    • GayathriMurali's avatar
      [SPARK-15129][R][DOC] R API changes in ML · af2a4b08
      GayathriMurali authored
      ## What changes were proposed in this pull request?
      
      Make user guide changes to SparkR documentation for all changes that happened in 2.0 to Machine Learning APIs
      
      Author: GayathriMurali <gayathri.m@intel.com>
      
      Closes #13285 from GayathriMurali/SPARK-15129.
      af2a4b08
    • Cheng Lian's avatar
      [SPARK-16033][SQL] insertInto() can't be used together with partitionBy() · 10b67144
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      When inserting into an existing partitioned table, partitioning columns should always be determined by catalog metadata of the existing table to be inserted. Extra `partitionBy()` calls don't make sense, and mess up existing data because newly inserted data may have wrong partitioning directory layout.
      
      ## How was this patch tested?
      
      New test case added in `InsertIntoHiveTableSuite`.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #13747 from liancheng/spark-16033-insert-into-without-partition-by.
      10b67144
    • hyukjinkwon's avatar
      [SPARK-15916][SQL] JDBC filter push down should respect operator precedence · ebb9a3b6
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR fixes the problem that the precedence order is messed when pushing where-clause expression to JDBC layer.
      
      **Case 1:**
      
      For sql `select * from table where (a or b) and c`, the where-clause is wrongly converted to JDBC where-clause `a or (b and c)` after filter push down. The consequence is that JDBC may returns less or more rows than expected.
      
      **Case 2:**
      
      For sql `select * from table where always_false_condition`, the result table may not be empty if the JDBC RDD is partitioned using where-clause:
      ```
      spark.read.jdbc(url, table, predicates = Array("partition 1 where clause", "partition 2 where clause"...)
      ```
      
      ## How was this patch tested?
      
      Unit test.
      
      This PR also close #13640
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #13743 from clockfly/SPARK-15916.
      ebb9a3b6
Loading